文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使使用Pandas和DataFrame的Selenium/BS4 4程序更加优化和优雅？

问如何使使用Pandas和DataFrame的Selenium/BS4 4程序更加优化和优雅？
EN

Stack Overflow用户

提问于 2022-02-28 19:33:44

回答 1查看 71关注 0票数 2

我正在学习网络抓取，并发现了一个有趣的挑战，从这个页面抓取一个Javascript工具栏表：三星诺克斯设备

我最终得到了我想要的输出，但我认为它感觉“无趣”，所以我希望任何改进使它更优雅。

想要的输出是一个dataframe/csv表，列= Device、Model_Nums、OS/Platform、Knox版本。不需要页面上的任何其他内容，我将分开/扩展和融化模型名称。

import pandas as pd

# Libraries for this task: 
from bs4 import BeautifulSoup
from selenium import webdriver

# Because the target table is built using Javascript handlebars, we have to use Selenium and a webdriver
driver = webdriver.Edge("MY_PATH") # REPLACE WITH >YOUR< PATH!

# Point the driver at the target webpage:
driver.get('https://www.samsungknox.com/en/knox-platform/supported-devices')

# Get the page content
html = driver.page_source
# Typically I'd do something like: soup = BeautifulSoup(html, "lxml")
# Link below suggested the following, which works; I don't know if it matters
sp = BeautifulSoup(html, "html.parser")

# The 'table here is really a bunch of nested divs 
tables = soup.find_all("div", class_='table-row')
# https://www.angularfix.com/2021/09/how-to-extract-text-from-inside-div-tag.html
rows = []
for t in tables:
    row = t.text
    rows.append(row)

# These are the table-row div classes within each table-row from the output at the previous step that I want:    
    # div class="supported-devices pivot-fixed"
    # div class="model"
    # div class="operating system"
    # div class="knox-version"

# Define div class names:
targets = ["supported-devices pivot-fixed", "model", "operating-system", "knox-version"]

# Create an empty list and loop through each target div class; append to list
data = []
for t in targets:
    hold = sp.find_all("div", class_=t)
    for h in hold:
        row = h.text
        data.append({'column': t, 'value': row}) 

df = pd.DataFrame(data)

# This feels like a hack, but I got stuck and it works, so \shrug/
# Create Series from filtered df based on 'column' value (corresponding to the the four "targets" above)
name = pd.Series(df['value'][df['column']=='supported-devices pivot-fixed']).reset_index(drop=True)
model = pd.Series(df['value'][df['column']=='model']).reset_index(drop=True)
os = pd.Series(df['value'][df['column']=='operating-system']).reset_index(drop=True)
knox = pd.Series(df['value'][df['column']=='knox-version']).reset_index(drop=True)
# Concatenate Series into df
df2 = pd.concat([df_name, df_model, df_os, df_knox], axis=1)

# Make the first row the column names:
new_header = df2.iloc[0] #grab the first row for the header
sam_knox_table = df2[1:] #take the data less the header row
sam_knox_table.columns = new_header #set the header row as the df header

# Bob's your uncle
sam_knox_table.to_csv('sam_knox.csv', index=False)

selenium

web-scraping

beautifulsoup

python

dataframe

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-02-28 20:11:24

要将设备和模型代码列中的文本从中刮除，您需要使用https://stackoverflow.com/a/69947765/7429447诱导WebDriverWait为https://stackoverflow.com/a/64770041/7429447创建所需文本的列表，然后使用熊猫将其写入DataFrame中，您可以使用下面的定位器策略

代码块： (驱动程序，20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR，“div.table-row:not(..table header)>div.Support-设备”)driver.get= my_elem.text for my_elem in WebDriverWait(驱动程序，20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR，)) df =pd.DataFrame(data=list(zip(设备，模型))，列=‘设备’，‘模型代码’)打印(Df) driver.quit()
控制台输出： 设备型号代码0 Galaxy A42 5G SM-A426N，SM-A426U，SM-A 4260，SM-A426B 1银河A52 SM-A525F，SM-A525M2 Galaxy A52 5G SM-A5260 3 Galaxy A52 5G SM-A526U，SC-53B，SM-A526W，SM-A526B4银河A52s 5G SM-A528B，SM-A528N .. 371齿轮运动SM-R 600 372齿轮S3经典SM-R775V 373齿轮S3前沿SM-R765V 374齿轮S2 SM-R 720，SM-R730A，SM-R730S，SM-R730V 375齿轮S2经典SM-R732，SM-R735，SM-R735A，SM-R735V，SM-R735S 376列x 2列

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71300182

复制

相似问题

问如何使使用Pandas和DataFrame的Selenium/BS4 4程序更加优化和优雅？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使使用Pandas和DataFrame的Selenium/BS4 4程序更加优化和优雅？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使使用Pandas和DataFrame的Selenium/BS4 4程序更加优化和优雅？
EN