我需要刮擦来自本网站的7条主要新闻-tengrinews.kz,每条新闻的日期、时间和标题。我使用selenium并安装了firefox开发版。
我检查了网站,7条新闻位于这个结构中:
... some stuff
BIG MAJOR NEWS TEXT
BIG MAJOR NEWS TEXT
news1 TEXT
news1 TEXT
news2 TEXT
news2 TEXT
news3 TEXT
news3 TEXT
我通过xpath或css找到了包含所有7条新闻的div框架_选择器。我确实得到了firefox web元素,但它是一个列表,它是空的!
如果我尝试定位单个href或div,它会返回一些'list‘类型的web元素,并且这个href必须有text属性(根据selenium文档)--但它给我的错误是"no attribute text“
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://tengrinews.kz")
css_to_big_news = 'html body div.my-app main section.tn-main-section.tn-container div.tn-main-news-container.tn-sub-container div.tn-main-news-grid div.tn-main-news-item.firs-column.tn-three-column.tn-background-cover a.tn-link'
href_big = driver.find_elements_by_css_selector(css_to_big_news)
print('type of href_big is %s and length is %d' %(type(href_big), len(href_big)))
print(href_big[0].text) #this is wrong
print(href_big.text()) # this is wrong with parenthesis
怎么了?
发布于 2020-07-20 02:38:36
要提取文本,例如文本,从每个使用硒和python你必须归纳WebDriverWait
对于visibility_of_all_elements_located()
您可以使用以下任一方法
CSS_SELECTOR
:driver.get(“https://tengrinews.kz/“) print("Date and Time:") print(我的_我的elem.text_
WebDriverWait(driver,20).until(EC.visibility )中的elem_的_全部_元素_位于((By.CSS_
选择器,"div.tn-main-news-grid div.tn-main-news-item ul.tn-data-list>li>span time“))
) print("Title:") print(我的_elem.get_my的属性(“innerHTML”)_WebDriverWait(driver,20).until(EC.visibility )中的elem_的_全部_元素_位于((By.CSS_选择器,"div.tn-main-news-grid div.tn-main-news-item span.tn-main-news-title“)))
XPATH
:driver.get(“https://tengrinews.kz/“) print("Date and Time:") print([my_我的elem.text_
WebDriverWait(driver,20).until(EC.visibility )中的elem_的_全部_元素_
位于((By.XPATH,"//div@class='tn-main-news-grid‘//div包含(@class,'tn-main-news-item')//ul@class='tn-data-list‘/li/span//time“))]) print("Title:") print([my_我的elem.text
_WebDriverWait(driver,20).until(EC.visibility )中的elem_的_全部_元素_位于((By.XPATH,"//div@class='tn-main-news-grid‘//div包含(@class,'tn-main-news-item')
//span@class='tn-main-news-title‘“)])
日期和时间:'вчера,18:27','вчера,21:45','вчера,20:52','вчера,19:48','вчера,17:34','вчера,14:50','вчера,14:32‘标题:‘Жарадо42градусовожидаетсяврегионахКазахстана’,‘СтрогийкарантинвводятвМангистаускойобласти’,‘Нехваткувакцининовую“суровую”волнуCOVID-19предрекливмире’,‘СтолицуКазахстана“оживили”’,‘ЖителиАктаусобралисьнаплощадииз-заотсутствиялекарствваптеках’,‘СтрогийкарантинвНур-Султанепродлилидо2августа’,‘“Едятантибиотики”。ВрачизПавлодараобъяснилростчислатяжелыхбольных的
预期从selenium.webdriver.support.ui导入WebDriverWait从 selenium.webdriver.common.by导入方式从selenium.webdriver.support导入_
作为EC的条件
Outro
链接到有用的文档:
get_attribute()
方法Gets the given attribute or property of the element.
text
属性返回The text of the element.
https://stackoverflow.com/questions/62983709
复制相似问题