当页面加载时,我可以在每次加载时找到,1 div,2 div,3 div或0 div
下面是一个包含3个div的示例:
<div class="SOME_dIV">
<span class="class_title">TITLE-1</span>
<span class="_some_class1">VALUE-1</span>
<span class="_some_class2">VALUE-2</span>
<span class="_some_class3">VALUE-3</span>
</div>
<div class="SOME_dIV">
<span class="class_title">TITLE-2</span>
<span class="_some_class1">VALUE-10</span>
<span class="_some_class2">VALUE-20</span>
<span class="_some_class3">VALUE-30</span>
</div>
<div class="SOME_dIV">
<span class="class_title">TITLE-3</span>
<span class="_some_class1">VALUE-100</span>
<span class="_some_class2">VALUE-200</span>
<span class="_some_class3">VALUE-300</span>
</div>
我的pyhton代码
html = webdriver.Firefox()
html.get('DYNAMIC_URL')
html_source = html.page_source
html_source_bs = bs(html_source, 'html.parser')
all_divs = html_source_bs.find_all('div', class_='SOME_DIV')
span_litle = all_divs[0].find('div', class_='class_title')
span_1 = all_divs[0].find_all('span', class_=lambda c: '_some_class1' in c)
span_2 = all_divs[0].find_all('span', class_=lambda c: '_some_class2' in c)
span_3 = all_divs[0].find_all('span', class_=lambda c: '_some_class3' in c)
title_list = ['Title']
span1_list = ['Span1']
span2_list = ['Span2']
span3_list = ['Span3']
for l_title in corrida_numero:
result = l_title.strip()
title_list.append(result)
for l_1 in participantes_numeros[0:]:
result = l_1.text.strip()
span1_list.append(result)
for l_2 in participantes_nomes[0:]:
result = l_2.text.strip()
span2_list.append(result)
for l_3 in participantes_odds[0:]:
result = l_3.text.strip()
span3_list.append(result)
print(title_list)
print(span1_list)
print(span2_list)
print(span3_list)
输出
['Title', 'TITLE-1']
['Span1', 'VALUE-1']
['Span2', 'VALUE-2']
['Span3', 'VALUE-3']
如果有3个div,则预期输出
['Title', 'TITLE-1']
['Span1', 'VALUE-1']
['Span2', 'VALUE-2']
['Span3', 'VALUE-3']
['Title', 'TITLE-2']
['Span1', 'VALUE-10']
['Span2', 'VALUE-20']
['Span3', 'VALUE-30']
['Title', 'TITLE-3']
['Span1', 'VALUE-100']
['Span2', 'VALUE-200']
['Span3', 'VALUE-300']
我正在从一个网站上抓取信息。当站点加载时,我可以找到一个带有类'SOME_DIV的div,2个div,或3个div,甚至更多,还有任何div (0)。
如果在webdriver加载页面时有3个div的类是'SOME_DIV‘,那么我想要得到所有div的信息。
此时,我只能使用"all_divs.find_all“获得第一个div数据,我想获得其他div的数据(如果存在),但在页面加载之前,我不知道会找到多少div。
发布于 2020-04-24 07:14:36
您可以使用all_divs的长度,并使用for循环和相应的索引来抓取和解析数据。
请参阅下面的示例代码。
all_divs = html_source_bs.find_all('div', class_='SOME_DIV')
span_title = []
span_1 =[]
span_2 =[]
span_3 =[]
for i in range(len(all_divs):
span_title.append(all_divs[i].find('div', class_='class_title'))
span_1.append(all_divs[0].find_all('span', class_=lambda c: '_some_class1' in c))
#Add span_2 & 3 here
发布于 2020-04-24 07:45:01
另一种解决方案。
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<div class="SOME_dIV">
<span class="class_title">TITLE-1</span>
<span class="_some_class1">VALUE-1</span>
<span class="_some_class2">VALUE-2</span>
<span class="_some_class3">VALUE-3</span>
</div>
<div class="SOME_dIV">
<span class="class_title">TITLE-2</span>
<span class="_some_class1">VALUE-10</span>
<span class="_some_class2">VALUE-20</span>
<span class="_some_class3">VALUE-30</span>
</div>
<div class="SOME_dIV">
<span class="class_title">TITLE-3</span>
<span class="_some_class1">VALUE-100</span>
<span class="_some_class2">VALUE-200</span>
<span class="_some_class3">VALUE-300</span>
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.selects('div.SOME_dIV')
titles = divs.select('span.class_title')
for title in titles:
print (title.text, title.nexts.text)
结果:
TITLE-1 ['VALUE-1', 'VALUE-2', 'VALUE-3']
TITLE-2 ['VALUE-10', 'VALUE-20', 'VALUE-30']
TITLE-3 ['VALUE-100', 'VALUE-200', 'VALUE-300']
这里有更多的例子。https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples
https://stackoverflow.com/questions/61397668
复制相似问题