问美汤只提取前10种元素
EN

Stack Overflow用户

提问于 2019-12-09 13:56:51

回答 1查看 618关注 0票数 2

我试着从昆努大众的网页上提取信息。例如“专业”信息。

url = 'https://www.kununu.com/de/volkswagen/kommentare'
page = requests.get(url)

soup = bs(page.text, 'html.parser')
divs = soup.find_all(class_="col-xs-12 col-lg-12")

for h2 in soup.find_all('h2', class_='h3', text=['Pro']):
    print(h2.find_next_sibling('p').get_text())

但作为一个输出，我只有前10个“专业”。在默认情况下，它只显示前10位注释，但是所有不可见的注释都在“col 12 col-lg-12”类下.或者我可能遗漏了什么，你能帮我提取所有的数据，而不仅仅是前10吗？

python

beautifulsoup

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-12-09 14:04:58

您可以模拟浏览器发送的XHR请求加载这些注释，以便动态加载更多的注释。

工作代码(注意:使用f-string，所以使用3.6+；如果使用早期的.format()版本，则使用.format())：

from bs4 import BeautifulSoup
import requests


comments = []
with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }

    page = 1
    while True:
        print(f"Processing page {page}..")

        url = f'https://www.kununu.com/de/volkswagen/kommentare/{page}'
        response = session.get(url)

        soup = BeautifulSoup(response.text, 'html.parser')
        new_comments = [
            pro.find_next_sibling('p').get_text()
            for pro in soup.find_all('h2', text='Pro')
        ]
        if not new_comments:
            print(f"No more comments. Page: {page}")
            break

        comments += new_comments

        # just to see current progress so far
        print(comments)
        print(len(comments))

        page += 1

print(comments)

请注意，在向同一主机发送多个请求时，如何实例化和使用requests.Session()对象( 提供性能效益 )。

票数 6

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59250391

复制

相似问题

问美汤只提取前10种元素
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问美汤只提取前10种元素EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问美汤只提取前10种元素
EN