爬取百度问答目的分析网页总结

DC童生

发布于 2018-12-27 16:48:38

1.2K0

发布于 2018-12-27 16:48:38

文章被收录于专栏：机器学习原理

目的

由于最近再开发问答系统，数据获取是一个问题，所以想通过爬虫爬取百度知道里面的问题和最优答案。

分析网页

先找到主目录，通过主目录找到各个问题的链接。

主要页面（获得url）
页面1：https://zhidao.baidu.com/search?word=5G&ie=gbk&site=-1&sites=0&date=4&pn=0
页面2：https://zhidao.baidu.com/search?word=5G&ie=gbk&site=-1&sites=0&date=4&pn=10
数据页面（获得问题和答案）
https://zhidao.baidu.com/question/1116524242596320819.html?fr=iks&word=5G&ie=gbk
https://zhidao.baidu.com/question/1760121401250070668.html?fr=iks&word=5G&ie=gbk

http://zhidao.baidu.com/question/461385357861423285.html?fr=iks&word=5G&ie=gbk

//*[@id="wgt-ask"]/h1
//*[@id="answer-2965098849"]/div[2]


问题1：需要返回列表中最长的元素

然后找到链接，爬取链接中的内容

url = "https://zhidao.baidu.com/search?word=5G&ie=gbk&site=-1&sites=0&date=4&pn=0"
    data = url_open(url)
    # html = etree.HTML(data)
    # result = etree.tostring(html, pretty_print=True, encoding="utf-8").decode("utf-8")
    # dom3 = html.xpath("//*[@class='dt mb-4 line']/a/href")
    # print(dom3)
    # print(result)
    # sys.exit(0)
    soup = BeautifulSoup(data, 'html.parser')
    html = soup.find(id="wgt-list").find_all("dt")
    html = [i.find("a")["href"] for i in html]
    print(html)

 url="http://zhidao.baidu.com/question/461385357861423285.html?fr=iks&word=5G&ie=gbk"
    data = url_open(url)
    soup = BeautifulSoup(data, 'html.parser')

    question = soup.find(class_="ask-title").text
    content = soup.find(class_="line content").text
    content = content.split("\n")
    index01 = content.index(sorted(content, key=lambda k: len(k), reverse=True)[0])
    print(question)
    print(content[index01])

总结

本该很简单的一个小爬虫，搞了两个多小时掌握一种方法精通就行，又想用框架，有想用xpth，最后用beautifulsoup完美完成。

image.png

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2018.12.19 ，如有侵权请联系 cloudcommunity@tencent.com 删除

爬虫

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

爬虫

登录后参与评论

0 条评论

热度

爬取百度问答目的分析网页总结

爬取百度问答目的分析网页总结

目的

分析网页

总结

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐