第0篇：爬虫的初步认识

文章来源：企鹅号 - Joe干货柜

写在前言的前面

本篇是#3y10h1wh#系列的第一篇推文，建立这样的一个系列主要是用来和“互联网广大的朋友们”做一些分享，以及监督本人自己的学习。

前言

本篇从爬取百度 “今日热点事件排行榜” 的实战中初步了解及使用BeautifulSoup库。

准备

url1 Beautiful Soup：

https://www.crummy.com/software/BeautifulSoup/

url2 百度 “今日热点事件排行榜”：

通过查看url2的源代码，寻找想要获取的关键词具体位置位于哪里。

url2的内容

由图易知，关键词位于

的中

代码

#CrawBaiduTop.py

import requests

from bs4 import BeautifulSoup

import bs4

tops = []

#创建空列表，用于储存词条

r = requests.get(url, timeout=40)

#获得url信息，设置40秒超时时间

r.raise_for_status()

#失败请求(非200响应)抛出异常

r.encoding = r.apparent_encoding

#根据内容分析出的编码方式，备选编码；

html = r.text

#获得的HTML文本

table = BeautifulSoup(html,"html.parser").find("table")

#对获得的文本进行html解析，查找内的信息

for words in table.find_all("a"):

#查找内的所有信息

if words.string !='search' and words.string !='新闻' and words.string !='视频'and words.string !='图片':

tops.append(words.string)

#append() 方法用于在列表末尾添加新对象

else:

continue

print(tops)

爬取结果

发表于: 2018-07-222018-07-22 18:48:41
原文链接：https://kuaibao.qq.com/s/20180722G1462K00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

第0篇：爬虫的初步认识

相关快讯

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐