粗略的啃完requests库的官方中文文档和BeautifulSoup的文档,本期主要灵活运用相关知识,实现对freebuf.com文章信息的抓取分析。
打开Freebuf,访问WEB安全专栏,查看其源代码,发现想要的信息——文章名,作者,文章地址,简介,访问次数以及发布时间。
一个自然而然的想法就是利用requests库抓取源代码,利用BeautifulSoup库分离出想要的信息,最后把信息保存在本地。然后把按照这个思路写下了代码。
二.实现
首先我们需要先构建request请求,由于一般网站都有反爬虫机制,所以在这里加入请求头,延迟时间。并做对于请求异常做处理。
def get_html(url, data = None):
header = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36'
}
timeout = random.choice(range(80, 100))
while True:
try:
response = requests.get(url, headers=header, timeout=timeout)
response.encoding = 'utf-8'
break
except socket.timeout as e:
print(e)
time.sleep(random.choice(range(20, 60)))
except socket.error as e:
print(e)
time.sleep(random.choice(range(0, 60)))
except http.client.BadStatusLine as e:
print(e)
time.sleep(random.choice(range(30, 60)))
except http.client.IncompleteRead as e:
print(e)
time.sleep(random.choice(range(20, 60)))
return response.text
请求以后,我们得到了网页数据。接下来开始利用BeautifulSoup库分离出想要的信息。可以看到,由于结构不是很复杂,而且元素没有缺失。这里我使用一个For循环的方式把它依次放到字典里。然后再添加到列表。
def get_data(html_text):
result = []
bs = BeautifulSoup(html_text, "html.parser")
titles = bs.select('#timeline > div > div.news-info > dl > dt > a')
urls = bs.select('#timeline > div > div.news-info > dl > dt > a')
descs = bs.select('#timeline > div > div.news-info > dl > dd.text')
writers = bs.select('#timeline > div > div.news-info > dl > dd:nth-child(2) > span.name > a')
pvs = bs.select('#timeline > div > div.news-info > div > span.look > strong:nth-child(1)')
uptimes = bs.select('#timeline > div > div.news-info > dl > dd:nth-child(2) > span.time')
for title,writer,url,desc,pv,uptime in zip(titles,writers,urls,descs,pvs,uptimes):
data = {
'title': title.get_text(),
'writer': writer.get_text(),
'url': url.get('href'),
'desc': desc.get_text(),
'pv': pv.get_text(),
'uptime': uptime.get_text()
}
result.append(data)
return result
鉴于刚刚已经对数据做了依次处理,列表中每个字典都是一条按照正确格式排列的文章信息,接下来我们开始构建代码,把整理好的数据保存到本地。
def data_output(info, filename):
with open(filename,'a',errors='ignore', newline='') as f:
f_csv = csv.writer(f)
for i in range(0,16):
temp = [[info[i]['title']],[info[i]['writer']],[info[i]['url']],[info[i]['desc']],[info[i]['pv']],[info[i]['uptime']]]
f_csv.writerow(temp)
按照最开始的思路(request请求->BeautifulSoup拆解->保存到本地),已经初步完成。接下构建主函数来调用。
if __name__ == '__main__':
for num in range(1, 3):
url = 'https://www.freebuf.com/articles/web/page/' + str(num)
html_text = get_html(url)
result = get_data(html_text)
data_output(result, 'freebuf.csv')
成果图