一 、编码问题
这是一段爬虫代码,采集豆瓣上《邪不压正》的影评,使用正则表达式提取数据(正则表达式刚学),代码如下:
# -*- coding:utf-8 -*-
importurllib2
importre
defloadpage(page):
start =
headers = {
"User-Agent":"Mozilla/5.0(Windows NT 10.0;WOW64)ApplewebKit/537.36(KHTML,like Gecko)Chrome/58.0.3029.110 Safari/537.36 SE 2.X Metasr 1.0"}
forstartinrange(page):
url ="https://movie.douban.com/subject/26366496/reviews?start="+ str(start)
request = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(request)
m = response.read().decode("utf-8")
printm
pattern = re.compile(r"
.*?
",re.S)
item_list = pattern.findall(m)
#print item_list
fp = open("3.txt",'a')
fp.write(str(item_list))
fp.close
if__name__ =="__main__":
page = int(raw_input("请输入要爬取的页面数:"))
loadpage(page)
主要问题:我已经使用了utf-8编码,转换之后,爬取下来的数据依然是这样
二、lxml解析问题
这是一只爬取百度贴吧图片的爬虫,以“手机摄影吧”为例,代码如下:
#-*- coding:utf-8 -*-
importos
importurllib
importurllib2
fromlxmlimportetree
defloadmainpage(name,page):
m = {"kw":name}
keyword = urllib.urlencode(m)
url ="https://tieba.baidu.com/f?"+ keyword
headers = {
"User-Agent":"Mozilla/5.0(Windows NT 10.0;WOW64)ApplewebKit/537.36(KHTML,like Gecko)Chrome/58.0.3029.110 Safari/537.36 SE 2.X Metasr 1.0",
}
forpninrange(page):
pn = pn*50
fullurl = url +"&pn="+ str(pn)
printfullurl
request = urllib2.Request(fullurl,headers = headers)
response = urllib2.urlopen(request)
xmlpage(response)
defxmlpage(response):
response = response.read()
html = etree.HTML(response)
links = html.xpath('//*[@id="thread_list"]/li/div/div/div/div/a/@href')
printlinks
forlinkinlinks:
url ="http://tiebai.baidu.com"+ link
printurl
#loadpage(url)
defloadpage(url):
request = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(request)
html = etree.HTML(response)
imagelink = html.xpath('//div[@class="d_post_content_main"]/div/div/div/img[@class="BDE_Image"/@src]')
forlinkinimagelink:
printlink
get_image(link)
defget_image(link):
headers = {
"User-Agent":"Mozilla/5.0(Windows NT 10.0;WOW64)ApplewebKit/537.36(KHTML,like Gecko)Chrome/58.0.3029.110 Safari/537.36 SE 2.X Metasr 1.0",
}
request = urllib2.Request(link,headers = headers)
response = urllib2.urlopen(request).read()
file_name = open("./images/"+".png","wb")
file_name.write(response)
file_name.close
if__name__ =="__main__":
name = raw_input("请输入你要爬取的贴吧名:")
page = int(raw_input("请输入你要爬取的页数:"))
loadmainpage(name,page)
代码语法无错误,但是在第一个xpath解析处(第一个xpath是为了提取贴吧所有帖子的链接),根本没有解析到东西,我自己写了一个规则,也按照教程敲了一个规则,均出现如下现象:
当然,您也可以直接关注我的公众号,在后台留言解答
领取专属 10元无门槛券
私享最新 技术干货