Python爬虫入门（一）获取源码

小歪

发布于 2018-04-04 12:12:25

1.4K0

发布于 2018-04-04 12:12:25

文章被收录于专栏：Python爬虫与算法进阶

举个例子，爬一爬知乎日报的相关数据 http://daily.zhihu.com/

1、获取源码

import requests

url = 'http://daily.zhihu.com/'

res = requests.get(url).text

print(res)

个人喜欢requests，直接访问，发现返回500错误

C:\Python35\python.exe F:/PyCharm/爬虫/daily.py

<html><body><h1>500 ServerError</h1>

An internal server error occured.

</body></html>

Process finished with exit code 0

根据经验判断，是知乎禁止爬虫，需要加上一些伪装，让我们看看加上浏览器伪装效果

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110Safari/537.36'}

url = 'http://daily.zhihu.com/'

res = requests.get(url,headers=headers).text

print(res)

看看结果，已经返回我们需要的数据

C:\Python35\python.exe F:/PyCharm/爬虫/daily.py

<!DOCTYPE html><html><head><title>知乎日报 - 每天 3 次，每次 7 分钟</title><metacharset="utf-8"><meta http-equiv="X-UA-Compatible"content="IE=edge,chrome=1"><meta name="description"content="在中国，资讯类移动应用的人均阅读时长是 5 分钟，而在知乎日报，这个数字是 21。以独有的方式为你提供最高质、最深度、最有收获的阅读体验。"><link rel="stylesheet"href="/css/base.auto.css"><link rel="stylesheet"href="/css/new_home_v3.auto.css"><scriptsrc="/js/jquery.1.9.1.js"></script><scriptsrc="/js/new_index_v3/home.js"></script><linkrel="shortcut icon" href="/favicon.ico"type="image/x-icon"><base target="_blank"><style>h1,h2,h3{padding: 0;margin:0}</style><basetarget="_blank"></head><bodyclass="home"><a href="javascript:;" title="回到顶部"class="back-to-top"></a><div><div><div><ahref="javascript:;" data-offset="470"><span>浏览内容</span></a><ahref="javascript:;" data-offset="0"class="active"><span>App 下载</span></a></div><h1class="logo"><a href="http://daily.zhihu.com/"title="知乎日报">知乎日报</a></h1></div></div><divclass="download">

...

但是这种写法是否可以应用到所有的网站，答案是“不”

2、代理设置

有时候同一个IP去爬取同一网站上的内容，久了之后就会被该网站服务器屏蔽。解决方法就是更换IP。这个时候，在对方网站上，显示的不是我们真实地IP地址，而是代理服务器的IP地址。

http://www.xicidaili.com/nn/ 西刺代理提供了很多可用的国内IP，可以直接拿来使用。

那么如何在爬虫里加入代理呢，看看requests的官方文档怎么说。http://docs.python-requests.org/zh_CN/latest/user/advanced.html#proxies

如果需要使用代理，你可以通过为任意请求方法提供 proxies 参数来配置单个请求:

import requests

proxies = {

"http":"http://10.10.1.10:3128",

"https": "http://10.10.1.10:1080",

}

requests.get("http://example.org",proxies=proxies)

用法很简单，加入proxies参数即可 import requests

proxies = {

"http":"http://121.201.24.248：8088",

"https": "http://36.249.194.52：8118",

}

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110Safari/537.36'}

url = 'http://daily.zhihu.com/'

res = requests.get(url,headers=headers，proxies=proxies).text

print(len(res))

为了便于测试，只打印出返回数据的长度

C:\Python35\python.exe F:/PyCharm/爬虫/daily.py

10830

Process finished with exit code 0

发现代理服务器成功爬取知乎日报的信息，内容是10830，故意把代理IP写错一位数，看看结果

import requests

proxies = {

"http":"http://121.201.24.248：8088",

"https": "http://36.249.194.52: 222",

}

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110Safari/537.36'}

url = 'http://daily.zhihu.com/'

res =requests.get(url,headers=headers,proxies=proxies).text

print(len(res))

我们把"https":"http://36.249.194.52：8118"修改为"https": "http://36.249.194.52: 222"，此时返回的结果如下，发现不能获取网页数据。所以，在使用代理服务器爬去网站时，如果出现异常，要考虑代理IP是否失效了。当然你也可以写一个爬虫，实时抓取最新的代理IP用来爬取。

Traceback (most recent call last):

File"F:/PyCharm/爬虫/daily.py", line 9, in <module>

res =requests.get(url,headers=headers,proxies=proxies).text

File"C:\Python35\lib\site-packages\requests\api.py", line 70, in get

returnrequest('get', url, params=params, **kwargs)

File"C:\Python35\lib\site-packages\requests\api.py", line 56, in request

returnsession.request(method=method, url=url, **kwargs)

File"C:\Python35\lib\site-packages\requests\sessions.py", line 488, inrequest

resp =self.send(prep, **send_kwargs)

File"C:\Python35\lib\site-packages\requests\sessions.py", line 609, insend

r =adapter.send(request, **kwargs)

File"C:\Python35\lib\site-packages\requests\adapters.py", line 485, in send

raiseProxyError(e, request=request)

requests.exceptions.ProxyError:HTTPConnectionPool(host='121.201.24.248：8088', port=80): Max retries exceeded withurl: http://daily.zhihu.com/ (Caused by ProxyError('Cannot connect to proxy.',NewConnectionError('<requests.packages.urllib3.connection.HTTPConnectionobject at 0x0000000003860DA0>: Failed to establish a new connection: [Errno11004] getaddrinfo failed',)))

3、模拟登录

有些网站是需要登录才能看到信息的，例如知乎，直接用requests获取知乎首页信息，返回数据是需要你登录的，只有登录了才能看到数据。

<button type="button"class="signin-switch-button">手机验证码登录</button>

<ahref="#">无法登录？</a>

</div>

<divdata-za-module="SNSSignIn">

<span class="name js-toggle-sns-buttons">社交帐号登录</span>

<div>

</div>

再次回到官方文档http://docs.python-requests.org/zh_CN/latest/user/quickstart.html#cookie

如果某个响应中包含一些 cookie，你可以快速访问它们：

>>> url = 'http://example.com/some/cookie/setting/url'

>>> r = requests.get(url)

>>> r.cookies['example_cookie_name']

'example_cookie_value'

要想发送你的cookies到服务器，可以使用 cookies 参数：

>>> url = 'http://httpbin.org/cookies'

>>> cookies = dict(cookies_are='working')

>>> r = requests.get(url, cookies=cookies)

>>> r.text

'{"cookies": {"cookies_are":"working"}}'

具体的分析过程可以参考xchaoinfo所写的文章和视频，讲解十分清晰https://zhuanlan.zhihu.com/p/25633789 下面是代码

import requests

from bs4 import BeautifulSoup

import os, time

import re

# import http.cookiejar as cookielib

# 构造 Request headers

agent = 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 MobileSafari/537.36'

headers = {

"Host": "www.zhihu.com",

"Referer": "https://www.zhihu.com/",

'User-Agent':agent

}

######### 构造用于网络请求的session

session = requests.Session()

# session.cookies =cookielib.LWPCookieJar(filename='zhihucookie')

# try:

# session.cookies.load(ignore_discard=True)

# except:

# print('cookie文件未能加载')

############ 获取xsrf_token

homeurl = 'https://www.zhihu.com'

homeresponse = session.get(url=homeurl, headers=headers)

homesoup = BeautifulSoup(homeresponse.text, 'html.parser')

xsrfinput = homesoup.find('input', {'name': '_xsrf'})

xsrf_token = xsrfinput['value']

print("获取到的xsrf_token为： ", xsrf_token)

########## 获取验证码文件

randomtime = str(int(time.time() * 1000))

captchaurl = 'https://www.zhihu.com/captcha.gif?r='+\

randomtime+"&type=login"

captcharesponse = session.get(url=captchaurl,headers=headers)

with open('checkcode.gif', 'wb') as f:

f.write(captcharesponse.content)

f.close()

# os.startfile('checkcode.gif')

captcha = input('请输入验证码：')

print(captcha)

########### 开始登陆

headers['X-Xsrftoken'] = xsrf_token

headers['X-Requested-With'] = 'XMLHttpRequest'

loginurl = 'https://www.zhihu.com/login/email'

postdata = {

'_xsrf':xsrf_token,

'email': '邮箱@qq.com',

'password': '密码'

}

loginresponse = session.post(url=loginurl,headers=headers, data=postdata)

print('服务器端返回响应码：', loginresponse.status_code)

print(loginresponse.json())

# 验证码问题输入导致失败: 猜测这个问题是由于session中对于验证码的请求过期导致

if loginresponse.json()['r']==1:

# 重新输入验证码，再次运行代码则正常。也就是说可以再第一次不输入验证码，或者输入一个错误的验证码，只有第二次才是有效的

randomtime =str(int(time.time() * 1000))

captchaurl ='https://www.zhihu.com/captcha.gif?r=' + \

randomtime + "&type=login"

captcharesponse= session.get(url=captchaurl, headers=headers)

withopen('checkcode.gif', 'wb') as f:

f.write(captcharesponse.content)

f.close()

os.startfile('checkcode.gif')

captcha =input('请输入验证码：')

print(captcha)

postdata['captcha'] = captcha

loginresponse =session.post(url=loginurl, headers=headers, data=postdata)

print('服务器端返回响应码：',loginresponse.status_code)

print(loginresponse.json())

##########################保存登陆后的cookie信息

# session.cookies.save()

############################判断是否登录成功

profileurl = 'https://www.zhihu.com/settings/profile'

profileresponse = session.get(url=profileurl,headers=headers)

print('profile页面响应码：', profileresponse.status_code)

profilesoup = BeautifulSoup(profileresponse.text,'html.parser')

div = profilesoup.find('div', {'id': 'rename-section'})

print(div)

好了关于爬虫的第一步，获取源码这一节讲了很多，其实大多数网站加上User-Agent和代理IP就可以正常爬取。下一节会讲讲如何利用xpath来解析网页，获取我们想要的数据。

小广告：下周二我会在趣直播讲一讲Python爬虫入门，欢迎大家来参加。http://m.quzhiboapp.com/?liveId=522#!/intro/522

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2017-05-19，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自 Python爬虫与算法进阶微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度

Python爬虫入门（一）获取源码

Python爬虫入门（一）获取源码

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐