文章背景:网络爬虫已经成为自动获取互联网数据的主要方式。Requests模块是Python的第三方模块,能够满足日常的网络请求,而且简单好用。之前已经介绍了Requests库的调用方法(参见文末的延伸阅读
),接下来进入实战环节。
1 爬取网页的通用代码框架2 京东商品页面的爬取3 亚马逊商品页面的爬取4 百度/360搜索关键词提交5 网络图片的爬取与存储6 IP地址归属地的自动查询
import requests
url = "http://www.baidu.com"
try:
r = requests.get(url, timeout = 30)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[:1000])
except Exception as exc:
print('There was a problem: %s' % (exc))
import requests
url = "https://item.jd.com/100014565820.html"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}
try:
r = requests.get(url,headers=headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[:1000])
except Exception as exc:
print('There was a problem: %s' % (exc))
需要提供headers定制头中'User-Agent'的值,否则返回的是一个重定向登录页面。headers的内容可以借助https://curl.trillworks.com/这个网站来获取。
import requests
url = "https://www.amazon.cn/dp/B07FQKB4TM?_encoding=UTF8&ref_=sa_menu_kindle_l3_ki"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}
try:
r = requests.get(url,headers=headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[:1000])
except Exception as exc:
print('There was a problem: %s' % (exc))
百度的关键词接口:http://www.baidu.com/s?wd=keyword
360的关键词接口:http://www.so.com/s?q=keyword
import requests
url = "http://www.baidu.com/s"
kv = {'wd':'Python'}
try:
r = requests.get(url, params=kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.request.url)
print(len(r.text))
except Exception as exc:
print('There was a problem: %s' % (exc))
import requests
url = "http://www.so.com/s"
kv = {'q':'Python'}
try:
r = requests.get(url, params=kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.request.url)
print(len(r.text))
except Exception as exc:
print('There was a problem: %s' % (exc))
网络图片链接的格式:http://www.example.com/picture.jpg
import requests, os
url = "http://image.ngchina.com.cn/2020/1125/20201125015958238.jpg"
root = "E:\\python123\\网络爬虫"
path = os.path.join(root, url.split('/')[-1])
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path, 'wb') as f:
f.write(r.content)
f.close()
print("文件保存成功")
else:
print("文件已存在")
except Exception as exc:
print('There was a problem: %s' % (exc))
查询网:https://m.ip138.com/ip.html
import requests
url = "https://m.ip138.com/iplookup.asp?ip="
ip = '202.204.80.112'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}
try:
r = requests.get(url+ip, headers = headers)
print(r.request.url)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[2000:4000])
except Exception as exc:
print('There was a problem: %s' % (exc))
参考资料:
[1] 中国大学MOOC: Python网络爬虫与信息提取(https://www.icourse163.org/course/BIT-1001870001)
[2] Python编程快速上手—让繁琐工作自动化(https://ddz.red/AFTmO)
[3] requests 请求京东商品搜索页返回登录页面问题(https://www.v2ex.com/t/540449)