前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Python: Requests库网络爬取实战

Python: Requests库网络爬取实战

作者头像
Exploring
发布2022-09-20 13:58:00
4600
发布2022-09-20 13:58:00
举报
文章被收录于专栏:数据处理与编程实践

文章背景:网络爬虫已经成为自动获取互联网数据的主要方式。Requests模块是Python的第三方模块,能够满足日常的网络请求,而且简单好用。之前已经介绍了Requests库的调用方法(参见文末的延伸阅读),接下来进入实战环节。

1 爬取网页的通用代码框架2 京东商品页面的爬取3 亚马逊商品页面的爬取4 百度/360搜索关键词提交5 网络图片的爬取与存储6 IP地址归属地的自动查询

1 爬取网页的通用代码框架
代码语言:javascript
复制
import requests
url = "http://www.baidu.com"

try:
    r = requests.get(url, timeout = 30)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000]) 
except Exception as exc:
    print('There was a problem: %s' % (exc))
2 京东商品页面的爬取
代码语言:javascript
复制
import requests
url = "https://item.jd.com/100014565820.html"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}

try:
    r = requests.get(url,headers=headers)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000]) 
except Exception as exc:
    print('There was a problem: %s' % (exc))

需要提供headers定制头中'User-Agent'的值,否则返回的是一个重定向登录页面。headers的内容可以借助https://curl.trillworks.com/这个网站来获取。

3 亚马逊商品页面的爬取
代码语言:javascript
复制
import requests
url = "https://www.amazon.cn/dp/B07FQKB4TM?_encoding=UTF8&ref_=sa_menu_kindle_l3_ki"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}

try:
    r = requests.get(url,headers=headers)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000]) 
except Exception as exc:
    print('There was a problem: %s' % (exc))
4 百度/360搜索关键词提交

百度的关键词接口:http://www.baidu.com/s?wd=keyword

360的关键词接口:http://www.so.com/s?q=keyword

代码语言:javascript
复制
import requests
url = "http://www.baidu.com/s"
kv = {'wd':'Python'}

try:
    r = requests.get(url, params=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.request.url)
    print(len(r.text)) 
except Exception as exc:
    print('There was a problem: %s' % (exc))
代码语言:javascript
复制
import requests
url = "http://www.so.com/s"
kv = {'q':'Python'}

try:
    r = requests.get(url, params=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.request.url)
    print(len(r.text)) 
except Exception as exc:
    print('There was a problem: %s' % (exc))
5 网络图片的爬取与存储

网络图片链接的格式:http://www.example.com/picture.jpg

代码语言:javascript
复制
import requests, os
url = "http://image.ngchina.com.cn/2020/1125/20201125015958238.jpg"

root = "E:\\python123\\网络爬虫"
path = os.path.join(root, url.split('/')[-1])

try:
    
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
        
except Exception as exc:
    print('There was a problem: %s' % (exc))
6 IP地址归属地的自动查询

查询网:https://m.ip138.com/ip.html

代码语言:javascript
复制
import requests
url = "https://m.ip138.com/iplookup.asp?ip="
ip = '202.204.80.112'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}

try:
    r = requests.get(url+ip, headers = headers)
    print(r.request.url)
    
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[2000:4000]) 
except Exception as exc:
    print('There was a problem: %s' % (exc))

参考资料:

[1] 中国大学MOOC: Python网络爬虫与信息提取(https://www.icourse163.org/course/BIT-1001870001

[2] Python编程快速上手—让繁琐工作自动化(https://ddz.red/AFTmO

[3] requests 请求京东商品搜索页返回登录页面问题(https://www.v2ex.com/t/540449

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2020-11-29,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 数据处理与编程实践 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1 爬取网页的通用代码框架
  • 2 京东商品页面的爬取
  • 3 亚马逊商品页面的爬取
  • 4 百度/360搜索关键词提交
  • 5 网络图片的爬取与存储
  • 6 IP地址归属地的自动查询
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档