我有一个运行在亚马逊网络服务上的应用程序,它向一个页面发出请求,要求使用requests
拉取元标签。我发现该页面允许curl请求,但不允许来自requests
库的请求。
作品:
curl https://www.seattletimes.com/nation-world/mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/
永远挂起:
imports requests
requests.get('https://www.seattletimes.com/nation-world/mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/')
这里curl和request的区别是什么?我应该只产生一个curl进程来发出我的请求吗?
发布于 2021-05-20 10:26:17
下面的两个代理都确实起作用了。还可以使用user_agent模块(位于pypi here上)来生成随机和有效的web用户代理。
import requests
agent = (
"Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/85.0.4183.102 Safari/537.36"
)
# or can use
# agent = "curl/7.61.1"
url = ("https://www.seattletimes.com/nation-world/"
"mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/")
r = requests.get(url, headers={'user-agent': agent})
或者,使用user_agent模块:
import requests
from user_agent import generate_user_agent
agent = generate_user_agent()
url = ("https://www.seattletimes.com/nation-world/"
"mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/")
r = requests.get(url, headers={'user-agent': agent})
为了进一步解释,requests设置了一个默认的用户代理here,而《西雅图时报》正在阻止这个用户代理。但是,使用python- request可以很容易地更改请求中的头参数,如上所示。
要说明默认参数,请执行以下操作:
r = requests.get('https://google.com/')
print(r.request.headers)
>>> {'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
与更新后的header参数
agent = "curl/7.61.1"
r = requests.get('https://google.com/', headers={'user-agent': agent})
print(r.request.headers)
>>>{'user-agent': 'curl/7.61.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
https://stackoverflow.com/questions/67591766
复制相似问题