我正在尝试创建一个Python网络爬虫,但由于某些原因,当我尝试爬取一个网站时,例如亚马逊,我的程序打印出的唯一内容是'None‘。
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Apython&page=' + str(page) + '&keywords=python&ie=UTF8&qid=1482022018&spIA=B01M63XMN1,B00WFP9S2E'
source = requests.get(url)
plain_text = source.text
obj = BeautifulSoup(plain_text, "html5lib")
for link in obj.find_all('a'):
href = link.get(url)
print(href)
page += 1
spider(1)
发布于 2021-08-06 15:13:48
您需要考虑请求中的头部和参数。您可以执行以下步骤来简化任务:
发布于 2016-12-18 09:28:40
不使用User-Agent:
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Apython&page=1%20%27&keywords=python&ie=UTF8&qid=1482022018&spIA=B01M63XMN1,B00WFP9S2E%27
使用用户代理:
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Apython&page=1%20%27&keywords=python&ie=UTF8&qid=1482022018&spIA=B01M63XMN1,B00WFP9S2E%27', headers=headers)
它工作得很好。
How to prevent getting blacklisted while scraping您可以阅读此页面,了解为什么要使用UA。
https://stackoverflow.com/questions/41204559
复制相似问题