文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Python抓取Aliexpress

问使用Python抓取Aliexpress
EN

Stack Overflow用户

提问于 2017-01-13 22:31:54

回答 2查看 3.7K关注 0票数 1

我试图在Python中放弃一个web刮刀，它通过Aliexpress供应商的所有产品。我的问题是，当我去不登录它，我最终重定向到登录网页。我在代码中添加了登录部分，但没有帮助。我将感谢所有的建议。

我的代码：

import requests
from bs4 import BeautifulSoup
import re
import sys
from lxml import html


def go_through_paginator(link):
    source_code = requests.get(link, data=payload,  headers = dict(referer = link))
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    print(soup)
    for page in soup.findAll ('div', {'class' : 'ui-pagination-navi util-left'}):
        for next_page in page.findAll ('a', {'class' : 'ui-pagination-next'}):
            next_page_link="https:" + next_page.get('href')
            print (next_page_link)
            gather_all_products (next_page_link)

def gather_all_products (url):
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for item in soup.findAll ('a', {'class' : 'pic-rind'}):
        product_link=item.get('href')
    go_through_paginator(url)


payload = {
    "loginId": "EMAIL", 
    "password": "LOGIN",
}

LOGIN_URL='https://login.aliexpress.com/buyer.htm?spm=2114.12010608.1000002.4.EihgQ5&return=https%3A%2F%2Fwww.aliexpress.com%2Fstore%2F1816376%3Fspm%3D2114.10010108.0.0.fs2frD&random=CAB39130D12E432D4F5D75ED04DC0A84'

session_requests = requests.session()
source_code = session_requests.get(LOGIN_URL)
source_code = session_requests.post(LOGIN_URL, data = payload)


URL='https://www.aliexpress.com/store/1816376?spm=2114.10010108.0.0.fs2frD'

source_code = requests.get(URL, data=payload,  headers = dict(referer = URL))
plain_text = source_code.text
soup = BeautifulSoup(plain_text)

for L1 in soup.findAll ('li', {'id' : 'product-nav'}):
    for L1_link in L1.findAll('a', {'class' : 'nav-link'}):
        link = "https:" + L1_link.get('href') 
        gather_all_products(link)

这是aliexpress登录网址：https://login.aliexpress.com/buyer.htm?spm=2114.12010608.1000002.4.EihgQ5&return=https%3A%2F%2Fwww.aliexpress.com%2Fstore%2F1816376%3Fspm%3D2114.10010108.0.0.fs2frD&random=CAB39130D12E432D4F5D75ED04DC0A84

python

web-scraping

beautifulsoup

web-crawler

回答 2

Stack Overflow用户

发布于 2017-06-13 20:28:25

尝试从、xman_t、和响应cookie中设置cookie值。

我正试着直接获取所有产品的信息。在我设置xman_t和intl_common_forever之前，请允许我获得7种产品。在设置了xman_t和intl_common_forever之后，我成功地获得了50个产品。

希望这能帮你擦拭他们的产品。

票数 0

Stack Overflow用户

发布于 2022-06-08 05:01:40

对于Stackoverflow来说，您的问题有点复杂，但是让我们来看看几个可以产生巨大影响的提示。

这里发生的事情是，支付宝怀疑你是一个机器人，并要求你登录继续。

为了避免这一点，你需要加强你的刮刀一点。让我们看一看AliExpress的常见技巧。

首先，您需要清理您的urls并删除跟踪参数：

URL='https://www.aliexpress.com/store/1816376?spm=2114.10010108.0.0.fs2frD'
#                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# we don't need this parameter, it just helps AliExpress to identify us

然后，我们应该使用类似于requests浏览器的标头来加强我们的会话。

BASE_HEADERS = {
    # lets use headers of Chrome 96 on Windows operating system to blend in:
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
session = requests.session(headers=BASE_HEADERS)

这两个提示将大大减少登录请求！

尽管这只是冰山一角--想了解更多信息，请看我写的如何刮除AliExpress博客教程

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/41644475

复制

相似问题

问使用Python抓取Aliexpress
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python抓取AliexpressEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python抓取Aliexpress
EN