首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何根据亚马逊的位置进行数据抓取?

如何根据亚马逊的位置进行数据抓取?
EN

Stack Overflow用户
提问于 2021-03-26 12:09:13
回答 2查看 1.9K关注 0票数 3

每当我想在amazon.com上刮擦时,我都失败了。因为产品信息根据amazon.com中的位置而变化

这种不断变化的信息如下;

  • 1-Price
  • 2-Shipping
  • 3-海关费用
  • 4-船运状况

用selenium更改位置很简单,但处理速度非常慢。这就是为什么我需要用刮擦或请求。

但是,虽然我模仿浏览器中的cookie和headers,但是amazon.com不允许我更改位置。

有两个大问题。

  1. 有一个名为“ubid”的数据,我无法导出该数据的副本。这是没有数据的亚马逊。它不允许更改location.
  2. Although --我对头数据也是这样做的,输出数据之间有区别。示例:我在浏览器中使用完全相同的标题。但在浏览器中,内容类型与json一样,但在我编写的代码中,是text / html;charset = UTF-8。

非常有趣的是,没有关于这一主题的信息。你不能在这个世界第一大购物网站上做面向地点的抓取。

请告诉我谁知道这个话题的答案。如果有一个解决方案作为刮除或请求,它就足够了。说真的,我已经一年没解决这个问题了。

代码语言:javascript
运行
复制
import requests
from lxml import etree
from random import choice
from urllib3.exceptions import InsecureRequestWarning
import urllib.parse
import urllib3.request
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

    

def location():
    headersdelivery = {
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
            'content-type':'application/x-www-form-urlencoded',
            'accept':'text/html,*/*',
            'x-requested-with':'XMLHttpRequest',
            'contenttype':'application/x-www-form-urlencoded;charset=utf-8',
            'origin':'https://www.amazon.com',
            'sec-fetch-site':'same-origin',
            'sec-fetch-mode':'cors',
            'sec-fetch-dest':'empty',
            'referer':'https://www.amazon.com/',
            'accept-encoding':'gzip, deflate, br',
            'accept-language':'tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7'
            }

    payload = {
    'locationType':'LOCATION_INPUT',
    'zipCode':'34249',
    'storeContext':'generic',
    'deviceType':'web',
    'pageType':'Gateway',
    'actionSource':'glow',
    'almBrandId':'undefined'}


    sessionid = requests.session()
    url = "https://www.amazon.com/gp/delivery/ajax/address-change.html"
    ulkesecmereq = sessionid.post(url, headers=headersdelivery, data=payload,verify=False)

    return sessionid


def response(locationsession):
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'TE': 'Trailers'}

    postdata = {
    'storeContext':'generic',
    'pageType':'Gateway'
    }
    req = locationsession.post("https://www.amazon.com/gp/glow/get-location-label.html",headers=headers, data=postdata, verify=False)
    print(req.content)


locationsession = location()
response(locationsession)
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-12-02 14:18:05

首先,您应该从基本的亚马逊页面获得令牌anti-csrftoken-a2z

使用特定的用户代理:Mozilla ...www.amazon.com提出请求。

  1. 通过XPATH选择器获得JSON数据:

//span[@id='nav-global-location-data-modal-action']/@data-a-modal

来自此选择器的JSON示例:

代码语言:javascript
运行
复制
{
  "width": 375,
  "closeButton": "false",
  "popoverLabel": "Choose your location",
  "ajaxHeaders": {
    "anti-csrftoken-a2z": "ajaxHeaders >> anti-csrftoken-a2z"
  },
  "name": "glow-modal",
  "url": "/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal",
  "footer": "<span class=\"a-declarative\" data-action=\"a-popover-close\" data-a-popover-close=\"{}\"><span class=\"a-button a-button-primary\"><span class=\"a-button-inner\"><button name=\"glowDoneButton\" class=\"a-button-text\" type=\"button\">Done</button></span></span></span>",
  "header": "Choose your location"
}

  1. 为下一个请求设置标题:

代码语言:javascript
运行
复制
headers = {
    "anti-csrftoken-a2z": `gMDCYRgjYFVWvjfmU70/qMURqYh7kAko11WlenYAAAAMAAAAAGGokFZyYXcAAAAA`,
    "user-agent": "Mozila ..."
}

  1. 向链接发出请求:具有步骤2中的标头的https://www.amazon.com/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal和步骤1.

中的响应cookies。

从响应: Regex:'CSRF_TOKEN : "(.+?)"'提取CSRF_TOKEN

  1. 为下一个请求设置标题:

代码语言:javascript
运行
复制
headers = {
    "anti-csrftoken-a2z": "CSRF token from step 4",
    "user-agent": "Mozila ..."
}

  1. 通过formdata:

向:https://www.amazon.com/gp/delivery/ajax/address-change.html发出POST请求

代码语言:javascript
运行
复制
{
        "locationType": "LOCATION_INPUT",
        "zipCode": "zip-code",
        "storeContext": "generic",
        "deviceType": "web",
        "pageType": "Gateway",
        "actionSource": "glow",
        "almBrandId": "undefined",
}

具有步骤5中的标头和步骤3中的响应cookie。

如果所有文件都应该得到这样的响应:

代码语言:javascript
运行
复制
{
    'isValidAddress': 1, 
    'isTransitOutOfAis': 0, 
    'address': {'locationType': 'LOCATION_INPUT', 'district': None, 
    'zipCode': '30322', 'addressId': None, 'isDefaultShippingAddress': 'false', 'obfuscatedId': None, 'isAccountAddress': 'false', 'state': 'GA', 
    'countryCode': 'US', 'addressLabel': None, 
    'city': 'ATLANTA', 'addressLine1': None}, 'sembuUpdated': 1
}

  1. 保存步骤6中的响应cookie,并将它们用于进一步的请求

具有所有逻辑的Python脚本:

代码语言:javascript
运行
复制
import json

import requests
from parsel import Selector

AMAZON_US_URL = "https://www.amazon.com/"
AMAZON_ADDRESS_CHANGE_URL = (
    "https://www.amazon.com/gp/delivery/ajax/address-change.html"
)
AMAZON_CSRF_TOKEN_URL = (
    "https://www.amazon.com/gp/glow/get-address-selections.html?deviceType=desktop"
    "&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal"
)
DEFAULT_USER_AGENT = (
    "Mozilla/5.0 ..."
)
DEFAULT_REQUEST_HEADERS = {"Accept-Language": "en", "User-Agent": DEFAULT_USER_AGENT}


def get_amazon_content(start_url: str, cookies: dict = None) -> tuple:
    response = requests.get(
        url=start_url, headers=DEFAULT_REQUEST_HEADERS, cookies=cookies
    )
    response.raise_for_status()
    return Selector(text=response.text), response.cookies


def get_ajax_token(content: Selector):
    data = content.xpath(
        "//span[@id='nav-global-location-data-modal-action']/@data-a-modal"
    ).get()
    if not data:
        raise ValueError("Invalid page content")
    json_data = json.loads(data)
    return json_data["ajaxHeaders"]["anti-csrftoken-a2z"]


def get_session_id(content: Selector):
    session_id = content.re_first(r'session: \{id: "(.+?)"')
    if not session_id:
        raise ValueError("Session id not found")
    return session_id


def get_token(content: Selector):
    csrf_token = content.re_first(r'CSRF_TOKEN : "(.+?)"')
    if not csrf_token:
        raise ValueError("CSRF token not found")
    return csrf_token


def send_change_location_request(zip_code: str, headers: dict, cookies: dict):
    response = requests.post(
        url=AMAZON_ADDRESS_CHANGE_URL,
        data={
            "locationType": "LOCATION_INPUT",
            "zipCode": zip_code,
            "storeContext": "generic",
            "deviceType": "web",
            "pageType": "Gateway",
            "actionSource": "glow",
            "almBrandId": "undefined",
        },
        headers=headers,
        cookies=cookies,
    )
    assert response.json()["isValidAddress"], "Invalid change response"
    return response.cookies


def get_session_cookies(zip_code: str):
    response = requests.get(url=AMAZON_US_URL, headers=DEFAULT_REQUEST_HEADERS)
    content = Selector(text=response.text)

    headers = {
        "anti-csrftoken-a2z": get_ajax_token(content=content),
        "user-agent": DEFAULT_USER_AGENT,
    }
    response = requests.get(
        url=AMAZON_CSRF_TOKEN_URL, headers=headers, cookies=response.cookies
    )
    content = Selector(text=response.text)

    headers = {
        "anti-csrftoken-a2z": get_token(content=content),
        "user-agent": DEFAULT_USER_AGENT,
    }
    send_change_location_request(
        zip_code=zip_code, headers=headers, cookies=dict(response.cookies)
    )
    # Verify that location changed correctly.
    response = requests.get(
        url=AMAZON_US_URL, headers=DEFAULT_REQUEST_HEADERS, cookies=response.cookies
    )
    content = Selector(text=response.text)
    location_label = content.css("span#glow-ingress-line2::text").get().strip()

    assert zip_code in location_label


if __name__ == "__main__":
    get_session_cookies(zip_code="30322")

此外,使用Scrapy框架的类似逻辑:

代码语言:javascript
运行
复制
from http.cookies import SimpleCookie

from scrapy import FormRequest, Request, Spider
from scrapy.http import HtmlResponse


class AmazonSessionSpider(Spider):
    """
    Amazon spider for extracting location cookies.
    """

    name = "amazon.com:location-session"

    address_change_endpoint = "/gp/delivery/ajax/address-change.html"
    csrf_token_endpoint = (
        "/gp/glow/get-address-selections.html?deviceType=desktop"
        "&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal"
    )
    countries_base_urls = {
        "US": "https://www.amazon.com",
        "GB": "https://www.amazon.co.uk",
        "DE": "https://www.amazon.de",
        "ES": "https://www.amazon.es",
    }

    default_headers = {
        "sec-fetch-site": "none",
        "sec-fetch-dest": "document",
        "accept-language": "ru-RU,ru;q=0.9",
        "connection": "close",
    }

    def __init__(self, country: str, zip_code: str, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.country = country
        self.zip_code = zip_code

    def start_requests(self):
        """
        Make start request to main Amazon country page.
        """
        request = Request(
            url=self.countries_base_urls[self.country],
            headers=self.default_headers,
            callback=self.parse_ajax_token,
        )
        yield request

    def parse_ajax_token(self, response: HtmlResponse):
        """
        Parse ajax token from response.
        """
        yield response.request.replace(
            url=self.countries_base_urls[self.country] + self.csrf_token_endpoint,
            headers={
                "anti-csrftoken-a2z": self.get_ajax_token(response=response),
                **self.default_headers,
            },
            callback=self.parse_csrf_token,
        )

    def parse_csrf_token(self, response: HtmlResponse):
        """
        Parse CSRF token from response and make request to change Amazon location.
        """
        yield FormRequest(
            method="POST",
            url=self.countries_base_urls[self.country] + self.address_change_endpoint,
            formdata={
                "locationType": "LOCATION_INPUT",
                "zipCode": self.zip_code,
                "storeContext": "generic",
                "deviceType": "web",
                "pageType": "Gateway",
                "actionSource": "glow",
                "almBrandId": "undefined",
            },
            headers={
                "anti-csrftoken-a2z": self.get_csrf_token(response=response),
                **self.default_headers,
            },
            callback=self.parse_session_cookies,
        )

    def parse_session_cookies(self, response: HtmlResponse) -> dict:
        """
        Return cookies dict if location changed successfully.
        """
        json_data = response.json()
        if not json_data.get("isValidAddress"):
            return {}
        return self.extract_response_cookies(response=response)

    @staticmethod
    def get_ajax_token(response: HtmlResponse) -> str:
        """
        Extract ajax token from response.
        """
        data = response.xpath("//input[@id='glowValidationToken']/@value").get()
        if not data:
            raise ValueError("Invalid page content")
        return data

    @staticmethod
    def get_csrf_token(response: HtmlResponse) -> str:
        """
        Extract CSRF token from response.
        """
        csrf_token = response.css("script").re_first(r'CSRF_TOKEN : "(.+?)"')
        if not csrf_token:
            raise ValueError("CSRF token not found")
        return csrf_token

    @staticmethod
    def extract_response_cookies(response: HtmlResponse) -> dict:
        """
        Extract cookies from response object
        and return it in valid format.
        """
        cookies = {}
        cookie_headers = response.headers.getlist("Set-Cookie", [])
        for cookie_str in cookie_headers:
            cookie = SimpleCookie()
            cookie.load(cookie_str.decode("utf-8"))
            for key, raw_value in cookie.items():
                cookies[key] = raw_value.value
        return cookies

Shell命令:

代码语言:javascript
运行
复制
 scrapy crawl amazon.com:location-session -a country=US -a zip_code=30332
票数 7
EN

Stack Overflow用户

发布于 2021-04-20 13:42:42

我在标题中看到CSRF令牌(Anti-csrfTokena-a2z),您在位置请求中遗漏了该令牌,还有一个忽略了对位置的附加请求(https://www.amazon.co.uk/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal)。您应该在浏览器中实现所有请求。

Chrome中的简单示例:Chrome -> devtools -> network -> XHR copy as curl复制并在此处转换为请求库(https://curl.trillworks.com/)。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/66816666

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档