前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >python爬虫----(6. scrapy框架,抓取亚马逊数据)

python爬虫----(6. scrapy框架,抓取亚马逊数据)

作者头像
lpe234
发布2020-07-27 17:03:44
1.7K0
发布2020-07-27 17:03:44
举报
文章被收录于专栏:若是烟花若是烟花

利用xpath()分析抓取数据还是比较简单的,只是网址的跳转和递归等比较麻烦。耽误了好久,还是豆瓣好呀,URL那么的规范。唉,亚马逊URL乱七八糟的.... 可能对url理解还不够.

代码语言:javascript
复制
amazon
├── amazon
│   ├── __init__.py
│   ├── __init__.pyc
│   ├── items.py
│   ├── items.pyc
│   ├── msic
│   │   ├── __init__.py
│   │   └── pad_urls.py
│   ├── pipelines.py
│   ├── settings.py
│   ├── settings.pyc
│   └── spiders
│       ├── __init__.py
│       ├── __init__.pyc
│       ├── pad_spider.py
│       └── pad_spider.pyc
├── pad.xml
└── scrapy.cfg

(1)items.py

代码语言:javascript
复制
from scrapy import Item, Field


class PadItem(Item):
    sno = Field()
    price = Field()

(2)pad_spider.py

代码语言:javascript
复制
# -*- coding: utf-8 -*-
from scrapy import Spider, Selector
from scrapy.http import Request
from amazon.items import PadItem


class PadSpider(Spider):
    name = "pad"
    allowed_domains = ["amazon.com"]

    start_urls = []
    u1 = 'http://www.amazon.cn/s/ref=sr_pg_'
    u2 = '?rh=n%3A2016116051%2Cn%3A!2016117051%2Cn%3A888465051%2Cn%3A106200071&page='
    u3 = '&ie=UTF8&qid=1408641827'
    for i in range(181):
        url = u1 + str(i+1) + u2 + str(i+1) + u3
        start_urls.append(url)

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//div[@class="rsltGrid prod celwidget"]')
        items = []
        for site in sites:
            item = PadItem()
            item['sno'] = site.xpath('@name').extract()[0]
            try:
                item['price'] = site.xpath('ul/li/div/a/span/text()').extract()[0]
            # 索引异常,说明是新品
            except IndexError:
                item['price'] = site.xpath('ul/li/a/span/text()').extract()[0]
            items.append(item)
        return items

(3)settings.py

代码语言:javascript
复制
# -*- coding: utf-8 -*-

# Scrapy settings for amazon project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'amazon'

SPIDER_MODULES = ['amazon.spiders']
NEWSPIDER_MODULE = 'amazon.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'amazon (+http://www.yourdomain.com)'

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'

FEED_URI = 'pad.xml'
FEED_FORMAT = 'xml'

(4)结果如下 pad.xml

代码语言:javascript
复制
<?xml version="1.0" encoding="utf-8"?>
<items>
    <item>
        <sno>B00JWCIJ78</sno>
        <price>¥3199.00</price>
    </item>
    <item>
        <sno>B00E907DKM</sno>
        <price>¥3079.00</price>
    </item>
    <item>
        <sno>B00L8R7HKA</sno>
        <price>¥3679.00</price>
    </item>
    <item>
        <sno>B00IZ8W4F8</sno>
        <price>¥3399.00</price>
    </item>
    <item>
        <sno>B00MJMW4BU</sno>
        <price>¥4399.00</price>
    </item>
    <item>
        <sno>B00HV7KAMI</sno>
        <price>¥3799.00</price>
    </item>
    ...
</items>

(5)数据保存,保存到数据库

...

-- 2014年08月22日04:12:43

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档