scrapy爬取--腾讯社招的网站

用户2337871

发布于 2019-07-19 15:51:47

6110

发布于 2019-07-19 15:51:47

文章被收录于专栏：git

需求：得到相应的职位、职位类型、职位的链接、招聘人数、工作地点、发布时间

一、创建Scrapy项目的流程

1）使用命令创建爬虫腾讯招聘的职位项目：scrapy startproject tencent

2）进程项目命令：cd tencent,并且创建爬虫：scrapy genspider tencentPosition hr.tencent.com

3) 使用PyCharm打开项目

4）根据需求分析，完成items.py文件的字段

5）完成爬虫的编写

6）管道文件的编程

7）settings.py文件的配置信息

8）pycharm打开文件的效果图：

二、编写各个文件的代码：

1.tencentPosition.py文件

import scrapy

from tencent.items import TencentItem


class TencentpositionSpider(scrapy.Spider):
    name = 'tencentPosition'
    allowed_domains = ['hr.tencent.com']
    offset = 0
    url = "https://hr.tencent.com/position.php?&start="
    start_urls = [url + str(offset) + '#a', ]

    def parse(self, response):
        position_lists = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
        for postion in position_lists:
            item = TencentItem()
            position_name = postion.xpath("./td[1]/a/text()").extract()[0]
            position_link = postion.xpath("./td[1]/a/@href").extract()[0]
            position_type = postion.xpath("./td[2]/text()").get()
            people_num = postion.xpath("./td[3]/text()").extract()[0]
            work_address = postion.xpath("./td[4]/text()").extract()[0]
            publish_time = postion.xpath("./td[5]/text()").extract()[0]

            item["position_name"] = position_name
            item["position_link"] = position_link
            item["position_type"] = position_type
            item["people_num"] = people_num
            item["work_address"] = work_address
            item["publish_time"] = publish_time
            yield item

            # 下一页的数据
            total_page = response.xpath('//div[@class="left"]/span/text()').extract()[0]
            print(total_page)

            if self.offset < int(total_page):
                self.offset += 10
            new_url = self.url + str(self.offset) + "#a"
            yield scrapy.Request(new_url, callback=self.parse)

2.items.py 文件

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    position_name = scrapy.Field()
    position_link = scrapy.Field()
    position_type = scrapy.Field()
    people_num = scrapy.Field()
    work_address = scrapy.Field()
    publish_time = scrapy.Field()

*****切记字段和TencentpositionSpider.py文件保持一致

3.pipelines.py文件

import json


class TencentPipeline(object):
    def __init__(self):
        print("=======start========")
        self.file = open("tencent.json", "w", encoding="utf-8")

    def process_item(self, item, spider):
        print("=====ing=======")
        dict_item = dict(item)  # 转换成字典
        json_text = json.dumps(dict_item, ensure_ascii=False) + "\n"
        self.file.write(json_text)
        return item

    def close_spider(self, spider):
        print("=======end===========")
        self.file.close()

4.settings.py文件

5.运行文件：

1）在根目录下创建一个main.py

2)main.py文件

from scrapy import cmdline

cmdline.execute("scrapy crawl tencentPosition".split())

三、运行效果：

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

如有侵权请联系 cloudcommunity@tencent.com 删除

python

爬虫

scrapy

ide

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

scrapy爬取--腾讯社招的网站

scrapy爬取--腾讯社招的网站

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐