前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Scrapy爬取知乎------获取用户主页信息

Scrapy爬取知乎------获取用户主页信息

作者头像
andrew_a
发布2019-07-30 14:41:52
6900
发布2019-07-30 14:41:52
举报
文章被收录于专栏:Python爬虫与数据分析

写的详细一点吧,现在翻我以前写的,感觉有点模糊。

新建一个scrapy项目,scrapy startproject zhihuspider

日志文件是我自己创建的,新项目的目录结构如上

然后在spider文件下创建自己的爬虫文件,我起的名字比较奇葩

然后就可以写自己的爬虫了。把上一个登陆的代码再发一次,这次是放在一块

代码语言:javascript
复制
import os
import re
import json

import scrapy, time, hmac, base64
from urllib.parse import urlencode
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhihuscrapy.constants import Gender, People, HEADER
from zhihuscrapy.items import ZhihuPeopleItem, ZhihuRelationItem
from hashlib import sha1
from scrapy import Selector, log


class ZhihuComSpider(scrapy.Spider):
    name = 'zhihutest'
    allowed_domains = ['zhihu.com']
    start_url = 'https://www.zhihu.com/people/tmjzxy88'
    rules = (Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),)

    agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    headers = {
        'Connection': 'keep-alive',
        'Host': 'www.zhihu.com',
        'Referer': 'https://www.zhihu.com/signin',
        'User-Agent': agent
        # 'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
    }
    client_id='c3cef7c66a1843f8b3a9e6a1e3160e20'
    grant_type= 'password'
    source='com.zhihu.web'
    timestamp = str(int(time.time() * 1000))
    timestamp2 = str(time.time() * 1000)
    followee_ids = []

    # 处理签名
    def get_signnature(self,grant_type,client_id,source,timestamp):
        """
        通过 Hmac 算法计算返回签名
        实际是几个固定字符串加时间戳
        :param timestamp: 时间戳
        :return: 签名        """
        hm=hmac.new(b'd1b964811afb40118a12068ff74a12f4',None,sha1)
        hm.update(str.encode(grant_type))
        hm.update(str.encode(client_id))
        hm.update(str.encode(source))
        hm.update(str.encode(timestamp))
        return str(hm.hexdigest())

    def start_requests(self):
        # 进入登录页面,回调函数start_login()
        yield scrapy.Request('https://www.zhihu.com/api/v3/oauth/captcha?lang=en',headers=self.headers,callback=self.start_login, meta={'cookiejar': 1},)  # meta={'cookiejar':1}
    def start_login(self,response):
        # 判断是否需要验证码
        need_cap=json.loads(response.body)['show_captcha']
        # re.search(r'true', resp.text)
        print(need_cap)
        if need_cap:
            print('需要验证码')
            yield scrapy.Request('https://www.zhihu.com/api/v3/oauth/captcha?lang=en',headers=self.headers,callback=self.capture,method='PUT', meta={'cookiejar': response.meta['cookiejar']})

        else:
            print('不需要验证码')
            post_url = 'https://www.zhihu.com/api/v3/oauth/sign_in'
            post_data ={
                'client_id': self.client_id,
                'grant_type': self.grant_type,
                'timestamp': self.timestamp,
                'source': self.source,
                'signature': self.get_signnature(self.grant_type, self.client_id, self.source, self.timestamp),
                'username': '+86166666666',
                'password': '123456789',
                'captcha': '',
                # 改为'cn'是倒立汉字验证码
                'lang': 'en',
                'ref_source': 'other_',
                'utm_source': ''}
            yield scrapy.FormRequest(url=post_url, formdata=post_data, headers=self.headers, meta={'cookiejar': response.meta['cookiejar']},)

    def capture(self,response):
        try:
            img = json.loads(response.body)['img_base64']
        except ValueError:
            print('获取img_base64的值失败!')
        else:
            img = img.encode('utf8')
            img_data = base64.b64decode(img)

            with open('zhihu.gif', 'wb') as f:
                f.write(img_data)
                f.close()
        captcha = input('请输入验证码:')
        post_data = {
            'client_id': self.client_id,
            'grant_type': self.grant_type,
            'timestamp': self.timestamp,
            'source': self.source,
            'signature': self.get_signnature(self.grant_type, self.client_id, self.source, self.timestamp),
            'username': '+8617777775',
            'password': '123456789',
            'captcha': captcha,
            # 改为'cn'是倒立汉字验证码
            'lang': 'en',
            'ref_source': 'other_',
            'utm_source': '',
            '_xsrf': '0sQhRIVITLlEX8kQWA09VOqsPlSqRJQT'
        }
        yield scrapy.FormRequest(
            url='https://www.zhihu.com/signin',
            formdata=post_data,
            callback=self.after_login,
            headers=self.headers,
            meta={'cookiejar': response.meta['cookiejar']},
        )

    def after_login(self, response):
        if response.status == 200:
            print("登录成功")
            """
                    登陆完成后从第一个用户开始爬数据                    """
            return [scrapy.Request(
                self.start_url,
                meta={'cookiejar': response.meta['cookiejar']},
                callback=self.parse_people,
                errback=self.parse_err,
            )]
        else:
            print("登录失败")

接下来就是获取主页信息

获取用户主页信息就更简单,上篇已经模拟登录完成,登录后,点击用户主页,F12,在F5,

在activities页面中script id=js-initiaData。在这个script中保存着用户的所有信息

代码语言:javascript
复制
def parse_people(self, response):
    """
    解析用户主页    """
    if "need_login=true" in response.url:
        with open('need_login.html', 'w', encoding="utf8") as f:
            f.write(response.text)
    selector = Selector(response)

    try:
        zhihu_id = os.path.split(response.url)[-1]
        userlinks = selector.xpath('//script[@id="js-initialData"]/text()').extract_first()
        userlinks = json.loads(userlinks)
        userlinks = userlinks['initialState']['entities']['users'][zhihu_id]
        nickname = userlinks['name']

        try:
            # 位置
            location = userlinks['locations'][0]['name']
        except (KeyError, IndexError) as e:
            # log.WARNING('未找到位置'+str(e))
            location = "未知"
        try:
            # 公司
            employment = userlinks['employments'][0]['company']['name']
            # # 职位
            position = userlinks['employments'][0]['job']['name']
        except (KeyError, IndexError) as e:
            employment = '未知'
            position = '未知'
        try:
            # 行业
            business = userlinks['business'][0]['name']
        except (KeyError, IndexError) as e:
            business = "未知"
        try:
            # 学校名字
            school_name = userlinks['educations'][0]['school']['name']
            log.logger.info(school_name)
            # 专业
            major = userlinks['educations'][0]['major']['name']
            # 1高中及以下,2大专,3本科, 4硕士,5博士及以上
            edu = userlinks['educations'][0]['diploma']
            if edu == 1:
                education = '高中及以下'
            elif edu == 2:
                education = '大专'
            elif edu == 3:
                education = '本科'
            elif edu == 4:
                education = '硕士'
            elif edu == 5:
                education = '博士及以上'
            else:
                education = '未知'
        except (KeyError, IndexError) as e:
            school_name = "未知"
            major = "未知"
            education = "未知"
        try:
            gender = userlinks['gender']
            gender = '男' if gender == 1 else '女'
        except IndexError as e:
            gender = '未知'
        image_url = selector.xpath(
            '//div[@class="UserAvatar ProfileHeader-avatar"]/img/@src'
        ).extract_first('')[0:-3]
        follow_urls = selector.xpath(
            '//div[@class="NumberBoard FollowshipCard-counts NumberBoard--divider"]/a/@href'
        ).extract()
        followee_count = userlinks['followingCount']
        follower_count = userlinks['followerCount']
        item = ZhihuPeopleItem(
            nickname=nickname,
            zhihu_id=zhihu_id,
            location=location,
            business=business,
            gender=gender,
            employment=employment,
            position=position,
            education=education,
            school_name=school_name,
            major=major,
            followee_count=followee_count,
            follower_count=follower_count,
            image_url=image_url + 'jpg',
        )
        yield item
    except Exception as e:
        log.logger.error('当前用户不存在' + str(e))

我只解析了我要的数据,其他页面的信息没有获取,感兴趣的可以自己去尝试一下。然后items.py和pipelines.py中的内容,大家自己加就行了,一个是要保存的数据一个是连接数据库并保存。这就是获取个人主页的信息的代码。有问题欢迎提出来留言。

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2019-02-11,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 Python爬虫scrapy 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
验证码
腾讯云新一代行为验证码(Captcha),基于十道安全栅栏, 为网页、App、小程序开发者打造立体、全面的人机验证。最大程度保护注册登录、活动秒杀、点赞发帖、数据保护等各大场景下业务安全的同时,提供更精细化的用户体验。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档