专栏首页Python分享如何让你在众多二手车中挑中满意的?python帮你实现(附源码)

如何让你在众多二手车中挑中满意的?python帮你实现(附源码)

前言

老司机带你去看车,网上的几千条的二手车数据,只需几十行代码,就可以统统获取,保存数据到我们本地电脑上

知识点:

1.python基础知识2.函数3.requests库4.xpath适合零基础的同学

环境:

windows + pycharm + python3

爬虫流程:

1.目标网址2. 发送请求,获取响应3. 解析网页 提取数据4. 保存数据

步骤:

1.导入工具

import io
import sys
import requests   # pip install requests
from lxml import etree   # pip

2.获取汽车详情页面的url,解析网站

def get_detail_urls(url):
    # 目标网址
    # url = 'https://www.guazi.com/cs/buy/o3/'
    # 发送请求,获取响应
    resp = requests.get(url,headers=headers)
    text = resp.content.decode('utf-8')
    # 解析网页
    html = etree.HTML(text)
    ul = html.xpath('//ul[@class="carlist clearfix js-top"]')[0]
    # print(ul)
    lis = ul.xpath('./li')
    detail_urls = []
    for li in lis:
        detail_url = li.xpath('./a/@href')
        # print(detail_url)
        detail_url = 'https://www.guazi.com' + detail_url[0]
        # print(detail_url)
        detail_urls.append(detail_url)

    return detail_urls

3.添加请求头

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
    'Cookie':'uuid=5a823c6f-3504-47a9-8360-f9a5040e5f23; ganji_uuid=4238534742401031078259; lg=1; Hm_lvt_936a6d5df3f3d309bda39e92da3dd52f=1590045325; track_id=79952087417704448; antipas=q7222002m3213k0641719; cityDomain=cs; clueSourceCode=%2A%2300; user_city_id=204; sessionid=38afa34e-f972-431b-ce65-010f82a03571; close_finance_popup=2020-05-23; cainfo=%7B%22ca_a%22%3A%22-%22%2C%22ca_b%22%3A%22-%22%2C%22ca_s%22%3A%22pz_baidu%22%2C%22ca_n%22%3A%22pcbiaoti%22%2C%22ca_medium%22%3A%22-%22%2C%22ca_term%22%3A%22-%22%2C%22ca_content%22%3A%22%22%2C%22ca_campaign%22%3A%22%22%2C%22ca_kw%22%3A%22-%22%2C%22ca_i%22%3A%22-%22%2C%22scode%22%3A%22-%22%2C%22keyword%22%3A%22-%22%2C%22ca_keywordid%22%3A%22-%22%2C%22ca_transid%22%3A%22%22%2C%22platform%22%3A%221%22%2C%22version%22%3A1%2C%22track_id%22%3A%2279952087417704448%22%2C%22display_finance_flag%22%3A%22-%22%2C%22client_ab%22%3A%22-%22%2C%22guid%22%3A%225a823c6f-3504-47a9-8360-f9a5040e5f23%22%2C%22ca_city%22%3A%22cs%22%2C%22sessionid%22%3A%2238afa34e-f972-431b-ce65-010f82a03571%22%7D; preTime=%7B%22last%22%3A1590217273%2C%22this%22%3A1586866452%2C%22pre%22%3A1586866452%7D',
}

4.提取每辆汽车详情页面的数据

def parse_detail_page(url):
    resp = requests.get(url,headers=headers)
    text = resp.content.decode('utf-8')
    html = etree.HTML(text)
    # 标题
    title = html.xpath('//div[@class="product-textbox"]/h2/text()')[0]
    title = title.strip()
    print(title)
    # 信息
    info = html.xpath('//div[@class="product-textbox"]/ul/li/span/text()')
    # print(info)

    infos = {}
    cardtime = info[0]
    km = info[1]
    displacement = info[2]
    speedbox = info[3]

    infos['title'] = title
    infos['cardtime'] = cardtime
    infos['km'] = km
    infos['displacement'] = displacement
    infos['speedbox'] = speedbox
    print(infos)
    return infos

5.保存数据

def save_data(infos, f):
    f.write('{},{},{},{},{}\n'.format(infos['title'],infos['cardtime'],infos['km'],infos['displacement'],infos['speedbox']))



if __name__ == '__main__':
    
    base_url = 'https://www.guazi.com/cs/buy/o{}/'
    with open('guazi.csv','a',encoding='utf-8') as f:
        for x in range(1,51):
            url = base_url.format(x)
            detail_urls = get_detail_urls(url)
            for detail_url in detail_urls:
                infos = parse_detail_page(detail_url)
                save_data(infos, f)

最后运行代码,效果如下图

本文分享自微信公众号 - 松鼠爱吃饼干(Python_ku),作者:松鼠爱吃饼干

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2020-06-09

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • Python小白爬虫入门的第一个案例:爬取全站小说

    很多免费的资源只能看但是不提供下载,今天我们以小说为例,教你如何把互联网上只能看不能下载的东西给下载下来

    松鼠爱吃饼干
  • 今天刚上手爬虫,当然要从最简单的开始啦,验证一下所学的知识

    很多免费的资源只能看但是不提供下载,今天我们以小说为例教你如何把互联网上只能看不能下载的东西给下载下来

    松鼠爱吃饼干
  • 虽然现在有可以去码的软件了,可视频是如何自动跟踪打码的?

    它的身影随处可见,刷脸支付,信息审核,监控搜索等,除了这些常规操作,还可以对视频里的特定人物进行打码。

    松鼠爱吃饼干
  • 19.SimLogin_case05

    hankleo
  • 万方数据库,文献下载的准备

     ###后来文献下载任务完成了 ,相关的代码,细节,介绍看这篇文章 https://www.jianshu.com/p/134530b46a65

    东风冷雪
  • 【Go 语言社区】GO中怎么处理URL编码?

    package main import( "fmt" "net/url") func main(){ var URL = "%5B%7...

    李海彬
  • springboot Actuator

    springboot Actuator只需要加入依赖即可使用: <dependency> <groupId>org.springframework.bo...

    yawn
  • 关于字符串的应用

    hankleo
  • 【延迟注入】A5站长网某站存在SQL注入漏洞

    A5站长网某站存在SQL注入漏洞(附验证脚本) 详细说明: code 区域 POST /Login/login HTTP/1.1 Host: lianmeng....

    逸鹏
  • Logstash解析嵌套Json

    由于我们的埋点日志是嵌套json类型,要想最终所有字段展开来统计分析就必须把嵌套json展开。

    神秘的寇先森

扫码关注云+社区

领取腾讯云代金券