首页
学习
活动
专区
工具
TVP
发布
精选内容/技术社群/优惠产品,尽在小程序
立即前往

Practice Python-爬取数据

系统:[IOS]

爬虫Step By Step

获得链接

读取链接内容

抽取关键元素

爬取方式:基于网站API

基于html网页

简介

API ( Application Program Interface,应用程序接口) 爬取数据结果返回JSON格式

(1)API请求限制次数

(2)API所需参数

(3)编写代码调用API获取JSON格式

(4)储存到本地/数据库

JSON:JavaScript Object Notation (JavaScript对象表示法)JSON是一个轻量级的数据交换格式,连接API进行数据爬取的时候,数据的一般返回格式为JSON

准备工作

终端shell 进入环境 source activate python_study

(退出环境:source deactivate python_study)

(p:环境python_study在anaconda中已建好)

安装conda install jupyter(时间略久)

进入jupyter notebook

(p:python安装包方式pip install/conda install)

两者区别:

· Pip installs from PyPI. There are no releases of the basemap package on PyPI, it is just a simple registration page pointing at the real download location (SourceForge).

· Conda pulls from its own repository, typically with convenience builds of libraries common to the community Conda is aimed at. Conda's repository has a version of the basemap package available for installation, so it succeeds.

· This is not to say that Pip is "worse" than Conda in this instance, as you could easily download the package and install it with pip locally. This particular library has just opted to not add releases to PyPI.

Part One

Json访问对象Key/Value(键/值对)值:方法“ . / [] ”

Output:

#基于豆瓣API 豆瓣图书

# 调用Urllib.request包

import urllib

import urllib.request as urlrequest

import json

#获取网页内容

#search_name='满月之夜白鲸现'#已知书籍名称

#将中文转换为网页链接可理解的文字

#url_name="https://book.douban.com/subject_search?search_text={}&cat=1001".format(id)

#“.” 调用方法

http_content = urlrequest.urlopen(url).read()

#解码为可读内容

#print(http_content.decode("utf-8"))

#使用python解析json代码

json_content = json.loads(http_content)

#print(json_content)

#读取关键信息

rank=json_content["rating"]["average"]

RateNumbers=json_content["rating"]["numRaters"]

title_name=json_content["title"]

# 输出结果

print("\t""Title_name:",title_name,"\n\t","RateNumbers:",RateNumbers,"\n\t","Rank:",rank)

#保存到本地

with open("douban_practice.txt","w") as result:

#".write"方法写入文件,format()方法以固定形式传值

result.write("{} {} {}\n".format(title_name,RateNumbers,rank))

Part Two

Output:

对比:

Output:

#基于豆瓣网页

import urllib.request as urlrequest

from bs4 import BeautifulSoup

#获取网页内容

url="https://movie.douban.com/subject/27186619/"

http_content=urlrequest.urlopen(url).read()

#print(http_content.decode("utf-8"))

#用BeautifulSoup解析网页内容

soup=BeautifulSoup(http_content,"html.parser")

#print(soup.prettify())

#chrome developer tools预先确定位置 property="v:average"

#".get_text()"获取文本内容

#soup.find(property="v:average")

rank=soup.find(property="v:average").get_text()

print(rank)

Part Three

BeautifulSoup解析HTML

BeautifulSoup库的主要功能是:

通过find函数,利用tag来返回相应的代码块

通过prettify函数,可以友好地打印HTML网页

使用next&previous函数寻求上下文的tag

Output:

import urllib.request as urlrequest

from bs4 import BeautifulSoup

for page in range(24):

number=page+1

url="http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-{}".format(number)

http_content=urlrequest.urlopen(url).read()

#print(http_content.decode("GBK"))

soup=BeautifulSoup(http_content,"html.parser",from_encoding="gbk")

#print(soup.prettify())

book_content=(soup.find_all(class_="name"))

#print(book_content)

for each_book in book_content:

book_title=each_book.find("a")["title"]

book_ahref=each_book.find("a")["href"]

print(book_title,book_ahref)

#with open(dangdang.txt,"w") as outputfile:

#outputfile.write("{} {} \n".format(book_title,book_ahref))

参考文献

https://stackoverflow.com/questions/31899966/pip-install-vs-conda-install

豆瓣API:https://developers.douban.com/wiki/?title=api_v2

补充读物

JSON教程:http://www.runoob.com/json/json-tutorial.html

下期

预告

额....肯定会有的~!

  • 发表于:
  • 原文链接http://kuaibao.qq.com/s/20180116G10BXP00?refer=cp_1026
  • 腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长 进交流群

领取专属 10元无门槛券

私享最新 技术干货

扫码加入开发者社群
领券