系统:[IOS]
爬虫Step By Step
获得链接
读取链接内容
抽取关键元素
爬取方式:基于网站API
基于html网页
简介
API ( Application Program Interface,应用程序接口) 爬取数据结果返回JSON格式
(1)API请求限制次数
(2)API所需参数
(3)编写代码调用API获取JSON格式
(4)储存到本地/数据库
JSON:JavaScript Object Notation (JavaScript对象表示法)JSON是一个轻量级的数据交换格式,连接API进行数据爬取的时候,数据的一般返回格式为JSON
准备工作
终端shell 进入环境 source activate python_study
(退出环境:source deactivate python_study)
(p:环境python_study在anaconda中已建好)
安装conda install jupyter(时间略久)
进入jupyter notebook
(p:python安装包方式pip install/conda install)
两者区别:
· Pip installs from PyPI. There are no releases of the basemap package on PyPI, it is just a simple registration page pointing at the real download location (SourceForge).
· Conda pulls from its own repository, typically with convenience builds of libraries common to the community Conda is aimed at. Conda's repository has a version of the basemap package available for installation, so it succeeds.
· This is not to say that Pip is "worse" than Conda in this instance, as you could easily download the package and install it with pip locally. This particular library has just opted to not add releases to PyPI.
Part One
Json访问对象Key/Value(键/值对)值:方法“ . / [] ”
Output:
#基于豆瓣API 豆瓣图书
# 调用Urllib.request包
import urllib
import urllib.request as urlrequest
import json
#获取网页内容
#search_name='满月之夜白鲸现'#已知书籍名称
#将中文转换为网页链接可理解的文字
#url_name="https://book.douban.com/subject_search?search_text={}&cat=1001".format(id)
#“.” 调用方法
http_content = urlrequest.urlopen(url).read()
#解码为可读内容
#print(http_content.decode("utf-8"))
#使用python解析json代码
json_content = json.loads(http_content)
#print(json_content)
#读取关键信息
rank=json_content["rating"]["average"]
RateNumbers=json_content["rating"]["numRaters"]
title_name=json_content["title"]
# 输出结果
print("\t""Title_name:",title_name,"\n\t","RateNumbers:",RateNumbers,"\n\t","Rank:",rank)
#保存到本地
with open("douban_practice.txt","w") as result:
#".write"方法写入文件,format()方法以固定形式传值
result.write("{} {} {}\n".format(title_name,RateNumbers,rank))
Part Two
Output:
对比:
Output:
#基于豆瓣网页
import urllib.request as urlrequest
from bs4 import BeautifulSoup
#获取网页内容
url="https://movie.douban.com/subject/27186619/"
http_content=urlrequest.urlopen(url).read()
#print(http_content.decode("utf-8"))
#用BeautifulSoup解析网页内容
soup=BeautifulSoup(http_content,"html.parser")
#print(soup.prettify())
#chrome developer tools预先确定位置 property="v:average"
#".get_text()"获取文本内容
#soup.find(property="v:average")
rank=soup.find(property="v:average").get_text()
print(rank)
Part Three
BeautifulSoup解析HTML
BeautifulSoup库的主要功能是:
通过find函数,利用tag来返回相应的代码块
通过prettify函数,可以友好地打印HTML网页
使用next&previous函数寻求上下文的tag
Output:
import urllib.request as urlrequest
from bs4 import BeautifulSoup
for page in range(24):
number=page+1
url="http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-{}".format(number)
http_content=urlrequest.urlopen(url).read()
#print(http_content.decode("GBK"))
soup=BeautifulSoup(http_content,"html.parser",from_encoding="gbk")
#print(soup.prettify())
book_content=(soup.find_all(class_="name"))
#print(book_content)
for each_book in book_content:
book_title=each_book.find("a")["title"]
book_ahref=each_book.find("a")["href"]
print(book_title,book_ahref)
#with open(dangdang.txt,"w") as outputfile:
#outputfile.write("{} {} \n".format(book_title,book_ahref))
参考文献
https://stackoverflow.com/questions/31899966/pip-install-vs-conda-install
豆瓣API:https://developers.douban.com/wiki/?title=api_v2
补充读物
JSON教程:http://www.runoob.com/json/json-tutorial.html
下期
预告
额....肯定会有的~!
领取专属 10元无门槛券
私享最新 技术干货