[775]小象学院--爬虫知识点总结

周小董

发布于 2020-03-23 16:20:15

8350

发布于 2020-03-23 16:20:15

文章被收录于专栏：python前行者

第一课

查找安装包

pip search lxml

设置 pip 源，配置 pip.conf ，自动设置源

# mkdir ~/.pip/
# vim ~/.pip/pip.conf
[global]
index-url=https://pypi.tuna.tsinghua.edu.cn/simple

也可以每次安装的时候制定 source

# pip install –i https://pypi.tuna.tsinghua.edu.cn/simple lxml

TCP/IP 四层与 OSI 七层

HTTP 协议

 物理层：电器连接  数据链路层：交换机，STP，帧中继  网络层：路由器，IP 协议  传输层：TCP、UDP 协议  会话层：建立通信连接，网络拨号  表示层：每次连接只处理一个请求  应用层：HTTP、FTP

应用层的协议
- 无连接：每次连接只处理一个请求
- 无状态：每次连接、传输都是独立的

HTTP HEADER

REQUEST 部分的 HTTP HEADER

 Accept: text/plain
 Accept-Charset: utf-8
 Accept-Encoding: gzip, deflate
 Accept-Language: en-US
 Connection: keep-alive
 Content-Length: 348
 Content-Type: application/x-www-form-urlencoded
 Date: Tue, 15 Nov 1994 08:12:31 GMT
 Host: en.wikipedia.org:80
 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101Firefox/21.0
 Cookie: $Version=1; Skin=new;

keep-alive

HTTP是一个 请求<-> 响应模式的典型范例，即客户端向服务器发送一个请求信息，服务器来响应这个信息。在老的HTTP版本中，每个请求都将被创建一个新的 客户端-> 服务器的连接，在这个连接上发送请求，然后接收请求。这样的模式有一个很大的优点就是，它很简单，很容易理解和编程实现；它也有一个很大的缺点就是，它效率很低，因此Keep-Alive被提出用来解决效率低的问题。

Keep-Alive功能使客户端到服务器端的连接持续有效，当出现对服务器的后继请求时，Keep-Alive功能避免了建立或者重新建立连接。

默认情况下所在HTTP1.1中所有连接都被保持，除非在请求头或响应头中指明要关闭：Connection: Close

HTTP 请求方法

HTTP 响应状态码

 2XX 成功  3XX 跳转  4XX 客户端错误  500 服务器错误

HTTP 响应状态码 300

 300 Multiple Choices 存在多个可用的资源，可处理或丢弃  301 Moved Permanetly 重定向  302 Found 重定向  304 Not Modified 请求的资源未更新，丢弃

HTTP 响应状态码 400、500

 400 Bad Request 客户端请求有语法错误，不能被服务器所理解  401 Unauthorized 请求未经授权，这个状态代码必须和WWW-Authenticate报头域一起使用  403 Forbidden 服务器收到请求，但是拒绝提供服务  404 Not Found 请求资源不存在，eg：输入了错误的URL  500 Internal Server Error 服务器发生不可预期的错误  503 Server Unavailable 服务器当前不能处理客户端的请求，一段时间后可能恢复正常

错误处理

 400 Bad Request 检查请求的参数或者路径  401 Unauthorized 如果需要授权的网页，尝试重新登录  403 Forbidden

如果是需要登录的网站，尝试重新登录
IP被封，暂停爬取，并增加爬虫的等待时间，如果拨号网络，尝试重新联网更改IP

 404 Not Found 直接丢弃  5XX 服务器错误，直接丢弃，并计数，如果连续不成功，WARNING 并停止爬取

评估网页数量

site:www.mafengwo.cn

BITMAP 方式记录

pip install murmurhash3 bitarray

from bitarray import bitarray
import mmh3

offset = 2147483647 // 2^31 - 1
bit_array = bitarray(4*1024*1024*1024)
bit_array.setall(0)
# mmh3 hash value 32 bit signed int
# add offset to make it unsigned int 0 ~ 2^32-1
b1 = mmh3.hash(url, 42) + offset
bit_array[b1] = 1

优势：对存储进行了进一步压缩，在MD5的基础上，可以从128位最多压缩到1位，一般情况，如果用4bit或者8bit表示一个url，也能压缩32或者16倍
缺陷：碰撞概率增加

Bloom Filter

Bloom Filter使用了多个哈希函数，而不是一个。创建一个m位BitSet，先将所有位初始化为0，然后选择k个不同的哈希函数。第i个哈希函数对字符串str哈希的结果记为h（i，str），且h（i，str）的范围是0到m-1。

只能插入，不能删除！！

pybloomfilter

pip install pybloomfilter
git clone https://github.com/axiak/pybloomfiltermmap.git

例子：并不实际检查容量，如果需要比较低的error_rate，则需要设置更大的容量

>>> fruit = pybloomfilter.BloomFilter(100000, 0.1, '/tmp/words.bloom')
>>> fruit.update(('apple', 'pear', 'orange', 'apple'))
>>> len(fruit) 3
>>> 'mike' in fruit
False
>>> 'apple' in fruit
True

官方文档： https://media.readthedocs.org/pdf/pybloomfiltermmap3/latest/pybloomfiltermmap3.pdf

robots.txt协议

robots.txt协议地址（在主网站后加上robots.txt） https://www.mafengwo.cn/robots.txt

Sitemap: http://www.mafengwo.cn/sitemapIndex.xml 基本上描述了一个网站的结构

第二课

正则表达式

 Doc: https://docs.python.org/2/library/re.html  Useful Methods: findall(pattern, string, flags=0)  Useful Patterns:

str	说明	str	说明	str	说明
.	Any Char	*	0 or more repetitions	\	escape
^	Start	+	1 or more repetitions	{m,n}	m to n repetitions
$	End	?	0 or 1 repetitions	[]	a set of characters
\A	Start with	\d	Decimal digit	\|	or e.g. A\|B

lxml

Summary: 网页DOM选择器，快速定位操作HTML对象，也可以用于XML Install: pip install lxml Official doc: http://lxml.de/ Methods:

html = etree.HTML(html_content.lower().decode('utf-8'))
hrefs = html.xpath(u"//a")
hrefs = html.xpath(u'//a[@class="last-page"]')
hrefs = html.xpath(u'//*[@class="last-page"]')

html = lxml.html.fromstring(html_content)
elements = html.cssselect('div#page-let > a.last-page')

Mysql

Docs: https://dev.mysql.com/doc/connector-python/en/

第三课

from hdfs import *
from hdfs.util import HdfsError
import mysql.connector
from mysql.connector import errorcode,pooling
import http.client

hdfs_client = InsecureClient('http://54.223.92.169:50070', user='ec2-user')
self.cnxpool = mysql.connector.pooling.MySQLConnectionPool(pool_name = "mypool",
                                                          pool_size = max_num_thread,
                                                          **dbconfig)

Python hdfs module

pip install hdfs

Methods	Desc
read()	read a file
write()	write to file
delete()	Remove a file or directory from HDFS.
rename()	Move a file or folder.
download	Download a file or folder from HDFS and save it locally.
list()	Return names of files contained in a remote folder.
makedirs()	Create a remote directory, recursively if necessary.
rename()	Move a file or folder.
resolve()	Return absolute, normalized path, with special markers expanded.
upload()	Upload a file or directory to HDFS.
walk()	Depth-first walk of remote filesystem.

存储到HDFS

from hdfs import *
from hdfs.util import HdfsError

hdfs_client = InsecureClient(’[host]:[port]', user=’user')
with hdfs_client.write('/htmls/mfw/%s.html' % (filename)) as writer:
    writer.write(html_page)

except HdfsError as Arguments:
    print Arguments

第四课

Create Client Socket

#create an INET, STREAMing socket
s = socket.create_connection( socket.AF_INET,socket.SOCK_STREAM)

Create Server Socket

#create an INET, STREAMing socket
serversocket = socket.socket( socket.AF_INET,socket.SOCK_STREAM)
#bind the socket to a public host, and a well-known port
serversocket.bind((socket.gethostname(), 20010))
#become a server socket
serversocket.listen(5)
while True:
  #accept connections from outside
  (clientsocket, address) = serversocket.accept()
  #now do something with the clientsocket
  #in this case, we'll pretend this is a threaded
  server ct = client_thread(clientsocket)
  ct.run()

第五课

Python 的 PageRank - NetworkX

pip install networkx

g = nx.DiGraph() # 构造有向图
g.add_node(url)
g.add_edge(src, dest) # 添加边
nx.pagerank(g, 0.9) # 计算pagerank，g为有向图，0.9是PR的随机跳转概率（也称为阻尼系数）

第七课

表单登录

import urllib2
url = http://jc.lo/dev/login/login.php
headers = {
'host': "jc.lo",
'connection': "keep-alive",
'cache-control': "no-cache",
'content-type': "application/x-www-form-urlencoded",
'upgrade-insecure-requests': "1",
'accept-language': "zh-CN,en-US;q=0.8,en;q=0.6",
}
data = {'name':'caca', "password":'c'}
payload = urllib.urlencode(data)
request = urllib2.Request(url, payload , headers=headers)
response = urllib2.urlopen(request)

urllib2 的插件功能

proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib2.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib2.build_opener(proxy_handler, proxy_auth_handler)
# This time, rather than install the OpenerDirector, we use it directly:
opener.open('http://www.example.com/login.html')

CookieJar

import cookielib
cj = cookielib.CookieJar() #创建 CookieJar 对象
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) #注册插件
request = urllib2.Request(url, payload, headers=headers)
response = opener.open(request)
print response.items() #打印headers信息
print response.read() #打印返回的网页
#打印 cookie 内容
for cookie in cj:
  print cookie.name, cookie.value, cookie.domain

PhantomJS 来加载动态页面

# import webdriver from selenium
from selenium import webdriver
# load PhantomJS driver
driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'])
# set window size, better to fit the whole page in order to
# avoid dynamically loading data
driver.set_window_size(1280, 2400) # optional
# data page content
driver.get(cur_url)
# use page_source to get html content
content = driver.page_source

Selenium 通过浏览器的驱动，支持大量的HTML及Javascript的操作，常用的可以包括：

• page_source: 获取当前的 html 文本 • title：HTML 的 title • current_url：当前网页的URL • get_cookie() & get_cookies()：获取当前的cookie • delete_cookie() & delete_all_cookies()：删除所有的cookie • add_cookie()：添加一段cookie • set_page_load_timeout()：设置网页超时 • execute_script()：同步执行一段javascript命令 • execute_async_script()：异步执行javascript命令

Selenium 通过内嵌的浏览器 driver 与浏览器进程通信，因此在退出的时候必须调用driver.close() 及 driver.quit() 来退出 PhantomJS，否则PhantomJS 会一直运行在后台并占用系统资源。

PhantomJS 配置

--ignore-ssl-errors=[true|false] ignores SSL errors, such as expired or self-signed certificate errors (default is false). Also accepted: [yes|no].

--load-images=[true|false] load all inlined images (default is true). Also accepted: [yes|no].

--disk-cache=[true|false] enables disk cache (at desktop services cache storage location, default is false). Also accepted: [yes|no].

--cookies-file=/path/to/cookies.txt specifies the file name to store the persistent Cookies.

--debug=[true|false] prints additional warning and debug message,default is false. Also accepted: [yes|no].

--config specifies JSON-formatted configuration file (see below).

--ignore-ssl-errors=[true|false]

一些证书没有获得CA授权（多是自己制作的证书），浏览器会报出证书不受信任，这种情况需要用户交互操作（点击继续或者新人），使用这个命令后，能自动忽略此类错误

--load-images=[true|false] 网页上一般都存在大量的图片，这些图片对我们第一次执行抓取是没有用的，在这种情况下，选择 –-load-images=false 可以不下载这些图片，加快下载速度
PhantomJS 运行需要占用比较多的系统资源，所以并发数需要视计算机性能以及实际测试情况而定，建议不要设置太大
程序退出后，考虑调用shell来杀掉所有 phantomjs 进程subprocess.call('pgrep phantomjs | xargs kill')

第八课

利用 Scrapy 的命令来创建一个 Spider的子类，然后通过命令 scrapy runspider xxx.py 或者 scrapy crawlxxx 来运行这个spider，一个爬虫任务就开始了

scrapy runspider quotes_spider.py -o quotes.json
scrapy startproject chdemo 创建新的工程
scrapy genspider mfw mafengwo.com 创建mfw的spider

第九课

SimHash – 海明距离

pip install simhash

from simhash import Simhash
str0 = 'The Apache Hadoop software library is a framework that allows for
the distributed processing large data'
str1 = 'The Apache Hadoop software library is a framework that allows for
the distributed processing big data'
# 构造 SimHash 对象
sh0 = Simhash(str0)
sh1 = Simhash(str1)
# 构造特征值，关键字加权
features = [('Apache', 10),('Hadoop', 15),('framework', 3), ('distributed',
10), ('data', 6)]
# 不加权计算
sh0.distance(sh1)
# 加权计算海明距离
sh0.build_by_features(features)
sh1.build_by_features(features)
sh0.distance(sh1)

第十一课

Pillow

Pillow 是一个图像工具包，包含了一个 Image 类用来做图像的处理

pip install pillow

from PIL import Image
import lxml

def extract_image(html):
    tree = lxml.html.fromstring(html)
    img_data = tree.cssselect('div#recaptcha img')[0].get('src')
    img_data = img_data.partition(',')[-1]
    binary_img_data = img_data.decode('base64')
    img_data = BytesIO(binary_img_data)
    img = Image.open(img_data)
    img.save('test.png')
    return img

Tesseract-Ocr

Tesseract-Ocr 是一个 Google 主导的开源 OCR (Optical Character Recongnition) 引擎。Tesseract-Ocr 有很多的 python 开源版本

pip install pytesseract

import pytesseract

pytesseract.image_to_string(bw)

识别过程

大量验证码都是添加了干扰元素的，因此第一步要找出噪声并去除掉

http://www.bjhjyd.gov.cn/

找出验证码的色彩 对色彩像素进行统计

pixdata = img.load()
colors = {}
#  统计字符颜色像素情况
for y in range(img.size[1]):
    for x in range(img.size[0]):
        if colors.has_key(pixdata[x,y]):
            colors[pixdata[x, y]] += 1
        else:
            colors[pixdata[x,y]] = 1
#  排名第一的是背景色，第二的是主要颜色
colors = sorted(colors.items(), key=lambda d:d[1], reverse=True)

((240, 240, 240), 1996) - 排名第一的是背景色 ((51, 153, 0), 645) – 排名第二的是验证码字体颜色 ((241, 244, 237), 168), ((192, 168, 185), 37), ((161, 250, 53), 1)

去噪把验证码色彩设置为黑色，其余颜色设置为白色

significant  = colors[1][0]
for y in range(img.size[1]):
    for x in range(img.size[0]):
        if pixdata[x,y]  != significant:
            pixdata[x,y]  = (255,255,255)
        else:
            pixdata[x, y]  = (0,0,0)

调用 TesseractOcr 进行识别

word = pytesseract.image_to_string(img, lang='eng', config='ocr.conf')

lang 指定识别的语言
config 指定配置文件，我们设置了有效字符仅包含A~Za~z0~9 tessedit_char_whitelist abdefghijklmnoprstuvwxyzABDEFGHIJKLMNOPQRSTUVWXYZ12 34567890

翻页命令

driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')

第十二课

去除 Javascript 及 CSS 利用 lxml 的 clean 类，能删除HTML 里所包含 css 及 script

<script language="javascript" type="text/javascript">if (typeof M !==
"undefined" && typeof M.loadResource === "function")
{M.loadResource("http://js.mafengwo.net/js/cv/js+pageletcommon+pageHea
dUserInfoWWWDark:js++ACnzzGaLog:js+ARecruit:js+ALazyLoad^Z11V^148
9552560.js");}</script>

from lxml.html import clean

cleaner = clean.Cleaner(style=True, scripts=True, comments=True, javascript=True,page_structure=False, safe_attrs_only=False)
content = cleaner.clean_html(content.decode('utf-8')).encode('utf-8')

除去所有HTML TAG 利用下面的正则表达式，把HTML的TAG和属性也都去除掉，最后只剩下正文部分

reg = re.compile("<[^>]*>")
content = reg.sub('', content)

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2020/03/19 ，如有侵权请联系 cloudcommunity@tencent.com 删除

http

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度