查找安装包
pip search lxml
设置 pip 源,配置 pip.conf ,自动设置源
# mkdir ~/.pip/
# vim ~/.pip/pip.conf
[global]
index-url=https://pypi.tuna.tsinghua.edu.cn/simple
也可以每次安装的时候制定 source
# pip install –i https://pypi.tuna.tsinghua.edu.cn/simple lxml
物理层:电器连接 数据链路层:交换机,STP,帧中继 网络层:路由器,IP 协议 传输层:TCP、UDP 协议 会话层:建立通信连接,网络拨号 表示层:每次连接只处理一个请求 应用层:HTTP、FTP
Accept: text/plain
Accept-Charset: utf-8
Accept-Encoding: gzip, deflate
Accept-Language: en-US
Connection: keep-alive
Content-Length: 348
Content-Type: application/x-www-form-urlencoded
Date: Tue, 15 Nov 1994 08:12:31 GMT
Host: en.wikipedia.org:80
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101Firefox/21.0
Cookie: $Version=1; Skin=new;
HTTP是一个 请求<-> 响应模式的典型范例,即客户端向服务器发送一个请求信息,服务器来响应这个信息。在老的HTTP版本中,每个请求都将被创建一个新的 客户端-> 服务器的连接,在这个连接上发送请求,然后接收请求。这样的模式有一个很大的优点就是,它很简单,很容易理解和编程实现;它也有一个很大的缺点就是,它效率很低,因此Keep-Alive被提出用来解决效率低的问题。
Keep-Alive功能使客户端到服务器端的连接持续有效,当出现对服务器的后继请求时,Keep-Alive功能避免了建立或者重新建立连接。
默认情况下所在HTTP1.1中所有连接都被保持,除非在请求头或响应头中指明要关闭:Connection: Close
2XX 成功 3XX 跳转 4XX 客户端错误 500 服务器错误
300 Multiple Choices 存在多个可用的资源,可处理或丢弃 301 Moved Permanetly 重定向 302 Found 重定向 304 Not Modified 请求的资源未更新,丢弃
400 Bad Request 客户端请求有语法错误,不能被服务器所理解 401 Unauthorized 请求未经授权,这个状态代码必须和WWW-Authenticate报头域一起使用 403 Forbidden 服务器收到请求,但是拒绝提供服务 404 Not Found 请求资源不存在,eg:输入了错误的URL 500 Internal Server Error 服务器发生不可预期的错误 503 Server Unavailable 服务器当前不能处理客户端的请求,一段时间后可能恢复正常
400 Bad Request 检查请求的参数或者路径 401 Unauthorized 如果需要授权的网页,尝试重新登录 403 Forbidden
404 Not Found 直接丢弃 5XX 服务器错误,直接丢弃,并计数,如果连续不成功,WARNING 并停止爬取
site:www.mafengwo.cn
pip install murmurhash3 bitarray
from bitarray import bitarray
import mmh3
offset = 2147483647 // 2^31 - 1
bit_array = bitarray(4*1024*1024*1024)
bit_array.setall(0)
# mmh3 hash value 32 bit signed int
# add offset to make it unsigned int 0 ~ 2^32-1
b1 = mmh3.hash(url, 42) + offset
bit_array[b1] = 1
Bloom Filter使用了多个哈希函数,而不是一个。创建一个m位BitSet,先将所有位初始化为0,然后选择k个不同的哈希函数。第i个哈希函数对字符串str哈希的结果记为h(i,str),且h(i,str)的范围是0到m-1。
只能插入,不能删除!!
pip install pybloomfilter
git clone https://github.com/axiak/pybloomfiltermmap.git
例子: 并不实际检查容量,如果需要比较低的error_rate,则需要设置更大的容量
>>> fruit = pybloomfilter.BloomFilter(100000, 0.1, '/tmp/words.bloom')
>>> fruit.update(('apple', 'pear', 'orange', 'apple'))
>>> len(fruit) 3
>>> 'mike' in fruit
False
>>> 'apple' in fruit
True
官方文档: https://media.readthedocs.org/pdf/pybloomfiltermmap3/latest/pybloomfiltermmap3.pdf
robots.txt协议地址(在主网站后加上robots.txt) https://www.mafengwo.cn/robots.txt
Sitemap: http://www.mafengwo.cn/sitemapIndex.xml 基本上描述了一个网站的结构
Doc: https://docs.python.org/2/library/re.html Useful Methods: findall(pattern, string, flags=0) Useful Patterns:
str | 说明 | str | 说明 | str | 说明 |
---|---|---|---|---|---|
. | Any Char | * | 0 or more repetitions | \ | escape |
^ | Start | + | 1 or more repetitions | {m,n} | m to n repetitions |
$ | End | ? | 0 or 1 repetitions | [] | a set of characters |
\A | Start with | \d | Decimal digit | | | or e.g. A|B |
Summary: 网页DOM选择器,快速定位操作HTML对象,也可以用于XML Install: pip install lxml Official doc: http://lxml.de/ Methods:
html = etree.HTML(html_content.lower().decode('utf-8'))
hrefs = html.xpath(u"//a")
hrefs = html.xpath(u'//a[@class="last-page"]')
hrefs = html.xpath(u'//*[@class="last-page"]')
html = lxml.html.fromstring(html_content)
elements = html.cssselect('div#page-let > a.last-page')
Docs: https://dev.mysql.com/doc/connector-python/en/
from hdfs import *
from hdfs.util import HdfsError
import mysql.connector
from mysql.connector import errorcode,pooling
import http.client
hdfs_client = InsecureClient('http://54.223.92.169:50070', user='ec2-user')
self.cnxpool = mysql.connector.pooling.MySQLConnectionPool(pool_name = "mypool",
pool_size = max_num_thread,
**dbconfig)
pip install hdfs
Methods | Desc |
---|---|
read() | read a file |
write() | write to file |
delete() | Remove a file or directory from HDFS. |
rename() | Move a file or folder. |
download | Download a file or folder from HDFS and save it locally. |
list() | Return names of files contained in a remote folder. |
makedirs() | Create a remote directory, recursively if necessary. |
rename() | Move a file or folder. |
resolve() | Return absolute, normalized path, with special markers expanded. |
upload() | Upload a file or directory to HDFS. |
walk() | Depth-first walk of remote filesystem. |
from hdfs import *
from hdfs.util import HdfsError
hdfs_client = InsecureClient(’[host]:[port]', user=’user')
with hdfs_client.write('/htmls/mfw/%s.html' % (filename)) as writer:
writer.write(html_page)
except HdfsError as Arguments:
print Arguments
#create an INET, STREAMing socket
s = socket.create_connection( socket.AF_INET,socket.SOCK_STREAM)
#create an INET, STREAMing socket
serversocket = socket.socket( socket.AF_INET,socket.SOCK_STREAM)
#bind the socket to a public host, and a well-known port
serversocket.bind((socket.gethostname(), 20010))
#become a server socket
serversocket.listen(5)
while True:
#accept connections from outside
(clientsocket, address) = serversocket.accept()
#now do something with the clientsocket
#in this case, we'll pretend this is a threaded
server ct = client_thread(clientsocket)
ct.run()
pip install networkx
g = nx.DiGraph() # 构造有向图
g.add_node(url)
g.add_edge(src, dest) # 添加边
nx.pagerank(g, 0.9) # 计算pagerank,g为有向图,0.9是PR的随机跳转概率(也称为阻尼系数)
import urllib2
url = http://jc.lo/dev/login/login.php
headers = {
'host': "jc.lo",
'connection': "keep-alive",
'cache-control': "no-cache",
'content-type': "application/x-www-form-urlencoded",
'upgrade-insecure-requests': "1",
'accept-language': "zh-CN,en-US;q=0.8,en;q=0.6",
}
data = {'name':'caca', "password":'c'}
payload = urllib.urlencode(data)
request = urllib2.Request(url, payload , headers=headers)
response = urllib2.urlopen(request)
proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib2.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib2.build_opener(proxy_handler, proxy_auth_handler)
# This time, rather than install the OpenerDirector, we use it directly:
opener.open('http://www.example.com/login.html')
import cookielib
cj = cookielib.CookieJar() #创建 CookieJar 对象
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) #注册插件
request = urllib2.Request(url, payload, headers=headers)
response = opener.open(request)
print response.items() #打印headers信息
print response.read() #打印返回的网页
#打印 cookie 内容
for cookie in cj:
print cookie.name, cookie.value, cookie.domain
# import webdriver from selenium
from selenium import webdriver
# load PhantomJS driver
driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'])
# set window size, better to fit the whole page in order to
# avoid dynamically loading data
driver.set_window_size(1280, 2400) # optional
# data page content
driver.get(cur_url)
# use page_source to get html content
content = driver.page_source
Selenium 通过浏览器的驱动,支持大量的HTML及Javascript的操作,常 用的可以包括:
• page_source: 获取当前的 html 文本 • title:HTML 的 title • current_url:当前网页的URL • get_cookie() & get_cookies():获取当前的cookie • delete_cookie() & delete_all_cookies():删除所有的cookie • add_cookie():添加一段cookie • set_page_load_timeout():设置网页超时 • execute_script():同步执行一段javascript命令 • execute_async_script():异步执行javascript命令
Selenium 通过内嵌的浏览器 driver 与浏览器进程通信,因此在退出的时候必须调用driver.close() 及 driver.quit() 来退出 PhantomJS,否则PhantomJS 会一直运行在后台并占用系统资源。
--ignore-ssl-errors=[true|false] ignores SSL errors, such as expired or self-signed certificate errors (default is false). Also accepted: [yes|no].
--load-images=[true|false] load all inlined images (default is true). Also accepted: [yes|no].
--disk-cache=[true|false] enables disk cache (at desktop services cache storage location, default is false). Also accepted: [yes|no].
--cookies-file=/path/to/cookies.txt specifies the file name to store the persistent Cookies.
--debug=[true|false] prints additional warning and debug message,default is false. Also accepted: [yes|no].
--config specifies JSON-formatted configuration file (see below).
--ignore-ssl-errors=[true|false]
一些证书没有获得CA授权(多是自己制作的证书),浏览器会报出证书不受信任,这种情况需要用户交互操作(点击继续或者新人),使用这个命令后,能自动忽略此类错误
--load-images=[true|false]
网页上一般都存在大量的图片,这些图片对我们第一次执行抓取是没有用的,在这种情况下,选择 –-load-images=false 可以不下载这些图片,加快下载速度
subprocess.call('pgrep phantomjs | xargs kill')
利用 Scrapy 的命令来创建一个 Spider的子类,然后通过命令 scrapy runspider xxx.py
或者 scrapy crawlxxx
来运行这个spider,一个爬虫任务就开始了
scrapy runspider quotes_spider.py -o quotes.json
scrapy startproject chdemo 创建新的工程
scrapy genspider mfw mafengwo.com 创建mfw的spider
pip install simhash
from simhash import Simhash
str0 = 'The Apache Hadoop software library is a framework that allows for
the distributed processing large data'
str1 = 'The Apache Hadoop software library is a framework that allows for
the distributed processing big data'
# 构造 SimHash 对象
sh0 = Simhash(str0)
sh1 = Simhash(str1)
# 构造特征值,关键字加权
features = [('Apache', 10),('Hadoop', 15),('framework', 3), ('distributed',
10), ('data', 6)]
# 不加权计算
sh0.distance(sh1)
# 加权计算海明距离
sh0.build_by_features(features)
sh1.build_by_features(features)
sh0.distance(sh1)
Pillow 是一个图像工具包,包含了一个 Image 类用来做图像的处理
pip install pillow
from PIL import Image
import lxml
def extract_image(html):
tree = lxml.html.fromstring(html)
img_data = tree.cssselect('div#recaptcha img')[0].get('src')
img_data = img_data.partition(',')[-1]
binary_img_data = img_data.decode('base64')
img_data = BytesIO(binary_img_data)
img = Image.open(img_data)
img.save('test.png')
return img
Tesseract-Ocr 是一个 Google 主导的开源 OCR (Optical Character Recongnition) 引擎。Tesseract-Ocr 有很多的 python 开源版本
pip install pytesseract
import pytesseract
pytesseract.image_to_string(bw)
识别过程
大量验证码都是添加了干扰元素的,因此第一步要找出噪声并去除掉
http://www.bjhjyd.gov.cn/
找出验证码的色彩 对色彩像素进行统计
pixdata = img.load()
colors = {}
# 统计字符颜色像素情况
for y in range(img.size[1]):
for x in range(img.size[0]):
if colors.has_key(pixdata[x,y]):
colors[pixdata[x, y]] += 1
else:
colors[pixdata[x,y]] = 1
# 排名第一的是背景色,第二的是主要颜色
colors = sorted(colors.items(), key=lambda d:d[1], reverse=True)
((240, 240, 240), 1996) - 排名第一的是背景色 ((51, 153, 0), 645) – 排名第二的是验证码字体颜色 ((241, 244, 237), 168), ((192, 168, 185), 37), ((161, 250, 53), 1)
去噪 把验证码色彩设置为黑色,其余颜色设置为白色
significant = colors[1][0]
for y in range(img.size[1]):
for x in range(img.size[0]):
if pixdata[x,y] != significant:
pixdata[x,y] = (255,255,255)
else:
pixdata[x, y] = (0,0,0)
调用 TesseractOcr 进行识别
word = pytesseract.image_to_string(img, lang='eng', config='ocr.conf')
A~Za~z0~9
tessedit_char_whitelist
abdefghijklmnoprstuvwxyzABDEFGHIJKLMNOPQRSTUVWXYZ12
34567890翻页命令
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
去除 Javascript 及 CSS 利用 lxml 的 clean 类,能删除HTML 里所包含 css 及 script
<script language="javascript" type="text/javascript">if (typeof M !==
"undefined" && typeof M.loadResource === "function")
{M.loadResource("http://js.mafengwo.net/js/cv/js+pageletcommon+pageHea
dUserInfoWWWDark:js++ACnzzGaLog:js+ARecruit:js+ALazyLoad^Z11V^148
9552560.js");}</script>
from lxml.html import clean
cleaner = clean.Cleaner(style=True, scripts=True, comments=True, javascript=True,page_structure=False, safe_attrs_only=False)
content = cleaner.clean_html(content.decode('utf-8')).encode('utf-8')
除去所有HTML TAG 利用下面的正则表达式,把HTML的TAG和属性也都去除掉,最后只剩下正文部分
reg = re.compile("<[^>]*>")
content = reg.sub('', content)