六、BeautifulSoup4------自动登录网站（手动版）

酱紫安

发布于 2018-04-16 15:42:24

1.5K0

发布于 2018-04-16 15:42:24

文章被收录于专栏：python学习路python学习路

每天一个小实例：（按照教学视频上自动登录的网站，很容易就成功了。自已练习登录别的网站，问题不断）

这个自己分析登录boss直聘。我用了一下午的时间，而且还是手动输入验证码，自动识别输入验证码的还没成功，果然是师傅领进门，修行看个人，以后要多练

第一步、先访问网站，分析一下登录需要什么数据

第二步、创建 Beautiful Soup 对象,指定解析器。提取出登录所用的数据

data = {
    'regionCode':'+86',
    'account':账号,
    'password':密码,
    'captcha':验证码,
    'randomKey':验证码携带的randomKey
}

第三步、登录成功后，就可以做登录才可以做的事情，我想了想没什么可做的，就简单取点工作信息，这个不登录也行。我就是练习练习 Beautiful Soup

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 #第一步、先访问网站，分析一下登录需要什么数据
 5 session = requests.Session()  #如果不用这步，requests访问要携带授权的cookies
 6 bossUrl = 'https://login.zhipin.com/'
 7 headers = {
 8 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
 9 
10 }
11 response =session.get(url=bossUrl,headers=headers)
12 
13 
14 #第二步、创建 Beautiful Soup 对象,指定解析器。提取出登录所用的数据
15 #下面的data中就是需要的数据
16 soup = BeautifulSoup(response.text,'lxml')
17 
18 #获取验证码的url
19 captchaUrl =soup.select('span .verifyimg')[0].get('src')
20 img = requests.get(bossUrl + captchaUrl,headers=headers)
21 
22 #获取randomKey
23 randomKey = captchaUrl.split('=')[1]
24 
25 #将验证码保存起来
26 with open('captcha.png','wb') as f:
27     f.write(img.content)
28 #输入验证码
29 captcha = input('请输入验证码：')
30 
31 data = {
32     'regionCode':'+86',
33     'account':账号,
34     'password':密码,
35     'captcha':验证码,
36     'randomKey':验证码携带的randomKey
37 }
38 loginUrl = 'https://login.zhipin.com/login/account.json'
39 login = session.post(loginUrl,data=data,headers=headers)
40 
41 
42 #第三步、登录成功后，就可以做登录才可以做的事情，下面爬取得信息就算不登录也行，我就是练习练习
43 boss = session.get("https://www.zhipin.com/")
44 jobSoup =BeautifulSoup(boss.text,'lxml')
45 jobPrimary = jobSoup.select('.sub-li')
46 
47 for job in jobPrimary:
48     job_info = job.find('p').get_text()
49     try:
50         job_text = job.find_all(name='p', class_='job-text')[0].get_text()
51     except :
52         job_text =''
53 
54     print(job_info,job_text)

结果：我自己就是简单的提取一下数据，没有整理

1 D:\python.exe F:/django_test/spider/captcha_test.py
2 请输入验证码：5n47
3 架构师30K - 45K  杭州3-5年本科
4 web前端20K - 40K  北京3-5年本科
5 测试开发20K - 40K  杭州经验不限本科
6 销售运营23K - 24K  上海5-10年本科
7 内容营销产品运营15K - 30K  北京5-10年本科
8 Android22K - 44K  北京3-5年本科
9 iOS(高P)38K - 58K  北京5-10年硕士

和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。
BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解器。
Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。使用 pip 安装即可：pip install beautifulsoup4 官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

首先要先导入：

# 导入模块
from bs4 import BeautifulSoup


html = html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
asdf
    <div class="title">
        <b>The Dormouse's story总共</b>
        <h1>f</h1>
    </div>
<div class="story">Once upon a time there were three little sisters; and their names were
    <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</div>
ad<br/>sf
<p class="story">...</p>
</body>
</html>
"""

#创建 Beautiful Soup 对象,指定解析器，如果不指定会出现警告
'''
 UserWarning: No parser was explicitly specified, so I'm using the best.... 

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))
'''
soup = BeautifulSoup(html,'lxml')

#打开本地 HTML 文件的方式来创建对象
# soup = BeautifulSoup(open('hello.html'),'lxml')

# 找到第一个a标签
tag1 = soup.find(name='a')
# 找到所有的a标签
tag2 = soup.find_all(name='a')
# 找到id＝link2的标签
tag3 = soup.select('#link2')

1.find_all(name, attrs, recursive, text, **kwargs)获取匹配的所有标签

 1 # tags = soup.find_all('a')
 2 # print(tags)
 3  
 4 # tags = soup.find_all('a',limit=1)
 5 # print(tags)
 6  
 7 # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
 8 # # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
 9 # print(tags)
10  
11  
12 # ####### 列表 #######
13 # v = soup.find_all(name=['a','div'])
14 # print(v)
15  
16 # v = soup.find_all(class_=['sister0', 'sister'])
17 # print(v)
18  
19 # v = soup.find_all(text=['Tillie'])
20 # print(v, type(v[0]))
21  
22  
23 # v = soup.find_all(id=['link1','link2'])
24 # print(v)
25  
26 # v = soup.find_all(href=['link1','link2'])
27 # print(v)
28  
29 # ####### 正则 #######
30 import re
31 # rep = re.compile('p')
32 # rep = re.compile('^p')
33 # v = soup.find_all(name=rep)
34 # print(v)
35  
36 # rep = re.compile('sister.*')
37 # v = soup.find_all(class_=rep)
38 # print(v)
39  
40 # rep = re.compile('http://www.oldboy.com/static/.*')
41 # v = soup.find_all(href=rep)
42 # print(v)
43  
44 # ####### 方法筛选 #######
45 # def func(tag):
46 # return tag.has_attr('class') and tag.has_attr('id')
47 # v = soup.find_all(name=func)
48 # print(v)
49  
50  
51 # ## get,获取标签属性
52 # tag = soup.find('a')
53 # v = tag.get('id')
54 # print(v)

2.find(name, attrs, recursive, text, **kwargs),获取匹配的第一个标签

1 tag = soup.find('a')
2 print(tag)
3 tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
4 tag = soup.find(text='Lacie')
5 print(tag)

3. name，标签名称； attr，标签属性

 1 tag = soup.find('a')
 2 #a标签
 3 print(tag.name)
 4 
 5 attrs = tag.attrs    # 获取
 6 print(tag.attrs)
 7 #{'class': ['sister0'], 'id': 'link1'}
 8 
 9 tag.attrs = {'i':123} # 设置
10 tag.attrs['id'] = 'iiiii' # 设置
11 print(tag.attrs)
12 #{'i': 123, 'id': 'iiiii'}

4.children,所有子标签

 1 '''
 2 它返回的不是一个 list，不过我们可以通过遍历获取所有子节点。
 3 我们打印输出 .children 看一下，可以发现它是一个 list 生成器对象
 4 '''
 5 div = soup.find('div',class_="story")
 6 print(div.children)
 7 
 8 for children in div.children:
 9     print(children)
10 
11 
12 #结果：
13 '''Once upon a time there were three little sisters; and their names were
14     
15 <a class="sister0" id="link1">Els<span>f</span>ie</a>
16 ,
17     
18 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
19  and
20     
21 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
22 ;
23 and they lived at the bottom of a well.'''

5.descendants所有子子孙孙标签

 1 div = soup.find('div',class_="story")
 2 print(div.descendants)
 3 
 4 for children in div.descendants:
 5     print(children)
 6 
 7 
 8 #结果：
 9 '''Once upon a time there were three little sisters; and their names were
10     
11 <a class="sister0" id="link1">Els<span>f</span>ie</a>
12 Els
13 <span>f</span>
14 f
15 ie
16 ,
17     
18 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
19 Lacie
20  and
21     
22 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
23 Tillie
24 ;
25 and they lived at the bottom of a well.'''

6.CSS选择器

这就是另一种与 find_all 方法有异曲同工之妙的查找方法.

写 CSS 时，标签名不加任何修饰，类名前加.，id名前加#
在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

 1 #通过标签名查找
 2 print(soup.select('title'))
 3 
 4 #通过类名查找
 5 print(soup.select('.sister'))
 6 
 7 #通过id查找
 8 print(soup.select('#link1'))
 9 
10 #组合查找
11 '''组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，
12    例如查找 div 标签中，id 等于 link1的内容，二者需要用空格分开'''
13 print(soup.select('div #link1'))
14 
15 #属性查找
16 '''查找时还可以加入属性元素，属性需要用中括号括起来，不在同一节点的空格隔开
17    注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。'''
18 print(soup.select('a[class="sister"]'))
19 print(soup.select('div a[class="sister"]'))
20 
21 #获取内容以上的 select 方法返回的结果都是列表形式，
22 # 可以遍历形式输出，然后用 get_text() 方法来获取它的内容。
23 for title in soup.select('title'):
24     print(title.get_text())

7.clear,将标签的所有子标签全部清空（保留标签名）

1 tag = soup.find('body')
2 tag.clear()
3 print(soup)
4 
5 
6 '''结果：
7 <html><head><title>The Dormouse's story</title></head>
8 <body></body>
9 </html>'''

8. decompose,递归的删除所有的标签

1 tag = soup.find('body')
2 tag.decompose()
3 print(soup)
4 
5 
6 '''结果：
7 <html><head><title>The Dormouse's story</title></head>
8 
9 </html>'''

9. decode,转换为字符串（含当前标签）；decode_contents（不含当前标签）

1 tag = soup.find('body')
2 v = tag.decode()
3 print(type(soup))
4 print(type(v))
5 
6 
7 '''结果：
8 <class 'bs4.BeautifulSoup'>
9 <class 'str'>'''

10. encode,转换为字节（含当前标签）；encode_contents（不含当前标签）

tag = soup.find('body')
v = tag.encode()
print(type(soup))
print(type(v))

11.has_attr,检查标签是否具有该属性 ; get_text,获取标签内部文本内容; index,检查标签在某标签中的索引位置

12.当前的关联标签

 1  soup.next
 2  soup.next_element
 3  soup.next_elements
 4  soup.next_sibling
 5  soup.next_siblings
 6  
 7 
 8  tag.previous
 9  tag.previous_element
10  tag.previous_elements
11  tag.previous_sibling
12  tag.previous_siblings
13  
14 
15  tag.parent
16  tag.parents

13.查找某标签的关联标签

 1  tag.find_next(...)
 2  tag.find_all_next(...)
 3  tag.find_next_sibling(...)
 4  tag.find_next_siblings(...)
 5  
 6  tag.find_previous(...)
 7  tag.find_all_previous(...)
 8  tag.find_previous_sibling(...)
 9  tag.find_previous_siblings(...)
10  
11  tag.find_parent(...)
12  tag.find_parents(...)
13  
14  参数同find_all

14. 创建标签之间的关系

1  tag = soup.find('div')
2 a = soup.find('a')
3  tag.setup(previous_sibling=a)
4 print(tag.previous_sibling)

15.创建标签

 1 from bs4.element import Tag
 2 
 3 
 4 obj = Tag(name='a',attrs={'id':'qqq','class':'djj'})
 5 
 6 obj.string ='kv'
 7 
 8 print(obj)
 9 
10 
11 #结果<a class="djj" id="qqq">kv</a>

16.insert_after,insert_before 在当前标签后面或前面插入 ; append在当前标签内部追加一个标签; insert在当前标签内部指定位置插入一个标签

17 wrap，将指定标签把当前标签包裹起来；unwrap，去掉当前标签，将保留其包裹的标签

本文参与腾讯云自媒体分享计划，分享自作者个人站点/博客。

原始发表：2018-02-22 ，如有侵权请联系 cloudcommunity@tencent.com 删除

https

windows

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！

https

windows

登录后参与评论

0 条评论

热度

六、BeautifulSoup4------自动登录网站（手动版）

六、BeautifulSoup4------自动登录网站（手动版）

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐