前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Python爬虫——Beautiful Soup

Python爬虫——Beautiful Soup

作者头像
羊羽shine
发布2019-05-28 13:33:10
4990
发布2019-05-28 13:33:10
举报
文章被收录于专栏:Golang开发Golang开发

Beautiful Soup

Beautiful Soup是Python处理HTML或XML的解析库,使用Beautiful Soup需要安装Beautiful Soup库和lxml的库 Beautiful Soup官方下载地址

image.png

Beautiful Soup的安装方式

代码语言:javascript
复制
pip install beautifulsoup4
代码语言:javascript
复制
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>HelloPython</p>','lxml')
print(soup.p.string)
# HelloPython
获取属性
代码语言:javascript
复制
from bs4 import BeautifulSoup
html = '''
<html>
<head><title>BeautifulSoup Demo</title></head>
<body>
<p class="titleClass" name="titleName">titleContent</p>
</body>
</html>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs)
print(soup.p.attrs['name'])
获取内容

string获取节点的文本内容

代码语言:javascript
复制
from bs4 import BeautifulSoup
html = '''
<html>
<head><title>BeautifulSoup Demo</title></head>
<body>
<p class="titleClass" name="titleName">titleContent</p>
</body>
</html>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.p.string)
print(soup.head.string)

find_all

通过节点查找内容

代码语言:javascript
复制
from bs4 import BeautifulSoup
html = '''
<html>
<head><title>BeautifulSoup Demo</title></head>
<body>
<div class='classContent1'>
content0
</div>
<div class='classContent2'>
<li>conent1</li>
<li>conent2</li>
<li>conent3</li>
</div>
</body>
</html>
'''

soup = BeautifulSoup(html,'lxml')
result = soup.find_all('div')
print(result)

通过属性查找

代码语言:javascript
复制
from bs4 import BeautifulSoup
html = '''
<div class='classContent'>
<li>conent1</li>
<li>conent2</li>
<li>conent3</li>
</div>
'''

soup = BeautifulSoup(html,'lxml')
result = soup.find_all(attrs={'class':'classContent'})
print(result)

查找节点内容

代码语言:javascript
复制
from bs4 import BeautifulSoup
import re
html = '''
<div class='classContent'>
<li>conent1</li>
<li>conent2</li>
<li>conent3</li>
</div>
'''

soup = BeautifulSoup(html,'lxml')
result = soup.find_all(text=re.compile('conent'))
print(result)
# ['conent1', 'conent2', 'conent3']

select 选择器

代码语言:javascript
复制
from bs4 import BeautifulSoup
import re
html = '''
<div class='classContent'>
<li>conent1</li>
<li>conent2</li>
<li>conent3</li>
</div>
'''

soup = BeautifulSoup(html,'lxml')
result = soup.select('div li')
print(result)

获取豆瓣读书

代码语言:javascript
复制
from bs4 import BeautifulSoup
import requests
url = 'https://book.douban.com/top250?icn=index-book250-all'
urls = ['https://book.douban.com/top250?start={}'.format(str(n)) for n in range(0,250,25)]

def get_book(url):
    wb_data = requests.get(url)
    soup = BeautifulSoup(wb_data.text,'lxml')
    titles = soup.select('div.pl2 > a')
    imgs = soup.select('a.nbg > img')
    cates = soup.select('p.quote > span')
    for title,img,cate in zip(titles,imgs,cates):
        data = {
            'title':title.get_text(),
            'img':img.get('src'),
            'cate':cate.get_text()
        }
        print(data)

for url_urls in urls:
    get_book(url_urls)
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2018.07.22 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Beautiful Soup
    • 获取属性
      • 获取内容
      • find_all
      • select 选择器
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档