首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何在标记中筛选出不需要的标记

如何在标记中筛选出不需要的标记
EN

Stack Overflow用户
提问于 2016-02-02 00:30:40
回答 1查看 119关注 0票数 1

我试图从这个页面中提取一个项目符号列表:http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/

具体来说,在下面的截图中,用黄色突出显示的子弹。

首先,我使用漂亮的汤过滤掉所有没有属性的<ul>标记:

代码语言:javascript
运行
复制
text = BeautifulSoup(requests.get('http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/', timeout=7.00).text)
bullets = text.find_all(lambda tag: tag.name == 'ul' and not tag.attrs) 

以下是返回的两个<ul>标记:

代码语言:javascript
运行
复制
<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
</ul>

<ul><li class="share-item share-fb" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="facebook" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Facebook"></li><li class="share-item share-tw" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="twitter" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Twitter"></li><li class="share-item share-gp" data-lang="en-US" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="googlePlus" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Google+"></li><li class="share-item share-pn" data-media="http://bodetree.com/wp-content/uploads/2015/04/pain-points.png" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="pinterest" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Pinterest"></li></ul>

我只想提取页面正文中出现的<ul>标记,所以我想过滤掉包含垃圾的第二个<ul>标记。看起来,没有出现在页面正文中的<ul>标记有带有属性的<li>标记,因此我们可以在此基础上进行筛选。基本上,我想要的是表单<ul><li>string</li></ul>的标记结构。因此,在这种情况下,我希望返回的唯一<ul>是:

代码语言:javascript
运行
复制
<ul> 
<li>You are experiencing a decrease in sales and customers</li> 
<li>If your brand design does not reflect what you deliver</li> 
<li>If you want to attract a new target audience</li> 
<li>Management change</li> 
<li>19 Questions to Ask Yourself Before You Start Rebranding</li>
</ul> 

有什么方法可以用find_all()来做到这一点吗?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-02-02 00:38:41

在本文中搜索ul,这是一个带有class="entry-content"div

代码语言:javascript
运行
复制
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/', timeout=7.00).text)

bullets = soup.select("div.entry-content ul li")
print([bullet.get_text() for bullet in bullets])

指纹:

代码语言:javascript
运行
复制
[
    'You are experiencing a decrease in sales and customers', 
    'If your brand design does not reflect what you deliver', 
    'If you want to attract a new target audience', 
    'Management change', 
    '19 Questions to Ask Yourself Before You Start Rebranding'
]
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/35143015

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档