问Python:解析HTML以删除标记，并将文本转换应用于标记之后的所有文本
EN

Stack Overflow用户

提问于 2018-07-19 04:00:52

回答 2查看 41关注 0票数 0

我正在尝试检测包含HTML标记<p><strong class="title"> </strong></p>和标记"shared" OR "amenities"中的特定单词的字符串，并将单词"shared"附加到该标记后面出现的所有逗号分隔的子字符串。有什么简单的方法可以做到这一点吗？

示例输入：

</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">

示例输出：

swimming pool, barbecue, beach shared, tennis courts shared

python

regex

string

beautifulsoup

回答 2

Stack Overflow用户

发布于 2018-07-19 04:37:29

为此，您可以使用一些不同的库，常见的选择是Beautiful Soup或lxml。我更喜欢lxml，因为大多数语言都有类似于regex的实现，所以我感觉可以从投资中获得更多收益。

from lxml import html

stuff = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
stuff = html.fromstring(stuff)
ptag  = stuff.xpath('//p/*[contains(text(),"AMENITIES") or contains(text(), "SHARED")]//text()')
print(ptag)

票数 0

Stack Overflow用户

发布于 2018-07-19 05:21:05

我使用下面的代码让它正常工作。欢迎提出任何意见和建议！

from bs4 import BeautifulSoup

html_to_parse = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'

soup = BeautifulSoup(html_to_parse)
html_body = soup('body')[0]

shared_indicator = html_body.find('strong', 'title').get_text()
non_shared_amenities = html_to_parse.split(shared_indicator,1)[0]
non_shared_amenities = (BeautifulSoup(non_shared_amenities, 'html.parser')
         .get_text()
         .strip()
        )
shared_amenities = html_to_parse.split(shared_indicator,1)[1]

shared_amenities_array = (pd.Series(BeautifulSoup(shared_amenities, 'html.parser')
          .get_text()
          .split(','))
          .replace("[^A-Za-z0-9'`]+", " ", regex = True)
          .str.strip()
        .apply(lambda x: "{}{}".format(x, ' shared'))
)

shared_amenities_tagged = ", ".join(shared_amenities_array)

non_shared_amenities + ', ' + shared_amenities_tagged

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51410127

复制

相似问题

问Python:解析HTML以删除标记，并将文本转换应用于标记之后的所有文本
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python:解析HTML以删除标记，并将文本转换应用于标记之后的所有文本EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python:解析HTML以删除标记，并将文本转换应用于标记之后的所有文本
EN