问如何从csv文件中特定列的所有行中解析HTML编码的文本？
EN

Stack Overflow用户

提问于 2019-04-24 03:42:41

回答 1查看 94关注 0票数 1

下面是“content”列中数据外观的图像：

我在pandas中加载了一个csv文件。在列'Content‘中，每一行都包含不同长度的html编码文本。有些就像500+的单词。我的目标是去掉“content”列的所有行中的所有html编码。

有人能帮我弄到这个的代码吗？

到目前为止我只有这个。

dataset = pd.read_csv('NuggetData.csv')

“‘Content”是表中的第9列(如果第一列是0)，大约有17,000行。

content列中的示例文本(这也不是第1行的全文，顺便说一句，它更长)：

行1:

<h2>A bold new toy commercial debuted last week, and it's got the internet talking.</h2><div><div data-reactroot="" class="push-wrapper--mobile" data-card="image"><img src="//i.upworthy.com/nugget/57e9536dca7292001f000008/attachments/toygif1-65977b573530a2407626f8a4aad22a4e.gif" class=""><div class="image-caption"><p>GIFs via Smyths Toys.</p></div></div></div><h2>In some ways, it was pretty standard because a boy's love for rocket ships isn't all that unique.</h2><div><div data-reactroot="" class="push-wrapper--mobile" data-card="image"><img src="//i.upworthy.com/nugget/57e953b8e2d8c7001f00002d/attachments/toygif2-6ef9ddacf2a56c63a84d773645450563.gif" class=""></div></div><h2>Neither is his love of Legos.</h2><div><div data-reactroot="" class="push-wrapper--mobile" data-card="image"><img src="//i.upworthy.com/nugget/57e95558e2d8c7002b000025/attachments/toygif4-4f0829dad2602f7dd6ed52813e6791a5.gif" class=""></div></div><h2>Plenty of boys like to (pretend to) drive motorcycles, too.</h2><div><div data-reactroot="" class="push-wrapper--mobile" data-card="image"><img src="//i.upworthy.com/nugget/57e95595ca72920034000029/attachments/toygif5-e1824fae63099796ac2947ba76ea185d.gif" class=""></div></div><h2>But ... playing dress-up as a queen in front of a crowd of cheering supporters?</h2><div><div data-reactroot="" class="push-wrapper--mobile" data-card="image"><img src="//i.upworthy.com/nugget/57e954c0e2d8c7002d00001e/attachments/toygif3-21ea60c5917fd80da817919c655a4c96.gif" class=""></div></div><p><em>That's</em> extraordinary. </p><h2>

python

python-3.x

回答 1

Stack Overflow用户

发布于 2019-04-24 05:15:47

我建议您使用BeautifulSoup (库)和列表理解来解析您的内容列。

首先，您需要知道需要从HTML中获取哪些内容。为了解释，我做了一些假设：

如果您正在查找DIV标记中的内容，请说您正在查找前一个标记中的文本，(findAll('div'))

Let (.text)

You需要来自第三个DIV标记([2])

中的文本

from bs4 import BeautifulSoup as bs

dataset['parsed_content'] = [bs(x,'lxml').findAll('div')[2].text for x in dataset['content']]

使用前面的代码，您可以向dataframe添加一个新列，在任何情况下都不会修改内容。

可以使用pip安装依赖项BeautifulSoup和lxml。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55818338

复制

相似问题

问如何从csv文件中特定列的所有行中解析HTML编码的文本？
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从csv文件中特定列的所有行中解析HTML编码的文本？EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从csv文件中特定列的所有行中解析HTML编码的文本？
EN