问使用python和lxml模块从html中删除所有javascript标记和style标记
EN

Stack Overflow用户

提问于 2011-12-19 03:01:20

回答 3查看 20.7K关注 0票数 33

我正在使用http://lxml.de/库解析html文档。到目前为止，我已经想出了如何从html文档In lxml, how do I remove a tag but retain all contents?中剥离标签，但那篇文章中描述的方法保留了所有文本，剥离了标签，而没有删除实际的脚本。我还发现了一个对lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html的类引用，但对于如何实际使用该类来清理文档，这一点一目了然。任何帮助，也许一个简短的例子会对我有帮助！

python

html

lxml

回答 3

Stack Overflow用户

回答已采纳

发布于 2011-12-19 03:37:47

下面是一个做你想做的事情的例子。对于超文本标记语言文档，Cleaner是比使用strip_elements更好的通用解决方案，因为在这种情况下，您不仅希望去掉<script>标记，还希望去掉其他标记上的onclick=function()属性等内容。

#!/usr/bin/env python

import lxml
from lxml.html.clean import Cleaner

cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

print("WITH JAVASCRIPT & STYLES")
print(lxml.html.tostring(lxml.html.parse('http://www.google.com')))
print("WITHOUT JAVASCRIPT & STYLES")
print(lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com'))))

您可以获得可以在lxml.html.clean.Cleaner documentation中设置的选项列表；一些选项可以仅设置为True或False (缺省值)，而其他选项的列表如下：

cleaner.kill_tags = ['a', 'h1']
cleaner.remove_tags = ['p']

请注意kill和remove之间的区别：

remove_tags:
  A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
kill_tags:
  A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
allow_tags:
  A list of tags to include (default include all).

票数 66

Stack Overflow用户

发布于 2011-12-19 03:11:06

您可以使用strip_elements方法删除脚本，然后使用strip_tags方法删除其他标签：

etree.strip_elements(fragment, 'script')
etree.strip_tags(fragment, 'a', 'p') # and other tags that you want to remove

票数 4

Stack Overflow用户

发布于 2017-01-13 13:17:35

您也可以使用bs4 libray来实现此目的。

soup = BeautifulSoup(html_src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/8554035

复制

相似问题

问使用python和lxml模块从html中删除所有javascript标记和style标记
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python和lxml模块从html中删除所有javascript标记和style标记EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python和lxml模块从html中删除所有javascript标记和style标记
EN