问在Python中删除and抓取过程中的脚本和样式元素
EN

Stack Overflow用户

提问于 2018-08-16 01:56:41

回答 1查看 144关注 0票数 1

我正在使用下面的代码来抓取和删除脚本和样式，这样我就只能从网页获得文本

    link= "https://en.wikipedia.org/wiki/Mark_Zuckerberg"
    url = Request(link,headers={'User-Agent': 'Chrome/5.0'})
    html = urlopen(url).read()
    soup = BeautifulSoup(html)

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    print(text)

示例:假设来自网站的汤是

<ul><li>Technology entrepreneur</li><li>philanthropist</li></ul></div></td> 
</tr><tr><th scope="row">Years active</th><td>

我想把它打印出来

Technology entrepreneur philanthropist Years active

而它是打印的

Technology entrepreneurphilanthropistYears active

我想让它在杀死脚本和样式元素的地方插入空格。感谢以上代码中的任何建议。您可以运行原始url进行检查。

python

html

text

web-scraping

beautifulsoup

回答 1

Stack Overflow用户

发布于 2018-08-16 02:22:57

提取脚本标记后，可以将html转换为字符串，并使用正则表达式替换标记。

这对我来说很有效：

import requests
from bs4 import BeautifulSoup
import re

link= "https://en.wikipedia.org/wiki/Mark_Zuckerberg"
r = requests.get(link, headers={'User-Agent': 'Chrome/5.0'})
html = r.text
soup = BeautifulSoup(html, "lxml") # feel free to use other parsers, e.g. html.parser, I use lxml as it's the fastest one...
for script in soup.find_all('script'):
    script.extract()
html = str(soup)
html = re.sub('<.+?>', ' ', html)
html = " ".join(html.strip().split())
print(html)

在我明白了真正想要的是什么之后编辑的。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51863989

复制

相似问题

问在Python中删除and抓取过程中的脚本和样式元素
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python中删除and抓取过程中的脚本和样式元素EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python中删除and抓取过程中的脚本和样式元素
EN