我正在使用下面的代码来抓取和删除脚本和样式,这样我就只能从网页获得文本
link= "https://en.wikipedia.org/wiki/Mark_Zuckerberg"
url = Request(link,headers={'User-Agent': 'Chrome/5.0'})
html = urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
示例:假设来自网站的汤是
<ul><li>Technology entrepreneur</li><li>philanthropist</li></ul></div></td>
</tr><tr><th scope="row">Years active</th><td>
我想把它打印出来
Technology entrepreneur philanthropist Years active
而它是打印的
Technology entrepreneurphilanthropistYears active
我想让它在杀死脚本和样式元素的地方插入空格。感谢以上代码中的任何建议。您可以运行原始url进行检查。
发布于 2018-08-16 02:22:57
提取脚本标记后,可以将html转换为字符串,并使用正则表达式替换标记。
这对我来说很有效:
import requests
from bs4 import BeautifulSoup
import re
link= "https://en.wikipedia.org/wiki/Mark_Zuckerberg"
r = requests.get(link, headers={'User-Agent': 'Chrome/5.0'})
html = r.text
soup = BeautifulSoup(html, "lxml") # feel free to use other parsers, e.g. html.parser, I use lxml as it's the fastest one...
for script in soup.find_all('script'):
script.extract()
html = str(soup)
html = re.sub('<.+?>', ' ', html)
html = " ".join(html.strip().split())
print(html)
在我明白了真正想要的是什么之后编辑的。
https://stackoverflow.com/questions/51863989
复制相似问题