entityMap|blocks|key|6v5fc|text|html2text|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1k494|是一个Python程序，在这方面做得很好。^0|0^^$0|$]|1|@$2|3|4|5|6|7|8|E|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|F|9|@]|A|@]|B|$]]]]

<a href="https://github.com/Alir3z4/html2text" rel="noreferrer">html2text</a> is a Python program that does a pretty good job at this.

entityMap|blocks|key|9os1e|text|PyParsing做得很好。PyParsing维基被杀了，因此这里是另一个使用PyParsing的例子(|type|unstyled|depth|inlineStyleRanges|entityRanges|data|cso4o|示例链接|esa60|)。花一点时间研究pyparsing的一个原因是，他还写了一个非常简短、组织良好的O‘’Reilly简写手册，而且价格也很便宜。|ctcm0|话虽如此，我经常使用BeautifulSoup，处理实体问题并不难，您可以在运行BeautifulSoup之前转换它们。|9hlaj|祝你好运^0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|K|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|L|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|M|9|@]|A|@]|B|$]]|$2|G|4|H|6|7|8|N|9|@]|A|@]|B|$]]|$2|I|4|J|6|7|8|O|9|@]|A|@]|B|$]]]]

PyParsing does a great job. The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing (<a href="http://www.ccp4.ac.uk/dist/checkout/pyparsing-2.0.1/examples/0README.html" rel="nofollow noreferrer">example link</a>). One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O'Reilly Short Cut manual that is also inexpensive.

Having said that, I use BeautifulSoup a lot and it is not that hard to deal with the entities issues, you can convert them before you run BeautifulSoup. 

Goodluck

entityMap|blocks|key|732ed|text|你也可以在stripogram库中使用html2text方法。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|466g4|from+stripogram+import+html2text
text+=+html2text(your_html_string)|code-block|syntax|javascript|758ba|要安装stripogram，请轻松运行sudo|6g1vl|_|8bju|安装stripogram^0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|N|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|O|9|@]|A|@]|B|$F|G]]|$2|H|4|I|6|7|8|P|9|@]|A|@]|B|$]]|$2|J|4|K|6|7|8|Q|9|@]|A|@]|B|$]]|$2|L|4|M|6|7|8|R|9|@]|A|@]|B|$]]]]

You can use html2text method in the stripogram library also.

<pre><code>from stripogram import html2text
text = html2text(your_html_string)
</code></pre>

To install stripogram run sudo easy_install stripogram

entityMap|blocks|key|bedif|text|我发现自己今天也面临着同样的问题。我编写了一个非常简单的HTML解析器来剥离传入内容的所有标记，只返回仅包含最少格式的剩余文本。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|5i4nl|from+HTMLParser+import+HTMLParser
from+re+import+sub
from+sys+import+stderr
from+traceback+import+print_exc

class+_DeHTMLParser(HTMLParser):
++++def+__init__(self):
++++++++HTMLParser.__init__(self)
++++++++self.__text+=+[]

++++def+handle_data(self,+data):
++++++++text+=+data.strip()
++++++++if+len(text)+>+0:
++++++++++++text+=+sub('[+\t\r\n]%2B',+'+',+text)
++++++++++++self.__text.append(text+%2B+'+')

++++def+handle_starttag(self,+tag,+attrs):
++++++++if+tag+==+'p':
++++++++++++self.__text.append('\n\n')
++++++++elif+tag+==+'br':
++++++++++++self.__text.append('\n')

++++def+handle_startendtag(self,+tag,+attrs):
++++++++if+tag+==+'br':
++++++++++++self.__text.append('\n\n')

++++def+text(self):
++++++++return+''.join(self.__text).strip()


def+dehtml(text):
++++try:
++++++++parser+=+_DeHTMLParser()
++++++++parser.feed(text)
++++++++parser.close()
++++++++return+parser.text()
++++except:
++++++++print_exc(file=stderr)
++++++++return+text


def+main():
++++text+=+r'''
++++++++
++++++++++++
++++++++++++++++Project:+DeHTML

++++++++++++++++Description:

++++++++++++++++This+small+script+is+intended+to+allow+conversion+from+HTML+markup+to+
++++++++++++++++plain+text.
++++++++++++
++++++++
++++'''
++++print(dehtml(text))


if+__name__+==+'__main__':
++++main()|code-block|syntax|javascript^0|0^^$0|$]|1|@$2|3|4|5|6|7|8|H|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|I|9|@]|A|@]|B|$F|G]]]]

Found myself facing just the same problem today. I wrote a very simple HTML parser to strip incoming content of all markups, returning the remaining text with only a minimum of formatting.

<pre><code>from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
 def __init__(self):
 HTMLParser.__init__(self)
 self.__text = []

 def handle_data(self, data):
 text = data.strip()
 if len(text) &gt; 0:
 text = sub('[ \t\r\n]+', ' ', text)
 self.__text.append(text + ' ')

 def handle_starttag(self, tag, attrs):
 if tag == 'p':
 self.__text.append('\n\n')
 elif tag == 'br':
 self.__text.append('\n')

 def handle_startendtag(self, tag, attrs):
 if tag == 'br':
 self.__text.append('\n\n')

 def text(self):
 return ''.join(self.__text).strip()


def dehtml(text):
 try:
 parser = _DeHTMLParser()
 parser.feed(text)
 parser.close()
 return parser.text()
 except:
 print_exc(file=stderr)
 return text


def main():
 text = r'''
 &lt;html&gt;
 &lt;body&gt;
 &lt;b&gt;Project:&lt;/b&gt; DeHTML&lt;br&gt;
 &lt;b&gt;Description&lt;/b&gt;:&lt;br&gt;
 This small script is intended to allow conversion from HTML markup to 
 plain text.
 &lt;/body&gt;
 &lt;/html&gt;
 '''
 print(dehtml(text))


if __name__ == '__main__':
 main()
</code></pre>

entityMap|blocks|key|68j3k|text|注意：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|cs91q|NTLK不再支持|bpp7|函数|3sfcf|下面是原始答案，评论部分提供了替代方案。|7t1os|使用|62q7l|NLTK|74k0v|bv2ea|我浪费了4-5个小时来解决html2text的问题。幸运的是，我可以遇到NLTK。|47m4q|它神奇地工作着。|a4ti0|import+nltk+++
from+urllib+import+urlopen

url+=+"http://news.bbc.co.uk/2/hi/health/2284783.stm"++++
html+=+urlopen(url).read()++++
raw+=+nltk.clean_html(html)++
print(raw)|code-block|syntax|javascript^0|0|0|0|0|0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|W|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|X|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|Y|9|@]|A|@]|B|$]]|$2|G|4|H|6|7|8|Z|9|@]|A|@]|B|$]]|$2|I|4|J|6|7|8|10|9|@]|A|@]|B|$]]|$2|K|4|L|6|7|8|11|9|@]|A|@]|B|$]]|$2|M|4|-4|6|7|8|12|9|@]|A|@]|B|$]]|$2|N|4|O|6|7|8|13|9|@]|A|@]|B|$]]|$2|P|4|Q|6|7|8|14|9|@]|A|@]|B|$]]|$2|R|4|S|6|T|8|15|9|@]|A|@]|B|$U|V]]]]

NOTE: NTLK no longer supports <code>clean_html</code> function

Original answer below, and an alternative in the comments sections.

<hr>

Use <a href="https://pypi.python.org/pypi/nltk" rel="noreferrer">NLTK</a> 

I wasted my 4-5 hours fixing the issues with html2text. Luckily i could encounter NLTK. 
It works magically. 

<pre><code>import nltk 
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" 
html = urlopen(url).read() 
raw = nltk.clean_html(html) 
print(raw)
</code></pre>

entityMap|blocks|key|a2l06|text|请查看htmllib，而不是HTMLParser模块。它有一个类似的界面，但为你做了更多的工作。(它非常古老，所以在摆脱javascript和css方面帮助不大。您可以创建一个派生类，但可以添加名称为start的方法|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4dkjs|_|7fvub|脚本和结束|4je4k|4qsd9|样式(有关详细信息，请参阅python文档)，但对于格式错误的html，很难可靠地做到这一点。)无论如何，这里有一些简单的方法可以将纯文本打印到控制台|6odcg|from+htmllib+import+HTMLParser,+HTMLParseError
from+formatter+import+AbstractFormatter,+DumbWriter
p+=+HTMLParser(AbstractFormatter(DumbWriter()))
try:+p.feed('hello
there');+p.close()+#calling+close+is+not+usually+needed,+but+let's+play+it+safe
except+HTMLParseError:+print+':('+#the+html+is+badly+malformed+(or+you+found+a+bug)|code-block|syntax|javascript^0|0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|O|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|P|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|Q|9|@]|A|@]|B|$]]|$2|G|4|D|6|7|8|R|9|@]|A|@]|B|$]]|$2|H|4|I|6|7|8|S|9|@]|A|@]|B|$]]|$2|J|4|K|6|L|8|T|9|@]|A|@]|B|$M|N]]]]

Instead of the HTMLParser module, check out htmllib. It has a similar interface, but does more of the work for you. (It is pretty ancient, so it's not much help in terms of getting rid of javascript and css. You could make a derived class, but and add methods with names like start_script and end_style (see the python docs for details), but it's hard to do this reliably for malformed html.) Anyway, here's something simple that prints the plain text to the console

<pre><code>from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello&lt;br&gt;there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)
</code></pre>

entityMap|blocks|key|1h4dd|text|这不是一个确切的Python解决方案，但它会将Javascript生成的文本转换为文本，我认为这一点很重要(例如google.com)。浏览器链接(不是Lynx)具有Javascript引擎，并将使用-dump选项将源文件转换为文本。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|bgfkd|所以你可以这样做：|3p680|fname+=+os.tmpnam()
fname.write(html_source)
proc+=+subprocess.Popen(['links',+'-dump',+fname],+
++++++++++++++++++++++++stdout=subprocess.PIPE,
++++++++++++++++++++++++stderr=open('/dev/null','w'))
text+=+proc.stdout.read()|code-block|syntax|javascript^0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|J|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|K|9|@]|A|@]|B|$]]|$2|E|4|F|6|G|8|L|9|@]|A|@]|B|$H|I]]]]

This isn't exactly a Python solution, but it will convert text Javascript would generate into text, which I think is important (E.G. google.com). The browser Links (not Lynx) has a Javascript engine, and will convert source to text with the -dump option.

So you could do something like:

<pre><code>fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
 stdout=subprocess.PIPE,
 stderr=open('/dev/null','w'))
text = proc.stdout.read()
</code></pre>

entityMap|blocks|key|102mk|text|有一个用于数据挖掘的模式库。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|5kecg|http://www.clips.ua.ac.be/pages/pattern-web|b7ntc|你甚至可以决定保留哪些标签：|23mf9|s+=+URL('http://www.clips.ua.ac.be').download()
s+=+plaintext(s,+keep={'h1':[],+'h2':[],+'strong':[],+'a':['href']})
print+s|code-block|syntax|javascript^0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|L|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|M|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|N|9|@]|A|@]|B|$]]|$2|G|4|H|6|I|8|O|9|@]|A|@]|B|$J|K]]]]

There is Pattern library for data mining.

<a href="http://www.clips.ua.ac.be/pages/pattern-web" rel="noreferrer">http://www.clips.ua.ac.be/pages/pattern-web</a>

You can even decide what tags to keep:

<pre><code>s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s
</code></pre>

entityMap|blocks|key|2j87i|text|Beautiful确实可以转换html实体。考虑到HTML经常有buggy，并且充满了unicode和html编码问题，这可能是您最好的选择。这是我用来将html转换成原始文本的代码：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|c1f0d|import+BeautifulSoup
def+getsoup(data,+to_unicode=False):
++++data+=+data.replace(" ",+"+")
++++#+Fixes+for+bad+markup+I've+seen+in+the+wild.++Remove+if+not+applicable.
++++masssage_bad_comments+=+[
++++++++(re.compile(''),+lambda+match:+''),
++++]
++++myNewMassage+=+copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
++++myNewMassage.extend(masssage_bad_comments)
++++return+BeautifulSoup.BeautifulSoup(data,+markupMassage=myNewMassage,
++++++++convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES+
++++++++++++++++++++if+to_unicode+else+None)

remove_html+=+lambda+c:+getsoup(c,+to_unicode=True).getText(separator=u'+')+if+c+else+""|code-block|syntax|javascript^0|0^^$0|$]|1|@$2|3|4|5|6|7|8|H|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|I|9|@]|A|@]|B|$F|G]]]]

Beautiful soup does convert html entities. It's probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues. This is the code I use to convert html to raw text:

<pre><code>import BeautifulSoup
def getsoup(data, to_unicode=False):
 data = data.replace("&amp;nbsp;", " ")
 # Fixes for bad markup I've seen in the wild. Remove if not applicable.
 masssage_bad_comments = [
 (re.compile('&lt;!-([^-])'), lambda match: '&lt;!--' + match.group(1)),
 (re.compile('&lt;!WWWAnswer T[=\w\d\s]*&gt;'), lambda match: '&lt;!--' + match.group(0) + '--&gt;'),
 ]
 myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
 myNewMassage.extend(masssage_bad_comments)
 return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
 convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
 if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""
</code></pre>

entityMap|blocks|key|80ad7|text|下面是xperroni的答案的一个版本，它更完整一些。它跳过脚本和样式部分，并转换charrefs+(例如，')和HTML实体(例如，&)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2aldf|它还包括一个普通的纯文本到html的反向转换器。|6rp0h|"""
HTML+<->+text+conversions.
"""
from+HTMLParser+import+HTMLParser,+HTMLParseError
from+htmlentitydefs+import+name2codepoint
import+re

class+_HTMLToText(HTMLParser):
++++def+__init__(self):
++++++++HTMLParser.__init__(self)
++++++++self._buf+=+[]
++++++++self.hide_output+=+False

++++def+handle_starttag(self,+tag,+attrs):
++++++++if+tag+in+('p',+'br')+and+not+self.hide_output:
++++++++++++self._buf.append('\n')
++++++++elif+tag+in+('script',+'style'):
++++++++++++self.hide_output+=+True

++++def+handle_startendtag(self,+tag,+attrs):
++++++++if+tag+==+'br':
++++++++++++self._buf.append('\n')

++++def+handle_endtag(self,+tag):
++++++++if+tag+==+'p':
++++++++++++self._buf.append('\n')
++++++++elif+tag+in+('script',+'style'):
++++++++++++self.hide_output+=+False

++++def+handle_data(self,+text):
++++++++if+text+and+not+self.hide_output:
++++++++++++self._buf.append(re.sub(r'\s%2B',+'+',+text))

++++def+handle_entityref(self,+name):
++++++++if+name+in+name2codepoint+and+not+self.hide_output:
++++++++++++c+=+unichr(name2codepoint[name])
++++++++++++self._buf.append(c)

++++def+handle_charref(self,+name):
++++++++if+not+self.hide_output:
++++++++++++n+=+int(name[1:],+16)+if+name.startswith('x')+else+int(name)
++++++++++++self._buf.append(unichr(n))

++++def+get_text(self):
++++++++return+re.sub(r'+%2B',+'+',+''.join(self._buf))

def+html_to_text(html):
++++"""
++++Given+a+piece+of+HTML,+return+the+plain+text+it+contains.
++++This+handles+entities+and+char+refs,+but+not+javascript+and+stylesheets.
++++"""
++++parser+=+_HTMLToText()
++++try:
++++++++parser.feed(html)
++++++++parser.close()
++++except+HTMLParseError:
++++++++pass
++++return+parser.get_text()

def+text_to_html(text):
++++"""
++++Convert+the+given+text+to+html,+wrapping+what+looks+like+URLs+with++tags,
++++converting+newlines+to+
+tags+and+converting+confusing+chars+into+html
++++entities.
++++"""
++++def+f(mo):
++++++++t+=+mo.group()
++++++++if+len(t)+==+1:
++++++++++++return+{'&':'&',+"'":''',+'"':'"',+'<':'<',+'>':'>'}.get(t)
++++++++return+'%25s'+%25+(t,+t)
++++return+re.sub(r'https?://[%5E]+()"\';]%2B%7C[&\'"<>]',+f,+text)|code-block|syntax|javascript^0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|J|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|K|9|@]|A|@]|B|$]]|$2|E|4|F|6|G|8|L|9|@]|A|@]|B|$H|I]]]]

Here is a version of xperroni's answer which is a bit more complete. It skips script and style sections and translates charrefs (e.g., &amp;#39;) and HTML entities (e.g., &amp;amp;).

It also includes a trivial plain-text-to-html inverse converter.

<pre><code>"""
HTML &lt;-&gt; text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
 def __init__(self):
 HTMLParser.__init__(self)
 self._buf = []
 self.hide_output = False

 def handle_starttag(self, tag, attrs):
 if tag in ('p', 'br') and not self.hide_output:
 self._buf.append('\n')
 elif tag in ('script', 'style'):
 self.hide_output = True

 def handle_startendtag(self, tag, attrs):
 if tag == 'br':
 self._buf.append('\n')

 def handle_endtag(self, tag):
 if tag == 'p':
 self._buf.append('\n')
 elif tag in ('script', 'style'):
 self.hide_output = False

 def handle_data(self, text):
 if text and not self.hide_output:
 self._buf.append(re.sub(r'\s+', ' ', text))

 def handle_entityref(self, name):
 if name in name2codepoint and not self.hide_output:
 c = unichr(name2codepoint[name])
 self._buf.append(c)

 def handle_charref(self, name):
 if not self.hide_output:
 n = int(name[1:], 16) if name.startswith('x') else int(name)
 self._buf.append(unichr(n))

 def get_text(self):
 return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
 """
 Given a piece of HTML, return the plain text it contains.
 This handles entities and char refs, but not javascript and stylesheets.
 """
 parser = _HTMLToText()
 try:
 parser.feed(html)
 parser.close()
 except HTMLParseError:
 pass
 return parser.get_text()

def text_to_html(text):
 """
 Convert the given text to html, wrapping what looks like URLs with &lt;a&gt; tags,
 converting newlines to &lt;br&gt; tags and converting confusing chars into html
 entities.
 """
 def f(mo):
 t = mo.group()
 if len(t) == 1:
 return {'&amp;':'&amp;amp;', "'":'&amp;#39;', '"':'&amp;quot;', '&lt;':'&amp;lt;', '&gt;':'&amp;gt;'}.get(t)
 return '&lt;a href="%s"&gt;%s&lt;/a&gt;' % (t, t)
 return re.sub(r'https?://[^] ()"\';]+|[&amp;\'"&lt;&gt;]', f, text)
</code></pre>

entityMap|blocks|key|7qu1g|text|在Python3.x中，你可以通过导入'imaplib‘和'email’包，以一种非常简单的方式做到这一点。虽然这是一个较老的帖子，但也许我的答案可以帮助这个帖子的新手。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|c3h22|status,+data+=+self.imap.fetch(num,+'(RFC822)')
email_msg+=+email.message_from_bytes(data[0][1])+
#email.message_from_string(data[0][1])

#If+message+is+multi+part+we+only+want+the+text+version+of+the+body,+this+walks+the+message+and+gets+the+body.

if+email_msg.is_multipart():
++++for+part+in+email_msg.walk():+++++++
++++++++if+part.get_content_type()+==+"text/plain":
++++++++++++body+=+part.get_payload(decode=True)+#to+control+automatic+email-style+MIME+decoding+(e.g.,+Base64,+uuencode,+quoted-printable)
++++++++++++body+=+body.decode()
++++++++elif+part.get_content_type()+==+"text/html":
++++++++++++continue|code-block|syntax|javascript|c7tpa|现在您可以打印主体变量，它将是明文格式:)如果它对您来说足够好，那么选择它作为可接受的答案将是很好的。^0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|J|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|K|9|@]|A|@]|B|$F|G]]|$2|H|4|I|6|7|8|L|9|@]|A|@]|B|$]]]]

In Python 3.x you can do it in a very easy way by importing 'imaplib' and 'email' packages. Although this is an older post but maybe my answer can help new comers on this post.

<pre><code>status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
 for part in email_msg.walk(): 
 if part.get_content_type() == "text/plain":
 body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
 body = body.decode()
 elif part.get_content_type() == "text/html":
 continue
</code></pre>

Now you can print body variable and it will be in plaintext format :) If it is good enough for you then it would be nice to select it as accepted answer.

entityMap|blocks|key|ep71b|text|另一种选择是通过基于文本的web浏览器运行html并将其转储。例如(使用Lynx)：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1mem5|lynx+-dump+html_to_convert.html+>+converted_html.txt|code-block|syntax|javascript|dij1q|这可以在python脚本中完成，如下所示：|80ied|import+subprocess

with+open('converted_html.txt',+'w')+as+outputFile:
++++subprocess.call(['lynx',+'-dump',+'html_to_convert.html'],+stdout=testFile)|73npc|它不会准确地给出HTML文件中的文本，但根据您的用例，它可能比html2text的输出更可取。^0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|N|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|O|9|@]|A|@]|B|$F|G]]|$2|H|4|I|6|7|8|P|9|@]|A|@]|B|$]]|$2|J|4|K|6|E|8|Q|9|@]|A|@]|B|$F|G]]|$2|L|4|M|6|7|8|R|9|@]|A|@]|B|$]]]]

Another option is to run the html through a text based web browser and dump it. For example (using Lynx):

<pre><code>lynx -dump html_to_convert.html &gt; converted_html.txt
</code></pre>

This can be done within a python script as follows:

<pre><code>import subprocess

with open('converted_html.txt', 'w') as outputFile:
 subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)
</code></pre>

It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.

entityMap|blocks|key|9t0a1|text|我推荐一个名为goose-extractor+Goose的Python包，它将尝试提取以下信息：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|6lnu8|文章的主要文本文章的主要图像任何Youtube/Vimeo电影嵌入文章Meta+Description+Meta标签|fgosb|更多信息：|aq9nh|https://pypi.python.org/pypi/goose-extractor/^0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|I|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|J|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|K|9|@]|A|@]|B|$]]|$2|G|4|H|6|7|8|L|9|@]|A|@]|B|$]]]]

I recommend a Python Package called goose-extractor
Goose will try to extract the following information:

Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags

More :<a href="https://pypi.python.org/pypi/goose-extractor/" rel="nofollow">https://pypi.python.org/pypi/goose-extractor/</a>

entityMap|blocks|key|8pckp|text|另一个非python解决方案:+Libre+Office：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|7mvs5|soffice+--headless+--invisible+--convert-to+txt+input1.html|code-block|syntax|javascript|b2dbo|我更喜欢这个的原因是，每个HTML段落都会转换成一个文本行(没有换行)，这正是我想要的。其他方法需要后处理。Lynx确实产生了很好的输出，但并不完全是我想要的。此外，Libre+Office可以用来从各种格式转换...^0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|J|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|K|9|@]|A|@]|B|$F|G]]|$2|H|4|I|6|7|8|L|9|@]|A|@]|B|$]]]]

Another non-python solution: Libre Office:

<pre><code>soffice --headless --invisible --convert-to txt input1.html
</code></pre>

The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line (no line breaks), which is what I was looking for. Other methods require post-processing. Lynx does produce nice output, but not exactly what I was looking for. Besides, Libre Office can be used to convert from all sorts of formats...

entityMap|blocks|key|7fggq|text|以一种简单的方式|type|unstyled|depth|inlineStyleRanges|entityRanges|data|8kg3v|import+re

html_text+=+open('html_file.html').read()
text_filtered+=+re.sub(r'<(.*?)>',+'',+html_text)|code-block|syntax|javascript|cm5rq|此代码查找html的所有部分。|alsfg|_|32bdm|以“<”开头并以“>”结尾的文本，并将找到的所有文本替换为空字符串^0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|N|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|O|9|@]|A|@]|B|$F|G]]|$2|H|4|I|6|7|8|P|9|@]|A|@]|B|$]]|$2|J|4|K|6|7|8|Q|9|@]|A|@]|B|$]]|$2|L|4|M|6|7|8|R|9|@]|A|@]|B|$]]]]

in a simple way

<pre><code>import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'&lt;(.*?)&gt;', '', html_text)
</code></pre>

this code finds all parts of the html_text started with '&lt;' and ending with '>' and replace all found by an empty string

entityMap|blocks|key|8u9mi|text|如果您需要更高的速度和更低的精确度，那么您可以使用原始的lxml。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|ai8vs|import+lxml.html+as+lh
from+lxml.html.clean+import+clean_html

def+lxml_to_text(html):
++++doc+=+lh.fromstring(html)
++++doc+=+clean_html(doc)
++++return+doc.text_content()|code-block|syntax|javascript^0|0^^$0|$]|1|@$2|3|4|5|6|7|8|H|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|I|9|@]|A|@]|B|$F|G]]]]

if you need more speed and less accuracy then you could use raw lxml.

<pre><code>import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
 doc = lh.fromstring(html)
 doc = clean_html(doc)
 return doc.text_content()
</code></pre>

entityMap|blocks|key|dtbs5|text|我知道已经有很多答案了，但最重要的是|type|unstyled|depth|inlineStyleRanges|entityRanges|data|9ku8|优雅|aln9s|和|9t05p|pythonic式|aq0m7|我找到的解决方案部分地描述了，|c506j|这里|fo3ii|..。|732s7|from+bs4+import+BeautifulSoup

text+=+'+'.join(BeautifulSoup(some_html_string,+"html.parser").findAll(text=True))|code-block|syntax|javascript|a6hm5|更新|9kne4|基于弗雷泽的评论，这里有一个更优雅的解决方案：|spk7|from+bs4+import+BeautifulSoup

clean_text+=+'+'.join(BeautifulSoup(some_html_string,+"html.parser").stripped_strings)^0|0|0|0|0|0|0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|Z|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|10|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|11|9|@]|A|@]|B|$]]|$2|G|4|H|6|7|8|12|9|@]|A|@]|B|$]]|$2|I|4|J|6|7|8|13|9|@]|A|@]|B|$]]|$2|K|4|L|6|7|8|14|9|@]|A|@]|B|$]]|$2|M|4|N|6|7|8|15|9|@]|A|@]|B|$]]|$2|O|4|P|6|Q|8|16|9|@]|A|@]|B|$R|S]]|$2|T|4|U|6|7|8|17|9|@]|A|@]|B|$]]|$2|V|4|W|6|7|8|18|9|@]|A|@]|B|$]]|$2|X|4|Y|6|Q|8|19|9|@]|A|@]|B|$R|S]]]]

I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, <a href="https://stackoverflow.com/questions/761824/python-how-to-convert-markdown-formatted-text-to-text">here</a>.
<pre><code>from bs4 import BeautifulSoup

text = ' '.join(BeautifulSoup(some_html_string, &quot;html.parser&quot;).findAll(text=True))
</code></pre>
<h2>Update</h2>
Based on Fraser's comment, here is more elegant solution:
<pre><code>from bs4 import BeautifulSoup

clean_text = ' '.join(BeautifulSoup(some_html_string, &quot;html.parser&quot;).stripped_strings)
</code></pre>

entityMap|blocks|key|bl8ni|text|@PeYoTIL的答案是使用BeautifulSoup并删除样式和脚本内容，对我来说不起作用。我试过了，用的是|type|unstyled|depth|inlineStyleRanges|entityRanges|data|22kni|而不是|443pr|但它仍然不起作用。所以我创建了我自己的，它也使用|2992a|标记和替换|41qk|带有href链接的标签。还可以处理文本中的链接。可在|9rnq6|这个要点|bsel5|嵌入了测试文档。|el2nd|from+bs4+import+BeautifulSoup,+NavigableString

def+html_to_text(html):
++++"Creates+a+formatted+text+email+message+as+a+string+from+a+rendered+html+template+(page)"
++++soup+=+BeautifulSoup(html,+'html.parser')
++++#+Ignore+anything+in+head
++++body,+text+=+soup.body,+[]
++++for+element+in+body.descendants:
++++++++#+We+use+type+and+not+isinstance+since+comments,+cdata,+etc+are+subclasses+that+we+don't+want
++++++++if+type(element)+==+NavigableString:
++++++++++++#+We+use+the+assumption+that+other+tags+can't+be+inside+a+script+or+style
++++++++++++if+element.parent.name+in+('script',+'style'):
++++++++++++++++continue

++++++++++++#+remove+any+multiple+and+leading/trailing+whitespace
++++++++++++string+=+'+'.join(element.string.split())
++++++++++++if+string:
++++++++++++++++if+element.parent.name+==+'a':
++++++++++++++++++++a_tag+=+element.parent
++++++++++++++++++++#+replace+link+text+with+the+link
++++++++++++++++++++string+=+a_tag['href']
++++++++++++++++++++#+concatenate+with+any+non-empty+immediately+previous+string
++++++++++++++++++++if+(++++type(a_tag.previous_sibling)+==+NavigableString+and
++++++++++++++++++++++++++++a_tag.previous_sibling.string.strip()+):
++++++++++++++++++++++++text[-1]+=+text[-1]+%2B+'+'+%2B+string
++++++++++++++++++++++++continue
++++++++++++++++elif+element.previous_sibling+and+element.previous_sibling.name+==+'a':
++++++++++++++++++++text[-1]+=+text[-1]+%2B+'+'+%2B+string
++++++++++++++++++++continue
++++++++++++++++elif+element.parent.name+==+'p':
++++++++++++++++++++#+Add+extra+paragraph+formatting+newline
++++++++++++++++++++string+=+'\n'+%2B+string
++++++++++++++++text+%2B=+[string]
++++doc+=+'\n'.join(text)
++++return+doc|code-block|syntax|javascript^0|0|0|0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|T|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|U|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|V|9|@]|A|@]|B|$]]|$2|G|4|H|6|7|8|W|9|@]|A|@]|B|$]]|$2|I|4|J|6|7|8|X|9|@]|A|@]|B|$]]|$2|K|4|L|6|7|8|Y|9|@]|A|@]|B|$]]|$2|M|4|N|6|7|8|Z|9|@]|A|@]|B|$]]|$2|O|4|P|6|Q|8|10|9|@]|A|@]|B|$R|S]]]]

@PeYoTIL's answer using BeautifulSoup and eliminating style and script content didn't work for me. I tried it using <code>decompose</code> instead of <code>extract</code> but it still didn't work. So I created my own which also formats the text using the <code>&lt;p&gt;</code> tags and replaces <code>&lt;a&gt;</code> tags with the href link. Also copes with links inside text. Available at <a href="https://gist.github.com/racitup/2ded9c06c2563049e7e12b25bf2a8369" rel="nofollow noreferrer">this gist</a> with a test doc embedded.

<pre><code>from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
 "Creates a formatted text email message as a string from a rendered html template (page)"
 soup = BeautifulSoup(html, 'html.parser')
 # Ignore anything in head
 body, text = soup.body, []
 for element in body.descendants:
 # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
 if type(element) == NavigableString:
 # We use the assumption that other tags can't be inside a script or style
 if element.parent.name in ('script', 'style'):
 continue

 # remove any multiple and leading/trailing whitespace
 string = ' '.join(element.string.split())
 if string:
 if element.parent.name == 'a':
 a_tag = element.parent
 # replace link text with the link
 string = a_tag['href']
 # concatenate with any non-empty immediately previous string
 if ( type(a_tag.previous_sibling) == NavigableString and
 a_tag.previous_sibling.string.strip() ):
 text[-1] = text[-1] + ' ' + string
 continue
 elif element.previous_sibling and element.previous_sibling.name == 'a':
 text[-1] = text[-1] + ' ' + string
 continue
 elif element.parent.name == 'p':
 # Add extra paragraph formatting newline
 string = '\n' + string
 text += [string]
 doc = '\n'.join(text)
 return doc
</code></pre>

entityMap|blocks|key|5830h|text|任何人都试过了|type|unstyled|depth|inlineStyleRanges|entityRanges|data|7cnto|使用|9286l|漂白剂|aq3fl|什么？这对我很有效。^0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|I|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|J|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|K|9|@]|A|@]|B|$]]|$2|G|4|H|6|7|8|L|9|@]|A|@]|B|$]]]]

Anyone has tried <code>bleach.clean(html,tags=[],strip=True)</code> with <a href="https://pypi.python.org/pypi/bleach" rel="nofollow noreferrer">bleach</a>? it's working for me.

entityMap|blocks|key|dl7b3|text|安装|type|unstyled|depth|inlineStyleRanges|entityRanges|data|32pcd|html2text|f3465|使用|b8cjt|pip安装html2text|btg5l|atlr8|然后，|8aq4t|>>>+import+html2text
>>>
>>>+h+=+html2text.HTML2Text()
>>>+#+Ignore+converting+links+from+HTML
>>>+h.ignore_links+=+True
>>>+print+h.handle("Hello,+world!")
Hello,+world!|code-block|syntax|javascript^0|0|0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|Q|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|R|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|S|9|@]|A|@]|B|$]]|$2|G|4|H|6|7|8|T|9|@]|A|@]|B|$]]|$2|I|4|-4|6|7|8|U|9|@]|A|@]|B|$]]|$2|J|4|K|6|7|8|V|9|@]|A|@]|B|$]]|$2|L|4|M|6|N|8|W|9|@]|A|@]|B|$O|P]]]]

install html2text using 

<blockquote>
 pip install html2text
</blockquote>

then,

<pre><code>&gt;&gt;&gt; import html2text
&gt;&gt;&gt;
&gt;&gt;&gt; h = html2text.HTML2Text()
&gt;&gt;&gt; # Ignore converting links from HTML
&gt;&gt;&gt; h.ignore_links = True
&gt;&gt;&gt; print h.handle("&lt;p&gt;Hello, &lt;a href='http://earth.google.com/'&gt;world&lt;/a&gt;!")
Hello, world!
</code></pre>

entityMap|blocks|key|ff2uk|text|下面是我经常使用的代码。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|fabjp|from+bs4+import+BeautifulSoup
import+urllib.request


def+processText(webpage):

++++#+EMPTY+LIST+TO+STORE+PROCESSED+TEXT
++++proc_text+=+[]

++++try:
++++++++news_open+=+urllib.request.urlopen(webpage.group())
++++++++news_soup+=+BeautifulSoup(news_open,+"lxml")
++++++++news_para+=+news_soup.find_all("p",+text+=+True)

++++++++for+item+in+news_para:
++++++++++++#+SPLIT+WORDS,+JOIN+WORDS+TO+REMOVE+EXTRA+SPACES
++++++++++++para_text+=+('+').join((item.text).split())

++++++++++++#+COMBINE+LINES/PARAGRAPHS+INTO+A+LIST
++++++++++++proc_text.append(para_text)

++++except+urllib.error.HTTPError:
++++++++pass

++++return+proc_text|code-block|syntax|javascript|94do8|我希望这能有所帮助。^0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|J|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|K|9|@]|A|@]|B|$F|G]]|$2|H|4|I|6|7|8|L|9|@]|A|@]|B|$]]]]

Here's the code I use on a regular basis.

<pre><code>from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

 # EMPTY LIST TO STORE PROCESSED TEXT
 proc_text = []

 try:
 news_open = urllib.request.urlopen(webpage.group())
 news_soup = BeautifulSoup(news_open, "lxml")
 news_para = news_soup.find_all("p", text = True)

 for item in news_para:
 # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
 para_text = (' ').join((item.text).split())

 # COMBINE LINES/PARAGRAPHS INTO A LIST
 proc_text.append(para_text)

 except urllib.error.HTTPError:
 pass

 return proc_text
</code></pre>

I hope that helps.

entityMap|blocks|key|9q4ri|text|我知道这里已经有很多答案了，但我认为|type|unstyled|depth|inlineStyleRanges|entityRanges|data|ccafs|newspaper3k|815i4|同样值得一提的是。我最近需要完成一个类似的任务，从web上的文章中提取文本，到目前为止，这个库在我的测试中完成了很好的工作。它忽略菜单项和侧边栏中的文本，以及在OP请求时出现在页面上的任何JavaScript。|2gmfp|from+newspaper+import+Article

article+=+Article(url)
article.download()
article.parse()
article.text|code-block|syntax|javascript|dal79|如果你已经下载了HTML文件，你可以这样做：|51bvn|article+=+Article('')
article.set_html(html)
article.parse()
article.text|4la9|它甚至有一些用于总结文章主题的NLP功能：|aculi|article.nlp()
article.summary^0|0|0|0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|T|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|U|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|V|9|@]|A|@]|B|$]]|$2|G|4|H|6|I|8|W|9|@]|A|@]|B|$J|K]]|$2|L|4|M|6|7|8|X|9|@]|A|@]|B|$]]|$2|N|4|O|6|I|8|Y|9|@]|A|@]|B|$J|K]]|$2|P|4|Q|6|7|8|Z|9|@]|A|@]|B|$]]|$2|R|4|S|6|I|8|10|9|@]|A|@]|B|$J|K]]]]

I know there's plenty of answers here already but I think <a href="https://pypi.python.org/pypi/newspaper" rel="noreferrer" title="newspaper3k">newspaper3k</a> also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests. 

<pre><code>from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text
</code></pre>

If you already have the HTML files downloaded you can do something like this:

<pre><code>article = Article('')
article.set_html(html)
article.parse()
article.text
</code></pre>

It even has a few NLP features for summarizing the topics of articles:

<pre><code>article.nlp()
article.summary
</code></pre>

entityMap|blocks|key|ampbg|text|对我来说最有效的方法是inscripts。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|6k6iv|https://github.com/weblyzard/inscriptis|91frn|import+urllib.request
from+inscriptis+import+get_text

url+=+"http://www.informationscience.ch"
html+=+urllib.request.urlopen(url).read().decode('utf-8')

text+=+get_text(html)
print(text)|code-block|syntax|javascript|8kgvu|结果真的很好^0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|L|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|M|9|@]|A|@]|B|$]]|$2|E|4|F|6|G|8|N|9|@]|A|@]|B|$H|I]]|$2|J|4|K|6|7|8|O|9|@]|A|@]|B|$]]]]

Best worked for me is inscripts . 

<a href="https://github.com/weblyzard/inscriptis" rel="nofollow noreferrer">https://github.com/weblyzard/inscriptis</a>

<pre><code>import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)
</code></pre>

The results are really good

entityMap|blocks|key|1l9sn|text|使用BeautifulSoup只能从HTML中提取文本|type|unstyled|depth|inlineStyleRanges|entityRanges|data|tn97|url+=+"https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con+=+urlopen(url).read()
soup+=+BeautifulSoup(con,'html.parser')
texts+=+soup.get_text()
print(texts)|code-block|syntax|javascript^0|0^^$0|$]|1|@$2|3|4|5|6|7|8|H|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|I|9|@]|A|@]|B|$F|G]]]]

you can extract only text from HTML with BeautifulSoup

<pre><code>url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)
</code></pre>

entityMap|blocks|key|cme9t|text|我已经得到了很好的结果|type|unstyled|depth|inlineStyleRanges|entityRanges|data|6fd19|Apache+Tika|ed3an|..。它的目的是从内容中提取元数据和文本，因此底层解析器相应地进行了开箱即用的调优。|75fo6|Tika可以作为|ramj|服务器|bdehl|在Docker容器中运行/部署非常简单，并且可以通过以下方式进行访问|b7u6l|Python绑定|4nub|..。^0|0|0|0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|Q|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|R|9|@]|A|@]|B|$]]|$2|E|4|F|6|7|8|S|9|@]|A|@]|B|$]]|$2|G|4|H|6|7|8|T|9|@]|A|@]|B|$]]|$2|I|4|J|6|7|8|U|9|@]|A|@]|B|$]]|$2|K|4|L|6|7|8|V|9|@]|A|@]|B|$]]|$2|M|4|N|6|7|8|W|9|@]|A|@]|B|$]]|$2|O|4|P|6|7|8|X|9|@]|A|@]|B|$]]]]

I've had good results with <a href="https://tika.apache.org/" rel="nofollow noreferrer">Apache Tika</a>. Its purpose is the extraction of metadata and text from content, hence the underlying parser is tuned accordingly out of the box.

Tika can be run as a <a href="https://wiki.apache.org/tika/TikaJAXRS" rel="nofollow noreferrer">server</a>, is trivial to run / deploy in a Docker container, and from there can be accessed via <a href="https://github.com/chrismattmann/tika-python" rel="nofollow noreferrer">Python bindings</a>.

entityMap|blocks|key|17qfb|text|LibreOffice编写器注释具有优点，因为应用程序可以使用Python宏。它似乎提供了多种好处，既可以回答这个问题，也可以进一步加强LibreOffice的宏观基础。如果此解决方案是一次性实现，而不是用作更大的生产程序的一部分，则在writer中打开HTML并将页面另存为文本似乎可以解决此处讨论的问题。|type|unstyled|depth|inlineStyleRanges|entityRanges|data^0^^$0|$]|1|@$2|3|4|5|6|7|8|C|9|@]|A|@]|B|$]]]]

The LibreOffice writer comment has merit since the application can employ python macros. It seems to offer multiple benefits both for answering this question and furthering the macro base of LibreOffice. If this resolution is a one-off implementation, rather than to be used as part of a greater production program, opening the HTML in writer and saving the page as text would seem to resolve the issues discussed here.

entityMap|blocks|key|3c2cd|text|虽然很多人提到使用regex来剥离html标签，但也有很多缺点。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1lq1c|例如：|324he|hello worldI+love+you|code-block|syntax|javascript|34v6i|应解析为：|c9je|Hello+world
I+love+you|570ur|这是我想出来的一个片段，你可以根据你的特定需求来讨论它，它就像一个护身符|d8htk|import+re
import+html
def+html2text(htm):
++++ret+=+html.unescape(htm)
++++ret+=+ret.translate({
++++++++8209:+ord('-'),
++++++++8220:+ord('"'),
++++++++8221:+ord('"'),
++++++++160:+ord('+'),
++++})
++++ret+=+re.sub(r"\s",+"+",+ret,+flags+=+re.MULTILINE)
++++ret+=+re.sub("
%7C
%7C%7C%7C",+"\n",+ret,+flags+=+re.IGNORECASE)
++++ret+=+re.sub('<.*?>',+'+',+ret,+flags=re.DOTALL)
++++ret+=+re.sub(r"++%2B",+"+",+ret)
++++return+ret^0|0|0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|R|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|S|9|@]|A|@]|B|$]]|$2|E|4|F|6|G|8|T|9|@]|A|@]|B|$H|I]]|$2|J|4|K|6|7|8|U|9|@]|A|@]|B|$]]|$2|L|4|M|6|G|8|V|9|@]|A|@]|B|$H|I]]|$2|N|4|O|6|7|8|W|9|@]|A|@]|B|$]]|$2|P|4|Q|6|G|8|X|9|@]|A|@]|B|$H|I]]]]

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<pre><code>&lt;p&gt;hello&amp;nbsp;world&lt;/p&gt;I love you
</code></pre>

Should be parsed to:

<pre><code>Hello world
I love you
</code></pre>

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

<pre><code>import re
import html
def html2text(htm):
 ret = html.unescape(htm)
 ret = ret.translate({
 8209: ord('-'),
 8220: ord('"'),
 8221: ord('"'),
 160: ord(' '),
 })
 ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
 ret = re.sub("&lt;br&gt;|&lt;br /&gt;|&lt;/p&gt;|&lt;/div&gt;|&lt;/h\d&gt;", "\n", ret, flags = re.IGNORECASE)
 ret = re.sub('&lt;.*?&gt;', ' ', ret, flags=re.DOTALL)
 ret = re.sub(r" +", " ", ret)
 return ret
</code></pre>

entityMap|blocks|key|eeqin|text|在Python+2.7.9%2B中使用BeautifulSoup4的另一个示例|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1fp49|包括：|b2vkd|import+urllib2
from+bs4+import+BeautifulSoup|code-block|syntax|javascript|1om6g|代码：|ferli|def+read_website_to_text(url):
++++page+=+urllib2.urlopen(url)
++++soup+=+BeautifulSoup(page,+'html.parser')
++++for+script+in+soup(["script",+"style"]):
++++++++script.extract()+
++++text+=+soup.get_text()
++++lines+=+(line.strip()+for+line+in+text.splitlines())
++++chunks+=+(phrase.strip()+for+line+in+lines+for+phrase+in+line.split("++"))
++++text+=+'\n'.join(chunk+for+chunk+in+chunks+if+chunk)
++++return+str(text.encode('utf-8'))|62m57|解释：|4ulem|以.get格式读取url数据(使用BeautifulSoup)，删除所有脚本和样式元素，并使用html仅获取文本|adgi5|_|e9806|text()。将多个标题拆分为几行并删除其中的前导空格和尾随空格，然后将多个标题拆分为一行each+chunks+=+(phrase.strip()+for+line+in+line+for+phrase+in+line.split(“"))。然后使用text+=+'\n'.join，删除空行，最后作为认可的utf-8返回。|bdor4|备注：|h0b4|795c5|由于SSL问题，运行此命令的某些系统将失败，并显示https://连接，您可以关闭验证来修复该问题。示例修复：|ha10|http://blog.pengyifan.com/how-to-fix-python-ssl-certificate|3a305|bo9nr|验证|2bpl7|f4vot|失败/|42fs4|Python+<+2.7.9在运行以下代码时可能会遇到一些问题|cssq9|text.encode('utf-8')可能会留下奇怪的编码，可能只想返回字符串(文本)。^0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0^^$0|$]|1|@$2|3|4|5|6|7|8|1C|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|1D|9|@]|A|@]|B|$]]|$2|E|4|F|6|G|8|1E|9|@]|A|@]|B|$H|I]]|$2|J|4|K|6|7|8|1F|9|@]|A|@]|B|$]]|$2|L|4|M|6|G|8|1G|9|@]|A|@]|B|$H|I]]|$2|N|4|O|6|7|8|1H|9|@]|A|@]|B|$]]|$2|P|4|Q|6|7|8|1I|9|@]|A|@]|B|$]]|$2|R|4|S|6|7|8|1J|9|@]|A|@]|B|$]]|$2|T|4|U|6|7|8|1K|9|@]|A|@]|B|$]]|$2|V|4|W|6|7|8|1L|9|@]|A|@]|B|$]]|$2|X|4|-4|6|7|8|1M|9|@]|A|@]|B|$]]|$2|Y|4|Z|6|7|8|1N|9|@]|A|@]|B|$]]|$2|10|4|11|6|7|8|1O|9|@]|A|@]|B|$]]|$2|12|4|S|6|7|8|1P|9|@]|A|@]|B|$]]|$2|13|4|14|6|7|8|1Q|9|@]|A|@]|B|$]]|$2|15|4|S|6|7|8|1R|9|@]|A|@]|B|$]]|$2|16|4|17|6|7|8|1S|9|@]|A|@]|B|$]]|$2|18|4|19|6|7|8|1T|9|@]|A|@]|B|$]]|$2|1A|4|1B|6|7|8|1U|9|@]|A|@]|B|$]]]]

Another example using BeautifulSoup4 in Python 2.7.9+

includes:

<pre><code>import urllib2
from bs4 import BeautifulSoup
</code></pre>

Code:

<pre><code>def read_website_to_text(url):
 page = urllib2.urlopen(url)
 soup = BeautifulSoup(page, 'html.parser')
 for script in soup(["script", "style"]):
 script.extract() 
 text = soup.get_text()
 lines = (line.strip() for line in text.splitlines())
 chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
 text = '\n'.join(chunk for chunk in chunks if chunk)
 return str(text.encode('utf-8'))
</code></pre>

Explained:

Read in the url data as html (using BeautifulSoup), remove all script and style elements, and also get just the text using .get_text(). Break into lines and remove leading and trailing space on each, then break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")). Then using text = '\n'.join, drop blank lines, finally return as sanctioned utf-8.

Notes: 

<ul>
<li>Some systems this is run on will fail with https:// connections because of SSL issue, you can turn off the verify to fix that issue. Example fix: <a href="http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/" rel="nofollow noreferrer">http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/</a></li>
<li>Python &lt; 2.7.9 may have some issue running this </li>
<li>text.encode('utf-8') can leave weird encoding, may want to just return str(text) instead.</li>
</ul>

entityMap|blocks|key|88t78|text|我有一个类似的问题，实际上我在BeautifulSoup上使用了其中一个答案。问题是它真的很慢。我最终使用了名为selectolax的库。它非常有限，但它适用于这项任务。唯一的问题是我手动删除了不必要的空格。但它的工作速度似乎比BeautifulSoup解决方案快得多。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|12ftc|from+selectolax.parser+import+HTMLParser

def+get_text_selectolax(html):
++++tree+=+HTMLParser(html)

++++if+tree.body+is+None:
++++++++return+None

++++for+tag+in+tree.css('script'):
++++++++tag.decompose()
++++for+tag+in+tree.css('style'):
++++++++tag.decompose()

++++text+=+tree.body.text(separator='')
++++text+=+"+".join(text.split())+#+this+will+remove+all+the+whitespaces
++++return+text|code-block|syntax|javascript^0|0^^$0|$]|1|@$2|3|4|5|6|7|8|H|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|I|9|@]|A|@]|B|$F|G]]]]

I had a similar question and actually used one of the answers with BeautifulSoup.
The problem was it was really slow. I ended up using library called selectolax.
It's pretty limited but it works for this task.
The only issue was that I had manually remove unnecessary white spaces.
But it seems to be working much faster that BeautifulSoup solution.
<pre><code>from selectolax.parser import HTMLParser

def get_text_selectolax(html):
 tree = HTMLParser(html)

 if tree.body is None:
 return None

 for tag in tree.css('script'):
 tag.decompose()
 for tag in tree.css('style'):
 tag.decompose()

 text = tree.body.text(separator='')
 text = &quot; &quot;.join(text.split()) # this will remove all the whitespaces
 return text
</code></pre>

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad. 

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect &amp;#39; in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

Update <code>html2text</code> looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.

<hr>

Related questions:

<ul>
<li><a href="https://stackoverflow.com/questions/37486/filter-out-html-tags-and-resolve-entities-in-python">Filter out HTML tags and resolve entities in python</a></li>
<li><a href="https://stackoverflow.com/questions/57708/convert-xmlhtml-entities-into-unicode-string-in-python">Convert XML/HTML Entities into Unicode String in Python</a></li>
</ul>

Extracting text from HTML file using Python

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋 

腾讯云代码助手

CODING DevOps

Cloud Studio

SDK中心

API中心

命令行工具

 我想用Python从HTML文件中提取文本。我想要的输出基本上与从浏览器复制文本并将其粘贴到记事本中得到的输出相同。 我想要一些比使用正则表达式更健壮的东西，因为正则表达式在格式不佳的HTML上可能会失败。我看到很多人推荐Beautiful Soup，但我在使用它时遇到了一些问题。首先，它会拾取不需要的文本，比如Ja...

问使用Python从HTML文件中提取文本
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python从HTML文件中提取文本EN