blocks|key|933546|text|试试这个：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|933547|public+String+noTags(String+str){
++++Document+d+=+Jsoup.parse(str);
++++TextNode+tn+=+new+TextNode(d.body().html(),+"");
++++return+tn.getWholeText();
}|code-block|syntax|javascript|933548|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Try this:

<pre><code>public String noTags(String str){
 Document d = Jsoup.parse(str);
 TextNode tn = new TextNode(d.body().html(), "");
 return tn.getWholeText();
}
</code></pre>

blocks|key|717944|text|使用|type|unstyled|depth|inlineStyleRanges|entityRanges|data|717945|Jsoup.parse("A\nB").text();|code-block|syntax|javascript|717946|你有输出|717947|"A+B"+|717948|而不是|717949|A

B|717950|为此，我使用：|717951|descrizione+=+Jsoup.parse(html.replaceAll("(?i)<br[%5E>]*>",+"br2n")).text();
text+=+descrizione.replaceAll("br2n",+"\n");|717952|entityMap^0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|W|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|X|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|Y|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|Z|8|@]|9|@]|A|$E|F]]|$1|O|3|P|5|6|7|10|8|@]|9|@]|A|$]]|$1|Q|3|R|5|D|7|11|8|@]|9|@]|A|$E|F]]|$1|S|3|-4|5|6|7|12|8|@]|9|@]|A|$]]]|T|$]]

With

<pre><code>Jsoup.parse("A\nB").text();
</code></pre>

you have output 

<pre><code>"A B" 
</code></pre>

and not

<pre><code>A

B
</code></pre>

For this I'm using:

<pre><code>descrizione = Jsoup.parse(html.replaceAll("(?i)&lt;br[^&gt;]*&gt;", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
</code></pre>

blocks|key|933710|text|Jsoup.clean(unsafeString,+"",+Whitelist.none(),+new+OutputSettings().prettyPrint(false));|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|933711|我们在这里使用这种方法：|unstyled|933712|public+static+String+clean(String+bodyHtml,
+++++++++++++++++++++++String+baseUri,
+++++++++++++++++++++++Whitelist+whitelist,
+++++++++++++++++++++++Document.OutputSettings+outputSettings)|933713|通过传递Whitelist.none()，我们可以确保所有的超文本标记语言都被移除。|offset|length|style|CODE|933714|通过传递new+OutputSettings().prettyPrint(false)，我们可以确保输出不会重新格式化，并且会保留换行符。|933715|entityMap^0|0|0|0|4|G|0|4|13|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|T|8|@]|9|@]|A|$]]|$1|G|3|H|5|6|7|U|8|@]|9|@]|A|$B|C]]|$1|I|3|J|5|F|7|V|8|@$K|W|L|X|M|N]]|9|@]|A|$]]|$1|O|3|P|5|F|7|Y|8|@$K|Z|L|10|M|N]]|9|@]|A|$]]|$1|Q|3|-4|5|F|7|11|8|@]|9|@]|A|$]]]|R|$]]

<pre><code>Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
</code></pre>

We're using this method here:

<pre><code>public static String clean(String bodyHtml,
 String baseUri,
 Whitelist whitelist,
 Document.OutputSettings outputSettings)
</code></pre>

By passing it <code>Whitelist.none()</code> we make sure that all HTML is removed.

By passsing <code>new OutputSettings().prettyPrint(false)</code> we make sure that the output is not reformatted and line breaks are preserved.

blocks|key|718115|text|通过使用jsoup来尝试：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|718116|public+static+String+cleanPreserveLineBreaks(String+bodyHtml)+{

++++//+get+pretty+printed+html+with+preserved+br+and+p+tags
++++String+prettyPrintedBodyFragment+=+Jsoup.clean(bodyHtml,+"",+Whitelist.none().addTags("br",+"p"),+new+OutputSettings().prettyPrint(true));
++++//+get+plain+text+with+preserved+line+breaks+by+disabled+prettyPrint
++++return+Jsoup.clean(prettyPrintedBodyFragment,+"",+Whitelist.none(),+new+OutputSettings().prettyPrint(false));
}|code-block|syntax|javascript|718117|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Try this by using jsoup:

<pre><code>public static String cleanPreserveLineBreaks(String bodyHtml) {

 // get pretty printed html with preserved br and p tags
 String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
 // get plain text with preserved line breaks by disabled prettyPrint
 return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}
</code></pre>

blocks|key|718179|text|您可以遍历给定的元素|type|unstyled|depth|inlineStyleRanges|entityRanges|data|718180|public+String+convertNodeToText(Element+element)
{
++++final+StringBuilder+buffer+=+new+StringBuilder();

++++new+NodeTraversor(new+NodeVisitor()+{
++++++++boolean+isNewline+=+true;

++++++++@Override
++++++++public+void+head(Node+node,+int+depth)+{
++++++++++++if+(node+instanceof+TextNode)+{
++++++++++++++++TextNode+textNode+=+(TextNode)+node;
++++++++++++++++String+text+=+textNode.text().replace('\u00A0',+'+').trim();++++++++++++++++++++
++++++++++++++++if(!text.isEmpty())
++++++++++++++++{++++++++++++++++++++++++
++++++++++++++++++++buffer.append(text);
++++++++++++++++++++isNewline+=+false;
++++++++++++++++}
++++++++++++}+else+if+(node+instanceof+Element)+{
++++++++++++++++Element+element+=+(Element)+node;
++++++++++++++++if+(!isNewline)
++++++++++++++++{
++++++++++++++++++++if((element.isBlock()+%7C%7C+element.tagName().equals("br")))
++++++++++++++++++++{
++++++++++++++++++++++++buffer.append("\n");
++++++++++++++++++++++++isNewline+=+true;
++++++++++++++++++++}
++++++++++++++++}
++++++++++++}++++++++++++++++
++++++++}

++++++++@Override
++++++++public+void+tail(Node+node,+int+depth)+{++++++++++++++++
++++++++}++++++++++++++++++++++++
++++}).traverse(element);++++++++

++++return+buffer.toString();+++++++++++++++
}|code-block|syntax|javascript|718181|而对于您的代码|718182|String+result+=+convertNodeToText(JSoup.parse(html))|718183|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|N|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|O|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

You can traverse a given element

<pre><code>public String convertNodeToText(Element element)
{
 final StringBuilder buffer = new StringBuilder();

 new NodeTraversor(new NodeVisitor() {
 boolean isNewline = true;

 @Override
 public void head(Node node, int depth) {
 if (node instanceof TextNode) {
 TextNode textNode = (TextNode) node;
 String text = textNode.text().replace('\u00A0', ' ').trim(); 
 if(!text.isEmpty())
 { 
 buffer.append(text);
 isNewline = false;
 }
 } else if (node instanceof Element) {
 Element element = (Element) node;
 if (!isNewline)
 {
 if((element.isBlock() || element.tagName().equals("br")))
 {
 buffer.append("\n");
 isNewline = true;
 }
 }
 } 
 }

 @Override
 public void tail(Node node, int depth) { 
 } 
 }).traverse(element); 

 return buffer.toString(); 
}
</code></pre>

And for your code

<pre><code>String result = convertNodeToText(JSoup.parse(html))
</code></pre>

blocks|key|933902|text|使用textNodes()获取文本节点列表。然后用\n作为分隔符将它们连接起来。下面是我使用的一些scala代码，java移植应该很简单：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|933903|val+rawTxt+=+doc.body().getElementsByTag("div").first.textNodes()
++++++++++++++++++++.asScala.mkString("<br+/>\n")|code-block|syntax|javascript|933904|entityMap^0|2|B|P|2|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@$9|N|A|O|B|C]|$9|P|A|Q|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|R|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|S|8|@]|D|@]|E|$]]]|L|$]]

Use <code>textNodes()</code> to get a list of the text nodes. Then concatenate them with <code>\n</code> as separator.
Here's some scala code I use for this, java port should be easy:

<pre><code>val rawTxt = doc.body().getElementsByTag("div").first.textNodes()
 .asScala.mkString("&lt;br /&gt;\n")
</code></pre>

blocks|key|933974|text|/**
+*+Recursive+method+to+replace+html+br+with+java+\n.+The+recursive+method+ensures+that+the+linebreaker+can+never+end+up+pre-existing+in+the+text+being+replaced.
+*+@param+html
+*+@param+linebreakerString
+*+@return+the+html+as+String+with+proper+java+newlines+instead+of+br
+*/
public+static+String+replaceBrWithNewLine(String+html,+String+linebreakerString){
++++String+result+=+"";
++++if(html.contains(linebreakerString)){
++++++++result+=+replaceBrWithNewLine(html,+linebreakerString%2B"1");
++++}+else+{
++++++++result+=+Jsoup.parse(html.replaceAll("(?i)<br[%5E>]*>",+linebreakerString)).text();+//+replace+and+html+line+breaks+with+java+linebreak.
++++++++result+=+result.replaceAll(linebreakerString,+"\n");
++++}
++++return+result;
}|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|933975|通过调用相关的html来使用，该html包含br，以及您希望用作临时换行占位符的任何字符串。例如：|unstyled|933976|replaceBrWithNewLine(element.html(),+"br2n")|933977|递归将确保您用作换行符/换行符占位符的字符串实际上永远不会出现在源html中，因为它将一直添加"1“，直到在html中找不到断链符占位符字符串。它不会有Jsoup.clean方法可能遇到的带有特殊字符的格式化问题。|933978|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|N|8|@]|9|@]|A|$]]|$1|G|3|H|5|6|7|O|8|@]|9|@]|A|$B|C]]|$1|I|3|J|5|F|7|P|8|@]|9|@]|A|$]]|$1|K|3|-4|5|F|7|Q|8|@]|9|@]|A|$]]]|L|$]]

<pre><code>/**
 * Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
 * @param html
 * @param linebreakerString
 * @return the html as String with proper java newlines instead of br
 */
public static String replaceBrWithNewLine(String html, String linebreakerString){
 String result = "";
 if(html.contains(linebreakerString)){
 result = replaceBrWithNewLine(html, linebreakerString+"1");
 } else {
 result = Jsoup.parse(html.replaceAll("(?i)&lt;br[^&gt;]*&gt;", linebreakerString)).text(); // replace and html line breaks with java linebreak.
 result = result.replaceAll(linebreakerString, "\n");
 }
 return result;
}
</code></pre>

Used by calling with the html in question, containing the br, along with whatever string you wish to use as the temporary newline placeholder.
For example:

<pre><code>replaceBrWithNewLine(element.html(), "br2n")
</code></pre>

The recursion will ensure that the string you use as newline/linebreaker placeholder will never actually be in the source html, as it will keep adding a "1" untill the linkbreaker placeholder string is not found in the html. It wont have the formatting issue that the Jsoup.clean methods seem to encounter with special characters.

blocks|key|934039|text|text+=+Jsoup.parse(html.replaceAll("(?i)<br[%5E>]*>",+"br2n")).text();
text+=+descrizione.replaceAll("br2n",+"\n");|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|934040|如果html本身不包含"br2n“，则有效|unstyled|934041|所以,|934042|text+=+Jsoup.parse(html.replaceAll("(?i)<br[%5E>]*>",+"<pre>\n</pre>")).text();|934043|工作更可靠、更容易。|934044|entityMap^0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|P|8|@]|9|@]|A|$]]|$1|G|3|H|5|F|7|Q|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|R|8|@]|9|@]|A|$B|C]]|$1|K|3|L|5|F|7|S|8|@]|9|@]|A|$]]|$1|M|3|-4|5|F|7|T|8|@]|9|@]|A|$]]]|N|$]]

<pre><code>text = Jsoup.parse(html.replaceAll("(?i)&lt;br[^&gt;]*&gt;", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
</code></pre>

works if the html itself doesn't contain "br2n"

So, 

<pre><code>text = Jsoup.parse(html.replaceAll("(?i)&lt;br[^&gt;]*&gt;", "&lt;pre&gt;\n&lt;/pre&gt;")).text();
</code></pre>

works more reliable and easier.

blocks|key|934083|text|根据用户121196和绿色贝雷帽对selects和<pre>s的回答，唯一对我有效的解决方案是：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|934084|org.jsoup.nodes.Element+elementWithHtml+=+....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();|code-block|syntax|javascript|934085|entityMap^0|H|6|P|5|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@$9|N|A|O|B|C]|$9|P|A|Q|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|R|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|S|8|@]|D|@]|E|$]]]|L|$]]

Based on user121196's and Green Beret's answer with the <code>select</code>s and <code>&lt;pre&gt;</code>s, the only solution which works for me is:

<pre><code>org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("&lt;pre&gt;\n&lt;/pre&gt;");
elementWithHtml.select("p").prepend("&lt;pre&gt;\n\n&lt;/pre&gt;");
elementWithHtml.text();
</code></pre>

blocks|key|718402|text|这是我将html转换为文本的版本(实际上是user121196+answer的修改版本)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|718403|这不仅可以保留换行符，还可以格式化文本并删除过多的换行符和HTML转义符号，您将从HTML中获得更好的结果(在我的例子中，它是从邮件中收到的)。|718404|它最初是用Scala编写的，但是您可以很容易地将其更改为Java|718405|def+html2text(+rawHtml+:+String+)+:+String+=+{

++++val+htmlDoc+=+Jsoup.parseBodyFragment(+rawHtml,+"/"+)
++++htmlDoc.select("br").append("\\nl")
++++htmlDoc.select("div").prepend("\\nl").append("\\nl")
++++htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")

++++org.jsoup.parser.Parser.unescapeEntities(
++++++++Jsoup.clean(
++++++++++htmlDoc.html(),
++++++++++"",
++++++++++Whitelist.none(),
++++++++++new+org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
++++++++),false
++++).
++++replaceAll("\\\\nl",+"\n").
++++replaceAll("\r","").
++++replaceAll("\n\\s%2B\n","\n").
++++replaceAll("\n\n%2B","\n\n").+++++
++++trim()++++++
}|code-block|syntax|javascript|718406|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|N|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|O|8|@]|9|@]|A|$]]|$1|F|3|G|5|H|7|P|8|@]|9|@]|A|$I|J]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

This is my version of translating html to text (the modified version of user121196 answer, actually). 

This doesn't just preserve line breaks, but also formatting text and removing excessive line breaks, HTML escape symbols, and you will get a much better result from your HTML (in my case I'm receiving it from mail).

It's originally written in Scala, but you can change it to Java easily

<pre><code>def html2text( rawHtml : String ) : String = {

 val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" )
 htmlDoc.select("br").append("\\nl")
 htmlDoc.select("div").prepend("\\nl").append("\\nl")
 htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")

 org.jsoup.parser.Parser.unescapeEntities(
 Jsoup.clean(
 htmlDoc.html(),
 "",
 Whitelist.none(),
 new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
 ),false
 ).
 replaceAll("\\\\nl", "\n").
 replaceAll("\r","").
 replaceAll("\n\\s+\n","\n").
 replaceAll("\n\n+","\n\n"). 
 trim() 
}
</code></pre>

blocks|key|718464|text|根据其他答案和对这个问题的评论，似乎大多数来到这里的人都在寻找一种通用的解决方案，它可以为HTML文档提供格式良好的纯文本表示。我知道我是。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|718465|幸运的是，JSoup已经提供了一个相当全面的示例来说明如何实现这一点：HtmlToPlainText.java|offset|length|718466|示例FormattingVisitor可以很容易地根据您的喜好进行调整，并处理大多数块元素和换行。|style|CODE|718467|为了避免链接腐烂，下面是Jonathan+Hedley的完整解决方案：|718468|package+org.jsoup.examples;

import+org.jsoup.Jsoup;
import+org.jsoup.helper.StringUtil;
import+org.jsoup.helper.Validate;
import+org.jsoup.nodes.Document;
import+org.jsoup.nodes.Element;
import+org.jsoup.nodes.Node;
import+org.jsoup.nodes.TextNode;
import+org.jsoup.select.Elements;
import+org.jsoup.select.NodeTraversor;
import+org.jsoup.select.NodeVisitor;

import+java.io.IOException;

/**
+*+HTML+to+plain-text.+This+example+program+demonstrates+the+use+of+jsoup+to+convert+HTML+input+to+lightly-formatted
+*+plain-text.+That+is+divergent+from+the+general+goal+of+jsoup's+.text()+methods,+which+is+to+get+clean+data+from+a
+*+scrape.
+*+
+*+Note+that+this+is+a+fairly+simplistic+formatter+--+for+real+world+use+you'll+want+to+embrace+and+extend.
+*+
+*+
+*+To+invoke+from+the+command+line,+assuming+you've+downloaded+the+jsoup+jar+to+your+current+directory:
+*+<code>java+-cp+jsoup.jar+org.jsoup.examples.HtmlToPlainText+url+[selector]</code>
+*+where+url+is+the+URL+to+fetch,+and+selector+is+an+optional+CSS+selector.
+*+
+*+@author+Jonathan+Hedley,+jonathan@hedley.net
+*/
public+class+HtmlToPlainText+{
++++private+static+final+String+userAgent+=+"Mozilla/5.0+(jsoup)";
++++private+static+final+int+timeout+=+5+*+1000;

++++public+static+void+main(String...+args)+throws+IOException+{
++++++++Validate.isTrue(args.length+==+1+%7C%7C+args.length+==+2,+"usage:+java+-cp+jsoup.jar+org.jsoup.examples.HtmlToPlainText+url+[selector]");
++++++++final+String+url+=+args[0];
++++++++final+String+selector+=+args.length+==+2+?+args[1]+:+null;

++++++++//+fetch+the+specified+URL+and+parse+to+a+HTML+DOM
++++++++Document+doc+=+Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();

++++++++HtmlToPlainText+formatter+=+new+HtmlToPlainText();

++++++++if+(selector+!=+null)+{
++++++++++++Elements+elements+=+doc.select(selector);+//+get+each+element+that+matches+the+CSS+selector
++++++++++++for+(Element+element+:+elements)+{
++++++++++++++++String+plainText+=+formatter.getPlainText(element);+//+format+that+element+to+plain+text
++++++++++++++++System.out.println(plainText);
++++++++++++}
++++++++}+else+{+//+format+the+whole+doc
++++++++++++String+plainText+=+formatter.getPlainText(doc);
++++++++++++System.out.println(plainText);
++++++++}
++++}

++++/**
+++++*+Format+an+Element+to+plain-text
+++++*+@param+element+the+root+element+to+format
+++++*+@return+formatted+text
+++++*/
++++public+String+getPlainText(Element+element)+{
++++++++FormattingVisitor+formatter+=+new+FormattingVisitor();
++++++++NodeTraversor+traversor+=+new+NodeTraversor(formatter);
++++++++traversor.traverse(element);+//+walk+the+DOM,+and+call+.head()+and+.tail()+for+each+node

++++++++return+formatter.toString();
++++}

++++//+the+formatting+rules,+implemented+in+a+breadth-first+DOM+traverse
++++private+class+FormattingVisitor+implements+NodeVisitor+{
++++++++private+static+final+int+maxWidth+=+80;
++++++++private+int+width+=+0;
++++++++private+StringBuilder+accum+=+new+StringBuilder();+//+holds+the+accumulated+text

++++++++//+hit+when+the+node+is+first+seen
++++++++public+void+head(Node+node,+int+depth)+{
++++++++++++String+name+=+node.nodeName();
++++++++++++if+(node+instanceof+TextNode)
++++++++++++++++append(((TextNode)+node).text());+//+TextNodes+carry+all+user-readable+text+in+the+DOM.
++++++++++++else+if+(name.equals("li"))
++++++++++++++++append("\n+*+");
++++++++++++else+if+(name.equals("dt"))
++++++++++++++++append("++");
++++++++++++else+if+(StringUtil.in(name,+"p",+"h1",+"h2",+"h3",+"h4",+"h5",+"tr"))
++++++++++++++++append("\n");
++++++++}

++++++++//+hit+when+all+of+the+node's+children+(if+any)+have+been+visited
++++++++public+void+tail(Node+node,+int+depth)+{
++++++++++++String+name+=+node.nodeName();
++++++++++++if+(StringUtil.in(name,+"br",+"dd",+"dt",+"p",+"h1",+"h2",+"h3",+"h4",+"h5"))
++++++++++++++++append("\n");
++++++++++++else+if+(name.equals("a"))
++++++++++++++++append(String.format("+<%25s>",+node.absUrl("href")));
++++++++}

++++++++//+appends+text+to+the+string+builder+with+a+simple+word+wrap+method
++++++++private+void+append(String+text)+{
++++++++++++if+(text.startsWith("\n"))
++++++++++++++++width+=+0;+//+reset+counter+if+starts+with+a+newline.+only+from+formats+above,+not+in+natural+text
++++++++++++if+(text.equals("+")+&&
++++++++++++++++++++(accum.length()+==+0+%7C%7C+StringUtil.in(accum.substring(accum.length()+-+1),+"+",+"\n")))
++++++++++++++++return;+//+don't+accumulate+long+runs+of+empty+spaces

++++++++++++if+(text.length()+%2B+width+>+maxWidth)+{+//+won't+fit,+needs+to+wrap
++++++++++++++++String+words[]+=+text.split("\\s%2B");
++++++++++++++++for+(int+i+=+0;+i+<+words.length;+i%2B%2B)+{
++++++++++++++++++++String+word+=+words[i];
++++++++++++++++++++boolean+last+=+i+==+words.length+-+1;
++++++++++++++++++++if+(!last)+//+insert+a+space+if+not+the+last+word
++++++++++++++++++++++++word+=+word+%2B+"+";
++++++++++++++++++++if+(word.length()+%2B+width+>+maxWidth)+{+//+wrap+and+reset+counter
++++++++++++++++++++++++accum.append("\n").append(word);
++++++++++++++++++++++++width+=+word.length();
++++++++++++++++++++}+else+{
++++++++++++++++++++++++accum.append(word);
++++++++++++++++++++++++width+%2B=+word.length();
++++++++++++++++++++}
++++++++++++++++}
++++++++++++}+else+{+//+fits+as+is,+without+need+to+wrap+text
++++++++++++++++accum.append(text);
++++++++++++++++width+%2B=+text.length();
++++++++++++}
++++++++}

++++++++@Override
++++++++public+String+toString()+{
++++++++++++return+accum.toString();
++++++++}
++++}
}|code-block|syntax|javascript|718469|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java|1|https://stackoverflow.com/users/153184/jonathan-hedley^0|0|Z|K|0|0|2|H|0|C|F|1|0|0^^$0|@$1|2|3|4|5|6|7|10|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|11|8|@]|9|@$D|12|E|13|1|14]]|A|$]]|$1|F|3|G|5|6|7|15|8|@$D|16|E|17|H|I]]|9|@]|A|$]]|$1|J|3|K|5|6|7|18|8|@]|9|@$D|19|E|1A|1|1B]]|A|$]]|$1|L|3|M|5|N|7|1C|8|@]|9|@]|A|$O|P]]|$1|Q|3|-4|5|6|7|1D|8|@]|9|@]|A|$]]]|R|$S|$5|T|U|V|A|$W|X]]|Y|$5|T|U|V|A|$W|Z]]]]

Based on the other answers and the comments on this question it seems that most people coming here are really looking for a general solution that will provide a nicely formatted plain text representation of an HTML document. I know I was.

Fortunately JSoup already provide a pretty comprehensive example of how to achieve this: <a href="https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java" rel="nofollow noreferrer" title="HtmlToPlainText.java">HtmlToPlainText.java</a>

The example <code>FormattingVisitor</code> can easily be tweaked to your preference and deals with most block elements and line wrapping. 

To avoid link rot, here is <a href="https://stackoverflow.com/users/153184/jonathan-hedley">Jonathan Hedley</a>'s solution in full:

<pre><code>package org.jsoup.examples;

import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;

import java.io.IOException;

/**
 * HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted
 * plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a
 * scrape.
 * &lt;p&gt;
 * Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend.
 * &lt;/p&gt;
 * &lt;p&gt;
 * To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:&lt;/p&gt;
 * &lt;p&gt;&lt;code&gt;java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]&lt;/code&gt;&lt;/p&gt;
 * where &lt;i&gt;url&lt;/i&gt; is the URL to fetch, and &lt;i&gt;selector&lt;/i&gt; is an optional CSS selector.
 * 
 * @author Jonathan Hedley, jonathan@hedley.net
 */
public class HtmlToPlainText {
 private static final String userAgent = "Mozilla/5.0 (jsoup)";
 private static final int timeout = 5 * 1000;

 public static void main(String... args) throws IOException {
 Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]");
 final String url = args[0];
 final String selector = args.length == 2 ? args[1] : null;

 // fetch the specified URL and parse to a HTML DOM
 Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();

 HtmlToPlainText formatter = new HtmlToPlainText();

 if (selector != null) {
 Elements elements = doc.select(selector); // get each element that matches the CSS selector
 for (Element element : elements) {
 String plainText = formatter.getPlainText(element); // format that element to plain text
 System.out.println(plainText);
 }
 } else { // format the whole doc
 String plainText = formatter.getPlainText(doc);
 System.out.println(plainText);
 }
 }

 /**
 * Format an Element to plain-text
 * @param element the root element to format
 * @return formatted text
 */
 public String getPlainText(Element element) {
 FormattingVisitor formatter = new FormattingVisitor();
 NodeTraversor traversor = new NodeTraversor(formatter);
 traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node

 return formatter.toString();
 }

 // the formatting rules, implemented in a breadth-first DOM traverse
 private class FormattingVisitor implements NodeVisitor {
 private static final int maxWidth = 80;
 private int width = 0;
 private StringBuilder accum = new StringBuilder(); // holds the accumulated text

 // hit when the node is first seen
 public void head(Node node, int depth) {
 String name = node.nodeName();
 if (node instanceof TextNode)
 append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
 else if (name.equals("li"))
 append("\n * ");
 else if (name.equals("dt"))
 append(" ");
 else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
 append("\n");
 }

 // hit when all of the node's children (if any) have been visited
 public void tail(Node node, int depth) {
 String name = node.nodeName();
 if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
 append("\n");
 else if (name.equals("a"))
 append(String.format(" &lt;%s&gt;", node.absUrl("href")));
 }

 // appends text to the string builder with a simple word wrap method
 private void append(String text) {
 if (text.startsWith("\n"))
 width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
 if (text.equals(" ") &amp;&amp;
 (accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
 return; // don't accumulate long runs of empty spaces

 if (text.length() + width &gt; maxWidth) { // won't fit, needs to wrap
 String words[] = text.split("\\s+");
 for (int i = 0; i &lt; words.length; i++) {
 String word = words[i];
 boolean last = i == words.length - 1;
 if (!last) // insert a space if not the last word
 word = word + " ";
 if (word.length() + width &gt; maxWidth) { // wrap and reset counter
 accum.append("\n").append(word);
 width = word.length();
 } else {
 accum.append(word);
 width += word.length();
 }
 }
 } else { // fits as is, without need to wrap text
 accum.append(text);
 width += text.length();
 }
 }

 @Override
 public String toString() {
 return accum.toString();
 }
 }
}
</code></pre>

blocks|key|718484|text|通过使用jsoup来尝试：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|718485|++++doc.outputSettings(new+OutputSettings().prettyPrint(false));

++++//select+all+ +tags+and+append+\n+after+that
++++doc.select("br").after("\\n");

++++//select+all++tags+and+prepend+\n+before+that
++++doc.select("p").before("\\n");

++++//get+the+HTML+from+the+document,+and+retaining+original+new+lines
++++String+str+=+doc.html().replaceAll("\\\\n",+"\n");|code-block|syntax|javascript|718486|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Try this by using jsoup:

<pre><code> doc.outputSettings(new OutputSettings().prettyPrint(false));

 //select all &lt;br&gt; tags and append \n after that
 doc.select("br").after("\\n");

 //select all &lt;p&gt; tags and prepend \n before that
 doc.select("p").before("\\n");

 //get the HTML from the document, and retaining original new lines
 String str = doc.html().replaceAll("\\\\n", "\n");
</code></pre>

blocks|key|934239|text|对于更复杂的HTML，上面的解决方案都不能很好地工作；我能够成功地完成转换，同时使用以下命令保留换行符：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|934240|Document+document+=+Jsoup.parse(myHtml);
String+text+=+new+HtmlToPlainText().getPlainText(document);|code-block|syntax|javascript|934241|(版本1.10.3)|934242|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

For more complex HTML none of the above solutions worked quite right; I was able to successfully do the conversion while preserving line breaks with:

<pre><code>Document document = Jsoup.parse(myHtml);
String text = new HtmlToPlainText().getPlainText(document);
</code></pre>

(version 1.10.3)

blocks|key|934279|text|在JSoupv1.11.2上，我们现在可以使用Element.wholeText()。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|934280|示例代码：|934281|String+cleanString+=+Jsoup.parse(htmlString).wholeText();|code-block|syntax|javascript|934282|user121196's+answer仍然有效。但是wholeText()保留了文本的对齐方式。|934283|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/a/19602313/1767167^0|N|J|0|0|0|0|C|Q|B|D|6|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@$9|X|A|Y|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|Z|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|10|8|@]|D|@]|E|$K|L]]|$1|M|3|N|5|6|7|11|8|@$9|12|A|13|B|C]|$9|14|A|15|B|C]]|D|@$9|16|A|17|1|18]]|E|$]]|$1|O|3|-4|5|6|7|19|8|@]|D|@]|E|$]]]|P|$Q|$5|R|S|T|E|$U|V]]]]

On Jsoup v1.11.2, we can now use <code>Element.wholeText()</code>.

Example code:

<pre><code>String cleanString = Jsoup.parse(htmlString).wholeText();
</code></pre>

<code>user121196's</code> <a href="https://stackoverflow.com/a/19602313/1767167">answer</a> still works. But <code>wholeText()</code> preserves the alignment of texts.

I have the following code:

<pre><code> public class NewClass {
 public String noTags(String str){
 return Jsoup.parse(str).text();
 }


 public static void main(String args[]) {
 String strings="&lt;!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \"&gt;" +
 "&lt;HTML&gt; &lt;HEAD&gt; &lt;TITLE&gt;&lt;/TITLE&gt; &lt;style&gt;body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}&lt;/style&gt; &lt;/HEAD&gt; &lt;BODY&gt;&lt;p&gt;&lt;b&gt;hello world&lt;/b&gt;&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;b&gt;yo&lt;/b&gt; &lt;a href=\"http://google.com\"&gt;googlez&lt;/a&gt;&lt;/p&gt;&lt;/BODY&gt; &lt;/HTML&gt; ";

 NewClass text = new NewClass();
 System.out.println((text.noTags(strings)));
}
</code></pre>

And I have the result:

<pre><code>hello world yo googlez
</code></pre>

But I want to break the line:

<pre><code>hello world
yo googlez
</code></pre>

I have looked at <a href="https://jsoup.org/apidocs/org/jsoup/nodes/TextNode.html#getWholeText--">jsoup's TextNode#getWholeText()</a> but I can't figure out how to use it.

If there's a <code>&lt;br&gt;</code> in the markup I parse, how can I get a line break in my resulting output?

How do I preserve line breaks when using jsoup to convert html to plain text?

Java

我有以下代码： public class NewClass {     public String noTags(String str){         return Jsoup.parse(str).text();     }     public static void main(String args[]) {...

问如何在使用jsoup将html转换为纯文本时保留换行符？
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在使用jsoup将html转换为纯文本时保留换行符？EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在使用jsoup将html转换为纯文本时保留换行符？
EN