我将Html片段存储在一个表中。不是整个页面,没有标签之类的,只是基本的格式化。
我希望能够显示的Html只作为文本,没有格式,在一个给定的页面(实际上只有前30 - 50个字符,但这是容易的一点)。
如何将Html中的" text“作为纯文本放入字符串中?
所以这段代码。
<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>
变成:
你好,世界。外面有人吗?
发布于 2009-07-13 19:17:45
麻省理工学院许可的HtmlAgilityPack有in one of its samples,这是一种将HTML转换为纯文本的方法。
var plainText = HtmlUtilities.ConvertToPlainText(string html);
输入一个HTML字符串,如下所示
<b>hello, <i>world!</i></b>
您将得到一个纯文本结果,如下所示:
hello world!
发布于 2013-05-07 05:06:12
我不能使用HtmlAgilityPack,所以我为自己写了第二个最好的解决方案
private static string HtmlToPlainText(string html)
{
const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);
var text = html;
//Decode html specific characters
text = System.Net.WebUtility.HtmlDecode(text);
//Remove tag whitespace/line breaks
text = tagWhiteSpaceRegex.Replace(text, "><");
//Replace <br /> with line breaks
text = lineBreakRegex.Replace(text, Environment.NewLine);
//Strip formatting
text = stripFormattingRegex.Replace(text, string.Empty);
return text;
}
发布于 2017-10-11 14:38:46
将HTML转换为纯文本的三步过程
首先,你需要为HtmlAgilityPack安装Nuget包,然后创建这个类
public class HtmlToText
{
public HtmlToText()
{
}
public string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach(HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
public void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch(node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;
// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html));
}
break;
case HtmlNodeType.Element:
switch(node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}
if (node.HasChildNodes)
{
ConvertContentTo(node, outText);
}
break;
}
}
}
通过参照犹大·希曼戈的回答使用上面的类
第三,您需要创建上述类的对象,并使用ConvertHtml(HTMLContent)
方法将HTML转换为纯文本,而不是ConvertToPlainText(string html);
HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);
https://stackoverflow.com/questions/286813
复制相似问题