blocks|key|1554377|text|您可以使用TidyNet.Tidy将HTML转换为XHTML，然后使用XML解析器。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1554378|另一种选择是使用内置引擎mshtml：|1554379|using+mshtml;
...
object[]+oPageText+=+{+html+};
HTMLDocument+doc+=+new+HTMLDocumentClass();
IHTMLDocument2+doc2+=+(IHTMLDocument2)doc;
doc2.write(oPageText);|code-block|syntax|javascript|1554380|这允许您使用类似javascript的函数，如getElementById()|1554381|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|N|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|O|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|P|8|@]|9|@]|A|$]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.

Another alternative would be to use the builtin engine mshtml:

<pre><code>using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);
</code></pre>

This allows you to use javascript-like functions like getElementById()

blocks|key|1769777|text|我发现了一个名为Fizzler的项目，它采用jQuery/Sizzler方法来选择HTML元素。它基于HTML+Agility+Pack。它目前处于测试阶段，只支持CSS选择器的一个子集，但是在讨厌的XPath上使用CSS选择器是非常酷和令人耳目一新的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1769778|http://code.google.com/p/fizzler/|offset|length|1769779|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|X|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|N|8|@]|9|@$D|O|E|P|1|Q]]|A|$]]|$1|F|3|-4|5|6|7|R|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|C]]]]

I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.

<a href="http://code.google.com/p/fizzler/" rel="nofollow noreferrer">http://code.google.com/p/fizzler/</a>

blocks|key|1554481|text|在第三方产品和mshtml+(即互操作)上，您可以做很多事情而不会变得疯狂。使用System.Windows.Forms.WebBrowser。从那里，您可以在HtmlDocument上执行"GetElementById“或在HtmlElements上执行"GetElementsByTagName”之类的操作。如果你想实际地与浏览器交互(例如，模拟按钮点击)，你可以使用一个小反射(imo比Interop更小的邪恶)来实现：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1554482|var+wb+=+new+WebBrowser()|code-block|syntax|javascript|1554483|..。告诉浏览器导航(与此问题无关)。然后，在Document_Completed事件上，您可以像这样模拟点击。|1554484|var+doc+=+wb.Browser.Document
var+elem+=+doc.GetElementById(elementId);
object+obj+=+elem.DomElement;
System.Reflection.MethodInfo+mi+=+obj.GetType().GetMethod("click");
mi.Invoke(obj,+new+object[0]);|1554485|你可以做类似的反射来提交表单，等等。|1554486|好好享受吧。|1554487|entityMap^0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|R|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|S|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|U|8|@]|9|@]|A|$]]|$1|M|3|N|5|6|7|V|8|@]|9|@]|A|$]]|$1|O|3|-4|5|6|7|W|8|@]|9|@]|A|$]]]|P|$]]

You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:

<pre><code>var wb = new WebBrowser()
</code></pre>

... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.

<pre><code>var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);
</code></pre>

you can do similar reflection stuff to submit forms, etc.

Enjoy.

blocks|key|1769725|text|我写了一些代码来提供"LINQ+to+HTML“功能。我想我会在这里分享它。它是以Majestic+12为基础的，它接受Majestic-12的结果并生成LINQ+XML元素。在这一点上，您可以对HTML使用所有LINQ+to+XML工具。举个例子：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1769726|++++++++IEnumerable<XNode>+auctionNodes+=+Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);

++++++++foreach+(XElement+anchorTag+in+auctionNodes.OfType<XElement>().DescendantsAndSelf("a"))+{

++++++++++++if+(anchorTag.Attribute("href")+==+null)
++++++++++++++++continue;

++++++++++++Console.WriteLine(anchorTag.Attribute("href").Value);
++++++++}|code-block|syntax|javascript|1769727|我想使用Majestic-12，因为我知道它有很多关于在野外发现的HTML的内置知识。但我发现，要将Majestic-12结果映射到LINQ可以接受为XML的内容，需要额外的工作。我所包含的代码做了大量的清理工作，但当您使用此代码时，您会发现被拒绝的页面。你需要修改代码来解决这个问题。当抛出异常时，检查exception.Data“源”，因为它很可能设置为导致异常的HTML标记。以一种好的方式处理HTML有时并不是微不足道的……|1769728|现在，期望值已经很低了，下面是代码:)|1769729|using+System;
using+System.Collections.Generic;
using+System.Linq;
using+System.Text;
using+Majestic12;
using+System.IO;
using+System.Xml.Linq;
using+System.Diagnostics;
using+System.Text.RegularExpressions;

namespace+Majestic12ToXml+{
public+class+Majestic12ToXml+{

++++static+public+IEnumerable<XNode>+ConvertNodesToXml(byte[]+htmlAsBytes)+{

++++++++HTMLparser+parser+=+OpenParser();
++++++++parser.Init(htmlAsBytes);

++++++++XElement+currentNode+=+new+XElement("document");

++++++++HTMLchunk+m12chunk+=+null;

++++++++int+xmlnsAttributeIndex+=+0;
++++++++string+originalHtml+=+"";

++++++++while+((m12chunk+=+parser.ParseNext())+!=+null)+{

++++++++++++try+{

++++++++++++++++Debug.Assert(!m12chunk.bHashMode);++//+popular+default+for+Majestic-12+setting

++++++++++++++++XNode+newNode+=+null;
++++++++++++++++XElement+newNodesParent+=+null;

++++++++++++++++switch+(m12chunk.oType)+{
++++++++++++++++++++case+HTMLchunkType.OpenTag:

++++++++++++++++++++++++//+Tags+are+added+as+a+child+to+the+current+tag,+
++++++++++++++++++++++++//+except+when+the+new+tag+implies+the+closure+of+
++++++++++++++++++++++++//+some+number+of+ancestor+tags.

++++++++++++++++++++++++newNode+=+ParseTagNode(m12chunk,+originalHtml,+ref+xmlnsAttributeIndex);

++++++++++++++++++++++++if+(newNode+!=+null)+{
++++++++++++++++++++++++++++currentNode+=+FindParentOfNewNode(m12chunk,+originalHtml,+currentNode);

++++++++++++++++++++++++++++newNodesParent+=+currentNode;

++++++++++++++++++++++++++++newNodesParent.Add(newNode);

++++++++++++++++++++++++++++currentNode+=+newNode+as+XElement;
++++++++++++++++++++++++}

++++++++++++++++++++++++break;

++++++++++++++++++++case+HTMLchunkType.CloseTag:

++++++++++++++++++++++++if+(m12chunk.bEndClosure)+{

++++++++++++++++++++++++++++newNode+=+ParseTagNode(m12chunk,+originalHtml,+ref+xmlnsAttributeIndex);

++++++++++++++++++++++++++++if+(newNode+!=+null)+{
++++++++++++++++++++++++++++++++currentNode+=+FindParentOfNewNode(m12chunk,+originalHtml,+currentNode);

++++++++++++++++++++++++++++++++newNodesParent+=+currentNode;
++++++++++++++++++++++++++++++++newNodesParent.Add(newNode);
++++++++++++++++++++++++++++}
++++++++++++++++++++++++}
++++++++++++++++++++++++else+{
++++++++++++++++++++++++++++XElement+nodeToClose+=+currentNode;

++++++++++++++++++++++++++++string+m12chunkCleanedTag+=+CleanupTagName(m12chunk.sTag,+originalHtml);

++++++++++++++++++++++++++++while+(nodeToClose+!=+null+&&+nodeToClose.Name.LocalName+!=+m12chunkCleanedTag)
++++++++++++++++++++++++++++++++nodeToClose+=+nodeToClose.Parent;

++++++++++++++++++++++++++++if+(nodeToClose+!=+null)
++++++++++++++++++++++++++++++++currentNode+=+nodeToClose.Parent;

++++++++++++++++++++++++++++Debug.Assert(currentNode+!=+null);
++++++++++++++++++++++++}

++++++++++++++++++++++++break;

++++++++++++++++++++case+HTMLchunkType.Script:

++++++++++++++++++++++++newNode+=+new+XElement("script",+"REMOVED");
++++++++++++++++++++++++newNodesParent+=+currentNode;
++++++++++++++++++++++++newNodesParent.Add(newNode);
++++++++++++++++++++++++break;

++++++++++++++++++++case+HTMLchunkType.Comment:

++++++++++++++++++++++++newNodesParent+=+currentNode;

++++++++++++++++++++++++if+(m12chunk.sTag+==+"!--")
++++++++++++++++++++++++++++newNode+=+new+XComment(m12chunk.oHTML);
++++++++++++++++++++++++else+if+(m12chunk.sTag+==+"![CDATA[")
++++++++++++++++++++++++++++newNode+=+new+XCData(m12chunk.oHTML);
++++++++++++++++++++++++else
++++++++++++++++++++++++++++throw+new+Exception("Unrecognized+comment+sTag");

++++++++++++++++++++++++newNodesParent.Add(newNode);

++++++++++++++++++++++++break;

++++++++++++++++++++case+HTMLchunkType.Text:

++++++++++++++++++++++++currentNode.Add(m12chunk.oHTML);
++++++++++++++++++++++++break;

++++++++++++++++++++default:
++++++++++++++++++++++++break;
++++++++++++++++}
++++++++++++}
++++++++++++catch+(Exception+e)+{
++++++++++++++++var+wrappedE+=+new+Exception("Error+using+Majestic12.HTMLChunk,+reason:+"+%2B+e.Message,+e);

++++++++++++++++//+the+original+html+is+copied+for+tracing/debugging+purposes
++++++++++++++++originalHtml+=+new+string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
++++++++++++++++++++.Take(m12chunk.iChunkLength)
++++++++++++++++++++.Select(B+=>+(char)B).ToArray());+

++++++++++++++++wrappedE.Data.Add("source",+originalHtml);

++++++++++++++++throw+wrappedE;
++++++++++++}
++++++++}

++++++++while+(currentNode.Parent+!=+null)
++++++++++++currentNode+=+currentNode.Parent;

++++++++return+currentNode.Nodes();
++++}

++++static+XElement+FindParentOfNewNode(Majestic12.HTMLchunk+m12chunk,+string+originalHtml,+XElement+nextPotentialParent)+{

++++++++string+m12chunkCleanedTag+=+CleanupTagName(m12chunk.sTag,+originalHtml);

++++++++XElement+discoveredParent+=+null;

++++++++//+Get+a+list+of+all+ancestors
++++++++List<XElement>+ancestors+=+new+List<XElement>();
++++++++XElement+ancestor+=+nextPotentialParent;
++++++++while+(ancestor+!=+null)+{
++++++++++++ancestors.Add(ancestor);
++++++++++++ancestor+=+ancestor.Parent;
++++++++}

++++++++//+Check+if+the+new+tag+implies+a+previous+tag+was+closed.
++++++++if+("form"+==+m12chunkCleanedTag)+{

++++++++++++discoveredParent+=+ancestors
++++++++++++++++.Where(XE+=>+m12chunkCleanedTag+==+XE.Name)
++++++++++++++++.Take(1)
++++++++++++++++.Select(XE+=>+XE.Parent)
++++++++++++++++.FirstOrDefault();
++++++++}
++++++++else+if+("td"+==+m12chunkCleanedTag)+{

++++++++++++discoveredParent+=+ancestors
++++++++++++++++.TakeWhile(XE+=>+"tr"+!=+XE.Name)
++++++++++++++++.Where(XE+=>+m12chunkCleanedTag+==+XE.Name)
++++++++++++++++.Take(1)
++++++++++++++++.Select(XE+=>+XE.Parent)
++++++++++++++++.FirstOrDefault();
++++++++}
++++++++else+if+("tr"+==+m12chunkCleanedTag)+{

++++++++++++discoveredParent+=+ancestors
++++++++++++++++.TakeWhile(XE+=>+!("table"+==+XE.Name
++++++++++++++++++++++++++++++++++++%7C%7C+"thead"+==+XE.Name
++++++++++++++++++++++++++++++++++++%7C%7C+"tbody"+==+XE.Name
++++++++++++++++++++++++++++++++++++%7C%7C+"tfoot"+==+XE.Name))
++++++++++++++++.Where(XE+=>+m12chunkCleanedTag+==+XE.Name)
++++++++++++++++.Take(1)
++++++++++++++++.Select(XE+=>+XE.Parent)
++++++++++++++++.FirstOrDefault();
++++++++}
++++++++else+if+("thead"+==+m12chunkCleanedTag
++++++++++++++++++%7C%7C+"tbody"+==+m12chunkCleanedTag
++++++++++++++++++%7C%7C+"tfoot"+==+m12chunkCleanedTag)+{


++++++++++++discoveredParent+=+ancestors
++++++++++++++++.TakeWhile(XE+=>+"table"+!=+XE.Name)
++++++++++++++++.Where(XE+=>+m12chunkCleanedTag+==+XE.Name)
++++++++++++++++.Take(1)
++++++++++++++++.Select(XE+=>+XE.Parent)
++++++++++++++++.FirstOrDefault();
++++++++}

++++++++return+discoveredParent+??+nextPotentialParent;
++++}

++++static+string+CleanupTagName(string+originalName,+string+originalHtml)+{

++++++++string+tagName+=+originalName;

++++++++tagName+=+tagName.TrimStart(new+char[]+{+'?'+});++//+for+nodes+<?xml+>

++++++++if+(tagName.Contains(':'))
++++++++++++tagName+=+tagName.Substring(tagName.LastIndexOf(':')+%2B+1);

++++++++return+tagName;
++++}

++++static+readonly+Regex+_startsAsNumeric+=+new+Regex(@"%5E[0-9]",+RegexOptions.Compiled);

++++static+bool+TryCleanupAttributeName(string+originalName,+ref+int+xmlnsIndex,+out+string+result)+{

++++++++result+=+null;
++++++++string+attributeName+=+originalName;

++++++++if+(string.IsNullOrEmpty(originalName))
++++++++++++return+false;

++++++++if+(_startsAsNumeric.IsMatch(originalName))
++++++++++++return+false;

++++++++//
++++++++//+transform+xmlns+attributes+so+they+don't+actually+create+any+XML+namespaces
++++++++//
++++++++if+(attributeName.ToLower().Equals("xmlns"))+{

++++++++++++attributeName+=+"xmlns_"+%2B+xmlnsIndex.ToString();+;
++++++++++++xmlnsIndex%2B%2B;
++++++++}
++++++++else+{
++++++++++++if+(attributeName.ToLower().StartsWith("xmlns:"))+{
++++++++++++++++attributeName+=+"xmlns_"+%2B+attributeName.Substring("xmlns:".Length);
++++++++++++}+++

++++++++++++//
++++++++++++//+trim+trailing+\"
++++++++++++//
++++++++++++attributeName+=+attributeName.TrimEnd(new+char[]+{+'\"'+});

++++++++++++attributeName+=+attributeName.Replace(":",+"_");
++++++++}

++++++++result+=+attributeName;

++++++++return+true;
++++}

++++static+Regex+_weirdTag+=+new+Regex(@"%5E<!\[.*\]>$");+++++++//+matches+"<![if+!supportEmptyParas]>"
++++static+Regex+_aspnetPrecompiled+=+new+Regex(@"%5E<%25.*%25>$");+//+matches+"<%25@+...+%25>"
++++static+Regex+_shortHtmlComment+=+new+Regex(@"%5E<!-.*->$");+//+matches+"<!-Extra_Images->"

++++static+XElement+ParseTagNode(Majestic12.HTMLchunk+m12chunk,+string+originalHtml,+ref+int+xmlnsIndex)+{

++++++++if+(string.IsNullOrEmpty(m12chunk.sTag))+{

++++++++++++if+(m12chunk.sParams.Length+>+0+&&+m12chunk.sParams[0].ToLower().Equals("doctype"))
++++++++++++++++return+new+XElement("doctype");

++++++++++++if+(_weirdTag.IsMatch(originalHtml))
++++++++++++++++return+new+XElement("REMOVED_weirdBlockParenthesisTag");

++++++++++++if+(_aspnetPrecompiled.IsMatch(originalHtml))
++++++++++++++++return+new+XElement("REMOVED_ASPNET_PrecompiledDirective");

++++++++++++if+(_shortHtmlComment.IsMatch(originalHtml))
++++++++++++++++return+new+XElement("REMOVED_ShortHtmlComment");

++++++++++++//+Nodes+like+"<br+ "+will+end+up+with+a+m12chunk.sTag==""...++We+discard+these+nodes.
++++++++++++return+null;
++++++++}

++++++++string+tagName+=+CleanupTagName(m12chunk.sTag,+originalHtml);

++++++++XElement+result+=+new+XElement(tagName);

++++++++List<XAttribute>+attributes+=+new+List<XAttribute>();

++++++++for+(int+i+=+0;+i+<+m12chunk.iParams;+i%2B%2B)+{

++++++++++++if+(m12chunk.sParams[i]+==+"")
++++++++++++++++++++++++break;
++++++++++++++++}

++++++++++++++++continue;
++++++++++++}

++++++++++++if+(m12chunk.sParams[i]+==+"?"+&&+string.IsNullOrEmpty(m12chunk.sValues[i]))
++++++++++++++++continue;

++++++++++++string+attributeName+=+m12chunk.sParams[i];

++++++++++++if+(!TryCleanupAttributeName(attributeName,+ref+xmlnsIndex,+out+attributeName))
++++++++++++++++continue;

++++++++++++attributes.Add(new+XAttribute(attributeName,+m12chunk.sValues[i]));
++++++++}

++++++++//+If+attributes+are+duplicated+with+different+values,+we+complain.
++++++++//+If+attributes+are+duplicated+with+the+same+value,+we+remove+all+but+1.
++++++++var+duplicatedAttributes+=+attributes.GroupBy(A+=>+A.Name).Where(G+=>+G.Count()+>+1);

++++++++foreach+(var+duplicatedAttribute+in+duplicatedAttributes)+{

++++++++++++if+(duplicatedAttribute.GroupBy(DA+=>+DA.Value).Count()+>+1)
++++++++++++++++throw+new+Exception("Attribute+value+was+given+different+values");

++++++++++++attributes.RemoveAll(A+=>+A.Name+==+duplicatedAttribute.Key);
++++++++++++attributes.Add(duplicatedAttribute.First());
++++++++}

++++++++result.Add(attributes);

++++++++return+result;
++++}

++++static+HTMLparser+OpenParser()+{
++++++++HTMLparser+oP+=+new+HTMLparser();

++++++++//+The+code%2Bcomments+in+this+function+are+from+the+Majestic-12+sample+documentation.

++++++++//+...

++++++++//+This+is+optional,+but+if+you+want+high+performance+then+you+may
++++++++//+want+to+set+chunk+hash+mode+to+FALSE.+This+would+result+in+tag+params
++++++++//+being+added+to+string+arrays+in+HTMLchunk+object+called+sParams+and+sValues,+with+number
++++++++//+of+actual+params+being+in+iParams.+See+code+below+for+details.
++++++++//
++++++++//+When+TRUE+(and+its+default)+tag+params+will+be+added+to+hashtable+HTMLchunk+(object).oParams
++++++++oP.SetChunkHashMode(false);

++++++++//+if+you+set+this+to+true+then+original+parsed+HTML+for+given+chunk+will+be+kept+-+
++++++++//+this+will+reduce+performance+somewhat,+but+may+be+desireable+in+some+cases+where
++++++++//+reconstruction+of+HTML+may+be+necessary
++++++++oP.bKeepRawHTML+=+false;

++++++++//+if+set+to+true+(it+is+false+by+default),+then+entities+will+be+decoded:+this+is+essential
++++++++//+if+you+want+to+get+strings+that+contain+final+representation+of+the+data+in+HTML,+however
++++++++//+you+should+be+aware+that+if+you+want+to+use+such+strings+into+output+HTML+string+then+you+will
++++++++//+need+to+do+Entity+encoding+or+same+string+may+fail+later
++++++++oP.bDecodeEntities+=+true;

++++++++//+we+have+option+to+keep+most+entities+as+is+-+only+replace+stuff+like+&nbsp;+
++++++++//+this+is+called+Mini+Entities+mode+-+it+is+handy+when+HTML+will+need
++++++++//+to+be+re-created+after+it+was+parsed,+though+in+this+case+really
++++++++//+entities+should+not+be+parsed+at+all
++++++++oP.bDecodeMiniEntities+=+true;

++++++++if+(!oP.bDecodeEntities+&&+oP.bDecodeMiniEntities)
++++++++++++oP.InitMiniEntities();

++++++++//+if+set+to+true,+then+in+case+of+Comments+and+SCRIPT+tags+the+data+set+to+oHTML+will+be
++++++++//+extracted+BETWEEN+those+tags,+rather+than+include+complete+RAW+HTML+that+includes+tags+too
++++++++//+this+only+works+if+auto+extraction+is+enabled
++++++++oP.bAutoExtractBetweenTagsOnly+=+true;

++++++++//+if+true+then+comments+will+be+extracted+automatically
++++++++oP.bAutoKeepComments+=+true;

++++++++//+if+true+then+scripts+will+be+extracted+automatically:+
++++++++oP.bAutoKeepScripts+=+true;

++++++++//+if+this+option+is+true+then+whitespace+before+start+of+tag+will+be+compressed+to+single
++++++++//+space+character+in+string:+"+",+if+false+then+full+whitespace+before+tag+will+be+returned+(slower)
++++++++//+you+may+only+want+to+set+it+to+false+if+you+want+exact+whitespace+between+tags,+otherwise+it+is+just
++++++++//+a+waste+of+CPU+cycles
++++++++oP.bCompressWhiteSpaceBeforeTag+=+true;

++++++++//+if+true+(default)+then+tags+with+attributes+marked+as+CLOSED+(/+at+the+end)+will+be+automatically
++++++++//+forced+to+be+considered+as+open+tags+-+this+is+no+good+for+XML+parsing,+but+I+keep+it+for+backwards
++++++++//+compatibility+for+my+stuff+as+it+makes+it+easier+to+avoid+checking+for+same+tag+which+is+both+closed
++++++++//+or+open
++++++++oP.bAutoMarkClosedTagsWithParamsAsOpen+=+false;

++++++++return+oP;
++++}
}
}++|1769730|entityMap^0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Q|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|R|8|@]|9|@]|A|$]]|$1|K|3|L|5|D|7|S|8|@]|9|@]|A|$E|F]]|$1|M|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|N|$]]

I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:

<pre><code> IEnumerable&lt;XNode&gt; auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);

 foreach (XElement anchorTag in auctionNodes.OfType&lt;XElement&gt;().DescendantsAndSelf("a")) {

 if (anchorTag.Attribute("href") == null)
 continue;

 Console.WriteLine(anchorTag.Attribute("href").Value);
 }
</code></pre>

I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...

So now that expectations are realistically low, here's the code :)

<pre><code>using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Majestic12;
using System.IO;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace Majestic12ToXml {
public class Majestic12ToXml {

 static public IEnumerable&lt;XNode&gt; ConvertNodesToXml(byte[] htmlAsBytes) {

 HTMLparser parser = OpenParser();
 parser.Init(htmlAsBytes);

 XElement currentNode = new XElement("document");

 HTMLchunk m12chunk = null;

 int xmlnsAttributeIndex = 0;
 string originalHtml = "";

 while ((m12chunk = parser.ParseNext()) != null) {

 try {

 Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting

 XNode newNode = null;
 XElement newNodesParent = null;

 switch (m12chunk.oType) {
 case HTMLchunkType.OpenTag:

 // Tags are added as a child to the current tag, 
 // except when the new tag implies the closure of 
 // some number of ancestor tags.

 newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

 if (newNode != null) {
 currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

 newNodesParent = currentNode;

 newNodesParent.Add(newNode);

 currentNode = newNode as XElement;
 }

 break;

 case HTMLchunkType.CloseTag:

 if (m12chunk.bEndClosure) {

 newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

 if (newNode != null) {
 currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

 newNodesParent = currentNode;
 newNodesParent.Add(newNode);
 }
 }
 else {
 XElement nodeToClose = currentNode;

 string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

 while (nodeToClose != null &amp;&amp; nodeToClose.Name.LocalName != m12chunkCleanedTag)
 nodeToClose = nodeToClose.Parent;

 if (nodeToClose != null)
 currentNode = nodeToClose.Parent;

 Debug.Assert(currentNode != null);
 }

 break;

 case HTMLchunkType.Script:

 newNode = new XElement("script", "REMOVED");
 newNodesParent = currentNode;
 newNodesParent.Add(newNode);
 break;

 case HTMLchunkType.Comment:

 newNodesParent = currentNode;

 if (m12chunk.sTag == "!--")
 newNode = new XComment(m12chunk.oHTML);
 else if (m12chunk.sTag == "![CDATA[")
 newNode = new XCData(m12chunk.oHTML);
 else
 throw new Exception("Unrecognized comment sTag");

 newNodesParent.Add(newNode);

 break;

 case HTMLchunkType.Text:

 currentNode.Add(m12chunk.oHTML);
 break;

 default:
 break;
 }
 }
 catch (Exception e) {
 var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);

 // the original html is copied for tracing/debugging purposes
 originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
 .Take(m12chunk.iChunkLength)
 .Select(B =&gt; (char)B).ToArray()); 

 wrappedE.Data.Add("source", originalHtml);

 throw wrappedE;
 }
 }

 while (currentNode.Parent != null)
 currentNode = currentNode.Parent;

 return currentNode.Nodes();
 }

 static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {

 string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

 XElement discoveredParent = null;

 // Get a list of all ancestors
 List&lt;XElement&gt; ancestors = new List&lt;XElement&gt;();
 XElement ancestor = nextPotentialParent;
 while (ancestor != null) {
 ancestors.Add(ancestor);
 ancestor = ancestor.Parent;
 }

 // Check if the new tag implies a previous tag was closed.
 if ("form" == m12chunkCleanedTag) {

 discoveredParent = ancestors
 .Where(XE =&gt; m12chunkCleanedTag == XE.Name)
 .Take(1)
 .Select(XE =&gt; XE.Parent)
 .FirstOrDefault();
 }
 else if ("td" == m12chunkCleanedTag) {

 discoveredParent = ancestors
 .TakeWhile(XE =&gt; "tr" != XE.Name)
 .Where(XE =&gt; m12chunkCleanedTag == XE.Name)
 .Take(1)
 .Select(XE =&gt; XE.Parent)
 .FirstOrDefault();
 }
 else if ("tr" == m12chunkCleanedTag) {

 discoveredParent = ancestors
 .TakeWhile(XE =&gt; !("table" == XE.Name
 || "thead" == XE.Name
 || "tbody" == XE.Name
 || "tfoot" == XE.Name))
 .Where(XE =&gt; m12chunkCleanedTag == XE.Name)
 .Take(1)
 .Select(XE =&gt; XE.Parent)
 .FirstOrDefault();
 }
 else if ("thead" == m12chunkCleanedTag
 || "tbody" == m12chunkCleanedTag
 || "tfoot" == m12chunkCleanedTag) {


 discoveredParent = ancestors
 .TakeWhile(XE =&gt; "table" != XE.Name)
 .Where(XE =&gt; m12chunkCleanedTag == XE.Name)
 .Take(1)
 .Select(XE =&gt; XE.Parent)
 .FirstOrDefault();
 }

 return discoveredParent ?? nextPotentialParent;
 }

 static string CleanupTagName(string originalName, string originalHtml) {

 string tagName = originalName;

 tagName = tagName.TrimStart(new char[] { '?' }); // for nodes &lt;?xml &gt;

 if (tagName.Contains(':'))
 tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);

 return tagName;
 }

 static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);

 static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {

 result = null;
 string attributeName = originalName;

 if (string.IsNullOrEmpty(originalName))
 return false;

 if (_startsAsNumeric.IsMatch(originalName))
 return false;

 //
 // transform xmlns attributes so they don't actually create any XML namespaces
 //
 if (attributeName.ToLower().Equals("xmlns")) {

 attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
 xmlnsIndex++;
 }
 else {
 if (attributeName.ToLower().StartsWith("xmlns:")) {
 attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
 } 

 //
 // trim trailing \"
 //
 attributeName = attributeName.TrimEnd(new char[] { '\"' });

 attributeName = attributeName.Replace(":", "_");
 }

 result = attributeName;

 return true;
 }

 static Regex _weirdTag = new Regex(@"^&lt;!\[.*\]&gt;$"); // matches "&lt;![if !supportEmptyParas]&gt;"
 static Regex _aspnetPrecompiled = new Regex(@"^&lt;%.*%&gt;$"); // matches "&lt;%@ ... %&gt;"
 static Regex _shortHtmlComment = new Regex(@"^&lt;!-.*-&gt;$"); // matches "&lt;!-Extra_Images-&gt;"

 static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {

 if (string.IsNullOrEmpty(m12chunk.sTag)) {

 if (m12chunk.sParams.Length &gt; 0 &amp;&amp; m12chunk.sParams[0].ToLower().Equals("doctype"))
 return new XElement("doctype");

 if (_weirdTag.IsMatch(originalHtml))
 return new XElement("REMOVED_weirdBlockParenthesisTag");

 if (_aspnetPrecompiled.IsMatch(originalHtml))
 return new XElement("REMOVED_ASPNET_PrecompiledDirective");

 if (_shortHtmlComment.IsMatch(originalHtml))
 return new XElement("REMOVED_ShortHtmlComment");

 // Nodes like "&lt;br &lt;br&gt;" will end up with a m12chunk.sTag==""... We discard these nodes.
 return null;
 }

 string tagName = CleanupTagName(m12chunk.sTag, originalHtml);

 XElement result = new XElement(tagName);

 List&lt;XAttribute&gt; attributes = new List&lt;XAttribute&gt;();

 for (int i = 0; i &lt; m12chunk.iParams; i++) {

 if (m12chunk.sParams[i] == "&lt;!--") {

 // an HTML comment was embedded within a tag. This comment and its contents
 // will be interpreted as attributes by Majestic-12... skip this attributes
 for (; i &lt; m12chunk.iParams; i++) {

 if (m12chunk.sTag == "--" || m12chunk.sTag == "--&gt;")
 break;
 }

 continue;
 }

 if (m12chunk.sParams[i] == "?" &amp;&amp; string.IsNullOrEmpty(m12chunk.sValues[i]))
 continue;

 string attributeName = m12chunk.sParams[i];

 if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
 continue;

 attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
 }

 // If attributes are duplicated with different values, we complain.
 // If attributes are duplicated with the same value, we remove all but 1.
 var duplicatedAttributes = attributes.GroupBy(A =&gt; A.Name).Where(G =&gt; G.Count() &gt; 1);

 foreach (var duplicatedAttribute in duplicatedAttributes) {

 if (duplicatedAttribute.GroupBy(DA =&gt; DA.Value).Count() &gt; 1)
 throw new Exception("Attribute value was given different values");

 attributes.RemoveAll(A =&gt; A.Name == duplicatedAttribute.Key);
 attributes.Add(duplicatedAttribute.First());
 }

 result.Add(attributes);

 return result;
 }

 static HTMLparser OpenParser() {
 HTMLparser oP = new HTMLparser();

 // The code+comments in this function are from the Majestic-12 sample documentation.

 // ...

 // This is optional, but if you want high performance then you may
 // want to set chunk hash mode to FALSE. This would result in tag params
 // being added to string arrays in HTMLchunk object called sParams and sValues, with number
 // of actual params being in iParams. See code below for details.
 //
 // When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
 oP.SetChunkHashMode(false);

 // if you set this to true then original parsed HTML for given chunk will be kept - 
 // this will reduce performance somewhat, but may be desireable in some cases where
 // reconstruction of HTML may be necessary
 oP.bKeepRawHTML = false;

 // if set to true (it is false by default), then entities will be decoded: this is essential
 // if you want to get strings that contain final representation of the data in HTML, however
 // you should be aware that if you want to use such strings into output HTML string then you will
 // need to do Entity encoding or same string may fail later
 oP.bDecodeEntities = true;

 // we have option to keep most entities as is - only replace stuff like &amp;nbsp; 
 // this is called Mini Entities mode - it is handy when HTML will need
 // to be re-created after it was parsed, though in this case really
 // entities should not be parsed at all
 oP.bDecodeMiniEntities = true;

 if (!oP.bDecodeEntities &amp;&amp; oP.bDecodeMiniEntities)
 oP.InitMiniEntities();

 // if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
 // extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
 // this only works if auto extraction is enabled
 oP.bAutoExtractBetweenTagsOnly = true;

 // if true then comments will be extracted automatically
 oP.bAutoKeepComments = true;

 // if true then scripts will be extracted automatically: 
 oP.bAutoKeepScripts = true;

 // if this option is true then whitespace before start of tag will be compressed to single
 // space character in string: " ", if false then full whitespace before tag will be returned (slower)
 // you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
 // a waste of CPU cycles
 oP.bCompressWhiteSpaceBeforeTag = true;

 // if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
 // forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
 // compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
 // or open
 oP.bAutoMarkClosedTagsWithParamsAsOpen = false;

 return oP;
 }
}
} 
</code></pre>

blocks|key|1769651|text|之前已经提到过Html+Agility+Pack+--如果你想获得更快的速度，你可能还想看看the+Majestic-12+HTML+parser。它的处理相当笨拙，但它提供了非常快速的解析体验。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1769652|entityMap|0|LINK|mutability|MUTABLE|url|http://www.majestic12.co.uk/projects/html_parser.php^0|1A|R|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out <a href="http://www.majestic12.co.uk/projects/html_parser.php" rel="nofollow noreferrer">the Majestic-12 HTML parser</a>. Its handling is rather clunky, but it delivers a really fast parsing experience.

blocks|key|1554796|text|没有可以在控制台和Asp.net上运行的第三方库、WebBrowser类解决方案|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1554797|using+System;
using+System.Collections.Generic;
using+System.Text;
using+System.Windows.Forms;
using+System.Threading;

class+ParseHTML
{
++++public+ParseHTML()+{+}
++++private+string+ReturnString;

++++public+string+doParsing(string+html)
++++{
++++++++Thread+t+=+new+Thread(TParseMain);
++++++++t.ApartmentState+=+ApartmentState.STA;
++++++++t.Start((object)html);
++++++++t.Join();
++++++++return+ReturnString;
++++}

++++private+void+TParseMain(object+html)
++++{
++++++++WebBrowser+wbc+=+new+WebBrowser();
++++++++wbc.DocumentText+=+"feces+of+a+dummy";++++++++//;magic+words++++++++
++++++++HtmlDocument+doc+=+wbc.Document.OpenNew(true);
++++++++doc.Write((string)html);
++++++++this.ReturnString+=+doc.Body.InnerHtml+%2B+"+do+here+something";
++++++++return;
++++}
}|code-block|syntax|javascript|1554798|用法：|1554799|string+myhtml+=+"<HTML><BODY>This+is+a+new+HTML+document.</BODY></HTML>";
Console.WriteLine("before:"+%2B+myhtml);
myhtml+=+(new+ParseHTML()).doParsing(myhtml);
Console.WriteLine("after:"+%2B+myhtml);|1554800|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|N|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|O|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net

<pre><code>using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Threading;

class ParseHTML
{
 public ParseHTML() { }
 private string ReturnString;

 public string doParsing(string html)
 {
 Thread t = new Thread(TParseMain);
 t.ApartmentState = ApartmentState.STA;
 t.Start((object)html);
 t.Join();
 return ReturnString;
 }

 private void TParseMain(object html)
 {
 WebBrowser wbc = new WebBrowser();
 wbc.DocumentText = "feces of a dummy"; //;magic words 
 HtmlDocument doc = wbc.Document.OpenNew(true);
 doc.Write((string)html);
 this.ReturnString = doc.Body.InnerHtml + " do here something";
 return;
 }
}
</code></pre>

usage:

<pre><code>string myhtml = "&lt;HTML&gt;&lt;BODY&gt;This is a new HTML document.&lt;/BODY&gt;&lt;/HTML&gt;";
Console.WriteLine("before:" + myhtml);
myhtml = (new ParseHTML()).doParsing(myhtml);
Console.WriteLine("after:" + myhtml);
</code></pre>

blocks|key|1769385|text|解析HTML的麻烦在于它不是一门精确的科学。如果您要解析的是XHTML，那么事情会简单得多(正如您提到的，您可以使用通用的XML解析器)。因为HTML不一定是格式良好的XML，所以在尝试解析它时会遇到很多问题。它几乎需要逐个站点地完成。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1769386|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.

blocks|key|1769598|text|我过去曾使用ZetaHtmlTidy加载随机的网站，然后用xpath+(例如/html/body//p@class='textblock')处理内容的不同部分。它工作得很好，但有一些特殊的网站，它有问题，所以我不知道这是否是绝对最好的解决方案。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1769599|entityMap|0|LINK|mutability|MUTABLE|url|http://www.codeproject.com/KB/cs/ZetaHtmlTidy.aspx^0|6|C|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

I've used <a href="http://www.codeproject.com/KB/cs/ZetaHtmlTidy.aspx" rel="nofollow noreferrer">ZetaHtmlTidy</a> in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[@class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.

blocks|key|1554259|text|您可以使用HTML+DTD和通用XML解析库。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1554260|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

You could use a HTML DTD, and the generic XML parsing libraries.

blocks|key|1554680|text|如果需要查看JS对页面的影响并准备启动浏览器，请使用WatiN|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1554681|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]

blocks|key|1769808|text|根据您的需求，您可能会选择功能更丰富的库。我尝试了大多数/所有建议的解决方案，但最突出的是Html+Agility+Pack。它是一个非常容错和灵活的解析器。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1769809|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head &amp; shoulders was Html Agility Pack. It is a very forgiving and flexible parser.

blocks|key|1769828|text|试试这个脚本。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1769829|http://www.biterscripting.com/SS_URLs.html|offset|length|1769830|当我将它与这个url一起使用时，|1769831|script+SS_URLs.txt+URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")|code-block|syntax|javascript|1769832|它向我展示了这个帖子页面上的所有链接。|1769833|http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.|1769834|您可以修改该脚本以检查图像、变量或其他任何内容。|1769835|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|16|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Z|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|10|8|@]|9|@$D|11|E|12|1|13]]|A|$]]|$1|F|3|G|5|6|7|14|8|@]|9|@]|A|$]]|$1|H|3|I|5|J|7|15|8|@]|9|@]|A|$K|L]]|$1|M|3|N|5|6|7|16|8|@]|9|@]|A|$]]|$1|O|3|P|5|J|7|17|8|@]|9|@]|A|$K|L]]|$1|Q|3|R|5|6|7|18|8|@]|9|@]|A|$]]|$1|S|3|-4|5|6|7|19|8|@]|9|@]|A|$]]]|T|$U|$5|V|W|X|A|$Y|C]]]]

Try this script.

<a href="http://www.biterscripting.com/SS_URLs.html" rel="nofollow noreferrer">http://www.biterscripting.com/SS_URLs.html</a>

When I use it with this url,

<pre><code>script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")
</code></pre>

It shows me all the links on the page for this thread.

<pre><code>http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.
</code></pre>

You can modify that script to check for images, variables, whatever.

blocks|key|1769855|text|我用C#写了一些用来解析HTML标签的类。如果它们能满足您的特定需求，那么它们就很好、很简单。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1769856|您可以在http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c上阅读一篇关于它们的文章并下载源代码。|offset|length|1769857|在http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class上还有一篇关于泛型解析助手类的文章。|1769858|entityMap|0|LINK|mutability|MUTABLE|url|http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c|1|http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class^0|0|4|1X|0|0|1|22|1|0^^$0|@$1|2|3|4|5|6|7|R|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|S|8|@]|9|@$D|T|E|U|1|V]]|A|$]]|$1|F|3|G|5|6|7|W|8|@]|9|@$D|X|E|Y|1|Z]]|A|$]]|$1|H|3|-4|5|6|7|10|8|@]|9|@]|A|$]]]|I|$J|$5|K|L|M|A|$N|O]]|P|$5|K|L|M|A|$N|Q]]]]

I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.

You can read an article about them and download the source code at <a href="http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c" rel="nofollow">http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c</a>.

There's also an article about a generic parsing helper class at <a href="http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class" rel="nofollow">http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class</a>.

I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.

What is the best way to parse html in C#?

我正在寻找一种库/方法来解析html文件，该文件具有比通用xml解析库更多的html特定功能。

问在C#中解析html的最好方法是什么？
EN

回答 13

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在C#中解析html的最好方法是什么？EN

回答 13

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在C#中解析html的最好方法是什么？
EN