blocks|key|3233089|text|如果我没记错的话，不久前我自己也遇到过类似的问题。您可以通过将名称空间映射到None来“忽略”它，如下所示：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|3233090|sel+=+CSSSelector('#maincontent+.rprt_all+a',+namespaces={None:+"http://www.w3.org/1999/xhtml"})|code-block|syntax|javascript|3233091|entityMap^0|12|4|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@$9|N|A|O|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|P|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|Q|8|@]|D|@]|E|$]]]|L|$]]

If I remember correctly from having a similar problem myself a while ago. You can "ignore" the namespace by mapping it to <code>None</code> like this:

<pre><code>sel = CSSSelector('#maincontent .rprt_all a', namespaces={None: "http://www.w3.org/1999/xhtml"})
</code></pre>

blocks|key|3446534|text|祝你好运，让一个标准的XML/DOM解析器能够在大多数HTML上工作。最好的选择是使用BeautifulSoup+(pip+install+beautifulsoup4或easy_install+beautifulsoup4)，它可以处理构建不正确的结构。也许就像这样呢？|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|3446535|import+requests
from+bs4+import+BeautifulSoup

response+=+requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%2520cost-effectiveness%2520of%2520mirtazapine%2520versus%2520paroxetine%2520in%2520treating%2520people%2520with%2520depression%2520in%2520primary%2520care')
bs+=+BeautifulSoup(response.content)
div+=+bs.find('div',+class_='linkoutlist')
links+=+[+a['href']+for+a+in+div.find_all('a')+]

>>>+links
['http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&volume=19&issue=3&spage=125',+'http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=15107654.ui',+'https://www.researchgate.net/publication/e/pm/15107654?ln_t=p&ln_o=linkout',+'http://www.diseaseinfosearch.org/result/2199',+'http://www.nlm.nih.gov/medlineplus/antidepressants.html',+'http://toxnet.nlm.nih.gov/cgi-bin/sis/search/r?dbs%2Bhsdb:@term%2B@rn%2B24219-97-4']|code-block|syntax|javascript|3446536|我知道这不是您想要使用的库，但在使用DOM时，我曾多次遇到麻烦。BeautifulSoup的创建者绕过了许多容易在野外发生的边缘情况。|3446537|entityMap|0|LINK|mutability|MUTABLE|url|http://www.crummy.com/software/BeautifulSoup/bs4/doc/^0|1M|Q|2D|R|17|D|0|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@$9|V|A|W|B|C]|$9|X|A|Y|B|C]]|D|@$9|Z|A|10|1|11]]|E|$]]|$1|F|3|G|5|H|7|12|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|13|8|@]|D|@]|E|$]]|$1|M|3|-4|5|6|7|14|8|@]|D|@]|E|$]]]|N|$O|$5|P|Q|R|E|$S|T]]]]

Good luck getting a standard XML/DOM parse to work on most HTML. Your best bet would be to use <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow">BeautifulSoup</a> (<code>pip install beautifulsoup4</code> or <code>easy_install beautifulsoup4</code>), which has a lot of handling for incorrectly built structures. Maybe something like this instead?

<pre><code>import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
bs = BeautifulSoup(response.content)
div = bs.find('div', class_='linkoutlist')
links = [ a['href'] for a in div.find_all('a') ]

&gt;&gt;&gt; links
['http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&amp;volume=19&amp;issue=3&amp;spage=125', 'http://ovidsp.ovid.com/ovidweb.cgi?T=JS&amp;PAGE=linkout&amp;SEARCH=15107654.ui', 'https://www.researchgate.net/publication/e/pm/15107654?ln_t=p&amp;ln_o=linkout', 'http://www.diseaseinfosearch.org/result/2199', 'http://www.nlm.nih.gov/medlineplus/antidepressants.html', 'http://toxnet.nlm.nih.gov/cgi-bin/sis/search/r?dbs+hsdb:@term+@rn+24219-97-4']
</code></pre>

I know it's not the library you were looking to use, but I have historically slammed my head into walls on many occasions when it comes to DOM. The creators of BeautifulSoup have circumvented many edge cases that tend to happen in the wild.

blocks|key|406008|text|你需要处理命名空间，包括一个空的命名空间。|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|406009|工作解决方案：|406010|from+pyquery+import+PyQuery+as+pq
import+requests


response+=+requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%2520cost-effectiveness%2520of%2520mirtazapine%2520versus%2520paroxetine%2520in%2520treating%2520people%2520with%2520depression%2520in%2520primary%2520care')

namespaces+=+{'xi':+'http://www.w3.org/2001/XInclude',+'test':+'http://www.w3.org/1999/xhtml'}
links+=+pq('#maincontent+.linkoutlist+test%7Ca',+response.content,+namespaces=namespaces)
for+link+in+links:
++++print+link.attrib.get("title",+"No+title")|code-block|syntax|javascript|406011|打印与选择器匹配的所有链接的标题：|406012|Full+text+at+publisher's+site
No+title
Free+resource
Free+resource
Free+resource
Free+resource|406013|406014|或者，只需将parser设置为"html"并忽略名称空间：|CODE|406015|links+=+pq('#maincontent+.linkoutlist+a',+response.content,+parser="html")
for+link+in+links:
++++print+link.attrib.get("title",+"No+title")|406016|entityMap^0|3|6|0|0|0|0|0|0|6|6|F|6|0|0^^$0|@$1|2|3|4|5|6|7|Y|8|@$9|Z|A|10|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|11|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|12|8|@]|D|@]|E|$K|L]]|$1|M|3|N|5|6|7|13|8|@]|D|@]|E|$]]|$1|O|3|P|5|J|7|14|8|@]|D|@]|E|$K|L]]|$1|Q|3|-4|5|6|7|15|8|@]|D|@]|E|$]]|$1|R|3|S|5|6|7|16|8|@$9|17|A|18|B|T]|$9|19|A|1A|B|T]]|D|@]|E|$]]|$1|U|3|V|5|J|7|1B|8|@]|D|@]|E|$K|L]]|$1|W|3|-4|5|6|7|1C|8|@]|D|@]|E|$]]]|X|$]]

You need to handle namespaces, including an empty one.

Working solution:

<pre><code>from pyquery import PyQuery as pq
import requests


response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')

namespaces = {'xi': 'http://www.w3.org/2001/XInclude', 'test': 'http://www.w3.org/1999/xhtml'}
links = pq('#maincontent .linkoutlist test|a', response.content, namespaces=namespaces)
for link in links:
 print link.attrib.get("title", "No title")
</code></pre>

Prints titles of all links matching the selector:

<pre><code>Full text at publisher's site
No title
Free resource
Free resource
Free resource
Free resource
</code></pre>

<hr>

Or, just set the <code>parser</code> to <code>"html"</code> and forget about namespaces:

<pre><code>links = pq('#maincontent .linkoutlist a', response.content, parser="html")
for link in links:
 print link.attrib.get("title", "No title")
</code></pre>

This is driving me totally nuts, I've been struggling with it for many hours. Any help would be much appreciated. 

I'm using <a href="https://pypi.python.org/pypi/pyquery" rel="noreferrer">PyQuery</a> 1.2.9 (which is built on top of <code>lxml</code>) to scrape <a href="http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care" rel="noreferrer">this URL</a>. I just want to get a list of all the links in the <code>.linkoutlist</code> section. 

This is my request in full:

<pre><code>response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
doc = pq(response.content)
links = doc('#maincontent .linkoutlist a')
print links
</code></pre>

But that returns an empty array. If I use this query instead:

<pre><code>links = doc('#maincontent .linkoutlist')
</code></pre>

Then I get this back this HTML:

<pre><code>&lt;div xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" class="linkoutlist"&gt;
 &lt;h4&gt;Full Text Sources&lt;/h4&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a title="Full text at publisher's site" href="http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&amp;amp;volume=19&amp;amp;issue=3&amp;amp;spage=125" ref="itool=Abstract&amp;amp;PrId=3159&amp;amp;uid=15107654&amp;amp;db=pubmed&amp;amp;log$=linkoutlink&amp;amp;nlmid=8609061" target="_blank"&gt;Lippincott Williams &amp;amp; Wilkins&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://ovidsp.ovid.com/ovidweb.cgi?T=JS&amp;amp;PAGE=linkout&amp;amp;SEARCH=15107654.ui" ref="itool=Abstract&amp;amp;PrId=3682&amp;amp;uid=15107654&amp;amp;db=pubmed&amp;amp;log$=linkoutlink&amp;amp;nlmid=8609061" target="_blank"&gt;Ovid Technologies, Inc.&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;h4&gt;Other Literature Sources&lt;/h4&gt;
 ...
&lt;/div&gt;
</code></pre>

So the parent selectors do return HTML with lots of <code>&lt;a&gt;</code> tags. This also appears to be valid HTML. 

More experimenting reveals that lxml does not like the <code>xmlns</code> attribute on the opening div, for some reason. 

How can I ignore this in lxml, and just parse it like regular HTML?

UPDATE: Trying <code>ns_clean</code>, still failing:

<pre><code> parser = etree.XMLParser(ns_clean=True)
 tree = etree.parse(StringIO(response.content), parser)
 sel = CSSSelector('#maincontent .rprt_all a')
 print sel(tree)
</code></pre>

Using lxml to parse namepaced HTML?

这简直要把我逼疯了，我已经为此挣扎了好几个小时了。任何帮助都将不胜感激。我正在使用 1.2.9 (它构建在lxml之上)来抓取。我只想获得.linkoutlist部分中所有链接的列表。这是我的完整请求：response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed...

问使用lxml解析命名空间的HTML？
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用lxml解析命名空间的HTML？EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用lxml解析命名空间的HTML？
EN