内容来源于 Stack Overflow,并遵循CC BY-SA 3.0许可协议进行翻译与使用
我在用PyQuery1.2.9(建在)刮
这个
URL.我只想得到一个列表,其中列出了部分。
这是我的全部要求:
response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care') doc = pq(response.content) links = doc('#maincontent .linkoutlist a') print links
但这会返回一个空数组。如果我使用此查询,则:
links = doc('#maincontent .linkoutlist')
然后我把它拿回来这个HTML:
<div xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" class="linkoutlist"> <h4>Full Text Sources</h4> <ul> <li><a title="Full text at publisher's site" href="http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&volume=19&issue=3&spage=125" ref="itool=Abstract&PrId=3159&uid=15107654&db=pubmed&log$=linkoutlink&nlmid=8609061" target="_blank">Lippincott Williams & Wilkins</a></li> <li><a href="http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=15107654.ui" ref="itool=Abstract&PrId=3682&uid=15107654&db=pubmed&log$=linkoutlink&nlmid=8609061" target="_blank">Ovid Technologies, Inc.</a></li> </ul> <h4>Other Literature Sources</h4> ... </div>
我如何在lxml中忽略这一点,并像解析常规HTML一样解析它呢?
你需要处理命名空间,例如:
from pyquery import PyQuery as pq import requests response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care') namespaces = {'xi': 'http://www.w3.org/2001/XInclude', 'test': 'http://www.w3.org/1999/xhtml'} links = pq('#maincontent .linkoutlist test|a', response.content, namespaces=namespaces) for link in links: print link.attrib.get("title", "No title")
打印与选择器匹配的所有链接的标题:
Full text at publisher's site No title Free resource Free resource Free resource Free resource
可以设置“html”:
links = pq('#maincontent .linkoutlist a', response.content, parser="html") for link in links: print link.attrib.get("title", "No title")
import requests from bs4 import BeautifulSoup response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care') bs = BeautifulSoup(response.content) div = bs.find('div', class_='linkoutlist') links = [ a['href'] for a in div.find_all('a') ] >>> links ['http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&volume=19&issue=3&spage=125', 'http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=15107654.ui', 'https://www.researchgate.net/publication/e/pm/15107654?ln_t=p&ln_o=linkout', 'http://www.diseaseinfosearch.org/result/2199', 'http://www.nlm.nih.gov/medlineplus/antidepressants.html', 'http://toxnet.nlm.nih.gov/cgi-bin/sis/search/r?dbs+hsdb:@term+@rn+24219-97-4']