我正在使用Ubuntu14.04.3和Python2.7.6来解析一个Yellowpages.com,YP,Apartments页面。我使用的是lxml和xpath。YP页面似乎具有相同的布局。每一页的中心列出了每套公寓。每个中心公寓都有一个索引号。每页有30个索引公寓。其他公寓在页面的顶部、底部和右侧列出,看起来像是广告,对解析没有兴趣。我解析了页面,获得了列出的公寓的每一项的计数。如果有30个编号的公寓,我会得到不同的项目计数,这似乎是有问题的。例如:
lenIdxBusNames = 30
lenBusinessNames = 32
lenStreets = 30
lenPageHrefs = 15.
我将项目/元素写入CSV中的行。BusinessName和pageHref列未对齐。BusinessName列上移了一行。只有15个pageHrefs,这意味着缺少一些。它们不是与其他项目保持在同一行中,而是在该列的前16行中列出。一些搜索路径包括:
idxBusNames = tree.xpath('//h3[@class="n"]/text()'),
businessNames = tree.xpath('//h3/a[@class="business-name"]/text()'),
streets = tree.xpath('//p[@class="adr"]/span[1]/text()') and
pageHrefs = tree.xpath('//a[@class="track-visit-website"]/@href')
我发现了使用Firefox firebug的xpath。更多细节在附件中。
谢谢你的帮助,鲍勃
发布于 2015-10-13 22:45:43
基于我的评论:
import requests
from lxml import etree
url="""http://www.yellowpages.com/search?search_terms=apartment"""
url+="""&insert geo params here"""
r = requests.get(url)
h = etree.HTMLParser()
tree = etree.fromstring(r.text,h)
xp_info_nodes = """//div[@class='info']"""
xp_id = """h3[@class='n']/text()"""
xp_name = """h3[@class='n']/a[@class='business-name']/text()"""
xp_adr = """div[@class='info-section info-primary']/p[@class='adr']/span[1]/text()"""
xp_link = """.//a[@class='track-visit-website']/@href"""
info_nodes = tree.xpath(xp_info_nodes)
all_data = []
for node in info_nodes:
#mandatory_nodes
data = [
node.xpath(xp_id),
node.xpath(xp_name),
node.xpath(xp_adr),
]
#insert some function to clean up data[0] here. its returning weird strings.
ldata = len(data)
data = [d for d in data if d]
if len(data) != ldata:
continue
#optional
optional_data = [
data.append(node.xpath(xp_link))
]
optional_data = [o for o in optional_data]
all_data.append(data + optional_data)
for row in all_data:
"""print a line of your csv"""
https://stackoverflow.com/questions/33109493
复制