我已经写了一个脚本,可以执行反向搜索在网站上使用名称和盖子从预定义的CSV文件。但是,当搜索完成后,它可以将包含地址和电话号码的结果放在这些名称和盖子旁边,创建一个新的CSV文件。它现在正在无懈可击。我试着把整个过程弄清楚。任何改进这个脚本的建议都将受到高度赞赏。下面是我尝试过的代码:
import csv
import requests
from lxml import html
with open("predefined.csv", "r") as f, open('newly_created.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = reader.fieldnames + ['Address', 'Phone']
writer = csv.writer = csv.DictWriter(g, fieldnames = newfieldnames)
writer.writeheader()
for entry in reader:
Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
response = requests.get(Page)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(@class,"business-card")]')
for title in tree.xpath('//article[contains(@class,"business-card")]'):
Address= title.xpath('.//p[@class="address"]/span/text()')[0]
Contact = title.xpath('.//p[@class="phone"]/text()')[0]
print(Address,Contact)
new_row = entry
new_row['Address'] = Address
new_row['Phone'] = Contact
writer.writerow(new_row)
这是链接到结果。
发布于 2017-07-09 18:54:10
为了改进代码,我们可以做很多事情:
PEP8
命名建议保持一致-例如:Page
应该是page
-或者更好的url
Address
将是address
Contact
将是contact
f
可以是input_file
g
可以是output_file
titles
变量。writer = csv.writer = csv.DictWriter(...)
-只需直接将writer
分配给DictWriter
实例requests.Session()
实例应该会对性能产生积极的影响。.findtext()
方法代替xpath()
,然后获取第一项crawl
函数,以保持web抓取逻辑分离。下面是经过修改的代码,并结合了上述改进和其他改进:
import csv
import requests
from lxml import html
URL_TEMPLATE = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}"
def crawl(entries):
with requests.Session() as session:
for entry in entries:
url = URL_TEMPLATE.format(entry["Name"].replace(" ", "-"), entry["Lid"])
response = session.get(url)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(@class,"business-card")]')
for title in titles:
address = title.findtext('.//p[@class="address"]/span')
contact = title.findtext('.//p[@class="phone"]')
print(address, contact)
entry['Address'] = address
entry['Phone'] = contact
yield entry
if __name__ == '__main__':
with open("predefined.csv", "r") as input_file, open('newly_created.csv', 'w', newline='') as output_file:
reader = csv.DictReader(input_file)
field_names = reader.fieldnames + ['Address', 'Phone']
writer = csv.DictWriter(output_file, fieldnames=field_names)
writer.writeheader()
for entry in crawl(reader):
writer.writerow(entry)
(未测试)
https://codereview.stackexchange.com/questions/168750
复制相似问题