我建立了一个网络刮刀,可以在Facebook和其他网站上找到一份工作列表,但是我想把代码分解成可以重用到其他网站的功能。这个结构是有效的,但我认为它可以更有效地与功能。我被困在如何构造函数上了。只需要抽出两页进行测试。
from time import time
from requests import get
from time import sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
from bs4 import BeautifulSoup
import csv
# Range of only 2 pages
pages = [str(i) for i in range(1, 3)]
cities = ["Menlo%20Park%2C%20CA",
"Fremont%2C%20CA",
"Los%20Angeles%2C%20CA",
"Mountain%20View%2C%20CA",
"Northridge%2CCA",
"Redmond%2C%20WA",
"San%20Francisco%2C%20CA",
"Santa%20Clara%2C%20CA",
"Seattle%2C%20WA",
"Woodland%20Hills%2C%20CA"]
# Preparing the monitoring of the loop
start_time = time()
requests = 0
with open('facebook_job_list.csv', 'w', newline='') as f:
header = csv.writer(f)
header.writerow(["Website", "Title", "Location", "Job URL"])
for page in pages:
for c in cities:
# Requests the html page
response = get("https://www.facebook.com/careers/jobs/?page=" + page +
"&results_per_page=100&locations[0]=" + c)
# Pauses the loop between 8 and 15 seconds
sleep(randint(8, 15))
# Monitor the frequency of requests
requests += 1
elapsed_time = time() - start_time
print("Request:{}; Frequency: {} request/s".format(requests, requests/elapsed_time))
clear_output(wait=True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn("Request: {}; Status code: {}".format(requests, response.status_code))
# Break the loop if number of requests is greater than expected
if requests > 2:
warn("Number of requests was greater than expected.")
break
# Parse the content of the request with BeautifulSoup
page_soup = BeautifulSoup(response.text, 'html.parser')
job_containers = page_soup.find_all("a", "_69jm")
# Select all 100 jobs containers from a single page
for container in job_containers:
site = page_soup.find("title").text
title = container.find("div", "_69jo").text
location = container.find("div", "_1n-z _6hy- _21-h").text
link = container.get("href")
job_link = "https://www.facebook.com" + link
with open('facebook_job_list.csv', 'a', newline='') as f:
rows = csv.writer(f)
rows.writerow([site, title, location, job_link])
发布于 2019-03-25 08:35:31
一些快速建议:
如果使用requests
关键字,则params
模块可以为您编写字符串:
import requests
cities = ["Menlo Park, CA"]
pages = range(1, 3)
url = "https://www.facebook.com/careers/jobs/"
for city in cities:
for page in pages:
params = {"page": page, "results_per_page": 100, "locations[0]": city}
response = requests.get(url, params=params)
使用函数组织代码。这允许您给它们一个可读的名称(甚至添加docstring)。
def get_job_infos(response):
"""Parse the content of the request to get all job postings"""
page_soup = BeautifulSoup(response.text, 'lxml')
job_containers = page_soup.find_all("a", "_69jm")
# Select all 100 jobs containers from a single page
for container in job_containers:
site = page_soup.find("title").text
title = container.find("div", "_69jo").text
location = container.find("div", "_1n-z _6hy- _21-h").text
job_link = "https://www.facebook.com" + container.get("href")
yield site, title, location, job_link
这是一个可以迭代的生成器。使用lxml
解析器通常更快。
请注意,csv
可以使用writer.writerows
一次编写多个行,这需要行的任何可迭代性:
with open('facebook_job_list.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerows(get_job_infos(response))
这样,每页只需要打开一次文件,而不是一百次。更好的做法是使整个程序成为一个生成器,这样您就可以在打开文件一次的同时编写所有行:
def get_all_jobs(url, cities, pages):
for city in cities:
for page in pages:
params = {"page": page, "results_per_page": 100, "locations[0]": city}
response = requests.get(url, params=params)
# check status code
yield from get_job_infos(response)
# rate throttling, etc here
...
if __name__ == "__main__":
cities = ["Menlo Park, CA", ...]
pages = range(1, 3)
url = "https://www.facebook.com/careers/jobs/"
with open('facebook_job_list.csv', "w") as f:
writer = csv.writer(f)
writer.writerow(["Website", "Title", "Location", "Job URL"])
writer.writerows(get_all_jobs(url, pages, cities))
这样,get_all_jobs
生成器将在迭代过程中生成yield
作业,在需要时获得下一个页面。
https://codereview.stackexchange.com/questions/216135
复制相似问题