首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >从Facebook上抓取招聘信息

从Facebook上抓取招聘信息
EN

Code Review用户
提问于 2019-03-25 02:48:13
回答 1查看 385关注 0票数 3

我建立了一个网络刮刀,可以在Facebook和其他网站上找到一份工作列表,但是我想把代码分解成可以重用到其他网站的功能。这个结构是有效的,但我认为它可以更有效地与功能。我被困在如何构造函数上了。只需要抽出两页进行测试。

代码语言:javascript
运行
复制
from time import time
from requests import get
from time import sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
from bs4 import BeautifulSoup
import csv

# Range of only 2 pages
pages = [str(i) for i in range(1, 3)]
cities = ["Menlo%20Park%2C%20CA",
          "Fremont%2C%20CA",
          "Los%20Angeles%2C%20CA",
          "Mountain%20View%2C%20CA",
          "Northridge%2CCA",
          "Redmond%2C%20WA",
          "San%20Francisco%2C%20CA",
          "Santa%20Clara%2C%20CA",
          "Seattle%2C%20WA",
          "Woodland%20Hills%2C%20CA"]

# Preparing the monitoring of the loop
start_time = time()
requests = 0

with open('facebook_job_list.csv', 'w', newline='') as f:
    header = csv.writer(f)
    header.writerow(["Website", "Title", "Location", "Job URL"])

for page in pages:
    for c in cities:
        # Requests the html page
        response = get("https://www.facebook.com/careers/jobs/?page=" + page +
                       "&results_per_page=100&locations[0]=" + c)

        # Pauses the loop between 8 and 15 seconds
        sleep(randint(8, 15))

        # Monitor the frequency of requests
        requests += 1
        elapsed_time = time() - start_time
        print("Request:{}; Frequency: {} request/s".format(requests, requests/elapsed_time))
        clear_output(wait=True)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn("Request: {}; Status code: {}".format(requests, response.status_code))

        # Break the loop if number of requests is greater than expected
        if requests > 2:
            warn("Number of requests was greater than expected.")
            break

        # Parse the content of the request with BeautifulSoup
        page_soup = BeautifulSoup(response.text, 'html.parser')
        job_containers = page_soup.find_all("a", "_69jm")

        # Select all 100 jobs containers from a single page
        for container in job_containers:
            site = page_soup.find("title").text
            title = container.find("div", "_69jo").text
            location = container.find("div", "_1n-z _6hy- _21-h").text
            link = container.get("href")
            job_link = "https://www.facebook.com" + link

            with open('facebook_job_list.csv', 'a', newline='') as f:
                rows = csv.writer(f)
                rows.writerow([site, title, location, job_link])
EN

回答 1

Code Review用户

回答已采纳

发布于 2019-03-25 08:35:31

一些快速建议:

如果使用requests关键字,则params模块可以为您编写字符串:

代码语言:javascript
运行
复制
import requests

cities = ["Menlo Park, CA"]
pages = range(1, 3)
url = "https://www.facebook.com/careers/jobs/"

for city in cities:
    for page in pages:
        params = {"page": page, "results_per_page": 100, "locations[0]": city}
        response = requests.get(url, params=params)

使用函数组织代码。这允许您给它们一个可读的名称(甚至添加docstring)。

代码语言:javascript
运行
复制
def get_job_infos(response):
    """Parse the content of the request to get all job postings"""
    page_soup = BeautifulSoup(response.text, 'lxml')
    job_containers = page_soup.find_all("a", "_69jm")

    # Select all 100 jobs containers from a single page
    for container in job_containers:
        site = page_soup.find("title").text
        title = container.find("div", "_69jo").text
        location = container.find("div", "_1n-z _6hy- _21-h").text
        job_link = "https://www.facebook.com" + container.get("href")
        yield site, title, location, job_link

这是一个可以迭代的生成器。使用lxml解析器通常更快。

请注意,csv可以使用writer.writerows一次编写多个行,这需要行的任何可迭代性:

代码语言:javascript
运行
复制
with open('facebook_job_list.csv', 'a', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(get_job_infos(response))

这样,每页只需要打开一次文件,而不是一百次。更好的做法是使整个程序成为一个生成器,这样您就可以在打开文件一次的同时编写所有行:

代码语言:javascript
运行
复制
def get_all_jobs(url, cities, pages):
    for city in cities:
        for page in pages:
            params = {"page": page, "results_per_page": 100, "locations[0]": city}
            response = requests.get(url, params=params)
            # check status code

            yield from get_job_infos(response)

            # rate throttling, etc here
            ...

if __name__ == "__main__":
    cities = ["Menlo Park, CA", ...]
    pages = range(1, 3)
    url = "https://www.facebook.com/careers/jobs/"

    with open('facebook_job_list.csv', "w") as f:
        writer = csv.writer(f)
        writer.writerow(["Website", "Title", "Location", "Job URL"])
        writer.writerows(get_all_jobs(url, pages, cities))

这样,get_all_jobs生成器将在迭代过程中生成yield作业,在需要时获得下一个页面。

票数 3
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/216135

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档