
今天我们使用Web抓取模块(如Selenium,Beautiful Soup和urllib)在Python中编写脚本来抓取一个分类广告网站Craigslist的数据。主要通过浏览器访问网站Craigslist提取出搜索结果中的标题、链接等信息。

首先我们先看下具体被抓取网站的样子:

我们根据输入的参数提前整理出url的信息主要包括邮编、最高价格、距离范围、以及网站域名位置。
https://sfbay.craigslist.org/search/sss?search_distance=5&postal=94201&max_price=500
我们根据这个地址来看具体的代码编写过程,最后将完整的代码展示给大家:
首先导入要使用的安装包:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import urllib.request接下来我们定义一个类实现抓取网站的具体操作:
location:具体的域名位置
postal:邮编
max_price:最高价
radius:距离
url:拼接要访问的地址
driver:使用chrome浏览器
deley:延迟时间
class  CraiglistScraper(object):
  def __init__(self, location, postal, max_price, radius):
    self.location = location
    self.postal = postal
    self.max_price = max_price
    self.radius = radius
    self.url = f"https://{location}.craigslist.org/search/sss?search_distance={radius}&postal={postal}&max_price={max_price}"
    self.driver = webdriver.Chrome('chromedriver.exe')
    self.delay = 3接下来在类中定义load_craigslist_url方法,使用selenium打开浏览器,然后进行3秒的延迟加载后 获取到搜索框的元素这里是id为searchform:

具体方法如下:
def load_craigslist_url(self):
    self.driver.get(self.url)
    try:
      wait = WebDriverWait(self.driver,self.delay)
      wait.until(EC.presence_of_element_located((By.ID, "searchform")))
      print("页面已经初始化完毕")
    except TimeoutException:
      print("加载页面超时")根据网站源码可知,搜索结果是由li标签组成并且样式为class="result-row":

根据以上分析我们编写extract_post_information方法获取搜索结果中的标题、价格、日期数据:
def extract_post_information(self):
    all_posts = self.driver.find_elements_by_class_name("result-row")
    dates = []
    titles = []
    prices = []
    for post in all_posts:
      title = post.text.split("$")
      if title[0] == '':
        title = title[1]
      else:
        title = title[0]
      title = title.split("\n")
      price = title[0]
      title = title[-1]
      title = title.split(" ")
      month = title[0]
      day = title[1]
      title = ' '.join(title[2:])
      date = month + " " + day
      titles.append(title)
      prices.append(price)
      dates.append(date)
    return titles,prices,dates接下来我们提取商品的链接,根据源码分析可知,链接是a标签中class为result-title hdrlnk的代码:

我们编写抽取超链接的方法extract_post_urls并使用BeautifulSoup实现:
def extract_post_urls(self):
    url_list = []
    html_page = urllib.request.urlopen(self.url)
    soup = BeautifulSoup(html_page, "lxml")
    for link in soup.findAll("a", {"class": "result-title hdrlnk"}):
      print(link["href"])
      url_list.append(link["href"])
    return url_list然后设置关闭浏览器的方法:
  def quit(self):
    self.driver.close()调用程序进行执行抓取:
#运行测试
location = "sfbay"
postal = "94201"
max_price = "500"
radius = "5"
scraper = CraiglistScraper(location, postal, max_price, radius)
scraper.load_craigslist_url()
titles, prices, dates = scraper.extract_post_information()
print(titles)
scraper.extract_post_urls()
scraper.quit()然后就可以运行看效果啦,最终的完整代码如下:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import urllib.request
class  CraiglistScraper(object):
  def __init__(self, location, postal, max_price, radius):
    self.location = location
    self.postal = postal
    self.max_price = max_price
    self.radius = radius
    self.url = f"https://{location}.craigslist.org/search/sss?search_distance={radius}&postal={postal}&max_price={max_price}"
    self.driver = webdriver.Chrome('chromedriver.exe')
    self.delay = 3
  def load_craigslist_url(self):
    self.driver.get(self.url)
    try:
      wait = WebDriverWait(self.driver,self.delay)
      wait.until(EC.presence_of_element_located((By.ID, "searchform")))
      print("页面已经初始化完毕")
    except TimeoutException:
      print("加载页面超时")
  def extract_post_information(self):
    all_posts = self.driver.find_elements_by_class_name("result-row")
    dates = []
    titles = []
    prices = []
    for post in all_posts:
      title = post.text.split("$")
      if title[0] == '':
        title = title[1]
      else:
        title = title[0]
      title = title.split("\n")
      price = title[0]
      title = title[-1]
      title = title.split(" ")
      month = title[0]
      day = title[1]
      title = ' '.join(title[2:])
      date = month + " " + day
      titles.append(title)
      prices.append(price)
      dates.append(date)
    return titles,prices,dates
  def extract_post_urls(self):
    url_list = []
    html_page = urllib.request.urlopen(self.url)
    soup = BeautifulSoup(html_page, "lxml")
    for link in soup.findAll("a", {"class": "result-title hdrlnk"}):
      print(link["href"])
      url_list.append(link["href"])
    return url_list
  def quit(self):
    self.driver.close()
#运行测试
location = "sfbay"
postal = "94201"
max_price = "500"
radius = "5"
scraper = CraiglistScraper(location, postal, max_price, radius)
scraper.load_craigslist_url()
titles, prices, dates = scraper.extract_post_information()
print(titles)
scraper.extract_post_urls()
scraper.quit()
    
感兴趣的童鞋可以做下测试,对于Selenium、BeautifulSoup不太熟悉的童鞋可以参考之前的文章:
web爬虫-搞一波天涯论坛帖子练练手
web爬虫-用Selenium操作浏览器抓数据
今天的学习就到这里了,下节见吧
下面的是我的公众号二维码图片,欢迎关注。
