使用Python爬虫获取游民福利

不可言诉的深渊

发布于 2019-07-26 17:31:26

1K00

代码可运行

文章被收录于专栏：Python机器学习算法说书人Python机器学习算法说书人

运行总次数：0

代码可运行

选择网站

在这里，我选的网站是许多游戏玩家，许多游戏开发者都浏览过的网站——游民星空（https://www.gamersky.com/）浏览器打开这个网站之后，点击娱乐，然后在新的页面中点击游民福利，会跳到另一个页面，这个网页的网址变成了：https://www.gamersky.com/ent/xz/，这个页面的显示如图所示。

获取数据

我们要爬的就是游民福利这个标题下的所有无序列表项，先不要急着爬，先看看它总共有多少个列表项，鼠标往下滑，滑到最底下，如图所示。

可以发现它居然是分页的，每一页只显示一定数量的内容，不用管，跳到第2页，如图所示。

可以发现，跳到第二页网址并没有发生变化，说明这是一个动态网站，并没有把数据写死在HTML，动态网站一般有两种——使用Ajax异步加载和使用JavaScript动态加载。至于它到底是什么加载打开浏览器开发者工具试一下就行了，如图所示。

这里需要注意Ajax的加载与之对应的是XHR（也就是我现在选中的），JavaScript的加载就是旁边的JS啦。我们从第二页跳回第一页，看看Ajax这里有没有新的请求出现，如图所示。

可以发现它并没有新的请求出现，不用解释了，切到JS吧~！如图所示。

总共有三个JS请求，到底数据在哪里呢~！我们一个一个找，首先看第一个，点击第一个之后然后点击response，看看响应的数据里有没有我们要的东西，直接ctrl+F搜索，在搜索之前我们需要先考虑一下要搜索什么？实际上我们可以发现每一个无序列表项的标题都是一个超链接，打开其中一个，可以发现它又有很多张福利图。现在要做的就是获取每个标题的超链接，也就是我们当前要找的内容。我们首先在第一个JS请求对应的响应内容中搜索那个URL，如图所示。

可以发现，我们要找的内容果然是在第一个JS请求，但是现在的又出现了一个新的问题，就是JS请求的URL是如何构成的，点击headers，如图所示。

这简直就是太长了，我已经不想看了，直接跳到下一项。可以发现使用的是GET请求，状态码200（正常）。往下滑，找到如图所示的位置。

验证请求

可以发现它带了三个参数，大概看一下，感觉只要jsondata这个参数，其他的貌似不需要，到底是不是这样？我们要通过测试程序来验证。

  from requests import get
  print(get("https://db2.gamersky.com/LabelJsonpAjax.aspx", params={
      'jsondata': '{"type":"updatenodelabel","isCache":true,"cacheTime":60,"nodeId":"20119","isNodeId":"true","page":1}'
  }).content.decode())

运行结果如图所示。

从运行结果中，我们可以发现依旧可以获取数据，并没有出现错误。这个仅仅只是获取了第一页的数据，那我想获取每一页的数据该怎么办？我们可以发现jsondata这个参数里面有一个page的字段，这个字段的值应该就是对应第几页。那么又有一个问题产生了，总共有多少页？其实答案已经很明显了，就在运行的结果中，就是totalPages字段对应的值。那么如何获取这个值呢？其实很简单，我们发现每个字段都是以逗号分隔，然后字段名和字段值中间有冒号，那么我完全可以使用字符串方法来获取总页数。首先以逗号分割response，然后获取分割之后的第2个（索引为1）子串，然后继续以冒号分割这个子串，取分割后的第2个（索引为1），然后转换成int类型就可以啦~！

  from requests import get
  response = get("https://db2.gamersky.com/LabelJsonpAjax.aspx", params={
      'jsondata': '{"type":"updatenodelabel","isCache":true,"cacheTime":60,"nodeId":"20119","isNodeId":"true","page":1}'
  }).content.decode()
  total_page = int(response.split(',')[1].split(':')[1])
  print(total_page)

运行结果如图所示。

筛选数据

从运行结果中可以发现确实获取到了总页数。我们先不要急着去把每一页都爬下来，先通过正则筛选一下第一页的数据。第一页中我们需要那个超链接，因此轻而易举的写出正则：r'<a href=\\"(.*?)\\".*?>'，接下来就是获取每一页的数据并筛选，筛选大家应该都会了吧，关键的问题应该就是获取每一页的数据。依旧很简单，直接使用for从第二页开始获取，到最后一页截止，最后一页就是总页数，因为第一页的获取过了，说了这么多，直接贴代码。

  from requests import get
  from re import compile
  response = get("https://db2.gamersky.com/LabelJsonpAjax.aspx", params={
      'jsondata': '{"type":"updatenodelabel","isCache":true,"cacheTime":60,"nodeId":"20119","isNodeId":"true","page":1}'
  }).content.decode()
  sub_url_pattern = compile(r'<a href=\\"(.*?)\\".*?>')
  sub_urls = sub_url_pattern.findall(response)
  print(len(sub_urls))
  for sub_url in sub_urls:
      print(1, sub_url)
  total_page = int(response.split(',')[1].split(':')[1])
  for page in range(2, total_page+1):
      response = get("https://db2.gamersky.com/LabelJsonpAjax.aspx", params={
          'jsondata': '{"type":"updatenodelabel","isCache":true,"cacheTime":60,"nodeId":"20119","isNodeId":"true",'
                      '"page":%d}' % page}).content.decode()
      sub_url_pattern = compile(r'<a href=\\"(.*?)\\".*?>')
      sub_urls = sub_url_pattern.findall(response)
      for sub_url in sub_urls:
          print(page, sub_url)

回到浏览器，我们随便点击一个无序列表的标题，会跳转到另一个页面，如图所示。

往下滑，滑到如图所示的位置。

可以发现居然是分页的，我们跳到第二页，看一下URL到底变没变，可以发现URL变成了：https://www.gamersky.com/ent/201901/1145126_2.shtml，那就说明它这里数据是写死在HTML的，没有动态加载。但是又产生了一个问题，第一页可不可以是https://www.gamersky.com/ent/201901/1145126_1.shtml，试试就知道了，答案是不可以，直接返回404，页面不存在。因此，我们要把第一页单独处理，后面的遇到404就停止。说了这么多，直接上代码。

  for sub_url in sub_urls:
      print(sub_url)
      sub_web_response = get(sub_url).content.decode()
      page = 2
      while True:
          sub_url = sub_url.replace(".shtml", f"_{page}.shtml")
          sub_web_response = get(sub_url)
          status_code = sub_web_response.status_code
          print(sub_url)
          sub_url = sub_url.replace(f"_{page}.shtml", ".shtml")
          if status_code == 404:
              break
          page += 1

接下来就是获取图片的URL了，回到浏览器，随便检查一张图片，如图所示。

然后就是研究选中的那一段HTML内容，写出图片URL的正则表达式即可r'<img class="picact" alt="游民星空" src="(.*?)".*?>'，然后就是进行整合，得到整个爬虫的源代码。

  from requests import get
  from re import compile
  response = get("https://db2.gamersky.com/LabelJsonpAjax.aspx", params={
      'jsondata': '{"type":"updatenodelabel","isCache":true,"cacheTime":60,"nodeId":"20119","isNodeId":"true","page":1}'
  }).content.decode()
  sub_url_pattern = compile(r'<a href=\\"(.*?)\\".*?>')
  sub_urls = sub_url_pattern.findall(response)
  count = 0
  for sub_url in sub_urls:
      sub_web_response = get(sub_url).content.decode()
      image_pattern = compile(r'<img class="picact" alt="游民星空" src="(.*?)".*?>')
      image_urls = image_pattern.findall(sub_web_response)
      for image_url in image_urls:
          print(image_url)
          open(f"image/{count}.jpg", "wb").write(get(image_url).content)
          count += 1
      page = 2
      while True:
          sub_url = sub_url.replace(".shtml", f"_{page}.shtml")
          sub_web_response = get(sub_url)
          status_code = sub_web_response.status_code
          sub_web_response = sub_web_response.content.decode()
          image_urls = image_pattern.findall(sub_web_response)
          for image_url in image_urls:
              print(image_url)
              open(f"image/{count}.jpg", "wb").write(get(image_url).content)
              count += 1
          sub_url = sub_url.replace(f"_{page}.shtml", ".shtml")
          if status_code == 404:
              break
          page += 1
  total_page = int(response.split(',')[1].split(':')[1])
  for page in range(2, total_page+1):
      response = get("https://db2.gamersky.com/LabelJsonpAjax.aspx", params={
          'jsondata': '{"type":"updatenodelabel","isCache":true,"cacheTime":60,"nodeId":"20119","isNodeId":"true",'
                      '"page":%d}' % page}).content.decode()
      sub_url_pattern = compile(r'<a href=\\"(.*?)\\".*?>')
      sub_urls = sub_url_pattern.findall(response)
      for sub_url in sub_urls:
          sub_web_response = get(sub_url).content.decode()
          image_pattern = compile(r'<img class="picact" alt="游民星空" src="(.*?)".*?>')
          image_urls = image_pattern.findall(sub_web_response)
          for image_url in image_urls:
              print(image_url)
              open(f"image/{count}.jpg", "wb").write(get(image_url).content)
              count += 1
          page = 2
          while True:
              sub_url = sub_url.replace(".shtml", f"_{page}.shtml")
              sub_web_response = get(sub_url)
              status_code = sub_web_response.status_code
              sub_web_response = sub_web_response.content.decode()
              image_urls = image_pattern.findall(sub_web_response)
              for image_url in image_urls:
                  print(image_url)
                  open(f"image/{count}.jpg", "wb").write(get(image_url).content)
                  count += 1
              sub_url = sub_url.replace(f"_{page}.shtml", ".shtml")
              if status_code == 404:
                  break
              page += 1

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-01-18，如有侵权请联系 cloudcommunity@tencent.com 删除

json