Github 热门趋势 App（WeCode）后端接口爬虫

阳仔

发布于 2019-07-31 17:24:20

5160

发布于 2019-07-31 17:24:20

文章被收录于专栏：终身开发者

Github Trending 是 Github 上每天的热门项目或者库的排行版。

WeCode App 首页就是使用 Github Trending 上的排行版的数据，这些数据是我使用 Python 爬虫抓取的。WeCode 源码也已经开源在Github 上 https://github.com/wecodexyz/WeCode 感兴趣的，可以给个 star。

现在就看看这个爬虫是如何实现的吧

开发环境

Python 2.7
requests
BeautifulSoup

Python 中自带有 urllib2 网络请求库，但 requests 用起来封装得更好，可以很方便的设置 Cookies、Headers、代理等信息，比起内置 urllib2 用起来更加顺手。强烈推荐使用 requests 作为网络请求库。

BeautifulSoup 是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

获取页面的方法

def read_page(url, timeout=30):
    header = {'User-Agent': USER_AGENT}
    try:
        response = requests.get(url=url, timeout=timeout, headers=header)
    except requests.exceptions.ConnectionError as e:
        print e
        return None, False

    return response, response.status_code

其中 USER_AGENT 是移动端的 UA

USER_AGENT = 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) '\            
            'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36'

解析 HTML 方法

def parser_repos(response):
  repos = []
  soup = BeautifulSoup(response.text, "lxml")
  for div in soup.find_all('div', {'class': 'list-item with-avatar'}):
      avatar = div.find('img', {'class': 'avatar'})['src']
      name_string = div.find('strong', {'class': 'list-item-title'}).string
      owner = name_string.split('/')[0]
      repo_name = name_string.split('/')[1]
      url = GITHUB + "/" + name_string

      meta = div.find('strong', {'class': 'meta'})
      stars = 0

      if meta:
          stars = div.find('strong', {'class': 'meta'}).contents[0].strip('\n').lstrip().rstrip()
      else:
          stars = "0"

      desc = parser_desc(div.find('div', {'class': 'repo-description'}))

      repos.append({
          "owner":owner,
          "name":repo_name,
          "avatar":avatar,
          "stars_today":stars,
          "desc":desc,
          "url":url
      })

  return repos

测试方法

if name == 'main':
  response, code = read_page('https://github.com/trending')
  if code == 200:
      parser_repos(response)

除了本地测试外，这个爬虫还部署到了后端服务上，可以使用 http://angrycode.leanapp.cn/api/github/trending 进行访问查看。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-06-30，如有侵权请联系 cloudcommunity@tencent.com 删除

python

git

github

开源

本文分享自终身开发者微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度