网站检测

不可言诉的深渊

发布于 2020-02-26 14:21:10

1.5K0

发布于 2020-02-26 14:21:10

这段时间或许绝大部分的读者朋友是宅在家中的，我也一样啊。只不过我最近因为太无聊了想看电影但不想充值 VIP，打算直接上网找有没有免费资源，可是我所收藏的免费资源网站几乎都不能用了。正因为这样，我打算编写一个网站过滤程序来检测这样的网站是不是还可以用。

概述

我们先来简单的分析一下，首先，我们要检测一批免费资源网站；其次，通过状态码来判断网站是不是能够使用（其实通过状态码并不能做到 100% 准确率的检测，但是大部分情况下是正确的）；另外，有可能网站会重复，还要过滤重复的网站；最后，我们需要做一个存储。

通过上面的叙述，我们可以得出这个程序一共有三个主要的功能——检测、过滤和保存。知道这些写出框架应该不难，代码如下：

 class WebsiteDetection:
     def __init__(self):
         pass
 
     def detect(self):
         pass
 
     def filter(self):
         pass
 
     def save(self):
         pass
 
 
 if __name__ == '__main__':
     website_detection = WebsiteDetection()
     website_detection.filter()
     website_detection.detect()
     website_detection.save()

初始化

因为我们需要对一批网站进行检测，因此我们需要一个列表存放网站，有些人或许会认为反正早晚是要去重，为什么不用集合呢？我建议还是先用列表，因为去重没有想的那么简单。同时我们还需要一个状态码和分数对应的字典，下面我讲一下构造这个字典的思路，如果状态码是 5 打头，分数就对应 0；如果状态码是 4 打头，分数就对应 1；以此类推。在给代码之前我先讲一下为什么这么设置这个字典，因为 5 打头就意味着网站还能用的的可能性最低，所以分数最低，其它同理。最后，我们需要定义一个网站和分数对应的字典，这个字典初始化为空字典。

     def __init__(self):
         self.websites = open('websites.txt').readlines()
         self.status_code_score = {'5': 0, '4': 1, '3': 2, '2': 3}
         self.website_score = {}

过滤重复的网站

我们先来看几个网站，看完之后就能够知道怎么过滤重复的网站了。

其中我们可以发现这些网站有着相同的域名和协议，也就对应同一个网站。因此，我们不可以对整个地址进行去重，这就是为什么没有在前面使用集合这个容器装数据。知道上面所说的，过滤重复的网站实现起来就是轻而易举了，直接一个集合推导式就解决了。

     def filter(self):
         self.websites = {'/'.join(website.strip('\n').split('/', maxsplit=3)[:-1])+'/'for website in self.websites}

网站检测

网站检测非常简单，首先获取网站集合容器中的一个元素（不是第一个，因为集合中元素是无序的），然后就是字典更新，字典的键为网站，值就是所谓的分数，分数根据状态码的第一个数字来定。

     def detect(self):
         for website in self.websites:
             print(website)
             try:
                 self.website_score[website] = self.status_code_score[str(get(website, headers={
                     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/6'
                                   '3.0.3239.132 Safari/537.36 QIHU 360SE'}).status_code)[0]]
             except BaseException as e:
                 str(e)

我在这里做了简单的输出同时还用 try...except...排除异常，因为网络程序运行到一半崩溃是很有可能的。

存储

下面我们来看一下存储功能，首先，我们不可以存储重复的网站；其次，网站需要按照某种规则排序，可以访问的可能性越大分数越高，当然也就排得越前；最后，需要持久化存储。综合以上 3 点，应该去使用 redis 的有序集合来存储数据。

     def save(self):
         redis = Redis()
         for website, score in self.website_score.items():
             redis.zadd('websites', website, score)
         redis.connection_pool.disconnect()

总结

最后给出这个程序的完整源代码。

 from requests import get
 from redis.client import Redis
 
 
 class WebsiteDetection:
     def __init__(self):
         self.websites = open('websites.txt').readlines()
         self.status_code_score = {'5': 0, '4': 1, '3': 2, '2': 3}
         self.website_score = {}
 
     def detect(self):
         for website in self.websites:
             print(website)
             try:
                 self.website_score[website] = self.status_code_score[str(get(website, headers={
                     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/6'
                                   '3.0.3239.132 Safari/537.36 QIHU 360SE'}).status_code)[0]]
             except BaseException as e:
                 str(e)
 
     def filter(self):
         self.websites = {'/'.join(website.strip('\n').split('/', maxsplit=3)[:-1])+'/'for website in self.websites}
 
     def save(self):
         redis = Redis()
         for website, score in self.website_score.items():
             redis.zadd('websites', website, score)
         redis.connection_pool.disconnect()
 
 
 if __name__ == '__main__':
     website_detection = WebsiteDetection()
     website_detection.filter()
     website_detection.detect()
     website_detection.save()

运行结果和 redis 中的数据如图所示。