最近在做一点爬虫相关的学习,爬可能比较简单,网上教材一箩筐,今天要掰扯的是关于批量下载的技能。
当爬虫爬取到N多的资源时,比如图片,比如小视频,如果几百个,代码运行一下跑上大半天就能搞定。如果有几十万个甚至百万级别的,排队切换的时间就不能忽略不计了,这个时候就可以考虑多线程了。
这就是今天的内容,代码如下:
#!/usr/bin/env python3.6
# _*_ coding:utf-8 _*_
# __author__: Ed Frey
# DATE: 2019/2/28
import threading
import os
import sys
import urllib.request
import urllib.error
import time
from time import ctime
import socket
import urllib.request
# to get links from website
def getUrlData(url):
…………
#show the rate of downloading
def _progress(block_num, block_size, total_size):
…………
#download the video
def getDown_urllib(url, file_path,error_path,videoName):
…………
#method of downloading
def getVideo_urllib(url_m3u8, path, videoName,error_path):
…………
#main process
def run(videoName):
…………
class MyThread(threading.Thread):
def __init__(self, func, args, name=''):
threading.Thread.__init__(self)
self.func = func
self.name = name
self.args = args
def run(self):
print('Ready to go', self.name, ' in:', ctime())
self.res = self.func(*self.args)
print(self.name, 'end with:', ctime())
def getResult(self):
return self.res
def main():
print('multi_mode')
threads = []
t1 = MyThread(run, (1000,), run.__name__)
threads.append(t1)
t2 = MyThread(run, (1000,), run.__name__)
threads.append(t2)
t3 = MyThread(run, (1000,), run.__name__)
threads.append(t3)
for i in range(len(threads)):
print(i)
threads[i].start()
time.sleep(1)
for i in range(len(threads)):
threads[i].join()
print(threads[i].getResult())
print('all threads have finished!')
if __name__ == '__main__':
main()
中间实现的函数就隐藏了,看了会眼花,有感兴趣的小伙伴可以私信我获取,主要还是线程的调用方式,class MyThread(threading.Thread)继承一个线程的类,然后main()中设置多个线程,再运行。这个时候,就是一次性下载3个小视频,当然,也可以设置十个二十个,只要在电脑负荷内,效率自然会远远大于单个单个的下载。