你好!我正在尝试用python编写网络爬虫。我想使用python多线程。即使在阅读了之前的建议论文和教程之后,我仍然有问题。我的代码在这里(整个源代码都是here):
class Crawler(threading.Thread):
global g_URLsDict
varLock = threading.Lock()
count = 0
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
self.url = self.queue.get()
def run(self):
while 1:
print self.getName()+" started"
self.page = getPage(self.url)
self.parsedPage = getParsedPage(self.page, fix=True)
self.urls = getLinksFromParsedPage(self.parsedPage)
for url in self.urls:
self.fp = hashlib.sha1(url).hexdigest()
#url-seen check
Crawler.varLock.acquire() #lock for global variable g_URLs
if self.fp in g_URLsDict:
Crawler.varLock.release() #releasing lock
else:
#print url+" does not exist"
Crawler.count +=1
print "total links: %d"%len(g_URLsDict)
print self.fp
g_URLsDict[self.fp] = url
Crawler.varLock.release() #releasing lock
self.queue.put(url)
print self.getName()+ " %d"%self.queue.qsize()
self.queue.task_done()
#self.queue.task_done()
#self.queue.task_done()
print g_URLsDict
queue = Queue.Queue()
queue.put("http://www.ertir.com")
for i in range(5):
t = Crawler(queue)
t.setDaemon(True)
t.start()
queue.join()
它没有按照需要工作,它在线程1之后没有给出任何结果,并且它以不同的方式执行,有时会给出这个错误:
Exception in thread Thread-2 (most likely raised during interpreter shutdown):
我怎么才能修复它?而且我也不认为这比for循环更有效。
我尝试过修复run():
def run(self):
while 1:
print self.getName()+" started"
self.page = getPage(self.url)
self.parsedPage = getParsedPage(self.page, fix=True)
self.urls = getLinksFromParsedPage(self.parsedPage)
for url in self.urls:
self.fp = hashlib.sha1(url).hexdigest()
#url-seen check
Crawler.varLock.acquire() #lock for global variable g_URLs
if self.fp in g_URLsDict:
Crawler.varLock.release() #releasing lock
else:
#print url+" does not exist"
print self.fp
g_URLsDict[self.fp] = url
Crawler.varLock.release() #releasing lock
self.queue.put(url)
print self.getName()+ " %d"%self.queue.qsize()
#self.queue.task_done()
#self.queue.task_done()
self.queue.task_done()
我在不同的地方尝试过task_done()命令,有谁能解释一下不同之处吗?
发布于 2012-05-29 22:55:39
只有在线程初始化时才会调用self.url = self.queue.get()
。如果您想要拾取新的urls以进行后续处理,则需要尝试从while循环中的队列重新获取urls。
尝试用self.page = getPage(self.queue.get())
替换self.page = getPage(self.url)
。请注意,get函数将无限期地阻塞。您可能希望在一段时间后超时,并为后台线程添加一些方法,以便通过请求优雅地退出(这将消除您看到的异常)。
有一些some good examples on effbot.org以我上面描述的方式使用get()。
编辑-回答您的初始评论:
看一下the docs for task_done()
;对于每个对get()
的调用(不会超时),您应该调用task_done()
,它会告诉对join()
的任何阻塞调用该队列上的所有内容现在都已处理。每次对get()
的调用都将阻塞(休眠),同时等待队列上张贴新的url。
Edit2 -尝试以下替代运行函数:
def run(self):
while 1:
print self.getName()+" started"
url = self.queue.get() # <-- note that we're blocking here to wait for a url from the queue
self.page = getPage(url)
self.parsedPage = getParsedPage(self.page, fix=True)
self.urls = getLinksFromParsedPage(self.parsedPage)
for url in self.urls:
self.fp = hashlib.sha1(url).hexdigest()
#url-seen check
Crawler.varLock.acquire() #lock for global variable g_URLs
if self.fp in g_URLsDict:
Crawler.varLock.release() #releasing lock
else:
#print url+" does not exist"
Crawler.count +=1
print "total links: %d"%len(g_URLsDict)
print self.fp
g_URLsDict[self.fp] = url
Crawler.varLock.release() #releasing lock
self.queue.put(url)
print self.getName()+ " %d"%self.queue.qsize()
self.queue.task_done() # <-- We've processed the url this thread pulled off the queue so indicate we're done with it.
https://stackoverflow.com/questions/10800593
复制相似问题