blocks|key|546052|text|尝试此解决方案。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|546053|import+threading

def+fetch_links(url):
++++r+=+requests.get(url)
++++soup+=+BeautifulSoup(r.content)
++++return+soup.find_all("a",+{"class":+"dev-link"})

threads+=+[threading.Thread(target=fetch_links,+args=(url,))
+++++++++++for+url+in+websites]

for+t+in+thread:
++++t.start()|code-block|syntax|javascript|546054|通过requests.get()下载网页内容是一种阻塞操作，Python线程化实际上可以提高性能。|offset|length|style|CODE|546055|entityMap^0|0|0|2|E|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Q|8|@$I|R|J|S|K|L]]|9|@]|A|$]]|$1|M|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|N|$]]

Try this solution.

<pre><code>import threading

def fetch_links(url):
 r = requests.get(url)
 soup = BeautifulSoup(r.content)
 return soup.find_all("a", {"class": "dev-link"})

threads = [threading.Thread(target=fetch_links, args=(url,))
 for url in websites]

for t in thread:
 t.start()
</code></pre>

Downloading web page content via <code>requests.get()</code> is a blocking operation, and Python threading can actually improve performance.

blocks|key|1091109|text|如果你想使用多线程，|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1091110|import+threading
import+requests
from+bs4+import+BeautifulSoup

class+Scraper(threading.Thread):
++++def+__init__(self,+threadId,+name,+url):
++++++++threading.Thread.__init__(self)
++++++++self.name+=+name
++++++++self.id+=+threadId
++++++++self.url+=+url

++++def+run(self):
++++++++r+=+requests.get(self.url)
++++++++soup+=+BeautifulSoup(r.content,+'html.parser')
++++++++links+=+soup.find_all("a")
++++++++return+links
#list+the+websites+in+below+list
websites+=+[]
i+=+1
for+url+in+websites:
++++thread+=+Scraper(i,+"thread"%2Bstr(i),+url)
++++res+=+thread.run()
++++#+print+res|code-block|syntax|javascript|1091111|这可能会有帮助|1091112|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

If you want to use multithreading then,
<pre><code>import threading
import requests
from bs4 import BeautifulSoup

class Scraper(threading.Thread):
 def __init__(self, threadId, name, url):
 threading.Thread.__init__(self)
 self.name = name
 self.id = threadId
 self.url = url

 def run(self):
 r = requests.get(self.url)
 soup = BeautifulSoup(r.content, 'html.parser')
 links = soup.find_all(&quot;a&quot;)
 return links
#list the websites in below list
websites = []
i = 1
for url in websites:
 thread = Scraper(i, &quot;thread&quot;+str(i), url)
 res = thread.run()
 # print res
</code></pre>
this might be helpful

blocks|key|1091117|text|当涉及到python和scraping时，scrapy可能是最佳选择。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1091118|scrapy使用twisted+mertix库来实现并行性，因此您不必担心线程和python+GIL|1091119|如果您必须使用漂亮的must，请查看this+library|1091120|entityMap|0|LINK|mutability|MUTABLE|url|https://scrapy.org/|1|https://twistedmatrix.com/trac/|2|https://wiki.python.org/moin/GlobalInterpreterLock|3|https://github.com/alecxe/scrapy-beautifulsoup^0|L|6|0|0|8|E|1|14|A|2|0|I|C|3|0^^$0|@$1|2|3|4|5|6|7|V|8|@]|9|@$A|W|B|X|1|Y]]|C|$]]|$1|D|3|E|5|6|7|Z|8|@]|9|@$A|10|B|11|1|12]|$A|13|B|14|1|15]]|C|$]]|$1|F|3|G|5|6|7|16|8|@]|9|@$A|17|B|18|1|19]]|C|$]]|$1|H|3|-4|5|6|7|1A|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]|P|$5|K|L|M|C|$N|Q]]|R|$5|K|L|M|C|$N|S]]|T|$5|K|L|M|C|$N|U]]]]

when it comes to python and scraping, <a href="https://scrapy.org/" rel="nofollow noreferrer">scrapy</a> is probably the way to go.

scrapy is using <a href="https://twistedmatrix.com/trac/" rel="nofollow noreferrer">twisted mertix</a> library for parallelism so you dont have to worry about threading and the <a href="https://wiki.python.org/moin/GlobalInterpreterLock" rel="nofollow noreferrer">python GIL</a>

If you must use beautifulsoap check <a href="https://github.com/alecxe/scrapy-beautifulsoup" rel="nofollow noreferrer">this library</a> out

I'm making a webscraping app in Python with Django web framework. I need to scrape multiple queries using beautifulsoup library. Here is snapshot of code that I have written:
<pre><code>for url in websites:
 r = requests.get(url)
 soup = BeautifulSoup(r.content)
 links = soup.find_all(&quot;a&quot;, {&quot;class&quot;:&quot;dev-link&quot;})
</code></pre>
Actually here the scraping of webpage is going sequentially, I want to run it in parallel manner. I don't have much idea about threading in Python.
can someone tell me, How can I do scrape in parallel manner? Any help would be appreciated.

How to scrape multiple html page in parallel with beautifulsoup in python?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我正在用Django web框架用Python制作一个网络抓取应用程序。我需要用漂亮的汤库抓取多个查询。下面是我写的代码的快照：for url in websites:    r = requests.get(url)    soup = BeautifulSoup(r.content)    links = soup...

问如何在python中并行抓取多个html页面？
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在python中并行抓取多个html页面？EN