我正在尝试抓取这个包含家谱信息的意大利网站:https://www.natitrentino.mondotrentino.net/。我的代码运行良好,但有1300000页。我想添加一个多线程来更快地抓取数据。有人能帮帮我吗?
提前感谢!
下面是我没有使用多线程的代码:
import requests
import tabula
import pandas as pd
first_page = 16999
last_page = 17500
def url_ok(url):
r = requests.head(url)
return r.status_code == 200
general = pd.DataFrame(columns = ['Cognome','Nome', 'Nome del padre', 'Cognome della madre', 'Nome della madre', 'Data di nascita', 'Parrocchia', 'Comune', 'URL'])
for i in range(first_page, last_page):
pdf_path = f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}'
if url_ok(f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}') == False:
i = i+1
else:
y1 = 89.45
x1 = 25.8
y2 = y1 + 193.199
x2 = x1 + 447
dfs = tabula.read_pdf(pdf_path, stream=True, pages=1, area=(y1,x1,y2,x2), guess=False)
intermediate = dfs[0].set_index("Cognome").T
intermediate['URL'] = f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}'
general = general.append(intermediate)
print(general)
print(i)
发布于 2021-09-15 14:51:21
下面是一个解决方案。
其主要思想是在线程之间分配17500 - 16999页的数量。
这是由
offsets = [(last_page-first_page) // number_of_threads + (1 if x < (last_page -
first_page) % number_of_threads else 0) for x in range(number_of_threads)]然后将每个偏移量添加到start_page以向前移动,并将此负载分配给每个线程
import requests
import tabula
import pandas as pd
import math
import threading
first_page = 16999
last_page = 17500
number_of_threads = 4
# # divide last_page-firstpage into equal parts
# if the number of threads = 4 then the below will divide 17500-16999 into equal Parts
# [126, 125, 125, 125]
# can adjust the number of threads to increase the threads
offsets = [(last_page-first_page) // number_of_threads + (1 if x < (last_page -
first_page) % number_of_threads else 0) for x in range(number_of_threads)]
def scrape(start, end):
print(f'starting to scrape from {start} to {end}')
for i in range(start, end):
pdf_path = f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}'
if url_ok(f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}') == False:
i = i+1
else:
y1 = 89.45
x1 = 25.8
y2 = y1 + 193.199
x2 = x1 + 447
dfs = tabula.read_pdf(pdf_path, stream=True,
pages=1, area=(y1, x1, y2, x2), guess=False)
intermediate = dfs[0].set_index("Cognome").T
intermediate[
'URL'] = f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}'
general = general.append(intermediate)
print(general)
print(i)
def url_ok(url):
r = requests.head(url)
return r.status_code == 200
for offset in offsets:
start = first_page
first_page += offset
threading.Thread(name="child", target=scrape, args=(start, first_page)).start()注意:在偏移量中可能有一些重叠
https://stackoverflow.com/questions/69194470
复制相似问题