首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >多线程- web-scraping - Python

多线程- web-scraping - Python
EN

Stack Overflow用户
提问于 2021-09-15 13:50:15
回答 1查看 76关注 0票数 0

我正在尝试抓取这个包含家谱信息的意大利网站:https://www.natitrentino.mondotrentino.net/。我的代码运行良好,但有1300000页。我想添加一个多线程来更快地抓取数据。有人能帮帮我吗?

提前感谢!

下面是我没有使用多线程的代码:

代码语言:javascript
运行
复制
import requests
import tabula
import pandas as pd

    
first_page = 16999
last_page = 17500

def url_ok(url):
    r = requests.head(url)
    return r.status_code == 200

general = pd.DataFrame(columns = ['Cognome','Nome', 'Nome del padre', 'Cognome della madre', 'Nome della madre', 'Data di nascita', 'Parrocchia', 'Comune', 'URL'])
for i in range(first_page, last_page):
    pdf_path = f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}'
    if url_ok(f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}') == False:
        i = i+1
    else:
        y1 = 89.45
        x1 = 25.8
        y2 = y1 + 193.199
        x2 = x1 + 447
        dfs = tabula.read_pdf(pdf_path, stream=True, pages=1, area=(y1,x1,y2,x2), guess=False)
        intermediate = dfs[0].set_index("Cognome").T
        intermediate['URL'] = f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}'
        general = general.append(intermediate)
        print(general)
    print(i)

EN

回答 1

Stack Overflow用户

发布于 2021-09-15 14:51:21

下面是一个解决方案。

其主要思想是在线程之间分配17500 - 16999页的数量。

这是由

代码语言:javascript
运行
复制
offsets = [(last_page-first_page) // number_of_threads + (1 if x < (last_page -
                                                                    first_page) % number_of_threads else 0) for x in range(number_of_threads)]

然后将每个偏移量添加到start_page以向前移动,并将此负载分配给每个线程

代码语言:javascript
运行
复制
 import requests
import tabula
import pandas as pd
import math
import threading

first_page = 16999
last_page = 17500

number_of_threads = 4

# # divide last_page-firstpage into equal parts 
# if the number of threads = 4 then the below will divide 17500-16999 into equal Parts 
# [126, 125, 125, 125]
# can adjust the number of threads to increase the threads 
offsets = [(last_page-first_page) // number_of_threads + (1 if x < (last_page -
                                                                    first_page) % number_of_threads else 0) for x in range(number_of_threads)]

def scrape(start, end):
    print(f'starting to scrape from {start} to {end}')
    for i in range(start, end):
        pdf_path = f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}'
        if url_ok(f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}') == False:
            i = i+1
        else:
            y1 = 89.45
            x1 = 25.8
            y2 = y1 + 193.199
            x2 = x1 + 447
            dfs = tabula.read_pdf(pdf_path, stream=True,
                                  pages=1, area=(y1, x1, y2, x2), guess=False)
            intermediate = dfs[0].set_index("Cognome").T
            intermediate[
                'URL'] = f'https://www.natitrentino.mondotrentino.net/natiintrentino/viewprint/{i}'
            general = general.append(intermediate)
            print(general)
        print(i)


def url_ok(url):
    r = requests.head(url)
    return r.status_code == 200


for offset in offsets:
    start = first_page
    first_page += offset
    threading.Thread(name="child", target=scrape, args=(start, first_page)).start()

注意:在偏移量中可能有一些重叠

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69194470

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档