首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >使用校验和从动态链接中抓取PDF

使用校验和从动态链接中抓取PDF
EN

Stack Overflow用户
提问于 2021-08-15 03:18:33
回答 1查看 36关注 0票数 0

我一直在尝试从这样的页面中抓取PDF:https://www.oecd-ilibrary.org/science-and-technology/oecd-digital-economy-papers_20716826?page=4

..。使用BeautifulSoup没有用。

如何抓取实际的pdf文档?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-08-15 05:54:17

代码语言:javascript
运行
复制
import trio
import httpx
from bs4 import BeautifulSoup, SoupStrainer
from urllib.parse import urljoin, urlparse

mainurl = 'https://www.oecd-ilibrary.org/science-and-technology/oecd-digital-economy-papers_20716826'


async def downloader(client, link, channel):
    fname = urlparse(link)[2].split('/')[-1]
    async with channel, await trio.open_file(fname, 'wb') as f:
        r = await client.get(link)
        await f.write(r.content)
        print(f'Downloaded: {link}')


async def get_links(content):
    return (urljoin(mainurl, x['href']) for x in BeautifulSoup(content, 'lxml', parse_only=SoupStrainer(
        id='collectionsort')).select('a.action-pdf'))


async def worker(channel):
    async with channel:
        async for client, page, nurse in channel:
            params = {
                'page': page
            }
            r = await client.get(mainurl, params=params)
            links = await get_links(r.text)
            for link in links:
                nurse.start_soon(downloader, client, link, channel.clone())


async def main():
    async with httpx.AsyncClient(timeout=None) as client, trio.open_nursery() as nurse:
        sender, receiver = trio.open_memory_channel(0)

        async with receiver:
            for _ in range(5):
                nurse.start_soon(worker, receiver.clone())

            async with sender:
                for page in range(1, 19):
                    await sender.send([client, page, nurse])


if __name__ == "__main__":
    try:
        trio.run(main)
    except KeyboardInterrupt:
        exit('Bye!')
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68788457

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档