问使用PyMuPDF从PDF中提取完整的超链接字符串
EN

Stack Overflow用户

提问于 2020-03-12 22:13:25

回答 1查看 358关注 0票数 1

我正试着从PDF中提取每一个链接。我可以使用下面的代码获得每个超链接：

folder = "test_folder"
folder_data = [os.path.join(dp, f) for dp, dn, filenames in os.walk(folder) for f in filenames if os.path.splitext(f)[1] == '.pdf']

data = [loc.replace("\\", "/") for loc in folder_data]

    for loc in data:

    doc = fitz.open(loc)
    #color_check(doc, count)
    file_name = loc.split("/")[-1]
    print (f"INFO: Crawling over file {file_name}, number {count} of {len(data)}")
    count += 1

    for page in doc:
        links = page.getLinks()
        print(links)
        for link in links:
            uri_rect = []
            uri_rect.append([round(link['from'][0], 2), round(link['from'][1], 2), round(link['from'][2], 2), round(link['from'][3], 2)])
            words_in_document = page.getTextWords()
            #print(links)
            for word in words_in_document:
                word_rect = []
                word_rect.append([round(word[0], 2), round(word[1], 2), round(word[2], 2), round(word[3], 2)])
                rect_dif_percentage = len(set(uri_rect[0])&set(word_rect[0])) / float(len(set(uri_rect[0]) | set(word_rect[0]))) * 100
                if rect_dif_percentage >= 60:
                    #If link links to a file
                    try:   
                        referenced_file_name = link['file'].split("/")[1]
                        referenced_file_path = link['file'].split("/")[0]
                        for file_loc in range(len(data)):
                            if referenced_file_name in data[file_loc]:
                                referenced_file_path = data[file_loc]
                        output.append([loc, word[4], referenced_file_path])

                    #If link links to a website
                    except:
                        referenced_file_name = "N/A"
                        referenced_file_path = link['uri']
                        output.append([loc, word[4], referenced_file_path])
with open("output.csv", "a", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(output)

print("INFO: Crawling completed, you can close this window and check output.csv")

问题如下。如果一个超链接有多个单词，我将无法获得第二个单词，因为我使用的是在page.getLinks()中找到的矩形，而这个方法只找到超链接的第一个单词。

例如，下面的超链接：Click me!

我的代码只能获取''Click'‘字符串。

我能做些什么来解决这个问题？我被卡住了，什么也想不起来。此外，如果您有其他不使用PyMuPDF的解决方案，欢迎使用！

pymupdf

python

pdf

Stack Overflow用户

发布于 2021-09-09 15:05:39

您应该能够简单地使用page.get_textbox()获取链接的文本，如下所示：

...
    for link in links:
        x = 10 # expand rect to adding padding - change value accordingly
        linkText = page.get_textbox(link["from"] + (-x, -x, x, x))
...

票数 1

查看全部 1 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60655909

复制

相似问题

问使用PyMuPDF从PDF中提取完整的超链接字符串
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用PyMuPDF从PDF中提取完整的超链接字符串EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用PyMuPDF从PDF中提取完整的超链接字符串
EN