问Pdf Miner返回奇怪的字母/字符
EN

Stack Overflow用户

提问于 2018-10-18 05:02:53

回答 1查看 1K关注 0票数 3

我正在使用pdfminer和python 3，我在从pdf中恢复的文本中得到了奇怪的字母。

例如，我得到的是signiﬁcant而不是significant (请注意，字母f和I合并为一个)。

我不知道为什么会发生这种事。这是我正在使用的代码。

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
from nltk.tokenize import sent_tokenize


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    sentences = sent_tokenize(text)

    for s in sentences:
        print(s)
        print("\n\n")

到目前为止，我唯一的猜测是它可能与编码有关，但看起来there is no way to retrieve the encoding of a pdf

python

python-3.x

pdf

text

pdfminer

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-10-18 05:27:18

PDFminer工作正常。所讨论的字符是Unicode字符U+FB01，即fi ligature。

在代码中添加一行代码，将ﬁ替换为fi：

for s in sentences:
    s = s.replace ('ﬁ', 'fi')
    print (s)

还有另一个用Unicode定义的非常常见的纯排版(*)连字: U+FB02，fl连字；同样对待这个：

    s = s.replace ('ﬂ', 'fl')

以及Alphabetic Presentation block中的其他几个，您也可以将其包括在内。

(*)不要犯错误，将æ更改为ae，将œ更改为oe。这些不是“纯粹的排版连字”，而是它们本身的有效字符。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52863575

复制

相似问题

问Pdf Miner返回奇怪的字母/字符
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pdf Miner返回奇怪的字母/字符EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pdf Miner返回奇怪的字母/字符
EN