文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用Python从PDF中提取表作为文本？

问如何使用Python从PDF中提取表作为文本？
EN

Stack Overflow用户

提问于 2017-11-28 14:23:25

回答 4查看 135.5K关注 0票数 47

我有一个PDF，其中包含表格，文字和一些图像。我想从PDF文件中的任何位置提取表格。

现在正在手动从页面中查找表。从那里，我将捕获该页面并保存到另一个PDF中。

import PyPDF2

PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored

pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object

pg4 = pfr.getPage(126) #extract pg 127

writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)

NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
    writer.write(outputStream) #write pages to new PDF

我的目标是从整个PDF文档中提取表格。

pdf

pdf-parsing

python

回答 4

Stack Overflow用户

回答已采纳

发布于 2020-04-25 19:50:36

这个答案适用于任何遇到图像并需要使用OCR的pdfs的人。我找不到一个可行的现成的解决办法，没有任何东西能给我所需的精确性。

以下是我发现起作用的步骤。

使用pdfimages从https://poppler.freedesktop.org/将pdf的页面转换成图像。
使用特塞尔检测旋转，使用ImageMagick mogrify修复它。
使用OpenCV查找和提取表。
使用OpenCV从表中查找和提取每个单元格。
使用OpenCV来裁剪和清理每个单元格，这样就不会有干扰OCR软件的噪音。
使用Tesseract对每个单元格进行OCR。
将每个单元格的提取文本合并成所需的格式。

我编写了一个python包，其中包含可以帮助执行这些步骤的模块。

存储库：https://github.com/eihli/image-table-ocr

资料来源：ocr.html

有些步骤不需要代码，它们利用了诸如pdfimages和tesseract这样的外部工具。我将为一些确实需要代码的步骤提供一些简短的示例。

查找表：

这个链接是一个很好的参考，同时也知道如何找到表。https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/

import cv2

def find_tables(image):
    BLUR_KERNEL_SIZE = (17, 17)
    STD_DEV_X_DIRECTION = 0
    STD_DEV_Y_DIRECTION = 0
    blurred = cv2.GaussianBlur(image, BLUR_KERNEL_SIZE, STD_DEV_X_DIRECTION, STD_DEV_Y_DIRECTION)
    MAX_COLOR_VAL = 255
    BLOCK_SIZE = 15
    SUBTRACT_FROM_MEAN = -2

    img_bin = cv2.adaptiveThreshold(
        ~blurred,
        MAX_COLOR_VAL,
        cv2.ADAPTIVE_THRESH_MEAN_C,
        cv2.THRESH_BINARY,
        BLOCK_SIZE,
        SUBTRACT_FROM_MEAN,
    )
    vertical = horizontal = img_bin.copy()
    SCALE = 5
    image_width, image_height = horizontal.shape
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(image_width / SCALE), 1))
    horizontally_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(image_height / SCALE)))
    vertically_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)

    horizontally_dilated = cv2.dilate(horizontally_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1)))
    vertically_dilated = cv2.dilate(vertically_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (1, 60)))

    mask = horizontally_dilated + vertically_dilated
    contours, hierarchy = cv2.findContours(
        mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE,
    )

    MIN_TABLE_AREA = 1e5
    contours = [c for c in contours if cv2.contourArea(c) > MIN_TABLE_AREA]
    perimeter_lengths = [cv2.arcLength(c, True) for c in contours]
    epsilons = [0.1 * p for p in perimeter_lengths]
    approx_polys = [cv2.approxPolyDP(c, e, True) for c, e in zip(contours, epsilons)]
    bounding_rects = [cv2.boundingRect(a) for a in approx_polys]

    # The link where a lot of this code was borrowed from recommends an
    # additional step to check the number of "joints" inside this bounding rectangle.
    # A table should have a lot of intersections. We might have a rectangular image
    # here though which would only have 4 intersections, 1 at each corner.
    # Leaving that step as a future TODO if it is ever necessary.
    images = [image[y:y+h, x:x+w] for x, y, w, h in bounding_rects]
    return images

从桌子上提取细胞。

这与2非常相似，所以我不会包含所有的代码。我将参考的部分将是对单元格进行排序。

我们要识别从左到右，从上到下的细胞。

我们会找到最左上角的矩形。然后我们会发现所有的矩形都有一个中心在这个左上角矩形的顶部和底部的y值内。然后，我们将按照它们的中心的x值对这些矩形进行排序。我们将从列表中删除这些矩形并重复。

def cell_in_same_row(c1, c2):
    c1_center = c1[1] + c1[3] - c1[3] / 2
    c2_bottom = c2[1] + c2[3]
    c2_top = c2[1]
    return c2_top < c1_center < c2_bottom

orig_cells = [c for c in cells]
rows = []
while cells:
    first = cells[0]
    rest = cells[1:]
    cells_in_same_row = sorted(
        [
            c for c in rest
            if cell_in_same_row(c, first)
        ],
        key=lambda c: c[0]
    )

    row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
    rows.append(row_cells)
    cells = [
        c for c in rest
        if not cell_in_same_row(c, first)
    ]

# Sort rows by average height of their center.
def avg_height_of_center(row):
    centers = [y + h - h / 2 for x, y, w, h in row]
    return sum(centers) / len(centers)

rows.sort(key=avg_height_of_center)

票数 23

Stack Overflow用户

发布于 2018-10-29 17:01:41

我建议你用表格提取这张桌子。
将您的pdf作为参数传递给tabula，它将以dataframe的形式返回表。
您的pdf中的每个表都作为一个数据返回。
这张表将在一张数据回放列表中返回，用于处理你需要的熊猫数据。

这是我提取pdf的代码。

import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here'  + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)

有关更多细节，请参考我的存储库。

票数 23

Stack Overflow用户

发布于 2020-10-29 14:51:13

如果您的pdf是基于文本的，而不是扫描的文档(即，如果您可以单击并拖动以在PDF查看器中选择表中的文本)，则可以使用模块camelot-py

import camelot
tables = camelot.read_pdf('foo.pdf')

然后，您可以选择如何保存表(如csv、json、excel、html、sqlite)，以及是否应该在ZIP存档中压缩输出。

tables.export('foo.csv', f='csv', compress=False)

编辑：tabula-py显示的速度大约是camelot-py的6倍，所以应该使用它。

import camelot
import cProfile
import pstats
import tabula

cmd_tabula = "tabula.read_pdf('table.pdf', pages='1', lattice=True)"
prof_tabula = cProfile.Profile().run(cmd_tabula)
time_tabula = pstats.Stats(prof_tabula).total_tt

cmd_camelot = "camelot.read_pdf('table.pdf', pages='1', flavor='lattice')"
prof_camelot = cProfile.Profile().run(cmd_camelot)
time_camelot = pstats.Stats(prof_camelot).total_tt

print(time_tabula, time_camelot, time_camelot/time_tabula)

已给予

1.8495559890000015 11.057014036000016 5.978199147125147

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47533875

复制

相似问题

问如何使用Python从PDF中提取表作为文本？
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Python从PDF中提取表作为文本？EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Python从PDF中提取表作为文本？
EN