文章/答案/技术大牛

发布

社区首页 >问答首页 >从Python的pdftotext模块中删除页眉和页脚

问从Python的pdftotext模块中删除页眉和页脚
EN

Stack Overflow用户

提问于 2021-05-13 08:35:51

回答 3查看 3.9K关注 0票数 2

我使用pdftotext python包从pdf中提取文本，但是我需要从文本文件中删除标题和页脚，以只提取内容。

可以有两种解决方法:

在文本文件中使用正则表达式
在从pdf中获取文本时使用一些过滤器

现在的问题是页眉和页脚与页面不一致。

例如，头的前1-2行可能有一致的承包商地址，但标题的第3行有部分和页面所遵循的主题。类似地，页脚包括项目编号(也不是固定的数字值)、分段编号和一些设计单词，后面跟着一个应该是一致的日期(但每个项目的日期不同)。还应该注意的是，pdf文件可以是每个项目的500+页面，但是很可能会根据部分进行拆分。

目前，我正在使用这段代码提取信息。有什么参数我不知道哪些可以用来移除页眉和页脚？

import pdftotext

def get_data(pdf_path):

    with open(pdf_path, "rb") as f:
        pdf = pdftotext.PDF(f)

    print("Pages : ",len(pdf))

    with open('text-pdftotext.txt', 'w') as k:
        k.write("\n\n".join(pdf))

    f.close()
    k.close()

get_data('specification_file.pdf')

text-extraction

pdftotext

python

ocr

回答 3

Stack Overflow用户

发布于 2022-05-07 02:52:26

pdftotext最好按设计使用，即通过任何shell作为命令行使用。

因此，要删除分页符、页眉和页脚，请使用与设计运行的命令完全相同的命令。

pdftotext -nopgbrk -margint <number> -marginb <number> filename

与xpdf 4.04，这将给你的身体文本，没有对线和底线。

如果使用Poppler变体，则需要使用

  -x <int>             : x-coordinate of the crop area top left corner
  -y <int>             : y-coordinate of the crop area top left corner
  -W <int>             : width of crop area in pixels (default is 0)
  -H <int>             : height of crop area in pixels (default is 0)

票数 1

Stack Overflow用户

发布于 2022-05-07 00:23:20

在转换自动生成的项目规划PDF时，我也遇到了同样的问题，我想在发送结果之前从文本中去掉分页符。

我所做的就是使用正则表达式来匹配所有编号的分页符，并写出输入中不匹配的部分。下面是我10分钟内拼凑在一起的一个小实用程序脚本的完整代码：

#!/usr/bin/env python

import sys
import re
import argparse

parser = argparse.ArgumentParser()

parser.add_argument("--infile", "-i", type=str, default=None,
                    help="input file (default: %(default)s).")
parser.add_argument("--outfile", "-o", type=str, default=None,
                    help="output file (default: %(default)s).")

parser.add_argument("--fmt", "-f", type=str, default="\d\n\n",
                    help="the footer search format (default: %(default)s).")

args = parser.parse_args()

try:
    # open an input filr (use STDIN as default)
    fin = sys.stdin
    if args.infile:
        fin = open(args.infile,'r')

    # read in the entire file in one gulp, and close it.
    fstr = fin.read()
    fin.close()

    # open up the output file (use STDIN as default)
    fout = sys.stdout
    if args.outfile:
        fout = open(args.outfile,'w')

    # spin through all the matches and 
    last = 0
    for match in re.finditer(args.fmt, fstr, re.DOTALL):
        start,end = match.span()

        # write out everthing before the matched string since last match.
        fout.write(fstr[last:start])
        last = end

    # write out remaining text at the end of the file and close.
    fout.write(fstr[last:])
    fout.close()

# simple exception handling for file not found, etc.
except Exception as er:
    print(er)

我确信，其他人可以建议这里清理的东西，内省文档，等等，但它对我来说是有用的。

请注意，为了简单起见，这个脚本将输入的文本文件读取为一个字符串。这可能不适合500页文件，但您可以重写读取器来处理块，但必须确保分页符不会发生在其中一个块边界上。除此之外，所提供的代码将使您接近。

票数 0

Stack Overflow用户

发布于 2021-05-13 09:16:13

解决您的问题的一个方法是使用pdf2image模块将PDF作为图像处理，并使用pytesseract提取其中的文本。这样，您就可以用opencv裁剪页眉，结束页脚，只保留文件的核心。但是，它可能不是完美的方法，因为pdf2image方法convert_from_path可能需要相当长的时间才能运行。

如果你有兴趣的话，我会在这里删除一些代码。

首先，确保您安装了所有必要的约束，以及Tesseract和ImageMagik。您可以在网站上找到有关安装的任何信息。如果您正在使用windows，那么有一篇很好的中型文章这里。

使用pdf2image:将PDF转换为图像

如果您正在windows上工作，请不要忘记添加弹出程序路径。看起来应该像那个r'C:\<your_path>\poppler-21.02.0\Library\bin'

def pdftoimg(fic,output_folder, poppler_path):
    # Store all the pages of the PDF in a variable 
    pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=poppler_path) 

    image_counter = 0

    # Iterate through all the pages stored above 
    for page in pages: 
        filename = "page_"+str(image_counter)+".jpg"
        page.save(output_folder+filename, 'JPEG') 
        image_counter = image_counter + 1
        
    for i in os.listdir(output_folder):
        if i.endswith('.ppm'):
            os.remove(output_folder+i)

裁剪图像页脚和页眉：

我不知道您的脚和头的大小，但通过尝试裁剪您的图像多次，您应该能够找到正确的维度使用。然后，通过使用OpenCV裁剪方法new_head作为页眉下方y轴顶部像素的值，使用new_bottom作为页脚开始处y轴上的底部像素值，您可以对图像进行裁剪以保持PDF文档的主体。

def crop_img(fic, output_folder):
    img = cv2.imread(fic)
    shape = img.shape
    crop_img = img[new_head:new_bottom, 0:shape[1]]
    cv2.imwrite(output_folder+name, crop_img)

从图像中提取文本：

您的tesseract路径将是这样的：r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def imgtotext(img, tesseract_path):
    # Recognize the text as string in image using pytesserct 
    pytesseract.pytesseract.tesseract_cmd = tesseract_path
    text = str(((pytesseract.image_to_string(Image.open(img))))) 
    text = text.replace('-\n', '')
    
    return text

票数 -1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67516273

复制

相似问题

问从Python的pdftotext模块中删除页眉和页脚
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从Python的pdftotext模块中删除页眉和页脚EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从Python的pdftotext模块中删除页眉和页脚
EN