处理pdf相关pdfplumber

小田测测看

发布于 2026-06-17 17:18:48

670

pdfplumber 与 PyPDF2、PyMuPDF 等同类库相比，最大优势在于对文本和表格的精细化处理能力。它不仅能提取文字内容，还能捕获文本的位置坐标、字体属性等元数据，这对于需要分析 PDF 排版结构的场景至关重要

安装

pip install pdfplumber

全页文本提取

with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        # 提取整页文本并按行分割
        content = page.extract_text()
        if content:  # 处理可能的空白页
            for line in content.split("\n"):
                print(line.strip())

2. 带格式信息的文本提取

如需获取文本的位置、字体等元数据，可使用extract_words()方法：

with pdfplumber.open("formatted_doc.pdf") as pdf:
    page = pdf.pages[0]
    # 获取包含格式信息的单词列表
    words = page.extract_words()
    for word in words:
        print(f"内容: {word['text']}")
        print(f"位置: 左{x0:.1f}, 上{transform: translateY(.1f}, 右{x1:.1f}, 下{bottom:.1f}")
        print(f"字体: {word['fontname']}, 大小: {word['size']:.1f}pt\n")

输出结果可用于识别标题（通常为粗体大字体）、正文等结构化内容，为文档自动分类提供依据。

3. 区域文本提取

针对特定区域的信息提取（如发票的金额区域），可通过坐标裁剪实现：

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    # 定义提取区域：(x0, top, x1, bottom)，单位为点(1英寸=72点)
    amount_area = (300, 400, 500, 420)  # 假设金额在该矩形区域内
    # 裁剪页面并提取文本
    cropped = page.crop(amount_area)
    print("发票金额:", cropped.extract_text())

表格提取

1. 基础表格提取

对于有清晰边框的表格，extract_table()方法可直接识别：

with pdfplumber.open("financial_report.pdf") as pdf:
    page = pdf.pages[1]
    table = page.extract_table()
    # 打印表头和前3行数据
    if table:
        print("表头:", table[0])
        for row in table[1:4]:
            print("数据行:", row)

提取结果为二维列表，可直接转换为 DataFrame 进行分析：

import pandas as pd
df = pd.DataFrame(table[1:], columns=table[0])  # 跳过表头行

2. 复杂表格处理

无框表格或线条不完整的表格需要自定义分割策略：

table_settings = {
    "vertical_strategy": "text",  # 基于文本分布确定垂直分割线
    "horizontal_strategy": "text",  # 基于文本分布确定水平分割线
    "intersection_y_tolerance": 5,  # 允许5点的垂直偏差
    "intersection_x_tolerance": 5   # 允许5点的水平偏差
}

with pdfplumber.open("complex_table.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table(table_settings)

参数说明：

• vertical_strategy：可选 "lines"（依赖线条）或 "text"（文本分布）
• horizontal_strategy：同上
• 容差参数：解决线条对齐不准或文本偏移问题

3. 跨页表格合并

处理多页连续表格时，需跳过后续页面的表头：

full_table = []
with pdfplumber.open("multi_page_table.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        table = page.extract_table()
        if not table:
            continue
        # 第一页保留表头，其他页跳过表头
        if i == 0:
            full_table.extend(table)
        else:
            full_table.extend(table[1:])  # 从第二行开始添加

可视化调试

通过图像化方式验证表格识别效果：

with pdfplumber.open("debug_target.pdf") as pdf:
    page = pdf.pages[0]
    # 转换页面为PIL图像对象
    im = page.to_image()
    # 高亮显示检测到的表格线
    im.debug_tablefinder()
    # 保存图像用于分析
    im.save("table_debug.png")

该功能需要安装 Pillow 库，通过直观的视觉反馈帮助调整表格提取参数。

图形元素提取

提取 PDF 中的线条、矩形等图形元素：

with pdfplumber.open("engineering_drawing.pdf") as pdf:
    page = pdf.pages[0]
    # 获取各类图形元素
    lines = page.lines       # 直线
    rects = page.rects       # 矩形
    curves = page.curves     # 曲线
    
    print(f"检测到{len(lines)}条直线，{len(rects)}个矩形")
    
    # 筛选水平直线（y坐标变化极小）
    horizontal_lines = [line for line in lines 
                      if abs(line["y0"] - line["y1"]) < 1]

自定义提取逻辑

针对特殊格式 PDF，可基于字符级数据构建提取方法：

def custom_extractor(page):
    # 获取所有字符及其位置信息
    chars = page.chars
    # 按垂直位置分组（行）
    lines = {}
    for char in chars:
        # 四舍五入y坐标作为行键，解决微小偏移
        line_key = round(char["top"], 1)
        if line_key notin lines:
            lines[line_key] = []
        lines[line_key].append(char)
    
    # 按水平位置排序并拼接文本
    result = []
    for y insorted(lines.keys()):
        # 按x坐标排序字符
        line_chars = sorted(lines[y], key=lambda c: c["x0"])
        # 拼接成完整行文本
        line_text = "".join([c["text"] for c in line_chars])
        result.append(line_text)
    return"\n".join(result)

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2025-07-07，如有侵权请联系 cloudcommunity@tencent.com 删除

数据