文章/答案/技术大牛

发布

社区首页 >问答首页 >python-docx:提取文本以及标题和副标题编号

问python-docx:提取文本以及标题和副标题编号
EN

Stack Overflow用户

提问于 2020-08-21 18:12:02

回答 1查看 451关注 0票数 1

我有一个word文档，它的结构如下：

1. Heading
    1.1. Sub-heading
        (a) Sub-sub-heading

当我使用以下代码在docx中加载文档时：

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)
print(getText("a.docx"))

我得到了以下输出。

Heading
Sub-heading
Sub-sub-heading

如何将标题/副标题编号与文本一起提取？我尝试过simplify_docx，但它只适用于标准的MS Word标题样式，而不适用于自定义标题样式。

docx

python-docx

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-08-21 19:15:08

不幸的是，数字不是文本的一部分，而是由Word自己根据标题样式(Heading i)生成的，我认为docx不会公开任何方式来获取这个数字。

但是，您可以使用para.style检索样式/级别，然后通读文档以重新计算编号方案。然而，这很麻烦，因为它没有考虑到您可以使用的任何自定义样式。也许有一种方法可以访问文档的style.xml部分中的编号方案，但我不知道如何访问。

import docx

level_from_style_name = {f'Heading {i}': i for i in range(10)}

def format_levels(cur_lev):
    levs = [str(l) for l in cur_lev if l != 0]
    return '.'.join(levs)  # Customize your format here

d = docx.Document('my_doc.docx')

current_levels = [0] * 10
full_text = []

for p in d.paragraphs:
    if p.style.name not in level_from_style_name:
        full_text.append(p.text)
    else:
        level = level_from_style_name[p.style.name]
        current_levels[level] += 1
        for l in range(level + 1, 10):
            current_levels[l] = 0
        full_text.append(format_levels(current_levels) + ' ' + p.text)

for l in full_text:
    print(l)

它来自于

给了我

Hello world
1 H1 foo
1.1 H2 bar
1.1.1 H3 baz
Paragraph are really nice !
1.1.2 H3 bibou
Something else
2 H1 foofoo
You got the drill…

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63520875

复制

相似问题

问python-docx:提取文本以及标题和副标题编号
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python-docx:提取文本以及标题和副标题编号EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python-docx:提取文本以及标题和副标题编号
EN