问在python中用字符串替换句子/段落的最佳方法
EN

Stack Overflow用户

提问于 2018-06-09 23:53:14

回答 1查看 1.6K关注 0票数 2

如何将文本文件中的所有句子和段落替换为<string>标记？

我希望文本文档中的间距、制表符和列表保持不变：

示例输入：

Clause 1:

  a) detail 1. some more about detail 1. Here is more information about this paragraph right here. There is more information that we think sometimes.

  b) detail 2. some more about detail 2. and some more..

输出示例：

<string>

  a) <string>

  b) <string>

python

text

nlp

text-processing

text-parsing

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-12 08:39:03

我不知道这是不是最好的方法，但它相当简单，很容易修改。它处理问题陈述中的示例，以及注释中的大部分示例。

import sys, re

text = sys.stdin.read()

# A pattern expressing the parts of the input that we want to preserve:
keeper_pattern = r'''(?x)  # verbose format

    (   # We put parens around the whole pattern
        # (and use ?: for subgroups)
        # so that when we use it as the splitter-pattern for re.split(),
        # the result contains one string for each occurrence of the pattern
        # (in addition to the usual between-splitter strings).

                    # The main thing we want to keep is paragraph-separators,
                    # and the 'lead' of the line that follows a para-sep:
                    #
        \n{2,}      # two or more newlines, followed by
        \x20*       # optional indentation (zero or more spaces), followed by
        (?:         # an optional item-marker, which is
          (?:         #   either
            \d+ \.    #       digits followed by a dot,
            |         #   or
            [a-z] \)  #       a letter followed by a right-paren,
          )           #   followed by
          \x20+       #   one or more spaces.
        )?

        |
                    # The other thing we want to keep is
                    # item-markers within paragraphs:
                    #
        \( i+ \)    # a lower-case Roman numeral between parens
                    # (generalize as necessary)
    )
'''

for (i, chunk) in enumerate(re.split(keeper_pattern, text)):

    # In the result of re.split(),
    # the splitters (keepers) will be in the odd positions.
    is_keeper = (i % 2 == 1)

    if is_keeper:
        if chunk.startswith('\n'):
            # paragraph-separator etc
            replacement = chunk
        else:
            # within-para item-marker
            replacement = ' ' + chunk + ' '
    else:
        if chunk == '':
            # (happens if two keepers are adjacent)
            replacement = ''
        else:
            # everything else
            replacement = '<string>'

    sys.stdout.write(replacement)

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50776019

复制

相似问题

问在python中用字符串替换句子/段落的最佳方法
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中用字符串替换句子/段落的最佳方法EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中用字符串替换句子/段落的最佳方法
EN