首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >使用regex替换文档中格式糟糕的问卷

使用regex替换文档中格式糟糕的问卷
EN

Code Review用户
提问于 2022-04-27 23:14:58
回答 1查看 50关注 0票数 3

我在一堆文件中有格式相当糟糕的问卷(即有序列表),我想清理这些文件,并将当前版本替换为已清理的版本。

示例文本

代码语言:javascript
运行
复制
STUDY: A trial of Passy-Muir valve was completed to allow the patient to achieve hands-free voicing and also to improve his secretion management. A clinical swallow evaluation was not completed due to the severity of the patient's mucus and lack of saliva control.

The patient's laryngeal area was palpated during a dry swallow and he does have significantly reduced laryngeal elevation and radiation fibrosis. The further evaluate of his swallowing function is safety; a modified barium swallow study needs to be concluded to objectively evaluate his swallow safety, and to rule out aspiration. A trial of neuromuscular electrical stimulation therapy was completed to determine if this therapy protocol will be beneficial and improving the patient's swallowing function and safety.\n01 Do you have preexisting      No\nconditions?\n02 Within the past 12 months I worried about          Never True\nmy health would get worse.\n03 Within the past 12 months I have had         Never True\nhigh blood pressure.\n04 What is your housing situation today?   I have housing\n05 How many times have you moved in the past 12        Zero (I did not move)\nmonths?\n06 Are you worried that in the next 2 months, you may not    No\nhave your own housing to live in?\n07 Do you have trouble paying your heating or electricity    No\nbill?\n08 Do you have trouble paying for medicines?                 No\n09 Are you currently unemployed and looking for work?        No\n10 Are you interested in more education?                     Yes\n\nFor his neuromuscular electrical stimulation therapy, the type was BMR with a single mode cycle time is 4 seconds and 12 seconds off with frequency was 60 __________ with a ramp of 2 seconds, phase duration was 220 with an output of 99 milliamps. Electrodes were placed on the suprahyoid/submandibular triangle with an upright body position, trial length was 10 minutes.\n On a pain scale, the patient reported no pain with the electrical stimulation therapy.

我让阿马丹的一些帮助编写了这个实现,它将问卷分解成单独的问题,并将它们分开处理。我已经制定了识别问卷的代码(嗯,这是我不知道该如何包含的最后一个问题,因为我不知道它后面是什么),然后用清理过的版本替换它。此外,问卷可能有不同数量的问题,但在这里,我只是明确地将数字设置为10个。

代码

代码语言:javascript
运行
复制
question_break_re = re.compile("\n(?=\\d{2} )")
answer_re = re.compile("\\s{2,}([^\n]+)")
whitespace_re = re.compile("\\s+")
end_of_question_mark_re = re.compile(r"(?:\?|\.)?$")

def tidy_up_question(question):
    answer = None
    match = answer_re.search(question)
    if match:
        answer = match.group(1)
        question = question[:match.start(0)] + question[match.end(0):]
    question = whitespace_re.sub(' ', question).strip()
    if answer is not None:
        question = end_of_question_mark_re.sub(f": {answer}", question, count=1)
    return question+"\n"

q_n_a = re.findall(r"\n01[\s\S]*\n(?=10)", text)[0]
qlist = [
    tidy_up_question(question)
    for question in question_break_re.split(q_n_a)
    if question.strip()
]

print(text.replace(q_n_a, '\n'.join(qlist)))

输出

研究:一项帕西-缪尔瓣膜试验已经完成,以使患者能够实现无手发声,并改善他的分泌物管理。由于患者的粘液严重程度和唾液控制不足,临床吞咽评估未完成。病人的喉区是在咽下干燥时触诊的,他确实显著降低了喉部抬高和辐射纤维化。进一步评价他的吞咽功能是安全的,需要完成一项改良的钡剂吞咽研究,以客观地评价他的吞咽安全,排除误吸。完成了神经肌肉电刺激疗法的试验,以确定该治疗方案是否有益并改善患者的吞咽功能和安全性。01你是否有先前的情况:在过去12个月内,我担心我的健康会变得更糟:在过去12个月内,我从来没有真正感觉到我有高血压:永远不要真04,你今天的住房状况是什么:我在过去12个月里有05次你搬家了:0(我没有搬家) 06你担心在接下来的2个月里,你可能没有自己的住房可供居住:07号,你在支付暖气或电费方面有困难吗?08年你在支付药品方面有困难吗?09年你现在失业找工作吗?第10号你对更多的教育感兴趣吗?是的,他的神经肌肉电刺激疗法是BMR,单模周期时间为4秒,频率为12秒,频率为60 __________,斜坡为2秒,相位持续时间为220,输出为99毫安培。电极置于舌骨上/下颌下三角上,体位直立,试验时间10 was。在疼痛量表上,病人报告电刺激疗法没有疼痛。

这是一个成功!然而,我觉得实现的步骤太多了,而且可能不够有效。我想知道我是否可以使用re.sub()来识别每一个问卷项目/问题,并将其替换为干净的版本。类似于re.sub(r"\\n(\d{2} ).*\\n(?=\d{2} )", lambda m: tidy_up_question(m.group()), text),但这当然还不起作用。这个是可能的吗?

问题

  1. 我是否可以用一个re.sub或其他函数来标识和替换每个问题(或者更确切地说,是一个有序列表中的每个项)?
  2. 我能否在更大的文本中有效地做到这一点?
  3. 还有其他可能的改进使它更快,也许可以用不同数量的问题来识别问卷?
EN

回答 1

Code Review用户

发布于 2022-07-13 06:17:14

根据示例文本,看起来每个问题从一行开头的2位数开始,在下一个问题之前结束,或在空行处结束。这样的regex模式可以捕捉到这一点:

代码语言:javascript
运行
复制
question_re = re.compile(r"""
    ^(?P<number>\d\d)  # two digits, but only at start of a line
    \s
    (?P<question>.*?)      # match anything until 
    (?=\n\d\d|\n\n)    #   the next question or a blank line
    """,
    re.VERBOSE | re.DOTALL)

然后使用Pattern.sub(repl, string),其中repl是返回替换字符串的函数。在这种情况下,repl将是一个重新排列问题文本的函数。例如:

代码语言:javascript
运行
复制
def rearrange(match):
    """Presumes that the question and answer are on one or more lines. The first
    part of the question and the answer are on the first line separated by a run
    of 2+ spaces. The rest of the question, if any, follows in succeding lines."""
    question, answer = re.split(r"\s{2,}", match['question'], maxsplit=1)
    answer, *rest = answer.split('\n')
    return F"\n{match['number']} {question} {' '.join(rest)}  {answer}."

Python3.6将__getitem__()方法添加到MatchObject中,这样您就可以编写match['question']而不是match.group('question')

有了这两段文字,就成了一条直线。

代码语言:javascript
运行
复制
reformatted_text = question_re.sub(rearrange, text)
票数 2
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/276107

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档