首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >python regex -在模式后面选择单词

python regex -在模式后面选择单词
EN

Stack Overflow用户
提问于 2021-10-28 18:47:20
回答 2查看 55关注 0票数 0

案文:

  1. 管理,食品员工和有条件的员工;知识,责任和报告-意见:没有书面的员工健康政策在现场。必须为所有员工提供服务。优先基金会违规行为7-38-010引证。5.对呕吐和腹泻事件作出反应的程序-评论:没有书面清洁程序或呕吐/腹泻事件所需的设备。必须提供。优先基金会违规行为7-38-005引证发布。25.为生/未熟食物提供的消费者咨询-评论:菜单不披露和通知消费者未煮熟或未煮熟的特定菜单项,以及食用这类食物的潜在危险。必须提供一个消费者咨询,以披露和提醒客户这类项目。优先基金会违规行为。没有发出引证。38.

问题:

案文的章节包括第3、5、25和38节(后面是开始索引)。我想在“-注释:”之后,在下一节的开始索引之前,从一个部分中提取所有文本。

代码语言:javascript
运行
复制
def comments(x):
    result = []
    for elem in df['Violations']:
        matches = re.findall(r'\d+\. (.*?)(?: - |\r?\n|$)', elem)
        result.extend(matches)
    print(result)

附加的代码正在执行完全相反的抽取,它只提取'-注释‘之前的单词:’,我如何才能更改它?

非常感谢

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-10-28 19:31:32

如果您希望在Comments:|之间使用文本,那么在regex中使用这些值。

代码语言:javascript
运行
复制
'Comments: ([^\|]*) \|'

它使用()只捕获Comments:|之间的字符,但与|不同(参见[^\|])。

因为|在regex中有特殊的含义,所以我使用\|作为文本中的普通字符。

代码语言:javascript
运行
复制
'Comments: (.*?) \|'

它使用?获取与|不同的字符

代码语言:javascript
运行
复制
import re

elem = '''MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.'''

#matches = re.findall('Comments: ([^\|]*) \|', elem)
matches = re.findall('Comments: (.*?) \|', elem)

#print(matches)

for item in matches:
    print(item)
    print('---')

结果:

代码语言:javascript
运行
复制
NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.
---
NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.
---
MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.
票数 1
EN

Stack Overflow用户

发布于 2021-10-28 21:47:52

您的模式在-、换行符或字符串结尾之前尽可能地捕获组中的文本,并且与Comments:不匹配任何部分。

您可以通过匹配注释来更改它,并在后面为文本添加一个捕获组。

代码语言:javascript
运行
复制
\d+\. .*?(?: - Comments:\s*)(.*?)(?: \||$)

Regex演示

更精确的匹配可能是匹配每个文本的开头,即数字、一个点和一个空格,然后匹配直到第一次出现-Comments:而不跨越另一个文本的开始。

在注释之后,您可以使用捕获组来捕获到文本的下一次出现,或者断言字符串的末尾(如果它是最后一个文本)。

使用re.findall将返回捕获组1的值。

代码语言:javascript
运行
复制
\b\d+\. (?:(?!\d+\. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)

模式匹配:

  • \b阻止部分单词匹配的单词边界
  • \d+\.匹配1+数字、点和空格
  • (?:(?!\d+\. |- Comments:).)*匹配任何字符,如果直接在右边没有模式\d+\.- Comments
  • - Comments:\s*匹配- Comments:,后面跟着可选的空格字符
  • (.*?)捕获组1,尽可能匹配任何字符
  • (?: \||$)匹配

Regex演示 x- Python演示

示例

代码语言:javascript
运行
复制
import re

regex = r"\b\d+\. (?:(?!\d+\. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)"

s = "3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.  | 38. "

print(re.findall(regex, s))

输出

代码语言:javascript
运行
复制
[
'NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.', 
'NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.', 
'MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. '
]
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69759396

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档