案文:
问题:
案文的章节包括第3、5、25和38节(后面是开始索引)。我想在“-注释:”之后,在下一节的开始索引之前,从一个部分中提取所有文本。
def comments(x):
result = []
for elem in df['Violations']:
matches = re.findall(r'\d+\. (.*?)(?: - |\r?\n|$)', elem)
result.extend(matches)
print(result)
附加的代码正在执行完全相反的抽取,它只提取'-注释‘之前的单词:’,我如何才能更改它?
非常感谢
发布于 2021-10-28 19:31:32
如果您希望在Comments:
和|
之间使用文本,那么在regex中使用这些值。
'Comments: ([^\|]*) \|'
它使用()
只捕获Comments:
和|
之间的字符,但与|
不同(参见[^\|]
)。
因为|
在regex中有特殊的含义,所以我使用\|
作为文本中的普通字符。
或
'Comments: (.*?) \|'
它使用?
获取与|
不同的字符
import re
elem = '''MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.'''
#matches = re.findall('Comments: ([^\|]*) \|', elem)
matches = re.findall('Comments: (.*?) \|', elem)
#print(matches)
for item in matches:
print(item)
print('---')
结果:
NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.
---
NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.
---
MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.
发布于 2021-10-28 21:47:52
您的模式在-
、换行符或字符串结尾之前尽可能地捕获组中的文本,并且与Comments:
不匹配任何部分。
您可以通过匹配注释来更改它,并在后面为文本添加一个捕获组。
\d+\. .*?(?: - Comments:\s*)(.*?)(?: \||$)
更精确的匹配可能是匹配每个文本的开头,即数字、一个点和一个空格,然后匹配直到第一次出现-Comments:而不跨越另一个文本的开始。
在注释之后,您可以使用捕获组来捕获到文本的下一次出现,或者断言字符串的末尾(如果它是最后一个文本)。
使用re.findall将返回捕获组1的值。
\b\d+\. (?:(?!\d+\. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)
模式匹配:
\b
阻止部分单词匹配的单词边界\d+\.
匹配1+数字、点和空格(?:(?!\d+\. |- Comments:).)*
匹配任何字符,如果直接在右边没有模式\d+\.
或- Comments
- Comments:\s*
匹配- Comments:
,后面跟着可选的空格字符(.*?)
捕获组1,尽可能匹配任何字符(?: \||$)
匹配示例
import re
regex = r"\b\d+\. (?:(?!\d+\. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)"
s = "3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38. "
print(re.findall(regex, s))
输出
[
'NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.',
'NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.',
'MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. '
]
https://stackoverflow.com/questions/69759396
复制相似问题