问题摘要:我已经编写了泛型正则表达式来从句子中捕获两组。此外,我需要将第二组的第三任期串联到第一组。我在regex中使用了单词
and
作为分区来分隔句子的两组。例如: Input =‘For,SAC-1和RbC-27 合成的遗传细胞不是由人面部和动物皮肤中的痤疮白细胞-2引起的。自那时起,SAC-1 合成和RbC-27 合成的遗传细胞不是由面部皮肤和动物皮肤中的白细胞-2引起的。
import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin."
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))
regex能够捕获组,但我从TypeError: 'NoneType' object is not subscriptable
方法行中得到了substitute
错误。任何类型的建议或帮助执行上述问题将不胜感激。
发布于 2021-06-08 05:23:31
分裂溶液
虽然这不是regex解决方案,但这当然有效:
from string import punctuation
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
if word == "and":
# strip punctuation or we will get skin. instead of skin
x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))
产出如下:
由于SAC-1和RbC-27合成的遗传细胞不是由人面部皮肤和动物皮肤中的痤疮WbC-2引起的。
此解决方案避免直接插入列表,因为这会在迭代过程中导致索引问题。相反,我们将列表中的第一个“和”替换为“合成和”,第二个替换为“皮肤和”,然后重新加入拆分字符串。
Regex解
如果您坚持使用regex解决方案,我建议使用包含单个模式的re.findall
,因为对于这个问题,这个模式更具有普遍性:
from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)
由于SAC-1和RbC-27合成的遗传细胞不是由人面部皮肤和动物皮肤中的痤疮WbC-2引起的。
我们再次使用strip(punctuation)
,因为捕获了skin.
:我们不希望在句子末尾丢失标点符号,但是我们希望在句子中丢失标点符号。
以下是我们的模式:
(.*?)\sand\s(.*?)\s([^\s]+)
(.*?)\s
:捕获"and“之前的所有内容,包括空格\s(.*?)\s
:捕捉"and“后面的单词([^\s]+)
:在下一个空格之前捕获任何不是空格的东西(即。在“和”之后的第二个词。这也确保了我们捕获标点符号。发布于 2021-06-08 21:39:13
您不需要导入punctuation
,只有一个正则表达式可以工作:
import re
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
pattern = re.compile(r"(.*?)\s+and\s+(\S+)\s+(\S+)\b([_\W]*)", re.DOTALL)
result = ''.join([f"{a} {c} and {b} {c}{d}" for a,b,c,d in pattern.findall(x)])
print(result)
结果:Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
见Python证明。
使用re.DOTALL
允许点匹配行提要字符。
\b
在结束时使用单词边界来剥离穿孔,并与 ([_\W]*)
**.**一起将其捕获为一个单独的组。
使用\s+
从结果中修剪任意数量的空白字符。
[^\s]
和\S
一样,让它更短。
见正则证明。
解释
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
and 'and'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
[_\W]* any character of: '_', non-word
characters (all but a-z, A-Z, 0-9, _) (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \4
https://stackoverflow.com/questions/67881649
复制相似问题