文章/答案/技术大牛

发布

问用regex替代法连接术语
EN

Stack Overflow用户

提问于 2021-06-08 04:57:48

回答 2查看 83关注 0票数 3

问题摘要：我已经编写了泛型正则表达式来从句子中捕获两组。此外，我需要将第二组的第三任期串联到第一组。我在regex中使用了单词and作为分区来分隔句子的两组。例如: Input =‘For，SAC-1和RbC-27 合成的遗传细胞不是由人面部和动物皮肤中的痤疮白细胞-2引起的。自那时起，SAC-1 合成和RbC-27 合成的遗传细胞不是由面部皮肤和动物皮肤中的白细胞-2引起的。

import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin." 
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))

regex能够捕获组，但我从TypeError: 'NoneType' object is not subscriptable方法行中得到了substitute错误。任何类型的建议或帮助执行上述问题将不胜感激。

python

regex

string

regex-group

python-re

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-06-08 05:23:31

分裂溶液

虽然这不是regex解决方案，但这当然有效：

from string import punctuation

x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
    if word == "and":
        # strip punctuation or we will get skin. instead of skin
        x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))

产出如下：

由于SAC-1和RbC-27合成的遗传细胞不是由人面部皮肤和动物皮肤中的痤疮WbC-2引起的。

此解决方案避免直接插入列表，因为这会在迭代过程中导致索引问题。相反，我们将列表中的第一个“和”替换为“合成和”，第二个替换为“皮肤和”，然后重新加入拆分字符串。

Regex解

如果您坚持使用regex解决方案，我建议使用包含单个模式的re.findall，因为对于这个问题，这个模式更具有普遍性：

from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)

由于SAC-1和RbC-27合成的遗传细胞不是由人面部皮肤和动物皮肤中的痤疮WbC-2引起的。

我们再次使用strip(punctuation)，因为捕获了skin.：我们不希望在句子末尾丢失标点符号，但是我们希望在句子中丢失标点符号。

以下是我们的模式：

(.*?)\sand\s(.*?)\s([^\s]+)

(.*?)\s：捕获"and“之前的所有内容，包括空格
\s(.*?)\s：捕捉"and“后面的单词
([^\s]+)：在下一个空格之前捕获任何不是空格的东西(即。在“和”之后的第二个词。这也确保了我们捕获标点符号。

票数 2

Stack Overflow用户

发布于 2021-06-08 21:39:13

您不需要导入punctuation，只有一个正则表达式可以工作：

import re
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
pattern = re.compile(r"(.*?)\s+and\s+(\S+)\s+(\S+)\b([_\W]*)", re.DOTALL)
result = ''.join([f"{a} {c} and {b} {c}{d}" for a,b,c,d in pattern.findall(x)])
print(result)

结果：Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

见Python证明。

使用re.DOTALL允许点匹配行提要字符。

\b 在结束时使用单词边界来剥离穿孔，并与 ([_\W]*)**.**一起将其捕获为一个单独的组。

使用\s+从结果中修剪任意数量的空白字符。

[^\s]和\S一样，让它更短。

见正则证明。

解释

--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    .*?                      any character (0 or more times (matching
                             the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  and                      'and'
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (                        group and capture to \4:
--------------------------------------------------------------------------------
    [_\W]*                   any character of: '_', non-word
                             characters (all but a-z, A-Z, 0-9, _) (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \4

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67881649

复制

相似问题

问用regex替代法连接术语
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用regex替代法连接术语EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用regex替代法连接术语
EN