我正在对单词列表执行以下操作。我从Project Gutenberg文本文件中读取行,将每行用空格拆分,执行一般的标点符号替换,然后将每个单词和标点符号打印到各自的行中,以便以后进一步处理。我不确定如何将每个单引号替换为标记或排除所有撇号。我当前的方法是使用已编译的正则表达式:
apo = re.compile("[A-Za-z]'[A-Za-z]")
并执行以下操作:
if "'" in word and !apo.search(word):
word = word.replace("'","\n<singlequote>")
但这忽略了在带有撇号的单词两边使用单引号的情况。它也不会向我表明单引号是否与单词结尾的单词开头相邻。
示例输入:
don't
'George
ma'am
end.'
didn't.'
'Won't
示例输出(处理并打印到文件后):
don't
<opensingle>
George
ma'am
end
<period>
<closesingle>
didn't
<period>
<closesingle>
<opensingle>
Won't
关于这项任务,我确实有一个进一步的问题:既然<opensingle>
和<closesingle>
的区别似乎相当困难,执行如下替换是否更明智
word = word.replace('.','\n<period>')
word = word.replace(',','\n<comma>')
在执行替换操作之后?
发布于 2018-06-10 04:58:36
我认为这可以从前视或后视引用中受益。python引用是https://docs.python.org/3/library/re.html,我经常引用的一个通用正则表达式站点是https://www.regular-expressions.info/lookaround.html。
您的数据:
words = ["don't",
"'George",
"ma'am",
"end.'",
"didn't.'",
"'Won't",]
现在,我将使用正则表达式和它们的替代品来定义一个元组。
In [230]: apo = (
(re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",),
(re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",),
(re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ),
(re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",),
)
...: ...: ...: ...: ...: ...:
In [231]: words = ["don't",
"'George",
"ma'am",
"end.'",
"didn't.'",
"'Won't",]
...: ...: ...: ...: ...: ...:
In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
Out[232]:
['don<apostrophe>t',
'<opensingle>George',
'ma<apostrophe>am',
'end<period><closesingle>',
'didn<apostrophe>t<period><closesingle>',
'<opensingle>Won<apostrophe>t']
下面是正则表达式的情况:
(?<=[A-Za-z])
表示仅匹配(但不消耗)前面的字符是letter.(?=[A-Za-z])
是前视(仍然不消耗)如果后面的字符是letter.(?<![A-Za-z])
是负向后视,这意味着如果前面有一个字母,则它不会match.(?![A-Za-z])
是负向前视。请注意,我在<closesingle>
中添加了一个.
检查,并且apo
中的顺序很重要,因为您可能会用<period>
替换.
...
这是对单个单词的操作,但也应该适用于句子。
In [233]: onelong = """
don't
'George
ma'am
end.'
didn't.'
'Won't
"""
...: ...: ...: ...: ...: ...: ...:
In [235]: print(
reduce(lambda sentence,x: x[0].sub(x[1], sentence), apo, onelong)
)
...: ...:
don<apostrophe>t
<opensingle>George
ma<apostrophe>am
end<period><closesingle>
didn<apostrophe>t<period><closesingle>
<opensingle>Won<apostrophe>t
(使用reduce
是为了便于在单词/字符串上应用正则表达式的.sub
,然后将输出保存到下一个正则表达式的.sub
,等等。)
https://stackoverflow.com/questions/50777729
复制相似问题