我想完全标记化一句话:“半衰期最长的元素是铀-234”教授说。
我想要这个输出:
['"', 'The', 'element', 'with', 'the', 'longests', 'half-life', 'isn't', 'Uranium-234', '"', 'said', 'the', 'professor', '.']
这里所有的标点符号都是分开的,但是像“is‘t”和“but’t”这样的单词是一个记号。连字符连接的单词也被视为一个标记,这正是我想要的。
目前我正在使用它来标记化它:
p = re.compile(r"\w+(?:'\w+)?|[^\w\s]")
p.findall(s)
这给出了输出:
['"', 'The', 'element', 'with', 'the', 'longest', 'half', '-', 'life', 'isn't', 'Uranium', '-', '234', '"', 'said', 'the', 'professor', "."]
这样我就不能将连字符连接的单词标记为一个标记。
发布于 2021-02-28 21:46:21
使用
字符类,并且您忘记了下划线:
\w+(?:['-]\w+)?|[^\w\s]|_
请参见
证明
..。
解释
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
['-] any character of: ''', '-'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^\w\s] any character except: word characters (a-
z, A-Z, 0-9, _), whitespace (\n, \r, \t,
\f, and " ")
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
_ '_'
Python代码
import re
regex = r"\w+(?:['-]\w+)?|[^\w\s]|_"
test_str = "\"The element with the longest half-life is Uranium-234\" said the professor."
print(re.findall(regex, test_str))
结果
:
https://stackoverflow.com/questions/66414047
复制相似问题