blocks|key|1176937|text|你有没有尝试过Ponyguruma，一个绑定到Oniguruma正则表达式引擎的Python？在该引擎中，您可以简单地输入\p{Armenian}来匹配亚美尼亚字符。\p{Ll}或\p{Zs}也可以。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1176938|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/mitsuhiko/ponyguruma/|1|http://www.geocities.jp/kosako3/oniguruma/^0|1P|C|2B|6|2I|6|7|A|0|N|9|1|0^^$0|@$1|2|3|4|5|6|7|P|8|@$9|Q|A|R|B|C]|$9|S|A|T|B|C]|$9|U|A|V|B|C]]|D|@$9|W|A|X|1|Y]|$9|Z|A|10|1|11]]|E|$]]|$1|F|3|-4|5|6|7|12|8|@]|D|@]|E|$]]]|G|$H|$5|I|J|K|E|$L|M]]|N|$5|I|J|K|E|$L|O]]]]

Have you tried <a href="https://github.com/mitsuhiko/ponyguruma/" rel="noreferrer">Ponyguruma</a>, a Python binding to the <a href="http://www.geocities.jp/kosako3/oniguruma/" rel="noreferrer">Oniguruma</a> regular expression engine? In that engine you can simply say <code>\p{Armenian}</code> to match Armenian characters. <code>\p{Ll}</code> or <code>\p{Zs}</code> work too.

blocks|key|1392576|text|Unicode模块(标准re模块的替代模块)支持使用\p{}语法的regex代码点属性。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1392577|entityMap|0|LINK|mutability|MUTABLE|url|http://pypi.python.org/pypi/regex^0|C|2|Q|4|X|5|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@$9|O|A|P|B|C]|$9|Q|A|R|B|C]]|D|@$9|S|A|T|1|U]]|E|$]]|$1|F|3|-4|5|6|7|V|8|@]|D|@]|E|$]]]|G|$H|$5|I|J|K|E|$L|M]]]]

The <a href="http://pypi.python.org/pypi/regex">regex</a> module (an alternative to the standard <code>re</code> module) supports Unicode codepoint properties with the <code>\p{}</code> syntax.

blocks|key|1177049|text|您可以在每个字符上费力地使用unicodedata：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1177050|import+unicodedata

def+strip_accents(x):
++++return+u''.join(c+for+c+in+unicodedata.normalize('NFD',+x)+if+unicodedata.category(c)+!=+'Mn')|code-block|syntax|javascript|1177051|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

You can painstakingly use unicodedata on each character:

<pre><code>import unicodedata

def strip_accents(x):
 return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
</code></pre>

blocks|key|1392674|text|说到自主开发的解决方案，前段时间我写了一个小的program来做这件事-将写为\p{...}的unicode类别转换为从unicode+specification+(v.5.0.0)中提取的一系列值。仅支持类别(例如：L、Zs)，并且仅限于BMP。我把它贴在这里，以防有人觉得它有用(尽管Oniguruma看起来真的是一个更好的选择)。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1392675|示例用法：|1392676|>>>+from+unicode_hack+import+regex
>>>+pattern+=+regex(r'%5E\\p{Lu}(\\p{L}%7C\\p{N}%7C_)*')
>>>+print+pattern.match(u'ÁñÇ_1%2B2').group(0)
ÁñÇ_1
>>>|code-block|syntax|javascript|1392677|这是source。还有一个JavaScript+version，使用相同的数据。|1392678|entityMap|0|LINK|mutability|MUTABLE|url|http://difnet.com.br/opensource/unicode_hack.py.txt|1|http://unicode.org/versions/Unicode5.0.0/|2|http://difnet.com.br/opensource/unicode_hack.py|3|https://stackoverflow.com/a/8933546/520779^0|13|7|31|1|33|2|N|7|0|1W|D|1|0|0|0|2|6|2|D|I|3|0^^$0|@$1|2|3|4|5|6|7|12|8|@$9|13|A|14|B|C]|$9|15|A|16|B|C]|$9|17|A|18|B|C]]|D|@$9|19|A|1A|1|1B]|$9|1C|A|1D|1|1E]]|E|$]]|$1|F|3|G|5|6|7|1F|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|1G|8|@]|D|@]|E|$K|L]]|$1|M|3|N|5|6|7|1H|8|@]|D|@$9|1I|A|1J|1|1K]|$9|1L|A|1M|1|1N]]|E|$]]|$1|O|3|-4|5|6|7|1O|8|@]|D|@]|E|$]]]|P|$Q|$5|R|S|T|E|$U|V]]|W|$5|R|S|T|E|$U|X]]|Y|$5|R|S|T|E|$U|Z]]|10|$5|R|S|T|E|$U|11]]]]

Speaking of homegrown solutions, some time ago I wrote a small <a href="http://difnet.com.br/opensource/unicode_hack.py.txt" rel="nofollow noreferrer">program</a> to do just that - convert a unicode category written as <code>\p{...}</code> into a range of values, extracted from the unicode <a href="http://unicode.org/versions/Unicode5.0.0/" rel="nofollow noreferrer">specification</a> (v.5.0.0). Only categories are supported (ex.: <code>L</code>, <code>Zs</code>), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).

Example usage:

<pre><code>&gt;&gt;&gt; from unicode_hack import regex
&gt;&gt;&gt; pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
&gt;&gt;&gt; print pattern.match(u'ÁñÇ_1+2').group(0)
ÁñÇ_1
&gt;&gt;&gt;
</code></pre>

Here's the <a href="http://difnet.com.br/opensource/unicode_hack.py" rel="nofollow noreferrer">source</a>. There is also a <a href="https://stackoverflow.com/a/8933546/520779">JavaScript version</a>, using the same data.

blocks|key|1392375|text|您说得对，Python+regex解析器不支持Unicode属性类。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1392376|如果您想做一个不错的修改，这通常是有用的，您可以创建一个预处理器，它扫描字符串以查找这样的类令牌(\p{M}或其他任何东西)，并将它们替换为相应的字符集，这样，例如，\p{M}将变为[\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]，\P{M}将变为[%5E\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]。|offset|length|style|CODE|1392377|人们会感谢你的。:)|1392378|entityMap^0|0|1D|5|2B|5|2J|1I|42|5|4A|1J|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|M|8|@$D|N|E|O|F|G]|$D|P|E|Q|F|G]|$D|R|E|S|F|G]|$D|T|E|U|F|G]|$D|V|E|W|F|G]]|9|@]|A|$]]|$1|H|3|I|5|6|7|X|8|@]|9|@]|A|$]]|$1|J|3|-4|5|6|7|Y|8|@]|9|@]|A|$]]]|K|$]]

You're right that Unicode property classes are not supported by the Python regex parser.

If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (<code>\p{M}</code> or whatever) and replaces them with the corresponding character sets, so that, for example, <code>\p{M}</code> would become <code>[\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]</code>, and <code>\P{M}</code> would become <code>[^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]</code>.

People would thank you. :)

blocks|key|1176997|text|请注意，虽然\p{Ll}在Python正则表达式中没有等效项，但\p{Zs}应该包含在'(?u)\s'中。(?u)，正如文档所说，“使\w、\W、\b、\B、\d、\D、\s和\S依赖于Unicode字符属性数据库。”\s表示任意的空格字符。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1176998|entityMap^0|6|6|W|6|17|8|1H|4|31|2|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]|$9|K|A|L|B|C]|$9|M|A|N|B|C]|$9|O|A|P|B|C]|$9|Q|A|R|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|S|8|@]|D|@]|E|$]]]|G|$]]

Note that while <code>\p{Ll}</code> has no equivalent in Python regular expressions, <code>\p{Zs}</code> should be covered by <code>'(?u)\s'</code>.
The <code>(?u)</code>, as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and <code>\s</code> means any spacing character.

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use <code>\p{Ll}</code> to match an arbitrary lower-case letter, or <code>p{Zs}</code> for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.

Python regex matching Unicode properties

Perl和其他一些当前的正则表达式引擎支持正则表达式中的Unicode属性，比如类别。例如，在Perl中，你可以使用\p{Ll}来匹配任意的小写字母，或者使用p{Zs}来匹配任何空格分隔符。我在Python的2.x和3.x代码行中都看不到对此的支持(遗憾的是)。有没有人知道有什么好的策略可以达到类似的效果？欢迎自行开发的解决方案。

问与Unicode属性匹配的Python正则表达式
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问与Unicode属性匹配的Python正则表达式EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问与Unicode属性匹配的Python正则表达式
EN