blocks|key|166491|text|“但想知道正则表达式中的最大十六进制边界”：*在所有utf模式中:+0x10ffff+*本地8-bt模式:+0xff+*本地16位模式:+0xffff+*本地32位模式:+0x1fffffff|type|unstyled|depth|inlineStyleRanges|entityRanges|data|166492|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

"but want to know about the max hex boundary in a regex":
* in all utf modes: 0x10ffff
* native 8-bt mode: 0xff
* native 16-bit mode: 0xffff
* native 32-bit mode: 0x1fffffff

blocks|key|164343|text|我对php不是很确定，但是在代码点上确实没有调控器。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|164344|因此，只有大约110万个有效数据并不重要。|164345|这可能会随时发生变化，但这并不是真正取决于引擎|164346|来执行这一点。有保留的cp是在有效范围内的孔，|164347|有效范围内有代孕，原因不胜枚举|164348|除了字长之外，没有其他限制。|164349|对于UTF-+32，您不能超过31位，因为32是符号位。|164350|0x00000000+-+0x7FFFFFFF|offset|length|style|CODE|164351|这是有意义的，因为unsigned+int作为一种数据类型是32位硬件寄存器的自然大小。|164352|对于UTF-+16，更准确地说，您可以看到相同的限制被屏蔽到16位。位32仍然是符号位，将0x0000+-+0xFFFF保留为有效范围。|164353|通常，如果你使用支持ICU的引擎，你应该能够使用它，|164354|它将source和regex都转换为UTF-32。Boost+Regex就是这样一个引擎。|164355|编辑：|BOLD|164356|关于UTF-16|164357|我猜当Unicode超过16位时，他们在16位代理对的范围内打了一个洞。但它只在两个对之间留下了总共20个可用的比特。|164358|每个代理中有10位，其它6位用于确定hi或L0。|164359|看起来这给Unicode人员留下了20位的限制%2B额外的0xFFFF舍入，总共有0x10FFFF代码点，有不可用的漏洞。|164360|能够将所有码点转换为不同的编码(8/16/32)|164361|实际上必须是可转换的。因此，永远向后兼容的20位是|164362|他们早先遇到的陷阱，但现在必须接受。|164363|无论如何，正则表达式引擎不会在短期内强制实施这一限制，可能永远不会。|164364|就代理而言，它们就是空洞，格式错误的文本代理不能在模式之间转换。这只适用于转换期间的文字编码字符，而不是一个字符的十六进制表示。例如，很容易在UTF-16+(仅限)模式下搜索未配对的代理，甚至是配对的代理。|164365|但我猜正则表达式引擎并不真正关心漏洞或限制，它们只关心主题字符串处于什么模式。不，引擎不会说：|164366|‘嘿，等等，模式是UTF-16我最好把\x{210C1}转换成\x{D844}\x{DCC1}__。等等，如果我这样做了，如果它的量化\x{210C1}%2B__，开始在它周围注入正则表达式构造怎么办?更糟糕的是，如果它在[\x{210C1}]__类中呢?不..最好将它限制为\x{FFFF}__。|164367|我使用了一些非常好用的伪代码代理转换：|164368|+Definitions:
+====================
+10-bits
++3FF+=+000000++1111111111

+Hi+Surrogate
+D800+=+110110++0000000000
+DBFF+=+110110++1111111111+

+Lo+Surrogate
+DC00+=+110111++0000000000
+DFFF+=+110111++1111111111


+Conversions:
+====================
+UTF-16+Surrogates+to+UTF-32
+if+(+TESTFOR_SURROGATE_PAIR(hi,lo)+)
+{
++++u32Out+=+0x10000+%2B+(++((hi+&+0x3FF)+<<+10)+%7C+(lo+&+0x3FF)++);
+}

+UTF-32+to+UTF-16+Surrogates
+if+(+u32In+>=+0x10000)
+{
++++u32In+-=+0x10000;
++++hi+=+(0xD800+%2B+((u32In+&+0xFFC00)+>>+10));
++++lo+=+(0xDC00+%2B+(u32In+&+0x3FF));
+}

+Macro's:
+====================
+#define+TESTFOR_SURROGATE_HI(hs)+(((hs+&+0xFC00))+==+0xD800+)
+#define+TESTFOR_SURROGATE_LO(ls)+(((ls+&+0xFC00))+==+0xDC00+)
+#define+TESTFOR_SURROGATE_PAIR(hs,ls)+(+(((hs+&+0xFC00))+==+0xD800)+&&+(((ls+&+0xFC00))+==+0xDC00)+)
+//
+#define+PTR_TESTFOR_SURROGATE_HI(ptr)+(((*ptr+&+0xFC00))+==+0xD800+)
+#define+PTR_TESTFOR_SURROGATE_LO(ptr)+(((*ptr+&+0xFC00))+==+0xDC00+)
+#define+PTR_TESTFOR_SURROGATE_PAIR(ptr)+(+(((*ptr+&+0xFC00))+==+0xD800)+&&+(((*(ptr%2B1)+&+0xFC00))+==+0xDC00)+)|code-block|syntax|javascript|164369|entityMap^0|0|0|0|0|0|0|0|0|N|0|9|C|0|19|F|0|0|0|0|3|0|0|0|0|0|0|0|0|0|0|0|J|9|V|G|1V|A|31|B|3S|8|0|0|0^^$0|@$1|2|3|4|5|6|7|1Z|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|20|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|21|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|22|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|23|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|24|8|@]|9|@]|A|$]]|$1|L|3|M|5|6|7|25|8|@]|9|@]|A|$]]|$1|N|3|O|5|6|7|26|8|@$P|27|Q|28|R|S]]|9|@]|A|$]]|$1|T|3|U|5|6|7|29|8|@$P|2A|Q|2B|R|S]]|9|@]|A|$]]|$1|V|3|W|5|6|7|2C|8|@$P|2D|Q|2E|R|S]]|9|@]|A|$]]|$1|X|3|Y|5|6|7|2F|8|@]|9|@]|A|$]]|$1|Z|3|10|5|6|7|2G|8|@]|9|@]|A|$]]|$1|11|3|12|5|6|7|2H|8|@$P|2I|Q|2J|R|13]]|9|@]|A|$]]|$1|14|3|15|5|6|7|2K|8|@]|9|@]|A|$]]|$1|16|3|17|5|6|7|2L|8|@]|9|@]|A|$]]|$1|18|3|19|5|6|7|2M|8|@]|9|@]|A|$]]|$1|1A|3|1B|5|6|7|2N|8|@]|9|@]|A|$]]|$1|1C|3|1D|5|6|7|2O|8|@]|9|@]|A|$]]|$1|1E|3|1F|5|6|7|2P|8|@]|9|@]|A|$]]|$1|1G|3|1H|5|6|7|2Q|8|@]|9|@]|A|$]]|$1|1I|3|1J|5|6|7|2R|8|@]|9|@]|A|$]]|$1|1K|3|1L|5|6|7|2S|8|@]|9|@]|A|$]]|$1|1M|3|1N|5|6|7|2T|8|@]|9|@]|A|$]]|$1|1O|3|1P|5|6|7|2U|8|@$P|2V|Q|2W|R|S]|$P|2X|Q|2Y|R|S]|$P|2Z|Q|30|R|S]|$P|31|Q|32|R|S]|$P|33|Q|34|R|S]]|9|@]|A|$]]|$1|1Q|3|1R|5|6|7|35|8|@]|9|@]|A|$]]|$1|1S|3|1T|5|1U|7|36|8|@]|9|@]|A|$1V|1W]]|$1|1X|3|-4|5|6|7|37|8|@]|9|@]|A|$]]]|1Y|$]]

I'm not sure about php but there really is no governor on code points 
so it doesn't matter that there are only some 1.1 million valid ones. 
That is subject to change at any time, but its not really up to engines 
to enforce that. There are reserved cp's that are holes in the valid range, 
there are surrogates in the valid range, the reasons are endless for there 
to be no other restriction other than the word size. 

For UTF-32, you can't go over 31 bits because 32 is the sign bit. 
<code>0x00000000 - 0x7FFFFFFF</code>

Makes sense since <code>unsigned int</code> as a data type is the natural size of 32-bit hardware registers. 

For UTF-16, even truer you can see the same limitation masked to 16 bit.
Bit 32 is still the sign bit leaving <code>0x0000 - 0xFFFF</code> as a valid range. 

Usually, if you use an engine that supports ICU you should be able to use it, 
which converts both source and regex into UTF-32. Boost Regex is one such engine.

edit: 

Regarding UTF-16 

I guess when Unicode outgrew 16 bit, they punched a hole in the 16-bit range for surrogate pairs. But it left only 20 total bits between the pair as useable. 

10 bits in each surrogate with the other 6 used to determine hi or lo. 
Looks like this left the Unicode folks with a limit of 20 bits + an extra 0xFFFF rounded, to a total of 0x10FFFF codepoints, with unusable holes. 

To be able to convert to a different encoding (8/16/32) all the codepoints 
must actually be convertible. Thus the forever backward compatibile 20-bit is 
the trap they ran into early, but now must live with. 

Regardless, regex engines won't be enforcing this limit anytime soon, probably never. 
As far as surrogates, they are the hole, and an mal-formed literal surrogate can't be converted between modes. That just pertains to a literal encoded character during conversion, not a hex representation of one. For instance its easy to search a text in UTF-16 (only) mode for unpaired surrogates, or even paired one's. 

But I guess regex engines don't really care about holes or limits, they only care about what mode the subject string is in. No, the engine is not going to say: 
'Hey wait, the mode is UTF-16 I better convert <code>\x{210C1}</code> to <code>\x{D844}\x{DCC1}</code>. Wait, if I did that, what do I do if its quantified <code>\x{210C1}+</code>,start injecting regex constructs around it? Worse yet, what if its in a class <code>[\x{210C1}]</code>? Nah.. better limit it to <code>\x{FFFF}</code>.

Some handy dandy, pseudo-code surrogate conversions I use: 

<pre><code> Definitions:
 ====================
 10-bits
 3FF = 000000 1111111111

 Hi Surrogate
 D800 = 110110 0000000000
 DBFF = 110110 1111111111 

 Lo Surrogate
 DC00 = 110111 0000000000
 DFFF = 110111 1111111111


 Conversions:
 ====================
 UTF-16 Surrogates to UTF-32
 if ( TESTFOR_SURROGATE_PAIR(hi,lo) )
 {
 u32Out = 0x10000 + ( ((hi &amp; 0x3FF) &lt;&lt; 10) | (lo &amp; 0x3FF) );
 }

 UTF-32 to UTF-16 Surrogates
 if ( u32In &gt;= 0x10000)
 {
 u32In -= 0x10000;
 hi = (0xD800 + ((u32In &amp; 0xFFC00) &gt;&gt; 10));
 lo = (0xDC00 + (u32In &amp; 0x3FF));
 }

 Macro's:
 ====================
 #define TESTFOR_SURROGATE_HI(hs) (((hs &amp; 0xFC00)) == 0xD800 )
 #define TESTFOR_SURROGATE_LO(ls) (((ls &amp; 0xFC00)) == 0xDC00 )
 #define TESTFOR_SURROGATE_PAIR(hs,ls) ( (((hs &amp; 0xFC00)) == 0xD800) &amp;&amp; (((ls &amp; 0xFC00)) == 0xDC00) )
 //
 #define PTR_TESTFOR_SURROGATE_HI(ptr) (((*ptr &amp; 0xFC00)) == 0xD800 )
 #define PTR_TESTFOR_SURROGATE_LO(ptr) (((*ptr &amp; 0xFC00)) == 0xDC00 )
 #define PTR_TESTFOR_SURROGATE_PAIR(ptr) ( (((*ptr &amp; 0xFC00)) == 0xD800) &amp;&amp; (((*(ptr+1) &amp; 0xFC00)) == 0xDC00) )
</code></pre>

blocks|key|168235|text|正如minitech在第一条评论中建议的那样，你必须使用代码点-对于这个字符，它是\x{210C1}。这也是UTF-32中的编码形式。F0+AF+AB+BF是UTF8编码的序列(参见http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=210C1)。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|168236|在某些版本的PCRE中，您可以使用高达\x{7FFFFFFF}的值。但我真的不知道有什么可以与之匹配。|168237|引用http://www.pcre.org/pcre.txt|168238|168239|在UTF-16模式下，字符代码为Unicode，范围为0到0x10ffff，但0xd800到0xdfff范围内的值除外，因为这些是成对使用的“代理”值，用于编码大于0xffff的值。..。在UTF-32模式下，字符代码为Unicode，范围为0到0x10ffff，但0xd800到0xdfff范围内的值除外，因为这些值在UTF-32中是格式错误的“代理”值。|blockquote|BOLD|168240|168241|168242|0x10ffff是可用于匹配字符的最大值(这就是我从这里提取的值)。0x10ffff目前也是unicode标准中定义的最大代码点(参见What+are+some+of+the+differences+between+the+UTFs?)+-因此上面的每个值都没有任何意义(或者我就是不明白)……|168243|entityMap|0|LINK|mutability|MUTABLE|url|http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=210C1|1|http://www.pcre.org/pcre.txt|2|http://www.unicode.org/faq/utf_bom.html#gen6^0|15|9|1V|B|2J|1R|0|0|J|C|0|2|S|1|0|0|R|A|3D|A|0|0|0|0|8|Y|8|1V|1E|2|0^^$0|@$1|2|3|4|5|6|7|14|8|@$9|15|A|16|B|C]|$9|17|A|18|B|C]]|D|@$9|19|A|1A|1|1B]]|E|$]]|$1|F|3|G|5|6|7|1C|8|@$9|1D|A|1E|B|C]]|D|@]|E|$]]|$1|H|3|I|5|6|7|1F|8|@]|D|@$9|1G|A|1H|1|1I]]|E|$]]|$1|J|3|-4|5|6|7|1J|8|@]|D|@]|E|$]]|$1|K|3|L|5|M|7|1K|8|@$9|1L|A|1M|B|N]|$9|1N|A|1O|B|N]]|D|@]|E|$]]|$1|O|3|-4|5|6|7|1P|8|@]|D|@]|E|$]]|$1|P|3|-4|5|6|7|1Q|8|@]|D|@]|E|$]]|$1|Q|3|R|5|6|7|1R|8|@$9|1S|A|1T|B|C]|$9|1U|A|1V|B|C]]|D|@$9|1W|A|1X|1|1Y]]|E|$]]|$1|S|3|-4|5|6|7|1Z|8|@]|D|@]|E|$]]]|T|$U|$5|V|W|X|E|$Y|Z]]|10|$5|V|W|X|E|$Y|11]]|12|$5|V|W|X|E|$Y|13]]]]

As minitech suggests in the first comment, you have to use the codepoint - for this character, it's <code>\x{210C1}</code>. That's also the encoded form in UTF-32.
<code>F0 AF AB BF</code> is the UTF-8 encoded sequence (see <a href="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=210C1" rel="nofollow">http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=210C1</a>).

There are some versions of PCRE where you can use values up to <code>\x{7FFFFFFF}</code>. But I really don't know what could be matched with it.

To quote <a href="http://www.pcre.org/pcre.txt" rel="nofollow">http://www.pcre.org/pcre.txt</a>:

<blockquote>
 In UTF-16 mode, the character code is Unicode, in the range 0 to
 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff
 because those are "surrogate" values that are used in pairs to encode
 values greater than 0xffff.
 [...] 
 In UTF-32 mode, the character code is Unicode, in the range 0 to
 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff
 because those are "surrogate" values that are ill-formed in UTF-32.
</blockquote>

<code>0x10ffff</code> is the largest value you can use to match a character (that's what I extract from this). <code>0x10ffff</code> is currently also the largest code point defined in the unicode standard (see <a href="http://www.unicode.org/faq/utf_bom.html#gen6" rel="nofollow">What are some of the differences between the UTFs?</a>) - thus every value above does not make any sense (or I just don't get it)...

blocks|key|168251|text|所以我不能将一个字母与f0+a1+83+81的十六进制值相匹配。问题不是如何匹配这些字母，而是这个范围&这个边界来自于u修饰符，应该如何将字符串视为UTF-16|type|unstyled|depth|inlineStyleRanges|entityRanges|data|168252|168253|您将两个概念混合在一起，这导致了这种混淆。|168254|F0+A1+83+81不是字符的十六进制值。这是UTF-8对字节流中该字符的码位进行编码的方式。|offset|length|style|CODE|168255|PHP支持\x{}模式的UTF-16代码点是正确的，但是{和}中的值表示UTF-16代码点，而不是用于编码字节流中给定字符的实际字节。|168256|因此，您可以在\x{}中使用的最大值实际上是10FFFF。|168257|为了与PHP相匹配，你需要使用它的代码点，正如@minitech在他的评论中所建议的那样，它是\x{0210c1}。|168258|在PCRE+documentation的"Validity+of+strings"一节中引用了进一步的解释。|168259|168260|在进行任何其他处理之前检查整个字符串。除了检查字符串的格式外，还会进行检查以确保所有代码点都在U%2B0到U%2B10FFFF的范围内，不包括代理区域。所谓的“非字符”代码点没有被排除在外，因为Unicode勘误#9清楚地表明它们不应该被排除在外。|blockquote|168261|168262|Unicode的“代理区”中的字符保留供UTF-16使用，在UTF-16中，它们成对使用，以编码具有大于0xFFFF的值的代码点。由UTF-16对编码的代码点在UTF-8和UTF-32编码中独立可用。(换句话说，整个代理的事情是对UTF-16的捏造，它不幸地搞乱了UTF-8和UTF-32。)|168263|entityMap|0|LINK|mutability|MUTABLE|url|http://www.pcre.org/current/doc/html/pcre2unicode.html|1^0|0|0|0|0|B|0|5|4|S|1|U|1|0|7|4|M|6|0|1B|A|0|1|I|0|K|L|1|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|18|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|19|8|@]|9|@]|A|$]]|$1|C|3|D|5|6|7|1A|8|@]|9|@]|A|$]]|$1|E|3|F|5|6|7|1B|8|@$G|1C|H|1D|I|J]]|9|@]|A|$]]|$1|K|3|L|5|6|7|1E|8|@$G|1F|H|1G|I|J]|$G|1H|H|1I|I|J]|$G|1J|H|1K|I|J]]|9|@]|A|$]]|$1|M|3|N|5|6|7|1L|8|@$G|1M|H|1N|I|J]|$G|1O|H|1P|I|J]]|9|@]|A|$]]|$1|O|3|P|5|6|7|1Q|8|@$G|1R|H|1S|I|J]]|9|@]|A|$]]|$1|Q|3|R|5|6|7|1T|8|@]|9|@$G|1U|H|1V|1|1W]|$G|1X|H|1Y|1|1Z]]|A|$]]|$1|S|3|-4|5|6|7|20|8|@]|9|@]|A|$]]|$1|T|3|U|5|V|7|21|8|@]|9|@]|A|$]]|$1|W|3|-4|5|6|7|22|8|@]|9|@]|A|$]]|$1|X|3|Y|5|6|7|23|8|@]|9|@]|A|$]]|$1|Z|3|-4|5|6|7|24|8|@]|9|@]|A|$]]]|10|$11|$5|12|13|14|A|$15|16]]|17|$5|12|13|14|A|$15|16]]]]

<blockquote>
 So I can't match a letter like with equivalent hex value of f0 a1
 83 81. The question is not how to match these letters, but how this
 range &amp; this boundary came from as u modifier should treat strings as
 UTF-16
</blockquote>

You are mixing two concepts which is causing this confusion. 

<code>F0 A1 83 81</code> isn't the hex value of the character . This is the way
UTF-8 encodes the code point for that character in the byte stream.

It is correct that PHP supports UTF-16 code points for the <code>\x{}</code> pattern, but the values inside <code>{</code> and <code>}</code> represent UTF-16 code points and not the actual bytes used to encode the given character in the byte stream.

So the largest possible value you can use with <code>\x{}</code> is actually <code>10FFFF</code>. 

And to match with PHP you need to use it's code point which as suggested by @minitech in his comment is <code>\x{0210c1}</code>.

Further explanation quoted from section <a href="http://www.pcre.org/current/doc/html/pcre2unicode.html" rel="nofollow">"Validity of strings"</a> from the <a href="http://www.pcre.org/current/doc/html/pcre2unicode.html" rel="nofollow">PCRE documentation</a>.

<blockquote>
 The entire string is checked before any other processing takes place.
 In addition to checking the format of the string, there is a check to
 ensure that all code points lie in the range U+0 to U+10FFFF,
 excluding the surrogate area. The so-called "non-character" code
 points are not excluded because Unicode corrigendum #9 makes it clear
 that they should not be.
 
 Characters in the "Surrogate Area" of Unicode are reserved for use by
 UTF-16, where they are used in pairs to encode code points with values
 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
 are available independently in the UTF-8 and UTF-32 encodings. (In
 other words, the whole surrogate thing is a fudge for UTF-16 which
 unfortunately messes up UTF-8 and UTF-32.)
</blockquote>

Without using <code>u</code> flag the hex range that can be used is <code>[\x{00}-\x{ff}]</code>, but with <code>u</code> flag it goes up to a 4-byte value <code>\x{7fffffff}</code> (<code>[\x{00000000}-\x{7fffffff}]</code>).

So if I execute the below code:

<pre><code>preg_match("/[\x{00000000}-\x{80000000}]+/u", $str, $match);
</code></pre>

Will get this error:

<pre><code>Warning: preg_match(): Compilation failed: character value in \x{...} sequence is too large
</code></pre>

So I can't match a letter like <code></code> with equivalent hex value of <code>f0 a1 83 81</code>. The question is not how to match these letters, but how this range &amp; this boundary came from as <code>u</code> modifier should treat strings as <code>UTF-16</code>

<a href="http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines" rel="noreferrer">PCRE supports UTF-16 since v8.30</a>

<pre><code>echo PCRE_VERSION;
</code></pre>

PCRE version with PHP 5.3.24 - 5.3.28, 5.4.14 - 5.5.7:

<pre><code>8.32 2012-11-30
</code></pre>

PCRE version with PHP 5.3.19 - 5.3.23, 5.4.9 - 5.4.13:

<pre><code>8.31 2012-07-06
</code></pre>

<a href="http://3v4l.org/CrPZ8" rel="noreferrer">http://3v4l.org/CrPZ8</a>

Maximum Hex value in regex

在不使用u标志的情况下，可以使用的十六进制范围是[\x{00}-\x{ff}]，但是如果使用u标志，则可以使用4字节值\x{7fffffff} ([\x{00000000}-\x{7fffffff}])。因此，如果我执行以下代码：preg_match("/[\x{00000000}-\x{80000000}]+/u",...

问正则表达式中的最大十六进制值
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问正则表达式中的最大十六进制值EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问正则表达式中的最大十六进制值
EN