blocks|key|1157770|text|UTF-8并不总是使用一个字节，它是1到4个字节。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1157771|1157772|前128个字符(US-ASCII码)需要一个字节。|blockquote|1157773|1157774|接下来的1920个字符需要两个字节进行编码。这涵盖了几乎所有拉丁字母表的其余部分，还包括希腊语、西里尔语、科普特语、亚美尼亚语、希伯来语、阿拉伯语、叙利亚语和Tāna字母表，以及组合变音标记。|1157775|基本多语言平面的其余部分中的字符需要三个字节，基本多语言平面包含通用use12中的几乎所有字符，包括大多数中文、日文和韩文CJK字符。|1157776|Unicode的其他平面中的字符需要四个字节，其中包括不太常见的CJK字符、各种历史文字、数学符号和表情符号(象形符号)。|1157777|1157778|来源：Wikipedia|offset|length|1157779|entityMap|0|LINK|mutability|MUTABLE|url|http://en.wikipedia.org/wiki/UTF-8^0|0|0|0|0|0|0|0|0|3|9|0|0^^$0|@$1|2|3|4|5|6|7|Z|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|10|8|@]|9|@]|A|$]]|$1|C|3|D|5|E|7|11|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|12|8|@]|9|@]|A|$]]|$1|G|3|H|5|6|7|13|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|14|8|@]|9|@]|A|$]]|$1|K|3|L|5|6|7|15|8|@]|9|@]|A|$]]|$1|M|3|-4|5|6|7|16|8|@]|9|@]|A|$]]|$1|N|3|O|5|6|7|17|8|@]|9|@$P|18|Q|19|1|1A]]|A|$]]|$1|R|3|-4|5|6|7|1B|8|@]|9|@]|A|$]]]|S|$T|$5|U|V|W|A|$X|Y]]]]

UTF-8 does not use one byte all the time, it's 1 to 4 bytes.

<blockquote>
 The first 128 characters (US-ASCII) need one byte.
 
 The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.
 
 Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.
 
 Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
</blockquote>

source: <a href="http://en.wikipedia.org/wiki/UTF-8">Wikipedia</a>

blocks|key|1851554|text|UTF-8对每个字符使用1-4个字节:一个字节用于ascii字符(前128个Unicode值与ascii相同)。但这只需要7位。如果设置了最高("sign")位，则表示多字节序列的开始；设置的连续高位的数量表示字节数，然后是0，其余位构成该值。对于其他字节，最高的两位将是1和0，其余6位用于该值。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1851555|所以一个四字节的序列应该以11110开头...(AND...=值的三位)，然后是三个字节，每个字节有6位的值，产生21位的值。2%5E21超出了unicode字符数，因此所有unicode都可以用UTF8表示。|1851556|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

UTF-8 uses 1-4 bytes per character: one byte for ascii characters (the first 128 unicode values are the same as ascii). But that only requires 7 bits. If the highest ("sign") bit is set, this indicates the start of a multi-byte sequence; the number of consecutive high bits set indicates the number of bytes, then a 0, and the remaining bits contribute to the value. For the other bytes, the highest two bits will be 1 and 0 and the remaining 6 bits are for the value.

So a four byte sequence would begin with 11110... (and ... = three bits for the value) then three bytes with 6 bits each for the value, yielding a 21 bit value. 2^21 exceeds the number of unicode characters, so all of unicode can be expressed in UTF8.

blocks|key|209114|text|UTF-8是一种可变长度编码，的最小为每个字符8位。|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|209115|具有较高码位的字符将占用最多32位。|209116|entityMap^0|F|3|0|0^^$0|@$1|2|3|4|5|6|7|J|8|@$9|K|A|L|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|M|8|@]|D|@]|E|$]]|$1|H|3|-4|5|6|7|N|8|@]|D|@]|E|$]]]|I|$]]

UTF-8 is a variable length encoding with a minimum of 8 bits per character. 
Characters with higher code points will take up to 32 bits.

blocks|key|209147|text|引用Wikipedia中的话："UTF-8使用一到四个8位字节(在Unicode标准中称为“八位字节”)对Unicode字符集中的1,112,064个码位中的每一个进行编码。“|type|unstyled|depth|inlineStyleRanges|entityRanges|data|209148|一些链接：|209149|209150|http://www.utf-8.com/|unordered-list-item|offset|length|209151|http://www.joelonsoftware.com/articles/Unicode.html|209152|http://www.icu-project.org/docs/papers/forms_of_unicode/|209153|http://en.wikipedia.org/wiki/UTF-8|209154|209155|entityMap|0|LINK|mutability|MUTABLE|url|1|2|3^0|0|0|0|0|L|0|0|0|1F|1|0|0|1K|2|0|0|Y|3|0|0^^$0|@$1|2|3|4|5|6|7|10|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|11|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|12|8|@]|9|@]|A|$]]|$1|E|3|F|5|G|7|13|8|@]|9|@$H|14|I|15|1|16]]|A|$]]|$1|J|3|K|5|G|7|17|8|@]|9|@$H|18|I|19|1|1A]]|A|$]]|$1|L|3|M|5|G|7|1B|8|@]|9|@$H|1C|I|1D|1|1E]]|A|$]]|$1|N|3|O|5|G|7|1F|8|@]|9|@$H|1G|I|1H|1|1I]]|A|$]]|$1|P|3|-4|5|6|7|1J|8|@]|9|@]|A|$]]|$1|Q|3|-4|5|6|7|1K|8|@]|9|@]|A|$]]]|R|$S|$5|T|U|V|A|$W|F]]|X|$5|T|U|V|A|$W|K]]|Y|$5|T|U|V|A|$W|M]]|Z|$5|T|U|V|A|$W|O]]]]

Quote from Wikipedia: "UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed "octets" in the Unicode Standard)."

Some links:

<ul>
<li><a href="http://www.utf-8.com/" rel="nofollow">http://www.utf-8.com/</a></li>
<li><a href="http://www.joelonsoftware.com/articles/Unicode.html" rel="nofollow">http://www.joelonsoftware.com/articles/Unicode.html</a></li>
<li><a href="http://www.icu-project.org/docs/papers/forms_of_unicode/" rel="nofollow">http://www.icu-project.org/docs/papers/forms_of_unicode/</a></li>
<li><a href="http://en.wikipedia.org/wiki/UTF-8" rel="nofollow">http://en.wikipedia.org/wiki/UTF-8</a></li>
</ul>

blocks|key|2173590|text|查看Unicode+Standard和相关信息，例如他们的常见问题条目UTF-8+UTF-16,+UTF-32+&+BOM。这并不是一帆风顺的，但它是权威的信息，您可能在其他地方读到的关于UTF-8的大部分内容都是值得怀疑的。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|2173591|“UTF-8”中的“8”与以位为单位的代码单元的长度有关。代码单元是用于编码字符的实体，不一定是简单的一对一映射。UTF-8使用可变数量的编码单元对字符进行编码。|2173592|可以用UTF-8编码的字符集合与UTF-16或UTF-32完全相同，即所有Unicode字符。它们都对整个Unicode编码空间进行编码，其中甚至包括非字符和未分配的代码点。|2173593|entityMap|0|LINK|mutability|MUTABLE|url|http://unicode.org/faq/utf_bom.html^0|Z|Q|0|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]]|C|$]]|$1|D|3|E|5|6|7|T|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|U|8|@]|9|@]|C|$]]|$1|H|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]]]

Check out the Unicode Standard and related information, such as their FAQ entry, <a href="http://unicode.org/faq/utf_bom.html" rel="nofollow">UTF-8 UTF-16, UTF-32 &amp; BOM</a>. It’s not that smooth sailing, but it’s authoritative information, and much of what you might read about UTF-8 elsewhere is questionable.

The “8” in “UTF-8” relates to the length of code units in bits. Code units are entities use to encode characters, not necessarily as a simple one-to-one mapping. UTF-8 uses a variable number of code units to encode a character.

The collection of characters that can be encoded in UTF-8 is exactly the same as for UTF-16 or UTF-32, namely all Unicode characters. They all encode the entire Unicode coding space, which even includes noncharacters and unassigned code points.

blocks|key|1158039|text|虽然我同意mpen关于当前最大UTF-8编码(2,164,864)+(下面列出，我不能评论他的)的观点，但如果你取消UTF-8的两个主要限制:只能使用4个字节的限制，并且不能使用代码254和255+(他只删除了4个字节的限制)，他就会落后2级。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1158040|起始码254遵循起始位的基本排列(多位标志设置为1，计数为6个1，终端0，没有备用位)，给您6个额外的字节可用(6个10xxxxxx组，额外的2%5E36个代码)。|1158041|起始码255并不完全遵循基本设置，没有使用终端0，但使用了所有位，为您提供了7个额外的字节(多位标志设置为1，计数为7个1，没有终端0，因为使用了所有位；7个10xxxxx组，额外的2%5E42个代码)。|1158042|将它们相加，最终的最大可呈现字符集为4,468,982,745,216。这比当前使用的所有字符、旧的或已死的语言以及任何认为已丢失的语言都要多。天使或天书有人吗？|1158043|此外，在UTF-8标准中，除了254和255:+128-191以及其他一些代码之外，还有一些被忽略/忽略的单字节代码。一些是由键盘本地使用的，示例代码128通常是一个删除退格键。由于一个或多个原因(https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences)，其他起始代码(和相关范围)无效。|offset|length|1158044|entityMap|0|LINK|mutability|MUTABLE|url|https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences^0|0|0|0|0|2R|1M|0|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|U|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|V|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|W|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|X|8|@]|9|@$J|Y|K|Z|1|10]]|A|$]]|$1|L|3|-4|5|6|7|11|8|@]|9|@]|A|$]]]|M|$N|$5|O|P|Q|A|$R|S]]]]

While I agree with mpen on the current maximum UTF-8 codes (2,164,864) (listed below, I couldn't comment on his), he is off by 2 levels if you remove the 2 major restrictions of UTF-8: only 4 bytes limit and codes 254 and 255 can not be used (he only removed the 4 byte limit).

Starting code 254 follows the basic arrangement of starting bits (multi-bit flag set to 1, a count of 6 1's, and terminal 0, no spare bits) giving you 6 additional bytes to work with (6 10xxxxxx groups, an additional 2^36 codes).

Starting code 255 doesn't exactly follow the basic setup, no terminal 0 but all bits are used, giving you 7 additional bytes (multi-bit flag set to 1, a count of 7 1's, and no terminal 0 because all bits are used; 7 10xxxxxx groups, an additional 2^42 codes).

Adding these in gives a final maximum presentable character set of 4,468,982,745,216. This is more than all characters in current use, old or dead languages, and any believed lost languages. Angelic or Celestial script anyone?

Also there are single byte codes that are overlooked/ignored in the UTF-8 standard in addition to 254 and 255: 128-191, and a few others. Some are used locally by the keyboard, example code 128 is usually a deleting backspace. The other starting codes (and associated ranges) are invalid for one or more reasons (<a href="https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences" rel="nofollow">https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences</a>).

blocks|key|209323|text|Unicode与UTF-8紧密结合。Unicode特别支持2%5E21个代码点(2,097,152个字符)，这与UTF-8支持的代码点数量完全相同。两个系统都为代码点等保留了相同的“死区”和限制区。...as+of+June+2018+the+most+recent+version,+Unicode+11.0,+contains+a+repertoire+of+137,439+characters|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|209324|来自unicode标准的。+Unicode+FAQ|style|BOLD|209325|209326|+Unicode标准对U%2B0000..U%2B10FFFF范围内的字符进行编码，该范围相当于21位代码空间。|blockquote|209327|209328|209329|来自UTF8维基百科页面的。+UTF-8+Description|209330|由于在2003年将Unicode代码空间限制为21位值，因此|209331|+-8被定义为在一到四个字节中对码点进行编码，...|209332|209333|entityMap|0|LINK|mutability|MUTABLE|url|https://en.wikipedia.org/wiki/Unicode|1|https://www.unicode.org/faq/utf_bom.html|2|https://en.wikipedia.org/wiki/UTF-8#Description^0|2P|2T|0|0|C|1|E|B|1|0|0|0|0|0|D|1|F|H|2|0|0|0|0^^$0|@$1|2|3|4|5|6|7|16|8|@]|9|@$A|17|B|18|1|19]]|C|$]]|$1|D|3|E|5|6|7|1A|8|@$A|1B|B|1C|F|G]]|9|@$A|1D|B|1E|1|1F]]|C|$]]|$1|H|3|-4|5|6|7|1G|8|@]|9|@]|C|$]]|$1|I|3|J|5|K|7|1H|8|@]|9|@]|C|$]]|$1|L|3|-4|5|6|7|1I|8|@]|9|@]|C|$]]|$1|M|3|-4|5|6|7|1J|8|@]|9|@]|C|$]]|$1|N|3|O|5|6|7|1K|8|@$A|1L|B|1M|F|G]]|9|@$A|1N|B|1O|1|1P]]|C|$]]|$1|P|3|Q|5|6|7|1Q|8|@]|9|@]|C|$]]|$1|R|3|S|5|K|7|1R|8|@]|9|@]|C|$]]|$1|T|3|-4|5|6|7|1S|8|@]|9|@]|C|$]]|$1|U|3|-4|5|6|7|1T|8|@]|9|@]|C|$]]]|V|$W|$5|X|Y|Z|C|$10|11]]|12|$5|X|Y|Z|C|$10|13]]|14|$5|X|Y|Z|C|$10|15]]]]

Unicode is firmly married to UTF-8. Unicode specifically supports 2^21 code points (2,097,152 characters) which is exactly the same number of code points supported by UTF-8. Both systems reserve the same 'dead' space and restricted zones for code points etc. <a href="https://en.wikipedia.org/wiki/Unicode" rel="nofollow noreferrer">...as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters</a>

From the unicode standard. <a href="https://www.unicode.org/faq/utf_bom.html" rel="nofollow noreferrer">Unicode FAQ</a>

<blockquote>
 The Unicode Standard encodes characters in the range U+0000..U+10FFFF,
 which amounts to a 21-bit code space.
</blockquote>

From the UTF-8 Wikipedia page. <a href="https://en.wikipedia.org/wiki/UTF-8#Description" rel="nofollow noreferrer">UTF-8 Description</a>

<blockquote>
 Since the restriction of the Unicode code-space to 21-bit values in
 2003, UTF-8 is defined to encode code points in one to four bytes, ...
</blockquote>

If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters?

The first 128 code points are the same as in ASCII. But it says UTF-8 can support up to million of characters?

How does this work?

How many characters can UTF-8 encode?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋 

腾讯云代码助手

CODING DevOps

Cloud Studio

SDK中心

API中心

命令行工具

如果UTF-8是8位，是不是意味着最多只能有256个不同的字符？前128个码位与ASCII中的相同。但它说UTF-8可以支持多达百万个字符？这是怎么回事？

问UTF-8可以编码多少个字符？
EN

回答 7

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问UTF-8可以编码多少个字符？EN

回答 7

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问UTF-8可以编码多少个字符？
EN