blocks|key|969634|text|好吧，我也打开了维基百科的页面，在介绍部分我看到了"Unicode可以由不同的字符编码实现。最常用的编码是UTF-8+(它对任何ASCII字符使用一个字节，它在UTF-8和ASCII编码中具有相同的代码值，最多四个字节用于其他字符)，现在已经过时的UCS-2+(它对每个字符使用两个字节，但不能在当前Unicode标准中对每个字符进行编码)“。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|969635|正如这段引述所示，您的问题在于您假设Unicode是一种单一的字符编码方式。实际上，Unicode有多种形式，在这句话中，其中一种甚至每个字符都有一个字节，就像你习惯的那样。|969636|所以你想要的简单答案是它是不同的。|969637|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|I|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|J|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|G|$]]

Well I just pulled up the Wikipedia page on it too, and in the intro portion I saw "Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters), the now-obsolete UCS-2 (which uses two bytes for each character but cannot encode every character in the current Unicode standard)"

As this quote demonstrates, your problem is that you are assuming Unicode is a single way of encoding characters. There are actually multiple forms of Unicode, and, again in that quote, one of them even has 1 byte per character just like what you are used to.

So your simple answer that you want is that it varies.

blocks|key|754150|text|简单地说，Unicode是一个标准，它将一个数字(称为代码点)分配给世界上所有的字符(它还在开发中)。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|754151|现在你需要用字节来表示这个代码点，这就是character+encoding。UTF-8,+UTF-16,+UTF-6是表示这些字符的方式。|754152|UTF-8是多字节字符编码。字符可以有1到6个字节(其中一些现在可能不需要)。|754153|UTF-32每个字符都有4个字节的字符。|754154|UTF-16对每个字符使用16位，并且它只表示称为BMP的Unicode字符的一部分(对于所有实际目的，它已经足够了)。Java在其字符串中使用此编码。|754155|entityMap^0|5|7|0|K|I|13|K|0|0|5|0|0|6|0|0|6|0^^$0|@$1|2|3|4|5|6|7|P|8|@$9|Q|A|R|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|S|8|@$9|T|A|U|B|C]|$9|V|A|W|B|C]]|D|@]|E|$]]|$1|H|3|I|5|6|7|X|8|@$9|Y|A|Z|B|C]]|D|@]|E|$]]|$1|J|3|K|5|6|7|10|8|@$9|11|A|12|B|C]]|D|@]|E|$]]|$1|L|3|M|5|6|7|13|8|@$9|14|A|15|B|C]]|D|@]|E|$]]|$1|N|3|-4|5|6|7|16|8|@]|D|@]|E|$]]]|O|$]]

Simply speaking <code>Unicode</code> is a standard which assigned one number (called code point) to all characters of the world (Its still work in progress).

Now you need to represent this code points using bytes, thats called <code>character encoding</code>. <code>UTF-8, UTF-16, UTF-6</code> are ways of representing those characters. 

<code>UTF-8</code> is multibyte character encoding. Characters can have 1 to 6 bytes (some of them may be not required right now). 

<code>UTF-32</code> each characters have 4 bytes a characters.

<code>UTF-16</code> uses 16 bits for each character and it represents only part of Unicode characters called BMP (for all practical purposes its enough). Java uses this encoding in its strings.

blocks|key|969786|text|在UTF-8中有一个强大的工具可以计算任何字符串的字节数：http://mothereff.in/byte-counter|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|969787|更新：@mathias已经公开了代码：https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js|969788|entityMap|0|LINK|mutability|MUTABLE|url|http://mothereff.in/byte-counter|1|https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js^0|T|W|0|0|J|25|1|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]]|C|$]]|$1|D|3|E|5|6|7|T|8|@]|9|@$A|U|B|V|1|W]]|C|$]]|$1|F|3|-4|5|6|7|X|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]|N|$5|I|J|K|C|$L|O]]]]

There is a great tool for calculating the bytes of any string in UTF-8: <a href="http://mothereff.in/byte-counter" rel="noreferrer">http://mothereff.in/byte-counter</a>

Update: @mathias has made the code public: <a href="https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js" rel="noreferrer">https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js</a>

blocks|key|969847|text|看看这个Unicode+code+converter。例如，在"0x...+notation“字段中输入0x2009，其中2009+is+the+Unicode+number+for+thin+space，然后单击Convert。十六进制数字E2+80+89+(3字节)出现在"UTF-8+code+unit“字段中。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|969848|entityMap|0|LINK|mutability|MUTABLE|url|http://r12a.github.io/apps/conversion/|1|http://unicode-table.com/en/#2009^0|1G|6|3D|8|4|M|0|1P|15|1|0^^$0|@$1|2|3|4|5|6|7|P|8|@$9|Q|A|R|B|C]|$9|S|A|T|B|C]]|D|@$9|U|A|V|1|W]|$9|X|A|Y|1|Z]]|E|$]]|$1|F|3|-4|5|6|7|10|8|@]|D|@]|E|$]]]|G|$H|$5|I|J|K|E|$L|M]]|N|$5|I|J|K|E|$L|O]]]]

Check out this <a href="http://r12a.github.io/apps/conversion/" rel="nofollow noreferrer">Unicode code converter</a>. For example, enter <code>0x2009</code>, where <a href="http://unicode-table.com/en/#2009" rel="nofollow noreferrer">2009 is the Unicode number for thin space</a>, in the "0x... notation" field, and click Convert. The hexadecimal number <code>E2 80 89</code> (3 bytes) appears in the "UTF-8 code units" field.

blocks|key|969909|text|奇怪的是，没有人指出如何计算一个Unicode字符需要多少字节。以下是UTF-8编码字符串的规则：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|969910|Binary++++Hex++++++++++Comments
0xxxxxxx++0x00..0x7F+++Only+byte+of+a+1-byte+character+encoding
10xxxxxx++0x80..0xBF+++Continuation+byte:+one+of+1-3+bytes+following+the+first
110xxxxx++0xC0..0xDF+++First+byte+of+a+2-byte+character+encoding
1110xxxx++0xE0..0xEF+++First+byte+of+a+3-byte+character+encoding
11110xxx++0xF0..0xF7+++First+byte+of+a+4-byte+character+encoding|code-block|syntax|javascript|969911|所以快速的答案是:它需要1到4个字节，这取决于第一个字节，它将指示它将占用多少字节。|969912|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:

<pre><code>Binary Hex Comments
0xxxxxxx 0x00..0x7F Only byte of a 1-byte character encoding
10xxxxxx 0x80..0xBF Continuation byte: one of 1-3 bytes following the first
110xxxxx 0xC0..0xDF First byte of a 2-byte character encoding
1110xxxx 0xE0..0xEF First byte of a 3-byte character encoding
11110xxx 0xF0..0xF7 First byte of a 4-byte character encoding
</code></pre>

So the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it'll take up.

blocks|key|969976|text|对于UTF-16，如果字符以0xD800或更大的字符开头，则需要四个字节(两个代码单元)；这样的字符称为“代理对”。更具体地说，代理项对的形式如下：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|969977|[0xD800+-+0xDBFF]++[0xDC00+-+0xDFF]|code-block|syntax|javascript|969978|哪里..。表示给定范围内的双字节代码单元。任何<=+0xD7FF都是一个代码单元(两个字节)。任何>=+0xE000都是无效的(有争议的是，BOM标记除外)。|969979|请参阅http://unicodebook.readthedocs.io/unicode_encodings.html，7.5节。|offset|length|969980|entityMap|0|LINK|mutability|MUTABLE|url|http://unicodebook.readthedocs.io/unicode_encodings.html^0|0|0|0|3|1K|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|W|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|X|8|@]|9|@$K|Y|L|Z|1|10]]|A|$]]|$1|M|3|-4|5|6|7|11|8|@]|9|@]|A|$]]]|N|$O|$5|P|Q|R|A|$S|T]]]]

For UTF-16, the character needs four bytes (two code units) if it starts with 0xD800 or greater; such a character is called a "surrogate pair." More specifically, a surrogate pair has the form:

<pre><code>[0xD800 - 0xDBFF] [0xDC00 - 0xDFF]
</code></pre>

where [...] indicates a two-byte code unit with the given range. Anything &lt;= 0xD7FF is one code unit (two bytes). Anything >= 0xE000 is invalid (except BOM markers, arguably).

See <a href="http://unicodebook.readthedocs.io/unicode_encodings.html" rel="nofollow">http://unicodebook.readthedocs.io/unicode_encodings.html</a>, section 7.5.

blocks|key|754495|text|在UTF-8中：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|754496|1+byte:+++++++0+-+++++7F+++++(ASCII)
2+bytes:+++++80+-++++7FF+++++(all+European+plus+some+Middle+Eastern)
3+bytes:++++800+-+++FFFF+++++(multilingual+plane+incl.+the+top+1792+and+private-use)
4+bytes:++10000+-+10FFFF|code-block|syntax|javascript|754497|在UTF-16中：|754498|2+bytes:++++++0+-+++D7FF+++++(multilingual+plane+except+the+top+1792+and+private-use+)
4+bytes:+++D800+-+10FFFF|754499|在UTF-32中：|754500|4+bytes:++++++0+-+10FFFF|754501|根据定义，10FFFF是最后一个unicode码点，之所以这样定义，是因为它是UTF-16的技术限制。|754502|它也是UTF-8可以用4字节编码的最大码点，但UTF-8编码背后的思想也适用于5字节和6字节编码，以覆盖直到7FFFFFFF的码点。只有UTF-32的一半。|754503|entityMap^0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|W|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|X|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|Y|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|Z|8|@]|9|@]|A|$E|F]]|$1|O|3|P|5|6|7|10|8|@]|9|@]|A|$]]|$1|Q|3|R|5|6|7|11|8|@]|9|@]|A|$]]|$1|S|3|-4|5|6|7|12|8|@]|9|@]|A|$]]]|T|$]]

In UTF-8:

<pre><code>1 byte: 0 - 7F (ASCII)
2 bytes: 80 - 7FF (all European plus some Middle Eastern)
3 bytes: 800 - FFFF (multilingual plane incl. the top 1792 and private-use)
4 bytes: 10000 - 10FFFF
</code></pre>

In UTF-16:

<pre><code>2 bytes: 0 - D7FF (multilingual plane except the top 1792 and private-use )
4 bytes: D800 - 10FFFF
</code></pre>

In UTF-32:

<pre><code>4 bytes: 0 - 10FFFF
</code></pre>

10FFFF is the last unicode codepoint by definition, and it's defined that way because it's UTF-16's technical limit.

It is also the largest codepoint UTF-8 can encode in 4 byte, but the idea behind UTF-8's encoding also works for 5 and 6 byte encodings to cover codepoints until 7FFFFFFF, ie. half of what UTF-32 can.

blocks|key|754585|text|Unicode是一个standard，它为每个字符提供一个唯一的数字。对于世界上存在的所有字符，这些唯一的数字被称为code+points+(这只是一个唯一的代码)(一些字符还在添加中)。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|754586|出于不同的目的，您可能需要用字节来表示这个code+points+(大多数编程语言都是这样做的)，这就是Character+Encoding的用武之地。|754587|UTF-8、UTF-16、UTF-32等都是Character+Encodings，并且Unicode的代码点在这些编码中以不同的方式表示。|754588|UTF-8编码具有可变宽度长度，其中编码的字符可以占用1到4个字节；|754589|UTF-16具有可变长度，其中编码的字符可以采用1或2个字节(即8或16位)。这只表示所有称为BMP+(基本多语言平面)的Unicode字符的一部分，对于几乎所有情况都足够了。Java的字符串和字符使用UTF-16编码；|754590|UTF-32具有固定的长度，每个字符恰好占用4个字节(32位)。|754591|entityMap|0|LINK|mutability|MUTABLE|url|https://unicode.org/standard/WhatIsUnicode.html^0|0|7|1M|A|A|8|0|0|L|B|1G|I|0|0|5|6|6|D|6|M|J|0|0|5|0|0|6|2T|6|0|0|6|0^^$0|@$1|2|3|4|5|6|7|X|8|@$9|Y|A|Z|B|C]|$9|10|A|11|B|C]]|D|@$9|12|A|13|1|14]]|E|$]]|$1|F|3|G|5|6|7|15|8|@$9|16|A|17|B|C]|$9|18|A|19|B|C]]|D|@]|E|$]]|$1|H|3|I|5|6|7|1A|8|@$9|1B|A|1C|B|C]|$9|1D|A|1E|B|C]|$9|1F|A|1G|B|C]|$9|1H|A|1I|B|C]]|D|@]|E|$]]|$1|J|3|K|5|6|7|1J|8|@$9|1K|A|1L|B|C]]|D|@]|E|$]]|$1|L|3|M|5|6|7|1M|8|@$9|1N|A|1O|B|C]|$9|1P|A|1Q|B|C]]|D|@]|E|$]]|$1|N|3|O|5|6|7|1R|8|@$9|1S|A|1T|B|C]]|D|@]|E|$]]|$1|P|3|-4|5|6|7|1U|8|@]|D|@]|E|$]]]|Q|$R|$5|S|T|U|E|$V|W]]]]

<code>Unicode</code> is a <a href="https://unicode.org/standard/WhatIsUnicode.html" rel="nofollow noreferrer">standard</a> which provides a unique number for every character. These unique numbers are called <code>code point</code>s (which is just unique code) to all characters existing in the world (some's are still to be added).

For different purposes, you might need to represent this <code>code points</code> in bytes (most programming languages do so), and here's where <code>Character Encoding</code> kicks in.

<code>UTF-8</code>, <code>UTF-16</code>, <code>UTF-32</code> and so on are all <code>Character Encodings</code>, and Unicode's code points are represented in these encodings, in different ways.

 
<code>UTF-8</code> encoding has a variable-width length, and characters, encoded in it, can occupy 1 to 4 bytes inclusive;

<code>UTF-16</code> has a variable length and characters, encoded in it, can take either 1 or 2 bytes (which is 8 or 16 bits). This represents only part of all Unicode characters called BMP (Basic Multilingual Plane) and it's enough for almost all the cases. Java uses <code>UTF-16</code> encoding for its strings and characters;

<code>UTF-32</code> has fixed length and each character takes exactly 4 bytes (32 bits).

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require? 

I assume that one Unicode character can contain every possible character from any language - am I correct? So how many bytes does it need per character? 

And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?

I read the <a href="http://en.wikipedia.org/wiki/Unicode">Wikipedia article about Unicode</a> but it is quite difficult for me. I am looking forward to seeing a simple answer.

How many bytes does one Unicode character take?

Java

我对编码有点困惑。据我所知，旧的ASCII字符每个字符占用一个字节。Unicode字符需要多少字节？我假设一个Unicode字符可以包含来自任何语言的所有可能的字符-我说的对吗？那么每个字符需要多少字节呢？UTF-7、UTF-6、UTF-16等是什么意思？它们是不同版本的Unicode吗？我读过，但它对我来说很难。我期...

问一个Unicode字符需要多少字节？
EN

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问一个Unicode字符需要多少字节？EN