blocks|key|3065135|text|如果使用不同的编码进行编译，则这些编码只会影响源文件。如果源代码中没有任何特殊字符，那么结果字节码将不会有任何差异。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3065136|对于运行时，使用操作系统的默认字符集。这与您用于编译的字符集无关。|3065137|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

If you compile with different encodings, these encodings only affect your source files. If you don't have any special characters inside your sources, there will be no difference in the resulting byte code.

For runtime, the default charset of the operating system is used. This is independent from the charset you used for compiling.

blocks|key|318995|text|基于this和this的Erm确认控制字符在这两种编码中完全相同。您指出的区别是DOS/Windows实际上为Windows-1252中的大多数控制字符(如心/杆/黑桃/钻石字符和类似字符)提供了符号，而ISO-8859则没有。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|318996|entityMap|0|LINK|mutability|MUTABLE|url|http://en.wikipedia.org/wiki/ISO-8859-1#ISO-8859-1|1|http://en.wikipedia.org/wiki/Windows-1252^0|2|4|0|7|4|1|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]|$A|R|B|S|1|T]]|C|$]]|$1|D|3|-4|5|6|7|U|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]|L|$5|G|H|I|C|$J|M]]]]

Erm based on <a href="http://en.wikipedia.org/wiki/ISO-8859-1#ISO-8859-1" rel="nofollow noreferrer">this</a> and <a href="http://en.wikipedia.org/wiki/Windows-1252" rel="nofollow noreferrer">this</a> the ACK control character is exactly the same in both encodings. The difference the link you pointed out is talking about how DOS/Windows actually has symbols for most of the control characters in Windows-1252 (like the Heart/Club/Spade/Diamond characters and simileys) while ISO-8859 does not.

blocks|key|319008|text|type|unstyled|depth|inlineStyleRanges|entityRanges|data|319009|源文件可以是任何编码|ordered-list-item|319010|你需要告诉编译器源文件的编码(如javac+-encoding...)；否则，假定平台编码在类文件二进制文件中为UTF-8，字符串文字存储为(修改后的)UTF-8，但除非您使用字节码，否则这并不重要(请参见JVM+spec)|offset|length|style|CODE|319011|Strings+|319012|+UTF-16，always+(请参见Java+language+spec)|319013|The+System.out+PrintStream将字符串从UTF-16转换为系统编码中的字节，然后再将其写入标准输出|319014|319015|备注：|319016|319017|Blog+post+I+wrote+on+Java+encoding|unordered-list-item|319018|Don't+use+-Dfile.encoding|319019|319020|entityMap|0|LINK|mutability|MUTABLE|url|http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html#7963|1|http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1|2|http://java.sun.com/javase/6/docs/api/java/io/PrintStream.html|3|http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html|4|http://bugs.sun.com/view_bug.do?bug_id=4163515^0|0|0|G|I|2V|8|0|0|0|J|I|1|0|4|A|F|B|F|B|2|0|0|0|0|0|Y|3|0|A|F|0|P|4|0|0^^$0|@$1|2|3|-4|4|5|6|1F|7|@]|8|@]|9|$]]|$1|A|3|B|4|C|6|1G|7|@]|8|@]|9|$]]|$1|D|3|E|4|C|6|1H|7|@$F|1I|G|1J|H|I]]|8|@$F|1K|G|1L|1|1M]]|9|$]]|$1|J|3|K|4|C|6|1N|7|@]|8|@]|9|$]]|$1|L|3|M|4|5|6|1O|7|@]|8|@$F|1P|G|1Q|1|1R]]|9|$]]|$1|N|3|O|4|C|6|1S|7|@$F|1T|G|1U|H|I]|$F|1V|G|1W|H|I]]|8|@$F|1X|G|1Y|1|1Z]]|9|$]]|$1|P|3|-4|4|5|6|20|7|@]|8|@]|9|$]]|$1|Q|3|R|4|5|6|21|7|@]|8|@]|9|$]]|$1|S|3|-4|4|5|6|22|7|@]|8|@]|9|$]]|$1|T|3|U|4|V|6|23|7|@]|8|@$F|24|G|25|1|26]]|9|$]]|$1|W|3|X|4|V|6|27|7|@$F|28|G|29|H|I]]|8|@$F|2A|G|2B|1|2C]]|9|$]]|$1|Y|3|-4|4|5|6|2D|7|@]|8|@]|9|$]]|$1|Z|3|-4|4|5|6|2E|7|@]|8|@]|9|$]]]|10|$11|$4|12|13|14|9|$15|16]]|17|$4|12|13|14|9|$15|18]]|19|$4|12|13|14|9|$15|1A]]|1B|$4|12|13|14|9|$15|1C]]|1D|$4|12|13|14|9|$15|1E]]]]

<ol>
<li>Source files can be in any encoding</li>
<li>You need to tell the compiler the encoding of source files (e.g. <code>javac -encoding...</code>); otherwise, platform encoding is assumed</li>
<li>In class file binaries, string literals are stored as (modified) UTF-8, but unless you work with bytecode, this doesn't matter (see <a href="http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html#7963" rel="noreferrer">JVM spec</a>)</li>
<li>Strings in Java are UTF-16, always (see <a href="http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1" rel="noreferrer">Java language spec</a>)</li>
<li>The <code>System.out</code> <a href="http://java.sun.com/javase/6/docs/api/java/io/PrintStream.html" rel="noreferrer"><code>PrintStream</code></a> will transform your strings from UTF-16 to bytes in the system encoding prior to writing them to stdout</li>
</ol>

Notes:

<ul>
<li><a href="http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html" rel="noreferrer">Blog post I wrote on Java encoding</a></li>
<li><a href="http://bugs.sun.com/view_bug.do?bug_id=4163515" rel="noreferrer">Don't use <code>-Dfile.encoding</code></a></li>
</ul>

blocks|key|3065241|text|关于Java中的字符串编码的“要知道的内容”的摘要：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3065242|在内存中，String实例是一系列16位的“代码单元”，将其作为Java值进行处理。从概念上讲，这些代码单元编码一系列“代码点”，其中代码点是“根据Unicode标准属于给定字符的数字”。代码点的范围从0到一百万多一点，尽管到目前为止只定义了10万个左右。从0到65535的代码点被编码到一个代码单元中，而其他代码点使用两个代码单元。此过程称为UTF-16+(也称为UCS-2)。有一些微妙之处(一些代码点是无效的，例如65535，在前65536个代码点中有2048个代码点的范围，正好为其他代码点的编码保留)。|offset|length|style|CODE|3065243|代码页等不会影响|unordered-list-item|3065244|在|3065245|中存储字符串的方式。这就是"Unicode“以"Uni”开头的原因。只要您不对字符串执行I/O，您就处于Unicode的世界中，在这里，每个人都使用相同的字符映射来编码points.|3065246|Charsets，当将字符串编码为字节或从字节解码字符串时，就会开始行动。除非明确指定，否则Java将使用取决于用户"locale“的默认字符集，”locale“是日本计算机说日语的模糊集合概念。当您使用System.out.println()打印字符串时，JVM会将字符串转换为适合这些字符所在位置的内容，这通常意味着使用取决于当前语言环境(或JVM对当前语言环境的猜测)的字符集将它们转换为字节。|3065247|One+Java应用程序是Java编译器。Java编译器需要解释源文件的内容，在系统级，这些内容只是一串字节。然后，Java编译器会为此选择一个默认字符集，并根据当前的语言环境执行此操作，就像Java所做的那样，因为Java编译器本身就是用Java编写的。Java编译器(javac)接受命令行标志(-encoding)，该标志可用于覆盖默认选择。|3065248|+Java编译器生成独立于语言环境类文件。字符串最终以(某种程度上)+UTF-8编码出现在这些类文件中，而不管Java编译器用来解释源文件的字符集是什么。运行Java编译器的系统上的区域设置会影响如何解释源代码，但是一旦Java编译器了解到您的字符串包含代码点编号6，那么这个代码点就会进入类文件，而不是其他代码点。请注意，代码点0到127具有相同的UTF-8、CP-1252和ISO-8859-1编码，因此您获得的结果不足为奇。|3065249|即使String实例不依赖于任何类型的编码，只要它们保留在内存中，您可能希望对字符串执行的一些操作是与区域设置相关的。这不是编码的问题；但是语言环境也定义了一种“语言”，因此，大写和小写的概念取决于所使用的语言。通常的疑点是调用"unicode".toUpperCase()：这会生成"UNICODE"，除非当前语言环境是土耳其语，在这种情况下，您会得到"UNİCODE"+(+"I“有一个点)。这里的基本假设是，如果当前语言环境是土耳其语，那么应用程序管理的数据可能是土耳其语文本；就我个人而言，我觉得这个假设充其量是有问题的。但事实就是如此。|3065250|3065251|实际上，您应该在代码中显式地指定编码，至少在大多数情况下是这样。不要调用String.getBytes()，调用String.getBytes("UTF-8")。在将默认的依赖于区域设置的编码应用于与用户交换的某些数据(如配置文件或立即显示的消息)时，可以使用该编码；但在其他地方，请尽可能避免使用依赖于区域设置的方法。|3065252|在Java的其他依赖于语言环境的部分中，还有日历。有整个时区业务，它依赖于“时区”，它应该与计算机的地理位置相关(这不是严格意义上的“地区”的一部分……)。此外，无数的Java应用程序在曼谷运行时神秘地失败，因为在泰国地区，Java默认使用佛教日历，根据该日历，当前年份是2553。|3065253|根据经验，假设世界是广阔的(的确如此!)并且保持通用(不要做任何依赖于字符集的事情，直到最后一刻，那时必须实际执行I/O+)。|3065254|entityMap^0|0|5|6|0|0|0|0|2U|K|0|3S|5|46|9|0|0|2|6|36|N|3Y|9|4X|9|5A|1|0|0|10|H|1K|O|0|0|0^^$0|@$1|2|3|4|5|6|7|15|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|16|8|@$D|17|E|18|F|G]]|9|@]|A|$]]|$1|H|3|I|5|J|7|19|8|@]|9|@]|A|$]]|$1|K|3|L|5|6|7|1A|8|@]|9|@]|A|$]]|$1|M|3|N|5|J|7|1B|8|@]|9|@]|A|$]]|$1|O|3|P|5|J|7|1C|8|@$D|1D|E|1E|F|G]]|9|@]|A|$]]|$1|Q|3|R|5|J|7|1F|8|@$D|1G|E|1H|F|G]|$D|1I|E|1J|F|G]]|9|@]|A|$]]|$1|S|3|T|5|J|7|1K|8|@]|9|@]|A|$]]|$1|U|3|V|5|J|7|1L|8|@$D|1M|E|1N|F|G]|$D|1O|E|1P|F|G]|$D|1Q|E|1R|F|G]|$D|1S|E|1T|F|G]|$D|1U|E|1V|F|G]]|9|@]|A|$]]|$1|W|3|-4|5|6|7|1W|8|@]|9|@]|A|$]]|$1|X|3|Y|5|6|7|1X|8|@$D|1Y|E|1Z|F|G]|$D|20|E|21|F|G]]|9|@]|A|$]]|$1|Z|3|10|5|6|7|22|8|@]|9|@]|A|$]]|$1|11|3|12|5|6|7|23|8|@]|9|@]|A|$]]|$1|13|3|-4|5|6|7|24|8|@]|9|@]|A|$]]]|14|$]]

A summary of "what to know" about string encodings in Java:

<ul>
<li>A <code>String</code> instance, in memory, is a sequence of 16-bit "code units", which Java handles as <code>char</code> values. Conceptually, those code units encode a sequence of "code points", where a code point is "the number attributed to a given character as per the Unicode standard". Code points range from 0 to a bit more than one million, although only 100 thousands or so have been defined so far. Code points from 0 to 65535 are encoded into a single code unit, while other code points use two code units. This process is called UTF-16 (aka UCS-2). There are a few subtleties (some code points are invalid, e.g. 65535, and there is a range of 2048 code points in the first 65536 reserved precisely for the encoding of the other code points).</li>
<li>Code pages and the like do not impact how Java stores the strings in RAM. That's why "Unicode" starts with "Uni". As long as you do not perform I/O with your strings, you are in the world of Unicode where everybody uses the same mapping of characters to code points.</li>
<li>Charsets come into action when encoding strings into bytes, or decoding strings from bytes. Unless explicitly specified, Java will use a default charset which depends on the user "locale", a fuzzy aggregate notion of what makes a computer in Japan speak Japanese. When you print out a string with <code>System.out.println()</code>, the JVM will convert the string into something suitable for wherever those characters go, which often means converting them to bytes using a charset which depends on the current locale (or what the JVM guessed of the current locale).</li>
<li>One Java application is the Java compiler. The Java compiler needs to interpret the contents of source files, which are, at the system level, just bunch of bytes. The Java compiler then selects a default charset for that, and it does so depending on the current locale, just like Java would do, because the Java compiler is itself written in Java. The Java compiler (<code>javac</code>) accepts a command-line flag (<code>-encoding</code>) which can be used to override that default choice.</li>
<li>The Java compiler produces class files which are locale-independent. String literals ends up in those class files with (sort of) UTF-8 encoding, regardless of the charset which the Java compiler used to interpret the source files. The locale on the system on which the Java compiler runs impacts how the source code is interpreted, but once the Java compiler has understood that your string contains the code point number 6, then this code point is what will make its way to the class files, and none other. Note that code points 0 to 127 have the same encoding in UTF-8, CP-1252 and ISO-8859-1, hence what you obtain is no wonder.</li>
<li>Even so <code>String</code> instances do not depend on any kind of encoding, as long as they remain in RAM, some of the operations you may want to perform on strings are locale-dependent. This is not a question of encoding; but a locale also defines a "language" and it so happens that the notions of uppercase and lowercase depend on the language which is used. The Usual Suspect is calling <code>"unicode".toUpperCase()</code>: this yields <code>"UNICODE"</code> except if the current locale is Turkish, in which case you get <code>"UNİCODE"</code> (the "<code>I</code>" has a dot). The basic assumption here is that if the current locale is Turkish then the data the application is managing is probably Turkish text; personally, I find this assumption at best questionable. But so it is.</li>
</ul>

In practical terms, you should specify encodings explicitly in your code, at least most of the time. Do not call <code>String.getBytes()</code>, call <code>String.getBytes("UTF-8")</code>. Use of the default, locale-dependent encoding is fine when it is applied to some data exchanged with the user, such as a configuration file or a message to display immediately; but elsewhere, avoid locale-dependent methods whenever possible.

Among other locale-dependent parts of Java, there are calendars. There is the whole time zone business, which depends on the "time zone", which should relate to the geographical position of the computer (and this is not part of the "locale" stricto sensu...). Also, countless Java application mysteriously fail when run in Bangkok, because in a Thai locale, Java defaults to the Buddhist calendar according to which the current year is 2553.

As a rule of thumb, assume that the World is vast (it is !) and keep things generic (do not do anything which depends on a charset until the very last moment, when I/O must actually be performed).

I recently realized that I don't fully understand Java's string encoding process.

Consider the following code:

<pre><code>public class Main
{
 public static void main(String[] args)
 {
 System.out.println(java.nio.charset.Charset.defaultCharset().name());
 System.out.println("ack char: ^"); /* where ^ = 0x06, the ack char */
 }
}
</code></pre>

Since the control characters are <a href="http://en.wikipedia.org/wiki/Code_page" rel="noreferrer">interpreted differently between windows-1252 and ISO-8859-1</a>, I chose the <code>ack</code> char for testing.

I now compile it with different file encodings, UTF-8, <a href="http://en.wikipedia.org/wiki/Windows-1252" rel="noreferrer">windows-1252</a>, and <a href="http://en.wikipedia.org/wiki/ISO/IEC_8859-1" rel="noreferrer">ISO-8859-1</a>. The both compile to the exact same thing, byte-per-byte as verified by <code>md5sum</code>.

I then run the program:

<pre><code>$ java Main | hexdump -C
00000000 55 54 46 2d 38 0a 61 63 6b 20 63 68 61 72 3a 20 |UTF-8.ack char: |
00000010 06 0a |..|
00000012

$ java -Dfile.encoding=iso-8859-1 Main | hexdump -C
00000000 49 53 4f 2d 38 38 35 39 2d 31 0a 61 63 6b 20 63 |ISO-8859-1.ack c|
00000010 68 61 72 3a 20 06 0a |har: ..|
00000017

$ java -Dfile.encoding=windows-1252 Main | hexdump -C
00000000 77 69 6e 64 6f 77 73 2d 31 32 35 32 0a 61 63 6b |windows-1252.ack|
00000010 20 63 68 61 72 3a 20 06 0a | char: ..|
00000019
</code></pre>

It correctly outputs the <code>0x06</code> no matter which encoding is being used.

Ok, it still outputs the same <code>0x06</code>, which would be interpreted as the printable [ACK] char by windows-1252 codepages.

That leads me to a few questions:

<ol>
<li>Is the codepage / charset of the Java file being compiled expected to be identical to the default charset of the system under which it's being compiled? Are the two always synonymous?</li>
<li>The compiled representation doesn't seem dependent on the compile-time charset, is this indeed the case?</li>
<li>Does this imply that strings within Java files may be interpreted differently at runtime if they don't use standard characters for the current charset/locale?</li>
<li>What else should I really know about string and character encoding in Java?</li>
</ol>

From compilation to runtime, how does Java String encoding really work

我最近意识到我并不完全理解Java的字符串编码过程。考虑以下代码：public class Main{    public static void main(String[] args)    {        System.out.println(java.nio.charset.Charset.defaultCha...

问从编译到运行时，Java字符串编码到底是如何工作的
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从编译到运行时，Java字符串编码到底是如何工作的EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从编译到运行时，Java字符串编码到底是如何工作的
EN