blocks|key|1354837|text|它破坏了javascript，因为字符串中不能有换行符：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1354838|var+myString+=+"

";

//SyntaxError:+Unexpected+token+ILLEGAL|code-block|syntax|javascript|1354839|现在，UTF-8序列"E2-80-A8"解码为unicode代码点U%2B2028，其处理方式类似于javascript中的换行符：|offset|length|style|CODE|1354840|+var+myString+=+" ";

//Syntax+Error|1354841|但是，编写它是安全的。|1354842|var+myString+=+"\u2028";
//you+can+now+log+myString+in+console+and+get+real+representation+of+this+character|1354843|这就是正确编码的JSON所具有的。我会考虑对JSON进行适当的编码，而不是保留一份不安全字符的黑名单。(分别是U%2B2028和U%2B2029+AFAIK)。|1354844|在PHP中：|1354845|echo+json_encode(+chr(0xe2).+chr(0x80).chr(0xA8+)+);
//"\u2028"|1354846|entityMap|0|LINK|mutability|MUTABLE|url|http://www.fileformat.info/info/unicode/char/2028/index.htm^0|0|0|A|A|X|6|X|6|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|16|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|17|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|18|8|@$I|19|J|1A|K|L]|$I|1B|J|1C|K|L]]|9|@$I|1D|J|1E|1|1F]]|A|$]]|$1|M|3|N|5|D|7|1G|8|@]|9|@]|A|$E|F]]|$1|O|3|P|5|6|7|1H|8|@]|9|@]|A|$]]|$1|Q|3|R|5|D|7|1I|8|@]|9|@]|A|$E|F]]|$1|S|3|T|5|6|7|1J|8|@]|9|@]|A|$]]|$1|U|3|V|5|6|7|1K|8|@]|9|@]|A|$]]|$1|W|3|X|5|D|7|1L|8|@]|9|@]|A|$E|F]]|$1|Y|3|-4|5|6|7|1M|8|@]|9|@]|A|$]]]|Z|$10|$5|11|12|13|A|$14|15]]]]

It breaks javascript because strings cannot have newlines in them:

<pre><code>var myString = "

";

//SyntaxError: Unexpected token ILLEGAL
</code></pre>

Now, the UTF-8 sequence <code>"E2-80-A8"</code> decodes to unicode code point <a href="http://www.fileformat.info/info/unicode/char/2028/index.htm" rel="noreferrer"><code>U+2028</code></a>, which is treated similar to newline in javascript:

<pre><code> var myString = " ";

//Syntax Error
</code></pre>

It is however, safe to write 

<pre><code>var myString = "\u2028";
//you can now log myString in console and get real representation of this character
</code></pre>

which is what properly encoded JSON will have. I'd look into properly encoding JSON instead of keeping a blacklist of unsafe characters. (which are U+2028 and U+2029 AFAIK).

In PHP:

<pre><code>echo json_encode( chr(0xe2). chr(0x80).chr(0xA8 ) );
//"\u2028"
</code></pre>

blocks|key|1356476|text|A-Z，a-z和0-9通常是安全的。除了这62个字符之外，您还会遇到一些系统问题。任何人都不能给你其他答案。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1356477|例如，您提到了域名。处理Unicode域名的唯一方法是遵循RFC+3454和RFC+5890-5893，并以这种方式处理数据。大多数Unix文件系统上的文件名都是不包含/或\0的任意字节字符串。在不破坏任何东西的情况下，将Unix上的文件名视为Unicode字符串本身就是一个问题。请注意，Windows文件名不是A-Z安全的；NUL和PRN之类的东西是保留名称。每个领域都有自己的小问题和怪癖，没有简单的摘要可以满足所有的需求。|1356478|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

A-Z, a-z and 0-9 are generally safe. Outside those 62 characters, you will run to problems with some system. There's no other answer anyone can give you.

For example, you mention domain names. The only way to handle Unicode domain names is to follow RFC 3454 and RFCs 5890-5893, and process the data that way and only that way. Filenames on most Unix filesystems are arbitrary strings of bytes that don't include / or \0. Functionally treating a filename on Unix as a Unicode string without breaking anything is a question in itself. Note that Windows filenames are not A-Z safe; stuff like NUL and PRN are reserved names. Each domain is going to its own little issues and quirks, and no simple summary is going to suffice for everywhere.

blocks|key|2244842|text|看看Unicode图表。有一个非打印字符的列表。这些都是潜在的麻烦制造者。你的朋友U%2B2028有一群朋友：http://www.unicode.org/charts/PDF/U2000.pdf，而且它不仅仅在2000的范围内。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|2244843|你可以将它们全部核化，或者将它们分成不同的类别(像U%2B2028这样的SEP字符变成\n或者正确地转义)，等等。|2244844|HTH|2244845|entityMap|0|LINK|mutability|MUTABLE|url|http://www.unicode.org/charts/PDF/U2000.pdf^0|1H|17|0|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]]|C|$]]|$1|D|3|E|5|6|7|T|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|U|8|@]|9|@]|C|$]]|$1|H|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]]]

Look at the Unicode charts. There's a list of non-printing characters. These are the ones that'd be potential troublemakers. Your friend U+2028 has a bunch of friends: <a href="http://www.unicode.org/charts/PDF/U2000.pdf" rel="nofollow">http://www.unicode.org/charts/PDF/U2000.pdf</a> And it's not just in the 2000 range.

You could either nuke them all, or separate them into different categories (the SEP chars like U+2028 becoming \n or escaped properly), etc.

HTH

blocks|key|2244892|text|有一个字符属性数据库和一个描述它的报告，UNICODE+CHARACTER+DATABASE，它给出了浏览器“应该”如何处理代码点的好主意。我喜欢这个词，“应该”。Safest将是一个白名单，您可以使用L%7CM%7CN%7CS、字母或标记、数字或符号。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|2244893|看一下库的ICU+project|2244894|entityMap|0|LINK|mutability|MUTABLE|url|http://unicode.org/reports/tr44/#Properties|1|http://site.icu-project.org/^0|K|Q|0|0|5|B|1|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]]|C|$]]|$1|D|3|E|5|6|7|T|8|@]|9|@$A|U|B|V|1|W]]|C|$]]|$1|F|3|-4|5|6|7|X|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]|N|$5|I|J|K|C|$L|O]]]]

There's a database of character properties and a report describing it, the <a href="http://unicode.org/reports/tr44/#Properties" rel="nofollow">UNICODE CHARACTER DATABASE</a>, that gives a good idea of how browsers "should" treat a code point. I love that word, "should". Safest is going to be a whitelist, you could probably go with L|M|N|S, Letter or Mark or Number or Symbol.

Have a look at the <a href="http://site.icu-project.org/" rel="nofollow">ICU project</a> for a library

Recently I hit a bug due to data quality with browser support, and I am looking for a safe rule for applying string escape without double size unless required.

A UTF-8 byte sequence "E2-80-A8" (U+2028, LINE SEPARATOR), a perfectly valid character in a Unicode database. However, that sequence represents a line-separator (Yes, other then "0A").

And badly, many browser (including Chrome, Firefox, and Safari; I didn't test others), failed to process a JSONP callback which has a string that contains that Unicode character. The JSONP was included by a non-Unicode HTML which I did not have any control.

The browsers simply reported INVALID CODE/syntax error on such JavaScript which looks valid from debug tools and all text editors. What I guess is that it may try to convert "E2-80-A8" to BIG-5 and broke JS syntax.

The above is only an example of how Unicode can break your system unexpected. As far as I know, some hacker can use RTL and other control characters for their good. And there are many "quotes", "spaces", "symbols" and "controls" in Unicode specification.

QUESTION:

Is there a list of Unicode characters for every programmer to know about hidden features (and bugs) which we might not want them effective in our application. (e.g. Windows disable RTL in filename).

EDIT:

I am not asking for JSON nor JavaScript. I am asking for general best practice of Unicode handing in all programs.

List of Unicode characters that should be filtered in output?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

最近我遇到了一个bug，因为浏览器支持的数据质量，我正在寻找一个安全的规则来应用字符串转义而不是双倍大小，除非需要。一个UTF8字节序列"E2-80-A8“(U+2028，行分隔符)，在Unicode数据库中是一个完全有效的字符。但是，该序列表示行分隔符(是，不是"0A")。糟糕的是，许多浏览器(包括Chrome、Fi...

问应该在输出中过滤的Unicode字符列表？
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问应该在输出中过滤的Unicode字符列表？EN