blocks|key|471540|text|中文字符在特定的Unicode范围内：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|471541|471542|2F00-2FDF:康熙|unordered-list-item|471543|4E00-9FAF:+CJK+|471544|3400-4DBF:+CJK+Extension|471545|471546|所以基本上你需要做的就是检查字符的码点是否在已知的范围内。此示例是编写基于堆栈的解析器/拆分器的一个很好的起点，您只需将其扩展为将数字与拉丁字母分开，这应该足够明显(提示：Character#isDigit())：|offset|length|style|CODE|471547|Set<UnicodeBlock>+chineseUnicodeBlocks+=+new+HashSet<UnicodeBlock>()+{{
++++add(UnicodeBlock.CJK_COMPATIBILITY);
++++add(UnicodeBlock.CJK_COMPATIBILITY_FORMS);
++++add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS);
++++add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT);
++++add(UnicodeBlock.CJK_RADICALS_SUPPLEMENT);
++++add(UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION);
++++add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS);
++++add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A);
++++add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B);
++++add(UnicodeBlock.KANGXI_RADICALS);
++++add(UnicodeBlock.IDEOGRAPHIC_DESCRIPTION_CHARACTERS);
}};

String+mixedChinese+=+"查詢促進民間參與公共建設法（210ＢＯＴ法）";

for+(char+c+:+mixedChinese.toCharArray())+{
++++if+(chineseUnicodeBlocks.contains(UnicodeBlock.of(c)))+{
++++++++System.out.println(c+%2B+"+is+chinese");
++++}+else+{
++++++++System.out.println(c+%2B+"+is+not+chinese");
++++}
}|code-block|syntax|javascript|471548|祝好运。|471549|entityMap^0|0|0|0|0|0|0|2E|J|0|0|0^^$0|@$1|2|3|4|5|6|7|Z|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|10|8|@]|9|@]|A|$]]|$1|C|3|D|5|E|7|11|8|@]|9|@]|A|$]]|$1|F|3|G|5|E|7|12|8|@]|9|@]|A|$]]|$1|H|3|I|5|E|7|13|8|@]|9|@]|A|$]]|$1|J|3|-4|5|6|7|14|8|@]|9|@]|A|$]]|$1|K|3|L|5|6|7|15|8|@$M|16|N|17|O|P]]|9|@]|A|$]]|$1|Q|3|R|5|S|7|18|8|@]|9|@]|A|$T|U]]|$1|V|3|W|5|6|7|19|8|@]|9|@]|A|$]]|$1|X|3|-4|5|6|7|1A|8|@]|9|@]|A|$]]]|Y|$]]

Chinese characters lies within certain Unicode ranges:

<ul>
<li>2F00-2FDF: Kangxi </li>
<li>4E00-9FAF: CJK </li>
<li>3400-4DBF: CJK Extension</li>
</ul>

So all you basically need to do is to check if the character's codepoint lies within the known ranges. This example is a good starting point to write a stackbased parser/splitter, you only need to extend it to separate digits from latin letters, which should be obvious enough (hint: <code>Character#isDigit()</code>):

<pre><code>Set&lt;UnicodeBlock&gt; chineseUnicodeBlocks = new HashSet&lt;UnicodeBlock&gt;() {{
 add(UnicodeBlock.CJK_COMPATIBILITY);
 add(UnicodeBlock.CJK_COMPATIBILITY_FORMS);
 add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS);
 add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT);
 add(UnicodeBlock.CJK_RADICALS_SUPPLEMENT);
 add(UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION);
 add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS);
 add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A);
 add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B);
 add(UnicodeBlock.KANGXI_RADICALS);
 add(UnicodeBlock.IDEOGRAPHIC_DESCRIPTION_CHARACTERS);
}};

String mixedChinese = "查詢促進民間參與公共建設法（210ＢＯＴ法）";

for (char c : mixedChinese.toCharArray()) {
 if (chineseUnicodeBlocks.contains(UnicodeBlock.of(c))) {
 System.out.println(c + " is chinese");
 } else {
 System.out.println(c + " is not chinese");
 }
}
</code></pre>

Good luck.

blocks|key|1770394|text|这是一个我会采用的方法。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1770395|可以使用Character.codePointAt(char[]+charArray，int+index)返回字符数组中字符的Unicode值。|1770396|您还需要一个拉丁Unicode字符的映射。|1770397|如果你查看Character.UnicodeBlock的源代码，你会发现完整的拉丁语块是区间0x0000，0x0249。所以基本上你要检查你的Unicode代码点是否在这个间隔内。|1770398|我怀疑有一种方法可以只使用Character.Subset来检查它是否包含您的字符，但我还没有研究过。|1770399|entityMap^0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|M|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|N|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|O|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|P|8|@]|9|@]|A|$]]|$1|J|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|K|$]]

Here's an approach I would take.

You can use Character.codePointAt(char[] charArray, int index) to return the Unicode value for a char in your char array.

You will also need a mapping of Latin Unicode characters.

If you look in the source of Character.UnicodeBlock, the full LATIN block is the interval [0x0000, 0x0249]. So basically you check if your Unicode code point is somewhere within that interval.

I suspect there is a way to just use a Character.Subset to check if it contains your char, but I haven't looked into that.

blocks|key|471026|text|我是一个彻头彻尾的Lucene新手。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|471027|使用最新版本的Lucene+(在撰写本文时为3.6.0)，我设法接近您所需要的结果。|471028|++Analyzer+analyzer+=+new+StandardAnalyzer(Version.LUCENE_36,+Collections.emptySet());

++List<String>+words+=+new+ArrayList<String>();
++TokenStream+tokenStream+=+analyzer.tokenStream("content",+new+StringReader(original));
++CharTermAttribute+termAttribute+=+tokenStream.addAttribute(CharTermAttribute.class);

++try+{
++++tokenStream.reset();+//+Resets+this+stream+to+the+beginning.+(Required)
++++while+(tokenStream.incrementToken())+{
++++++words.add(termAttribute.toString());
++++}
++++tokenStream.end();+//+Perform+end-of-stream+operations,+e.g.+set+the+final+offset.
++}
++finally+{
++++tokenStream.close();+//+Release+resources+associated+with+this+stream.
++}|code-block|syntax|javascript|471029|我得到的结果是：|471030|[查,+詢,+促,+進,+民,+間,+參,+與,+公,+共,+建,+設,+法,+210ｂｏｔ,+法]|471031|entityMap^0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|P|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|Q|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|R|8|@]|9|@]|A|$]]|$1|K|3|L|5|F|7|S|8|@]|9|@]|A|$G|H]]|$1|M|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|N|$]]

Diclaimer: I'm a complete Lucene newbie.

Using the latest version of Lucene (3.6.0 at the time of writing) I manage to get close to the result you require.

<pre><code> Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36, Collections.emptySet());

 List&lt;String&gt; words = new ArrayList&lt;String&gt;();
 TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(original));
 CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);

 try {
 tokenStream.reset(); // Resets this stream to the beginning. (Required)
 while (tokenStream.incrementToken()) {
 words.add(termAttribute.toString());
 }
 tokenStream.end(); // Perform end-of-stream operations, e.g. set the final offset.
 }
 finally {
 tokenStream.close(); // Release resources associated with this stream.
 }
</code></pre>

The result I get is:

<pre><code>[查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, 210ｂｏｔ, 法]
</code></pre>

I am writing a java application; but stuck on this point.

Basically I have a string of Chinese characters with ALSO some possible Latin chars or numbers, lets say:

<pre><code>查詢促進民間參與公共建設法（210ＢＯＴ法）.
</code></pre>

I want to split those Chinese chars except the Latin or numbers as "BOT" above. So, at the end I will have this kind of list:

<code>[ 查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, （, 210, ＢＯＴ, 法, ）, ., ]</code>

How can I resolve this problem (for java)?

To split only Chinese characters in java

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我正在编写一个java应用程序，但我一直停留在这一点上。基本上我有一个中文字符串，还有一些可能的拉丁字符或数字，让我们这样说：查詢促進民間參與公共建設法（210ＢＯＴ法）.我想拆分那些中文字符，除了上面的"BOT“之外的拉丁文或数字。所以，在最后，我会得到这样的列表：[ 查, 詢, 促, 進, 民, 間, 參, 與, ...

问在java中仅拆分中文字符
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在java中仅拆分中文字符EN