blocks|key|1234880|text|值得一提的是，您的示例是模棱两可的。可能是：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1234881|\r
\n
\n
\r
\r
\n
\n|code-block|syntax|javascript|1234882|(七行)|1234883|或者：|1234884|\r\n
\n
\r
\r\n
\n|1234885|(五行)|1234886|那个？您使用的量词是一个贪婪的量词，这可能会使five成为正确的答案，但因为Scanner迭代标记(在您的例子中是单个字符，由于您选择的定界模式)，它将不情愿地匹配，一次一个字符，得出不正确的答案7。|1234887|entityMap^0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|V|8|@]|9|@]|A|$]]|$1|K|3|L|5|D|7|W|8|@]|9|@]|A|$E|F]]|$1|M|3|N|5|6|7|X|8|@]|9|@]|A|$]]|$1|O|3|P|5|6|7|Y|8|@]|9|@]|A|$]]|$1|Q|3|-4|5|6|7|Z|8|@]|9|@]|A|$]]]|R|$]]

It might be worth mentioning that your example is ambiguous. It could be:

<pre><code>\r
\n
\n
\r
\r
\n
\n
</code></pre>

(seven lines)

or:

<pre><code>\r\n
\n
\r
\r\n
\n
</code></pre>

(five lines)

The ? quantifier you have used is a greedy quantifier, which would probably make five the right answer, but because Scanner iterates over tokens (in your case individual characters, due to the delimiting pattern you chose), it will match reluctantly, one character at a time, arriving at the incorrect answer of seven.

blocks|key|1234935|text|事实上，这是两者的预期行为。扫描器主要关心使用分隔符将事物拆分成令牌。因此，它(懒惰地)获取您的sourceString，并将其视为以下一组标记：\r、\n、\n、\r、\r、\n和\n。然后，当您调用hasNext时，它会检查下一个令牌是否与您的模式匹配(多亏了\r\n?上的?，它们都很容易做到这一点)。因此，while循环会遍历这7个标记中的每一个。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1234936|另一方面，匹配器将贪婪地匹配正则表达式-因此，它会将\r\n捆绑在一起。|1234937|强调扫描器行为的一种方法是将您的正则表达式更改为(\\r\\n%7C\\n)。这将导致计数为0。这是因为扫描程序将第一个令牌读取为\r+(而不是\r\n)，然后注意到它与您的模式不匹配，因此当您调用hasNext时返回false。|1234938|(简而言之:在使用令牌模式之前，扫描器使用分隔符进行标记，匹配器不做任何形式的标记)|1234939|entityMap^0|21|2|24|2|27|2|2A|2|2D|2|2G|2|2J|2|3O|5|3V|1|0|Q|4|0|O|C|1R|2|1Y|4|2P|7|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@$9|O|A|P|B|C]|$9|Q|A|R|B|C]|$9|S|A|T|B|C]|$9|U|A|V|B|C]|$9|W|A|X|B|C]|$9|Y|A|Z|B|C]|$9|10|A|11|B|C]|$9|12|A|13|B|C]|$9|14|A|15|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|16|8|@$9|17|A|18|B|C]]|D|@]|E|$]]|$1|H|3|I|5|6|7|19|8|@$9|1A|A|1B|B|C]|$9|1C|A|1D|B|C]|$9|1E|A|1F|B|C]|$9|1G|A|1H|B|C]]|D|@]|E|$]]|$1|J|3|K|5|6|7|1I|8|@]|D|@]|E|$]]|$1|L|3|-4|5|6|7|1J|8|@]|D|@]|E|$]]]|M|$]]

That is, in fact, the expected behaviour of both. The scanner primarily cares about splitting things into tokens using your delimiter. So it (lazily) takes your sourceString and sees it as the following set of tokens: <code>\r</code>, <code>\n</code>, <code>\n</code>, <code>\r</code>, <code>\r</code>, <code>\n</code>, and <code>\n</code>. When you then call hasNext it checks if the next token matches your pattern (which they all trivially do thanks to the <code>?</code> on the <code>\r\n?</code>). The while loop therefore iterates over each of the 7 tokens.

On the other hand, the matcher will match the regex greedily - so it bundles the <code>\r\n</code>s together as you expect.

One way to emphasise the behaviour of Scanner is to change your regexp to <code>(\\r\\n|\\n)</code>. This results in a count of 0. This is because the scanner reads the first token as <code>\r</code> (not <code>\r\n</code>), and then notices it doesn't match your pattern, so returns false when you call <code>hasNext</code>.

(Short version: the scanner tokenises using your delimiter before using your token pattern, the matcher doesn't do any form of tokenising)

blocks|key|3351858|text|当您使用分隔符为""的Scanner时，它将生成每个字符长度为一个字符的标记。这是在应用您的新行regex之前。然后，它将这些字符中的每个字符与新行regex进行匹配；每个字符都匹配，因此它生成7个标记。但是，因为它将字符串拆分为1个字符的标记，所以它不会将相邻的\r\n字符组合到一个标记中。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|3351859|entityMap^0|8|2|B|7|3O|4|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]|$9|K|A|L|B|C]|$9|M|A|N|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|O|8|@]|D|@]|E|$]]]|G|$]]

When you use the <code>Scanner</code> with a delimiter of <code>""</code> it will produce tokens that are each one character long. This is before your new line regex is applied. It then matches each of these characters against the new line regex; each one matches, so it produces 7 tokens. However, because it split the string into 1-character tokens it will not group adjacent <code>\r\n</code> characters into one token.

blocks|key|489687|text|您的useDelimiter()和next()组合有问题。useDelimiter("")将在next()上返回一个长度为1的子字符串，因为实际上每两个字符之间就有一个空字符串。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|489688|也就是说，因为"\r\n".equals("\r"+%2B+""+%2B+"\n")所以"\r\n"实际上是两个标记，"\r"和"\n"，由""分隔。|489689|要获取Matcher-behavior，您需要findWithinHorizon，它会忽略分隔符。|489690|++++Pattern+newLinePattern+=+Pattern.compile("(\\r\\n?%7C\\n)",+Pattern.MULTILINE);
++++String+sourceString+=+"\r\n\n\r\r\n\n";
++++Scanner+scan+=+new+Scanner(sourceString);
++++int+count+=+0;
++++while+(scan.findWithinHorizon(newLinePattern,+0)+!=+null)+{
++++++++count%2B%2B;
++++}
++++System.out.println("found+"%2Bcount%2B"+newlines");+//+finds+5+newlines|code-block|syntax|javascript|489691|API链接|489692|489693|findWithinHorizon(Pattern+pattern,+int+horizon)尝试查找指定模式的下一个匹配项...忽略分隔符...如果没有检测到这样的模式，则返回null+...如果horizon为0，则...此方法继续搜索输入，查找未绑定的指定模式。|unordered-list-item|489694|489695|相关问题|489696|489697|Scanner+method+to+get+a+char+|489698|useDelimiter("")将标记为1个长度的substrings|489699|489700|489701|​|489702|entityMap|0|LINK|mutability|MUTABLE|url|http://java.sun.com/javase/6/docs/api/java/util/Scanner.html#findWithinHorizon%2528java.util.regex.Pattern,%2520int%2529|1|https://stackoverflow.com/questions/2597841/scanner-method-to-get-a-char^0|2|E|H|6|T|G|1B|6|0|7|V|14|6|1J|4|1O|4|1U|2|0|3|7|N|H|0|0|0|0|0|1B|2H|4|2R|7|0|1B|0|0|0|0|0|0|S|1|1|0|G|1|0|0|0^^$0|@$1|2|3|4|5|6|7|1G|8|@$9|1H|A|1I|B|C]|$9|1J|A|1K|B|C]|$9|1L|A|1M|B|C]|$9|1N|A|1O|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|1P|8|@$9|1Q|A|1R|B|C]|$9|1S|A|1T|B|C]|$9|1U|A|1V|B|C]|$9|1W|A|1X|B|C]|$9|1Y|A|1Z|B|C]]|D|@]|E|$]]|$1|H|3|I|5|6|7|20|8|@$9|21|A|22|B|C]|$9|23|A|24|B|C]]|D|@]|E|$]]|$1|J|3|K|5|L|7|25|8|@]|D|@]|E|$M|N]]|$1|O|3|P|5|6|7|26|8|@]|D|@]|E|$]]|$1|Q|3|-4|5|6|7|27|8|@]|D|@]|E|$]]|$1|R|3|S|5|T|7|28|8|@$9|29|A|2A|B|C]|$9|2B|A|2C|B|C]|$9|2D|A|2E|B|C]]|D|@$9|2F|A|2G|1|2H]]|E|$]]|$1|U|3|-4|5|6|7|2I|8|@]|D|@]|E|$]]|$1|V|3|W|5|6|7|2J|8|@]|D|@]|E|$]]|$1|X|3|-4|5|6|7|2K|8|@]|D|@]|E|$]]|$1|Y|3|Z|5|T|7|2L|8|@]|D|@$9|2M|A|2N|1|2O]]|E|$]]|$1|10|3|11|5|T|7|2P|8|@$9|2Q|A|2R|B|C]]|D|@]|E|$]]|$1|12|3|-4|5|6|7|2S|8|@]|D|@]|E|$]]|$1|13|3|-4|5|6|7|2T|8|@]|D|@]|E|$]]|$1|14|3|15|5|6|7|2U|8|@]|D|@]|E|$]]|$1|16|3|-4|5|6|7|2V|8|@]|D|@]|E|$]]]|17|$18|$5|19|1A|1B|E|$1C|1D]]|1E|$5|19|1A|1B|E|$1C|1F]]]]

Your <code>useDelimiter()</code> and <code>next()</code> combo is faulty. <code>useDelimiter("")</code> will return 1-length substring on <code>next()</code>, because an empty string does in fact sit between every two characters.

That is, because <code>"\r\n".equals("\r" + "" + "\n")</code> so <code>"\r\n"</code> are in fact two tokens, <code>"\r"</code> and <code>"\n"</code>, delimited by <code>""</code>.

To get the <code>Matcher</code>-behavior, you need <code>findWithinHorizon</code>, which ignores delimiters.

<pre><code> Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
 String sourceString = "\r\n\n\r\r\n\n";
 Scanner scan = new Scanner(sourceString);
 int count = 0;
 while (scan.findWithinHorizon(newLinePattern, 0) != null) {
 count++;
 }
 System.out.println("found "+count+" newlines"); // finds 5 newlines
</code></pre>

<h3>API links</h3>

<ul>
<li><a href="http://java.sun.com/javase/6/docs/api/java/util/Scanner.html#findWithinHorizon%28java.util.regex.Pattern,%20int%29" rel="nofollow noreferrer"><code>findWithinHorizon(Pattern pattern, int horizon)</code></a>

<blockquote>
 Attempts to find the next occurrence of the specified pattern [...] ignoring delimiters [...] If no such pattern is detected then the <code>null</code> is returned [...] If <code>horizon</code> is 0, then [...] this method continues to search through the input looking for the specified pattern without bound.
</blockquote></li>
</ul>

<h3>Related questions</h3>

<ul>
<li><a href="https://stackoverflow.com/questions/2597841/scanner-method-to-get-a-char">Scanner method to get a char</a>

<ul>
<li><code>useDelimiter("")</code> will tokenize into 1-length substrings</li>
</ul></li>
</ul>

I'm developing a syntax analyzer by hand in Java, and I'd like to use regex's to parse the various token types. The problem is that I'd also like to be able to accurately report the current line number, if the input doesn't conform to the syntax.

Long story short, I've run into a problem when I try to actually match a newline with the Scanner class. To be specific, when I try to match a newline with a pattern using the Scanner class, it fails. Almost always. But when I perform the same matching using a Matcher and the same source string, it retrieves the newline exactly as you'd expect it too. Is there a reason for this, that I can't seem to discover, or is this a bug, as I suspect?

FYI: I was unable to find a bug in the Sun database that describes this issue, so if it is a bug, it hasn't been reported.

Example Code:

<pre><code>Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
String sourceString = "\r\n\n\r\r\n\n";
Scanner scan = new Scanner(sourceString);
scan.useDelimiter("");
int count = 0;
while (scan.hasNext(newLinePattern)) {
 scan.next(newLinePattern);
 count++;
}
System.out.println("found "+count+" newlines"); // finds 7 newlines
Matcher match = newLinePattern.matcher(sourceString);
count = 0;
while (match.find()) {
 count++;
}
System.out.println("found "+count+" newlines"); // finds 5 newlines
</code></pre>

Java Scanner newline parsing with regex (Bug?)

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我正在用Java手工开发一个语法分析器，我想使用正则表达式来解析各种标记类型。问题是，如果输入不符合语法，我也希望能够准确地报告当前的行号。长话短说，当我尝试将换行符与Scanner类实际匹配时，我遇到了一个问题。具体地说，当我尝试使用Scanner类将换行符与模式进行匹配时，它失败了。几乎总是如此。但是，当我使用Ma...

问使用regex进行Java Scanner换行符解析(Bug?)
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用regex进行Java Scanner换行符解析(Bug?)EN