问正则表达式中的单词边界是什么？
EN

Stack Overflow用户

提问于 2009-08-25 04:46:59

回答 9查看 194.7K关注 0票数 174

我正在尝试使用正则表达式来匹配空格分隔的数字。我找不到\b (“单词边界”)的确切定义。我曾假设-12将是一个“整数字”(与\b\-?\d+\b匹配)，但似乎这不起作用。我很想知道你的方法。

我在Java 1.6中使用Java正则表达式

示例：

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());

String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());

pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());

这将返回：

true
false
true

regex

word-boundary

回答 9

Stack Overflow用户

回答已采纳

发布于 2009-08-24 21:00:24

在大多数正则表达式方言中，单词边界是\w和\W (非单词char)之间的位置，或者，如果字符串以单词字符([0-9A-Za-z_])开始或结束，则位于字符串的开头或结尾。

因此，在字符串"-12"中，它将在1之前或2之后匹配。破折号不是单词字符。

票数 140

Stack Overflow用户

发布于 2009-08-25 01:36:03

单词边界是一个位置，该位置前有单词字符但后面没有单词字符，或者后面有单词字符但后面没有单词字符。

票数 16

Stack Overflow用户

发布于 2013-12-17 00:54:42

在文本中搜索.NET、C++、C#和C之类的单词时，我遇到了一个更严重的问题。你可能会认为计算机程序员应该比命名一种很难为其编写正则表达式的语言更好。

无论如何，这是我发现的(主要从http://www.regular-expressions.info总结，这是一个很棒的站点)：在大多数风格的正则表达式中，与速记字符类\w匹配的字符是根据单词边界被视为单词字符的字符。Java是一个例外。Java支持\b的Unicode，但不支持\w。(我相信这在当时是有充分理由的)。

\w代表“单词字符”。它始终与ASCII字符[A-Za-z0-9_]匹配。注意其中包含了下划线和数字(但没有破折号！)。在大多数支持Unicode的版本中，\w包含许多来自其他脚本的字符。关于实际包含哪些字符，有很多不一致之处。通常包括字母和表意文字中的字母和数字。下划线和非数字的数字符号以外的连接符标点符号可能包含，也可能不包含。XML Schema和XPath甚至包含了\w中的所有符号。但是Java、JavaScript和PCRE只与\w匹配ASCII字符。

这就是为什么基于Java的正则表达式搜索C++、C#或.NET (即使您记得转义句点和加号)会被\b搞砸的原因。

注意:我不知道如何处理文本中的错误，比如有人在句子末尾的句号后不加空格。我允许这样做，但我不确定这是否一定是正确的。

无论如何，在Java语言中，如果您在文本中搜索那些名称奇怪的语言，则需要将\b替换为前后的空格和标点符号指示符。例如：

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

然后在你的测试或main函数中：

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

另外，我要感谢http://regexpal.com/，如果没有他，regex的世界将会非常悲惨！

票数 7

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/1324676

复制

相似问题

问正则表达式中的单词边界是什么？
EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问正则表达式中的单词边界是什么？EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问正则表达式中的单词边界是什么？
EN