文章/答案/技术大牛

发布

社区首页 >问答首页 >使用正则表达式从文本中获取对话片段

问使用正则表达式从文本中获取对话片段
EN

Stack Overflow用户

提问于 2010-06-01 13:46:42

回答 1查看 632关注 0票数 1

我正在尝试从一本书的文本中提取对话片段。例如，如果我有一个字符串

"What's the matter with the flag?" inquired Captain MacWhirr. "Seems all right to me."

然后我想提取"What's the matter with the flag?"和"Seem's all right to me."。

我找到了一个正则表达式来使用here，它是"[^"\\]*(\\.[^"\\]*)*"。当我在我的书.txt文件上执行Ctrl+F find regex时，这在Eclipse中工作得很好，但是当我运行以下代码时：

String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\""; Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

if(m.find())
 System.out.println(m.group(1));

唯一可以打印的是null。那么，我没有正确地将正则表达式转换为Java字符串吗？我是否需要考虑到Java String对双引号有一个\"这一事实？

regex

java

回答 1

Stack Overflow用户

回答已采纳

发布于 2010-06-01 13:49:21

在自然语言文本中，"不太可能通过前面的斜杠进行转义，因此您应该只能使用模式"([^"]*)"。

作为一个Java string文本，它是"\"([^\"]*)\""。

这是用Java编写的：

String regex = "\"([^\"]*)\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

while (m.find()) {
    System.out.println(m.group(1));
}

上面的打印结果：

What's the matter with the flag?
Seems all right to me.

关于转义序列

鉴于此声明：

String s = "\"";
System.out.println(s.length()); // prints "1"

字符串s只有一个字符"。\是出现在Java源代码级别的转义序列；字符串本身没有斜杠。

另请参阅

JLS 3.10.6 Escape Sequences for Character and String Literals

原始代码的问题

实际上，模式本身并没有什么问题，但您没有捕获到正确的部分。\1没有捕获引用的文本。下面是正确捕获组的模式：

String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"";
String bookText = "\"What's the matter?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

while (m.find()) {
    System.out.println(m.group(1));
}

为了进行直观的比较，这里是原始模式，作为Java字符串文字：

String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\""
                            ^^^^^^^^^^^^^^^^^
                           why capture this part?

下面是修改后的模式：

String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\""
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    we want to capture this part!

如前所述:这种复杂的模式对于自然语言文本来说不是必需的，因为自然语言文本不太可能包含转义引号。

另请参阅

regular-expressions.info/Grouping and backreferences

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/2947502

复制

相似问题

问使用正则表达式从文本中获取对话片段
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用正则表达式从文本中获取对话片段EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用正则表达式从文本中获取对话片段
EN