也许正则表达式不是解析它的最好方法,如果不是,请告诉我。无论如何,下面是语法树的一些示例:
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
无论如何,我要做的就是拉出连接符( and,then,once等)和它对应的头部(CC,IN,CC),我已经知道每个语法树的头部(CC,in,CC),这样它就可以作为一个锚,我还需要检索它的父级(第一个是S,第二个是SBARTMP,第三个是S),以及它的兄弟(如果有的话)和它的兄弟(在第一个none中,在第二个左侧兄弟中,第三个在左侧和右侧兄弟中)。任何高于父级的值都不包括在内
my $pos = "(\\\w|-)*";
my $sibling = qr{\s*(\\((?:(?>[^()]+)|(?1))*\\))\s*};
my $connective = "once";
my $re = qr{(\(\w*\s*$sibling*\s*\\(IN\s$connective\\)\s*$sibling*\s*\))};
此代码适用于以下内容:
my $test1 = "(X (SBAR-TMP (IN once) (S sdf) (S sdf)))";
my $test2 = "(X (SBAR-TMP (IN once))";
my $test3 = "(X (SBAR-TMP (IN once) (X as))";
my $test4 = "(X (SBAR-TMP (X adsf) (IN once))";
它将丢弃顶部的X,并保留其他所有内容,但是,一旦兄弟项中嵌入了内容,它就不匹配,因为正则表达式不会深入。
my $test = "(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))";
我不确定如何解释这一点。我对Perl的扩展模式比较陌生,刚刚开始学习它。为了阐明正则表达式的作用:它查找两个括号和大写字母/组合中的连接词,查找以两个括号结束的相同格式的完整父对象,然后查找所有括号都配对的任意数量的兄弟对象。
发布于 2010-12-27 00:31:59
要只获取离锚定连接符最近的“父级”,可以将其作为具有失败的递归父级来执行,也可以直接执行。(由于某些原因,我不能编辑我的其他帖子,一定是cookies被删除了)。
use strict;
use warnings;
my $connective = qr/ \((?:IN|CC)\s(?:once|and|then)\)/x;
my $sibling = qr/
\s*
(
(?! $connective )
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s*
/x;
my $regex1 = qr/
\( ( [\w-]+ \s* $sibling* \s* $connective \s* $sibling* ) \) #1
/x;
my $regex2 = qr/
( #1
\( \s*
( #2
[\w-]+ \s*
(?> $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
| (?1)
)
)
\s*
\)
)
/x;
my $sample = qq/
(X (SBAR-TMP (IN once) (S sdf) (S sdf)))
(X (SBAR-TMP (IN once))
(X (SBAR-TMP (IN once) (X as))
(X (SBAR-TMP (X adsf) (IN once))
(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
/;
while ($sample =~ /$regex1/xg) {
print "Found: $1\n";
}
print '-' x 20, "\n";
while ($sample =~ /$regex2/xg) {
print "Found: $2\n";
}
__END__
发布于 2010-12-23 21:26:36
你为什么放弃这个,你差一点就成功了。试试这个:
use strict;
use warnings;
my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
\s*
(
(?!$connect)
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s*
/x;
my $regex = qr/
( #1
\(
\s* [\w-]+ \s*
(?> $sibling* \s* $connective \s* $sibling*
| (?1)
)
\s*
\)
)
/x;
my @tests = (
'(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',
'(X (SBAR-TMP (IN once))',
'(X (SBAR-TMP (IN once) (X as))',
'(X (SBAR-TMP (X adsf) (IN once))',
);
for my $sample (@tests)
{
while ($sample =~ /$regex/xg) {
print "Found: $1\n";
}
}
my $another =<<EOS;
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S
(S
(NP blah
(VP blah)
)
(CC then)
(NP blah
(VP blah
(PP blah)
)
)
)
)
EOS
print "\n---------\n";
while ($another =~ /$regex/xg) {
print "\nFound:\n$1\n";
}
结束
发布于 2010-12-24 16:38:52
这应该也行得通
use strict;
use warnings;
my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
(?: \s*
(
(?!$connective)
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s* )
/x;
my $regex = qr/
( #1
\( \s*
( #2
[\w-]+ \s*
(?> $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
| (?1)
)
)
\s*
\)
)
/x;
my @tests = (
'(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',
'(X (SBAR-TMP (IN once))',
'(X (SBAR-TMP (IN once) (X as))',
'(X (SBAR-TMP (X adsf) (IN once))',
'(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))',
);
for my $sample (@tests)
{
while ($sample =~ /$regex/xg) {
print "Found: $2\n";
}
}
my $another = "
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
";
print "\n---------\n";
while ($another =~ /$regex/xg) {
print "\nFound:\n$2\n";
}
__END__
https://stackoverflow.com/questions/4518981
复制相似问题