我有以下代码,运行良好:
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{Script_Extensions=$type}/i) {
print qq|TITLE: $_->{domain} is not english ($type), so lets ignore it...\n| if $DEBUG > 0;
last;
}
}它所做的就是寻找特定的字符,这样我们就可以摆脱那些我们不想要的字符。现在,当它工作的时候,它有点慢(就像它在每个foreach()上做一个)。有没有办法在单一的正则表达式中找到这个方法?(并在可能的情况下提取匹配集)
更新:我现在正按建议尝试使用:
if ($page_title =~ /\p{Script_Extensions=Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Cherokee|Cyrillic|Devanagari|Ethiopic|Georgian|Greek|Gujarati|Gurmukhi|Hangul|Hanunoo|Hebrew|Hiragana|Kannada|Katakana|Khmer|Lao|Limbu|Malayalam|Mongolian|Myanmar|Ogham|Oriya|Runic|Sinhala|Syriac|Tagalog|Tagbanwa|TaiLe|Tamil|Telugu|Thaana|Thai|Tibetan}/i) {
colored(qq|$page_title matches $1, so lets ignore... |, 'yellow on_magenta'), "\n";
}还包括:
if ($page_title =~ /\p{Script_Extensions=(Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Cherokee|Cyrillic|Devanagari|Ethiopic|Georgian|Greek|Gujarati|Gurmukhi|Hangul|Hanunoo|Hebrew|Hiragana|Kannada|Katakana|Khmer|Lao|Limbu|Malayalam|Mongolian|Myanmar|Ogham|Oriya|Runic|Sinhala|Syriac|Tagalog|Tagbanwa|TaiLe|Tamil|Telugu|Thaana|Thai|Tibetan)}/i) {
colored(qq|$page_title matches $1, so lets ignore... |, 'yellow on_magenta'), "\n";
}但我发现了一个错误:
在regex中无法找到Unicode属性定义"Script_Extensions=Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Cherokee|Cyrillic|Devanagari|Ethiopic|Georgian|Greek|Gujarati|Gurmukhi|Hangul|Hanunoo|Hebrew|Hiragana|Kannada|Katakana|Khmer|Lao|Limbu|Malayalam|Mongolian|Myanmar|Ogham|Oriya|Runic|Sinhala|Syriac|Tagalog|Tagbanwa|TaiLe|Tamil|Telugu|Thaana|Thai|Tibetan“;在进程全域Can行300.中,
标记为<--在m/\p{Script_Extensions=Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Che中
发布于 2020-11-28 20:02:45
首先,我猜使用/i是没有意义的。
至于解决办法,交替是一种选择。
/
\p{Script_Extensions=Arabic}
| \p{Script_Extensions=Armenian}
| \p{Script_Extensions=Bengali}
| ...
| \p{Script_Extensions=Thaana}
| \p{Script_Extensions=Thai}
| \p{Script_Extensions=Tibetan}
/x交替允许我们使用空格,但更快的解决方案是字符类。
/
[\p{Script_Extensions=Arabic}\p{Script_Extensions=Armenian}\p{Script_Extensions=Bengali}...\p{Script_Extensions=Thaana}\p{Script_Extensions=Thai}\p{Script_Extensions=Tibetan}]
/x然而,41个属性中只有3个占用了屏幕的整个宽度。与这里介绍的其他解决方案一样,没有什么可以阻止您动态地构建模式。
my $class_body =
join '',
map "\\p{Script_Extensions=$_}",
qw(
Arabic Armenian Bengali
...
Thaana Thai Tibetan
);
/[$class_body]/但还有一种选择:(?[...])
/
(?[ \p{Script_Extensions=Arabic}
+ \p{Script_Extensions=Armenian}
+ \p{Script_Extensions=Bengali}
+ ...
+ \p{Script_Extensions=Thaana}
+ \p{Script_Extensions=Thai}
+ \p{Script_Extensions=Tibetan}
])
/x这需要5.36之前的use experimental qw( regex_sets );。但是在5.18中添加这个特性并使用它作为一个实验特性是安全的,因为从那以后没有对这个特性进行任何更改。
发布于 2020-11-28 10:47:54
我唯一能让它发挥作用的方法是:
if ($page_title =~ /\p{Script_Extensions=Han}|\p{Script_Extensions=Arabic}|\p{Script_Extensions=Armenian}|\p{Script_Extensions=Bengali}|\p{Script_Extensions=Bopomofo}|\p{Script_Extensions=Braille}|\p{Script_Extensions=Buhid}|\p{Script_Extensions=Canadian_Aboriginal}|\p{Script_Extensions=Cherokee}|\p{Script_Extensions=Cyrillic}|\p{Script_Extensions=Devanagari}|\p{Script_Extensions=Ethiopic}|\p{Script_Extensions=Georgian}|\p{Script_Extensions=Greek}|\p{Script_Extensions=Gujarati}|\p{Script_Extensions=Gurmukhi}|\p{Script_Extensions=Hangul}|\p{Script_Extensions=Hanunoo}|\p{Script_Extensions=Hebrew}|\p{Script_Extensions=Hiragana}|\p{Script_Extensions=Kannada}|\p{Script_Extensions=Katakana}|\p{Script_Extensions=Khmer}|\p{Script_Extensions=Lao}|\p{Script_Extensions=Limbu}|\p{Script_Extensions=Malayalam}|\p{Script_Extensions=Mongolian}|\p{Script_Extensions=Myanmar}|\p{Script_Extensions=Ogham}|\p{Script_Extensions=Oriya}|\p{Script_Extensions=Runic}|\p{Script_Extensions=Sinhala}|\p{Script_Extensions=Syriac}|\p{Script_Extensions=Tagalog}|\p{Script_Extensions=Tagbanwa}|\p{Script_Extensions=TaiLe}|\p{Script_Extensions=Tamil}|\p{Script_Extensions=Telugu}|\p{Script_Extensions=Thaana}|\p{Script_Extensions=Thai}|\p{Script_Extensions=Tibetan}/i) {
# ... match
}有点乱七八糟,但据我所知,它正在起作用。
基准制定的一个例子:
use Benchmark;
my $page = Common::get_html_file("bsscn2p.com","home");
my $page_title;
if ($page =~ /\<title\>(.+?)\<\/title\>/i) {
$page_title = $1;
}
timethese(100000, {
test1 => sub {
if ($page_title =~ /\p{Script_Extensions=Han}|\p{Script_Extensions=Arabic}|\p{Script_Extensions=Armenian}|\p{Script_Extensions=Bengali}|\p{Script_Extensions=Bopomofo}|\p{Script_Extensions=Braille}|\p{Script_Extensions=Buhid}|\p{Script_Extensions=Canadian_Aboriginal}|\p{Script_Extensions=Cherokee}|\p{Script_Extensions=Cyrillic}|\p{Script_Extensions=Devanagari}|\p{Script_Extensions=Ethiopic}|\p{Script_Extensions=Georgian}|\p{Script_Extensions=Greek}|\p{Script_Extensions=Gujarati}|\p{Script_Extensions=Gurmukhi}|\p{Script_Extensions=Hangul}|\p{Script_Extensions=Hanunoo}|\p{Script_Extensions=Hebrew}|\p{Script_Extensions=Hiragana}|\p{Script_Extensions=Kannada}|\p{Script_Extensions=Katakana}|\p{Script_Extensions=Khmer}|\p{Script_Extensions=Lao}|\p{Script_Extensions=Limbu}|\p{Script_Extensions=Malayalam}|\p{Script_Extensions=Mongolian}|\p{Script_Extensions=Myanmar}|\p{Script_Extensions=Ogham}|\p{Script_Extensions=Oriya}|\p{Script_Extensions=Runic}|\p{Script_Extensions=Sinhala}|\p{Script_Extensions=Syriac}|\p{Script_Extensions=Tagalog}|\p{Script_Extensions=Tagbanwa}|\p{Script_Extensions=TaiLe}|\p{Script_Extensions=Tamil}|\p{Script_Extensions=Telugu}|\p{Script_Extensions=Thaana}|\p{Script_Extensions=Thai}|\p{Script_Extensions=Tibetan}/i) {
#print qq|TITLE: $page_title matches, so lets ignore... $_->{domain} \n|;
}
},
test2 => sub {
foreach my $type (qw/Han Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Hangul Hanunoo Hebrew Hiragana Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{Script_Extensions=$type}/i) {
#print "MATCH for $type! \n";
}
}
}
});
Benchmark: timing 100000 iterations of test1, test2...
test1: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 3333333.33/s (n=100000)
(warning: too few iterations for a reliable count)
test2: 117 wallclock secs (115.36 usr + 0.13 sys = 115.49 CPU) @ 865.88/s (n=100000)因此,使用新代码更好,效率更高:)
https://stackoverflow.com/questions/65047822
复制相似问题