我有以下代码,运行良好:
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{Script_Extensions=$type}/i) {
print qq|TITLE: $_->{domain} is not english ($type), so lets ignore it...\n| if $DEBUG > 0;
last;
}
}它所做的就是寻找特定的字符,这样我们就可以摆脱那些我们不想要的字符。现在,当它工作的时候,它有点慢(就像它在每个foreach()上做一个)。有没有办法在单一的正则表达式中找到这个方法?(并在可能的情况下提取匹配集)
更新:我现在正按建议尝试使用:
if ($page_title =~ /\p{Script_Extensions=Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Cherokee|Cyrillic|Devanagari|Ethiopic|Georgian|Greek|Gujarati|Gurmukhi|Hangul|Hanunoo|Hebrew|Hiragana|Kannada|Katakana|Khmer|Lao|Limbu|Malayalam|Mongolian|Myanmar|Ogham|Oriya|Runic|Sinhala|Syriac|Tagalog|Tagbanwa|TaiLe|Tamil|Telugu|Thaana|Thai|Tibetan}/i) {
colored(qq|$page_title matches $1, so lets ignore... |, 'yellow on_magenta'), "\n";
}还包括:
if ($page_title =~ /\p{Script_Extensions=(Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Cherokee|Cyrillic|Devanagari|Ethiopic|Georgian|Greek|Gujarati|Gurmukhi|Hangul|Hanunoo|Hebrew|Hiragana|Kannada|Katakana|Khmer|Lao|Limbu|Malayalam|Mongolian|Myanmar|Ogham|Oriya|Runic|Sinhala|Syriac|Tagalog|Tagbanwa|TaiLe|Tamil|Telugu|Thaana|Thai|Tibetan)}/i) {
colored(qq|$page_title matches $1, so lets ignore... |, 'yellow on_magenta'), "\n";
}但我发现了一个错误:
在regex中无法找到Unicode属性定义"Script_Extensions=Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Cherokee|Cyrillic|Devanagari|Ethiopic|Georgian|Greek|Gujarati|Gurmukhi|Hangul|Hanunoo|Hebrew|Hiragana|Kannada|Katakana|Khmer|Lao|Limbu|Malayalam|Mongolian|Myanmar|Ogham|Oriya|Runic|Sinhala|Syriac|Tagalog|Tagbanwa|TaiLe|Tamil|Telugu|Thaana|Thai|Tibetan“;在进程全域Can行300.中,
标记为<--在m/\p{Script_Extensions=Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Che中
发布于 2020-11-28 10:47:54
我唯一能让它发挥作用的方法是:
if ($page_title =~ /\p{Script_Extensions=Han}|\p{Script_Extensions=Arabic}|\p{Script_Extensions=Armenian}|\p{Script_Extensions=Bengali}|\p{Script_Extensions=Bopomofo}|\p{Script_Extensions=Braille}|\p{Script_Extensions=Buhid}|\p{Script_Extensions=Canadian_Aboriginal}|\p{Script_Extensions=Cherokee}|\p{Script_Extensions=Cyrillic}|\p{Script_Extensions=Devanagari}|\p{Script_Extensions=Ethiopic}|\p{Script_Extensions=Georgian}|\p{Script_Extensions=Greek}|\p{Script_Extensions=Gujarati}|\p{Script_Extensions=Gurmukhi}|\p{Script_Extensions=Hangul}|\p{Script_Extensions=Hanunoo}|\p{Script_Extensions=Hebrew}|\p{Script_Extensions=Hiragana}|\p{Script_Extensions=Kannada}|\p{Script_Extensions=Katakana}|\p{Script_Extensions=Khmer}|\p{Script_Extensions=Lao}|\p{Script_Extensions=Limbu}|\p{Script_Extensions=Malayalam}|\p{Script_Extensions=Mongolian}|\p{Script_Extensions=Myanmar}|\p{Script_Extensions=Ogham}|\p{Script_Extensions=Oriya}|\p{Script_Extensions=Runic}|\p{Script_Extensions=Sinhala}|\p{Script_Extensions=Syriac}|\p{Script_Extensions=Tagalog}|\p{Script_Extensions=Tagbanwa}|\p{Script_Extensions=TaiLe}|\p{Script_Extensions=Tamil}|\p{Script_Extensions=Telugu}|\p{Script_Extensions=Thaana}|\p{Script_Extensions=Thai}|\p{Script_Extensions=Tibetan}/i) {
# ... match
}有点乱七八糟,但据我所知,它正在起作用。
基准制定的一个例子:
use Benchmark;
my $page = Common::get_html_file("bsscn2p.com","home");
my $page_title;
if ($page =~ /\<title\>(.+?)\<\/title\>/i) {
$page_title = $1;
}
timethese(100000, {
test1 => sub {
if ($page_title =~ /\p{Script_Extensions=Han}|\p{Script_Extensions=Arabic}|\p{Script_Extensions=Armenian}|\p{Script_Extensions=Bengali}|\p{Script_Extensions=Bopomofo}|\p{Script_Extensions=Braille}|\p{Script_Extensions=Buhid}|\p{Script_Extensions=Canadian_Aboriginal}|\p{Script_Extensions=Cherokee}|\p{Script_Extensions=Cyrillic}|\p{Script_Extensions=Devanagari}|\p{Script_Extensions=Ethiopic}|\p{Script_Extensions=Georgian}|\p{Script_Extensions=Greek}|\p{Script_Extensions=Gujarati}|\p{Script_Extensions=Gurmukhi}|\p{Script_Extensions=Hangul}|\p{Script_Extensions=Hanunoo}|\p{Script_Extensions=Hebrew}|\p{Script_Extensions=Hiragana}|\p{Script_Extensions=Kannada}|\p{Script_Extensions=Katakana}|\p{Script_Extensions=Khmer}|\p{Script_Extensions=Lao}|\p{Script_Extensions=Limbu}|\p{Script_Extensions=Malayalam}|\p{Script_Extensions=Mongolian}|\p{Script_Extensions=Myanmar}|\p{Script_Extensions=Ogham}|\p{Script_Extensions=Oriya}|\p{Script_Extensions=Runic}|\p{Script_Extensions=Sinhala}|\p{Script_Extensions=Syriac}|\p{Script_Extensions=Tagalog}|\p{Script_Extensions=Tagbanwa}|\p{Script_Extensions=TaiLe}|\p{Script_Extensions=Tamil}|\p{Script_Extensions=Telugu}|\p{Script_Extensions=Thaana}|\p{Script_Extensions=Thai}|\p{Script_Extensions=Tibetan}/i) {
#print qq|TITLE: $page_title matches, so lets ignore... $_->{domain} \n|;
}
},
test2 => sub {
foreach my $type (qw/Han Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Hangul Hanunoo Hebrew Hiragana Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{Script_Extensions=$type}/i) {
#print "MATCH for $type! \n";
}
}
}
});
Benchmark: timing 100000 iterations of test1, test2...
test1: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 3333333.33/s (n=100000)
(warning: too few iterations for a reliable count)
test2: 117 wallclock secs (115.36 usr + 0.13 sys = 115.49 CPU) @ 865.88/s (n=100000)因此,使用新代码更好,效率更高:)
https://stackoverflow.com/questions/65047822
复制相似问题