首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >在perl中对外部字符集进行字符串比较的更快方法

在perl中对外部字符集进行字符串比较的更快方法
EN

Stack Overflow用户
提问于 2020-11-28 08:36:56
回答 2查看 92关注 0票数 1

我有以下代码,运行良好:

代码语言:javascript
运行
复制
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
    if ($page_title =~ /\p{Script_Extensions=$type}/i) {

        print qq|TITLE: $_->{domain} is not english ($type), so lets ignore it...\n| if $DEBUG > 0;

        last;
    }
}

它所做的就是寻找特定的字符,这样我们就可以摆脱那些我们不想要的字符。现在,当它工作的时候,它有点慢(就像它在每个foreach()上做一个)。有没有办法在单一的正则表达式中找到这个方法?(并在可能的情况下提取匹配集)

更新:我现在正按建议尝试使用

代码语言:javascript
运行
复制
if ($page_title =~ /\p{Script_Extensions=Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Cherokee|Cyrillic|Devanagari|Ethiopic|Georgian|Greek|Gujarati|Gurmukhi|Hangul|Hanunoo|Hebrew|Hiragana|Kannada|Katakana|Khmer|Lao|Limbu|Malayalam|Mongolian|Myanmar|Ogham|Oriya|Runic|Sinhala|Syriac|Tagalog|Tagbanwa|TaiLe|Tamil|Telugu|Thaana|Thai|Tibetan}/i) {
    colored(qq|$page_title matches $1, so lets ignore... |, 'yellow on_magenta'), "\n";
}

还包括:

代码语言:javascript
运行
复制
if ($page_title =~ /\p{Script_Extensions=(Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Cherokee|Cyrillic|Devanagari|Ethiopic|Georgian|Greek|Gujarati|Gurmukhi|Hangul|Hanunoo|Hebrew|Hiragana|Kannada|Katakana|Khmer|Lao|Limbu|Malayalam|Mongolian|Myanmar|Ogham|Oriya|Runic|Sinhala|Syriac|Tagalog|Tagbanwa|TaiLe|Tamil|Telugu|Thaana|Thai|Tibetan)}/i) {
    colored(qq|$page_title matches $1, so lets ignore... |, 'yellow on_magenta'), "\n";
}

但我发现了一个错误:

在regex中无法找到Unicode属性定义"Script_Extensions=Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Cherokee|Cyrillic|Devanagari|Ethiopic|Georgian|Greek|Gujarati|Gurmukhi|Hangul|Hanunoo|Hebrew|Hiragana|Kannada|Katakana|Khmer|Lao|Limbu|Malayalam|Mongolian|Myanmar|Ogham|Oriya|Runic|Sinhala|Syriac|Tagalog|Tagbanwa|TaiLe|Tamil|Telugu|Thaana|Thai|Tibetan“;在进程全域Can行300.中,

标记为<--在m/\p{Script_Extensions=Han|Arabic|Armenian|Bengali|Bopomofo|Braille|Buhid|Canadian_Aboriginal|Che中

EN

Stack Overflow用户

发布于 2020-11-28 10:47:54

我唯一能让它发挥作用的方法是:

代码语言:javascript
运行
复制
if ($page_title =~ /\p{Script_Extensions=Han}|\p{Script_Extensions=Arabic}|\p{Script_Extensions=Armenian}|\p{Script_Extensions=Bengali}|\p{Script_Extensions=Bopomofo}|\p{Script_Extensions=Braille}|\p{Script_Extensions=Buhid}|\p{Script_Extensions=Canadian_Aboriginal}|\p{Script_Extensions=Cherokee}|\p{Script_Extensions=Cyrillic}|\p{Script_Extensions=Devanagari}|\p{Script_Extensions=Ethiopic}|\p{Script_Extensions=Georgian}|\p{Script_Extensions=Greek}|\p{Script_Extensions=Gujarati}|\p{Script_Extensions=Gurmukhi}|\p{Script_Extensions=Hangul}|\p{Script_Extensions=Hanunoo}|\p{Script_Extensions=Hebrew}|\p{Script_Extensions=Hiragana}|\p{Script_Extensions=Kannada}|\p{Script_Extensions=Katakana}|\p{Script_Extensions=Khmer}|\p{Script_Extensions=Lao}|\p{Script_Extensions=Limbu}|\p{Script_Extensions=Malayalam}|\p{Script_Extensions=Mongolian}|\p{Script_Extensions=Myanmar}|\p{Script_Extensions=Ogham}|\p{Script_Extensions=Oriya}|\p{Script_Extensions=Runic}|\p{Script_Extensions=Sinhala}|\p{Script_Extensions=Syriac}|\p{Script_Extensions=Tagalog}|\p{Script_Extensions=Tagbanwa}|\p{Script_Extensions=TaiLe}|\p{Script_Extensions=Tamil}|\p{Script_Extensions=Telugu}|\p{Script_Extensions=Thaana}|\p{Script_Extensions=Thai}|\p{Script_Extensions=Tibetan}/i) {
    # ... match
}

有点乱七八糟,但据我所知,它正在起作用。

基准制定的一个例子:

代码语言:javascript
运行
复制
  use Benchmark;

  my $page = Common::get_html_file("bsscn2p.com","home");

  my $page_title;
  if ($page =~ /\<title\>(.+?)\<\/title\>/i) {
    $page_title = $1;
  }

  timethese(100000, {
    test1 => sub {
      if ($page_title =~ /\p{Script_Extensions=Han}|\p{Script_Extensions=Arabic}|\p{Script_Extensions=Armenian}|\p{Script_Extensions=Bengali}|\p{Script_Extensions=Bopomofo}|\p{Script_Extensions=Braille}|\p{Script_Extensions=Buhid}|\p{Script_Extensions=Canadian_Aboriginal}|\p{Script_Extensions=Cherokee}|\p{Script_Extensions=Cyrillic}|\p{Script_Extensions=Devanagari}|\p{Script_Extensions=Ethiopic}|\p{Script_Extensions=Georgian}|\p{Script_Extensions=Greek}|\p{Script_Extensions=Gujarati}|\p{Script_Extensions=Gurmukhi}|\p{Script_Extensions=Hangul}|\p{Script_Extensions=Hanunoo}|\p{Script_Extensions=Hebrew}|\p{Script_Extensions=Hiragana}|\p{Script_Extensions=Kannada}|\p{Script_Extensions=Katakana}|\p{Script_Extensions=Khmer}|\p{Script_Extensions=Lao}|\p{Script_Extensions=Limbu}|\p{Script_Extensions=Malayalam}|\p{Script_Extensions=Mongolian}|\p{Script_Extensions=Myanmar}|\p{Script_Extensions=Ogham}|\p{Script_Extensions=Oriya}|\p{Script_Extensions=Runic}|\p{Script_Extensions=Sinhala}|\p{Script_Extensions=Syriac}|\p{Script_Extensions=Tagalog}|\p{Script_Extensions=Tagbanwa}|\p{Script_Extensions=TaiLe}|\p{Script_Extensions=Tamil}|\p{Script_Extensions=Telugu}|\p{Script_Extensions=Thaana}|\p{Script_Extensions=Thai}|\p{Script_Extensions=Tibetan}/i) {
        #print qq|TITLE: $page_title matches, so lets ignore... $_->{domain} \n|;
      }
    },
    test2 => sub {

      foreach my $type (qw/Han Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Hangul Hanunoo Hebrew Hiragana Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
          if ($page_title =~ /\p{Script_Extensions=$type}/i) {
            #print "MATCH for $type! \n";
          }
      }

    }
  });


Benchmark: timing 100000 iterations of test1, test2...
     test1:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU) @ 3333333.33/s (n=100000)
            (warning: too few iterations for a reliable count)
     test2: 117 wallclock secs (115.36 usr +  0.13 sys = 115.49 CPU) @ 865.88/s (n=100000)

因此,使用新代码更好,效率更高:)

票数 1
EN
查看全部 2 条回答
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65047822

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档