文章/答案/技术大牛

发布

问用perl打印模式
EN

Stack Overflow用户

提问于 2016-05-30 14:44:19

回答 1查看 72关注 0票数 3

我很难删除unicode编码语料库中的错误。以下列形式

രണവര്‍ഗ്ഗത്തിനകത്തു=ഭരണവര്‍ഗ്ഗത്തിന്:stemഅകത്തു|:suffix
ഭസ്മമാക്കിക്കളയുകയും=ഭസ്മം:stemആക്കിക്കളയുകയും|:suffix
ഭസ്മമാക്കി=ഭസ്മം:stemആക്കി|:suffix
ഭാഗത്തുനിന്നുണ്ടാകണം=ഭാഗത്ത്:stemനിന്ന്:stemഉണ്ടാകണം|:suffix,:
ഭാഗമായ=ഭാഗം:stemആയ|:suffix
ഭാര്യമാരില്‍നിന്നും=ഭാര്യമാരില്‍:stemനിന്നും|:suffix:suffix
ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:stemനിന്നു|:suffix,:suffix:suffix
ഭാര്യയായി=ഭാര്യ:stemആയി|:suffix
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix:suffix
ഭിത്തികളൊക്കെ=ഭിത്തികള്‍:stemഒക്കെ|:suffix
ഭിന്നതയില്ലെന്നും=ഭിന്നത:stemഇല്ല:stemഎന്നും|:suffix,:suffix0
ഭൂപ്രഭുക്കളെന്ന്=ഭൂപ്രഭുക്കള്‍:stemഎന്ന്|:suffix0
ഭൂമിയില്‍നിന്ന്=ഭൂമിയില്‍:stemനിന്ന്|:suffix
ഭൂമിയിലുള്ള=ഭൂമിയില്‍:stemഉള്ള|:suffix
ഭൂമിയെപ്പോലൊരു=ഭൂമിയെ:stemപോലെ:stemഒരു|:suffix,:suffix0
ഭൂമുഖവീക്ഷണനായി=ഭൂമുഖവീക്ഷണന്‍:stemആയി|:suffix:suffix
ഭൂസഞ്ചാരംപോലെ=ഭൂസഞ്ചാരം:stemപോലെ|:suffix
ഭേദിക്കേണ്ടതായി=ഭേദിക്കേണ്ടതാ്:stemആയി|:suffix:suffix
ഭൗതികവാദികളാണ്=ഭൗതികവാദികള്‍:stemആണ്|:suffix0
മക്കളയച്ചു=മക്കള്‍:stemഅയച്ചു|:suffix
മക്കള്‍ക്കാണ്=മക്കള്‍ക്ക്:stemആണ്|:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix:suffix
മഞ്ചേശ്വരത്താണ്=മഞ്ചേശ്വരത്ത്:stemആണ്|:suffix:suffix
മഞ്ഞുവെള്ളത്തിലാഴ്ത്തി=മഞ്ഞുവെള്ളത്തില്‍:stemആഴ്ത്തി|:suffix:suffix
മടങ്ങാണിതിന്=മടങ്ങ്:stemആണ്:stemഇതിന്|:suffix,:suffix
മടിയനായിരുന്നു=മടിയന്‍:stemആയിരുന്നു|:suffix

我需要一起移除两个词干和两个后缀。对于两个词干，我需要保留第一个词干，并将第二个词干转换为后缀。对于像这个:suffix:suffix这样的两个后缀，:suffix,:suffix0只需要保留一个后缀

use strict;
use warnings qw/ all FATAL /;

use List::Util 'reduce';

while ( <> ) {

    my ($word, $ss) = / \( ( /[^()]* ) \) /gx;

    my @ss = split ' ', $ss;

    my $str = reduce { sprintf 'S (%s) (%s)', $a, $b } @ss;

    printf "%s (%s)\n", $word, $str;
}

这是我正在尝试更改的perl代码，但该代码不足以处理复杂的问题。有没有办法处理这类错误。

**Expected output**  

`ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:stemനിന്നു|:suffix,:suffix:suffix` to
ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:suffixനിന്നു|:suffix

ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix:suffix to
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix
 മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix:suffix to
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix

有谁愿意帮我吗？

regex

perl

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-05-30 21:55:20

描述

^([^:]+:stem[^:]+)(?::stem(?=.*?(:suffix))|)([^:]+?\|:suffix[^:]*)(?::suffix[^:]*)*$

替换为： \1\2\3

这个正则表达式将执行以下操作：

假设每一行都有一个suffix字符串，然后匹配模式并将其拖到捕获组2中
如果有第二个stem，则替换为suffix。
移除除第一批suffix条目外的所有

示例

现场演示

https://regex101.com/r/rJ9gW3/2

样本文本

ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:stemനിന്നു|:suffix,:suffix:suffix
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix:suffix

样本与匹配

ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:suffixനിന്നു|:suffix,
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of a "line"
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [^:]+                    any character except: ':' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    :stem                    ':stem'
----------------------------------------------------------------------
    [^:]+                    any character except: ':' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    :stem                    ':stem'
----------------------------------------------------------------------
    (?=                      look ahead to see if there is:
----------------------------------------------------------------------
      .*?                      any character except \n (0 or more
                               times (matching the least amount
                               possible))
----------------------------------------------------------------------
      (                        group and capture to \2:
----------------------------------------------------------------------
        :suffix                   ':suffix'
----------------------------------------------------------------------
      )                        end of \2
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    [^:]+?                   any character except: ':' (1 or more
                             times (matching the least amount
                             possible))
----------------------------------------------------------------------
    \|                       '|'
----------------------------------------------------------------------
    :suffix                  ':suffix'
----------------------------------------------------------------------
    [^:]*                    any character except: ':' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    :suffix                  ':suffix'
----------------------------------------------------------------------
    [^:]*                    any character except: ':' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  $                        before an optional \n, and the end of a
                           "line"
----------------------------------------------------------------------

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/37528421

复制

相似问题

问用perl打印模式
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用perl打印模式EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用perl打印模式
EN