我有一个有三种模式的数据集:
第一:
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix第二:
inaccurate in:prefix<>accurate:stem
inactive in:prefix<>active:stem第三:
incommunicable in:prefix<>communicate:stem<>able:suffix
incompatibility in:prefix<>compatible:stem<>ity:suffix我需要将上面的表格转换成下面的表格:用宾州树状银行(http://languagelog.ldc.upenn.edu/myl/PennTreebank1995.pdf)的方式匹配括号
第一:
abrasion ((abrade:stem) ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)第二:
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))第三:
incommunicable (in:prefix ((communicate:stem)able:suffix))
incompatibility (in:prefix ((compatible:stem)ity:suffix))代码,我正在使用awk
{
n = gsub(/<>/,")",$2)
s = sprintf("%*s",n,"")
gsub(/ /,"(",s)
print "(" $1, s "((" $2 "))"
}编辑
更复杂的形式
nationalistic national: stem <>ism:suffix<>ist:suffix<>ic:suffix 至:
nationalistic ((((national: stem) ism:suffix)ist:suffix)ic:suffix)它没有产生例子中提到的预期产出。
发布于 2016-03-21 12:01:02
这应该是足够普遍的,因为它考虑到:stem、:prefix和:suffix来进行匹配:
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
b=gensub(/(\([a-zA-Z]*:stem\))<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
c=gensub(/([a-zA-Z]*:prefix)<>(.*)/,"(\\1\\2)", "g", b);
print c;}' testfile这里的演示:https://ideone.com/U3ux91
编辑
这应该考虑到多个后缀和前缀:
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
while ( a ~ /stem)<>.*:suffix/) {
a=gensub(/(\([a-zA-Z]*:stem\).*?)<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
}
while ( a ~ /<>/) {
a=gensub(/([a-zA-Z]*?:prefix)<>(.*)/,"(\\1\\2)", "g", a);
}
print a;}' test演示:https://ideone.com/U7LYXi (抱歉,如果反民族主义不是一个词,而是为了测试.)
发布于 2016-03-21 10:31:39
模式1的预期输出可能有问题,括号不成对,我猜想是--它是排印的,应该是:
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)我制作了这个awk脚本:
awk -v d="<>" '{$2="("$2")"}
$1~/^ab/{sub(d,")",$2);$2="(" $2}
$1~/^ina/{sub(d,"(",$2);$2=$2")"}
$1~/^inc/{sub(d,"((",$2);sub(d,")",$2);$2=$2")"}7' file对于同一个文件中的3种模式示例,它提供:
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))发布于 2016-03-24 14:57:46
awk -F'<>| ' -v OFS= '{
$1 = $1 " "
for (i=2; i<=NF; i++) {
if ($i ~ /prefix$/) { $i = "(" $i; $NF = $NF ")" }
if ($i ~ /stem\)?$/) { stem = i; $i = "(" $i ")" }
if ($i ~ /suffix\)?$/) { $i = $i ")"; $stem = "(" $stem } }
} { print }'https://stackoverflow.com/questions/36128175
复制相似问题