我正在尝试删除逗号分隔的文件中的行,其中APPID相同,category列在相同的类别中。输入:
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-1 ,,,,,,,, Cell ,
5002 , APP-1 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Cell ,
理想输出:
1,APPID,3,4,5,6,7,8,9,Category ,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Cell ,
"APP-1“被删除,因为它们的列2相同,并且它们的Category列都是"Cell”。
保留"APP-2“是因为它们在”类别“列中有一个”细胞“和另一个”生化“。
"APP-3“中的类似场景,其"Category”列包含不同的类别。
(更新版) "APP-4“被保留,因其列包含异质类别我们希望保留“5002,APP-4……”的重复,这将在下一个脚本中处理。此步骤将快速删除"Category“列中同构的数万个数据点(如果它们的APPID相同),以便下一个脚本中的数组不会爆炸。
到目前为止,这种尝试似乎没有奏效(此处参考:removal of redundant lines based on value in last column)
awk -F " ," '!a[$1,$2,$3,$4,$5,$6,$7,$8,$9]++' input
进程文件大约是每个文件百万行,总共大约有400个文件要处理。在这里,执行速度似乎至关重要。有没有上师可以开导呢?谢谢!
发布于 2014-08-07 10:10:47
def killDups(infilepath, outfilepath):
data = {}
with open(infilepath) as infile:
infile.readline()
for i,line in enumerate(infile):
line = line.strip()
cols = [col.strip() for col in line.split(',')]
appid, cat = cols[1], cols[-1]
if appid not in data:
data[appid] = {cat:i}
elif cat in data[appid]:
data[appid].pop(cat)
whitelist = set()
for k,v in data.items():
whitelist.update(v.values())
with open(infilepath) as infile, open(outfilepath, 'w') as outfile:
outfile.write(infile.readline())
for i,line in enumerate(infile):
if i in whitelist:
outfile.write(line)
发布于 2014-08-07 10:32:52
$ awk -F, '
{ key=$2 FS $(NF-1); nr2key[NR]=key; key2val[key]=$0; cnt[key]++ }
END {
for (i=1;i<=NR;i++) {
key=nr2key[i]
if (cnt[key] == 1) {
print key2val[key]
}
}
}
' file
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
发布于 2014-08-07 10:46:22
下面是使用awk
的另一种方法
awk -F, '
!patt[$2,$(NF-1)]++ { lines[$2,$(NF-1)] = $0 }
END {
for (line in lines)
if (patt[line] == 1)
print lines[line]
}' file | sort -t, -nk1,2
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-3 ,,,,,,,, Cell ,
patt
数组,对于相同的键,在END
块中,注意:有关使用普通awk
的更优雅方式,请参考Ed Morton's解决方案。
如果你有GNU awk
then (类似的逻辑,但使用内置的排序算法):
gawk -F, '
BEGIN { PROCINFO["sorted_in"] = "@ind_num_desc" }
!patt[$2,$(NF-1)]++ {
lines[$2,$(NF-1)] = $0
}
END {
for (line in lines)
if (patt[line] == 1)
print lines[line]
}' file
如果您可以使用perl
,那么:
perl -F, -lane'
print and next if $.==1; # print the header
$key = "@F[1,-1]"; # form the key using two columns
$h{$key} or push @rec, $key; # if key is not in hash push to array (for order)
push @{$h{$key}}, $_ # create hash of arrays
}{ # In the END block ...
print @{$h{$_}} for grep { @{$h{$_}} == 1 } @rec # print line whose array count is 1
' file
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
更新:
perl -F, -lane'
print and next if $.==1;
$seen{$F[1],$F[-1]}++ or push @rec, [$F[1], $F[-1]];
push @{$h{$F[1]}{$F[-1]}}, $_
}{
for (@rec) {
next if keys %{$h{$_->[0]}} == 1;
print join "\n", @{$h{$_->[0]}{$_->[1]}};
}
' file
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Cell ,
https://stackoverflow.com/questions/25173049
复制相似问题