我有以下文件(2016.csv,文件的头如下所示)
Zhichen Gong,Huanhuan Chen
Zhichuan Huang,Tiantian Xie,Ting Zhu,Jianwu Wang,Qingquan Zhang
Zhichuan Huang,Ting Zhu
Zhifei Zhang,Yang Song,Wei Wang 0063,Hairong Qi
我使用下面的awk循环来查找所有可能出现在上面文件一行中的名字。
awk -F, '{for(i=1;i<NF;i++){for(j=i+1;j<=NF;j++){if($i > $j){k[$i][$j]}else{k[$j][$i]}}}}END{for(n in k){for (l in k[n]){print n,",",l}}}' 2016.csv
这个awk循环的输出如下:
Zhichen Gong , Huanhuan Chen
Zhichuan Huang , Tiantian Xie
Zhichuan Huang , Ting Zhu
Zhichuan Huang , Jianwu Wang
Zhichuan Huang , Qingquan Zhang
Zhifei Zhang,Yang Song
Zhifei Zhang,Wei Wang 0063
Zhifei Zhang,Hairong Qi
etc
此循环工作良好,并一起查找出现在初始文件行中的所有对。我只想添加一个计数器,在awk输出的每一行旁边,它将显示这对计数器在初始文件中存在的次数。
例如,对于上面的awk输出,我希望它类似于:
Zhichen Gong , Huanhuan Chen, 1
Zhichuan Huang , Tiantian Xie, 1
Zhichuan Huang , Ting Zhu, 2
Zhichuan Huang , Jianwu Wang, 1
Zhichuan Huang , Qingquan Zhang, 1
Zhifei Zhang,Yang Song, 1
Zhifei Zhang,Wei Wang 0063,1
Zhifei Zhang,Hairong Qi,1
其中,第一行中的1 (Zhichen Gong , Huanhuan Chen, 1
)显示这对名称在初始文件中存在1次。
我假设我只需要在awk循环中添加一个计数器,但是到目前为止我还不能正确地完成它。
发布于 2022-06-17 20:11:12
使用OP的11行示例作为输入:
$ cat 2016.csv
Zhichen Gong,Huanhuan Chen
Zhichuan Huang,Tiantian Xie,Ting Zhu,Jianwu Wang,Qingquan Zhang
Zhichuan Huang,Ting Zhu
Zhifei Zhang,Yang Song,Wei Wang 0063,Hairong Qi
Zhihao Huang,Hui Li,Xin Li,Wei He
Zhijun Yin,You Chen,Daniel Fabbri,Jimeng Sun,Bradley A. Malin
Zhipeng Huang 0001,Bogdan Cautis,Reynold Cheng,Yudian Zheng
Zhipeng Huang 0001,Yudian Zheng,Reynold Cheng,Yizhou Sun,Nikos Mamoulis,Xiang Li 0067
Zhiqiang Tao,Hongfu Liu,Sheng Li 0001,Yun Fu 0001
Zhiqiang Xu,Yiping Ke
Zhiyuan Chen 0001,Estevam R. Hruschka Jr.,Bing Liu 0001
对OP的当前代码进行一些调整,以跟踪计数,然后根据计数和名称对输出进行排序:
awk '
BEGIN { FS=","; OFS=" , " }
{ for (i=1;i<NF;i++)
for(j=i+1;j<=NF;j++)
if ($i > $j) k[$i][$j]++ # increment counter
else k[$j][$i]++ # increment counter
}
END { # to sort by count we will create a new 3-dimensional array with the count as the 1st dimension
for (i in k)
for (j in k[i]) {
arr[k[i][j]][i][j] # arr[count][i][j]
delete k[i][j] # delete old array entry to limit memory usage
}
PROCINFO["sorted_in"]="@ind_num_desc" # sort 1st index by count/descending
for (cnt in arr) {
PROCINFO["sorted_in"]="@ind_str_asc" # sort 2nd/3rd indices by name/ascending
for (i in arr[cnt])
for (j in arr[cnt][i])
print i,j,cnt
}
}
' 2016.csv
备注:
...
[bob][smith]
和bob
和[bob][jones]
将要求bob
存储在内存中一次,[bob,smith]
和[bob,jones]
将要求bob
两次存储在内存中h 216h 117
OP的预期输出具有混合的输出分隔符;d18的使用将与OP先前的编辑相匹配;OP可以根据需要修改OFS
这将生成以下61行输出:
Yudian Zheng , Reynold Cheng , 2
Zhichuan Huang , Ting Zhu , 2
Zhipeng Huang 0001 , Reynold Cheng , 2
Zhipeng Huang 0001 , Yudian Zheng , 2
Daniel Fabbri , Bradley A. Malin , 1
Estevam R. Hruschka Jr. , Bing Liu 0001 , 1
Jimeng Sun , Bradley A. Malin , 1
Jimeng Sun , Daniel Fabbri , 1
Qingquan Zhang , Jianwu Wang , 1
Reynold Cheng , Bogdan Cautis , 1
Reynold Cheng , Nikos Mamoulis , 1
Sheng Li 0001 , Hongfu Liu , 1
Tiantian Xie , Jianwu Wang , 1
Tiantian Xie , Qingquan Zhang , 1
Ting Zhu , Jianwu Wang , 1
Ting Zhu , Qingquan Zhang , 1
Ting Zhu , Tiantian Xie , 1
Wei He , Hui Li , 1
Wei Wang 0063 , Hairong Qi , 1
Xiang Li 0067 , Nikos Mamoulis , 1
Xiang Li 0067 , Reynold Cheng , 1
Xin Li , Hui Li , 1
Xin Li , Wei He , 1
Yang Song , Hairong Qi , 1
Yang Song , Wei Wang 0063 , 1
Yizhou Sun , Nikos Mamoulis , 1
Yizhou Sun , Reynold Cheng , 1
Yizhou Sun , Xiang Li 0067 , 1
You Chen , Bradley A. Malin , 1
You Chen , Daniel Fabbri , 1
You Chen , Jimeng Sun , 1
Yudian Zheng , Bogdan Cautis , 1
Yudian Zheng , Nikos Mamoulis , 1
Yudian Zheng , Xiang Li 0067 , 1
Yudian Zheng , Yizhou Sun , 1
Yun Fu 0001 , Hongfu Liu , 1
Yun Fu 0001 , Sheng Li 0001 , 1
Zhichen Gong , Huanhuan Chen , 1
Zhichuan Huang , Jianwu Wang , 1
Zhichuan Huang , Qingquan Zhang , 1
Zhichuan Huang , Tiantian Xie , 1
Zhifei Zhang , Hairong Qi , 1
Zhifei Zhang , Wei Wang 0063 , 1
Zhifei Zhang , Yang Song , 1
Zhihao Huang , Hui Li , 1
Zhihao Huang , Wei He , 1
Zhihao Huang , Xin Li , 1
Zhijun Yin , Bradley A. Malin , 1
Zhijun Yin , Daniel Fabbri , 1
Zhijun Yin , Jimeng Sun , 1
Zhijun Yin , You Chen , 1
Zhipeng Huang 0001 , Bogdan Cautis , 1
Zhipeng Huang 0001 , Nikos Mamoulis , 1
Zhipeng Huang 0001 , Xiang Li 0067 , 1
Zhipeng Huang 0001 , Yizhou Sun , 1
Zhiqiang Tao , Hongfu Liu , 1
Zhiqiang Tao , Sheng Li 0001 , 1
Zhiqiang Tao , Yun Fu 0001 , 1
Zhiqiang Xu , Yiping Ke , 1
Zhiyuan Chen 0001 , Bing Liu 0001 , 1
Zhiyuan Chen 0001 , Estevam R. Hruschka Jr. , 1
如果输出的顺序无关紧要,则可以将END{...}块简化为以下内容:
END { for (i in k)
for (j in k[i])
print i,j,k[i][j]
}
https://stackoverflow.com/questions/72663578
复制相似问题