我在堆栈溢出上看到了这个问题的不同版本,但没有遇到解决这个特定用例的版本。
目标
根据整个行(不只是一列)查找重复行,最后一列中的值除外。删除除一个重复行之外的所有重复行,但首先将每个重复的最后一列中的值和起来,并在其余重复行的最后一列中显示结果值。我想在巴斯做这个。
用例
我有一个表格,每个网页在一个网站,它已经收到的数量,以及其他一些元数据。但是,表中的某些行表示相同的页,只具有不同的视图数。需要将这些视图相加,以显示每个页面的所有时间视图。
示例
原始档案:
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",100
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",15
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",10
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",20
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",30
想要的档案:
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
在这里,我想做的是分享我尝试过的每个脚本迭代,并分解出成功和失败的地方。这是我的头绪,我甚至很难做到。我的过程是利用一些类似的堆栈溢出问题的答案(所有这些问题都在awk
中,这对我来说是有意义的),并更改比较列。但是因为有些答案只是比较一列,所以我修改的结果是不一致和奇怪的。这些脚本非常复杂,我很难理解其中的原因。
有没有人能提供教育,告诉我如何去发现一个答案,或者一个指引我走向正确方向的例子?如果是的话谢谢你。
发布于 2020-06-20 01:03:12
无论引用的任何字段是否可以包含,
(例如,如果任何带有"Page Title 1"
占位符文本的字段实际上类似于"I, Robot - Page 1"
),这都是可行的:
$ awk '
BEGIN { FS=OFS="," }
NR==1 { print; next }
{ num=$NF; sub(/,[^,]*$/,""); sum[$0]+=num }
END { for (key in sum) print key, sum[key] }
' file
url,title,tag,version,guide,views
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
发布于 2020-06-20 00:27:53
使用GNU数据的一种方法
$ echo "url,title,tag,version,guide,views" && datamash --header-in -st, -g1,2,3,4,5 sum 6 < input.txt
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
或使用awk:
$ awk -F, 'NR==1 { print; next }
{ groups[$1 "," $2 "," $3 "," $4 "," $5] += $6 }
END { PROCINFO["sorted_in"] = "@ind_str_asc" # Sorted output when using GNU awk
for (g in groups) print g "," groups[g]
}' input.txt
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
发布于 2020-06-20 01:27:06
另一个awk
$ awk -F, -v OFS=, 'NR==1 {print; next}
{v=$NF; NF--; a[$0]+=v}
END {for(k in a) print k,a[k] | "sort"}' file
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
解释
打印标题行;保存值(最后一个字段),减少字段数,以便剩余的记录成为键($0),用键将值添加到累加器(将相等的键值之和)。最后打印键、值和排序。
https://stackoverflow.com/questions/62480103
复制相似问题