下面是一些示例数据:
library(data.table)
mydat <- data.table(id1=rep(c("A","B","C"),each=3),
id2=c("D","E","G", "D","E","F","G","E","D"),
val=c(1,2,4,1,2,3, 4,2,1))
这给了我们
id1 id2 val
1: A D 1
2: A E 2
3: A G 4
4: B D 1
5: B E 2
6: B F 3
7: C G 4
8: C E 2
9: C D 1
我的目标是获得id2和val的唯一值,然后生成一个依赖于唯一值的变量(例如,下面是唯一观察值的总和)。然后,应将此变量放入原始data.table
的列中。我经常发现自己编写的代码如下所示:
## This is the most obvious way
tmp <- unique(mydat[,.(id2,val)])
tmp[,weight:=val/sum(val)]
tmp[,val:=NULL]
mydat <- merge(mydat,tmp,by="id2",all.x=TRUE)
## A second option which doesn't require merging
mydat[,first:=FALSE]
mydat[mydat[,.I[1],by=.(id2)]$V1,first:=TRUE]
mydat[first==TRUE,weight2:=val/sum(val)]
mydat[,weight2:=max(weight,na.rm = TRUE),by=.(id2)]
mydat[,first:=NULL]
这给了我们
id2 id1 val weight weight2
1: D A 1 0.1 0.1
2: D B 1 0.1 0.1
3: D C 1 0.1 0.1
4: E A 2 0.2 0.2
5: E B 2 0.2 0.2
6: E C 2 0.2 0.2
7: F B 3 0.3 0.3
8: G A 4 0.4 0.4
9: G C 4 0.4 0.4
完全出于好奇,有没有一种更干净(更data.table)的方法来做到这一点呢?也许可以使用self joins?性能很重要,因为我处理的实际数据往往非常大。
发布于 2019-06-27 19:28:59
下面是一个不需要合并的选项:
total_val <- mydat[!duplicated(id2, val), sum(val)] # Just the scalar we are after
mydat[, `:=`(val = val[1], weight = val[1] / total_val), by = id2]
# id1 id2 val weight
# 1: A D 1 0.1
# 2: B D 1 0.1
# 3: C D 1 0.1
# 4: A E 2 0.2
# 5: B E 2 0.2
# 6: C E 2 0.2
# 7: B F 3 0.3
# 8: A G 4 0.4
# 9: C G 4 0.4
https://stackoverflow.com/questions/56781660
复制相似问题