我有一个简单的data.table,如下-
ID = c(rep("A", 1000), rep("B", 1000), rep("C", 1000), rep("D", 1000))
val = c("a", "a", "a", "b", "b", "c", "c","d","d","d","d","e","e","f","f","g","g","g","g","g")
dt = data.table(ID, val)我想在这个data.table中添加一个新的列,它将具有val按组ID的滞后。
这是预期的输出
> head(dt, 20)
ID val val_lag
1: A a <NA>
2: A a <NA>
3: A a <NA>
4: A b a
5: A b a
6: A c b
7: A c b
8: A d c
9: A d c
10: A d c
11: A d c
12: A e d
13: A e d
14: A f e
15: A f e
16: A g f
17: A g f
18: A g f
19: A g f
20: A g f我现在使用的解决方案是-
dt[, val_lag := with(rle(val), rep(c(NA, head(values, -1)), lengths)), by = ID]但是,此解决方案在实际数据集上非常慢,该数据集非常大,有数百万行。有没有更快的方法来解决这个问题?
以下是本文讨论的所有方法的性能结果-
microbenchmark::microbenchmark(rles = dt[, val_lag1 := with(rle(val), rep(c(NA, head(values, -1)), lengths)), by = ID],
chinsoon = dt[, val_lag := shift(val)[nafill(replace(seq.int(.N), rowid(rleid(val)) > 1L, NA_integer_), "locf")], by = ID],
TiC = dt[, val_lag3 := c(NA,rle(val)$values)[cumsum(c(0,head(val,-1)!=tail(val,-1)))+1], by = ID],
times = 1000
)
Unit: milliseconds
expr min lq mean median uq max neval cld
rles 1.549548 1.781014 2.750187 2.096805 2.743668 46.65326 1000 a
chinsoon 1.766827 2.060233 3.059109 2.379477 3.077080 67.16040 1000 a
TiC 1.986808 2.226933 3.472451 2.624236 3.397165 60.67802 1000 b谢谢!
发布于 2021-02-01 01:00:45
以下是另一种选择:
dt[, val_lag := shift(val)[nafill(replace(seq.int(.N), rowid(rleid(val)) > 1L, NA_integer_), "locf")]]计时码:
library(data.table)
set.seed(0L)
nr <- 1e6
ng <- 1e5
dt = data.table(ID=sample(ng, nr, TRUE), val=as.character(sample(nr, nr, TRUE)))
setorder(dt, ID, val)
microbenchmark::microbenchmark(times = 3L,
opt = dt[, val_lag := shift(val)[nafill(replace(seq.int(.N), rowid(rleid(val)) > 1L, NA_integer_), "locf")]],
rle = dt[, val_lag := with(rle(val), rep(c(NA, head(values, -1)), lengths)), by = ID]
)计时:
Unit: milliseconds
expr min lq mean median uq max neval
opt 133.8857 159.8922 265.2029 185.8987 330.8614 475.8242 3
rle 3097.6005 3123.5422 3193.2654 3149.4839 3241.0978 3332.7117 3编辑:添加了正在发生的事情的示例:
index | 1 2 3 4 5 6 7 8 9 10
value | a a a b b c c c d d
shifted (s) | NA a a a b b c c c d
rowid+rleid | 1 2 3 1 2 1 2 3 1 2
replace | 1 NA NA 4 NA 6 NA NA 9 NA <In ?nafill, Only double and integer data types are currently supported. Hence, nafill the indices before accessing>
nafill | 1 1 1 4 4 6 6 6 9 9
using s above | s[1] s[1] s[1] s[4] s[4] s[6] s[6] s[6] s[9] s[9]https://stackoverflow.com/questions/65982461
复制相似问题