在data.frame
(或data.table
)中,我希望用最接近的前一个非nA值“向前填充”NAs。下面是一个使用向量(而不是data.frame
)的简单示例:
> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
我想要一个函数fill.NAs()
,它允许我构造这样的yy
:
> yy
[1] NA NA NA 2 2 2 2 3 3 3 4 4
我需要对许多(总共~1 Tb)小型data.frame
(~30-50 Mb)重复此操作,其中一行是NA,它的所有条目都是NA。解决这个问题的好方法是什么?
我想出的难看的解决方案使用了这个函数:
last <- function (x){
x[length(x)]
}
fill.NAs <- function(isNA){
if (isNA[1] == 1) {
isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs
# can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)],
which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] -
which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])
replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])
replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}
fill.NAs
函数的用法如下:
y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
}
输出
> y
[1] NA 2 2 2 2 3 3 3 4 4 4
..。这似乎起作用了。但是,伙计,这是丑陋的吗!有什么建议吗?
发布于 2011-10-12 13:32:08
您可能希望使用zoo包中的na.locf()
函数来继续执行最后一个观察值,以替换您的NA值。
下面是帮助页面中它的用法示例的开头:
library(zoo)
az <- zoo(1:6)
bz <- zoo(c(2,NA,1,4,5,2))
na.locf(bz)
1 2 3 4 5 6
2 2 1 4 5 2
na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6
2 1 1 4 5 2
cz <- zoo(c(NA,9,3,2,3,2))
na.locf(cz)
2 3 4 5 6
9 3 2 3 2
发布于 2012-12-11 06:45:40
很抱歉挖出了一个老问题。我不能在火车上查找函数来做这项工作,所以我自己写了一个。
我很自豪地发现它的速度要快一点。
但它的灵活性较差。
但它与ave
配合得很好,这正是我所需要的。
repeat.before = function(x) { # repeats the last non NA value. Keeps leading NA
ind = which(!is.na(x)) # get positions of nonmissing values
if(is.na(x[1])) # if it begins with a missing, add the
ind = c(1,ind) # first position to the indices
rep(x[ind], times = diff( # repeat the values at these indices
c(ind, length(x) + 1) )) # diffing the indices + length yields how often
} # they need to be repeated
x = c(NA,NA,'a',NA,NA,NA,NA,NA,NA,NA,NA,'b','c','d',NA,NA,NA,NA,NA,'e')
xx = rep(x, 1000000)
system.time({ yzoo = na.locf(xx,na.rm=F)})
## user system elapsed
## 2.754 0.667 3.406
system.time({ yrep = repeat.before(xx)})
## user system elapsed
## 0.597 0.199 0.793
编辑
当这成为我最喜欢的答案时,我经常被提醒我不使用我自己的函数,因为我经常需要zoo的maxgap
参数。因为当我使用dplyr +date时,zoo在边缘情况下有一些奇怪的问题,我无法调试,所以今天我又回到这里来改进我的旧功能。
我对改进后的函数和此处的所有其他条目进行了基准测试。对于基本的功能集,tidyr::fill
是最快的,同时也不会失败边缘情况。@BrandonBertelsen的Rcpp条目更快,但它在输入类型方面不灵活(由于对all.equal
的误解,他错误地测试了边缘用例)。
如果你需要maxgap
,我下面的函数比zoo更快(而且没有日期的奇怪问题)。
我把documentation of my tests放上去了。
新函数
repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
if (!forward) x = rev(x) # reverse x twice if carrying backward
ind = which(!is.na(x)) # get positions of nonmissing values
if (is.na(x[1]) && !na.rm) # if it begins with NA
ind = c(1,ind) # add first pos
rep_times = diff( # diffing the indices + length yields how often
c(ind, length(x) + 1) ) # they need to be repeated
if (maxgap < Inf) {
exceed = rep_times - 1 > maxgap # exceeding maxgap
if (any(exceed)) { # any exceed?
ind = sort(c(ind[exceed] + 1, ind)) # add NA in gaps
rep_times = diff(c(ind, length(x) + 1) ) # diff again
}
}
x = rep(x[ind], times = rep_times) # repeat the values at these indices
if (!forward) x = rev(x) # second reversion
x
}
我还把这个函数放到了我的formr package中(仅限于Github)。
发布于 2017-08-10 00:02:39
data.table
解决方案:
dt <- data.table(y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA))
dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))]
dt
y y_forward_fill
1: NA NA
2: 2 2
3: 2 2
4: NA 2
5: NA 2
6: 3 3
7: NA 3
8: 4 4
9: NA 4
10: NA 4
这种方法也可以用于正向填充零:
dt <- data.table(y = c(0, 2, -2, 0, 0, 3, 0, -4, 0, 0))
dt[, y_forward_fill := y[1], .(cumsum(y != 0))]
dt
y y_forward_fill
1: 0 0
2: 2 2
3: -2 -2
4: 0 -2
5: 0 -2
6: 3 3
7: 0 3
8: -4 -4
9: 0 -4
10: 0 -4
这种方法在大规模数据上变得非常有用,当您想要按组执行前向填充时,这对于data.table
来说是微不足道的。只需将组添加到cumsum
逻辑之前的by
子句中。
dt <- data.table(group = sample(c('a', 'b'), 20, replace = TRUE), y = sample(c(1:4, rep(NA, 4)), 20 , replace = TRUE))
dt <- dt[order(group)]
dt[, y_forward_fill := y[1], .(group, cumsum(!is.na(y)))]
dt
group y y_forward_fill
1: a NA NA
2: a NA NA
3: a NA NA
4: a 2 2
5: a NA 2
6: a 1 1
7: a NA 1
8: a 3 3
9: a NA 3
10: a NA 3
11: a 4 4
12: a NA 4
13: a 1 1
14: a 4 4
15: a NA 4
16: a 3 3
17: b 4 4
18: b NA 4
19: b NA 4
20: b 2 2
https://stackoverflow.com/questions/7735647
复制相似问题