问将NAs替换为最新的非NA值
EN

Stack Overflow用户

提问于 2011-10-12 13:27:21

回答 19查看 113K关注 0票数 171

在data.frame (或data.table)中，我希望用最接近的前一个非nA值“向前填充”NAs。下面是一个使用向量(而不是data.frame)的简单示例：

> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)

我想要一个函数fill.NAs()，它允许我构造这样的yy：

> yy
[1] NA NA NA  2  2  2  2  3  3  3  4  4

我需要对许多(总共~1 Tb)小型data.frame (~30-50 Mb)重复此操作，其中一行是NA，它的所有条目都是NA。解决这个问题的好方法是什么？

我想出的难看的解决方案使用了这个函数：

last <- function (x){
    x[length(x)]
}    

fill.NAs <- function(isNA){
if (isNA[1] == 1) {
    isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs 
                                              # can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
    replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)], 
                                which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] - 
                                which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])      
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
    replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])     
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}

fill.NAs函数的用法如下：

y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
}

输出

> y
[1] NA  2  2  2  2  3  3  3  4  4  4

..。这似乎起作用了。但是，伙计，这是丑陋的吗！有什么建议吗？

data.table

zoo

r-faq

回答 19

Stack Overflow用户

回答已采纳

发布于 2011-10-12 13:32:08

您可能希望使用zoo包中的na.locf()函数来继续执行最后一个观察值，以替换您的NA值。

下面是帮助页面中它的用法示例的开头：

library(zoo)

az <- zoo(1:6)

bz <- zoo(c(2,NA,1,4,5,2))

na.locf(bz)
1 2 3 4 5 6 
2 2 1 4 5 2 

na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6 
2 1 1 4 5 2 

cz <- zoo(c(NA,9,3,2,3,2))

na.locf(cz)
2 3 4 5 6 
9 3 2 3 2

票数 181

Stack Overflow用户

发布于 2012-12-11 06:45:40

很抱歉挖出了一个老问题。我不能在火车上查找函数来做这项工作，所以我自己写了一个。

我很自豪地发现它的速度要快一点。

但它的灵活性较差。

但它与ave配合得很好，这正是我所需要的。

repeat.before = function(x) {   # repeats the last non NA value. Keeps leading NA
    ind = which(!is.na(x))      # get positions of nonmissing values
    if(is.na(x[1]))             # if it begins with a missing, add the 
          ind = c(1,ind)        # first position to the indices
    rep(x[ind], times = diff(   # repeat the values at these indices
       c(ind, length(x) + 1) )) # diffing the indices + length yields how often 
}                               # they need to be repeated

x = c(NA,NA,'a',NA,NA,NA,NA,NA,NA,NA,NA,'b','c','d',NA,NA,NA,NA,NA,'e')  
xx = rep(x, 1000000)  
system.time({ yzoo = na.locf(xx,na.rm=F)})  
## user  system elapsed   
## 2.754   0.667   3.406   
system.time({ yrep = repeat.before(xx)})  
## user  system elapsed   
## 0.597   0.199   0.793

编辑

当这成为我最喜欢的答案时，我经常被提醒我不使用我自己的函数，因为我经常需要zoo的maxgap参数。因为当我使用dplyr +date时，zoo在边缘情况下有一些奇怪的问题，我无法调试，所以今天我又回到这里来改进我的旧功能。

我对改进后的函数和此处的所有其他条目进行了基准测试。对于基本的功能集，tidyr::fill是最快的，同时也不会失败边缘情况。@BrandonBertelsen的Rcpp条目更快，但它在输入类型方面不灵活(由于对all.equal的误解，他错误地测试了边缘用例)。

如果你需要maxgap，我下面的函数比zoo更快(而且没有日期的奇怪问题)。

我把documentation of my tests放上去了。

新函数

repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
    if (!forward) x = rev(x)           # reverse x twice if carrying backward
    ind = which(!is.na(x))             # get positions of nonmissing values
    if (is.na(x[1]) && !na.rm)         # if it begins with NA
        ind = c(1,ind)                 # add first pos
    rep_times = diff(                  # diffing the indices + length yields how often
        c(ind, length(x) + 1) )          # they need to be repeated
    if (maxgap < Inf) {
        exceed = rep_times - 1 > maxgap  # exceeding maxgap
        if (any(exceed)) {               # any exceed?
            ind = sort(c(ind[exceed] + 1, ind))      # add NA in gaps
            rep_times = diff(c(ind, length(x) + 1) ) # diff again
        }
    }
    x = rep(x[ind], times = rep_times) # repeat the values at these indices
    if (!forward) x = rev(x)           # second reversion
    x
}

我还把这个函数放到了我的formr package中(仅限于Github)。

票数 69

Stack Overflow用户

发布于 2017-08-10 00:02:39

data.table解决方案：

dt <- data.table(y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA))
dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))]
dt
     y y_forward_fill
 1: NA             NA
 2:  2              2
 3:  2              2
 4: NA              2
 5: NA              2
 6:  3              3
 7: NA              3
 8:  4              4
 9: NA              4
10: NA              4

这种方法也可以用于正向填充零：

dt <- data.table(y = c(0, 2, -2, 0, 0, 3, 0, -4, 0, 0))
dt[, y_forward_fill := y[1], .(cumsum(y != 0))]
dt
     y y_forward_fill
 1:  0              0
 2:  2              2
 3: -2             -2
 4:  0             -2
 5:  0             -2
 6:  3              3
 7:  0              3
 8: -4             -4
 9:  0             -4
10:  0             -4

这种方法在大规模数据上变得非常有用，当您想要按组执行前向填充时，这对于data.table来说是微不足道的。只需将组添加到cumsum逻辑之前的by子句中。

dt <- data.table(group = sample(c('a', 'b'), 20, replace = TRUE), y = sample(c(1:4, rep(NA, 4)), 20 , replace = TRUE))
dt <- dt[order(group)]
dt[, y_forward_fill := y[1], .(group, cumsum(!is.na(y)))]
dt
    group  y y_forward_fill
 1:     a NA             NA
 2:     a NA             NA
 3:     a NA             NA
 4:     a  2              2
 5:     a NA              2
 6:     a  1              1
 7:     a NA              1
 8:     a  3              3
 9:     a NA              3
10:     a NA              3
11:     a  4              4
12:     a NA              4
13:     a  1              1
14:     a  4              4
15:     a NA              4
16:     a  3              3
17:     b  4              4
18:     b NA              4
19:     b NA              4
20:     b  2              2

票数 37

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/7735647

复制

相似问题

问将NAs替换为最新的非NA值
EN

回答 19

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将NAs替换为最新的非NA值EN

回答 19

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将NAs替换为最新的非NA值
EN