问R组按总分类
EN

Stack Overflow用户

提问于 2015-01-13 02:46:01

回答 3查看 6.1K关注 0票数 2

在R(我是相对较新的)中，我有一个由多个列和一个数字列组成的数据框架，我需要根据另一列确定的组进行聚合。

 SessionID   Price
 '1',       '624.99'
 '1',       '697.99'
 '1',       '649.00'
 '7',       '779.00'
 '7',       '710.00'
 '7',       '2679.50'

我需要按SessionID分组，并将每个数据帧的最大和最小值返回到原始数据框架中，例如：

 SessionID   Price     Min     Max
 '1',       '624.99'   624.99  697.99
 '1',       '697.99'   624.99  697.99
 '1',       '649.00'   624.99  697.99
 '7',       '779.00'   710.00  2679.50
 '7',       '710.00'   710.00  2679.50
 '7',       '2679.50'  710.00  2679.50

有什么办法能有效地做到这一点吗？

group-by

aggregate

语音识别特惠，低至14.9元！

提供业界非常具有性价比的语音识别服务，超高识别准确率，适用多场景

回答 3

Stack Overflow用户

回答已采纳

发布于 2015-01-13 03:01:07

这是我使用aggregate的解决方案。

首先，加载数据：

df <- read.table(text = 
"SessionID   Price
'1'       '624.99'
'1'       '697.99'
'1'       '649.00'
'7'       '779.00'
'7'       '710.00'
'7'       '2679.50'", header = TRUE)

然后aggregate和match返回到原来的data.frame

tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
df <- cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
print(df)
#  SessionID   Price    Min     Max
#1         1  624.99 624.99  697.99
#2         1  697.99 624.99  697.99
#3         1  649.00 624.99  697.99
#4         7  779.00 710.00 2679.50
#5         7  710.00 710.00 2679.50
#6         7 2679.50 710.00 2679.50

编辑：如下所示，您可能会想知道为什么会这样做。这确实有点奇怪。但是请记住，data.frame只是一个花哨的list。尝试调用str(tmp)，您将看到Price列本身是2乘2的数字矩阵。由于print.data.frame知道如何处理这个问题，所以print(tmp)看起来有3列，这让人感到困惑。无论如何，tmp[2]只需访问data.frame/list的第二个column/entry并返回该1列data.frame，而tmp[,2]则访问第二列并返回存储的数据类型。

票数 1

Stack Overflow用户

发布于 2015-01-13 02:53:31

使用R基：

df <- transform(df, Min = ave(Price, SessionID, FUN = min),
                    Max = ave(Price, SessionID, FUN = max))
df
#  SessionID   Price    Min     Max
#1         1  624.99 624.99  697.99
#2         1  697.99 624.99  697.99
#3         1  649.00 624.99  697.99
#4         7  779.00 710.00 2679.50
#5         7  710.00 710.00 2679.50
#6         7 2679.50 710.00 2679.50

由于您想要的结果不是聚合的，而只是包含两个额外列的原始数据，所以您希望在基R中使用ave而不是aggregate，如果希望通过SessionID对数据进行aggregate，通常会使用SessionID。(注意: AEBilgrau显示，您也可以通过一些额外的匹配来进行聚合。)

同样，对于dplyr，您希望使用mutate而不是summarise，因为您不想聚合/汇总数据。

使用dplyr：

library(dplyr)
df <- df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))

票数 4

Stack Overflow用户

发布于 2015-01-13 02:56:26

使用data.table包：

library(data.table)

dt = data.table(SessionID=c(1,1,1,7,7,7), Price=c(624,697,649,779,710,2679))

dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
dt
#   SessionId Price Min  Max
#1:         1   624 624  697
#2:         1   697 624  697
#3:         1   649 624  697
#4:         7   779 710 2679
#5:         7   710 710 2679
#6:         7  2679 710 2679

在您的情况下，如果您有一个data.frame df，只需执行dt=as.data.table(df)并使用上面的代码即可。

我对普通data.frame上解决方案的基准值感到好奇：

df = data.frame(SessionID=rep(1:1000, each=100), Price=runif(100000, 1, 2000))
dt = as.data.table(df)

algo1 <- function() 
{
    df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))
}

algo2 <- function()
{
    dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
}

algo3 <- function()
{
    tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
    cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
}

algo4 <- function()
{
    transform(df, Min = ave(Price, SessionID, FUN = min), Max = ave(Price, SessionID, FUN = max))
}   



#> system.time(algo1())
#   user  system elapsed 
#   0.03    0.00    0.19 

#> system.time(algo2())
#   user  system elapsed 
#   0.01    0.00    0.01 

#> system.time(algo3())
#   user  system elapsed 
#   0.77    0.01    0.78 

#> system.time(algo4())
#   user  system elapsed 
#   0.02    0.01    0.03