我反复使用的设计模式之一是对数据帧执行"group by“或"split,apply,combine (SAC)”,然后将聚合数据连接回原始数据。例如,在包含多个州和县的数据框架中计算每个县与州平均值的偏差时,这是很有用的。我的聚合计算很少只是一个简单的平均值,但它是一个很好的例子。我经常用以下方法来解决这个问题:
require(plyr)
set.seed(1)
## set up some data
group1 <- rep(1:3, 4)
group2 <- sample(c("A","B","C"), 12, rep=TRUE)
values <- rnorm(12)
df <- data.frame(group1, group2, values)
## got some data, so let's aggregate
group1Mean <- ddply( df, "group1", function(x)
data.frame( meanValue = mean(x$values) ) )
df <- merge( df, group1Mean )
df
这会产生很好的聚合数据,如下所示:
> df
group1 group2 values meanValue
1 1 A 0.48743 -0.121033
2 1 A -0.04493 -0.121033
3 1 C -0.62124 -0.121033
4 1 C -0.30539 -0.121033
5 2 A 1.51178 0.004804
6 2 B 0.73832 0.004804
7 2 A -0.01619 0.004804
8 2 B -2.21470 0.004804
9 3 B 1.12493 0.758598
10 3 C 0.38984 0.758598
11 3 B 0.57578 0.758598
12 3 A 0.94384 0.758598
这是可行的,但是有没有其他方法可以提高可读性、性能等呢?
发布于 2011-02-17 23:53:33
只需一行代码即可完成此任务:
new <- ddply( df, "group1", transform, numcolwise(mean))
new
group1 group2 values meanValue
1 1 A 0.48742905 -0.121033381
2 1 A -0.04493361 -0.121033381
3 1 C -0.62124058 -0.121033381
4 1 C -0.30538839 -0.121033381
5 2 A 1.51178117 0.004803931
6 2 B 0.73832471 0.004803931
7 2 A -0.01619026 0.004803931
8 2 B -2.21469989 0.004803931
9 3 B 1.12493092 0.758597929
10 3 C 0.38984324 0.758597929
11 3 B 0.57578135 0.758597929
12 3 A 0.94383621 0.758597929
identical(df, new)
[1] TRUE
发布于 2011-02-18 00:00:31
我认为ave()
在这里比你展示的plyr调用更有用(我对plyr还不够熟悉,不知道你是否可以直接用plyr做你想做的事情,如果你不能的话,我会很惊讶!)或其他基础R备选方案(aggregate()
、tapply()
)。
> with(df, ave(values, group1, FUN = mean))
[1] -0.121033381 0.004803931 0.758597929 -0.121033381 0.004803931
[6] 0.758597929 -0.121033381 0.004803931 0.758597929 -0.121033381
[11] 0.004803931 0.758597929
您可以使用within()
或transform()
将此结果直接嵌入到df
中
> df2 <- within(df, meanValue <- ave(values, group1, FUN = mean))
> head(df2)
group1 group2 values meanValue
1 1 A 0.4874291 -0.121033381
2 2 B 0.7383247 0.004803931
3 3 B 0.5757814 0.758597929
4 1 C -0.3053884 -0.121033381
5 2 A 1.5117812 0.004803931
6 3 C 0.3898432 0.758597929
> df3 <- transform(df, meanValue = ave(values, group1, FUN = mean))
> all.equal(df2,df3)
[1] TRUE
如果排序很重要:
> head(df2[order(df2$group1, df2$group2), ])
group1 group2 values meanValue
1 1 A 0.48742905 -0.121033381
10 1 A -0.04493361 -0.121033381
4 1 C -0.30538839 -0.121033381
7 1 C -0.62124058 -0.121033381
5 2 A 1.51178117 0.004803931
11 2 A -0.01619026 0.004803931
发布于 2011-02-17 23:54:40
不能只将x
添加到传递给ddply
的函数中吗
df <- ddply( df, "group1", function(x)
data.frame( x, meanValue = mean(x$values) ) )
https://stackoverflow.com/questions/5031116
复制相似问题