文章/答案/技术大牛

发布

问R中的大数据聚合
EN

Stack Overflow用户

提问于 2014-09-13 23:23:39

回答 2查看 775关注 0票数 1

我有一个数据集(dat)，如下所示：

Team    Person      Performance1    Performance2
 1      36465930         1              101
 1      37236856         1              101
 1      34940210         1              101
 1      29135524         1              101
 2      10318268         1              541
 2      641793           1              541
 2      32352593         1              541
 2      2139024          1              541
 3      35193922         2              790
 3      32645504         2              890
 3      32304024         2              790
 3      22696491         2              790

我试图识别和删除所有在Performance1或Performance2上有差异的团队。因此，例如，示例中的team 3在性能2上有差异，所以我想从dataset中删除该团队。下面是我编写的代码：

tda <- aggregate(dat, by=list(data$Team), FUN=sd)
tda1 <- tda[ which(tda$Performance1 != 0 | tda$Performance2 != 0), ]

问题是我的数据集中有超过10万个团队，所以我的第一行代码花费了非常长的时间，我不确定它是否能完成数据集的聚合。怎样才能更有效地解决这个问题呢？

谢谢！)

真诚的，艾米

aggregation

large-data

bigdata

aggregate

Stack Overflow用户

回答已采纳

发布于 2014-09-13 23:38:07

dplyr包通常非常快。这里有一种方法可以只选择那些标准差等于零的团队( Performance1和Performance2 )

library(dplyr)

datAggregated = dat %>%
  group_by(Team) %>%
  summarise(sdP1 = sd(Performance1),
            sdP2 = sd(Performance2)) %>%
  filter(sdP1==0 & sdP2==0)

datAggregated
  Team sdP1 sdP2
1    1    0    0
2    2    0    0

票数 2

查看全部 2 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25828695

复制

相似问题

问R中的大数据聚合
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R中的大数据聚合EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R中的大数据聚合
EN