文章/答案/技术大牛

发布

问一个更快的条件子集
EN

Stack Overflow用户

提问于 2022-10-03 18:28:15

回答 1查看 60关注 0票数 1

我试图根据一列中的值和另一列中最常见的值来修改我的数据。

df <- data.frame(points=c(1, 2, 4, 3, 4, 8, 3, 3, 2),
                 assists=c(6, 6, 5, 6, 6, 9, 9, 1, 1),
                 team=c('A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C'))

  points assists team
1      1       6    A
2      2       6    A
3      4       5    A
4      3       6    A
5      4       6    A
6      8       9    C
7      3       9    C
8      3       1    C
9      2       1    C

像这样：

df2 <- data.frame(points=c(1, 2, 3, 4, 8, 3),
                 assists=c(6, 6, 6, 6, 1, 1),
                 team=c('A', 'A', 'A', 'A', 'C', 'C'))

  points assists team
1      1       6    A
2      2       6    A
3      3       6    A
4      4       6    A
5      8       1    C
6      3       1    C

目标是保持"team“列中有值A和C的所有行，只要在”协助“列中保留最常见的值("6”表示"A“)。如果有领带(如"9“和"1”表示"C")，则应保留最后一个最常见的值。

我是用for循环来完成这个任务的，但是我的dataframe有3,000,000行，而且处理非常慢。有谁知道更快的选择吗？

base

dataframe

subset

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-10-03 18:44:53

我们可以修改Mode函数，并通过filter进行分组。

library(dplyr)
 Mode <- function(x) {
   # get the unique elements
   ux <- unique(x)
   # convert to integer index with match and get the frequency
   # tabulate should be faster than table
   freq <- tabulate(match(x, ux))
   # use == on the max of the freq, get the corresponding ux
   # then get the last elements of ux
   last(ux[freq == max(freq)])
 }
df %>% 
   # grouped by team
   group_by(team) %>% 
   # filter only those assists that are returned from Mode function
   filter(assists %in% Mode(assists)) %>%
   ungroup

-output

# A tibble: 6 × 3
  points assists team 
   <dbl>   <dbl> <chr>
1      1       6 A    
2      2       6 A    
3      3       6 A    
4      4       6 A    
5      3       1 C    
6      2       1 C

也可以使用data.table方法来加快执行速度。

library(data.table)
# setDT - converts data.frame to data.table
# create a frequency column (`.N`) by assignment (`:=`)
# grouped by team, assists columns
setDT(df)[, N := .N, by = .(team, assists)]
# grouped by team, get the index of the max N from reverse (`.N:1`)
#subset the assists with that index
# create a logical vector with %in%
# get the row index -.I, which creates a default column V1
# extract the column ($V1) and use that to subset the data
df[df[, .I[assists %in% assists[.N - which.max(N[.N:1]) + 1]],
     by = team]$V1][, N := NULL][]
   points assists   team
    <num>   <num> <char>
1:      1       6      A
2:      2       6      A
3:      3       6      A
4:      4       6      A
5:      3       1      C
6:      2       1      C

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73939415

复制

相似问题

问一个更快的条件子集
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问一个更快的条件子集EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问一个更快的条件子集
EN