我试图根据一列中的值和另一列中最常见的值来修改我的数据。
df <- data.frame(points=c(1, 2, 4, 3, 4, 8, 3, 3, 2),
assists=c(6, 6, 5, 6, 6, 9, 9, 1, 1),
team=c('A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C'))
points assists team
1 1 6 A
2 2 6 A
3 4 5 A
4 3 6 A
5 4 6 A
6 8 9 C
7 3 9 C
8 3 1 C
9 2 1 C
像这样:
df2 <- data.frame(points=c(1, 2, 3, 4, 8, 3),
assists=c(6, 6, 6, 6, 1, 1),
team=c('A', 'A', 'A', 'A', 'C', 'C'))
points assists team
1 1 6 A
2 2 6 A
3 3 6 A
4 4 6 A
5 8 1 C
6 3 1 C
目标是保持"team“列中有值A和C的所有行,只要在”协助“列中保留最常见的值("6”表示"A“)。如果有领带(如"9“和"1”表示"C"),则应保留最后一个最常见的值。
我是用for循环来完成这个任务的,但是我的dataframe有3,000,000行,而且处理非常慢。有谁知道更快的选择吗?
发布于 2022-10-03 18:44:53
我们可以修改Mode函数,并通过filter
进行分组。
library(dplyr)
Mode <- function(x) {
# get the unique elements
ux <- unique(x)
# convert to integer index with match and get the frequency
# tabulate should be faster than table
freq <- tabulate(match(x, ux))
# use == on the max of the freq, get the corresponding ux
# then get the last elements of ux
last(ux[freq == max(freq)])
}
df %>%
# grouped by team
group_by(team) %>%
# filter only those assists that are returned from Mode function
filter(assists %in% Mode(assists)) %>%
ungroup
-output
# A tibble: 6 × 3
points assists team
<dbl> <dbl> <chr>
1 1 6 A
2 2 6 A
3 3 6 A
4 4 6 A
5 3 1 C
6 2 1 C
也可以使用data.table
方法来加快执行速度。
library(data.table)
# setDT - converts data.frame to data.table
# create a frequency column (`.N`) by assignment (`:=`)
# grouped by team, assists columns
setDT(df)[, N := .N, by = .(team, assists)]
# grouped by team, get the index of the max N from reverse (`.N:1`)
#subset the assists with that index
# create a logical vector with %in%
# get the row index -.I, which creates a default column V1
# extract the column ($V1) and use that to subset the data
df[df[, .I[assists %in% assists[.N - which.max(N[.N:1]) + 1]],
by = team]$V1][, N := NULL][]
points assists team
<num> <num> <char>
1: 1 6 A
2: 2 6 A
3: 3 6 A
4: 4 6 A
5: 3 1 C
6: 2 1 C
https://stackoverflow.com/questions/73939415
复制相似问题