我有像这个例子一样的数据集,但是每个输入有1000个输入和1000个单词,每个输入x x字组合有30个值(在cols Copy1..Copy30中)。
df = read.table(header=T, sep=",", text="
Input,Time,Word,Copy1,Copy2,Copy3,Copy30
ark,1,ark,0.00,0.00,0.00,0.00
ark,1,ad,0.00,0.00,0.00,0.00
ark,1,bark,0.00,0.00,0.00,0.00
ark,50,ark,0.00,0.10,0.05,0.00
ark,50,ad,0.00,0.05,0.03,0.00
ark,50,bark,0.07,0.06,0.00,0.00
ark,100,ark,0.00,0.17,0.55,0.00
ark,100,ad,0.00,0.03,0.11,0.00
ark,100,bark,0.05,0.20,0.00,0.00
bark,1,ark,0.00,0.00,0.00,0.00
bark,1,ad,0.00,0.00,0.00,0.00
bark,1,bark,0.00,0.00,0.00,0.00
bark,50,ark,0.00,0.03,0.09,0.00
bark,50,ad,0.00,0.05,0.03,0.00
bark,50,bark,0.2,0.75,0.00,0.00
bark,100,ark,0.00,0.08,0.32,0.00
bark,100,ad,0.00,0.03,0.11,0.00
bark,100,bark,0.21,0.60,0.00,0.00
") %>% arrange(Input,Time,Word)
df
# Input Time Word Copy1 Copy2 Copy3 Copy30
# 1 ark 1 ad 0.00 0.00 0.00 0
# 2 ark 1 ark 0.00 0.00 0.00 0
# 3 ark 1 bark 0.00 0.00 0.00 0
# 4 ark 50 ad 0.00 0.05 0.03 0
# 5 ark 50 ark 0.00 0.10 0.05 0
# 6 ark 50 bark 0.07 0.06 0.00 0
# 7 ark 100 ad 0.00 0.03 0.11 0
# 8 ark 100 ark 0.00 0.17 0.55 0
# 9 ark 100 bark 0.05 0.20 0.00 0
# 10 bark 1 ad 0.00 0.00 0.00 0
# 11 bark 1 ark 0.00 0.00 0.00 0
# 12 bark 1 bark 0.00 0.00 0.00 0
# 13 bark 50 ad 0.00 0.05 0.03 0
# 14 bark 50 ark 0.00 0.03 0.09 0
# 15 bark 50 bark 0.20 0.75 0.00 0
# 16 bark 100 ad 0.00 0.03 0.11 0
# 17 bark 100 ark 0.00 0.08 0.32 0
# 18 bark 100 bark 0.21 0.60 0.00 0我想按输入和Word进行分组,对于每个组合,确定哪个复制列对每个单词具有最大值,然后只为该单词保留该列用于该输入。对一个previous question的回应让我成为了其中的一部分。此代码标识每个单词的最大副本。
max_copy <- df %>%
pivot_longer(starts_with("Copy"), names_to="copy_name", values_to="copy_value") %>%
group_by(Input, Word) %>%
filter(rank(copy_value, ties.method="first") == n()) %>%
group_by(Input, Time)
max_copy
# A tibble: 6 x 5
# Groups: Input, Time [3]
# Input Time Word copy_name copy_value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 100 ad Copy3 0.11
# 2 ark 100 ark Copy3 0.55
# 3 ark 100 bark Copy2 0.2
# 4 bark 50 bark Copy2 0.75
# 5 bark 100 ad Copy3 0.11
# 6 bark 100 ark Copy3 0.32现在,我要做的是使用它将数据减少为每个输入的每个单词的标识副本,这样结果将是:
# A tibble: 18 x 5
# Groups: Input, Time [6]
# Input Time Word copy_name copy_value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 1 ad Copy3 0
# 2 ark 1 ark Copy3 0
# 3 ark 1 bark Copy2 0
# 4 ark 50 ad Copy3 0.03
# 5 ark 50 ark Copy3 0.05
# 6 ark 50 bark Copy2 0.06
# 7 ark 100 ad Copy3 0.11
# 8 ark 100 ark Copy3 0.55
# 9 ark 100 bark Copy2 0.2
# 10 bark 1 ad Copy3 0
# 11 bark 1 ark Copy3 0
# 12 bark 1 bark Copy2 0
# 13 bark 50 ad Copy3 0.03
# 14 bark 50 ark Copy3 0.09
# 15 bark 50 bark Copy2 0.75
# 16 bark 100 ad Copy3 0.11
# 17 bark 100 ark Copy3 0.32
# 18 bark 100 bark Copy2 0.6 有什么方法可以像这样使用max_copy数据来减少df呢?
编辑:下面有一些解决方案的问题。如果存在负值(易于处理)或,则@akrun的解决方案会中断,如果后面的副本中有正值,则为最大值的副本(我不知道如何解决这个问题)。@AnoushiravanR的解在这两种情况下似乎都是稳健的,@AnilGo差尔的解也是如此。下面是包含这些条件的更新数据集。
df2 = read.table(header=T, sep=",", text="
Input,Time,Word,Copy1,Copy2,Copy3,Copy30
ark,1,ark,0.00,0.00,0.00,-0.01
ark,1,ad,0.00,0.00,0.00,-0.01
ark,1,bark,0.00,0.00,0.00,-0.01
ark,1,bar,0.00,0.00,0.00,-0.01
ark,50,ark,0.00,0.10,0.05,-0.01
ark,50,ad,0.00,0.05,0.03,-0.01
ark,50,bark,0.07,0.06,0.01,-0.01
ark,50,bar,0.07,0.06,0.01,-0.01
ark,100,ark,0.00,0.17,0.55,-0.01
ark,100,ad,0.00,0.03,0.11,-0.01
ark,100,bark,0.05,0.20,0.01,-0.01
ark,100,bar,0.04,0.15,0.01,-0.01
bark,1,ark,0.00,0.00,0.00,-0.01
bark,1,ad,0.00,0.00,0.00,-0.01
bark,1,bark,0.00,0.00,0.00,-0.01
bark,1,bar,0.00,0.00,0.00,-0.01
bark,50,ark,0.00,0.03,0.09,-0.01
bark,50,ad,0.00,0.05,0.03,-0.01
bark,50,bark,0.2,0.75,0.01,0.01
bark,50,bar,0.2,0.7,0.00,-0.01
bark,100,ark,0.00,0.08,0.32,-0.01
bark,100,ad,0.00,0.03,0.11,-0.01
bark,100,bark,0.21,0.60,0.01,-0.01
bark,100,bar,0.15,0.4,0.01,-0.01
") %>% arrange(Input,Time,Word)Df2的期望输出:
# A tibble: 24 x 5
# Input Time Word copy_name Value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 1 ad Copy3 0
# 2 ark 1 ark Copy3 0
# 3 ark 1 bar Copy2 0
# 4 ark 1 bark Copy2 0
# 5 ark 50 ad Copy3 0.03
# 6 ark 50 ark Copy3 0.05
# 7 ark 50 bar Copy2 0.06
# 8 ark 50 bark Copy2 0.06
# 9 ark 100 ad Copy3 0.11
# 10 ark 100 ark Copy3 0.55
# 11 ark 100 bar Copy2 0.15
# 12 ark 100 bark Copy2 0.2
# 13 bark 1 ad Copy3 0
# 14 bark 1 ark Copy3 0
# 15 bark 1 bar Copy2 0
# 16 bark 1 bark Copy2 0
# 17 bark 50 ad Copy3 0.03
# 18 bark 50 ark Copy3 0.09
# 19 bark 50 bar Copy2 0.7
# 20 bark 50 bark Copy2 0.75
# 21 bark 100 ad Copy3 0.11
# 22 bark 100 ark Copy3 0.32
# 23 bark 100 bar Copy2 0.4
# 24 bark 100 bark Copy2 0.6 发布于 2021-05-01 22:02:16
这可以用summarise来完成。使用pivot_longer将其重新格式化为“long”格式后,按“输入”、“时间”单词进行分组,然后summarise根据if all值为0的条件创建“copy_value”,然后返回0或else返回last非零值“copy_value”。
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'copy_value') %>%
group_by(Input, Time, Word) %>%
summarise(copy_value = if(all(copy_value == 0)) 0
else last(copy_value[copy_value != 0]), .groups = 'drop')-output
# A tibble: 18 x 4
# Input Time Word copy_value
# * <chr> <int> <chr> <dbl>
# 1 ark 1 ad 0
# 2 ark 1 ark 0
# 3 ark 1 bark 0
# 4 ark 50 ad 0.03
# 5 ark 50 ark 0.05
# 6 ark 50 bark 0.06
# 7 ark 100 ad 0.11
# 8 ark 100 ark 0.55
# 9 ark 100 bark 0.2
#10 bark 1 ad 0
#11 bark 1 ark 0
#12 bark 1 bark 0
#13 bark 50 ad 0.03
#14 bark 50 ark 0.09
#15 bark 50 bark 0.75
#16 bark 100 ad 0.11
#17 bark 100 ark 0.32
#18 bark 100 bark 0.6 如果我们也需要“copy_name”,那么在slice中使用相同的逻辑表达式来返回符合条件的行,即if all值为0,返回最后一行(n() -无关紧要)或获得copy_value的last非零的索引。现在,我们使用“输入”、“单词”和“mutate”(Copy_name)进行分组,方法是用相应的“copy_name”替换“copy_value”,其中“copy_value”是max。
df %>%
pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'copy_value') %>%
group_by(Input, Time, Word) %>%
arrange(copy_value) %>%
slice(if(all(copy_value <= 0)) n()
else tail(which(copy_value > 0), 1))%>%
group_by(Input, Word) %>%
mutate(copy_name = copy_name[which.max(copy_value)]) %>%
ungroup-output
# A tibble: 18 x 5
# Input Time Word copy_name copy_value
# <chr> <int> <chr> <chr> <dbl>
# 1 ark 1 ad Copy3 0
# 2 ark 1 ark Copy3 0
# 3 ark 1 bark Copy2 0
# 4 ark 50 ad Copy3 0.03
# 5 ark 50 ark Copy3 0.05
# 6 ark 50 bark Copy2 0.06
# 7 ark 100 ad Copy3 0.11
# 8 ark 100 ark Copy3 0.55
# 9 ark 100 bark Copy2 0.2
#10 bark 1 ad Copy3 0
#11 bark 1 ark Copy3 0
#12 bark 1 bark Copy2 0
#13 bark 50 ad Copy3 0.03
#14 bark 50 ark Copy3 0.09
#15 bark 50 bark Copy2 0.75
#16 bark 100 ad Copy3 0.11
#17 bark 100 ark Copy3 0.32
#18 bark 100 bark Copy2 0.6 发布于 2021-05-01 22:16:02
更新解决方案
我已经用你的新数据更新了我的解决方案。我看不出输出有什么问题,但是如果有什么需要修改的地方,我很高兴知道。
library(dplyr)
library(tidyr)
library(purrr)
df2 %>%
mutate(Copy_value = pmap_dbl(df2 %>% select(Copy1:Copy30), ~ max(c(...))),
Copy_name = pmap(df2 %>% select(Copy1:Copy30), ~
names(c(...)[c(...) == max(c(...))]))) %>%
unnest(Copy_name) %>%
group_by(Input, Word) %>%
mutate(Copy_name = Copy_name[which.max(Copy_value)]) %>%
distinct() %>%
select(-c(Copy1:Copy_value)) %>%
right_join(df2, by = c("Input", "Time", "Word")) %>%
rowwise() %>%
mutate(Copy_value = map_dbl(Copy_name, ~ get({.x}))) %>%
select(-c(Copy1:Copy30))output 这是新提供的数据集的输出。
Input Time Word Copy_name Copy_value
1 ark 1 ad Copy3 0.00
2 ark 1 ark Copy3 0.00
3 ark 1 bar Copy2 0.00
4 ark 1 bark Copy2 0.00
5 ark 50 ad Copy3 0.03
6 ark 50 ark Copy3 0.05
7 ark 50 bar Copy2 0.06
8 ark 50 bark Copy2 0.06
9 ark 100 ad Copy3 0.11
10 ark 100 ark Copy3 0.55
11 ark 100 bar Copy2 0.15
12 ark 100 bark Copy2 0.20
13 bark 1 ad Copy3 0.00
14 bark 1 ark Copy3 0.00
15 bark 1 bar Copy2 0.00
16 bark 1 bark Copy2 0.00
17 bark 50 ad Copy3 0.03
18 bark 50 ark Copy3 0.09
19 bark 50 bar Copy2 0.70
20 bark 50 bark Copy2 0.75
21 bark 100 ad Copy3 0.11
22 bark 100 ark Copy3 0.32
23 bark 100 bar Copy2 0.40
24 bark 100 bark Copy2 0.60发布于 2021-05-02 15:42:33
通过purrr的另一种方法
df %>%
pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'Value') %>%
semi_join(df %>% nest(copy_name = !c(Input, Word)) %>%
mutate(copy_name = map_chr(copy_name,
~ names(.x)[1 + which(.x[-1] == max(.x[-1]), arr.ind = T)[2]])),
by = c("Input", "Word", "copy_name")
)
# A tibble: 18 x 5
Input Time Word copy_name Value
<chr> <int> <chr> <chr> <dbl>
1 ark 1 ad Copy3 0
2 ark 1 ark Copy3 0
3 ark 1 bark Copy2 0
4 ark 50 ad Copy3 0.03
5 ark 50 ark Copy3 0.05
6 ark 50 bark Copy2 0.06
7 ark 100 ad Copy3 0.11
8 ark 100 ark Copy3 0.55
9 ark 100 bark Copy2 0.2
10 bark 1 ad Copy3 0
11 bark 1 ark Copy3 0
12 bark 1 bark Copy2 0
13 bark 50 ad Copy3 0.03
14 bark 50 ark Copy3 0.09
15 bark 50 bark Copy2 0.75
16 bark 100 ad Copy3 0.11
17 bark 100 ark Copy3 0.32
18 bark 100 bark Copy2 0.6实际上,这个可以分为两部分-
purrr::map_chr找到这些副本的名称,其中副本的值对于任何时间值都是最大的。df %>% nest(copy_name = !c(Input, Word)) %>%
mutate(copy_name = map_chr(copy_name,
~ names(.x)[1 + which(.x[-1] == max(.x[-1]), arr.ind = T)[2]]))
# A tibble: 6 x 3
Input Word copy_name
<chr> <chr> <chr>
1 ark ad Copy3
2 ark ark Copy3
3 ark bark Copy2
4 bark ad Copy3
5 bark ark Copy3
6 bark bark Copy2semi_join将旋转数据与此数据连接起来,这实际上是一个过滤的连接.单管的另一种方法
df %>% nest(data = !c(Input, Word)) %>%
mutate(data = map(data, ~ .x %>%
select(Time, 1+which(.x[-1] == max(.x[-1]), arr.ind = T)[2]) %>%
mutate(copy = names(.)[2]) %>%
rename_with(~'value', 2)
)) %>%
unnest(data)
# A tibble: 18 x 5
Input Word Time value copy
<chr> <chr> <int> <dbl> <chr>
1 ark ad 1 0 Copy3
2 ark ad 50 0.03 Copy3
3 ark ad 100 0.11 Copy3
4 ark ark 1 0 Copy3
5 ark ark 50 0.05 Copy3
6 ark ark 100 0.55 Copy3
7 ark bark 1 0 Copy2
8 ark bark 50 0.06 Copy2
9 ark bark 100 0.2 Copy2
10 bark ad 1 0 Copy3
11 bark ad 50 0.03 Copy3
12 bark ad 100 0.11 Copy3
13 bark ark 1 0 Copy3
14 bark ark 50 0.09 Copy3
15 bark ark 100 0.32 Copy3
16 bark bark 1 0 Copy2
17 bark bark 50 0.75 Copy2
18 bark bark 100 0.6 Copy2https://stackoverflow.com/questions/67351185
复制相似问题