今天在使用dplyr数据分析时遇到一个问题,就是如何在分组汇总时自动生成多列。
下面的代码和数据源主要来自:https://stackoverflow.com/questions/51063842/create-multiple-columns-in-summarize,以计算分位数为例。
> library(dplyr)
> library(tidyr)
>
> Z <- data.frame(x = runif(1000, min = 0, max = 20)) %>%
+ mutate(y = rnorm(n(), mean = sin(x))) %>%
+ group_by(x.category = round(x))
> Z
# A tibble: 1,000 x 3
# Groups: x.category [21]
x y x.category
<dbl> <dbl> <dbl>
1 0.670 0.121 1
2 16.5 0.0702 16
3 15.0 -1.47 15
4 3.16 -0.595 3
5 12.7 -0.915 13
6 5.25 -0.540 5
7 3.82 -0.671 4
8 10.6 -2.33 11
9 18.3 1.15 18
10 1.53 0.205 2
# … with 990 more rows
首先生成想要计算的分位数,然后再summarize()
中用list()
将结果包起来。这个办法的聪明之处在于解决了汇总时每个分组只能返回一个值的问题。
> probs <- c(0.25, 0.5, 0.75)
>
> Z %>%
+ summarize(x = mean(x),
+ quantile = list(quantile(y, probs)),
+ prob = list(probs)) %>%
+ unnest(cols = c("quantile", "prob"))
# A tibble: 63 x 4
x.category x quantile prob
<dbl> <dbl> <dbl> <dbl>
1 0 0.162 0.120 0.25
2 0 0.162 0.576 0.5
3 0 0.162 0.954 0.75
4 1 1.00 0.119 0.25
5 1 1.00 0.818 0.5
6 1 1.00 1.51 0.75
7 2 2.02 0.119 0.25
8 2 2.02 0.556 0.5
9 2 2.02 1.38 0.75
10 3 2.89 -0.418 0.25
# … with 53 more rows
去掉 unnest()
部分我们就可以看到精妙之处。
> Z %>%
+ summarize(x = mean(x),
+ quantile = list(quantile(y, probs)),
+ prob = list(probs))
# A tibble: 21 x 4
x.category x quantile prob
<dbl> <dbl> <list> <list>
1 0 0.162 <dbl [3]> <dbl [3]>
2 1 1.00 <dbl [3]> <dbl [3]>
3 2 2.02 <dbl [3]> <dbl [3]>
4 3 2.89 <dbl [3]> <dbl [3]>
5 4 3.95 <dbl [3]> <dbl [3]>
6 5 5.00 <dbl [3]> <dbl [3]>
7 6 5.99 <dbl [3]> <dbl [3]>
8 7 7.04 <dbl [3]> <dbl [3]>
9 8 7.98 <dbl [3]> <dbl [3]>
10 9 8.96 <dbl [3]> <dbl [3]>
# … with 11 more rows
只要进一步转换为宽格式就可以完成处理了。
> Z %>%
+ summarize(x = mean(x),
+ quantile = list(quantile(y, probs)),
+ prob = list(probs)) %>%
+ unnest(cols = c("quantile", "prob")) %>%
+ pivot_wider(names_from = "prob", values_from = "quantile")
# A tibble: 21 x 5
x.category x `0.25` `0.5` `0.75`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0.162 0.120 0.576 0.954
2 1 1.00 0.119 0.818 1.51
3 2 2.02 0.119 0.556 1.38
4 3 2.89 -0.418 0.0492 1.01
5 4 3.95 -1.48 -1.16 -0.363
6 5 5.00 -1.26 -0.591 -0.0489
7 6 5.99 -1.19 -0.644 0.128
8 7 7.04 -0.111 0.494 1.19
9 8 7.98 0.302 1.10 1.70
10 9 8.96 -0.161 0.730 1.32
# … with 11 more rows
还有一种解法也非常巧妙,利用逗号分隔符先将结果拼起来。
> q = c(0.25, 0.5, 0.75)
> Z %>%
+ summarise(x = mean(x),
+ qtls = paste(quantile(y, q), collapse = ","))
# A tibble: 21 x 3
x.category x qtls
<dbl> <dbl> <chr>
1 0 0.162 0.120072116112535,0.575978614296194,0.95448088936774
2 1 1.00 0.119032678946747,0.817605136591999,1.51100977230941
3 2 2.02 0.118974463754045,0.55558427752219,1.37673188157502
4 3 2.89 -0.417873321719452,0.0491691391974207,1.01255080037855
5 4 3.95 -1.4814206722814,-1.16393051974191,-0.363080292142612
6 5 5.00 -1.26150842194084,-0.590699875806735,-0.0489240190965984
7 6 5.99 -1.19467973679591,-0.644474146111019,0.127766424109508
8 7 7.04 -0.111463505167144,0.494006133975402,1.18883677903866
9 8 7.98 0.301653284838848,1.0991386822569,1.70216573511269
10 9 8.96 -0.16083962000444,0.729929510376446,1.32344915438963
# … with 11 more rows
然后进一步使用 seperate()
函数解包:
> Z %>%
+ summarise(x = mean(x),
+ qtls = paste(quantile(y, q), collapse = ",")) %>% # get quantile values as a string
+ separate(qtls, paste0("y_", 100*q), sep = ",", convert = T) # separate quantile values and give corresponding names to columns
# A tibble: 21 x 5
x.category x y_25 y_50 y_75
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0.162 0.120 0.576 0.954
2 1 1.00 0.119 0.818 1.51
3 2 2.02 0.119 0.556 1.38
4 3 2.89 -0.418 0.0492 1.01
5 4 3.95 -1.48 -1.16 -0.363
6 5 5.00 -1.26 -0.591 -0.0489
7 6 5.99 -1.19 -0.644 0.128
8 7 7.04 -0.111 0.494 1.19
9 8 7.98 0.302 1.10 1.70
10 9 8.96 -0.161 0.730 1.32
# … with 11 more rows
这个解法在命名上看起来更有优势。
参考:https://stackoverflow.com/questions/51063842/create-multiple-columns-in-summarize