文章/答案/技术大牛

发布

Tidyverse 数据处理

共 8 篇文章

盘一盘Tidyverse| 筛行选列之select，玩转列操作

盘一盘Tidyverse| 只要你要只要我有-filter 筛选行

Tidyverse|tidyr数据重塑之gather，spread（长数据宽数据转化）

Tidyverse|数据列的分分合合，一分多，多合一

Tidyverse| XX_join ：多个数据表（文件）之间的各种连接

tidyverse|数据分析常规操作-分组汇总（sumamrise+group_by)

数据处理 | R-tidyr包

数据处理|R-dplyr

清单首页Tidyverse 数据处理文章详情

清单「Tidyverse 数据处理」 01/08

盘一盘Tidyverse| 筛行选列之select，玩转列操作

生信补给站

2020年，开封《R 数据科学》R for data science，系统学习R 数据处理。
在一个典型的数据科学项目中，需要的工具模型大体如下图所示。 --R for Data Science

数据导入和数据整理较乏味和无聊，很容易从入门到放弃！从数据转换和可视化开始，容易看到成果，保持学习的动力。

之前的推文讲了一些生信常见图形的绘制（后续会一直补充），现在开始主要依据《R数据科学》一书逐渐介绍数据分析的过程。

本文根据 msleep数据集，盘一盘“列”的操作。

一载入R包，数据

#载入R包
#install.packages("tidyverse")
library("tidyverse")
#查看内置数据集
head(msleep,2)

# A tibble: 6 x 11
  name  genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
  <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Chee~ Acin~ carni Carn~ lc                  12.1      NA        NA      11.9
2 Owl ~ Aotus omni  Prim~ NA                  17         1.8      NA       7  
# ... with 2 more variables: brainwt <dbl>, bodywt <dbl>

上述数据集有11列（变量），而生信中的临床信息，实验室检验指标经常上百，基因（突变，表达）信息更是成千上万。

此时可以基于变量名，使用select() 函数快速生成一个有用的变量子集。

二以列之名

2.1 选择对应名称列

使用select()直接选择列名称所对应的列。

#选择name, sleep_total ,awake三列，使awake在中间
msleep %>%
  select(name, awake, sleep_total) %>% head()

彩蛋：添加顺序即为输出顺序。

2.2 选择若干连序列

使用start_col:end_col语法选择若干的连续列。

msleep %>%
  select(name:vore, sleep_total:awake) %>% head(2)

# A tibble: 6 x 7
  name                       genus      vore  sleep_total sleep_rem sleep_cycle awake
  <chr>                      <chr>      <chr>       <dbl>     <dbl>       <dbl> <dbl>
1 Cheetah                    Acinonyx   carni        12.1      NA        NA      11.9
2 Owl monkey                 Aotus      omni         17         1.8      NA       7

与基本语法类似，：用来选择连续的列。

2.3 根据部分列名称选择列

如果列名结构相似，可使用starts_with()，ends_with()， contains()完成部分匹配。

1）starts_with()选择以“XX”开头的所有列

msleep %>%
  select(name, starts_with("sleep")) %>% head(2)
# A tibble: 2 x 4
  name       sleep_total sleep_rem sleep_cycle
  <chr>            <dbl>     <dbl>       <dbl>
1 Cheetah           12.1      NA            NA
2 Owl monkey        17         1.8          NA

2）ends_with()选择以“XX”结尾的所有列

msleep %>%
  select(ends_with("e")) %>% head(2)
# A tibble: 2 x 4
  name       vore  sleep_cycle awake
  <chr>      <chr>       <dbl> <dbl>
1 Cheetah    carni          NA  11.9
2 Owl monkey omni           NA   7

3） contains()选择包含“XX”的所有列

msleep %>%
  select(contains("leep")) %>% head(2)
# A tibble: 2 x 3
  sleep_total sleep_rem sleep_cycle
        <dbl>     <dbl>       <dbl>
1        12.1      NA            NA
2        17         1.8          NA

4）matches() 选择基于正则的列

如果列名模式不相似，使用matches()选择对应正则表达式的列。

#选择任何包含“a”，后跟一个或多个其他字母和“e”的列
msleep %>%
  select(matches("a.+e")) %>% head(2)
# A tibble: 2 x 2
  name       awake
  <chr>      <dbl>
1 Cheetah     11.9
2 Owl monkey   7

三以逻辑之名

3.1 基于数据类型选择列

使用select_if()选择所有数值列select_if(is.numeric)，此外还可用is.numeric， is.integer，is.double，is.logical，is.factor。

msleep %>%
  select_if(is.numeric) %>% head(2)
# A tibble: 2 x 6
  sleep_total sleep_rem sleep_cycle awake brainwt bodywt
        <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
1        12.1      NA            NA  11.9 NA       50  
2        17         1.8          NA   7    0.0155   0.48

3.2 基于逻辑表达式选择列

msleep %>%
  select_if(is.numeric) %>% 
  select_if(~mean(., na.rm=TRUE) > 10) %>% head(2)
# A tibble: 2 x 3
  sleep_total awake bodywt
        <dbl> <dbl>  <dbl>
1        12.1  11.9  50  
2        17     7     0.48

注：select_all / if 函数要求将函数作为参数传递。因为mean > 10 本身不是函数，所以需要前面添加“~”表示匿名函数；或者使用funs()先将函数包装。

more_than_10 <- function(x) {
  mean(x,na.rm=TRUE) > 10
}
msleep %>% select_if(is.numeric) %>% select_if(more_than_10) %>% head(2)
# A tibble: 2 x 3
  sleep_total awake bodywt
        <dbl> <dbl>  <dbl>
1        12.1  11.9  50  
2        17     7     0.48

结果同上。

msleep %>%
  select_if(~is.numeric(.) & mean(., na.rm=TRUE) > 10) %>% head(2)

结果同上！

3.3 选择唯一值数目符合条件的列

结合 n_distinct()选择具有不少于20个不同答案的列。

msleep %>%
  select_if(~n_distinct(.) >= 20) %>% head(2)
# A tibble: 2 x 8
  name       genus    sleep_total sleep_rem sleep_cycle awake brainwt bodywt
  <chr>      <chr>          <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
1 Cheetah    Acinonyx        12.1      NA            NA  11.9 NA       50  
2 Owl monkey Aotus           17         1.8          NA   7    0.0155   0.48

四调整列顺序

4.1 选择列名称时候直接调整

#选择name, sleep_total ,awake三列，使awake在中间
msleep %>%
  select(name, awake, sleep_total) %>% head(2)

4.2 `everything()` 返回未被选择的所有列

当只是将几列移到最前面，后面的可使用everything()，节省大量输入时间。

msleep %>%
  select(conservation, everything()) %>% head(2)
# A tibble: 2 x 11
  conservation name  genus vore  order sleep_total sleep_rem sleep_cycle awake
  <chr>        <chr> <chr> <chr> <chr>       <dbl>     <dbl>       <dbl> <dbl>
1 lc           Chee~ Acin~ carni Carn~        12.1      NA            NA  11.9
2 NA           Owl ~ Aotus omni  Prim~        17         1.8          NA   7  
# ... with 2 more variables: brainwt <dbl>, bodywt <dbl>

五更改列名称

5.1 `select`更改列名

msleep %>%
  select(animal = name, sleep_total) %>% head(2)
# A tibble: 2 x 2
  animal     sleep_total
  <chr>            <dbl>
1 Cheetah           12.1
2 Owl monkey        17

注：select语句中更改，只留下select的列。

5.2 rename更改列名

msleep %>%
  rename(animal = name) %>% head(2)
# A tibble: 2 x 11
  animal genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
  <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Cheet~ Acin~ carni Carn~ lc                  12.1      NA            NA  11.9
2 Owl m~ Aotus omni  Prim~ NA                  17         1.8          NA   7
# ... with 2 more variables: brainwt <dbl>, bodywt <dbl>

以上两种方式注意区分！

5.3 重新格式化所有列名

1）select_all()函数允许更改所有列，并以一个函数作为参数。

msleep %>%
  select_all(toupper) %>% head(2)
# A tibble: 2 x 11
  NAME  GENUS VORE  ORDER CONSERVATION SLEEP_TOTAL SLEEP_REM SLEEP_CYCLE AWAKE
  <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
1 Chee~ Acin~ carni Carn~ lc                  12.1      NA            NA  11.9
2 Owl ~ Aotus omni  Prim~ NA                  17         1.8          NA   7
# ... with 2 more variables: BRAINWT <dbl>, BODYWT <dbl>

toupper()使所有列名变成大写形式，tolower()变成小写。

2）创建函数替换

如果输入文件的列名较混乱，根据需求逐步替换。

msleep2 <- select(msleep, name, sleep_total, brainwt)
colnames(msleep2) <- c("Q1 name", "Q2 sleep total", "Q3 brain weight")
msleep2[1:3,]
# A tibble: 3 x 3
  `Q1 name`       `Q2 sleep total` `Q3 brain weight`
  <chr>                      <dbl>             <dbl>
1 Cheetah                     12.1           NA
2 Owl monkey                  17              0.0155
3 Mountain beaver             14.4           NA

目的把列名中的"Q1 name"改为"name"，"Q2 sleep total"改为"sleep_total" ...

A：去掉前面的Q1，Q2，Q3 ；

B：去掉Q1，Q2，Q3 与名称的空格；

C：sleep total之间的空格使用下划线替换。

msleep2 %>%
    select_all(~str_replace(., "Q[0-9]+", "")) %>%  #去掉Q1
	select_all(~str_replace(., "^ ", "")) %>% #去掉名称前面的空格
    select_all(~str_replace(., " ", "_")) #下划线替换sleep total之间的空格
# A tibble: 83 x 3
   name                       sleep_total brain_weight
   <chr>                            <dbl>        <dbl>
 1 Cheetah                           12.1     NA
 2 Owl monkey                        17        0.0155

搞定！

六满五赠二

6.1 删除某些列

选择的列前用“-”即可，函数用法与选择一致。

 msleep %>%
   select(-(name:genus), -conservation,-(ends_with("e"))) %>% head(2)
# A tibble: 2 x 5
  order     sleep_total sleep_rem brainwt bodywt
  <chr>           <dbl>     <dbl>   <dbl>  <dbl>
1 Carnivora        12.1      NA   NA       50
2 Primates         17         1.8  0.0155   0.48

6.2 行名称改为第一列

某些数据框的行名并不是列，例如mtcars数据集：

 mtcars %>% head(2)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4

使用 rownames_to_column()函数，行名改为列，且可指定列名称。

mtcars %>%
	tibble::rownames_to_column("car_name") %>% head(2)
       car_name mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4  21   6  160 110  3.9 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

相信我，后面做数据链接（join）的时候，你会很希望行名是具体列的。

数据处理确实不如可视化“好看”，但前期数据处理必不可少，这个“槛”一起慢慢跨过去！

参考资料

《R数据科学》

https://r4ds.had.co.nz/introduction.html

https://suzanbaert.netlify.com/2018/01/dplyr-tutorial-1/

Tidyverse 数据处理

盘一盘Tidyverse| 筛行选列之select，玩转列操作

2.1 选择对应名称列

2.2 选择若干连序列

2.3 根据部分列名称选择列

3.1 基于数据类型选择列

3.2 基于逻辑表达式选择列

3.3 选择唯一值数目符合条件的列

4.1 选择列名称时候直接调整

4.2 `everything()` 返回未被选择的所有列

5.1 `select`更改列名

5.2 rename更改列名

5.3 重新格式化所有列名

6.1 删除某些列

6.2 行名称改为第一列

参考资料

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

Tidyverse 数据处理

盘一盘Tidyverse| 筛行选列之select，玩转列操作

2.1 选择对应名称列

2.2 选择若干连序列

2.3 根据部分列名称选择列

3.1 基于数据类型选择列

3.2 基于逻辑表达式选择列

3.3 选择唯一值数目符合条件的列

4.1 选择列名称时候直接调整

4.2 everything() 返回未被选择的所有列

5.1 select更改列名

5.2 rename更改列名

5.3 重新格式化所有列名

6.1 删除某些列

6.2 行名称改为第一列

参考资料

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

4.2 `everything()` 返回未被选择的所有列

5.1 `select`更改列名