专栏首页生信小驿站R语言日常笔记(2)distinc函数

R语言日常笔记(2)distinc函数

接上文:R语言日常笔记(1)filter函数

> library(dplyr)
> library(tidyverse)
> starwars %>%
+   head()
# A tibble: 6 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Luke~    172    77 blond      fair       blue            19   male   Tatooine  Human   <chr~
2 C-3PO    167    75 NA         gold       yellow         112   NA     Tatooine  Droid   <chr~
3 R2-D2     96    32 NA         white, bl~ red             33   NA     Naboo     Droid   <chr~
4 Dart~    202   136 none       white      yellow          41.9 male   Tatooine  Human   <chr~
5 Leia~    150    49 brown      light      brown           19   female Alderaan  Human   <chr~
6 Owen~    178   120 brown, gr~ light      blue            52   male   Tatooine  Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> 
> 
> #starwars数据集mass和mass列大于0的观测值(这一步可以用于快速剔除NA值)
> mass <- 0
> height <- 0
>  filter(starwars, mass > !!mass, mass > !!height)%>%
+    head()
# A tibble: 6 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Luke~    172    77 blond      fair       blue            19   male   Tatooine  Human   <chr~
2 C-3PO    167    75 NA         gold       yellow         112   NA     Tatooine  Droid   <chr~
3 R2-D2     96    32 NA         white, bl~ red             33   NA     Naboo     Droid   <chr~
4 Dart~    202   136 none       white      yellow          41.9 male   Tatooine  Human   <chr~
5 Leia~    150    49 brown      light      brown           19   female Alderaan  Human   <chr~
6 Owen~    178   120 brown, gr~ light      blue            52   male   Tatooine  Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
>  
>  
>  
> #取starwars数据集第五行
> slice(starwars, 5)
# A tibble: 1 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Leia~    150    49 brown      light      brown             19 female Alderaan  Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #取starwars数据集第五行
> filter(starwars, row_number() == 5)
# A tibble: 1 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Leia~    150    49 brown      light      brown             19 female Alderaan  Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #取starwars数据集前五行
> slice(starwars, 1:5)
# A tibble: 5 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Luke~    172    77 blond      fair       blue            19   male   Tatooine  Human   <chr~
2 C-3PO    167    75 NA         gold       yellow         112   NA     Tatooine  Droid   <chr~
3 R2-D2     96    32 NA         white, bl~ red             33   NA     Naboo     Droid   <chr~
4 Dart~    202   136 none       white      yellow          41.9 male   Tatooine  Human   <chr~
5 Leia~    150    49 brown      light      brown           19   female Alderaan  Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #starwars数据集后六行
> tail(starwars)
# A tibble: 6 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Finn      NA    NA black      dark       dark              NA male   NA        Human   <chr~
2 Rey       NA    NA brown      light      hazel             NA female NA        Human   <chr~
3 Poe ~     NA    NA brown      light      brown             NA male   NA        Human   <chr~
4 BB8       NA    NA none       none       black             NA none   NA        Droid   <chr~
5 Capt~     NA    NA unknown    unknown    unknown           NA female NA        NA      <chr~
6 Padm~    165    45 brown      light      brown             46 female Naboo     Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #starwars数据集最后五行
> slice(starwars, n())
# A tibble: 1 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Padm~    165    45 brown      light      brown             46 female Naboo     Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>

所使用的数据集是starwars数据集

A tibble with 87 rows and 13 variables:

name
Name of the character

height
Height (cm)

mass
Weight (kg)

hair_color,skin_color,eye_color
Hair, skin, and eye colors

birth_year
Year born (BBY = Before Battle of Yavin)

gender
male, female, hermaphrodite, or none.

homeworld
Name of homeworld

species
Name of species

films
List of films the character appeared in

vehicles
List of vehicles the character has piloted

starships
List of starships the character has piloted

本文将会接受如何数据框处理的常见需求:如何去掉重复值

仅保留每一种gender中第一个出现的观测值(去掉重复的gender观测值)

  • 第一种方法:match函数
> k <- match(unique(starwars$gender), starwars$gender)
> starwars[k,c('name','gender','skin_color', 'height', 'mass')]
# A tibble: 5 x 5
  name                  gender        skin_color       height  mass
  <chr>                 <chr>         <chr>             <int> <dbl>
1 Luke Skywalker        male          fair                172    77
2 C-3PO                 NA            gold                167    75
3 Leia Organa           female        light               150    49
4 Jabba Desilijic Tiure hermaphrodite green-tan, brown    175  1358
5 IG-88                 none          metal               200   140

match函数查找数据集中每个唯一gender的第一行的位置,然后根据位置提取这些行和所需的列。

  • 第二种方法:group_by和ungroup
starwars %>%
+   as_tibble %>%
+   select(name,gender, skin_color, height, mass) %>%
+   group_by(gender) %>%
+   filter(row_number(gender)==1) %>%
+   ungroup
# A tibble: 4 x 5
  name                  gender        skin_color       height  mass
  <chr>                 <chr>         <chr>             <int> <dbl>
1 Luke Skywalker        male          fair                172    77
2 Leia Organa           female        light               150    49
3 Jabba Desilijic Tiure hermaphrodite green-tan, brown    175  1358
4 IG-88                 none          metal               200   140

as_tibble 首先将数据框转换为tibble,select提取感兴趣或者相关的列, group_by按gender分组数据, filter抓取每个gender的第一行,然后 ungroup取消分组。

  • 第三种方法: summarize函数
> starwars %>%
+     as_tibble %>%
+     select(name,gender, skin_color, height, mass) %>%
+     group_by(gender) %>%
+     summarize(name = first(name), skin_color=first(skin_color), 
+               height=first( height), mass=first(mass))
# A tibble: 5 x 5
  gender        name                  skin_color       height  mass
  <chr>         <chr>                 <chr>             <int> <dbl>
1 female        Leia Organa           light               150    49
2 hermaphrodite Jabba Desilijic Tiure green-tan, brown    175  1358
3 male          Luke Skywalker        fair                172    77
4 none          IG-88                 metal               200   140
5 NA            C-3PO                 gold                167    75
> 

summarize可以避免取消分组这一步,但是 summarize命令需要使用者指定每个非 group_by变量。

  • 第四种方法:distinct
> starwars %>%
+     as_tibble %>%
+     select(name,gender, skin_color, height, mass) %>%
+     group_by(gender) %>%
+     distinct(gender,.keep_all = T)
# A tibble: 5 x 5
# Groups:   gender [5]
  name                  gender        skin_color       height  mass
  <chr>                 <chr>         <chr>             <int> <dbl>
1 Luke Skywalker        male          fair                172    77
2 C-3PO                 NA            gold                167    75
3 Leia Organa           female        light               150    49
4 Jabba Desilijic Tiure hermaphrodite green-tan, brown    175  1358
5 IG-88                 none          metal               200   140
>
> # Remove duplicate rows of the dataframe using skin_color and gender
> starwars %>%
+     as_tibble %>%
+     select(name,gender, skin_color, height, mass) %>%
+     group_by(gender) %>%
+     distinct(skin_color,gender,,.keep_all = T)
# A tibble: 39 x 5
# Groups:   gender [5]
   name                  gender        skin_color       height  mass
   <chr>                 <chr>         <chr>             <int> <dbl>
 1 Luke Skywalker        male          fair                172    77
 2 C-3PO                 NA            gold                167    75
 3 R2-D2                 NA            white, blue          96    32
 4 Darth Vader           male          white               202   136
 5 Leia Organa           female        light               150    49
 6 Owen Lars             male          light               178   120
 7 R5-D4                 NA            white, red           97    32
 8 Chewbacca             male          unknown             228   112
 9 Greedo                male          green               173    74
10 Jabba Desilijic Tiure hermaphrodite green-tan, brown    175  1358
# ... with 29 more rows

distinct函数看起来好多了:干净,简短,易于理解。 它不是抓住每个组的第一行,而是必须搜索并排除重复项。.keep_all函数用于保留输出数据框中的所有其他变量。

比较不同方法的速速优劣

library(tidyverse)

d1 <- function()
{
  k <- match(unique(starwars$gender), starwars$gender)
  starwars[k,c('name','gender','skin_color', 'height', 'mass')]
}


d2 <- function()
{
  
  starwars %>%
    as_tibble %>%
    select(name,gender, skin_color, height, mass) %>%
    group_by(gender) %>%
    filter(row_number(gender)==1) %>%
    ungroup
  
}


d3 <- function()
{
  starwars %>%
    as_tibble %>%
    select(name,gender, skin_color, height, mass) %>%
    group_by(gender) %>%
    summarize(name = first(name), skin_color=first(skin_color), 
              height=first( height), mass=first(mass))
  
}


d4 <- function()
{
  
  
  starwars %>%
    as_tibble %>%
    select(name,gender, skin_color, height, mass) %>%
    group_by(gender) %>%
    distinct(gender,.keep_all = T)
  
}

library(microbenchmark)
set.seed(1234)
microbenchmark(d1(), d2(), d3(), d4(), times=9)
Unit: microseconds
 expr      min       lq      mean   median       uq       max neval
 d1()   74.668   84.870  105.6366   88.580  131.710   140.522     9
 d2() 5478.496 5563.829 5808.0292 5735.888 5974.264  6379.598     9
 d3() 4710.960 4761.510 5062.5474 4856.583 4876.989  7026.091     9
 d4() 6099.018 6241.395 9503.2321 6422.265 6641.627 32286.160     9

从结果发现d1也就是用match的速度非常快! 而在tidyverse方法中, d3的summarize显然更好。

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 一文解决大量基因的生存分析并作图

    这两篇纯生信文章都是对单个基因或者所有单个marker做生存分析,目的是找到其中能够影响患者生存的marker或者基因(包括miRNA,lncRNA,mRNA等...

    用户1359560
  • 重复一篇3分左右纯生信文章(第三部分)

    用户1359560
  • R语言日常笔记(5)一些小问题的集合

    一般常用中位数将样本分为高低表达组,这样便于绘制,但是假如说某个基因表达量为0的样本数目超过了半数,这样的结果就是所有该基因的所有表达量被修改成‘high’,这...

    用户1359560
  • 「R」Hiplot Xena 镜像与官方网站下载速度测评

    最近剑峰在 Hiplot 服务器上线了 UCSC Xena 的镜像 https://xena.hiplot.com.cn/,本文是对已部署镜像的下载速度情况进行...

    王诗翔呀
  • Python的内置函数(四十一)、 chr()

    chr() 用一个范围在 range(256)内的(就是0~255)整数作参数,返回一个对应的字符。

    于小勇
  • 「R」使用UCSCXenaTools获取数据进行生存分析(全英文)

    The UCSC Xena platform provides an unprecedented resource for public omics data ...

    王诗翔呀
  • 腾讯云发起行业安全把脉行动,直播安全再度成为焦点

    腾讯云安全
  • 如何使用jMeter对需要CSRF token验证的OData服务进行并发性能测试

    In my previous blog JMeter beginner – how to use JMeter to measure performance o...

    Jerry Wang
  • 【Python】双十一,就用turtle画个单身狗送给自己

    Turtle库是Python语言中一个很流行的绘制图像的函数库,想象一只小乌龟,在一个横轴为x、纵轴为y的坐标系内,从原点(0,0)位置开始,它根据一组函数指令...

    黄博的机器学习圈子
  • 解密618背后的技术:亿级直播电商平台架构

    ? 年年618,今又618。 如果说今年618跟往年有什么不同?那一定是“直播带货”大火。从明星到店家,都加入这场“直播+电商”的热潮之中。与传统电商相比...

    腾讯云视频

扫码关注云+社区

领取腾讯云代金券