共 3 篇文章

TidyFriday 每天 5 分钟，轻轻松松上手 R 语言（六）数据读取与保存

清单「TidyFriday 每天 5 分钟，轻轻松松上手 R 语言」 03/03

TidyFriday 每天 5 分钟，轻轻松松上手 R 语言（四）

王诗翔呀·中山大学肿瘤防治中心博士后

上次推文，我们通过数字和字符进行了简单的行筛选，今天我们继续来探讨 filter()的进阶用法

今天我们使用 msleep 来进行演示filter()的用法,msleep 是一个关于哺乳动物睡眠的数据

glimeps(msleep)
## Observations: 83
## Variables: 11
## $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
## $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
## $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
## $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
## $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
## $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0....
## $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38...
## $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,...
## $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
## $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...

基于范围的过滤

如果我们要筛选某一范围的值，可以用两个逻辑条件。例如，要选择总睡眠时间在16至18小时之间的所有动物，我可以使用filter(sleep_total >= 16, sleep_total <= 18) ，但是使用 between ()看起来会更简洁一些


msleep %>%
  select(name, sleep_total) %>%
  filter(between(sleep_total, 16, 18))

## # A tibble: 4 x 2
##   name                   sleep_total
##   <chr>                        <dbl>
## 1 Owl monkey                    17.0
## 2 Long-nosed armadillo          17.4
## 3 North American Opossum        18.0
## 4 Arctic ground squirrel        16.6

如果我们想筛选睡眠时间在17上下相差0.5的值，我们可以这样写

msleep %>%
  select(name, sleep_total) %>%
  filter(near(sleep_total, 17, tol = 0.5))

## # A tibble: 26 x 2
##    name                       sleep_total
##    <chr>                            <dbl>
##  1 Owl monkey                        17.0
##  2 Mountain beaver                   14.4
##  3 Greater short-tailed shrew        14.9
##  4 Three-toed sloth                  14.4
##  5 Long-nosed armadillo              17.4
##  6 North American Opossum            18.0
##  7 Big brown bat                     19.7
##  8 Western american chipmunk         14.9
##  9 Thick-tailed opposum              19.4
## 10 Mongolian gerbil                  14.2
## # ... with 16 more rows

当我们想选择不包含某些特定类别的观测值时，我们可以将要排除的变量先赋值给 remove,然后再筛选

remove <- c("Rodentia", "Carnivora", "Primates")
msleep %>%
  select(order, name, sleep_total) %>%
  filter(!order %in% remove)

## # A tibble: 37 x 3
##    order           name                       sleep_total
##    <chr>           <chr>                            <dbl>
##  1 Soricomorpha    Greater short-tailed shrew       14.9
##  2 Artiodactyla    Cow                               4.00
##  3 Pilosa          Three-toed sloth                 14.4
##  4 Artiodactyla    Roe deer                          3.00
##  5 Artiodactyla    Goat                              5.30
##  6 Soricomorpha    Star-nosed mole                  10.3
##  7 Soricomorpha    Lesser short-tailed shrew         9.10
##  8 Cingulata       Long-nosed armadillo             17.4
##  9 Hyracoidea      Tree hyrax                        5.30
## 10 Didelphimorphia North American Opossum           18.0
## # ... with 27 more rows

基于正则的过滤

只有在完全匹配时，我们才能用字符串对变量进行筛选。但在某些情况下，需要根据部分内容进行筛选，我们需要一个函数来计算字符串上的正则表达式并返回布尔值。只要语句为 TRUE，就会过滤该行。这时有两个选项: base R 的 grepl ()函数，或者用 stringr 包的 str_detect ()。

我们要注意 R 是区分大小写的!通过使用 filter (str_detect (name，pattern=“ mouse”)) ，我们会跳过含有 Mouse 的行。如果我们想不区分大小写都筛选出来，可以用 tolower(变量名)的方法转为小写

msleep %>%
  select(name, sleep_total) %>%
  filter(str_detect(tolower(name), pattern = "mouse"))

## # A tibble: 5 x 2
##   name                       sleep_total
##   <chr>                            <dbl>
## 1 Vesper mouse                      7.00
## 2 House mouse                      12.5
## 3 Northern grasshopper mouse       14.5
## 4 Deer mouse                       11.5
## 5 African striped mouse             8.70

基于多条件的过滤

有时我们需要对多个条件进行筛选，可以组合使用逻辑运算符，比如我要筛选体重大于100、睡眠时间大于15小时，不属于食肉类的行，可以这样写

msleep %>%
  select(name, order, sleep_total:bodywt) %>%
  filter(bodywt > 100, (sleep_total > 15 | order != "Carnivora"))

## # A tibble: 10 x 8
##    name      order  sleep_total sleep_rem sleep_cycle awake brainwt bodywt
##    <chr>     <chr>        <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
##  1 Cow       Artio~        4.00     0.700       0.667 20.0    0.423    600
##  2 Asian el~ Probo~        3.90    NA          NA     20.1    4.60    2547
##  3 Horse     Peris~        2.90     0.600       1.00  21.1    0.655    521
##  4 Donkey    Peris~        3.10     0.400      NA     20.9    0.419    187
##  5 Giraffe   Artio~        1.90     0.400      NA     22.1   NA        900
##  6 Pilot wh~ Cetac~        2.70     0.100      NA     21.4   NA        800
##  7 African ~ Probo~        3.30    NA          NA     20.7    5.71    6654
##  8 Tiger     Carni~       15.8     NA          NA      8.20  NA        163
##  9 Brazilia~ Peris~        4.40     1.00        0.900 19.6    0.169    208
## 10 Bottle-n~ Cetac~        5.20    NA          NA     18.8   NA        173

筛选空行

如果我们想筛选 name,conservation 到 sleep_cycle 这几列，并剔除 conservation 为 NA 的行，可以通过is.na()来判断

msleep %>%
  select(name, conservation:sleep_cycle) %>%
  filter(!is.na(conservation))

## # A tibble: 54 x 5
##    name                     conservation sleep_total sleep_rem sleep_cycle
##    <chr>                    <chr>              <dbl>     <dbl>       <dbl>
##  1 Cheetah                  lc                 12.1     NA          NA
##  2 Mountain beaver          nt                 14.4      2.40       NA
##  3 Greater short-tailed sh~ lc                 14.9      2.30        0.133
##  4 Cow                      domesticated        4.00     0.700       0.667
##  5 Northern fur seal        vu                  8.70     1.40        0.383
##  6 Dog                      domesticated       10.1      2.90        0.333
##  7 Roe deer                 lc                  3.00    NA          NA
##  8 Goat                     lc                  5.30     0.600      NA
##  9 Guinea pig               domesticated        9.40     0.800       0.217
## 10 Grivet                   lc                 10.0      0.700      NA
## # ... with 44 more rows

跨列筛选

dplyr 包还有几个功能强大的包，来支持我们跨列筛选

「filter_all」

现在有个需求，只要列值包含字母组合 Ca 我们就把这个观测值筛选出来，我们可以用any_vars() 结合str_detect()来做，我们看到无论是 genus 还是 order 列，只要二者之一包含 Ca，就会被筛出


msleep %>%
  select(name:order, sleep_total, -vore) %>%
  filter_all(any_vars(str_detect(., pattern = "Ca")))

## # A tibble: 16 x 4
##    name              genus        order        sleep_total
##    <chr>             <chr>        <chr>              <dbl>
##  1 Cheetah           Acinonyx     Carnivora          12.1
##  2 Northern fur seal Callorhinus  Carnivora           8.70
##  3 Vesper mouse      Calomys      Rodentia            7.00
##  4 Dog               Canis        Carnivora          10.1
##  5 Roe deer          Capreolus    Artiodactyla        3.00
##  6 Goat              Capri        Artiodactyla        5.30
##  7 Guinea pig        Cavis        Rodentia            9.40
##  8 Domestic cat      Felis        Carnivora          12.5
##  9 Gray seal         Haliochoerus Carnivora           6.20
## 10 Tiger             Panthera     Carnivora          15.8
## 11 Jaguar            Panthera     Carnivora          10.4
## 12 Lion              Panthera     Carnivora          13.5
## 13 Caspian seal      Phoca        Carnivora           3.50
## 14 Genet             Genetta      Carnivora           6.30
## 15 Arctic fox        Vulpes       Carnivora          12.5
## 16 Red fox           Vulpes       Carnivora           9.80

与any_vars()类似的还有all_vars(),它将筛选所有值都符合某条件的行，比如我们想筛选所有变量类型为数值型且都大于1的列

msleep %>%
  select(name, sleep_total:bodywt, -awake) %>%
  filter_all(all_vars(. > 1))

## # A tibble: 1 x 6
##   name  sleep_total sleep_rem sleep_cycle brainwt bodywt
##   <chr>       <dbl>     <dbl>       <dbl>   <dbl>  <dbl>
## 1 Human        8.00      1.90        1.50    1.32   62.0

「filter_if」

现在我们想筛选出这样的观测值，字符型的变量中的值为空，而不管数值型的变量是否为空，此时 filter_all 就不太好用了，filter_all(any_vars(is.na(.)))会将所有包含 NA 的列选出来，不符合我们的要求这里我们可以用 is.character 来判断该列是否为字符类型

msleep %>%
  select(name:order, sleep_total:sleep_rem) %>%
  filter_if(is.character, any_vars(is.na(.)))

## # A tibble: 7 x 6
##   name            genus       vore  order          sleep_total sleep_rem
##   <chr>           <chr>       <chr> <chr>                <dbl>     <dbl>
## 1 Vesper mouse    Calomys     <NA>  Rodentia              7.00    NA
## 2 Desert hedgehog Paraechinus <NA>  Erinaceomorpha       10.3      2.70
## 3 Deer mouse      Peromyscus  <NA>  Rodentia             11.5     NA
## 4 Phalanger       Phalanger   <NA>  Diprotodontia        13.7      1.80
## 5 Rock hyrax      Procavia    <NA>  Hyracoidea            5.40     0.500
## 6 Mole rat        Spalax      <NA>  Rodentia             10.6      2.40
## 7 Musk shrew      Suncus      <NA>  Soricomorpha         12.8      2.00

依此类推，我们还可以用其他的类型判断is.numeric、 is.integer、 is.double、 is.logical、 is.factor等，我们的筛选手段更加丰富了

「filter_at」

filter_at()可以用来筛选给定变量中符合某条件的观测值，比如下面这个例子，我们想筛选出sleep_total和sleep_rem这两个变量值都大于5的

msleep %>%
  select(name, sleep_total:sleep_rem, brainwt:bodywt) %>%
  filter_at(vars(sleep_total, sleep_rem), all_vars(.>5))

## # A tibble: 2 x 5
##   name                 sleep_total sleep_rem brainwt bodywt
##   <chr>                      <dbl>     <dbl>   <dbl>  <dbl>
## 1 Thick-tailed opposum        19.4      6.60 NA       0.370
## 2 Giant armadillo             18.1      6.10  0.0810 60.0

Reference

❝https://suzan.rbind.io/2018/02/dplyr-tutorial-3 ❞