上次推文,我们通过数字和字符进行了简单的行筛选,今天我们继续来探讨 filter()的进阶用法
今天我们使用 msleep 来进行演示filter()
的用法,msleep 是一个关于哺乳动物睡眠的数据
glimeps(msleep)
## Observations: 83
## Variables: 11
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0....
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38...
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,...
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...
如果我们要筛选某一范围的值,可以用两个逻辑条件。例如,要选择总睡眠时间在16至18小时之间的所有动物,我可以使用filter(sleep_total >= 16, sleep_total <= 18)
,但是使用 between ()
看起来会更简洁一些
msleep %>%
select(name, sleep_total) %>%
filter(between(sleep_total, 16, 18))
## # A tibble: 4 x 2
## name sleep_total
## <chr> <dbl>
## 1 Owl monkey 17.0
## 2 Long-nosed armadillo 17.4
## 3 North American Opossum 18.0
## 4 Arctic ground squirrel 16.6
如果我们想筛选睡眠时间在17上下相差0.5的值,我们可以这样写
msleep %>%
select(name, sleep_total) %>%
filter(near(sleep_total, 17, tol = 0.5))
## # A tibble: 26 x 2
## name sleep_total
## <chr> <dbl>
## 1 Owl monkey 17.0
## 2 Mountain beaver 14.4
## 3 Greater short-tailed shrew 14.9
## 4 Three-toed sloth 14.4
## 5 Long-nosed armadillo 17.4
## 6 North American Opossum 18.0
## 7 Big brown bat 19.7
## 8 Western american chipmunk 14.9
## 9 Thick-tailed opposum 19.4
## 10 Mongolian gerbil 14.2
## # ... with 16 more rows
当我们想选择不包含某些特定类别的观测值时,我们可以将要排除的变量先赋值给 remove,然后再筛选
remove <- c("Rodentia", "Carnivora", "Primates")
msleep %>%
select(order, name, sleep_total) %>%
filter(!order %in% remove)
## # A tibble: 37 x 3
## order name sleep_total
## <chr> <chr> <dbl>
## 1 Soricomorpha Greater short-tailed shrew 14.9
## 2 Artiodactyla Cow 4.00
## 3 Pilosa Three-toed sloth 14.4
## 4 Artiodactyla Roe deer 3.00
## 5 Artiodactyla Goat 5.30
## 6 Soricomorpha Star-nosed mole 10.3
## 7 Soricomorpha Lesser short-tailed shrew 9.10
## 8 Cingulata Long-nosed armadillo 17.4
## 9 Hyracoidea Tree hyrax 5.30
## 10 Didelphimorphia North American Opossum 18.0
## # ... with 27 more rows
只有在完全匹配时,我们才能用字符串对变量进行筛选。但在某些情况下,需要根据部分内容进行筛选,我们需要一个函数来计算字符串上的正则表达式并返回布尔值。只要语句为 TRUE,就会过滤该行。这时有两个选项: base R 的 grepl ()
函数,或者用 stringr 包的 str_detect ()
。
我们要注意 R 是区分大小写的!通过使用 filter (str_detect (name,pattern=“ mouse”)) ,我们会跳过含有 Mouse 的行。如果我们想不区分大小写都筛选出来,可以用 tolower(变量名)的方法转为小写
msleep %>%
select(name, sleep_total) %>%
filter(str_detect(tolower(name), pattern = "mouse"))
## # A tibble: 5 x 2
## name sleep_total
## <chr> <dbl>
## 1 Vesper mouse 7.00
## 2 House mouse 12.5
## 3 Northern grasshopper mouse 14.5
## 4 Deer mouse 11.5
## 5 African striped mouse 8.70
有时我们需要对多个条件进行筛选,可以组合使用逻辑运算符,比如我要筛选体重大于100、睡眠时间大于15小时,不属于食肉类的行,可以这样写
msleep %>%
select(name, order, sleep_total:bodywt) %>%
filter(bodywt > 100, (sleep_total > 15 | order != "Carnivora"))
## # A tibble: 10 x 8
## name order sleep_total sleep_rem sleep_cycle awake brainwt bodywt
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cow Artio~ 4.00 0.700 0.667 20.0 0.423 600
## 2 Asian el~ Probo~ 3.90 NA NA 20.1 4.60 2547
## 3 Horse Peris~ 2.90 0.600 1.00 21.1 0.655 521
## 4 Donkey Peris~ 3.10 0.400 NA 20.9 0.419 187
## 5 Giraffe Artio~ 1.90 0.400 NA 22.1 NA 900
## 6 Pilot wh~ Cetac~ 2.70 0.100 NA 21.4 NA 800
## 7 African ~ Probo~ 3.30 NA NA 20.7 5.71 6654
## 8 Tiger Carni~ 15.8 NA NA 8.20 NA 163
## 9 Brazilia~ Peris~ 4.40 1.00 0.900 19.6 0.169 208
## 10 Bottle-n~ Cetac~ 5.20 NA NA 18.8 NA 173
如果我们想筛选 name,conservation 到 sleep_cycle 这几列,并剔除 conservation 为 NA 的行,可以通过is.na()
来判断
msleep %>%
select(name, conservation:sleep_cycle) %>%
filter(!is.na(conservation))
## # A tibble: 54 x 5
## name conservation sleep_total sleep_rem sleep_cycle
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Cheetah lc 12.1 NA NA
## 2 Mountain beaver nt 14.4 2.40 NA
## 3 Greater short-tailed sh~ lc 14.9 2.30 0.133
## 4 Cow domesticated 4.00 0.700 0.667
## 5 Northern fur seal vu 8.70 1.40 0.383
## 6 Dog domesticated 10.1 2.90 0.333
## 7 Roe deer lc 3.00 NA NA
## 8 Goat lc 5.30 0.600 NA
## 9 Guinea pig domesticated 9.40 0.800 0.217
## 10 Grivet lc 10.0 0.700 NA
## # ... with 44 more rows
dplyr 包还有几个功能强大的包,来支持我们跨列筛选
现在有个需求,只要列值包含字母组合 Ca 我们就把这个观测值筛选出来,我们可以用any_vars()
结合str_detect()
来做,我们看到无论是 genus 还是 order 列,只要二者之一包含 Ca,就会被筛出
msleep %>%
select(name:order, sleep_total, -vore) %>%
filter_all(any_vars(str_detect(., pattern = "Ca")))
## # A tibble: 16 x 4
## name genus order sleep_total
## <chr> <chr> <chr> <dbl>
## 1 Cheetah Acinonyx Carnivora 12.1
## 2 Northern fur seal Callorhinus Carnivora 8.70
## 3 Vesper mouse Calomys Rodentia 7.00
## 4 Dog Canis Carnivora 10.1
## 5 Roe deer Capreolus Artiodactyla 3.00
## 6 Goat Capri Artiodactyla 5.30
## 7 Guinea pig Cavis Rodentia 9.40
## 8 Domestic cat Felis Carnivora 12.5
## 9 Gray seal Haliochoerus Carnivora 6.20
## 10 Tiger Panthera Carnivora 15.8
## 11 Jaguar Panthera Carnivora 10.4
## 12 Lion Panthera Carnivora 13.5
## 13 Caspian seal Phoca Carnivora 3.50
## 14 Genet Genetta Carnivora 6.30
## 15 Arctic fox Vulpes Carnivora 12.5
## 16 Red fox Vulpes Carnivora 9.80
与any_vars()
类似的还有all_vars()
,它将筛选所有值都符合某条件的行,比如我们想筛选 所有变量类型为数值型且都大于1的列
msleep %>%
select(name, sleep_total:bodywt, -awake) %>%
filter_all(all_vars(. > 1))
## # A tibble: 1 x 6
## name sleep_total sleep_rem sleep_cycle brainwt bodywt
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Human 8.00 1.90 1.50 1.32 62.0
现在我们想筛选出这样的观测值,字符型的变量中的值为空,而不管数值型的变量是否为空, 此时 filter_all 就不太好用了,filter_all(any_vars(is.na(.)))
会将所有包含 NA 的列选出来,不符合我们的要求 这里我们可以用 is.character 来判断该列是否为字符类型
msleep %>%
select(name:order, sleep_total:sleep_rem) %>%
filter_if(is.character, any_vars(is.na(.)))
## # A tibble: 7 x 6
## name genus vore order sleep_total sleep_rem
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Vesper mouse Calomys <NA> Rodentia 7.00 NA
## 2 Desert hedgehog Paraechinus <NA> Erinaceomorpha 10.3 2.70
## 3 Deer mouse Peromyscus <NA> Rodentia 11.5 NA
## 4 Phalanger Phalanger <NA> Diprotodontia 13.7 1.80
## 5 Rock hyrax Procavia <NA> Hyracoidea 5.40 0.500
## 6 Mole rat Spalax <NA> Rodentia 10.6 2.40
## 7 Musk shrew Suncus <NA> Soricomorpha 12.8 2.00
依此类推,我们还可以用其他的类型判断is.numeric、 is.integer、 is.double、 is.logical、 is.factor
等,我们的筛选手段 更加丰富了
filter_at()
可以用来筛选给定变量中符合某条件的观测值,比如下面这个例子,我们想筛选出sleep_total
和sleep_rem
这两个变量值都大于5的
msleep %>%
select(name, sleep_total:sleep_rem, brainwt:bodywt) %>%
filter_at(vars(sleep_total, sleep_rem), all_vars(.>5))
## # A tibble: 2 x 5
## name sleep_total sleep_rem brainwt bodywt
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Thick-tailed opposum 19.4 6.60 NA 0.370
## 2 Giant armadillo 18.1 6.10 0.0810 60.0
❝https://suzan.rbind.io/2018/02/dplyr-tutorial-3 ❞