tidyverse：R语言中相当于python中pandas+matplotlib的存在

拴小林

发布于 2021-01-12 11:24:12

4.1K0

发布于 2021-01-12 11:24:12

文章被收录于专栏：数据驱动实践

tidyverse就是Hadley Wickham将自己所写的包整理成了一整套数据处理的方法，包括ggplot2、dplyr、tidyr、readr、purrr、tibble、stringr、forcats。出版有《R for Data Science》（中文版《R数据科学》），这本书详细介绍了tidyverse的使用方法。

tidyverse网址：https://www.tidyverse.org/

书籍网址：https://r4ds.had.co.nz/

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.Install the complete tidyverse with:

install.packages("tidyverse")

library(tidyverse) #加载以下tidyverse中核心的packages:

ggplot2：画图，可视化数据
dplyr：操控数据，过滤、排序等
tidyr：清理数据
readr：(从文件中读取数据
purrr：(提供好用的编程函数
tibble：data.frame升级款
stringr：处理字符，查找、替换等
forcats：处理因子问题

install.packages("tidyverse")  #安装包 关联的包比较多，耐心等待一会儿
library(tidyverse) #使用前，记得载入包

以下讲：readr（读）、tibble（类型）、%>%(管道）、dplyr（加减乘除）、tidyr（透视/反透视）、ggplot2（可视化）

—

readr：数据导入/读取

readr comes with five parsers for rectangular file formats:

read_csv() and read_csv2() for csv files，csv文件（逗号分隔的文件，execl文件可以另存为csv文件）【必学】
read_tsv() for tabs separated files
read_fwf() for fixed-width files
read_log() for web log files

> df <- read_csv("df.csv")

-- Column specification -------------------------------------------------------------------
cols(
  Sepal.Length = col_double(),
  Sepal.Width = col_double(),
  Petal.Length = col_double(),
  Petal.Width = col_double(),
  Species = col_character()
)


> df
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ... with 140 more rows

读取其他格式数据：

readxl：readxls(); readxlsx();
haven：打开SAS 、SPSS、Stata等外部数据。

—

tibble：高级数据框（data.frame升级版）

——数据（列）类型一目了然

tibble是R语言中一个用来替换data.frame类型的扩展的数据框，tibble继承了data.frame，是弱类型的，同时与data.frame有相同的语法，使用起来更方便。tibble包，也是由Hadley开发的R包。

tibble对data.frame做了重新的设定：

tibble，不关心输入类型，可存储任意类型，包括list类型
tibble，没有行名设置 row.names
tibble，支持任意的列名
tibble，会自动添加列名
tibble，类型只能回收长度为1的输入
tibble，会懒加载参数，并按顺序运行
tibble，是tbl_df类型

tibble是data.frame的进化版，有如下优点：生成的数据框数据每列可以保持原来的数据格式；查看数据时，不再会一行显示不下（会自动隐藏一部分，自带head）；有两种方式来创建tibble格式的数据：

1. 直接创建

> x <- c(1:3)
> y <- c(4:6)
> z <- letters[1:3]
> dft <- tibble(x,y,z) # 
> dft
# A tibble: 3 x 3
      x     y z    
  <int> <int> <chr>
1     1     4 a    
2     2     5 b    
3     3     6 c

2. 其他格式转化，使用as_tibble转换为tibble格式

> dft_1 <- as_tibble(mtcars)
> dft_1
# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ... with 22 more rows

更多：http://blog.fens.me/r-tibble/

—

%>%：管道函数

——将左侧的值应用到右侧数据data位置

管道函数在tidyverse中，管道符号是数据整理的主力，可以把许多功能连在一起，而且简洁好看，比起R的基本代码更加容易阅读！例如：x %>% f(y) 等价于 f(x,y)

Rstudio中快捷键： ctrl+shift+m

以R中自带的iris（鸢尾花数据集）为例：

> head(iris,n=3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> iris %>% head(n=3) # %>% 的作用就是将iris数据用于管道后面的head函数。
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

—

dplyr：数据整理

dplyr包的下述五个函数用法

4.1 筛选: filter

4.2 排列: arrange

4.3 选择: select

4.4 变形: mutate

4.5 汇总: summarise

4.6 分组: group_by

# install.packages("dplyr")
library(dplyr)

4.1 筛选: filter()

#按给定的逻辑判断筛选出符合要求的子数据集

filter(mtcars_df,mpg==21,hp==110) #按给定的逻辑判断筛选出符合要求的子数据集
# A tibble: 2 x 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    21     6   160   110   3.9  2.62  16.5     0     1     4     4
2    21     6   160   110   3.9  2.88  17.0     0     1     4     4

4.2 排列: arrange()

arrange(mtcars_df, disp) #可对列名加 desc(disp)进行降序

4.3 选择: select()

> select(mtcars_df, disp:wt) #用列名作参数来选择子数据集:
# A tibble: 32 x 4
    disp    hp  drat    wt
   <dbl> <dbl> <dbl> <dbl>
 1  160    110  3.9   2.62
 2  160    110  3.9   2.88
 3  108     93  3.85  2.32
 4  258    110  3.08  3.22
 5  360    175  3.15  3.44
 6  225    105  2.76  3.46
 7  360    245  3.21  3.57
 8  147.    62  3.69  3.19
 9  141.    95  3.92  3.15
10  168.   123  3.92  3.44
# ... with 22 more rows

4.4 变形: mutate()

#取行
#取1:dim(mtcars_df)[1]行
mutate(mtcars_df, NO = 1:dim(mtcars_df)[1]) 

#数值重定义和赋值
#将Ozone列取负数赋值给new，然后Temp列重新计算为(Temp - 32) / 1.8
mutate(airquality, new = -Ozone, Temp = (Temp - 32) / 1.8)

4.5 汇总: summarise()

#对数据框调用其它函数进行汇总操作
summarise(mtcars_df,mdisp = mean(disp, na.rm = TRUE))

4.6 分组: group_by()

#当对数据集通过group_by()添加了分组信息后，mutate()，arrange() 和 summarise() 函数会自动对这些 tbl 类数据执行分组操作。
cars <- group_by(mtcars_df, cyl)
countcars <- summarise(cars, count = n()) # count = n()用来计算次数


# %>%管道函数，把相应的数据直接引用为右侧源数据集
countcars <- group_by(mtcars_df, cyl) %>% summarise(count = n())

—

tidyr：数据整理

tidyr的两个主要函数是 gather()和 spread()。这些函数允许在长数据格式（long data）和宽数据格式（wide data）之间进行转换（功能类似于reshape包，但是比reshape更好用，并且可以用于管道%>%连接）。

tidyr包的下述四个函数用法

5.1 宽数据转为长数据：gather （excel透视表反向操作）

5.2 长数据转为宽数据：spread （excel透视表功能）

5.3 多列合并为一列：unit

5.4 将一列分离为多列：separat

#install.packages("tidyr") #安装tidyr包
library(tidyr)

5.1 宽数据转为长数据：gather()

类似excel透视表反向操作

图片解析参考：https://www.zhihu.com/collection/467554113

#gather(data, key, value, …, na.rm = FALSE, convert = FALSE)
#data：需要被转换的宽形表
#key：将原数据框中的所有列赋给一个新变量key
#value：将原数据框中的所有值赋给一个新变量value
#…：可以指定哪些列聚到同一列中
#na.rm：是否删除缺失值

widedata <- data.frame(person=c('Alex','Bob','Cathy'),grade=c(2,3,4),score=c(78,89,88))
#widedata
#  person grade score
#1   Alex     2    78
#2    Bob     3    89
#3  Cathy     4    88

longdata <- gather(widedata, variable, value,-grade)
#longdata
#  person variable value
#1   Alex    grade     2
#2    Bob    grade     3
#3  Cathy    grade     4
#4   Alex    score    78
#5    Bob    score    89
#6  Cathy    score    88

5.2 长数据转为宽数据：spread()

类似excel透视表操作

图片解析参考：https://www.zhihu.com/collection/467554113

#spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)
#data：为需要转换的长形表
#key：需要将变量值拓展为字段的变量
#value：需要分散的值
#fill：对于缺失值，可将fill的值赋值给被转型后的缺失值

stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)

stocksm <- stocks %>% gather(stock, price, -time)

#stocksm
#         time stock      price
#1  2009-01-01     X -1.6411394
#2  2009-01-02     X -0.2144050
#3  2009-01-03     X -1.0630161

stocksm %>% spread(stock, price)
#         time          X          Y          Z
#1  2009-01-01 -1.6411394 -5.2254532  7.5666852
#2  2009-01-02 -0.2144050  0.3570096  4.8142193
#3  2009-01-03 -1.0630161 -1.3085735  7.3624203

stocksm %>% spread(time, price)

5.3 多列合并为一列：unit()

#unite(data, col, …, sep = “_”, remove = TRUE)
#data：为数据框
#col：被组合的新列名称
#…：指定哪些列需要被组合
#sep：组合列之间的连接符，默认为下划线
#remove：是否删除被组合的列

wideunite<-unite(widedata, col = information, person, grade, score, sep= "-")
wideunite
#  information
#1   Alex-2-78
#2    Bob-3-89
#3  Cathy-4-88

6.4 将一列分离为多列：separate()

#separate()函数可将一列拆分为多列，一般可用于日志数据或日期时间型数据的拆分，语法如下：
#separate(data, col, into, sep = “[^[:alnum:]]+”, remove = TRUE,
#convert = FALSE, extra = “warn”, fill = “warn”, …)
#data：为数据框
#col：需要被拆分的列
#into：新建的列名，为字符串向量
#sep：被拆分列的分隔符
#remove：是否删除被分割的列 

widesep <- separate(wideunite, information,c("person","grade","score"), sep = "-")
widesep
#  person grade score
#1   Alex     2    78
#2    Bob     3    89
#3  Cathy     4    88

—

ggplot2：R语言经典可视化包

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-12-29，如有侵权请联系 cloudcommunity@tencent.com 删除

https