虽然R中存在许多基本的数据处理函数,但它们至今仍有一点混乱,并且缺乏一致的编码和容易地将流一起的能力。这导致很难记忆和操作。因此我们需要更有效的代码、更容易记住语法和易于阅读的语法。而tidyr
正是一个这样的包,它的唯一目的是简化创建[tidy data]的过程。本教程使您基本了解tidyr提供的数据整理的四个基本功能:
gather()
]宽数据转化成长数据spread()
]长数据转变成宽数据separate()
]将一列数据拆分为多列unite()
]将多列数据合并为一列install.packages("tidyr")
library(tidyr)
管道函数%>%
运算符将值或表达式的结果转发到下一个函数表达式中。当执行多个函数时,它的优势就变得显而易见。
Function: gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
Same as: data %>% gather(key, value, ..., na.rm = FALSE, convert = FALSE)
Arguments:
data: data frame
key: column name representing new variable
value: column name representing variable values
...: names of columns to gather (or not gather)
na.rm: option to remove observations with missing values (represented by NAs)
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate
## Source: local data frame [12 x 6]
##
## Group Year Qtr.1 Qtr.2 Qtr.3 Qtr.4
## 1 1 2006 15 16 19 17
## 2 1 2007 12 13 27 23
## 3 1 2008 22 22 24 20
## 4 1 2009 10 14 20 16
## 5 2 2006 12 13 25 18
## 6 2 2007 16 14 21 19
## 7 2 2008 13 11 29 15
## 8 2 2009 23 20 26 20
## 9 3 2006 11 12 22 16
## 10 3 2007 13 11 27 21
## 11 3 2008 17 12 23 19
## 12 3 2009 14 9 31 24
从上面的数据我们可以看出来四个季度的数据为 Qtr.1:Qtr.4。而这四个季度(四个变量)可以用 一个变量来表示,我们可以在一个列变量中来表示季度。
long_DF <- DF %>% gather(Quarter, Revenue, Qtr.1:Qtr.4)
head(long_DF, 24) # note, for brevity, I only show the data for the first two years
## Source: local data frame [24 x 4]
##
## Group Year Quarter Revenue
## 1 1 2006 Qtr.1 15
## 2 1 2007 Qtr.1 12
## 3 1 2008 Qtr.1 22
## 4 1 2009 Qtr.1 10
## 5 2 2006 Qtr.1 12
## 6 2 2007 Qtr.1 16
## 7 2 2008 Qtr.1 13
## 8 2 2009 Qtr.1 23
## 9 3 2006 Qtr.1 11
## 10 3 2007 Qtr.1 13
## .. ... ... ... ...
Quarter列代表观察值的某个季度(Q1或者Q2,3,4),而Revenue代表原数据的值。
#上述相同结果的四种不同写法
DF %>% gather(Quarter, Revenue, Qtr.1:Qtr.4)
DF %>% gather(Quarter, Revenue, -Group, -Year)
DF %>% gather(Quarter, Revenue, 3:6)
DF %>% gather(Quarter, Revenue, Qtr.1, Qtr.2, Qtr.3, Qtr.4)
还要注意,如果不为na.rm提供参数或不转换值,则使用默认值。
很多时候,一个列变量将包含多个变量,我们可以在每种情况下,我们的目标可能是在变量字符串中分隔字符。这可以使用separate()函数来实现,该函数将单个字符列分割为多个列。
## Grp_Ind Yr_Mo City_State First_Last Extra_variable
## 1 1.a 2006_Jan Dayton (OH) George Washington XX01person_1
## 2 1.b 2006_Feb Grand Forks (ND) John Adams XX02person_2
## 3 1.c 2006_Mar Fargo (ND) Thomas Jefferson XX03person_3
## 4 2.a 2007_Jan Rochester (MN) James Madison XX04person_4
## 5 2.b 2007_Feb Dubuque (IA) James Monroe XX05person_5
## 6 2.c 2007_Mar Ft. Collins (CO) John Adams XX06person_6
## 7 3.a 2008_Jan Lake City (MN) Andrew Jackson XX07person_7
## 8 3.b 2008_Feb Rushford (MN) Martin Van Buren XX08person_8
## 9 3.c 2008_Mar Unknown William Harrison XX09person_9
Function: separate(data, col, into, sep = " ", remove = TRUE, convert = FALSE)
Same as: data %>% separate(col, into, sep = " ", remove = TRUE, convert = FALSE)
Arguments:
data: data frame
col: column name representing current variable
into: names of variables representing new variables
sep: how to separate current variable (char, num, or symbol)
remove: if TRUE, remove input column from output data frame
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or
factor as appropriate
通过使用 separate()
函数,我们将可以得到下列:
separate_DF <- long_DF %>% separate(Quarter, c("Time_Interval", "Interval_ID"))
head(separate_DF, 10)
## Source: local data frame [10 x 5]
##
## Group Year Time_Interval Interval_ID Revenue
## 1 1 2006 Qtr 1 15
## 2 1 2007 Qtr 1 12
## 3 1 2008 Qtr 1 22
## 4 1 2009 Qtr 1 10
## 5 2 2006 Qtr 1 12
## 6 2 2007 Qtr 1 16
## 7 2 2008 Qtr 1 13
## 8 2 2009 Qtr 1 23
## 9 3 2006 Qtr 1 11
## 10 3 2007 Qtr 1 13
下列两种写法会得到相同的结果
long_DF %>% separate(Quarter, c("Time_Interval", "Interval_ID"))
long_DF %>% separate(Quarter, c("Time_Interval", "Interval_ID"), sep = "\\.")
Function: unite(data, col, ..., sep = " ", remove = TRUE)
Same as: data %>% unite(col, ..., sep = " ", remove = TRUE)
Arguments:
data: data frame
col: column name of new "merged" column
...: names of columns to merge
sep: separator to use between merged values
remove: if TRUE, remove input column from output data frame
unite_DF <- separate_DF %>% unite(Quarter, Time_Interval, Interval_ID, sep = ".")
head(unite_DF, 10)
## Source: local data frame [10 x 4]
##
## Group Year Quarter Revenue
## 1 1 2006 Qtr.1 15
## 2 1 2007 Qtr.1 12
## 3 1 2008 Qtr.1 22
## 4 1 2009 Qtr.1 10
## 5 2 2006 Qtr.1 12
## 6 2 2007 Qtr.1 16
## 7 2 2008 Qtr.1 13
## 8 2 2009 Qtr.1 23
## 9 3 2006 Qtr.1 11
## 10 3 2007 Qtr.1 13
示例:
separate_DF %>% unite(Quarter, Time_Interval, Interval_ID, sep = "_")
separate_DF %>% unite(Quarter, Time_Interval, Interval_ID)
# If no spearator is identified, "_" will automatically be used
spread()
将长格式改为宽格式说明:有时我们需要将长格式数据转换为宽格式数据。函数spread()
将某一列数据值分布在多个列上。
Function: spread(data, key, value, fill = NA, convert = FALSE)
Same as: data %>% spread(key, value, fill = NA, convert = FALSE)
Arguments:
data: data frame
key: column values to convert to multiple columns
value: single column values to convert to multiple columns' values
fill: If there isn't a value for every combination of the other variables and the key
column, this value will be substituted
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or
factor as appropriate
wide_DF <- unite_DF %>% spread(Quarter, Revenue)
head(wide_DF, 24)
## Source: local data frame [12 x 6]
##
## Group Year Qtr.1 Qtr.2 Qtr.3 Qtr.4
## 1 1 2006 15 16 19 17
## 2 1 2007 12 13 27 23
## 3 1 2008 22 22 24 20
## 4 1 2009 10 14 20 16
## 5 2 2006 12 13 25 18
## 6 2 2007 16 14 21 19
## 7 2 2008 13 11 29 15
## 8 2 2009 23 20 26 20
## 9 3 2006 11 12 22 16
## 10 3 2007 13 11 27 21
## 11 3 2008 17 12 23 19
## 12 3 2009 14 9 31 24