hdf5文件是一种大数据存储结构,除了目前介绍的hdf5r包之外,同时cran中的h5包,Bioconductor中的rhdf5也能够实现类似的功能。
library(hdf5r)
# 创建一个临时hdf5文件
test_filename <- tempfile(fileext = ".h5")
# 读取hdf5文件,如果存在则覆盖
file.h5 <- H5File$new(test_filename, mode = "w")
file.h5
# Class: H5File
# Filename: C:\Users\cmusunqi\TMP\Rtmp2Vb8Pj\file29ac4e56549a.h5
# Access type: H5F_ACC_RDWR
建立两个分组,一个分组用来装mtcars的数据,一个用于nycflights13
mtcars.grp <- file.h5$create_group("mtcars")
flights.grp <- file.h5$create_group("flights")
写入数据
library(datasets)
library(nycflights13)
library(reshape2)
# 在分组中加入数据
mtcars.grp[["mtcars"]] <- datasets::mtcars
# 飞行数据中放入天气数据
flights.grp[["weather"]] <- nycflights13::weather
# 飞行数据中放入航班数据
flights.grp[["flights"]] <- nycflights13::flights
从weather数据中提取站点为EWR的风向和风速数据,并保存为matrix,小时为列,日期为行
# 取子集,subset函数
weather_wind_dir <- subset(
# 选择行
nycflights13::weather, origin == "EWR",
# 选择列
select = c("year", "month", "day", "hour", "wind_dir"))
# 去除存在缺失值的行
weather_wind_dir <- na.exclude(weather_wind_dir)
# 将风向转换为整数
weather_wind_dir$wind_dir <- as.integer(weather_wind_dir$wind_dir)
# acast为聚合函数,类似dcast
weather_wind_dir <- acast(
weather_wind_dir,
year + month + day ~ hour, value.var = "wind_dir")
# 风向放入flights组中
flights.grp[["wind_dir"]] <- weather_wind_dir
# 对风速处理
weather_wind_speed <- subset(
nycflights13::weather, origin == "EWR",
select = c("year","month", "day", "hour", "wind_speed"))
weather_wind_speed <- na.exclude(weather_wind_speed)
# 将长数据装换为宽数据的矩阵
weather_wind_speed <- acast(
weather_wind_speed,
year + month + day ~ hour, value.var = "wind_speed")
# 将风速放入filght组中
flights.grp[["wind_speed"]] <- weather_wind_speed
定义attributes,也就是将风向和风速的行列名指定为特征
h5attr(flights.grp[["wind_dir"]], "colnames") <- colnames(weather_wind_dir)
h5attr(flights.grp[["wind_dir"]], "rownames") <- rownames(weather_wind_dir)
h5attr(flights.grp[["wind_speed"]], "colnames") <- colnames(weather_wind_speed)
h5attr(flights.grp[["wind_speed"]], "rownames") <- rownames(weather_wind_speed)
这个比较重要,目前来看,我需要的其实是对数据的读取,至于制作hdf5文件,我想我应该暂时不会涉及
# 查看file.h5下的group
names(file.h5)
# [1] "flights" "mtcars"
# 查看filght组中有什么数据
names(flights.grp)
## [1] "flights" "weather" "wind_dir" "wind_speed"
# ls函数,返回名字、连接类型、数据的维度等信息
flights.grp$ls()
## name link.type obj_type num_attrs group.nlinks group.mounted
## 1 flights H5L_TYPE_HARD H5I_DATASET 0 NA NA
## 2 weather H5L_TYPE_HARD H5I_DATASET 0 NA NA
## 3 wind_dir H5L_TYPE_HARD H5I_DATASET 2 NA NA
## 4 wind_speed H5L_TYPE_HARD H5I_DATASET 2 NA NA
## dataset.rank dataset.dims dataset.maxdims dataset.type_class
## 1 1 336776 Inf H5T_COMPOUND
## 2 1 26115 Inf H5T_COMPOUND
## 3 2 364 x 24 Inf x Inf H5T_INTEGER
## 4 2 364 x 24 Inf x Inf H5T_INTEGER
## dataset.space_class committed_type
## 1 H5S_SIMPLE <NA>
## 2 H5S_SIMPLE <NA>
## 3 H5S_SIMPLE <NA>
## 4 H5S_SIMPLE <NA>
HDF5文件包含的信息较多,不仅仅需要获得组和文件名,同时也需要获得组中的信息。ls函数能够返回数据类型、数据大小、数据的维度、最大维度等信息。
数据类型
# 将天气数据集取出
weather_ds <- flights.grp[["weather"]]
# get_type查看类型为H5T_COMPOUND
weather_ds_type <- weather_ds$get_type()
# get_class不知道什么意思,数据类型?
weather_ds_type$get_class()
## [1] H5T_COMPOUND
## 13 Levels: H5T_NO_CLASS H5T_INTEGER H5T_FLOAT H5T_TIME ... H5T_NCLASSES
## 13 Values: -1 0 1 2 ... 11
通过cat显示get_type函数生成的内容
cat(weather_ds_type$to_text())
## H5T_COMPOUND {
## H5T_STRING {
## STRSIZE H5T_VARIABLE;
## STRPAD H5T_STR_NULLTERM;
## CSET H5T_CSET_ASCII;
## CTYPE H5T_C_S1;
## } "origin" : 0;
## H5T_STD_I32LE "year" : 8;
## H5T_STD_I32LE "month" : 12;
## H5T_STD_I32LE "day" : 16;
## H5T_STD_I32LE "hour" : 20;
## H5T_IEEE_F64LE "temp" : 24;
## H5T_IEEE_F64LE "dewp" : 32;
## H5T_IEEE_F64LE "humid" : 40;
## H5T_IEEE_F64LE "wind_dir" : 48;
## H5T_IEEE_F64LE "wind_speed" : 56;
## H5T_IEEE_F64LE "wind_gust" : 64;
## H5T_IEEE_F64LE "precip" : 72;
## H5T_IEEE_F64LE "pressure" : 80;
## H5T_IEEE_F64LE "visib" : 88;
## H5T_IEEE_F64LE "time_hour" : 96;
## }
# 维度
weather_ds$dims
weather_ds$maxdims
weather_ds$chunk_dims
返回维度,最大维度,和chunk数
## [1] 26115
## [1] Inf
## [1] 78
查看属性,并查看具体的名字
# 查看风向表有什么属性
h5attr_names(flights.grp[["wind_dir"]])
## [1] "colnames" "rownames"
# 查看具体的属性
h5attr(flights.grp[["wind_dir"]], "colnames")
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [16] "15" "16" "17" "18" "19" "20" "21" "22" "23"
在hdf5文件中,有很多种方式获得对象的详细信息,如:
我们还希望能够读取数据、更改数据、扩展数据集并再次删除数据集。读取数据的方式与读取常规R数组和数据框的方式相同。然而,hdf5-table类型只有一个维度,因此,不可能有选择地读取列所有的列都必须在同一时间读取
# 读取1-5行的数据
weather_ds[1:5]
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## 1 EWR 2013 1 1 1 39.02 26.06 59.37 270 10.35702 NA
## 2 EWR 2013 1 1 2 39.02 26.96 61.63 250 8.05546 NA
## 3 EWR 2013 1 1 3 39.02 28.04 64.43 240 11.50780 NA
## 4 EWR 2013 1 1 4 39.92 28.04 62.21 250 12.65858 NA
## 5 EWR 2013 1 1 5 39.02 28.04 64.43 260 12.65858 NA
## precip pressure visib time_hour
## 1 0 1012.0 10 1357020000
## 2 0 1012.3 10 1357023600
## 3 0 1012.5 10 1357027200
## 4 0 1012.2 10 1357030800
## 5 0 1011.9 10 1357034400
# 读取风向的前3行
# 风向为martix,所以可以同时选取
wind_dir_ds <- flights.grp[["wind_dir"]]
wind_dir_ds[1:3, ]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0 1 1 1 1 1 1 1 1 1 1 1 0 1
## [2,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [3,] 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] 1 1 1 1 1 1 1 1 1 1
## [2,] 1 1 1 1 1 1 1 1 1 1
## [3,] 1 1 1 1 1 1 1 1 1 1
# 替换第一行
wind_dir_ds[1, ] <- rep(1, 24)
wind_dir_ds[1, ]
也可以在数据集的维度之外添加数据,只要它们在maxdim内。数据集将被扩展以容纳新数据。当数据集的扩展导致未分配的点时,它们将被默认的填充值填充。一般为0
wind_dir_ds$get_fill_value()
## [1] 0
# 天趣1行25列的数字为1,其他位置自动添加0
wind_dir_ds[1, 25] <- 1
wind_dir_ds[1:2, ]
# 扩展了数据集,使其具有第25列,除了第一行外,其余都填充了Os,
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## [1,] 1 1 1 1 1 1 1 1 1 1 1
## [2,] 1 1 1 1 1 1 1 1 1 1 0
删除数据集
# 删除风向数据集
flights.grp$link_delete("wind_dir")
flights.grp$ls()
关闭文件有两个选项,关闭和关闭所有h5文件。
file.h5$close_all()
以上内容为hdfr5包的基本功能,当然还有一些高级功能,涉及创建文件和数据类型的内容
love&peace