xarray | 数据结构(2)

bugsuse

发布于 2020-04-21 17:21:45

3.8K0

发布于 2020-04-21 17:21:45

文章被收录于专栏：气象杂货铺气象杂货铺

Dataset

xarray.Dataset 是和 DataFrame 相同的多维数组。这是一个维度对齐的标签数组(DataArray)的类字典容器。它用来展示NetCDF文件格式的数据。

除了Dataset的类字典接口外，还可以使用它获取变量，Dataset 有4个主要属性：

dims：每个维度名称和长度的字典映射，比如{'x': 6, 'y': 6, 'time': 8}
data_vars：相应变量的 DataArray 类字典容器
coords：用于 data_vars 标记点的 DataArray 类字典容器，比如数字，datetime对象或字符串数组
attrs：包含任意元数据的 OrderedDict

数据或坐标中变量的区别是语义上的，你可以忽略这些差异。访问数据集中的字典可以获取任意类别的变量。然而，xarray正是利用了索引和计算之间的差异。坐标中表示的是常数/固定/独立的量，而数据中表示的是变化/测量/依赖的量。

下面是如何为天气预测构造数据集的例子：

上例中， temperature 和 precipitation 为数据变量(data variables)。其它数组表示坐标变量(coordinate variables)，因为它们标示的是沿着维度的点。

注：

因为数据集使用的是投影坐标，因此 latitude 和 longitude 表示2D数组，而 reference_time 表示做出预测时的参考时间，不是应用预测的有效时间 time。

创建 Dataset

为了创建一个 Dataset，需要提供一个字典包含任意变量的 data_vars，包含坐标信息的 coords及包含属性信息的 attrs。

data_vars：字典，每个键表示变量名，而键值可以是：
- DataArray 或 Variable
- (dims, data[, attrs]) 形式的元组，可以转化为 Variable 的参数
- 可以转化为 DataArray 的 pandas 对象
- 1D数组或列表
coords：和 data_vars 形式相同的字典
attrs：字典

下面来创建一个 Dataset：

>> temp = 15 + 8 * np.random.randn(2, 2, 3)
>> precip = 10 * np.random.rand(2, 2, 3)
>> lon = [[-99.83, -99.32], [-99.79, -99.23]]
>> lat = [[42.25, 42.21], [42.63, 42.59]]
# 在真正创建数组时，可以指定其它属性信息，比如：单位等信息
>> ds = xr.Dataset({'temperature': (['x', 'y', 'time'],  temp),'precipitation': (['x', 'y', 'time'], precip)},
 coords={'lon': (['x', 'y'], lon),
'lat': (['x', 'y'], lat),
'time': pd.date_range('2014-09-06', periods=3),
'reference_time': pd.Timestamp('2014-09-05')})

>> ds
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    reference_time  datetime64[ns] 2014-09-05
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 28.62 11.51 18.14 2.756 16.41 24.21 ...
    precipitation   (x, y, time) float64 3.398 9.667 5.833 4.238 1.67 6.591 ...

使用 DataArray 或 pandas 对象作为参数值：

>> xr.Dataset({'bar': foo})
<xarray.Dataset>
Dimensions:  (dim_0: 4, dim_1: 3)
Coordinates:
  * dim_0    (dim_0) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * dim_1    (dim_1) <U2 'IA' 'IL' 'IN'
Data variables:
    bar      (dim_0, dim_1) float64 0.7039 0.1457 0.6233 0.6067 0.6926 ...

>> xr.Dataset({'bar': foo.to_pandas()})
<xarray.Dataset>
Dimensions:  (dim_0: 4, dim_1: 3)
Coordinates:
  * dim_0    (dim_0) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * dim_1    (dim_1) object 'IA' 'IL' 'IN'
Data variables:
    bar      (dim_0, dim_1) float64 0.7039 0.1457 0.6233 0.6067 0.6926 ...

注：

Dataset 可以转换为 DataArray, DataFrame, dict, netcdf，分别对应 to_array, to_dataframe, to_dict, to_netcdf 方法。同样可以利用 to_* 类方法将 DataArray 转换为 DataFrame, Dataset, Series, Dict, netcdf, masked_array。

当使用 pandas 对象作为键值时，pandas 索引名会用作维度名，并且其数据会和已有变量进行对齐。

可以用以下对象创建 Dataset：

pandas.DataFrame 或 pandas.Panel 分别沿其列或项直接传递给 Dataset
使用 Dataset.from_datafrom 的 pandas.DataFrame，将额外处理多索引。参见 和Pandas一起使用

Dataset 内容

Dataset 使用了 python 的字典接口，而通过 DataArray 提供值：

# 判断变量是否包含在 Dataset 中
>> 'temperature' in ds
True
# 获取键
>> ds.keys()
KeysView(<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    reference_time  datetime64[ns] 2014-09-05
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 28.62 11.51 18.14 2.756 16.41 24.21 ...
    precipitation   (x, y, time) float64 3.398 9.667 5.833 4.238 1.67 6.591 ...)
# 获取变量
>> ds['temperature']
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 28.622812,  11.512907,  18.141037],
        [  2.756045,  16.406521,  24.212536]],

       [[  6.691933,  19.13648 ,  15.801706],
        [ 13.673612,  24.580889,  20.556329]]])
Coordinates:
    reference_time  datetime64[ns] 2014-09-05
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Dimensions without coordinates: x, y

坐标和数据中的变量都是有效键。

数据变量和坐标变量也分别包括在 data_vars 和 coords 类字典属性中：

>> ds.data_vars
Data variables:
    temperature    (x, y, time) float64 28.62 11.51 18.14 2.756 16.41 24.21 ...
    precipitation  (x, y, time) float64 3.398 9.667 5.833 4.238 1.67 6.591 ...

>> ds.coords
Coordinates:
    reference_time  datetime64[ns] 2014-09-05
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08

类似 DataArray，Dataset 中也在 attrs 属性中存储元数据信息。

虽然 xarray 不会强制限制属性设置，但是如果使用的不是字符串，数字或 numpy.ndarray 对象，那么在序列化某些文件格式时仍可能会失败。

为了方便操作，你可以想获取属性一样直接读取变量信息，但不能这样设置变量：

>> ds.temperature
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 28.622812,  11.512907,  18.141037],
        [  2.756045,  16.406521,  24.212536]],

       [[  6.691933,  19.13648 ,  15.801706],
        [ 13.673612,  24.580889,  20.556329]]])
Coordinates:
    reference_time  datetime64[ns] 2014-09-05
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Dimensions without coordinates: x, y

这在类似 Ipython 的开发环境中时非常有用的，可以像补全方法和属性一样直接补全变量名。

类字典方法

更新数据集时，可以使用类似字典的方法：

>> ds = xr.Dataset()
>> ds['temperature'] = (('x', 'y', 'time'), temp)
>> ds['precipitation'] = (('x', 'y', 'time'), precip)
>> ds.coords['lat'] = (('x', 'y'), lat)
>> ds.coords['lon'] = (('x', 'y'), lon)
>> ds.coords['time'] = pd.date_range('2014-09-06', periods=3)
>> ds.coords['reference_time'] = pd.Timestamp('2014-09-05')

和 matlab 中的结构体变量非常相似。

也可以标准的字典方法（比如： values, items, __delitem__, get, update）来改变Dataset中的变量。注意：使用 __setitem__ 和 update 将 DataArray 或 pandas 对象赋值给 Dataset时，会和原始数据索引进行自动对齐。

使用 copy 方法可以复制 Dataset，但是执行的是浅复制操作。就是说数组仍然存储在相同的 numpy.ndarray 对象中。调用 .copy(deep = True) 可以执行深复制操作。

数据集转换

除了上述的类字典方法外， xarray 还有一些其它方法可以将数据集转换为其它对象。

指定变量名称或是使用 drop 方法可以删除变量并返回新的 Dataset：

>> list(ds[['temperature']])
['temperature', 'time', 'reference_time', 'lat', 'lon']

>> list(ds[['x']])
['x', 'reference_time']

>> list(ds.drop('temperature'))
['precipitation', 'lat', 'lon', 'time', 'reference_time']

如果传递维度名给 drop，那么就会删除使用此维度名的所有变量：

>> ds
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 28.62 11.51 18.14 2.756 16.41 24.21 ...
    precipitation   (x, y, time) float64 3.398 9.667 5.833 4.238 1.67 6.591 ...

>> ds2 = ds.drop('time')
>> ds2
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: time, x, y
Data variables:
    temperature     (x, y, time) float64 28.62 11.51 18.14 2.756 16.41 24.21 ...
    precipitation   (x, y, time) float64 3.398 9.667 5.833 4.238 1.67 6.591 ...

使用 assign 和 assign_coords 可以改变类字典，而且会返回具有额外变量的新数据集：

>> ds.assign(temperature2 = 2 * ds.temperature)
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 28.62 11.51 18.14 2.756 16.41 24.21 ...
    precipitation   (x, y, time) float64 3.398 9.667 5.833 4.238 1.67 6.591 ...
    temperature2    (x, y, time) float64 57.25 23.03 36.28 5.512 32.81 48.43 ...

pipe() 方法允许你调用外部函数(比如：ds.pipe(func))，从而代替简单的调用 (比如：func(ds))。这和 linux 中的管道操作非常类似。

>> plt.plot((2 * ds.temperature.sel(x=0)).mean('y'))
[<matplotlib.lines.Line2D at 0x1d822965b38>]

>> (ds.temperature.sel(x=0).pipe(lambda x: 2 * x).mean('y').pipe(plt.plot))
[<matplotlib.lines.Line2D at 0x1d82296d5f8>]

pipe 和 assign 方法是对 pandas 同名方法的复制 (DataFrame.assign 和 DataFrame.pipe)。

使用 xarray 创建新数据集不会造成性能损失，即使是从文件中加载。创建新对象代替那些存在的”变异“变量，对于理解代码来说是有利的。

变量重命名

rename 方法可以重命名数据集变量：

>> ds.rename({'temperature': 'temp', 'precipitation': 'precip'})
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temp            (x, y, time) float64 28.62 11.51 18.14 2.756 16.41 24.21 ...
    precip          (x, y, time) float64 3.398 9.667 5.833 4.238 1.67 6.591 ...

swap_dims 允许你交换维度变量和非维度变量，并返回新数据集：

>> ds.coords['day'] = ('time', [6, 7, 8])
>> ds
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int32 6 7 8
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 28.62 11.51 18.14 2.756 16.41 24.21 ...
    precipitation   (x, y, time) float64 3.398 9.667 5.833 4.238 1.67 6.591 ...

>> ds.swap_dims({'time': 'day'})
<xarray.Dataset>
Dimensions:         (day: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    time            (day) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
  * day             (day) int32 6 7 8
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, day) float64 28.62 11.51 18.14 2.756 16.41 24.21 ...
    precipitation   (x, y, day) float64 3.398 9.667 5.833 4.238 1.67 6.591 ...

对比上述结果可以发现，维度变量 time 变为 day。

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2017-08-23，如有侵权请联系 cloudcommunity@tencent.com 删除

python