数据分析篇 | Pandas 时间序列 - 日期时间索引

龙哥

发布于 2019-12-22 18:49:18

5K0

发布于 2019-12-22 18:49:18

文章被收录于专栏：Python绿色通道Python绿色通道

部字符串索引切片 vs. 精准匹配精确索引截断与花式索引日期/时间组件

DatetimeIndex 主要用作 Pandas 对象的索引。DatetimeIndex 类为时间序列做了很多优化：

预计算了各种偏移量的日期范围，并在后台缓存，让后台生成后续日期范围的速度非常快（仅需抓取切片）。
在 Pandas 对象上使用 shift 与 tshift 方法进行快速偏移。
合并具有相同频率的重叠 DatetimeIndex 对象的速度非常快（这点对快速数据对齐非常重要）。
通过 year、month 等属性快速访问日期字段。
snap 等正则函数与超快的 asof 逻辑。

DatetimeIndex 对象支持全部常规 Index 对象的基本用法，及一些列简化频率处理的高级时间序列专有方法。

参阅：重置索引注意：Pandas 不强制排序日期索引，但如果日期没有排序，可能会引发可控范围之外的或不正确的操作。

DatetimeIndex 可以当作常规索引，支持选择、切片等方法。

In [94]: rng = pd.date_range(start, end, freq='BM')

In [95]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [96]: ts.index
Out[96]: 
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
               '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
              dtype='datetime64[ns]', freq='BM')

In [97]: ts[:5].index
Out[97]: 
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31'],
              dtype='datetime64[ns]', freq='BM')

In [98]: ts[::2].index
Out[98]: 
DatetimeIndex(['2011-01-31', '2011-03-31', '2011-05-31', '2011-07-29',
               '2011-09-30', '2011-11-30'],
              dtype='datetime64[ns]', freq='2BM')

局部字符串索引

能解析为时间戳的日期与字符串可以作为索引的参数：

In [99]: ts['1/31/2011']
Out[99]: 0.11920871129693428

In [100]: ts[datetime.datetime(2011, 12, 25):]
Out[100]: 
2011-12-30    0.56702
Freq: BM, dtype: float64

In [101]: ts['10/31/2011':'12/31/2011']
Out[101]: 
2011-10-31    0.271860
2011-11-30   -0.424972
2011-12-30    0.567020
Freq: BM, dtype: float64

Pandas 为访问较长的时间序列提供了便捷方法，年、年月字符串均可：

In [102]: ts['2011']
Out[102]: 
2011-01-31    0.119209
2011-02-28   -1.044236
2011-03-31   -0.861849
2011-04-29   -2.104569
2011-05-31   -0.494929
2011-06-30    1.071804
2011-07-29    0.721555
2011-08-31   -0.706771
2011-09-30   -1.039575
2011-10-31    0.271860
2011-11-30   -0.424972
2011-12-30    0.567020
Freq: BM, dtype: float64

In [103]: ts['2011-6']
Out[103]: 
2011-06-30    1.071804
Freq: BM, dtype: float64

带 DatetimeIndex 的 DateFrame 也支持这种切片方式。局部字符串是标签切片的一种形式，这种切片也包含截止时点，即，与日期匹配的时间也会包含在内：

In [104]: dft = pd.DataFrame(np.random.randn(100000, 1), columns=['A'],
   .....:                    index=pd.date_range('20130101', periods=100000, freq='T'))
   .....: 

In [105]: dft
Out[105]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043

[100000 rows x 1 columns]

In [106]: dft['2013']
Out[106]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043

[100000 rows x 1 columns]

下列代码截取了自 1 月 1 日凌晨起，至 2 月 28 日午夜的日期与时间。

In [107]: dft['2013-1':'2013-2']
Out[107]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-28 23:55:00  0.850929
2013-02-28 23:56:00  0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517

[84960 rows x 1 columns]

下列代码截取了包含截止日期及其时间在内的日期与时间。

In [108]: dft['2013-1':'2013-2-28']
Out[108]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-28 23:55:00  0.850929
2013-02-28 23:56:00  0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517

[84960 rows x 1 columns]

下列代码指定了精准的截止时间，注意此处的结果与上述截取结果的区别：

In [109]: dft['2013-1':'2013-2-28 00:00:00']
Out[109]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-27 23:56:00  1.197749
2013-02-27 23:57:00  0.720521
2013-02-27 23:58:00 -0.072718
2013-02-27 23:59:00 -0.681192
2013-02-28 00:00:00 -0.557501

[83521 rows x 1 columns]

截止时间是索引的一部分，包含在截取的内容之内：

In [110]: dft['2013-1-15':'2013-1-15 12:30:00']
Out[110]: 
                            A
2013-01-15 00:00:00 -0.984810
2013-01-15 00:01:00  0.941451
2013-01-15 00:02:00  1.559365
2013-01-15 00:03:00  1.034374
2013-01-15 00:04:00 -1.480656
...                       ...
2013-01-15 12:26:00  0.371454
2013-01-15 12:27:00 -0.930806
2013-01-15 12:28:00 -0.069177
2013-01-15 12:29:00  0.066510
2013-01-15 12:30:00 -0.003945

[751 rows x 1 columns]

0.18.0 版新增。

DatetimeIndex 局部字符串索引还支持多层索引 DataFrame。

In [111]: dft2 = pd.DataFrame(np.random.randn(20, 1),
   .....:                     columns=['A'],
   .....:                     index=pd.MultiIndex.from_product(
   .....:                         [pd.date_range('20130101', periods=10, freq='12H'),
   .....:                          ['a', 'b']]))
   .....: 

In [112]: dft2
Out[112]: 
                              A
2013-01-01 00:00:00 a -0.298694
                    b  0.823553
2013-01-01 12:00:00 a  0.943285
                    b -1.479399
2013-01-02 00:00:00 a -1.643342
...                         ...
2013-01-04 12:00:00 b  0.069036
2013-01-05 00:00:00 a  0.122297
                    b  1.422060
2013-01-05 12:00:00 a  0.370079
                    b  1.016331

[20 rows x 1 columns]

In [113]: dft2.loc['2013-01-05']
Out[113]: 
                              A
2013-01-05 00:00:00 a  0.122297
                    b  1.422060
2013-01-05 12:00:00 a  0.370079
                    b  1.016331

In [114]: idx = pd.IndexSlice

In [115]: dft2 = dft2.swaplevel(0, 1).sort_index()

In [116]: dft2.loc[idx[:, '2013-01-05'], :]
Out[116]: 
                              A
a 2013-01-05 00:00:00  0.122297
  2013-01-05 12:00:00  0.370079
b 2013-01-05 00:00:00  1.422060
  2013-01-05 12:00:00  1.016331

0.25.0 版新增。

字符串索引切片支持 UTC 偏移。

In [117]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))

In [118]: df
Out[118]: 
                           0
2019-01-01 00:00:00-08:00  0

In [119]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[119]: 
                           0
2019-01-01 00:00:00-08:00  0

切片 vs. 精准匹配

0.20.0 版新增。

基于索引的精度，字符串既可用于切片，也可用于精准匹配。字符串精度比索引精度低，就是切片，比索引精度高，则是精准匹配。

In [120]: series_minute = pd.Series([1, 2, 3],
   .....:                           pd.DatetimeIndex(['2011-12-31 23:59:00',
   .....:                                             '2012-01-01 00:00:00',
   .....:                                             '2012-01-01 00:02:00']))
   .....: 

In [121]: series_minute.index.resolution
Out[121]: 'minute'

下例中的时间戳字符串没有 Series 对象的精度高。series_minute 到秒，时间戳字符串只到分。

In [122]: series_minute['2011-12-31 23']
Out[122]: 
2011-12-31 23:59:00    1
dtype: int64

精度为分钟（或更高精度）的时间戳字符串，给出的是标量，不会被当作切片。

In [123]: series_minute['2011-12-31 23:59']
Out[123]: 1

In [124]: series_minute['2011-12-31 23:59:00']
Out[124]: 1

索引的精度为秒时，精度为分钟的时间戳返回的是 Series。

In [125]: series_second = pd.Series([1, 2, 3],
   .....:                           pd.DatetimeIndex(['2011-12-31 23:59:59',
   .....:                                             '2012-01-01 00:00:00',
   .....:                                             '2012-01-01 00:00:01']))
   .....: 

In [126]: series_second.index.resolution
Out[126]: 'second'

In [127]: series_second['2011-12-31 23:59']
Out[127]: 
2011-12-31 23:59:59    1
dtype: int64

用时间戳字符串切片时，还可以用 [] 索引 DataFrame。

In [128]: dft_minute = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},
   .....:                           index=series_minute.index)
   .....: 

In [129]: dft_minute['2011-12-31 23']
Out[129]: 
                     a  b
2011-12-31 23:59:00  1  4

警告：字符串执行精确匹配时，用 [] 按列，而不是按行截取 DateFrame ，参阅索引基础。如，dft_minute ['2011-12-31 23:59'] 会触发 KeyError，这是因为 2012-12-31 23:59与索引的精度一样，但没有叫这个名字的列。

为了实现精准切片，要用 .loc 对行进行切片或选择。

In [130]: dft_minute.loc['2011-12-31 23:59']
Out[130]: 
a    1
b    4
Name: 2011-12-31 23:59:00, dtype: int64

注意，DatetimeIndex 精度不能低于日。

In [131]: series_monthly = pd.Series([1, 2, 3],
   .....:                            pd.DatetimeIndex(['2011-12', '2012-01', '2012-02']))
   .....: 

In [132]: series_monthly.index.resolution
Out[132]: 'day'

In [133]: series_monthly['2011-12']  # 返回的是 Series
Out[133]: 
2011-12-01    1
dtype: int64

精确索引

正如上节所述，局部字符串依靠时间段的精度索引 DatetimeIndex，即时间间隔与索引精度相关。反之，用 Timestamp 或 datetime 索引更精准，这些对象指定的时间更精确。注意，精确索引包含了起始时点。

就算没有显式指定，Timestamp 与datetime 也支持 hours、minutes、seconds，默认值为 0。

In [134]: dft[datetime.datetime(2013, 1, 1):datetime.datetime(2013, 2, 28)]
Out[134]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-27 23:56:00  1.197749
2013-02-27 23:57:00  0.720521
2013-02-27 23:58:00 -0.072718
2013-02-27 23:59:00 -0.681192
2013-02-28 00:00:00 -0.557501

[83521 rows x 1 columns]

不用默认值。

In [135]: dft[datetime.datetime(2013, 1, 1, 10, 12, 0):
   .....:     datetime.datetime(2013, 2, 28, 10, 12, 0)]
   .....: 
Out[135]: 
                            A
2013-01-01 10:12:00  0.565375
2013-01-01 10:13:00  0.068184
2013-01-01 10:14:00  0.788871
2013-01-01 10:15:00 -0.280343
2013-01-01 10:16:00  0.931536
...                       ...
2013-02-28 10:08:00  0.148098
2013-02-28 10:09:00 -0.388138
2013-02-28 10:10:00  0.139348
2013-02-28 10:11:00  0.085288
2013-02-28 10:12:00  0.950146

[83521 rows x 1 columns]

截断与花式索引

truncate() 便捷函数与切片类似。注意，与切片返回的是部分匹配日期不同， truncate 假设 DatetimeIndex 里未标明时间组件的值为 0。

In [136]: rng2 = pd.date_range('2011-01-01', '2012-01-01', freq='W')

In [137]: ts2 = pd.Series(np.random.randn(len(rng2)), index=rng2)

In [138]: ts2.truncate(before='2011-11', after='2011-12')
Out[138]: 
2011-11-06    0.437823
2011-11-13   -0.293083
2011-11-20   -0.059881
2011-11-27    1.252450
Freq: W-SUN, dtype: float64

In [139]: ts2['2011-11':'2011-12']
Out[139]: 
2011-11-06    0.437823
2011-11-13   -0.293083
2011-11-20   -0.059881
2011-11-27    1.252450
2011-12-04    0.046611
2011-12-11    0.059478
2011-12-18   -0.286539
2011-12-25    0.841669
Freq: W-SUN, dtype: float64

花式索引返回 DatetimeIndex，但因为打乱了 DatetimeIndex 频率，丢弃了频率信息，见 freq=None：

In [140]: ts2[[0, 2, 6]].index
Out[140]: DatetimeIndex(['2011-01-02', '2011-01-16', '2011-02-13'], dtype='datetime64[ns]', freq=None)

日期/时间组件

以下日期/时间属性可以访问 Timestamp 或 DatetimeIndex。

属性	说明
year	datetime 的年
month	datetime 的月
day	datetime 的日
hour	datetime 的小时
minute	datetime 的分钟
second	datetime 的秒
microsecond	datetime 的微秒
nanosecond	datetime 的纳秒
date	返回 datetime.date（不包含时区信息）
time	返回 datetime.time（不包含时区信息）
timetz	返回带本地时区信息的 datetime.time
dayofyear	一年里的第几天
weekofyear	一年里的第几周
week	一年里的第几周
dayofweek	一周里的第几天，Monday=0, Sunday=6
weekday	一周里的第几天，Monday=0, Sunday=6
weekday_name	这一天是星期几（如，Friday）
quarter	日期所处的季节：Jan-Mar = 1 等
days_in_month	日期所在的月有多少天
is_month_start	逻辑判断是不是月初（由频率定义）
is_month_end	逻辑判断是不是月末（由频率定义）
is_quarter_start	逻辑判断是不是季初（由频率定义）
is_quarter_end	逻辑判断是不是季末（由频率定义）
is_year_start	逻辑判断是不是年初（由频率定义）
is_year_end	逻辑判断是不是年末（由频率定义）
is_leap_year	逻辑判断是不是日期所在年是不是闰年