时序数据是指时间序列数据。时间序列数据是同一统一指标按时间顺序记录的数据列。在同一数据列中的各个数据必须是同口径的,要求具有可比性。时序数据可以是时期数,也可以时点数。
时间序列分析的目的是通过找出样本内时间序列的统计特性和发展规律性,构建时间序列模型,进行样本外预测。
现在,一起来学习用Pandas处理时序数据。
本文目录
1. 时序的创建
1.1. 四类时间变量
1.2. 时间点的创建
1.3. DataOffset对象
2. 时序的索引及属性
2.1. 索引切片
2.2. 子集索引
2.3. 时间点的属性
3. 重采样
3.1. resample对象的基本操作
3.2. 采样聚合
3.3. 采样组的迭代
4. 窗口函数
4.1. Rolling
4.2. Expanding
5. 问题及练习
5.1. 问题
5.2. 练习
import pandas as pd
import numpy as np
pd.to_datetime('2020.1.1')
pd.to_datetime('2020 1.1')
pd.to_datetime('2020 1 1')
pd.to_datetime('2020 1-1')
pd.to_datetime('2020-1 1')
pd.to_datetime('2020-1-1')
pd.to_datetime('2020/1/1')
pd.to_datetime('1.1.2020')
pd.to_datetime('1.1 2020')
pd.to_datetime('1 1 2020')
pd.to_datetime('1 1-2020')
pd.to_datetime('1-1 2020')
pd.to_datetime('1-1-2020')
pd.to_datetime('1/1/2020')
pd.to_datetime('20200101')
pd.to_datetime('2020.0101')
Timestamp('2020-01-01 00:00:00')
#pd.to_datetime('2020\\1\\1')
#pd.to_datetime('2020`1`1')
#pd.to_datetime('2020.1 1')
#pd.to_datetime('1 1.2020')
pd.to_datetime('2020\\1\\1',format='%Y\\%m\\%d')
pd.to_datetime('2020`1`1',format='%Y`%m`%d')
pd.to_datetime('2020.1 1',format='%Y.%m %d')
pd.to_datetime('1 1.2020',format='%d %m.%Y')
Timestamp('2020-01-01 00:00:00')
pd.Series(range(2),index=pd.to_datetime(['2020/1/1','2020/1/2']))
type(pd.to_datetime(['2020/1/1','2020/1/2']))
pandas.core.indexes.datetimes.DatetimeIndex
df = pd.DataFrame({'year': [2020, 2020],'month': [1, 1], 'day': [1, 2]})
pd.to_datetime(df)
pd.to_datetime('2020/1/1 00:00:00.123456789')
Timestamp('2020-01-01 00:00:00.123456789')
pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
pd.date_range(start='2020/1/1',end='2020/1/10',periods=3)
pd.date_range(start='2020/1/1',end='2020/1/10',freq='D')
pd.date_range(start='2020/1/1',periods=3,freq='D')
pd.date_range(end='2020/1/3',periods=3,freq='D')
pd.date_range(start='2020/1/1',periods=3,freq='T')
pd.date_range(start='2020/1/1',periods=3,freq='M')
pd.date_range(start='2020/1/1',periods=3,freq='BYS')
weekmask = 'Mon Tue Fri'
holidays = [pd.Timestamp('2020/1/%s'%i) for i in range(7,13)]
#注意holidays
pd.bdate_range(start='2020-1-1',end='2020-1-15',freq='C',weekmask=weekmask,holidays=holidays)
ts = pd.Timestamp('2020-3-29 01:00:00', tz='Europe/Helsinki')
ts + pd.Timedelta(days=1)
Timestamp('2020-03-30 02:00:00+0300', tz='Europe/Helsinki')
ts + pd.DateOffset(days=1)
Timestamp('2020-03-30 01:00:00+0300', tz='Europe/Helsinki')
ts = pd.Timestamp('2020-3-29 01:00:00')
ts + pd.Timedelta(days=1)
Timestamp('2020-03-30 01:00:00')
ts + pd.DateOffset(days=1)
Timestamp('2020-03-30 01:00:00')
pd.Timestamp('2020-01-01') + pd.DateOffset(minutes=20) - pd.DateOffset(weeks=2)
Timestamp('2019-12-18 00:20:00')
pd.Timestamp('2020-01-01') + pd.offsets.Week(2)
Timestamp('2020-01-15 00:00:00')
pd.Timestamp('2020-01-01') + pd.offsets.BQuarterBegin(1)
Timestamp('2020-03-02 00:00:00')
pd.Series(pd.offsets.BYearBegin(3).apply(i) for i in pd.date_range('20200101',periods=3,freq='Y'))
pd.date_range('20200101',periods=3,freq='Y') + pd.offsets.BYearBegin(3)
pd.Series(pd.offsets.CDay(3,weekmask='Wed Fri',holidays='2020010').apply(i)
for i in pd.date_range('20200105',periods=3,freq='D'))
rng = pd.date_range('2020','2021', freq='W')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.head()
ts['2020-01-26']
-0.47982974619679947
ts['2020-01-26':'20200726'].head()
ts['2020-7'].head()
ts['2011-1':'20200726'].head()
pd.Series(ts.index).dt.week.head()
pd.Series(ts.index).dt.day.head()
pd.Series(ts.index).dt.strftime('%Y-间隔1-%m-间隔2-%d').head()
pd.date_range('2020','2021', freq='W').month
pd.date_range('2020','2021', freq='W').weekday
df_r = pd.DataFrame(np.random.randn(1000, 3),index=pd.date_range('1/1/2020', freq='S', periods=1000),
columns=['A', 'B', 'C'])
r = df_r.resample('3min')
r
r.sum()
df_r2 = pd.DataFrame(np.random.randn(200, 3),index=pd.date_range('1/1/2020', freq='D', periods=200),
columns=['A', 'B', 'C'])
r = df_r2.resample('CBMS')
r.sum()
3.2. 采样聚合
r = df_r.resample('3T')
r['A'].mean()
r['A'].agg([np.sum, np.mean, np.std])
类似地,可以使用函数lambda表达式
r.agg({'A': np.sum,'B': lambda x: max(x)-min(x)})
3.3. 采样组的迭代
small = pd.Series(range(6),index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00'
, '2020-01-01 00:31:00','2020-01-01 01:00:00'
,'2020-01-01 03:00:00','2020-01-01 03:05:00']))
resampled = small.resample('H')
for name, group in resampled:
print("Group: ", name)
print("-" * 27)
print(group, end="\n\n")
s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
s.head()
s.rolling(window=50)
Rolling [window=50,center=False,axis=0]
s.rolling(window=50).mean()
s.rolling(window=50,min_periods=3).mean().head()
s.rolling(window=50,min_periods=3).apply(lambda x:x.std()/x.mean()).head()
s.rolling('15D').mean().head()
s.rolling('15D', closed='right').sum().head()
s.rolling(window=len(s),min_periods=1).sum().head()
s.expanding().sum().head()
s.expanding().apply(lambda x:sum(x)).head()
s.cumsum().head()
s.cumsum().head()
s.shift(2).head()
s.diff(3).head()
s.pct_change(3).head()
5.1. 问题
5.2. 练习
【练习二】 继续使用上一题的数据,请完成下列问题: