Python-for-data-重新采样和频率转换
重新采样指的是将时间序列从一个频率转换到另一个频率的过程。
但是也并不是所有的采样方式都是属于上面的两种
pandas中使用resample方法来实现频率转换
import pandas as pd
import numpy as np
rng = pd.date_range("2020-05-10",periods=100,freq="D")
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2020-05-10 0.239122
2020-05-11 0.847263
2020-05-12 0.394896
2020-05-13 1.556826
2020-05-14 -0.612460
...
2020-08-13 0.246714
2020-08-14 1.890153
2020-08-15 -2.090757
2020-08-16 -1.076017
2020-08-17 1.139343
Freq: D, Length: 100, dtype: float64
ts.resample("M").mean() # 相当于是先根据M月份进行分组,再求平均值
2020-05-31 0.147573
2020-06-30 -0.194357
2020-07-31 -0.027795
2020-08-31 -0.030770
Freq: M, dtype: float64
ts.resample("M",kind="period").mean()
2020-05 0.147573
2020-06 -0.194357
2020-07 -0.027795
2020-08 -0.030770
Freq: M, dtype: float64
将数据聚合到一个规则的低频上,例如将时间转换为每个月,“M"或者"BM”,将数据分成一个月的时间间隔。
每个间隔是半闭合的,一个数据只能属于一个时间间隔。时间间隔的并集必须是整个时间帧
rng = pd.date_range("2020-01-01", periods=12,freq="T") # T 表示的是分钟
ts = pd.Series(np.arange(12),index=rng)
ts
2020-01-01 00:00:00 0
2020-01-01 00:01:00 1
2020-01-01 00:02:00 2
2020-01-01 00:03:00 3
2020-01-01 00:04:00 4
2020-01-01 00:05:00 5
2020-01-01 00:06:00 6
2020-01-01 00:07:00 7
2020-01-01 00:08:00 8
2020-01-01 00:09:00 9
2020-01-01 00:10:00 10
2020-01-01 00:11:00 11
Freq: T, dtype: int64
默认情况下,左箱体边界是包含的。00:00的值是00:00到00:05间隔内的值
# 通过计算每一组的加和将这些数据聚合到五分钟的块或者柱内
ts.resample("5min",closed="right").sum()
2019-12-31 23:55:00 0
2020-01-01 00:00:00 15
2020-01-01 00:05:00 40
2020-01-01 00:10:00 11
Freq: 5T, dtype: int64
产生的时间序列按照每个箱体左边的时间戳被标记。
传递label="right"可以使用右箱体边界标记时间序列
ts.resample("5min",closed="right",label="right").sum()
2020-01-01 00:00:00 0
2020-01-01 00:05:00 15
2020-01-01 00:10:00 40
2020-01-01 00:15:00 11
Freq: 5T, dtype: int64
向loffset参数传递字符串或者日期偏置
ts.resample("5min",closed="right",
label="right",loffset="-2s").sum()
2019-12-31 23:59:58 0
2020-01-01 00:04:58 15
2020-01-01 00:09:58 40
2020-01-01 00:14:58 11
Freq: 5T, dtype: int64
ts.resample("5min",closed="right",label="right").sum().index
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:05:00',
'2020-01-01 00:10:00', '2020-01-01 00:15:00'],
dtype='datetime64[ns]', freq='5T')
在金融数据中,为每个数据桶计算4个值是常见的问题:
通过ohlc聚合函数能够得到四种聚合值列的DF数据
ts.resample("5min").ohlc()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
open | high | low | close | |
---|---|---|---|---|
2020-01-01 00:00:00 | 0 | 4 | 0 | 4 |
2020-01-01 00:05:00 | 5 | 9 | 5 | 9 |
2020-01-01 00:10:00 | 10 | 11 | 10 | 11 |
frame = pd.DataFrame(np.random.randn(2,4),
index=pd.date_range("5/1/2020",periods=2,freq="W-WED"),
columns=["Colorado","Texas","New York","Ohio"])
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2020-05-06 | 0.639827 | 0.306684 | 0.458653 | 0.461327 |
2020-05-13 | 1.056361 | 0.815583 | 1.627846 | 0.326976 |
低频转到高频的时候会形成缺失值
# 采用asfreq方法在不聚合的情况下,转换到高频率
df_daily = frame.resample("D").asfreq() #
df_daily
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2020-05-06 | 0.639827 | 0.306684 | 0.458653 | 0.461327 |
2020-05-07 | NaN | NaN | NaN | NaN |
2020-05-08 | NaN | NaN | NaN | NaN |
2020-05-09 | NaN | NaN | NaN | NaN |
2020-05-10 | NaN | NaN | NaN | NaN |
2020-05-11 | NaN | NaN | NaN | NaN |
2020-05-12 | NaN | NaN | NaN | NaN |
2020-05-13 | 1.056361 | 0.815583 | 1.627846 | 0.326976 |
ffill():使用前面的值填充,limit限制填充的次数
frame.resample("D").ffill(limit=3) # ffill()使用前面的值填充
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2020-05-06 | 0.639827 | 0.306684 | 0.458653 | 0.461327 |
2020-05-07 | 0.639827 | 0.306684 | 0.458653 | 0.461327 |
2020-05-08 | 0.639827 | 0.306684 | 0.458653 | 0.461327 |
2020-05-09 | 0.639827 | 0.306684 | 0.458653 | 0.461327 |
2020-05-10 | NaN | NaN | NaN | NaN |
2020-05-11 | NaN | NaN | NaN | NaN |
2020-05-12 | NaN | NaN | NaN | NaN |
2020-05-13 | 1.056361 | 0.815583 | 1.627846 | 0.326976 |
frame = pd.DataFrame(np.random.randn(24,4),
index=pd.period_range("1-2019","12-2020",freq="M"),
columns=["Colorado","Texas","New York","Ohio"])
frame[:5]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2019-01 | -1.160706 | 0.309239 | 0.847304 | 0.610020 |
2019-02 | -0.860034 | 0.153525 | 0.481263 | -1.149621 |
2019-03 | -1.506958 | -0.822806 | 0.223697 | 0.364879 |
2019-04 | -1.245177 | 1.886646 | 0.011271 | 1.074032 |
2019-05 | -0.752537 | 0.788435 | 0.277268 | -0.551638 |
annual_frame = frame.resample("A-DEC").mean()
annual_frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2019 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2020 | -0.481252 | -0.13397 | 0.424763 | -0.014648 |
# Q-DEC:每季度、年末在12月份
annual_frame.resample("Q-DEC").ffill()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2019Q1 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2019Q2 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2019Q3 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2019Q4 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2020Q1 | -0.481252 | -0.13397 | 0.424763 | -0.014648 |
2020Q2 | -0.481252 | -0.13397 | 0.424763 | -0.014648 |
2020Q3 | -0.481252 | -0.13397 | 0.424763 | -0.014648 |
2020Q4 | -0.481252 | -0.13397 | 0.424763 | -0.014648 |
annual_frame.resample("Q-DEC",convention="end").ffill()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2019Q4 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2020Q1 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2020Q2 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2020Q3 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2020Q4 | -0.481252 | -0.13397 | 0.424763 | -0.014648 |
annual_frame.resample("Q-MAR").ffill()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2019Q4 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2020Q1 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2020Q2 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2020Q3 | -0.520804 | 0.19733 | 0.341988 | -0.107696 |
2020Q4 | -0.481252 | -0.13397 | 0.424763 | -0.014648 |
2021Q1 | -0.481252 | -0.13397 | 0.424763 | -0.014648 |
2021Q2 | -0.481252 | -0.13397 | 0.424763 | -0.014648 |
2021Q3 | -0.481252 | -0.13397 | 0.424763 | -0.014648 |
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有