首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >优化代码以查找DataFrame中每一行过去4至6天的值的中值

优化代码以查找DataFrame中每一行过去4至6天的值的中值
EN

Stack Overflow用户
提问于 2017-05-27 05:50:06
回答 3查看 309关注 0票数 3

给定时间戳数据,我想计算过去4-6天的某些变量的中位数。过去1-3天的中位数可以由pd.pandas.DataFrame.rolling计算,但我找不到如何使用滚动来计算过去4-6天的中位数。

代码语言:javascript
运行
复制
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='6H')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
np.random.seed(1)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))

数据看起来是这样的。在我的真实数据中,时间上有差距,一天中可能会有更多的数据点。

代码语言:javascript
运行
复制
              timestamp       var
0   2011-01-01 00:00:00  1.624345
1   2011-01-01 06:00:00 -0.611756
2   2011-01-01 12:00:00 -0.528172
3   2011-01-01 18:00:00 -1.072969
4   2011-01-02 00:00:00  0.865408
5   2011-01-02 06:00:00 -2.301539
6   2011-01-02 12:00:00  1.744812
7   2011-01-02 18:00:00 -0.761207
8   2011-01-03 00:00:00  0.319039
9   2011-01-03 06:00:00 -0.249370
10  2011-01-03 12:00:00  1.462108

期望产出:

代码语言:javascript
运行
复制
              timestamp       var  past4d-6d_var_median
0   2011-01-01 00:00:00  1.624345                   NaN # no data in past 4-6 days
1   2011-01-01 06:00:00 -0.611756                   NaN # no data in past 4-6 days
2   2011-01-01 12:00:00 -0.528172                   NaN # no data in past 4-6 days
3   2011-01-01 18:00:00 -1.072969                   NaN # no data in past 4-6 days
4   2011-01-02 00:00:00  0.865408                   NaN # no data in past 4-6 days
5   2011-01-02 06:00:00 -2.301539                   NaN # no data in past 4-6 days
6   2011-01-02 12:00:00  1.744812                   NaN # no data in past 4-6 days
7   2011-01-02 18:00:00 -0.761207                   NaN # no data in past 4-6 days
8   2011-01-03 00:00:00  0.319039                   NaN # no data in past 4-6 days
9   2011-01-03 06:00:00 -0.249370                   NaN # no data in past 4-6 days
10  2011-01-03 12:00:00  1.462108                   NaN # no data in past 4-6 days
11  2011-01-03 18:00:00 -2.060141                   NaN # no data in past 4-6 days
12  2011-01-04 00:00:00 -0.322417                   NaN # no data in past 4-6 days
13  2011-01-04 06:00:00 -0.384054                   NaN # no data in past 4-6 days
14  2011-01-04 12:00:00  1.133769                   NaN # no data in past 4-6 days
15  2011-01-04 18:00:00 -1.099891                   NaN # no data in past 4-6 days
16  2011-01-05 00:00:00 -0.172428                   NaN # only 4 data in past 4-6 days
17  2011-01-05 06:00:00 -0.877858             -0.528172
18  2011-01-05 12:00:00  0.042214             -0.569964
19  2011-01-05 18:00:00  0.582815             -0.528172
20  2011-01-06 00:00:00 -1.100619             -0.569964
21  2011-01-06 06:00:00  1.144724             -0.528172
22  2011-01-06 12:00:00  0.901591             -0.388771
23  2011-01-06 18:00:00  0.502494             -0.249370

我现在的代码是:

代码语言:javascript
运行
复制
def findPastVar2(df, var='var' ,window=3, method='median'):
    # window= # of past days    
    for i in xrange(len(df)):
        pastVar2 = df[var].loc[(df['timestamp'] - df['timestamp'].loc[i] < datetime.timedelta(days=-window)) & (df['timestamp'] - df['timestamp'].loc[i] >= datetime.timedelta(days=-window*2))]
        if pastVar2.shape[0]>=5: # At least 5 data points
            if method == 'median':
                df.loc[i,'past{}d-{}d_{}_median'.format(window+1,window*2,var)] = np.median(pastVar2.values)
    return(df)

当前速度:

代码语言:javascript
运行
复制
In [35]: %timeit df2 = findPastVar2(df)
1 loop, best of 3: 821 ms per loop

我编辑了这篇文章,这样我就可以清楚地显示至少5个数据点的预期输出。我已经设置了随机种子,这样每个人都应该能够得到相同的输入和显示相同的输出。据我所知,简单的rollingshift不适用于同一天中的多个数据。

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2017-06-05 03:11:22

我们开始:

代码语言:javascript
运行
复制
df.set_index('timestamp', inplace = True)
df['var'] =df['var'].rolling('3D', min_periods = 3).median().shift(freq = pd.Timedelta('4d')).shift(-1)

df['var'] 
Out[55]: 
timestamp
2011-01-01 00:00:00         NaN
2011-01-01 06:00:00         NaN
2011-01-01 12:00:00         NaN
2011-01-01 18:00:00         NaN
2011-01-02 00:00:00         NaN
2011-01-02 06:00:00         NaN
2011-01-02 12:00:00         NaN
2011-01-02 18:00:00         NaN
2011-01-03 00:00:00         NaN
2011-01-03 06:00:00         NaN
2011-01-03 12:00:00         NaN
2011-01-03 18:00:00         NaN
2011-01-04 00:00:00         NaN
2011-01-04 06:00:00         NaN
2011-01-04 12:00:00         NaN
2011-01-04 18:00:00         NaN
2011-01-05 00:00:00         NaN
2011-01-05 06:00:00   -0.528172
2011-01-05 12:00:00   -0.569964
2011-01-05 18:00:00   -0.528172
2011-01-06 00:00:00   -0.569964
2011-01-06 06:00:00   -0.528172
2011-01-06 12:00:00   -0.569964
2011-01-06 18:00:00   -0.528172
2011-01-07 00:00:00   -0.388771
2011-01-07 06:00:00   -0.249370
2011-01-07 12:00:00   -0.388771
票数 2
EN

Stack Overflow用户

发布于 2017-06-01 06:12:25

这是为每一行设置的方式,作为一个不规则的时间序列,它将有不同的宽度,因此需要像您已经开始的迭代方法。但是,如果我们把索引作为时间序列

代码语言:javascript
运行
复制
# setup the df:
df = pd.DataFrame(index = pd.date_range('1/1/2011', periods=100, freq='12H'))
df['var'] = np.random.randn(len(df))

在这种情况下,我选择每12小时间隔一次,但可以是任何可用的或不定期的。使用一个带有中间窗口的修改函数,加上一个偏移量(在这里,正的Delta正在向后看),给您提供了您想要的灵活性:

代码语言:javascript
运行
复制
def GetMedian(df,var='var',window='2D',Delta='3D'):
    for Ti in df.index:
        Vals=df[(df.index < Ti-pd.Timedelta(Delta)) & \
                (df.index > Ti-pd.Timedelta(Delta)-pd.Timedelta(window))]
        df.loc[Ti,'Medians']=Vals[var].median()
    return df

这样的速度要快得多:

代码语言:javascript
运行
复制
%timeit GetMedian(df)
84.8 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
票数 1
EN

Stack Overflow用户

发布于 2017-06-05 07:42:54

min_period应该是2而不是5,因为您不应该计算窗口大小。(5-3= 2)

代码语言:javascript
运行
复制
import pandas as pd
import numpy as np
import datetime
np.random.seed(1)  # set random seed for easier comparison
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))

def first():
    df['past4d-6d_var_median'] = [np.nan]*3 + df.rolling(window=3, min_periods=2).median()[:-3]['var'].tolist()
    return df
代码语言:javascript
运行
复制
%timeit -n1000 first()
1000 loops, best of 3: 6.23 ms per loop

我的第一次尝试没有使用shift(),但后来我看到了Noobie的answer

我用shift()做了下面的一个,比上一个要快得多。

代码语言:javascript
运行
复制
def test():
    df['past4d-6d_var_median'] = df['var'].rolling(window=3, min_periods=2).median().shift(3)
    return df
代码语言:javascript
运行
复制
%timeit -n1000 test()
1000 loops, best of 3: 1.66 ms per loop

第二个速度大约是第一个的4倍。

这两个函数创建相同的结果,如下所示:

代码语言:javascript
运行
复制
df2 = test()
df2
                  timestamp       var   past4d-6d_var_median
    0   2011-01-01 00:00:00  1.624345                    NaN
    1   2011-01-02 00:00:00 -0.611756                    NaN
    2   2011-01-03 00:00:00 -0.528172                    NaN
    3   2011-01-04 00:00:00 -1.072969                    NaN
    4   2011-01-05 00:00:00  0.865408               0.506294
    5   2011-01-06 00:00:00 -2.301539              -0.528172
    6   2011-01-07 00:00:00  1.744812              -0.611756
    ...         ...            ...             ...
    93  2011-04-04 00:00:00 -0.638730               1.129484
    94  2011-04-05 00:00:00  0.423494               1.129484
    95  2011-04-06 00:00:00  0.077340               0.185156
    96  2011-04-07 00:00:00 -0.343854              -0.375285
    97  2011-04-08 00:00:00  0.043597              -0.375285
    98  2011-04-09 00:00:00 -0.620001               0.077340
    99  2011-04-10 00:00:00  0.698032               0.077340
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/44213415

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档