文章/答案/技术大牛

发布

社区首页 >问答首页 >提取DataFrame的扩展窗口(粗大的步幅)

问提取DataFrame的扩展窗口(粗大的步幅)
EN

Stack Overflow用户

提问于 2018-02-16 08:25:08

回答 1查看 1.3K关注 0票数 3

(与this answer有关)

给定一个df，我希望得到df.expanding()的结果，并使用.apply()对此执行一些多元操作(在扩展的行窗口上同时涉及几个df列的操作)。结果证明这是不可能的。

所以，就像上面链接的答案一样，我需要使用numpy.as_strides of df。除了与上面链接的问题相反，使用大步获得我的df的扩展视图，而不是滚动视图(扩展窗口有固定的左侧，右侧逐渐向右移动)。

考虑一下这个df

import numpy
import pandas


df = pandas.DataFrame(numpy.random.normal(0, 1, [100, 2]), columns=['size_A', 'size_B']).cumsum(axis=0)

请考虑以下代码，以提取该W行的滚动窗口(这来自于上面的答案)：

def get_sliding_window(df, W):
    a = df.values                 
    s0,s1 = a.strides
    m,n = a.shape
    return numpy.lib.stride_tricks\
               .as_strided(a,shape=(m-W+1,W,n),strides=(s0,s0,s1))

roll_window = get_sliding_window(df, W = 3)
roll_window[2]

现在，我想修改get_sliding_window，使其返回扩展的df窗口(而不是滚动窗口)：

def get_expanding_window(df):
    a = df.values                 
    s0,s1 = a.strides
    m,n = a.shape
    out = numpy.lib.stride_tricks\
               .as_strided(a, shape=(m,m,n),strides=(s0,s0,s1))
    return out

expg_window = get_expanding_window(df)
expg_window[2]

但我没有正确地使用as_strided的参数:我似乎无法得到正确的矩阵--这可能是这样的：

[df.iloc[0:1].values ,df.iloc[0:2].values, df.iloc[0:3].values,...]

编辑：

在@ThomasKühn的评论中，他建议使用列表理解。这样可以解决问题，但速度太慢了。费用是多少？

一个向量值函数，我们可以比较列表理解的成本与.expand()。它并不小：

numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10000)), columns=['Value'])
%timeit method_1 = numpy.array([df.Value.iloc[range(j + 1)].sum() for j in range(df.shape[0])])

给予：

6.37 s ± 219 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

与.expanding()的比较

%timeit method_2 = df.expanding(0).apply(lambda x: x.sum())

这意味着：

35.5 ms ± 356 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

最后，在对this问题的评论中，有关于我试图解决的问题的更多细节。

python

pandas

numpy

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-02-19 12:55:50

我编写了一些函数，这些函数都应该完成相同的任务，但需要不同的时间来完成任务：

import timeit
import numba as nb

x = np.random.normal(0,1,(10000,2))
def f1():
    res = [np.sum(x[:i,0] > x[i,1]) for i in range(x.shape[0])]
    return res

def f2():
    buf = np.empty(x.shape[0])
    res = np.empty(x.shape[0])
    for i in range(x.shape[0]):
        buf[:i] = x[:i,0] > x[i,1]
        res[i] = np.sum(buf[:i])
    return res

def f3():
    res = np.empty(x.shape[0])
    for i in range(x.shape[0]):
        res[i] = np.sum(x[:i,0] > x[i,1])
    return res


@nb.jit(nopython=True)
def f2_nb():
    buf = np.empty(x.shape[0])
    res = np.empty(x.shape[0])
    for i in range(x.shape[0]):
        buf[:i] = x[:i,0] > x[i,1]
        res[i] = np.sum(buf[:i])
    return res

@nb.jit(nopython=True)
def f3_nb():
    res = np.empty(x.shape[0])
    for i in range(x.shape[0]):
        res[i] = np.sum(x[:i,0] > x[i,1])
    return res

##checking that all functions give the same result:
print('checking correctness')
print(np.all(f1()==f2()))
print(np.all(f1()==f3()))
print(np.all(f1()==f2_nb()))
print(np.all(f1()==f3_nb()))

print('+'*50)
print('performance tests')
print('f1()')        
print(min(timeit.Timer(
    'f1()',
    setup = 'from __main__ import f1,x',
).repeat(7,10)))

print('-'*50)
print('f2()')
print(min(timeit.Timer(
    'f2()',
    setup = 'from __main__ import f2,x',
).repeat(7,10)))

print('-'*50)
print('f3()')
print(min(timeit.Timer(
    'f3()',
    setup = 'from __main__ import f3,x',
).repeat(7,10)))

print('-'*50)
print('f2_nb()')
print(min(timeit.Timer(
    'f2_nb()',
    setup = 'from __main__ import f2_nb,x',
).repeat(7,10)))

print('-'*50)
print('f3_nb()')
print(min(timeit.Timer(
    'f3_nb()',
    setup = 'from __main__ import f3_nb,x',
).repeat(7,10)))

正如您所看到的，差异并不大，但在性能上有一些差异。后两个函数只是早期函数的“重复”，但使用的是numba优化。速度测试的结果如下

checking correctness
True
True
True
True
++++++++++++++++++++++++++++++++++++++++++++++++++
performance tests
f1()
2.02294262702344
--------------------------------------------------
f2()
3.0964318679762073
--------------------------------------------------
f3()
1.9573561699944548
--------------------------------------------------
f2_nb()
1.3796060049789958
--------------------------------------------------
f3_nb()
0.48667875200044364

正如您所看到的，差异并不是很大，但是在最慢的函数和最快的函数之间，加速比至少是6倍。希望这能有所帮助。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48822715

复制

相似问题

问提取DataFrame的扩展窗口(粗大的步幅)
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问提取DataFrame的扩展窗口(粗大的步幅)EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问提取DataFrame的扩展窗口(粗大的步幅)
EN