文章/答案/技术大牛

发布

社区首页 >问答首页 >用于查找大于当前元素的第一个未来元素的Numpy矢量化算法

问用于查找大于当前元素的第一个未来元素的Numpy矢量化算法
EN

Stack Overflow用户

提问于 2011-10-26 06:52:31

回答 2查看 1.1K关注 0票数 4

我有一个时间序列A，我想生成另一个时间序列B，这样

Bi = j，其中j是大于i的第一个索引，使得Aj > Ai。

在numpy中有没有一种快速的方法来做到这一点？

谢谢。

编辑:最好只使用O(n)的空间。

python

numpy

vectorization

回答 2

Stack Overflow用户

回答已采纳

发布于 2011-10-26 07:11:46

测试不足，使用风险自负。

import numpy

a = numpy.random.random(100)

# a_by_a[i,j] = a[i] > a[j]
a_by_a = a[numpy.newaxis,:] > a[:,numpy.newaxis]
# by taking the upper triangular, we ignore all cases where i < j
a_by_a = numpy.triu(a_by_a)
# argmax will give the first index with the highest value (1 in this case)
print numpy.argmax(a_by_a, axis = 1)

较低的内存版本：

a = numpy.random.random(100)

@numpy.vectorize
def values(i):
    try:
        return (a[i:] > a[i]).nonzero()[0][0] + i
    except IndexError:
        return -1 # no valid values found

b = values(numpy.arange(100))

更快的版本：

@np.vectorize
def values(i):
    try:
        return next(idx for idx, value in enumerate(A[i+1:]) if value > A[i]) + i + 1
    except StopIteration:
        return -1 # no valid values found
return values(np.arange(len(A)))

更快的版本：

def future6(A):
    # list of tuples (index into A, value in A)
    # invariant: indexes and values in sorted order
    known = []
    result = []
    for idx in xrange(len(A) - 1, -1, -1):
        value = A[idx]
        # since known is sorted a binary search could be applied here
        # I haven't bothered

        # anything lower then the current value
        # cannot possibly be used again, since this value will be first index instead
        # of those
        known = [(x,y) for x,y in known if y > value]


        if known: 
            # all values in known are > current value
            # they reverse sorted by index               
            # the value with the lowest index is first
            result.append(known[-1][0])
        else:
            # no values exist this high, report -1
            result.append(-1)
        # add to end of the list to maintain invariant
        known.append( (idx, value) )

    # let numpy worry about reversing the array
    return np.array(result)[::-1]

这里使用的一些想法归功于半机械人。

算法差分

cyborg显示了不同算法之间的显着差异，这取决于输入的数据。我从运行的算法中收集了一些统计数据，看看是否可以弄清楚发生了什么。

随机数据：

Average distance between value and its target: 9
Average length of ascends list: 24
Average length of segment in ascends list: 1.33
Average length of known list: 9.1

由于列表非常短，因此递增算法大多衰减为线性搜索。它确实清除了不能在未来使用的上升，所以它仍然比线性搜索要好得多。

振荡：

Average distance between value and its target: 31.46
Average length of ascends list: 84
Average length of segment in ascends list: 1.70
Average length of known list: 57.98

这种振荡倾向于将不同的部分分开得更远。这自然会阻碍线性搜索算法。这两种“更聪明”的算法都必须跟踪额外的数据。我的算法衰减很大，因为它每次都会扫描数据。ascends算法涉及的数据较少，并且执行得更好。

升序：

Average distance between value and its target: 2.57
Average length of ascends list: 40
Average length of segment in ascends list: 3.27
Average length of known list: 3037.97

这就是为什么我的算法会有问题，它必须跟踪大量的升级值。值和目标之间的短距离解释了线性搜索的良好性能。Ascends仍然不能处理很长的片段。

更好的算法

我的算法没有理由必须对数据进行线性搜索。它是经过排序的，我们只需要从列表的末尾删除较小的值。

def future6(A):
    # list of tuples (index into A, value in A)
    # invariant: indexes and values in sorted order
    known = []
    result = []
    for idx in xrange(len(A) - 1, -1, -1):
        value = A[idx]
        # since known is sorted a binary search could be applied here
        # I haven't bothered

        # anything lower then the current value
        # cannot possibly be used again, since this value will be first index instead
        # of those
        while known and known[-1][1] < value:
            known.pop()


        if known: 
            # all values in known are > current value
            # they reverse sorted by index               
            # the value with the lowest index is first
            result.append(known[-1][0])
        else:
            # no values exist this high, report -1
            result.append(-1)
        # add to end of the list to maintain invariant
        known.append( (idx, value) )

    # let numpy worry about reversing the array
    return np.array(result)[::-1]

但在我看来，我们可以重用以前计算出的B的值，而不是构造新的数据结构。如果j> i，Ai > Aj，则Bi > Bj。

def future8(A):
    B = [-1] * len(A)
    for index in xrange(len(A)-2, -1, -1):
        target = index + 1
        value = A[index]
        while target != -1 and A[target] < value:
            target = B[target]
        B[index] = target
    return np.array(B)

我的基准测试结果：

Random series:
future2 ascends  : 0.242569923401
future6 full list: 0.0363488197327
future7 vectorize: 0.129994153976
future8 reuse: 0.0299410820007
Oscillating series:
future2 ascends  : 0.233623981476
future6 full list: 0.0360488891602
future7 vectorize: 1.19140791893
future8 reuse: 0.0297570228577
Ascending trend series:
future2 ascends  : 0.120707035065
future6 full list: 0.0314049720764
future7 vectorize: 0.0640320777893
future8 reuse: 0.0246520042419

升序部分

赛博格有一个非常有趣的想法，就是利用提升的部分。我不认为他的任何测试用例真正展示了他在那之后的行为。我不认为这些部分足够长，无法充分利用。但我认为真正的数据可能会有这样的部分，所以利用它将会非常有帮助。

但我不认为它会起作用。准备进行二进制搜索所需的数据需要O(n)时间。如果我们多次执行二进制搜索，这将是很好的，但是一旦我们在升序部分的中间找到一个值，我们就永远不会重新访问左边的任何东西。因此，即使使用二进制搜索，我们处理数据的时间也最多为O(n)。

如果构建必要的数据比稍后扫描上升部分的成本更低，那么它可以工作。但扫描相当便宜，你很难想出一种更便宜的方法来处理升降节。

票数 9

Stack Overflow用户

发布于 2011-10-27 05:03:17

@Winston Ewert的future8方法是O(n) (!)，比我们之前所有的建议都要好。为了证明它是O(n)，观察内部while循环对任何B[target]值最多执行一次。

我以前的回答是：

以下是三种方法的基准(在@Winston Ewert和我之间的乒乓球之后)：

使用二进制搜索的

升序列表。(future2)
Full list (future6，@Winston Ewert的future7，@Winston Ewert 's future5的增强版本)。

在不同的条件下，每种方法都比其他方法快得多。如果序列是随机的，那么“完整列表”(future6)是最快的。如果序列振荡，那么“升序列表”(future2)是最快的。如果序列趋于上升，那么矢量化(future7)是最快的。

如果数据是股票行情，我会选择“矢量化”(future7)，因为股票有上升的趋势，而且它很简单，在所有条件下都表现合理。

输出：

Random series:
future2 ascends  : 0.210215095646
future6 full list: 0.0920153693145
future7 vectorize: 0.138747922771
Oscillating series:
future2 ascends  : 0.208349650159
future6 full list: 0.940276050999
future7 vectorize: 0.597290143496
Ascending trend series:
future2 ascends  : 0.131106233627
future6 full list: 20.7201531342
future7 vectorize: 0.0540951244451

代码：

import numpy as np
import time 
import timeit

def future2(A):    
    def reverse_enum(L):
        for index in reversed(xrange(len(L))):
            yield len(L)-index-1, L[index]
    def findnext(x, A, ascends): # find index of first future number greater than x
        for idx, segment in reverse_enum(ascends):
            joff=A[segment[0]:segment[1]+1].searchsorted(x,side='right') # binary search
            if joff < (segment[1]-segment[0]+1):
                j=segment[0]+joff
                [ascends.pop() for _ in range(idx)] # delete previous segments
                segment[0]=j # cut beginning of segment 
                return j
        return -1
    B = np.arange(len(A))+1
    # Note: B[i]=-1 where there is no greater value in the future.
    B[-1] = -1 # put -1 at the end
    ascends = [] # a list of pairs of indexes, ascending segments of A
    maximum = True
    for i in xrange(len(A)-2,-1,-1): # scan backwards
        #print(ascends)
        if A[i] < A[i+1]:
            if maximum:
                ascends.append([i+1,i+1])
                maximum = False
            else:
                ascends[-1][0] = i+1
        else:# A[i] >= A[i+1]
            B[i] = findnext(A[i], A, ascends)
            maximum = True
    return B


def future6(A):
    # list of tuples (index into A, value in A)
    # invariant: indexes and values in sorted order
    known = []
    result = []
    for idx in xrange(len(A) - 1, -1, -1):
        value = A[idx]
        # since known is sorted a binary search could be applied here
        # I haven't bothered

        # anything lower then the current value
        # cannot possibly be used again, since this value will be first index instead
        # of those
        known = [(x,y) for x,y in known if y > value]


        if known: 
            # all values in known are > current value
            # they reverse sorted by index               
            # the value with the lowest index is first
            result.append(known[-1][0])
        else:
            # no values exist this high, report -1
            result.append(-1)
        # add to end of the list to maintain invariant
        known.append( (idx, value) )

    # let numpy worry about reversing the array
    return np.array(result)[::-1]


def future7(A):
    @np.vectorize
    def values(i):
        for idx, v in enumerate(A[i+1:]): # loop is faster than genexp with exception
            if A[i]<v:
                return idx+i+1
        return -1
    return values(np.arange(len(A)))

if __name__ == '__main__':
    print('Random series:')
    tsetup = """import future; import numpy; A = numpy.random.random(1e4)"""
    t = timeit.timeit('future.future2(A)', tsetup, number=3)
    print('future2 ascends  : '+str(t))
    t = timeit.timeit('future.future6(A)', tsetup, number=3)
    print('future6 full list: '+str(t))
    t = timeit.timeit('future.future7(A)', tsetup, number=3)
    print('future7 vectorize: '+str(t))

    print('Oscillating series:')
    tsetup = """import future; import numpy; A = numpy.random.randint(1e5,size=1e4)-5e4; A = A.cumsum()"""
    t = timeit.timeit('future.future2(A)', tsetup, number=3)
    print('future2 ascends  : '+str(t))
    t = timeit.timeit('future.future6(A)', tsetup, number=3)
    print('future6 full list: '+str(t))
    t = timeit.timeit('future.future7(A)', tsetup, number=3)
    print('future7 vectorize: '+str(t))

    print('Ascending trend series:')
    tsetup = """import future; import numpy; A = numpy.random.randint(1e5,size=1e4)-3e4; A = A.cumsum()"""
    t = timeit.timeit('future.future2(A)', tsetup, number=3)
    print('future2 ascends  : '+str(t))
    t = timeit.timeit('future.future6(A)', tsetup, number=3)
    print('future6 full list: '+str(t))
    t = timeit.timeit('future.future7(A)', tsetup, number=3)
    print('future7 vectorize: '+str(t))

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/7896812

复制

相似问题

问用于查找大于当前元素的第一个未来元素的Numpy矢量化算法
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于查找大于当前元素的第一个未来元素的Numpy矢量化算法EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于查找大于当前元素的第一个未来元素的Numpy矢量化算法
EN