问在Python中高效地获取直方图柱的索引
EN

Stack Overflow用户

提问于 2014-11-06 23:54:51

回答 5查看 13.3K关注 0票数 27

简短问题

我有一个很大的10000x10000个元素的图像，我把它放在几百个不同的扇区/框中。然后，我需要对每个bin中包含的值执行一些迭代计算。

如何提取每个bin的索引，以便使用bins值有效地执行计算？

我正在寻找的是一种解决方案，它可以避免每次都要从我的大型数组中选择ind == j的瓶颈。有没有一种方法可以一次性直接获得属于每个bin的元素的索引？

详细说明

1.简单的解决方案

实现我所需的一种方法是使用如下代码(参见THIS相关答案)，其中我将我的值数字化，然后使用j循环选择等于j的数字化索引，如下所示

import numpy as np

# This function func() is just a placemark for a much more complicated function.
# I am aware that my problem could be easily sped up in the specific case of
# of the sum() function, but I am looking for a general solution to the problem.
def func(x):
    y = np.sum(x)
    return y

vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

result = [func(vals[ind == j]) for j in range(1, nbins)]

这不是我想要的，因为它每次都从我的大型数组中选择ind == j。这使得这个解决方案非常低效和缓慢。

2.使用binned_statistics

对于用户定义函数的一般情况，上述方法与在scipy.stats.binned_statistic中实现的方法相同。直接使用Scipy可以获得相同的输出，如下所示

import numpy as np
from scipy.stats import binned_statistics

vals = np.random.random(1e8)
results = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]

3.使用labeled_comprehension

另一种Scipy替代方案是使用scipy.ndimage.measurements.labeled_comprehension。使用该函数，上面的示例将变成

import numpy as np
from scipy.ndimage import labeled_comprehension

vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

result = labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)

不幸的是，这种形式的效率也很低，尤其是与我的原始示例相比，它没有速度优势。

4.与IDL语言的比较

为了进一步澄清，我正在寻找的功能相当于IDL语言HERE的HISTOGRAM函数中的REVERSE_INDICES关键字。这个非常有用的功能可以在Python中高效地复制吗？

具体地说，使用IDL语言，上面的例子可以写成

vals = randomu(s, 1e8)
nbins = 100
bins = [0:1:1./nbins]
h = histogram(vals, MIN=bins[0], MAX=bins[-2], NBINS=nbins, REVERSE_INDICES=r)
result = dblarr(nbins)

for j=0, nbins-1 do begin
    jbins = r[r[j]:r[j+1]-1]  ; Selects indices of bin j
    result[j] = func(vals[jbins])
endfor

上面的IDL实现比Numpy实现快大约10倍，这是因为不必为每个bin选择bin的索引。并且支持IDL实现的速度差异随着箱的数量的增加而增加。

python

numpy

scipy

回答 5

Stack Overflow用户

回答已采纳

发布于 2014-11-12 21:29:01

我发现一个特殊的稀疏矩阵构造器可以非常有效地实现所需的结果。它有点晦涩难懂，但我们可以滥用它来达到这个目的。下面的函数的使用方法与scipy.stats.binned_statistic几乎相同，但速度要快几个数量级

import numpy as np
from scipy.sparse import csr_matrix

def binned_statistic(x, values, func, nbins, range):
    '''The usage is nearly the same as scipy.stats.binned_statistic''' 

    N = len(values)
    r0, r1 = range

    digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
    S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))

    return [func(group) for group in np.split(S.data, S.indptr[1:-1])]

我避免使用np.digitize，因为它没有使用所有bin都相同宽度的事实，因此速度很慢，但我使用的方法可能不能很好地处理所有边缘情况。

票数 9

Stack Overflow用户

发布于 2014-11-07 00:37:19

我假设不能更改本例中使用digitize完成的绑定。这是一种方法，您可以一劳永逸地进行排序。

vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()

#example of using the data:
for j in range(nbins-1):
    #I am using a plotting function for your f, to show that they cluster
    plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')

如图所示，bin实际上是预期的集群：

票数 4

Stack Overflow用户

发布于 2014-11-07 00:52:08

通过先对数组进行排序，然后使用np.searchsorted，可以将计算时间减半。

vals = np.random.random(1e8)
vals.sort()

nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

results = [func(vals[np.searchsorted(ind,j,side='left'):
                     np.searchsorted(ind,j,side='right')])
           for j in range(1,nbins)]

使用1e8作为我的测试用例，我的计算时间从34秒缩短到了17秒。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/26783719

复制

相似问题

问在Python中高效地获取直方图柱的索引
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python中高效地获取直方图柱的索引EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python中高效地获取直方图柱的索引
EN