blocks|key|318330|text|卷积的典型优化是使用信号的FFT。原因是:实空间的卷积是FFT空间的乘积。与通常的卷积方法相比，计算快速傅立叶变换，然后计算乘积和结果的iFFT通常更快。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|318331|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

A typical optimization for convolution is to use the FFT of your signal. The reason is: the convolution in real space is a product in FFT space. It is often faster to compute the FFT, then the product, and the iFFT of the result than convolve the usual way.

blocks|key|3276938|text|对于特定的示例3x3内核，我会观察到|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3276939|1++1++1
1+-8++1
1++1++1

++1++1++1+++++0++0++0
=+1++1++1++%2B++0+-9++0
++1++1++1+++++0++0++0|code-block|syntax|javascript|3276940|并且第一个是可分解的-它可以通过对每行进行卷积(1111)，然后对每列进行卷积来进行卷积。然后减去原始数据的9倍。这可能会更快，也可能不会更快，这取决于scipy程序员是否足够智能地自动执行此操作。(我已经有一段时间没有签到了。)|3276941|您可能想要做更多有趣的卷积，其中因子分解可能是可能的，也可能是不可能的。|3276942|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|N|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|O|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|P|8|@]|9|@]|A|$]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

For the particular example 3x3 kernel, I'd observe that

<pre><code>1 1 1
1 -8 1
1 1 1

 1 1 1 0 0 0
= 1 1 1 + 0 -9 0
 1 1 1 0 0 0
</code></pre>

and that the first of these is factorable - it can be convoluted by convolving (1 1 1) for each row, and then again for each column. Then subtract nine times the original data. This may or may not be faster, depending on whether the scipy programmers made it smart enough to automatically do this. (I haven't checked in a while.)

You probably want to do more interesting convolutions, where factoring may or may not be possible.

blocks|key|3063930|text|scipy中用于做2d卷积的代码有点凌乱和未优化。如果您想了解scipy的低级功能，请参阅http://svn.scipy.org/svn/scipy/trunk/scipy/signal/firfilter.c。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|3063931|如果您所需要的只是使用像您所展示的那样的小的、恒定的内核进行处理，那么这样的函数可能会起作用：|3063932|def+specialconvolve(a):
++++#+sorry,+you+must+pad+the+input+yourself
++++rowconvol+=+a[1:-1,:]+%2B+a[:-2,:]+%2B+a[2:,:]
++++colconvol+=+rowconvol[:,1:-1]+%2B+rowconvol[:,:-2]+%2B+rowconvol[:,2:]+-+9*a[1:-1,1:-1]
++++return+colconvol|code-block|syntax|javascript|3063933|这个函数利用了内核的可分离性，比如上面提到的DarenW，以及更优化的numpy算术例程。根据我的测量，它比convolve2d函数快1000多倍。|3063934|entityMap|0|LINK|mutability|MUTABLE|url|http://svn.scipy.org/svn/scipy/trunk/scipy/signal/firfilter.c^0|19|1P|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@$A|V|B|W|1|X]]|C|$]]|$1|D|3|E|5|6|7|Y|8|@]|9|@]|C|$]]|$1|F|3|G|5|H|7|Z|8|@]|9|@]|C|$I|J]]|$1|K|3|L|5|6|7|10|8|@]|9|@]|C|$]]|$1|M|3|-4|5|6|7|11|8|@]|9|@]|C|$]]]|N|$O|$5|P|Q|R|C|$S|T]]]]

The code in scipy for doing 2d convolutions is a bit messy and unoptimized. See <a href="http://svn.scipy.org/svn/scipy/trunk/scipy/signal/firfilter.c" rel="noreferrer">http://svn.scipy.org/svn/scipy/trunk/scipy/signal/firfilter.c</a> if you want a glimpse into the low-level functioning of scipy. 

If all you want is to process with a small, constant kernel like the one you showed, a function like this might work:

<pre><code>def specialconvolve(a):
 # sorry, you must pad the input yourself
 rowconvol = a[1:-1,:] + a[:-2,:] + a[2:,:]
 colconvol = rowconvol[:,1:-1] + rowconvol[:,:-2] + rowconvol[:,2:] - 9*a[1:-1,1:-1]
 return colconvol
</code></pre>

This function takes advantage of the separability of the kernel like DarenW suggested above, as well as taking advantage of the more optimized numpy arithmetic routines. It's over 1000 times faster than the convolve2d function by my measurements.

blocks|key|315386|text|在将C与ctype结合使用之前，我建议在C中运行一个独立的卷积，看看限制在哪里。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|315387|CUDA、cython、scipy.weave也是如此……|315388|增加了7feb:+convolve33+8位数据与裁剪需要大约20个时钟周期的每个点，2个时钟周期每个内存访问，在我的mac+g4与gcc+4.2的PCC.你的里程数会有所不同。|315389|以下是一些微妙之处：|315390|315391|你关心0..255的正确剪裁吗？sizeof()很慢，cython等不需要know.|unordered-list-item|315392|Numpy/scipy可能需要A大小的临时内存(所以保持2*np.clip(A)<缓存大小)。|315393|315394|但是，如果你的C代码就地运行更新，那只是内存的一半，但是算法不同。|315395|顺便说一句，谷歌theano卷积=>“一个卷积运算，应该模仿scipy.signal.convolve2d，但在开发中更快！”|offset|length|315396|entityMap|0|LINK|mutability|MUTABLE|url|http://www.pylearn.org/theano/introduction.html^0|0|0|0|0|0|0|0|0|0|8|6|0|0^^$0|@$1|2|3|4|5|6|7|12|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|13|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|14|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|15|8|@]|9|@]|A|$]]|$1|H|3|-4|5|6|7|16|8|@]|9|@]|A|$]]|$1|I|3|J|5|K|7|17|8|@]|9|@]|A|$]]|$1|L|3|M|5|K|7|18|8|@]|9|@]|A|$]]|$1|N|3|-4|5|6|7|19|8|@]|9|@]|A|$]]|$1|O|3|P|5|6|7|1A|8|@]|9|@]|A|$]]|$1|Q|3|R|5|6|7|1B|8|@]|9|@$S|1C|T|1D|1|1E]]|A|$]]|$1|U|3|-4|5|6|7|1F|8|@]|9|@]|A|$]]]|V|$W|$5|X|Y|Z|A|$10|11]]]]

Before going to say C with ctypes, I'd suggest running a standalone convolve in C, to see where the limit is. 
Similarly for CUDA, cython, scipy.weave ...

Added 7feb: convolve33 8-bit data with clipping takes ~ 20 clock cycles per point,
2 clock cycles per mem access, on my mac g4 pcc with gcc 4.2. Your mileage will vary.

A couple of subtleties:

<ul>
<li>do you care about correct clipping to 0..255 ? np.clip() is slow,
cython etc. don't know.</li>
<li>Numpy/scipy may need memory for temps the size of A (so keep 2*sizeof(A) &lt; cache size). 
If your C code, though, does a running update inplace, that's half the mem but a different algorithm.</li>
</ul>

By the way, google <a href="http://www.pylearn.org/theano/introduction.html" rel="nofollow noreferrer">theano</a> convolve =>
"A convolution op that should mimic scipy.signal.convolve2d, but faster! In development"

blocks|key|3063967|text|截至2018年，SciPy/Numpy组合似乎已经加速了很多。这就是我在笔记本电脑(Dell+Inspiron+13、i5)上看到的内容。OpenCV做得最好，但是你对模式没有任何控制。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3063968|>>>+img=+np.random.rand(1000,1000)
>>>+kernel+=+np.ones((3,3),+dtype=np.float)/9.0
>>>+t1=+time.time();dst1+=+cv2.filter2D(img,-1,kernel);print(time.time()-t1)
0.0235188007355
>>>+t1=+time.time();dst2+=+signal.correlate(img,kernel,mode='valid',method='fft');print(time.time()-t1)
0.140458106995
>>>+t1=+time.time();dst3+=+signal.convolve2d(img,kernel,mode='valid');print(time.time()-t1)
0.0548939704895
>>>+t1=+time.time();dst4+=+signal.correlate2d(img,kernel,mode='valid');print(time.time()-t1)
0.0518119335175
>>>+t1=+time.time();dst5+=+signal.fftconvolve(img,kernel,mode='valid');print(time.time()-t1)
0.13204407692|code-block|syntax|javascript|3063969|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

As of 2018, seems like SciPy/Numpy combo has been sped up a lot. This is what I saw on my laptop (Dell Inspiron 13, i5).
OpenCV did the best but you don't have any control on modes. 

<pre><code>&gt;&gt;&gt; img= np.random.rand(1000,1000)
&gt;&gt;&gt; kernel = np.ones((3,3), dtype=np.float)/9.0
&gt;&gt;&gt; t1= time.time();dst1 = cv2.filter2D(img,-1,kernel);print(time.time()-t1)
0.0235188007355
&gt;&gt;&gt; t1= time.time();dst2 = signal.correlate(img,kernel,mode='valid',method='fft');print(time.time()-t1)
0.140458106995
&gt;&gt;&gt; t1= time.time();dst3 = signal.convolve2d(img,kernel,mode='valid');print(time.time()-t1)
0.0548939704895
&gt;&gt;&gt; t1= time.time();dst4 = signal.correlate2d(img,kernel,mode='valid');print(time.time()-t1)
0.0518119335175
&gt;&gt;&gt; t1= time.time();dst5 = signal.fftconvolve(img,kernel,mode='valid');print(time.time()-t1)
0.13204407692
</code></pre>

I'd like to improve the performance of convolution using python, and was hoping for some insight on how to best go about improving performance. 

I am currently using scipy to perform the convolution, using code somewhat like the snippet below:

<pre><code>import numpy
import scipy
import scipy.signal
import timeit

a=numpy.array ( [ range(1000000) ] )
a.reshape(1000,1000)
filt=numpy.array( [ [ 1, 1, 1 ], [1, -8, 1], [1,1,1] ] )

def convolve():
 global a, filt
 scipy.signal.convolve2d ( a, filt, mode="same" )

t=timeit.Timer("convolve()", "from __main__ import convolve")
print "%.2f sec/pass" % (10 * t.timeit(number=10)/100)
</code></pre>

I am processing image data, using grayscale (integer values between 0 and 255), and I currently get about a quarter of a second per convolution. My thinking was to do one of the following:

Use corepy, preferably with some optimizations
 Recompile numpy with icc &amp; ikml.
 Use python-cuda.

I was wondering if anyone had any experience with any of these approaches ( what sort of gain would be typical, and if it is worth the time ), or if anyone is aware of a better library to perform convolution with Numpy.

Thanks!

EDIT: 

Speed up of about 10x by re-writing python loop in C over using Numpy.

Improving Numpy Performance

我想使用python提高卷积的性能，并希望获得一些关于如何最好地提高性能的见解。我目前正在使用scipy执行卷积，使用的代码有点像下面的代码片段：import numpyimport scipyimport scipy.signalimport timeita=numpy.array ( [ range(1000000...

问提高Numpy性能
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问提高Numpy性能EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问提高Numpy性能
EN