blocks|key|2482557|text|这是因为您的测试矩阵太小且太规则；找出最快的评估顺序的开销可能会超过潜在的性能收益。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2482558|使用文档中的示例：|2482559|import+numpy+as+snp
from+numpy.linalg+import+multi_dot

#+Prepare+some+data
A+=+np.random.rand(10000,+100)
B+=+np.random.rand(100,+1000)
C+=+np.random.rand(1000,+5)
D+=+np.random.rand(5,+333)

%25timeit+-n+10+multi_dot([A,+B,+C,+D])
%25timeit+-n+10+np.dot(np.dot(np.dot(A,+B),+C),+D)
%25timeit+-n+10+A.dot(B).dot(C).dot(D)|code-block|syntax|javascript|2482560|结果：|2482561|10+loops,+best+of+3:+12+ms+per+loop
10+loops,+best+of+3:+62.7+ms+per+loop
10+loops,+best+of+3:+59+ms+per+loop|2482562|multi_dot通过计算标量乘法次数最少的最快乘法顺序来提高性能。|offset|length|style|CODE|2482563|在上面的例子中，默认的规则乘法顺序((AB)C)D被计算为A((BC)D)--因此1000x100+@+100x1000乘法被减少为1000x100+@+100x333，从而至少减少了2/3标量乘法。|2482564|可以通过测试来验证这一点|2482565|%25timeit+-n+10+np.dot(A,+np.dot(np.dot(B,+C),+D))
10+loops,+best+of+3:+19.2+ms+per+loop|2482566|entityMap^0|0|0|0|0|0|0|9|0|H|8|T|8|15|J|1U|I|2K|3|0|0|0^^$0|@$1|2|3|4|5|6|7|10|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|11|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|12|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|13|8|@]|9|@]|A|$]]|$1|K|3|L|5|F|7|14|8|@]|9|@]|A|$G|H]]|$1|M|3|N|5|6|7|15|8|@$O|16|P|17|Q|R]]|9|@]|A|$]]|$1|S|3|T|5|6|7|18|8|@$O|19|P|1A|Q|R]|$O|1B|P|1C|Q|R]|$O|1D|P|1E|Q|R]|$O|1F|P|1G|Q|R]|$O|1H|P|1I|Q|R]]|9|@]|A|$]]|$1|U|3|V|5|6|7|1J|8|@]|9|@]|A|$]]|$1|W|3|X|5|F|7|1K|8|@]|9|@]|A|$G|H]]|$1|Y|3|-4|5|6|7|1L|8|@]|9|@]|A|$]]]|Z|$]]

It's because your test matrices are too small and too regular; the overhead in figuring out the fastest evaluation order may outweights the potential performance gain.

Using the example from the document:

<pre><code>import numpy as snp
from numpy.linalg import multi_dot

# Prepare some data
A = np.random.rand(10000, 100)
B = np.random.rand(100, 1000)
C = np.random.rand(1000, 5)
D = np.random.rand(5, 333)

%timeit -n 10 multi_dot([A, B, C, D])
%timeit -n 10 np.dot(np.dot(np.dot(A, B), C), D)
%timeit -n 10 A.dot(B).dot(C).dot(D)
</code></pre>

Result:

<pre><code>10 loops, best of 3: 12 ms per loop
10 loops, best of 3: 62.7 ms per loop
10 loops, best of 3: 59 ms per loop
</code></pre>

<code>multi_dot</code> improves performance by evaluating the fastest multiplication order in which there are least scalar multiplications.

In the above case, the default regular multiplication order <code>((AB)C)D</code> is evaluated as <code>A((BC)D)</code>--so that a <code>1000x100 @ 100x1000</code> multiplication is reduced to <code>1000x100 @ 100x333</code>, cutting down at least <code>2/3</code> scalar multiplications.

You can verify this by testing

<pre><code>%timeit -n 10 np.dot(A, np.dot(np.dot(B, C), D))
10 loops, best of 3: 19.2 ms per loop
</code></pre>

I'm trying to optimize some code that performs lots of sequential matrix operations. 

I figured <code>numpy.linalg.multi_dot</code> (<a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.multi_dot.html#numpy.linalg.multi_dot" rel="nofollow noreferrer">docs here</a>) would perform all the operations in C or BLAS and thus it would be way faster than going something like <code>arr1.dot(arr2).dot(arr3)</code> and so on.

I was really surprised running this code on a notebook:

<pre><code>v1 = np.random.rand(2,2)

v2 = np.random.rand(2,2)



%%timeit 
 ​ 
v1.dot(v2.dot(v1.dot(v2)))

The slowest run took 9.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.14 µs per loop



%%timeit ​

np.linalg.multi_dot([v1,v2,v1,v2])

The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.9 µs per loop
</code></pre>

To find out that the same operation is about 10x slower using <code>multi_dot</code>.

My questions are:

<ul>
<li>Am I missing something ? does it make any sense ?</li>
<li>Is there another way to optimize sequential matrix operations ?</li>
<li>Should I expect the same behavior using cython ?</li>
</ul>

How is numpy multi_dot slower than numpy.dot?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我正在尝试优化一些执行大量顺序矩阵运算的代码。我认为numpy.linalg.multi_dot ()将执行C或BLAS中的所有操作，因此它将比arr1.dot(arr2).dot(arr3)等要快得多。我真的很惊讶在笔记本上运行这段代码：v1 = np.random.rand(2,2)v2 = np.random.r...

为什么numpy multi_dot比numpy.dot慢？-腾讯云开发者社区-腾讯云

问为什么numpy multi_dot比numpy.dot慢？
EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么numpy multi_dot比numpy.dot慢？EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么numpy multi_dot比numpy.dot慢？
EN