blocks|key|33776|text|高速缓存一致性。当您水平扫描时，您的数据将在内存中更接近，因此您将有更少的缓存未命中，因此性能将更快。对于一个足够小的矩形，这并不重要。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|33777|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Cache coherence. When you scan horizontally, your data will be closer together in memory, so you will have less cache misses and thus performance will be faster. For a small enough rectangle, this won't matter.

blocks|key|4158264|text|答案已经被接受了，但我认为这还不是全部。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4158265|是的，缓存是所有这些元素必须以某种顺序存储在内存中的重要原因。如果您按存储顺序对它们进行索引，则可能会减少缓存未命中的次数。很有可能。|4158266|另一个问题(也被许多答案提到)是几乎每个处理器都有一个非常快的整数增量指令。它们通常不会有一个非常快的“增量乘以这第二个二进制数量”。这就是你在索引“对照颗粒”时所要求的。|4158267|第三个问题是优化。已经投入了大量的精力和研究来优化这类循环，如果您以某种合理的顺序对其进行索引，那么您的编译器将更有可能实施其中的一项优化。|4158268|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|J|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|K|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|L|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|M|8|@]|9|@]|A|$]]|$1|H|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|I|$]]

An answer has been accepted, but I don't think it's the whole story.

Yes, cache is a big part of the reason all those elements have to be stored in memory in some order. If you index through them in the order they are stored, you are likely to have less cache misses. Likely.

The other issue (also mentioned by a lot of answers) is that pretty much every processor has a very fast integer increment instruction. They do not typically have a very fast "increment by some amount multiplied by this second arbirary amount". That's what you are asking for when you index "against the grain".

A third issue is optimization. A lot of effort and research has been put into optimizing loops of this kind, and your compiler will be much more likely to be able to put one of those optimizations into affect if you index through it in some reasonable kind of order.

blocks|key|1333404|text|缓存确实是原因，但如果你想知道争论的实质，你可以看看U.Drepper的“每个程序员都应该知道关于内存的东西”：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1333405|http://people.redhat.com/drepper/cpumemory.pdf|offset|length|1333406|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|1A|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|N|8|@]|9|@$D|O|E|P|1|Q]]|A|$]]|$1|F|3|-4|5|6|7|R|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|C]]]]

Cache is indeed the reason, but if you want to know the meat of the argument, you could take a look at the "What Every Programmer Should Know About Memory" by U. Drepper:

<a href="http://people.redhat.com/drepper/cpumemory.pdf" rel="noreferrer">http://people.redhat.com/drepper/cpumemory.pdf</a>

blocks|key|1335935|text|在前面的答案的基础上再做一些扩展：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1335936|通常，作为程序员，我们可以将程序的可寻址内存看作一个平面字节数组，从0x00000000到0xFFFFFFFF。操作系统将保留其中的一些地址(例如，所有小于0x800000000的地址)供自己使用，但我们可以随心所欲地处理其他地址。所有这些内存位置都驻留在计算机的RAM中，当我们想要对它们进行读取或写入时，我们会发出相应的指令。|1335937|但这不是真的！有一堆复杂的因素影响着这个简单的进程内存模型:虚拟内存、交换和缓存。|offset|length|style|BOLD|1335938|与RAM交谈需要相当长的时间。它比硬盘快得多，因为没有任何旋转盘或磁铁，但按照现代CPU的标准，它仍然相当慢。因此，当您尝试从内存中的特定位置读取数据时，您的CPU并不只是将该位置读取到寄存器中并调用它。取而代之的是，它读取该位置，将一堆附近的位置/and到处理器缓存中，该缓存驻留在CPU上，访问速度比主内存快得多。|1335939|现在，我们对计算机的行为有了更复杂但更正确的看法。当我们试图读取内存中的某个位置时，首先要查看处理器缓存中是否已经存储了该位置的值。如果是，我们使用缓存中的值。如果不是，我们在主存中进行更长的访问，检索值以及它的几个邻居，并将它们放在缓存中，踢出一些过去在那里的东西来腾出空间。|1335940|现在我们可以看到为什么第二个代码片段比第一个代码片段更快。在第二个示例中，我们首先访问a[0]、b[0]和c[0]。这些值中的每一个都被缓存，还有它们的邻居，比如a[1..7]、b[1..7]和c[1..7]。然后，当我们访问a[1]、b[1]和c[1]时，它们已经在缓存中，我们可以快速读取它们。最终我们到了a[8]，不得不再次使用内存，但八分之七的情况下，我们使用的是漂亮的快速缓存内存，而不是笨重的慢速内存。|CODE|1335941|(那么，为什么对a、b和c的访问不会相互踢出缓存呢？这有点复杂，但本质上是处理器根据地址决定在缓存中存储给定值的位置，因此空间上彼此不近的三个对象不太可能被缓存到同一位置。)|1335942|相比之下，考虑lbrandy帖子中的第一个代码片段。我们首先阅读a[0]、b[0]和c[0]，缓存a[1..7]、b[1..7]和c[1..7]。然后访问a[width]、b[width]和c[width]。假设width是RAM8(可能是8，否则我们不会关心这种低级优化)，我们必须再次访问>=，缓存一组新的值。当我们到达a[1]时，它可能已经被踢出缓存，以便为其他东西腾出空间。在三个数组大于处理器缓存的常见情况下，/every单个读取/将错过缓存，从而极大地降低性能。|1335943|这是对现代缓存行为的一个非常高层次的讨论。对于更深入和更技术性的东西，this看起来像是对这个主题的彻底但可读的处理。|1335944|entityMap|0|LINK|mutability|MUTABLE|url|http://www.cs.iastate.edu/~prabhu/Tutorial/CACHE/mem_title.html^0|0|0|12|2|0|3L|5|0|0|17|4|1C|4|1H|4|29|7|2H|7|2P|7|35|4|3A|4|3F|4|4B|4|0|8|1|A|1|C|1|0|W|4|11|4|16|4|1D|7|1L|7|1T|7|25|8|2E|8|2N|8|4I|4|0|Z|4|0|0^^$0|@$1|2|3|4|5|6|7|14|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|15|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|16|8|@$F|17|G|18|H|I]]|9|@]|A|$]]|$1|J|3|K|5|6|7|19|8|@$F|1A|G|1B|H|I]]|9|@]|A|$]]|$1|L|3|M|5|6|7|1C|8|@]|9|@]|A|$]]|$1|N|3|O|5|6|7|1D|8|@$F|1E|G|1F|H|P]|$F|1G|G|1H|H|P]|$F|1I|G|1J|H|P]|$F|1K|G|1L|H|P]|$F|1M|G|1N|H|P]|$F|1O|G|1P|H|P]|$F|1Q|G|1R|H|P]|$F|1S|G|1T|H|P]|$F|1U|G|1V|H|P]|$F|1W|G|1X|H|P]]|9|@]|A|$]]|$1|Q|3|R|5|6|7|1Y|8|@$F|1Z|G|20|H|P]|$F|21|G|22|H|P]|$F|23|G|24|H|P]]|9|@]|A|$]]|$1|S|3|T|5|6|7|25|8|@$F|26|G|27|H|P]|$F|28|G|29|H|P]|$F|2A|G|2B|H|P]|$F|2C|G|2D|H|P]|$F|2E|G|2F|H|P]|$F|2G|G|2H|H|P]|$F|2I|G|2J|H|P]|$F|2K|G|2L|H|P]|$F|2M|G|2N|H|P]|$F|2O|G|2P|H|P]]|9|@]|A|$]]|$1|U|3|V|5|6|7|2Q|8|@]|9|@$F|2R|G|2S|1|2T]]|A|$]]|$1|W|3|-4|5|6|7|2U|8|@]|9|@]|A|$]]]|X|$Y|$5|Z|10|11|A|$12|13]]]]

To expand on the previous answers a bit:

Usually, as programmers, we can think of our programs' addressable memory as a flat array of bytes, from 0x00000000 to 0xFFFFFFFF. The operating system will reserve some of those addresses (all the ones lower than 0x800000000, say) for its own use, but we can do what we like with the others. All those memory locations live in the computer's RAM, and when we want to read from them or write to them we issue the appropriate instructions.

But this isn't true! There are a bunch of complications tainting that simple model of process memory: virtual memory, swapping, and the cache.

Talking to RAM takes a fairly long time. It's much faster than going to the hard disk, as there aren't any spinning plates or magnets involved, but it's still pretty slow by the standards of a modern CPU. So, when you try to read from a particular location in memory, your CPU doesn't just read that one location into a register and call it good. Instead, it reads that location, /and a bunch of nearby locations/, into a processor cache that lives on the CPU and can be accessed much more quickly than main memory.

Now we have a more complicated, but more correct, view of the computer's behavior. When we try to read a location in memory, first we look in the processor cache to see if the value at that location is already stored there. If it is, we use the value in the cache. If it isn't, we take a longer trip into main memory, retrieve the value as well as several of its neighbors and stick them in the cache, kicking out some of what used to be there to make room.

Now we can see why the second code snippet is faster than the first. In the second example, we first access <code>a[0]</code>, <code>b[0]</code>, and <code>c[0]</code>. Each of those values is cached, along with their neighbors, say <code>a[1..7]</code>, <code>b[1..7]</code>, and <code>c[1..7]</code>. Then when we access <code>a[1]</code>, <code>b[1]</code>, and <code>c[1]</code>, they're already in the cache and we can read them quickly. Eventually we get to <code>a[8]</code>, and have to go to RAM again, but seven times out of eight we're using nice fast cache memory instead of clunky slow RAM memory.

(So why don't accesses to <code>a</code>, <code>b</code>, and <code>c</code> kick each other out of the cache? It's a bit complicated, but essentially the processor decides where to store a given value in the cache by its address, so three objects that aren't near each other spatially are unlikely to be cached into the same location.)

By contrast, consider the first snippet from lbrandy's post. We first read <code>a[0]</code>, <code>b[0]</code>, and <code>c[0]</code>, caching <code>a[1..7]</code>, <code>b[1..7]</code>, and <code>c[1..7]</code>. Then we access <code>a[width]</code>, <code>b[width]</code>, and <code>c[width]</code>. Assuming width is >= 8 (which it probably is, or else we wouldn't care about this sort of low-level optimization), we have to go to RAM again, caching a new set of values. By the time we get to <code>a[1]</code>, it will probably have been kicked out of the cache to make room for something else. In the not-at-all-uncommon case of a trio of arrays that are larger than the processor cache, it's likely that /every single read/ will miss the cache, degrading performance enormously.

This has been a very high-level discussion of modern caching behavior. For something more in-depth and technical, <a href="http://www.cs.iastate.edu/~prabhu/Tutorial/CACHE/mem_title.html" rel="noreferrer">this</a> looks like a thorough yet readable treatment of the subject.

blocks|key|33109|text|是的，“缓存一致性”……当然这要看情况，你可以优化垂直扫描的内存分配。传统上，显存是从左到右，从上到下分配的，我敢肯定回到了CRT屏幕的时代，它以同样的方式绘制扫描线。从理论上讲，您可以改变这一点--所有这些都表明，水平方法没有任何内在的东西。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|33110|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Yeah, 'cache coherence'...of course it depends, you could optimize memory allocation for vertical scans. Traditionally video memory is allocated left-to-right, top-to-bottom, going back I'm sure to the days of CRT screens which drew scanlines the same way. In theory you could change this though--all this to say there isn't anything intrinsic about the horizontal method.

blocks|key|1333368|text|原因是，当你深入到内存布局的硬件级别时，实际上并不存在二维数组这样的东西。所以，“垂直”扫描到你需要访问的下一个单元格，你是在沿着这些线做一个操作|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1333369|对于索引为(行，列)的2D数组，需要将其转换为arrayindex的一维数组，因为计算机中的内存是线性的。|1333370|因此，如果您垂直扫描，则下一个索引的计算公式为：|1333371|index+=+row+*+numColumns+%2B+col;|code-block|syntax|javascript|1333372|但是，如果是水平扫描，则下一个索引如下所示：|1333373|index+=+index%2B%2B;|1333374|与乘法和加法相比，单次加法将减少CPU的操作码，因此由于计算机内存的体系结构，水平扫描速度更快。|1333375|缓存不是答案，因为如果这是您第一次加载此数据，则每次数据访问都将是缓存未命中。对于第一次执行，水平更快，因为有更少的操作。通过三角形的后续循环将通过缓存而变得更快，如果三角形足够大，则垂直循环可能会因为缓存未命中而变慢，但由于访问下一个元素所需的操作数量增加，因此将始终比水平扫描慢。|1333376|entityMap^0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|V|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|W|8|@]|9|@]|A|$]]|$1|F|3|G|5|H|7|X|8|@]|9|@]|A|$I|J]]|$1|K|3|L|5|6|7|Y|8|@]|9|@]|A|$]]|$1|M|3|N|5|H|7|Z|8|@]|9|@]|A|$I|J]]|$1|O|3|P|5|6|7|10|8|@]|9|@]|A|$]]|$1|Q|3|R|5|6|7|11|8|@]|9|@]|A|$]]|$1|S|3|-4|5|6|7|12|8|@]|9|@]|A|$]]]|T|$]]

The reason is because there is really no such thing as a 2 dimensional array when you get down to the hardware level of how memory is laid out. So scanning 'vertically' to get to the next cell you need to visit you're doing an operation along these lines

For a 2D array indexed as (row, column) this needs to be translated into a single dimension array of array[index] because memory in a computer is linear.

So if you're scanning vertically, the next index is calculated as:

<pre><code>index = row * numColumns + col;
</code></pre>

however, if you're scanning horizontally then the next index is just as follows:

<pre><code>index = index++;
</code></pre>

A single addition is going to be fewer op codes for the CPU then a multiplication AND addition, and thus horizontal scanning is faster because of the architecture of computer memory.

Cache is not the answer because if this is the first time you're loading this data, every data access will be a cache miss. For the very first execution, horizontal is faster because there are fewer operations. Subsequent loops through the triangle will be made faster by cache, and vertical could be slower because of cache misses if the triangle is sufficiently large, but will always be slower than horizontal scanning because of the increased number of operations needed to access the next element.

I just stumbled upon <a href="http://lbrandy.com/blog/2009/03/more-cache-craziness/" rel="nofollow noreferrer">this blog post</a> about cache algorithms.
The author shows two code samples that loop through a rectangle and compute something (my guess is the computing code is just a placeholder).
On one of the examples, he scans the rectangle vertically, and on the other horizontally. He then says the second is fastest, and that every programmer should know why. Now I must not be a programmer, because to me it looks exactly the same.
Can anyone explain why the former is faster?

Fastest way to loop through a 2d array?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我刚刚偶然发现了关于缓存算法的。作者展示了两个代码示例，它们循环遍历一个矩形并计算一些东西(我猜计算代码只是一个占位符)。在其中一个示例中，他垂直扫描矩形，而在另一个示例中水平扫描矩形。然后他说第二个是最快的，每个程序员都应该知道为什么。现在我一定不是一个程序员，因为对我来说它看起来完全一样。有人能解释一下为什么前者更快吗？

问循环二维数组的最快方法？
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问循环二维数组的最快方法？EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问循环二维数组的最快方法？
EN