DAY40:阅读Memory Fence Functions

GPUS Lady

发布于 2018-08-01 14:45:05

7320

发布于 2018-08-01 14:45:05

文章被收录于专栏：GPUS开发者

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第39天，我们正在讲解CUDA C语法，希望在接下来的61天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。

本文共计1057字，阅读时间15分钟

前情回顾：

DAY36：阅读”执行空间"扩展修饰符

DAY37：阅读不同存储器的修饰符

DAY38：阅读存储器修饰符

DAY39:阅读扩展数据类型

B.5. Memory Fence Functions

The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device is not necessarily the order in which the data is observed being written by another CUDA or host thread.

For example, if thread 1 executes writeXY() and thread 2 executes readXY() as defined in the following code sample

it is possible that B ends up equal to 20 and A equal to 1 for thread 2. In a strongly-ordered memory model, the only possibilities would be:

· A equal to 1 and B equal to 2,

· A equal to 10 and B equal to 2,

· A equal to 10 and B equal to 20,

Memory fence functions can be used to enforce some ordering on memory accesses. The memory fence functions differ in the scope in which the orderings are enforced but they are independent of the accessed memory space (shared memory, global memory, page-locked host memory, and the memory of a peer device).

void __threadfence_block();

ensures that:

· All writes to all memory made by the calling thread before the call to __threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to __threadfence_block();

· All reads from all memory made by the calling thread before the call to __threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to__threadfence_block().

void __threadfence();

acts as __threadfence_block() for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to __threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to __threadfence(). Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is ensured by using the volatile keyword as detailed in Volatile Qualifier.

void __threadfence_system();

acts as __threadfence_block() for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to__threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to __threadfence_system().

__threadfence_system() is only supported by devices of compute capability 2.x and higher.

In the previous code sample, inserting a fence function call between X = 10; and Y = 20; and between int A = X; and int B = Y; would ensure that for thread 2, A will always be equal to 10 if B is equal to 20. If thread 1 and 2 belong to the same block, it is enough to use __threadfence_block(). If thread 1 and 2 do not belong to the same block, __threadfence() must be used if they are CUDA threads from the same device and __threadfence_system() must be used if they are CUDA threads from two different devices.

A common use case is when threads consume some data produced by other threads as illustrated by the following code sample of a kernel that computes the sum of an array of N numbers in one call. Each block first sums a subset of the array and stores the result in global memory. When all blocks are done, the last block done reads each of these partial sums from global memory and sums them to obtain the final result. In order to determine which block is finished last, each block atomically increments a counter to signal that it is done with computing and storing its partial sum (see Atomic Functions about atomic functions). The last block is the one that receives the counter value equal to gridDim.x-1. If no fence is placed between storing the partial sum and incrementing the counter, the counter might increment before the partial sum is stored and therefore, might reach gridDim.x-1 and let the last block start reading partial sums before they have been actually updated in memory.

Memory fence functions only affect the ordering of memory operations by a thread; they do not ensure that these memory operations are visible to other threads (like __syncthreads() does for threads within a block (see Synchronization Functions)). In the code sample below, the visibility of memory operations on the result variable is ensured by declaring it as volatile (see Volatile Qualifier).

本文备注/经验分享：

本章节主要说了threadfence系列函数, 后面还给出了一个单步规约求和的例子。

先说一下这个例子, 这个例子是目前常见的3种规约求和的例子里面, 最难理解的一种。另外两种是: 两步规约(最常见)，和一步规约，但最后使用原子操作.其中我们的论坛上我总是推荐两步规约, 因为无论是性能上, 还是可理解性上, 这个版本是最好的。本章节提出的这个一步规约求和过程, 主要是个范例, 为了能让你知道如何能在1个kernel执行期间, 1个或者多个线程, 使用来自其他的1个或者多个线程产生的数据(Producer-Consumer模型)。

先从threadfence开始了.首先说, threadfence有2个主要作用, 和3个控制级别:一个作用是控制访存(仅限写入)的生效。另外一个则是控制编译器在生成指令的时候，在fence的上面和下面(从代码行的角度)，控制不要过度的优化(针对读取和写入)。这实际上是个硬件和软件结合的过程，前者主要是硬件方面的，后者主要是软件上的(编译器的行为控制)。

先说一下threadfence对于存储器写入的控制。它具有一个这里暂时没有说明的细节:可以将调用它的线程(warp)暂停一段时间，直到该线程(warp)进行的访存写入，完成到某个级别。然后才能继续该线程(warp)的执行。也就是本章节说的，到了对block中的伙伴线程，还是到GPU上的其他线程，还是到全局范围(例如CPU或者其他参与P2P Access的卡)的哪个程度的完成级别.这就导致了本章节的3个级别的不同后缀的threadfence函数: __threadfence_block(); __threadfence(); __threadfence_system(); 这三个级别一个比一个高. 在使用的时候, 会导致线程(warp)暂停的时间也越来越长(暂停不一定会有性能负面影响. 因为硬件可以选择在此期间切换到其他线程或者warp上执行，也就说可以被掩盖)。如果只从硬件的角度看，线程在执行了访存写入的指令(由编译器生成)后，没有这3个函数(所生成的对应指令)的时候, 是会继续往下执行的, 并不等待访存完成。

如果只从硬件的角度看，线程在执行了访存写入的指令(由编译器生成)后，没有这3个函数(所生成的对应指令)的时候，是会继续往下执行的，并不等待访存完成，而有了这三个指令后, 线程则会暂停, 等待写入的过程完成到一定程度, 能被其他的不同级别的线程"看"到, 才会继续执行。这也是刚才为何强调说这是"控制访存(写入)"的括号中的写入的原因。因为对于读取来说, 在使用到操作数的时候, 可以自动等待.但是写入没有这个功能, 提交数据(从寄存器)和访存指令给SM里的LSU后, 也就是访存指令成功发出后, 线程继续往下执行的.而此三个函数, 等效的引入了3种不同级别的延迟, 能让多种硬件流水线中的写入操作完成到一定程度后才继续往下执行. 在很多场景下非常又用(例如本章节最后的例子的一个block中的线程读取了其他block中的线程生成的数据的时候).但具体会导致线程/warp暂停多久, 目前无资料.一个有趣的地方是, 当年某计算能力的卡, 在某个级别上的该指令的一个级别上是有时间错误的,然后编译器在对特定的该计算能力的卡, 生成特定级别的该暂停指令的时候,进行了patch操作.(连续生成两条该指令, 以便修正最坏情况下的时间问题, 保证写入操作正常执行到某个地方)。

然后还没完, 我们都知道硬件离开不了软件, 否则就是一具没有灵魂的尸体. 不仅仅硬件对该threadfence系列函数(所对应的指令)具有一定的操作,编译器在遇到该3个函数的时候, 除了正常生成3种对应的指令外, 还有引起一些行为上的变化:这就是之前的文字说过的, 引起一些优化行为上的变化, 本章节也有具体的文字描述。通俗的说, 就是将threadfence系列函数前后的访存隔离开, 编译器不会跨越threadfence的边界, 重新调整访存语句的顺序,来在保持安全和逻辑正确的前提下, 进行性能优化。这就是为何本章节出现了很多重复的长句的原因。

而长句的另外一部分, 说, "一个线程的....写入操作....何时何地能被何人观察到(3个级别)..."则是编译器的行为变化后, 外加硬件的3个不同生成的指令所带来的暂停控制.这是threadfence的主要两点。三个级别用户应当自己看一下.前面也说过了, 主要是暂停时间上的不同(非来自正式资料, 来自实际观察).

然后光有这两点还没完, 本章节还说了一个volatile关键字。正巧它也有两个作用.也分别说软件和硬件上的. 在本章节最后的例子种, 该关键字不可缺少.软件上的作用, 也是类似threadfence函数一样, 会发生编译器行为上的控制和变化:这点实际上是大家都很熟悉的。

编译器不会使用之前存在的已经有的旧值, 例如已经读取到的固定某个p[offset]. 而会重新生成访存指令来读取.避免使用已经无效的老的值导致问题(例如p[offset]同样的p和offset已经被其他线程刚刚改写过了)，另外一点则是硬件上的, 我们都知道目前的卡(例如Pascal, 计算能力6.1的GTX1080),具有统一(逻辑上)的L2 cache,和每个SM里面单独所有的L1 cache(也叫unified cache在本计算能力上, 在其他计算能力上可能不同, 这里统一用L1 cache表示)，也就说, 该卡有1个L2 cache(位于GPU里), 和20个独立的L1 cache(位于每个SM里)，一旦SM 10写入了某个值, 改变了它的L1 cache和/或全局的L2 cache中的内容(变成刚才写入的值)，则SM 9如果依然普通的访存读取, 可能会直接因为之前在SM 9的L1 cache中有过旧的内容, 立刻得到了. 从而错过了刚刚SM 10上(的线程)写入的新值，而此时, 有了volatile关键字, 编译器会在遭遇到后, 不仅仅不使用刚才(1)点所说的情况, 而会重新发出一次访存，同时还会控制访存指令的cache控制策略, 让它能越过可能的当前SM上的L1中的老值, 得到正确结果.换句话说, 一些卡的多个L1和统一的L2在一定情况下, 并不维持一致性.这是关于volatile。

然后说一下这个例子. 这个例子是一个简化版本, 一些细节被删除了. 但做为这里的范例还是足够的.用户如果想看全文版本, 网上有很多. 来自NV的, 来自不同作者的.到时候可以参考一下. 这里我先说一下重要的点:普通访存和原子操作混合.首先之前我们说过, 这是一个单步的规约求和kernel. 只需要一次kernel启动, 就可以对一个缓冲区上的全部大量数据, 并行求出最终的和来(所谓规约操作，规约操作是指从一个较大范围的输入, 得到一个较小范围的输出，你也可以看成是从一个大集合到小集合的影射过程)。当然, 本例子只是简单的求和，这个kernel因为只想进行一次启动, 就完成所有操作, 因此这个代码被切分成了前后两个部分, 前面的一个部分就是普通的各自blocks们算自己的内部的和, 然后保存到显存.后面的一个部分则是选举出来最后结束的一个block, 该block进行第二步工作, 读取之前的blocks们的内部的和, 然后再求出最终的和。于是这样就完成了单体kernel全局求和.因为这里涉及到了, 读取其他线程的结果, 所以用了之前说过的volatile + threadfence,但是这里的threadfence还有一个重要作用,就是常见的普通访存(result[...] = ...写入)和原子操作访存(atomInc)之间的问题。普通的访存指令和原子操作指令之间是没有完成次序性的.这个代码通过在两个指令之间, 插入了一句threadfence函数,导致了之前的所有将要结束的kernel, 只有暂停了足够时间, 让自己写入的部分的和已经成功的被生效到了全局可见的时候, 才通过原子操作, 进行竞选.这样可以让竞选出来的最终block, 能在原子操作的结果出来(竞选成功)后, 立刻此时就已经有生效的需要被读取的结果了.避免后续的一条atomic操作已经完成, 而之前的普通访存(保存部分的和)还在路上没有执行完.这就是这里的主要注意事项.

然后还需要说明的是, 常见的这种规约求和还有另外两种版本.版本2需要2次kernel启动, 但代码简单不少.版本2的kernel每个block只内部算自己的部分和, 保存自己的部分和即可(这里没写, 论坛有的是)，也就是版本2的kernel只需要版本1的kernel的前半部分即可.后面的那一半不要.然后只需要通过简单技巧(连续启动两次它), 第二次的启动只需要设定1个block.这样即可完成任务，读者可以想想一下为何.这个版本的代码简单(只有一半), 同时能规避threadfence操作, 也能规避atomInc操作, 还能规避这里的block内部用if控制原子操作的范围(集体进行一次)然后将结果用shared memory广播回来的过程.简化了相当多, 但往往可能有更好的性能.因此强力推荐代码2(你唯一需要付出的是启动两次该kernel即可. 但kernel自身得到了大量的简化).这是版本2. 感兴趣的可以看一下我们的论坛.

http://bbs.gpuworld.cn/thread-9426-1-1.htmls

请参考5# 我同事的说法.实质上这里我同事指出了, 引用: "从方法上说，可以使用一些比较复杂的方法实现单个kernel完成整体的规约操作，但是这样做并无什么明显的好处，所以并不推荐这样。"他说的比较委婉.此外, 需要说明的是, 本主题中, 我同事还同时说明了,存在方法3, 也同样需要一步kernel启动即可.就是将原本方法2的第二次启动, 或者方法1的后半部分,直接改写成一次原子累加即可. 这样综合看可能是最简单的版本.但需要提前的一次对最终结果的预先清0(这里没说).也可以让用户参考. 需要说明的是, 最基本版本的threadfence, 用得不是很多(__threadfence_block)，因为往往你还需要同时控制block内部地线程们地步伐一致(同步到某个执行点)，而此时, 可以选择考虑__syncthreads()系列函数。在目前版本的手册说明上, 它还同时具有之前地__threadfence_block的效果(例如硬件的访存写入生效性控制, 或者编译器类似在fence前后的语句顺序调整安排).因此在常见的情况下, block内部使用了__syncthreads()往往不再需要使用__threadfence_block了.这其实也是大家常见和熟悉的, 例如一个shared memory上地数组上的数据交换操作.这里需要特别地说明.主要因为这是并行的, 所以很多东西你需要考虑的, 特别说同时进行数据地生成和消费.所以这是为何推荐之前地2步版本的规约求和的原因.它通过2次顺序执行的kernel启动, 每次都不会同时消费本次生成的数据.规避了这点.

所以从可读性上好了很多.这也是一个常规的建议, 如果没有必要, 不要在一次kernel启动内部, 立刻消费本次产生地数据.总是可以留到下次启动的.(除了shared memory上的, 这种一个block内部会进行多次生成和稍后立刻使用，但这是shared memory的常规用途——还记得我们之前几天说过的, shared memory的3大作用吗?)，这种用途可以使用__syncthreads(), 通过它除了同步之外的副作用, 省掉你写threadfence)。

有不明白的地方，请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-07-02，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自 GPUS开发者微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度