DAY41：阅读Synchronization Functions

GPUS Lady

发布于 2018-08-01 14:46:15

1.1K0

发布于 2018-08-01 14:46:15

文章被收录于专栏：GPUS开发者

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第41天，我们正在讲解CUDA C语法，希望在接下来的59天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。

本文共计369字，阅读时间10分钟

前情回顾：

DAY36：阅读”执行空间"扩展修饰符

DAY37：阅读不同存储器的修饰符

DAY38：阅读存储器修饰符

DAY39:阅读扩展数据类型

DAY40:阅读Memory Fence Functions

B.6. Synchronization Functions

void __syncthreads();

waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.

__syncthreads() is used to coordinate communication between the threads of the same block. When some threads within a block access the same addresses in shared or global memory, there are potential read-after-write, write-after-read, or write-after-write hazards for some of these memory accesses. These data hazards can be avoided by synchronizing threads in-between these accesses.

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

Devices of compute capability 2.x and higher support three variations of __syncthreads() described below.

int __syncthreads_count(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.

int __syncthreads_and(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.

int __syncthreads_or(int predicate);

void __syncwarp(unsigned mask=0xffffffff);

will cause the executing thread to wait until all warp lanes named in mask have executed a __syncwarp() (with the same mask) before resuming execution. All non-exited threads named in mask must execute a corresponding __syncwarp() with the same mask, or the result is undefined.

Executing __syncwarp() guarantees memory ordering among threads participating in the barrier. Thus, threads within a warp that wish to communicate via memory can store to memory, execute __syncwarp(), and then safely read values stored by other threads in the warp.

Note: For .target sm_6x or below, all threads in mask must execute the same __syncwarp() in convergence, and the union of all values in mask must be equal to the active mask. Otherwise, the behavior is undefined.

本文备注/经验分享：

今天将主要说一下__syncthreads*()家族的系列函数, 以及, CUDA 9新增的__syncwarp()。

前者主要是在一个block的范围内进行控制——代码执行位置的同步, memory fence的行为(硬件或者软件上)---昨天的内容说过这个fence. 而__syncwarp()作为新增内容, 主要是CUDA 9引入为了适应新卡的, 范围则缩减为warp内部。先说一下传统的经典的, 大家喜闻乐见的__syncthreads()家族.这个是从CUDA最初就开始有的函数, 可以完成局部同步, 也就是block内部的同步, 同时附带memory fence效果(这点请参考昨天内容).往往用于在shared memory上的数据交换操作. 但需要注意的是: (1)对于数据量很大的内容, 可能shared memory的空间不够(48KB小于计算能力7.0, 最多96KB大于等于计算能力7.0). 此时可以需要通过global memory进行交换. 但通过local memory是不可以的, 因为线程间的同一个local memory上的地址中的内容是不同地(参考之前的local memory章节).但block内部通过global memory进行数据交换的场合比较少见.

(2)对于数据量非常少的内容, 例如1-bit级的交换, 对于常见操作, 可以由__syncthreads()的扩展版本直接完成, 连shared memory都不需要: 这包含常见地计数(1-bit值的原子累加), bool的and和or操作。也就是扩展的__syncthreads_count(), __syncthreads_and(), __synchtreads_or() 这三种常见的1-bit级别的数据交换, 如果能落到这三种最常见的用途上, 可以直接由__syncthreads()完成, 不需要走shared memory, 但如果是其他特别地数据操作类型, 例如要求每4个线程进行累加一次, 那么只能走shared memory. 但好在这三种基本上是最常见的情况了。 (3)对于数据量能在1个或者多个4B或者8B的情况下, 同时交换范围能在warp内部, 可以直接考虑warp shuffle操作, 该操作具有典型的shared memory级别的延迟, 但不需要写入任何shared memory中的存储单元, 不破坏任何内容即可完成交换. 例如某卡, 可以在正好用满48KB的(或者2-3个32KB的shared memory)的时候, 还能通过warp shuffle进行类似shared memory上的数据交换(逻辑上等于使用shared memory写入, 然后立刻读取, 但下标进行了变换, 等效的完成了数据交换), 不需要破坏或者临时保存shared memory上的内容。其中第二点的count操作(__syncthreads_count()函数), 很多时候非常有用.在最初引入了扩展版本的__syncthreads()的时候(计算能力2.0/2.1), 普通的__syncthreads(), 也就是没有任何后缀地版本, 实际上被编译成int result = __syncthreads_count(1); 然后立刻丢弃掉结果。因此实际上不用担心使用了具有这些数据交换/统计能力的扩展版本的__syncthreads()家族函数, 是否会影响性能.你可以从上面的看到, 这实际上等于总是使用了扩展版本的__syncthreads(). 因此可以放心。

什么叫丢掉结果？ N卡很多这种指令生成的. 为了满足逻辑，例如有uint32_t a,b,c; a = b + c; 然后对于进位标志的判定, carry = (a < b). 如果你的代码只需要使用carry标志, 则编译器可能生成一条直接浪费掉累加结果的带有进位输出的加法，这往往涉及到RZ寄存器的使用. 还记得它吗? 我们之前说过.为何一个线程只能使用63个寄存器而不是64个.或者为何一个寄存器只能使用255个寄存器, 而不是256个，其中的一个寄存器专门用作黑洞用途。专用用来往里写入结果, 然后丢弃的。实际上在最初, 无论是扩展版本的__syncthreads_count()还是普通版本的__syncthreads()，都会生成同样的类似BAR.RED.POPC RZ, RZ, RZ, PT这种操作的。这条指令同时完成了同步(以及别忘记memory fence效果), 外加结果统计。只是丢入了黑洞寄存器不要了而已。

这里是用来说明, 其实这些扩展版本的代价非常低,应当需要使用就使用——肯定比你手工在shared memory进行规约快，不过从Kepler开始(计算能力3.0), 请注意这是从9.0开始的最低支持版本了.(CUDA 8.0最低的支持版本是2.0. 所以请尽快扔掉你手头的C2050之类的卡）。从这代开始到计算能力7.2(TX2的继任者), 将普通的__syncthreads()和扩展版本的__syncthreads*()分开了.前者将只生成一条BAR.SYNC操作指令.而后者则会生成两条指令. 前者的一条, 外加一条B2R.RESULT指令用来收集结果.这可能是为了降低在Barrier上的等待延迟. 也可能有其他硬件涉及上的考虑.毕竟这些扩展版本同时有2个功能.有的时候在使用了同步功能后, 不一定需要立刻使用统计/数据交换功能的结果.此时新版本的设计"可能"会有更好的性能.因此新版本也应当使用它们(当适用的时候)。最后需要说明的是, 从CUDA 9引入的, 为了新卡设计的__syncwarp(可选mask参数在此)函数.因为之前章节中说过, SIMT架构从7.0开始, 允许部分情况下的warp不一致执行，此时__syncwarp带来了2个效果,一个是常见的memory fence效果,另外则允许短暂地warp内部部分或者全部线程执行到特定的点.例如当你需要进行warp内部的数据交换,而不想通过全局的__syncthreads()进行. 或者warp shuffle不适用的时候.需要注意的是，以前的老代码, 进行隐式warp交换(通过volatile指针 + shared memory), 可能你需要额外改写, 避免在新卡上挂掉.以及, 此函数根据实验, 在老卡上(Pascal或者更低计算能力)，将被编译成空白操作。因此虽然你现在依然在使用老卡,但依然可以安心的升级到9.2, 而不必坚持8.0之类的老版本, 因为: (1)及时的更换到新编译器有助于你熟悉新式的写法, 将来你总有一天会买新卡的. 现在熟悉有助于降低到时候地升级版本, 避免手头的老式代码写法出来的东西越来越多, 到时候集中移植会是个灾难。 (2)升级到新版本的CUDA 9.2, 不会降低你老卡的性能, 这函数使用后可以理解成编译成空白操作,或者理解成被编译器忽略. 因此总是建议更新的(你迟早的更新, 除非你打算永远不买新卡）。以及, 在计算能力7.0的卡上,__syncwarp()有时候会被编译器直接忽略, 有的时候在没有使用__syncwarp()函数的时候, 编译器自动插入生成了SYNCWARP指令.而有的时候, __syncwarp()被编译成一系列稍微复杂的指令序列. 但为何会这样目前尚不明确(欢迎读者补充). 此外, 需要补充的是,扩展版本的__syncthreads*()家族函数,均接受一个同样地(int predicate)风格的参数,但在实际的硬件上(2.0-7.2)，硬件指令接受的是Predicate寄存器(1-bit值)，很多时候你会发现编译器无辜的反复在整数表示的布尔值(例如int a = 1)和P寄存器之间来会切换.但这可能是因为传统上C没有1-bit布尔值的规范定义导致的.不一定是编译器(弱智)或者是设计上的问题.但这一般情况下不会造成性能影响. 只是这里需要说明一下.

有不明白的地方，请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-07-03，如有侵权请联系 cloudcommunity@tencent.com 删除

gui