DAY58:阅读Launch Bounds

GPUS Lady

发布于 2018-08-01 15:05:57

1.2K0

发布于 2018-08-01 15:05:57

文章被收录于专栏：GPUS开发者

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第58天，我们正在讲解CUDA C语法，希望在接下来的42天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。

B.22. Launch Bounds

As discussed in detail in Multiprocessor Level, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance.

Therefore, the compiler uses heuristics to minimize register usage while keeping register spilling (see Device Memory Accesses) and instruction count to a minimum. An application can optionally aid these heuristics by providing additional information to the compiler in the form of launch bounds that are specified using the __launch_bounds__() qualifier in the definition of a __global__function:

maxThreadsPerBlock specifies the maximum number of threads per block with which the application will ever launch MyKernel(); it compiles to the .maxntidPTX directive;
minBlocksPerMultiprocessor is optional and specifies the desired minimum number of resident blocks per multiprocessor; it compiles to the .minnctapersmPTX directive.

If launch bounds are specified, the compiler first derives from them the upper limit L on the number of registers the kernel should use to ensure that minBlocksPerMultiprocessor blocks (or a single block if minBlocksPerMultiprocessor is not specified) of maxThreadsPerBlock threads can reside on the multiprocessor (see Hardware Multithreading for the relationship between the number of registers used by a kernel and the number of registers allocated per block). The compiler then optimizes register usage in the following way:

If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at the expense of more local memory usage and/or higher number of instructions;
If the initial register usage is lower than L
- If maxThreadsPerBlock is specified and minBlocksPerMultiprocessor is not, the compiler uses maxThreadsPerBlock to determine the register usage thresholds for the transitions between n and n+1 resident blocks (i.e., when using one less register makes room for an additional resident block as in the example of Multiprocessor Level) and then applies similar heuristics as when no launch bounds are specified;
- If both minBlocksPerMultiprocessor and maxThreadsPerBlock are specified, the compiler may increase register usage as high as L to reduce the number of instructions and better hide single thread instruction latency.

A kernel will fail to launch if it is executed with more threads per block than its launch bound maxThreadsPerBlock.

Optimal launch bounds for a given kernel will usually differ across major architecture revisions. The sample code below shows how this is typically handled in device code using the __CUDA_ARCH__macro introduced in Application Compatibility

In the common case where MyKernel is invoked with the maximum number of threads per block (specified as the first parameter of __launch_bounds__()), it is tempting to use MY_KERNEL_MAX_THREADS as the number of threads per block in the execution configuration:

This will not work however since __CUDA_ARCH__ is undefined in host code as mentioned in Application Compatibility, so MyKernel will launch with 256 threads per block even when __CUDA_ARCH__is greater or equal to 200. Instead the number of threads per block should be determined:

Either at compile time using a macro that does not depend on __CUDA_ARCH__, for example

Or at runtime based on the compute capability

Register usage is reported by the --ptxas options=-v compiler option. The number of resident blocks can be derived from the occupancy reported by the CUDA profiler (see Device Memory Accessesfor a definition of occupancy).

Register usage can also be controlled for all __global__ functions in a file using the maxrregcount compiler option. The value of maxrregcount is ignored for functions with launch bounds.

本文备注/经验分享：

今天的内容是__launch_bounds__()修饰.该修饰和nvcc的-maxrregcount=N, 是目前控制每个线程的常规寄存器使用量的唯二的方式 (忘记常规寄存器是什么了? 请参考之前的章节里面的对常规寄存器和Predicate寄存器的描述. 后者注意是固定的7个, 改不了...能改变的只有常规寄存器(R寄存器). 所以这也是为何-maxrregcount中的r字母连续出现了两次的原因, 而不是打错)。我们都知道, GPU是靠TLP和ILP进行延迟掩盖, 从而发挥高性能的. 而前者, TLP, 需要海量的线程并行. SM上面能同时上去同时执行的线程数量实际上有限(受到SM上的多种资源限制, 主要是shared memory和寄存器使用量; 以及, 硬件本身的特性限制)。 shared memory的使用量控制好办, 这个是用户手工设定的.而寄存器的用量控制则要看刚才说过的唯2方式.本章节则介绍了launch bounds的控制方式.这种方式和maxrregcount控制各自有各自的优点.通过前者, 可以不需要手工计算出来每个线程能用多少寄存器, 可以直接要求特定的每个SM上的blocks数量和blocks里面的线程数量, 例如, 你可以直接对一个kernel进行修饰: __launch_bounds__(256, 1),即要求至少可以上256个线程的block 1个, 在每SM上.而__launch_bounds__(512, 2), 则等于说, 要求每个SM上至少能上去2个512线程的blocks.因为常见的Maxwell和Pascal卡(例如1080), 每个SM有64K个寄存器(一些资料写成256KB, 因为每个寄存器是32-bit的, 所以一样),这种要求等于说, 每个线程最多可以使用64个常规寄存器 (64K / 512 / 2 = 64),等于你编译的时候使用nvcc -maxrregcount = 64 ...., 或者在VS的项目属性中的最大寄存器数量设定为64.但是你不需要手工计算出来这个64(如何计算, 本章节也有说明).还是很方便的.其次, maxrregcount只能对一个.cu文件整体指定, 如果该文件中有多个kernel的源代码,则maxrregcount的方式对所有的kernel都应用.而如果你只需要限制一个kernel的时候, 则只能将.cu文件拆分成多个文件, 每个文件里面一个kernel的源代码.或者使用__launch_bounds__, 这种允许对单一kernel设定, 而不需要拆分成多个文件.但是maxrregcount用起来简单粗暴, 很多时候也是很令人喜欢的选择的.无论是那种寄存器使用数量的控制, 均会引起驻留线程数量的变化,精确的说, 因为线程是以blocks为组织单位上到SM的, 这实际上会引起驻留blocks数量的变化(resident blocks)——这里假设kernel不受到其他资源使用的影响, 例如shared memory 。这种变化会引起驻留的blocks或者线程数量上升(当每个线程使用的寄存器资源变少的时候), 或者下降(当每个线程使用的寄存器资源变多的时候),但是性能的变化则不一定.不一定是驻留的线程数量(或者说blocks数量)增加(TLP线程并行度高)性能就好.也不一定驻留的线程数量变少(TLP程度降低)性能就差.这个不一定的.需要用户反复手工自己试验才能知道.每个kernel都可能不同的.但一般总会是一个驼峰样子的曲线(请大脑想象一下).

一开始寄存器使用较多, 随着寄存器用量的减少(或者说线程数目增加), 性能假设可以逐渐提升，然后慢慢的提升到了一个峰值, 此时寄存器使用继续减少, 性能不但不再提升了, 反而再次开始下降.这是因为, 虽然使用寄存器较少(线程较多)能提高并行度, 从而可能提升性能;但寄存器的使用较少, 影响了一些数据的缓存或者使用(例如, 原本被缓冲在寄存器中的值, 现在需要被从local memory中重新读取),或者原本一些能直接用的值, 现在因为寄存器数量较少, 被覆盖了, 需要重新计算.从而引入了额外的代价, 能损失性能.因为寄存器的数量使用变化同时有对性能有利和对性能有害的2个方面存在,所以追寻它们两个的平衡点, 性能最好的方面, 需要用户的反复试验.这也是为了论坛经常说的, 很多东西无固定公式, 不是使用的越多(越少)就越好.等等。包括昨天论坛有人问, threadIdx.x的使用(和对它的计算, 例如threadIdx.x * 8 + 7), 是我每次使用的时候直接用threadIdx.x好呢?还是保存它的值到变量(或者对它的计算后的值到变量), 然后再用变量好呢?我的回答器其实总是, 还是就地使用的好.这是因为你的手工计算保存(特别是编译器无法知道后续能否再次计算出来的时候), 往往会手工的占据寄存器(或者local memory)，而直接写明式子, 很多时候编译器能在寄存器使用上, 和重新计算上(假设它能重新计算), 根据当前的情况, 例如用户对launch bounds的指定, 和当前区域的寄存器紧张程度, 自动作出均衡.比你总是假定手工缓冲的效果往往会好的.类似这样的. 请想想, 一个数据要么是强制保存在寄存器中或者local memory(或者你手工保存到shared memory), 然后以后重用; 要么则是能重新创造出来(例如重新读取, 或者重新计算出来),这两者哪个好, 其实很多时候是难以人脑均衡的.而编译器往往能考虑更多方面, 所以交给编译器往往更好(不是说人脑想不出来, 而是需要考虑的方面太多, 人脑思考不合算. 毕竟老板每天的工资是有限的, 不能浪费到这种方面). 这也是很多时候我们对CPU上的开发说, 用x86/x64汇编好? 还是交给编译器好?这实际上是20多年前，当年的著名的CPU上的汇编和C之争.现在20年过去了, 现在新一代的从业者可能感觉这个问题很无聊, 很好笑.因为现在的人会想, CPU的架构不同, 多种指令的execution ports不同, 缓存的配置不同, 我考虑这些干吗? 为何不直接交给编译器?同理现在GPU上也是如此.当年CPU上使用eax寄存器好? 还是直接C中使用一个p[id]好? 前者需要手工汇编控制寄存器, 后者全自动.这就是现在的GPU上的同样的问题几乎.所以还是直接交给编译器吧.注意我们不是说, 一定用CUDA C就好, 同样需要看到很多手工(不是这里的唯2的两种方式, 而是从最底层汇编(SASS)的角度的手工使用)写成的高质量代码.例如cublas.但是这种控制方式具体常规用户太遥远, 或者太困难.所以这里的唯2的方式则是你只能的选择. 需要注意的是, 本章的launch bounds最终的影响发生在PTX->CUBIN的工程中, 而不是发生在CUDA C -> PTX的过程中, 这是因为CUDA C层次没有寄存器的概念(只有私有变量---最终会映射成寄存器和local memory), 而PTX中虽然有虚拟寄存器, 但PTX采用单次寄存器赋值风格(Single Static Assignment---一种常规的优化需要), 虚拟寄存器是无限多的. 所以最终实际发生在CUBIN/SASS的生成中，这是为何反复本章节手册, 提到PTX中的对应的2个directive的原因，不使用PTX的用户, 可以直接暂时无视它(例如需要使用一些CUDA C中没有的功能, 但在PTX中有，例如高级版本的__syncthreads(), 能允许block中的部分线程同步, 而不是全部，此时可选在CUDA C代码中嵌入一些PTX)。此外, CUDA C中的launch bounds, 有一个很有意思的竞争规范(OpenCL)中的对应东西,OpenCL中的reqd_work_group_size具有类似效果，但两者并不等价.CUDA的本章节的launch bounds中的最大线程数量/blocks, 设定了一个上限.用户最终调用kernel的时候, 不超过即可(小于等于)，而OpenCL版本的, 则是固定的group中的线程数量, 不能超过, 也不能小于(必须等于).来自OpenCL的用户需要特别注意这点.主要是这影响了一些优化. 例如如果线程数量固定, 类似a = blockDim.x * blockIdx.x + threadIdx.x;中的这里的blockDim.x实际上是一个常数了.而不固定则不能使用这个优化.所以实际上说OpenCL上的这个版本往往会效果更好(其实看编译器---AMD的编译器比较弱智, 很大程度的抵消了这点好处).但是CUDA上的版本更宽泛, 约束更低, 用起来更自由, 不容易出错(例如用户指定了小于的数量),所以究竟哪个好只能说各有利弊. 此外, 对于一种特殊的用法:

_launch_bounds__(32) __global__ void your_kernel(....) { ... }

这种样子的kernel, 如果在内部写有代码: 写入shared memory __syncthreads(); 读取Shared memory 则launch bounds不仅仅具有限定寄存器使用量的作用.还具有暗示了blocks内部warp的排布的效果.AMD家的编译器长期利用这种副作用效果, 来拿掉对OpenCL中的barrier()的使用(直接编译成空)，提升性能.而NV的编译器, 必须使用新的CUDA 9+, 才具有这个效果.之前的老CUDA 8.0依然会生成一条bar.sync指令.虽然这是不必要的(例如在6.1的Pascal上编译).所以你看, 虽然是老卡, 但是更换成了新的编译器后(例如CUDA 9.2)，依然能享受到编译效果的提升.所以这也是为了我们之前总是建议用户升级到CUDA 9的原因, 即使你是老卡.代码质量和运行速度依然可以被提升的(当然, 可能有一些不兼容的改动, 不过你迟早得迎接CUDA 9的. 早使用老卡也能提速).

需要补充的是:本章节给出了最后的一个例子代码，是对__CUDA_ARCH__是否是小于计算能力2.0和大于等有计算能力2.0的判断，(即对Tesla架构(注意不是现在的Tesla卡, 当年的很早的一个计算架构), 和对Fermi+的架构的判断) 这里需要说一下.首先是手册应当直接拿掉这个例子的.这个例子实际上是CUDA 9.2中直接抄自老版本的手册.现在CUDA 9.2都不支持fermi了, 结果例子还在，而现在的新用户往往不需要考虑这个了.可惜了手册本章节最后的一大段解说.

有不明白的地方，请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-07-30，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自 GPUS开发者微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度