DAY64：阅读 Memory Model

GPUS Lady

发布于 2018-08-17 14:40:22

3790

发布于 2018-08-17 14:40:22

文章被收录于专栏：GPUS开发者

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第64天，我们正在讲解CUDA C语法，希望在接下来的36天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。

关注微信公众号，查看历史信息，可以看到之前的阅读

本文共计864字，阅读时间20分钟

D.2.2. Memory Model

Parent and child grids share the same global and constant memory storage, but have distinct local and shared memory.

D.2.2.1. Coherence and Consistency

D.2.2.1.1. Global Memory

Parent and child grids have coherent access to global memory, with weak consistency guarantees between child and parent. There are two points in the execution of a child grid when its view of memory is fully consistent with the parent thread: when the child grid is invoked by the parent, and when the child grid completes as signaled by a synchronization API invocation in the parent thread.

All global memory operations in the parent thread prior to the child grid's invocation are visible to the child grid. All memory operations of the child grid are visible to the parent after the parent has synchronized on the child grid's completion.

In the following example, the child grid executing child_launch is only guaranteed to see the modifications to data made before the child grid was launched. Since thread 0 of the parent is performing the launch, the child will be consistent with the memory seen by thread 0 of the parent. Due to the first __syncthreads() call, the child will see data[0]=0, data[1]=1, ..., data[255]=255 (without the __syncthreads() call, only data[0] would be guaranteed to be seen by the child). When the child grid returns, thread 0 is guaranteed to see modifications made by the threads in its child grid. Those modifications become available to the other threads of the parent grid only after the second __syncthreads() call:

D.2.2.1.2. Zero Copy Memory

Zero-copy system memory has identical coherence and consistency guarantees to global memory, and follows the semantics detailed above. A kernel may not allocate or free zero-copy memory, but may use pointers to zero-copy passed in from the host program.

D.2.2.1.3. Constant Memory

Constants are immutable and may not be modified from the device, even between parent and child launches. That is to say, the value of all __constant__ variables must be set from the host prior to launch. Constant memory is inherited automatically by all child kernels from their respective parents.

Taking the address of a constant memory object from within a kernel thread has the same semantics as for all CUDA programs, and passing that pointer from parent to child or from a child to parent is naturally supported.

D.2.2.1.4. Shared and Local Memory

Shared and Local memory is private to a thread block or thread, respectively, and is not visible or coherent between parent and child. Behavior is undefined when an object in one of these locations is referenced outside of the scope within which it belongs, and may cause an error.

The NVIDIA compiler will attempt to warn if it can detect that a pointer to local or shared memory is being passed as an argument to a kernel launch. At runtime, the programmer may use the __isGlobal() intrinsic to determine whether a pointer references global memory and so may safely be passed to a child launch.

Note that calls to cudaMemcpy*Async() or cudaMemset*Async() may invoke new child kernels on the device in order to preserve stream semantics. As such, passing shared or local memory pointers to these APIs is illegal and will return an error.

D.2.2.1.5. Local Memory

Local memory is private storage for an executing thread, and is not visible outside of that thread. It is illegal to pass a pointer to local memory as a launch argument when launching a child kernel. The result of dereferencing such a local memory pointer from a child will be undefined.

For example the following is illegal, with undefined behavior if x_array is accessed by child_launch:

It is sometimes difficult for a programmer to be aware of when a variable is placed into local memory by the compiler. As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap, either with cudaMalloc(), new() or by declaring __device__ storage at global scope. For example:

D.2.2.1.6. Texture Memory

Writes to the global memory region over which a texture is mapped are incoherent with respect to texture accesses. Coherence for texture memory is enforced at the invocation of a child grid and when a child grid completes. This means that writes to memory prior to a child kernel launch are reflected in texture memory accesses of the child. Similarly, writes to memory by a child will be reflected in the texture memory accesses by a parent, but only after the parent synchronizes on the child's completion. Concurrent accesses by parent and child may result in inconsistent data.

本文备注/经验分享：

实际上, 本章节说了这么多, 其实可以缩小成一个最小的范围:

只要你不要跨kernel使用local/texture/shared/constant, 然后global memory每次都启动子kernel前同步父kernel的block中的线程, 然后立刻暂停父kernel的block(同步等待子kernel),那么肯定是安全的.

这个基础可以给你一个能快速入手动态并行的时候访存的基础.然后再在这个基本的基础上, 如果感觉性能不行, 或者不方便, 再考虑逐渐利用本章节的详细说明,扩大使用范围.这样可以较快的没有挫折的入手, 很多时候一致性导致的问题, 难以精确重现或者调试,这样至少能让你入门的高兴一点.或者干脆如同之前的章节说过, CUDA很多东西, 不使用也可以写出代码来,你也可以干脆放弃计算能力3.5+上的并行计算能力,特别是对于新人来说, 与其挣扎半个月, 一事无成,不如先写出一点渣代码来, 这样还能有点进度, 给入门带来点信心.或者再干脆一点，连CUDA都不要用, 直接全程CPU完成计算, 这样可以带来更大的信心.或者再再干脆一点, 连计算机都不要用, 直接手写数学式子, 在A4纸上完成(好吧, 最后两个是怕你阅读累了, 用来消遣逗乐的)。

今天的章节, 主要是说在使用DP(动态并行)的情况下, 各个存储器的使用注意事项, 特别是一致性的问题.本章节将对常用的几乎所有的存储器类型(global, local, texture, constant等等)进行说明.按照动态并行章节的惯例, 没有说明的, 就是和Host一样的.首先说一下global memory. 本章节拆开了普通global memory(显存), 和映射的global memory(也叫zero-copy memory, 或者内存映射映射的global memory).实际上这两者是完全一样的.父kernel和子kernel之间的对global memory上的内容的修改,后者可以看到前者在自己启动前的所有修改, 前者也可以看到后者(子kernel)结束后, 之前所作的左右更改.但是存在另外一种情况, 就是资源允许的时候, 父kernel和子kernel可能在同时运行,此时互相对方通过普通访存作出的内容修改, 则不能保证一致性.任何一方读取正在可能被另外一方更改中的内容, 得到的值是不定的. 注意手册本章节没有对原子操作作出说明, 根据我们之前的试验,原子操作是可以在同时运行的父kernel和子kernel之间, 保持正确的一致性的.但前提是你得用的小心, 不要假设一些情况, 从而造成死锁.例如你假设子kernel会同时在运行, 然后通过原子操作设定标志, 通知父kernel它执行到了特定的位置或者完成了特定的操作,但是实际上, 子kernel完全没有资源运行, 此时父kernel通过原子操作轮询(例如某种CAS操作),将永远的卡死，这点需要注意，此外, 关于本章节没有提到的另外一种存储器类型---unified memory。这个需要说明, 从kernel的角度(例如做为kernel的参数: int *p), 用起来和另外两种global memory(显存后备的, 和内存后备的), 并无区别.所以实际上, 刚才的段落的内容(一致性), 对三种global memory的情况, 都是相同的(从kernel的角度).注意, 因为在支持的平台下(例如64-bit Linux. Windows下很多时候并不能支持真正的unified memory),和Pascal+的卡, unified memory支持同时host上的原子操作访问, 和device上的原子操作一起进行.这实际上会导致, 在使用动态并行的时候, 同时存在CPU端, 父kernel, 子kernel, 三者同时的视角(因为很可能, 这三者在同时执行),根据推算依然应当使用原子操作(CPU上的原子操作+GPU上的原子操作)在这三者间能保持一致性.但是我们没有测试过这点. 建议用户自行测试反馈. 此外, 关于父kernel和子kernel同时使用原子操作保持一致性的问题(也包括子kernel单独运行, 或者父kernel单独运行, 另外一个暂停的时候),我们只做过简单测试, 用户使用在生产环境的时候, 应当作出详尽测试. 同时这点也欢迎反馈.手册本章节并未对这两点作出任何说明, 手册只说明, 只有在子kernel结束和死亡的时候才能保持普通访存的一致性. 这点需要注意了.

关于为何是能在子kernel启动和结束前后, 能保持一致性, 这是因为根据已有的资料, 这两个时刻会清一下L1 cache数据(invalidate all L1 lines), 还记得我们之前的章节的L2是统一的, L1是分布式的, 各个L1并不维持一致性的说法吗?这是global memory.注意手册这个段落还说明了, cudaDeviceSynchronize()---动态并行并不能在设备端使用流同步, 只能一次性得等待本block启动的*所有*子kernel们,只是会等子kernel, 并不具有让父kernel进行同步的功能,这点虽然看上去是显然的, 但是很多用户容易想歪, 因此这里需要作出特别说明. 其次是关于local memory, 这个任何时候都不能将指向local memory中的内容传递给子kernel的,实际上不仅仅是子kernel, 连传递给父kernel自己的block里面的其他线程, 也是不允许的(之前的章节有说过这个),使用仅限于单体线程, 不能做任何传递的。这点一定要注意, 实际上, 目前的CUDA编译器会尝试检测到这点,然后发现问题会告诉你, 过不了编译的.但是很多用户会挖空心思尝试越过这个限制(赞扬这种想象力),例如, 最简单的可以外面用struct套一下, 就可以越过去了.但是此时这样做, 虽然能正常的编译, 但是运行的时候会导致未定义的结果(你可以总是传递local memory上的值, 传值而不是地址, 是可以的，例如有一个local memory上的a[], 读取里面任何一个元素的值, 然后做为参数或者其他方式传递给子kernel, 这是可以的). 总之, 使用local memory一定要注意.因为实际中, local memory很多时候看起来像你的以前的CPU函数上面的局部变量之类的东西, 可以简单的记住, 只要以前CPU上操作看上去不安全的情况, 例如CPU上你传递了一个局部数组或者变量的地址,那么在GPU就同样不要使用.实际上, 这么做CPU上也存在挂的风险的, 只是GPU上几乎必挂而已。说完local memory, 该说一下shared memory了,这个东西因为是block私有的, 理论上说, 只应当在一次kernel启动(grid)中的1个block的内部的线程之间互相传递.其他的所有操作都是未定义的.因为连之前无动态并行的时候, 同一个kernel之间都不能跨block传递(参考之前的shared memory章节), 你就更不用说, 能跨父子kernel传递了.任何试图这样做的均将导致未定义的结果. 我们知道总是存在一些技巧的, 可以让shared memory上的内容保存下来, 能够跨block,或者跨下一次kernel启动能使用它们，例如有的时候, 填充或者说初始化shared memory非常耗时, 例如需要从global memory非常随机的访存, 得到数据来填充, 或者需要经过很长的计算过程才能初始化好.此时一些技巧利用了硬件, 并不在多个上到SM的blocks之间, 或者kernel启动之间清空shared memory,很可能下一次的启动或者下一个的block能遭遇到之前的shared memory中的值,此时可以重复使用.这种用法是手册未定义的. 如果真的需要这样做(例如可以提升20%的性能),则建议总是在固定的驱动版本, 和计算能力下进行. 好在如果是做为软硬件一体的解决方案, 这种是可以的.(例如给用户提供硬件, 外加和上面执行的软件的时候)，做为纯软件提供的方案则不建议这样做.特别是当卡带有显示输出的时候,经常会发现这样做是不安全的(即, shared memory中的内容可能被来自第三方的shader---例如来自桌面环境绘制, 所破坏).所以一定要小心这种未定义的行为.用户可以考虑进行一次快速内容校验---例如某种快速的并行CRC之类, 再考虑是否能重用shared memory内容.然后关于texture, 这个很多时候, 后备的普通线性内存, 可以通过普通指针进行改写,然后这种改写再没有动态并行的时候, 必须等到下一次kernel启动才能生效(参考我们之前的host上的texture章节),现在再使用动态并行的时候, 结束kernel的时候, 因为总是有保证, 会有L1 texture cache被清空一次, 因此此时能保证安全.但是需要注意的是, 不要一个kernel在更新texture, 另外一个kernel在同时试图读取(采样),这种是不安全的. 你必须完全暂停一个, 才可以. 例如父kernel(中的block)可以直接用cudaDeviceSynchronize()暂停自己的执行.以及, 不同的计算能力上, L1/Shared/Texture有不同的组合方式, 有的是L1/Shared硬件上是一体的, 有的是L1/Texture是一体的, 还有的是L1/Texture/Shared是一体的, 用户需要注意不同计算能力上,texutre cache究竟指的是什么的问题.然后是constant memory,这个建议总是在动态并行使用的时候, 不要更新它.因为总是有一些可以从GPU上直接更新constant memory的技巧, 而不是用的host上的cudaMemcpyToSymbol(后者是标准做法)来更新.我建议总是在一次父kernel启动和子kernel(们)的所有启动之间, 不要碰它, 或者说碰了它就不要读取了, 然后再下一次从Host上启动的kernel中, 再尝试读取. 这主要是涉及2个问题: (1)目前并未知道, 动态并行启动kernel的时候, 是否会清constant cache(可能会) (2)就算第(1)点后, 编译器再目前遇到对constant的使用的时候, 很可能不会每次都访问constant memory/cache, 而是使用寄存器中之前保存的值, 这样无论动态并行是否清空它, 都很危险. 因为这2点, 所以建议总是在下一次, 从*Host*上启动的kernel中, 再使用. 以避免不一致的危险.

有不明白的地方，请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-08-07，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自 GPUS开发者微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度