DAY93：阅读Coherency and Concurrency

GPUS Lady

发布于 2018-12-25 16:06:56

6620

发布于 2018-12-25 16:06:56

文章被收录于专栏：GPUS开发者

K.2.2. Coherency and Concurrency

Simultaneous access to managed memory on devices of compute capability lower than 6.x is not possible, because coherence could not be guaranteed if the CPU accessed a Unified Memory allocation while a GPU kernel was active. However, devices of compute capability 6.x on supporting operating systems allow the CPUs and GPUs to access Unified Memory allocations simultaneously via the new page faulting mechanism. A program can query whether a device supports concurrent access to managed memory by checking a new concurrentManagedAccessproperty. Note, as with any parallel application, developers need to ensure correct synchronization to avoid data hazards between processors.

K.2.2.1. GPU Exclusive Access To Managed Memory

To ensure coherency on pre-6.x GPU architectures, the Unified Memory programming model puts constraints on data accesses while both the CPU and GPU are executing concurrently. In effect, the GPU has exclusive access to all managed data while any kernel operation is executing, regardless of whether the specific kernel is actively using the data. When managed data is used with cudaMemcpy*() or cudaMemset*(), the system may choose to access the source or destination from the host or the device, which will put constraints on concurrent CPU access to that data while the cudaMemcpy*() or cudaMemset*() is executing. See Memcpy()/Memset() Behavior With Managed Memory for further details.

It is not permitted for the CPU to access any managed allocations or variables while the GPU is active for devices with concurrentManagedAccess property set to 0. On these systems concurrent CPU/GPU accesses, even to different managed memory allocations, will cause a segmentation fault because the page is considered inaccessible to the CPU.

In example above, the GPU program kernel is still active when the CPU touches y. (Note how it occurs before cudaDeviceSynchronize().) The code runs successfully on devices of compute capability 6.x due to the GPU page faulting capability which lifts all restrictions on simultaneous access. However, such memory access is invalid on pre-6.x architectures even though the CPU is accessing different data than the GPU. The program must explicitly synchronize with the GPU before accessing y:

As this example shows, on systems with pre-6.x GPU architectures, a CPU thread may not access any managed data in between performing a kernel launch and a subsequent synchronization call, regardless of whether the GPU kernel actually touches that same data (or any managed data at all). The mere potential for concurrent CPU and GPU access is sufficient for a process-level exception to be raised.

Note that if memory is dynamically allocated with cudaMallocManaged() or cuMemAllocManaged() while the GPU is active, the behavior of the memory is unspecified until additional work is launched or the GPU is synchronized. Attempting to access the memory on the CPU during this time may or may not cause a segmentation fault. This does not apply to memory allocated using the flag cudaMemAttachHost or CU_MEM_ATTACH_HOST.

本文备注/经验分享：

本章节主要说明的是，一代的Unidifed Memory的主要缺点，和使用注意事项。注意这里的一代Unified Memory是指的Kepler/Maxwell这两代卡。从Pascal开始，是增强版本的2代的Unified Memory。并不存在相关问题。（除了Windows下的Unified Memory在Pascal上会降级成Maxwell的模式，以及，TX2虽然集成的GPU也是Pascal，但也只支持Maxwell风格的一代Unified Memory）这两点是特例。使用的时候应当注意这点。其实用户并不需要单独记忆这两点特例，本章节的第一段给出了直接查询设备的concurrentManagedAccess属性即可。该属性在1代的Unified Memory上值为0，2代的Unified Memory上值为1（除了特例）。这样可以直接免去记忆的负担，现用现查即可。关于如何查询设备属性，其实我们之前的章节曾经说过，这里简单的说下Unified Memory的这个相关属性：先定义一个：cudaDeviceProp prop; 然后再：cudaGetDeviceProperties(&prop, n); 然后再判断prop.concurrentManagedAccess是0还是1即可。注意CUDA里面的卡的编号是从0号卡开始的。一共如果我有4张卡，则它们是0，1，2，3，而不是1，2，3，4 以及，既然说道这里，注意Unified Memory很多函数接受一个特殊的-1的设备编号，该设备编号往往代表CPU，而不是系统中的任何一个GPU。这个后面我们遇到再说。回到本章节，2代的Unified Memory的一个主要特点是，支持CPU和GPU同时访问一段Unified Memory缓冲区，而1代的Unified Memory并不支持这点。所以这就是本章节的第二部分说的，在使用老版本的Unified Memory的时候，注意有GPU对Unified Memory的独占访问性。本章节第二段强调了，只要GPU上正在执行Kernel，无论该Kernel是否使用某段Unified Memory，则CPU将不能使用所有的Unified Memory。注意这里的某段和所有的字样的关系。具体的本文下面给出了一个案例：

__device__ __managed__ int x, y=2;

__global__ void kernel()

{

x = 10;

}

存在x,y两个unified memory上的变量（关于__managed__关键字请看上一章节），而某kernel这里只是用了x(将它设定成10）但是后续的CPU代码： kernel<<< 1, 1 >>>(); y = 20; // Error on GPUs not supporting concurrent access 注意后续的CPU代码哪怕在启动kernel后，在kernel执行期间，访问的是另外一个y，而不是该kernel用到的x，一样会出错。这就是为何之前强调“所有”的原因。然后本章节给出了该代码，在老Unified Memory上的正确BUG修正（本章节的相似代码的第二个）， CPU可以通过cudaDeviceSynchronize()等待GPU完成对Unified Memory的使用（虽然它使用的并不是y）之后，即可安全的访问Unified Memory了。 cudaDeviceSynchronize(); y = 20; // Success on GPUs not supporing concurrent access 注意这里添加了一句同步。不能同时CPU和GPU访问Unified Memory，是老的一代的Unified Memory的一大特点，特别是本章节说的，默认情况下，只要有kernel在执行，就会锁定所有的Unified Memory，无法让CPU同时用，无论该Kernel是否真的用了所有的这是一个需要经常被忽略和注意的点。这样看上去，老卡的限制还是很多的，但是幸运的是，实际上存在老卡上的老Unified Memory上的切分的做法，可以允许你将Unified Memory切分成多块，指定哪些块某个kernel能用，另外哪些块，Kernel不会用，这种做法允许一定程度的让CPU和GPU同时访问Unified Memory（但依然没有2代，也就是Pascal+上自由），这个后续再说。注意本章节还提出了另外注意事项，一个是分配Unified Memory的时候，如果在调用cudaMallocManaged的同时，用户还在卡上有kernel正处于进行中，则建议最好等待完成一次GPU同步后，再碰它（还有其他使用方式，但这种是最安全的）。否则很可能就像刚才的两个例子中的第一个例子那样，host在访问分配的Unified Memory的时候挂掉。然后本文最后还给出一个例外，除非分配的时候指定了attach到host的标志（还记得之前说过，cudaMallocManaged最后有一个可选参数吗，现在用户可以单独设定成cudaMemAttachHost即可）这种情况下，自动的Unified Memory将默认上去在Host存储器（也就是内存中），可以规避老卡上的不能CPU和GPU访问的问题。手册这里还有两点需要补充的：（1）默认的是attach到global的。所以你必须明确指定。（2）CU_MEM_ATTACH_HOST是Driver API的写法，手册这里不小心写混了。正确的Runtime API的用法是cudaMemAttachHost 注意这里的开头变化，和大小写变化。此外，本章节还提到过一次cudaMemcpy和cudaMemset系列函数，在访问数据的方向上的问题。因为Unified Memory能被CPU和GPU都访问到，所以很多时候可以认为是，例如一次cudaMemcpy既可以认为是从Host上访问的，也可以认为是从Device上访问的，因为Unified Memory可以认为是CPU和GPU都共用的虚拟存储器。所以这里实际上在具体的使用中，对传输方向的写法之类的，有轻微区别。这个后面我们单独用一个章节进行描述。

有不明白的地方，请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-12-07，如有侵权请联系 cloudcommunity@tencent.com 删除

kernel