DAY6:阅读 CUDA C编程接口之CUDA C runtime

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第六天,我们用几天时间来学习CUDA 的编程接口,其中最重要的部分就是CUDA C runtime.希望在接下来的95天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计845字,阅读时间15分钟

3.2.4. Page-Locked Host Memory【锁页内存】

The runtime provides functions to allow the use of page-locked (also known as pinned) host memory (as opposed to regular pageable host memory allocated by malloc()):

· cudaHostAlloc() and cudaFreeHost() allocate【分配】 and free【释放】 page-locked host memory;

· cudaHostRegister() page-locks a range of memory allocated by malloc() (see reference manual for limitations).

Using page-locked host memory has several benefits:

· Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices as mentioned in Asynchronous Concurrent Execution【异步并发执行】.

· On some devices, page-locked host memory can be mapped into the address space of the device, eliminating the need to copy it to or from device memory as detailed in Mapped Memory.

· On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked and even higher if in addition it is allocated as write-combining as described in Write-Combining Memory.

Page-locked host memory is a scarce【稀缺的】 resource however, so allocations in page-locked memory will start failing long before allocations in pageable memory. In addition, by reducing the amount of physical memory available to the operating system for paging, consuming too much page-locked memory reduces overall system performance.

The simple zero-copy CUDA sample comes with a detailed document on the page-locked memory APIs.

3.2.4.1. Portable Memory

A block of page-locked memory can be used in conjunction with any device in the system (see Multi-Device System for more details on multi-device systems), but by default, the benefits of using page-locked memory described above are only available in conjunction with 【结合】the device that was current when the block was allocated (and with all devices sharing the same unified address space, if any, as described in Unified Virtual Address Space). To make these advantages available to all devices, the block needs to be allocated by passing the flag cudaHostAllocPortable to cudaHostAlloc() or page-locked by passing the flag cudaHostRegisterPortable to cudaHostRegister().

3.2.4.2. Write-Combining Memory【写合并内存】

By default page-locked host memory is allocated as cacheable. It can optionally be allocated as write-combining instead by passing flag cudaHostAllocWriteCombined to cudaHostAlloc(). Write-combining memory frees up the host's L1 and L2 cache resources, making more cache available to the rest of the application. In addition, write-combining memory is not snooped【监视】 during transfers across the PCI Express bus, which can improve transfer performance by up to 40%.

Reading from write-combining memory from the host is prohibitively slow, so write-combining memory should in general be used for memory that the host only writes to.

3.2.4.3. Mapped Memory

A block of page-locked host memory can also be mapped into the address space of the device by passing flag cudaHostAllocMapped to cudaHostAlloc() or by passing flag cudaHostRegisterMapped to cudaHostRegister(). Such a block has therefore in general two addresses: one in host memory that is returned by cudaHostAlloc() or malloc(), and one in device memory that can be retrieved using cudaHostGetDevicePointer() and then used to access the block from within a kernel. The only exception is for pointers allocated with cudaHostAlloc() and when a unified address space is used for the host and the device as mentioned in Unified Virtual Address Space.

Accessing host memory directly from within a kernel has several advantages:

· There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel;

· There is no need to use streams (see Concurrent Data Transfers) to overlap data transfers with kernel execution; the kernel-originated data transfers automatically overlap with kernel execution.

Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses using streams or events (see Asynchronous Concurrent Execution) to avoid any potential read-after-write, write-after-read, or write-after-write hazards.

To be able to retrieve the device pointer to any mapped page-locked memory, page-locked memory mapping must be enabled by calling cudaSetDeviceFlags() with the cudaDeviceMapHost flag before any other CUDA call is performed. Otherwise, cudaHostGetDevicePointer() will return an error.

cudaHostGetDevicePointer() also returns an error if the device does not support mapped page-locked host memory. Applications may query this capability by checking the canMapHostMemory device property (see Device Enumeration), which is equal to 1 for devices that support mapped page-locked host memory.

Note that atomic functions (see Atomic Functions) operating on mapped page-locked memory are not atomic from the point of view of the host or other devices.

Also note that CUDA runtime requires that 1-byte, 2-byte, 4-byte, and 8-byte naturally aligned loads and stores to host memory initiated from the device are preserved as single accesses from the point of view of the host and other devices. On some platforms, atomics to memory may be broken by the hardware into separate load and store operations. These component load and store operations have the same requirements on preservation of naturally aligned accesses. As an example, the CUDA runtime does not support a PCI Express bus topology where a PCI Express bridge splits 8-byte naturally aligned writes into two 4-byte writes between the device and the host.

本文备注/经验分享:

The runtime provides functions to allow the use of page-locked (also known as pinned) host memory (as opposed to regular pageable host memory allocated by malloc()): 内存分成两种。一种是普通的内存(可以换页到磁盘),另外一种是锁定页面中物理内存中的(也就是你看到的插上去的内存条中),malloc()分配的是普通的。 runtime的分配函数,有些是分配显存的。有些是分配内存的,和普通的C语言的分配函数(例如malloc)不同的是,它能够直接分配锁页内存,或者通过非分配的锁定/注册函数,可以将普通内存改成page-locked memory。

Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices.在一些设备上(计算能力2.0+)在page-locked的内存和显存间的复制,能同时执行kernel ,

也就是我们常说的:(1)准备数据;(2) 数据传输到显存;(3)kernel用显存中的数据进行计算;(4)结果传输回来——这是用的普通内存,如果用的page-lock的memory,下一次的kernel启动所用到的数据,能和本次kernel启动同时进行,这样就有效的节省了时间,提高了性能。

On some devices, page-locked host memory can be mapped into the address space of the device, eliminating the need to copy it to or from device memory

在某些设备上,可以将锁页内存映射到设备的地址空间,从而消除了将其复制到设备显存的需要。 这里,计算能力2.0+的GPU卡现在都支持了。但需要精确的说,只是消除了手工复制。GPU会在读取数据的时候自动跨PCI-E传输的,只是减轻了用户手工的工作量。以及,这样还可以在显存不够的时候,用内存临时撑一下(当然,买大显存的卡是正经的),这个特性的进化版本叫unified memory,在Pascal+上,支持的平台下(例如64-bit,和Linux下),性能比这个zero-copy的好。

On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked and even higher if in addition it is allocated as write-combining 如果是有FSB的系统上,如果将内存分配成page-locked memory, 那么传输太宽就会变得更高。而如果是分配成WC(写合并)的,那么会比page-locked的更加高。需要说明的是,现在的机器已经没有FSB总线了。也没有北桥了。这些都过去了。(强烈建议NV把这一段删除)

Portable Memory 是指能多卡同时使用的。从很早很早之前,只要是64-bit系统下,分配的所有page-locked memory都自动是portable的了。不需要额外指定了。这段主要是针对以前的Tesla(1.x)架构说的,2.0+(Fermi+)早就自动Portable了 。不过有时候还有用。很多WIndows平台并不能很好的支持Unified Memory,多卡的情况下会自动退化成普通的映射的page-locked memory。这个时候还有点意义。

By default page-locked host memory is allocated as cacheable. It can optionally be allocated as write-combining instead by passing flag cudaHostAllocWriteCombined to cudaHostAlloc().

普通的page-locked memory是用cache缓冲的。如果添加了cudaHostAllocWriteCombined标志给cudaHostAlloc()函数,则变成了WC内存。这种会by pass掉各级缓存的,现在的机器不一定有提速效果的,所以也不建议这么用。

注意:WC正常情况下不要使用它,因为这货读取起来,速度超级超级慢。很多人用了WC内存,传输无提速。然后在Host上不小心CPU读取里面的数据,完了,N倍的性能损失掉了。所以除了机器特殊,真心不应该使用它。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

原文发布于微信公众号 - 吉浦迅科技(gpusolution)

原文发表时间:2018-05-08

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

扫码关注云+社区