DAY6:阅读 CUDA C编程接口之CUDA C runtime

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第六天,我们用几天时间来学习CUDA 的编程接口,其中最重要的部分就是CUDA C runtime.希望在接下来的95天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计845字,阅读时间15分钟

3.2.4. Page-Locked Host Memory【锁页内存】

The runtime provides functions to allow the use of page-locked (also known as pinned) host memory (as opposed to regular pageable host memory allocated by malloc()):

· cudaHostAlloc() and cudaFreeHost() allocate【分配】 and free【释放】 page-locked host memory;

· cudaHostRegister() page-locks a range of memory allocated by malloc() (see reference manual for limitations).

Using page-locked host memory has several benefits:

· Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices as mentioned in Asynchronous Concurrent Execution【异步并发执行】.

· On some devices, page-locked host memory can be mapped into the address space of the device, eliminating the need to copy it to or from device memory as detailed in Mapped Memory.

· On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked and even higher if in addition it is allocated as write-combining as described in Write-Combining Memory.

Page-locked host memory is a scarce【稀缺的】 resource however, so allocations in page-locked memory will start failing long before allocations in pageable memory. In addition, by reducing the amount of physical memory available to the operating system for paging, consuming too much page-locked memory reduces overall system performance.

The simple zero-copy CUDA sample comes with a detailed document on the page-locked memory APIs.

3.2.4.1. Portable Memory

A block of page-locked memory can be used in conjunction with any device in the system (see Multi-Device System for more details on multi-device systems), but by default, the benefits of using page-locked memory described above are only available in conjunction with 【结合】the device that was current when the block was allocated (and with all devices sharing the same unified address space, if any, as described in Unified Virtual Address Space). To make these advantages available to all devices, the block needs to be allocated by passing the flag cudaHostAllocPortable to cudaHostAlloc() or page-locked by passing the flag cudaHostRegisterPortable to cudaHostRegister().

3.2.4.2. Write-Combining Memory【写合并内存】

By default page-locked host memory is allocated as cacheable. It can optionally be allocated as write-combining instead by passing flag cudaHostAllocWriteCombined to cudaHostAlloc(). Write-combining memory frees up the host's L1 and L2 cache resources, making more cache available to the rest of the application. In addition, write-combining memory is not snooped【监视】 during transfers across the PCI Express bus, which can improve transfer performance by up to 40%.

Reading from write-combining memory from the host is prohibitively slow, so write-combining memory should in general be used for memory that the host only writes to.

3.2.4.3. Mapped Memory

A block of page-locked host memory can also be mapped into the address space of the device by passing flag cudaHostAllocMapped to cudaHostAlloc() or by passing flag cudaHostRegisterMapped to cudaHostRegister(). Such a block has therefore in general two addresses: one in host memory that is returned by cudaHostAlloc() or malloc(), and one in device memory that can be retrieved using cudaHostGetDevicePointer() and then used to access the block from within a kernel. The only exception is for pointers allocated with cudaHostAlloc() and when a unified address space is used for the host and the device as mentioned in Unified Virtual Address Space.

Accessing host memory directly from within a kernel has several advantages:

· There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel;

· There is no need to use streams (see Concurrent Data Transfers) to overlap data transfers with kernel execution; the kernel-originated data transfers automatically overlap with kernel execution.

Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses using streams or events (see Asynchronous Concurrent Execution) to avoid any potential read-after-write, write-after-read, or write-after-write hazards.

To be able to retrieve the device pointer to any mapped page-locked memory, page-locked memory mapping must be enabled by calling cudaSetDeviceFlags() with the cudaDeviceMapHost flag before any other CUDA call is performed. Otherwise, cudaHostGetDevicePointer() will return an error.

cudaHostGetDevicePointer() also returns an error if the device does not support mapped page-locked host memory. Applications may query this capability by checking the canMapHostMemory device property (see Device Enumeration), which is equal to 1 for devices that support mapped page-locked host memory.

Note that atomic functions (see Atomic Functions) operating on mapped page-locked memory are not atomic from the point of view of the host or other devices.

Also note that CUDA runtime requires that 1-byte, 2-byte, 4-byte, and 8-byte naturally aligned loads and stores to host memory initiated from the device are preserved as single accesses from the point of view of the host and other devices. On some platforms, atomics to memory may be broken by the hardware into separate load and store operations. These component load and store operations have the same requirements on preservation of naturally aligned accesses. As an example, the CUDA runtime does not support a PCI Express bus topology where a PCI Express bridge splits 8-byte naturally aligned writes into two 4-byte writes between the device and the host.

本文备注/经验分享:

The runtime provides functions to allow the use of page-locked (also known as pinned) host memory (as opposed to regular pageable host memory allocated by malloc()): 内存分成两种。一种是普通的内存(可以换页到磁盘),另外一种是锁定页面中物理内存中的(也就是你看到的插上去的内存条中),malloc()分配的是普通的。 runtime的分配函数,有些是分配显存的。有些是分配内存的,和普通的C语言的分配函数(例如malloc)不同的是,它能够直接分配锁页内存,或者通过非分配的锁定/注册函数,可以将普通内存改成page-locked memory。

Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices.在一些设备上(计算能力2.0+)在page-locked的内存和显存间的复制,能同时执行kernel ,

也就是我们常说的:(1)准备数据;(2) 数据传输到显存;(3)kernel用显存中的数据进行计算;(4)结果传输回来——这是用的普通内存,如果用的page-lock的memory,下一次的kernel启动所用到的数据,能和本次kernel启动同时进行,这样就有效的节省了时间,提高了性能。

On some devices, page-locked host memory can be mapped into the address space of the device, eliminating the need to copy it to or from device memory

在某些设备上,可以将锁页内存映射到设备的地址空间,从而消除了将其复制到设备显存的需要。 这里,计算能力2.0+的GPU卡现在都支持了。但需要精确的说,只是消除了手工复制。GPU会在读取数据的时候自动跨PCI-E传输的,只是减轻了用户手工的工作量。以及,这样还可以在显存不够的时候,用内存临时撑一下(当然,买大显存的卡是正经的),这个特性的进化版本叫unified memory,在Pascal+上,支持的平台下(例如64-bit,和Linux下),性能比这个zero-copy的好。

On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked and even higher if in addition it is allocated as write-combining 如果是有FSB的系统上,如果将内存分配成page-locked memory, 那么传输太宽就会变得更高。而如果是分配成WC(写合并)的,那么会比page-locked的更加高。需要说明的是,现在的机器已经没有FSB总线了。也没有北桥了。这些都过去了。(强烈建议NV把这一段删除)

Portable Memory 是指能多卡同时使用的。从很早很早之前,只要是64-bit系统下,分配的所有page-locked memory都自动是portable的了。不需要额外指定了。这段主要是针对以前的Tesla(1.x)架构说的,2.0+(Fermi+)早就自动Portable了 。不过有时候还有用。很多WIndows平台并不能很好的支持Unified Memory,多卡的情况下会自动退化成普通的映射的page-locked memory。这个时候还有点意义。

By default page-locked host memory is allocated as cacheable. It can optionally be allocated as write-combining instead by passing flag cudaHostAllocWriteCombined to cudaHostAlloc().

普通的page-locked memory是用cache缓冲的。如果添加了cudaHostAllocWriteCombined标志给cudaHostAlloc()函数,则变成了WC内存。这种会by pass掉各级缓存的,现在的机器不一定有提速效果的,所以也不建议这么用。

注意:WC正常情况下不要使用它,因为这货读取起来,速度超级超级慢。很多人用了WC内存,传输无提速。然后在Host上不小心CPU读取里面的数据,完了,N倍的性能损失掉了。所以除了机器特殊,真心不应该使用它。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

原文发布于微信公众号 - 吉浦迅科技(gpusolution)

原文发表时间:2018-05-08

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏程序人生 阅读快乐

C#程序设计语言-第4版(美)

C#语言结合了快速应用开发语言的高效和C/C++语言的强大。本书全部内容更新到C# 4.0版,提供了C# 4.0语言的完整规范、参考资料、范例代码和来自12位卓...

853
来自专栏数据库

《数据库系统概念》15-可扩展动态散列

静态散列要求桶的数目始终固定,那么在确定桶数目和选择散列函数时,如果桶数目过小,随着数据量增加,性能会降低;如果留一定余量,又会带来空间的浪费;或者定期重组散列...

2107
来自专栏CDA数据分析师

三行Python代码,让数据预处理速度提高2到6倍

Python 是机器学习领域内的首选编程语言,它易于使用,也有很多出色的库来帮助你更快处理数据。但当我们面临大量数据时,一些问题就会显现……

954
来自专栏java思维导图

《深入理解Java虚拟机》的思维导图读书笔记

作者:被称为L的男人 链接:http://www.jianshu.com/p/ff4a1795e462 本文为转载文章,原文请点击左下角查看,谢谢。 说明 本篇...

3104
来自专栏吉浦迅科技

DAY3:阅读CUDA C编程接口

1703
来自专栏点滴积累

geotrellis使用(三)geotrellis数据处理过程分析

之前简单介绍了geotrellis的工作过程以及一个简单的demo,最近在此demo的基础上实现了SRTM DEM数据的实时分析以及高程实时处理,下面我就以我实...

4506
来自专栏从零开始学 Web 前端

从零开始学 Web 之 BOM(四)client系列

这时候,如果鼠标不动,只滚动滑轮的话,会发现图片会距离鼠标原点越来越远。为什么呢?

722
来自专栏AI科技大本营的专栏

十图详解TensorFlow数据读取机制(附代码)

作者 | 何之源 在学习TensorFlow的过程中,有很多小伙伴反映读取数据这一块很难理解。确实这一块官方的教程比较简略,网上也找不到什么合适的学习材料。今天...

34911
来自专栏刘望舒

Android 人脸识别之人脸注册

3332
来自专栏生信宝典

R语言学习 - 柱状图

柱状图绘制 柱状图也是较为常见的一种数据展示方式,可以展示基因的表达量,也可以展示GO富集分析结果,基因注释数据等。 常规矩阵柱状图绘制 有如下4个基因在5组样...

2325

扫码关注云+社区