DAY7:阅读 CUDA C编程接口之CUDA C runtime

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第7天,我们用几天时间来学习CUDA 的编程接口,其中最重要的部分就是CUDA C runtime.希望在接下来的93天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计566字,阅读时间15分钟

这几章节都在讲CUDA C Runtime,前面我们已经讲解了初始化、设备显存、共享内存、锁页内存,今天我们要讲解异步并发执行。这部分内容也是相当多,我们将再分3天时间来梳理

3.2.5. Asynchronous Concurrent Execution【异步并发执行】

CUDA exposes the following operations as independent tasks that can operate concurrently with one another:

· Computation on the host;

· Computation on the device;

· Memory transfers from the host to the device;

· Memory transfers from the device to the host;

· Memory transfers within the memory of a given device;

· Memory transfers among devices.

The level of concurrency achieved between these operations will depend on the feature set【特性集】 and compute capability of the device as described below.

3.2.5.1. Concurrent Execution【并发执行】 between Host and Device

Concurrent host execution is facilitated through asynchronous library functions that return control to the host thread before the device completes the requested task. Using asynchronous calls, many device operations can be queued up【排队】 together to be executed by the CUDA driver when appropriate device resources are available. This relieves the host thread of much of the responsibility to manage the device, leaving it free for other tasks. The following device operations are asynchronous【异步】 with respect to 【关于】the host:

· Kernel launches;

· Memory copies within a single device's memory;

· Memory copies from host to device of a memory block of 64 KB or less;

· Memory copies performed by functions that are suffixed with Async;

· Memory set function calls.

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.

Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.

3.2.5.2. Concurrent Kernel Execution

Some devices of compute capability 2.x and higher can execute multiple kernels concurrently.【同时地】 Applications may query 【查询】this capability by checking the concurrentKernels device property (see Device Enumeration), which is equal to 1 for devices that support it.

The maximum number of kernel launches that a device can execute concurrently depends on its compute capability and is listed in Table 14.

A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.

Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels.

3.2.5.3. Overlap of Data Transfer and Kernel Execution

Some devices can perform an asynchronous memory copy to or from the GPU concurrently with kernel execution. Applications may query this capability by checking the asyncEngineCountdevice property (see Device Enumeration), which is greater than zero for devices that support it. If host memory is involved in the copy, it must be page-locked.

It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property). Intra-device copies are initiated using the standard memory copy functions with destination and source addresses residing on the same device.

3.2.5.4. Concurrent Data Transfers

Some devices of compute capability 2.x and higher can overlap copies to and from the device. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is equal to 2 for devices that support it. In order to be overlapped, any host memory involved in the transfers must be page-locked.

本文备注/经验分享:

Concurrent host execution is facilitated through asynchronous library functions that return control to the host thread before the device completes the requested task: 通过异步的库函数调用,不等这些函数完成要求的任务,控制就已经返回了Host线程了。很多函数都这样的,例如cudaMemcpyAsync*,例如kernel启动<<<>>>,都不等完成就立刻控制权返回了host了,这样通过这种机制,host可以继续往下执行(从代码流的角度),而device则在执行刚才给他的任务,这样host和device看上去是同时进行的,host可以继续准备下一个任务,例如分配一段缓冲区,用来容纳即将算完的结果,并用cudaMemcpy*取回它们。

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. 程序员可以全局的通过设定一个值为1的CUDA_LAUNCH_BLOCKING环境变量,来禁用当前系统上运行的CUDA程序们的kernel异步启动特性(<<<>>>变成同步的了),还记得上面说的<<<>>>是异步的么。这特性会造成性能损失的。但可以更容易的调试BUG

Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked. 当Profiler正在收集硬件计数器信息的时候,kenrel启动是同步的(包括NSight和Visual Profiler),除非你在profiler中设定了启用concurrent kernel profiling(concurrent kernel profiling不用翻译,因为profiler中这个设置本身就是英文的)。后面半句是说,cudaMemcpyAsync*()如果给定的参数不符合要求(要求page-locked的memory,你给了普通内存),则会退化成普通的同步版本的cudaMemcpy*()。这个特性是后来的CUDA有的,以前如果参数不对不会退化,会出错。后来很多人错的太多了,NV做了容错处理,如果现在参数给的不对,不对出错,只是会失去特性,变成这函数的没有Async后缀的姐妹版本函数。

Overlap of Data Transfer 将kernel的执行和数据的传输(在时间轴上)重叠同时执行。

还记得昨天我们说过有的情况下,数据传输和kernel执行可以同时执行么?这样就节省了时间了。否则常规情况是这样的:

数据传输给显卡---->SM们计算----->数据传输回来---->下一论数据传输过去----->SM们下一轮计算----->继续取回

这样会导致设备性能被浪费的,为何?

数据传输给显卡(DMA忙碌,SM/SP们空闲)---->SM们计算(SM/SP们忙碌,DMA空闲)----->数据传输回来(DMA忙碌,SM又空闲了)......

你看,只要有数据传输中,SM们这样安排就在空闲中,0计算量。而一旦SM们在忙碌中,这样数据就0传输量了。这就很尴尬了。如果这样安排: 数据传输给显卡------>SM们计算------->数据传输回来

下一论数据传输过去----->SM们下一轮计算----->继续取回

下一论数据传输过去----->SM们下一轮计算----->继续取回.... 这样显卡的计算能力,和数据传输能力都没有浪费.三行间重叠的部分能同时进行(需要计算能力2.0+,Copy Engines > 1) 在实际应用中,浪费传输能力不可耻。但浪费计算性能不能接受。哪怕你不是为了传输能力,光为了不浪费计算能力了,就应当考虑让它们overlap的。而overlap可以通过多流,异步复制等来实现。如果你用profiler的话,能看到一个彩色的时间轴,上面会画出来类似我的3行文字的效果的。其他可以重叠的情况有:设备内部的复制和kernel执行,或者设备内部的传输和普通的跨PCI-E传输,前者需要支持并发kernel执行,后者需要copy engines >= 1。

It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property) :设备内部的复制比较其他,是启动一个隐形的kernel进行的。(而不是通过DMA引擎),比较奇特。这是为了说,设备内部的复制,和kernel执行如果要同时进行,需要的是并发kernel执行能力(一个设备的属性)。需要说明的是,这两个括号内的都不需要了。因为要求计算能力至少是Fermi+,而9.0早就不支持1.X了(Tesla架构),所以括号应当删除,不应当为不存在的,不支持的设备说明情况。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

原文发布于微信公众号 - 吉浦迅科技(gpusolution)

原文发表时间:2018-05-09

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏逍遥剑客的游戏开发

Nebula3中的Jobs子系统

1032
来自专栏Java Edge

操作系统之存储管理一、基本概念:地址重定位二、地址重定位三、物理内存管理四、连续内存管理方案五、离散内存管理方案(重点)六、交换技术七、虚拟存储技术八、页表及页表项的设计三、虚拟页式存储中软件相关策略

3268
来自专栏数据和云

Oracle构造序列的方法分析对比

编辑手记:关于Oracle的序列,相信大家并不陌生,但很多人平时只用到connect by 的方式来构造序列,今天一起来学习更多的构造序列的方法及每个方法的优缺...

3407
来自专栏大内老A

《WCF服务编程》关于“队列服务”一个值得商榷的地方

今天写《WCF技术剖析(卷2)》关于“队列服务”部分,看了《WCF服务编程》相关的内容。里面介绍一个关于“终结点不能共享相同的消息队列”说法,个人觉得这值得商榷...

1807
来自专栏张善友的专栏

SQL Server数据库碎片

当索引所在页面的基于主关键字的逻辑顺序,和数据文件中的物理顺序不匹配时,碎片就产生了。所有的叶级页包含了指向前一个和后一个页的指针。这样就形成一个双链表。理想情...

1758
来自专栏菩提树下的杨过

扫盲文章:AMF,RTMP,RTMPT,RTMPS

AMF AMF(是Action Message Format的缩写)是在flash和flex中与远程服务端交换数据的一种格式. 它是二进制格式,Flash应用与...

1945
来自专栏Java技术

【面试题】2018年最全Java面试通关秘籍第三套!

注:本文是从众多面试者的面试经验中整理而来,其中不少是本人出的一些题目,网络资源众多,如有雷同,纯属巧合!禁止一切形式的碰瓷行为!未经允许禁止一切形式的转载和复...

761
来自专栏廖可知的专栏

Redis 4.0 PSYNC2中second_replid_offset探究

Redis 4.0起引入了PSYNC2同步方式,分析源码时我们注意到,server数据中增加了replid2、second_replid_offset两个成员。...

1496
来自专栏我是东东强

网络测量之NetSight(NSDI-2014)

NSDI 2014年中,由斯坦福大学”SDN之父”,Nick Mckeown教授带领的实验室发表了题为《I Know What Yout Packet Did ...

1013
来自专栏携程技术中心

高性能Key/Value存储引擎SessionDB

简介 随着公司业务量的逐年成长,粘性会话(Sticky Session)越来越成为应用横向扩展(Scale Out)的瓶颈,为消除粘性会话,支持应用无状态(St...

22210

扫码关注云+社区