前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >DAY7:阅读 CUDA C编程接口之CUDA C runtime

DAY7:阅读 CUDA C编程接口之CUDA C runtime

作者头像
GPUS Lady
发布2018-06-25 15:50:42
5800
发布2018-06-25 15:50:42
举报
文章被收录于专栏:GPUS开发者GPUS开发者

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第7天,我们用几天时间来学习CUDA 的编程接口,其中最重要的部分就是CUDA C runtime.希望在接下来的93天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计566字,阅读时间15分钟

这几章节都在讲CUDA C Runtime,前面我们已经讲解了初始化、设备显存、共享内存、锁页内存,今天我们要讲解异步并发执行。这部分内容也是相当多,我们将再分3天时间来梳理

3.2.5. Asynchronous Concurrent Execution【异步并发执行】

CUDA exposes the following operations as independent tasks that can operate concurrently with one another:

· Computation on the host;

· Computation on the device;

· Memory transfers from the host to the device;

· Memory transfers from the device to the host;

· Memory transfers within the memory of a given device;

· Memory transfers among devices.

The level of concurrency achieved between these operations will depend on the feature set【特性集】 and compute capability of the device as described below.

3.2.5.1. Concurrent Execution【并发执行】 between Host and Device

Concurrent host execution is facilitated through asynchronous library functions that return control to the host thread before the device completes the requested task. Using asynchronous calls, many device operations can be queued up【排队】 together to be executed by the CUDA driver when appropriate device resources are available. This relieves the host thread of much of the responsibility to manage the device, leaving it free for other tasks. The following device operations are asynchronous【异步】 with respect to 【关于】the host:

· Kernel launches;

· Memory copies within a single device's memory;

· Memory copies from host to device of a memory block of 64 KB or less;

· Memory copies performed by functions that are suffixed with Async;

· Memory set function calls.

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.

Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.

3.2.5.2. Concurrent Kernel Execution

Some devices of compute capability 2.x and higher can execute multiple kernels concurrently.【同时地】 Applications may query 【查询】this capability by checking the concurrentKernels device property (see Device Enumeration), which is equal to 1 for devices that support it.

The maximum number of kernel launches that a device can execute concurrently depends on its compute capability and is listed in Table 14.

A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.

Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels.

3.2.5.3. Overlap of Data Transfer and Kernel Execution

Some devices can perform an asynchronous memory copy to or from the GPU concurrently with kernel execution. Applications may query this capability by checking the asyncEngineCountdevice property (see Device Enumeration), which is greater than zero for devices that support it. If host memory is involved in the copy, it must be page-locked.

It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property). Intra-device copies are initiated using the standard memory copy functions with destination and source addresses residing on the same device.

3.2.5.4. Concurrent Data Transfers

Some devices of compute capability 2.x and higher can overlap copies to and from the device. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is equal to 2 for devices that support it. In order to be overlapped, any host memory involved in the transfers must be page-locked.

本文备注/经验分享:

Concurrent host execution is facilitated through asynchronous library functions that return control to the host thread before the device completes the requested task: 通过异步的库函数调用,不等这些函数完成要求的任务,控制就已经返回了Host线程了。很多函数都这样的,例如cudaMemcpyAsync*,例如kernel启动<<<>>>,都不等完成就立刻控制权返回了host了,这样通过这种机制,host可以继续往下执行(从代码流的角度),而device则在执行刚才给他的任务,这样host和device看上去是同时进行的,host可以继续准备下一个任务,例如分配一段缓冲区,用来容纳即将算完的结果,并用cudaMemcpy*取回它们。

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. 程序员可以全局的通过设定一个值为1的CUDA_LAUNCH_BLOCKING环境变量,来禁用当前系统上运行的CUDA程序们的kernel异步启动特性(<<<>>>变成同步的了),还记得上面说的<<<>>>是异步的么。这特性会造成性能损失的。但可以更容易的调试BUG

Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked. 当Profiler正在收集硬件计数器信息的时候,kenrel启动是同步的(包括NSight和Visual Profiler),除非你在profiler中设定了启用concurrent kernel profiling(concurrent kernel profiling不用翻译,因为profiler中这个设置本身就是英文的)。后面半句是说,cudaMemcpyAsync*()如果给定的参数不符合要求(要求page-locked的memory,你给了普通内存),则会退化成普通的同步版本的cudaMemcpy*()。这个特性是后来的CUDA有的,以前如果参数不对不会退化,会出错。后来很多人错的太多了,NV做了容错处理,如果现在参数给的不对,不对出错,只是会失去特性,变成这函数的没有Async后缀的姐妹版本函数。

Overlap of Data Transfer 将kernel的执行和数据的传输(在时间轴上)重叠同时执行。

还记得昨天我们说过有的情况下,数据传输和kernel执行可以同时执行么?这样就节省了时间了。否则常规情况是这样的:

数据传输给显卡---->SM们计算----->数据传输回来---->下一论数据传输过去----->SM们下一轮计算----->继续取回

这样会导致设备性能被浪费的,为何?

数据传输给显卡(DMA忙碌,SM/SP们空闲)---->SM们计算(SM/SP们忙碌,DMA空闲)----->数据传输回来(DMA忙碌,SM又空闲了)......

你看,只要有数据传输中,SM们这样安排就在空闲中,0计算量。而一旦SM们在忙碌中,这样数据就0传输量了。这就很尴尬了。如果这样安排: 数据传输给显卡------>SM们计算------->数据传输回来

下一论数据传输过去----->SM们下一轮计算----->继续取回

下一论数据传输过去----->SM们下一轮计算----->继续取回.... 这样显卡的计算能力,和数据传输能力都没有浪费.三行间重叠的部分能同时进行(需要计算能力2.0+,Copy Engines > 1) 在实际应用中,浪费传输能力不可耻。但浪费计算性能不能接受。哪怕你不是为了传输能力,光为了不浪费计算能力了,就应当考虑让它们overlap的。而overlap可以通过多流,异步复制等来实现。如果你用profiler的话,能看到一个彩色的时间轴,上面会画出来类似我的3行文字的效果的。其他可以重叠的情况有:设备内部的复制和kernel执行,或者设备内部的传输和普通的跨PCI-E传输,前者需要支持并发kernel执行,后者需要copy engines >= 1。

It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property) :设备内部的复制比较其他,是启动一个隐形的kernel进行的。(而不是通过DMA引擎),比较奇特。这是为了说,设备内部的复制,和kernel执行如果要同时进行,需要的是并发kernel执行能力(一个设备的属性)。需要说明的是,这两个括号内的都不需要了。因为要求计算能力至少是Fermi+,而9.0早就不支持1.X了(Tesla架构),所以括号应当删除,不应当为不存在的,不支持的设备说明情况。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-05-09,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 GPUS开发者 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 3.2.5. Asynchronous Concurrent Execution【异步并发执行】
  • 3.2.5.1. Concurrent Execution【并发执行】 between Host and Device
  • 3.2.5.2. Concurrent Kernel Execution
  • 3.2.5.3. Overlap of Data Transfer and Kernel Execution
  • 3.2.5.4. Concurrent Data Transfers
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档