DAY7:阅读 CUDA C编程接口之CUDA C runtime

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第7天,我们用几天时间来学习CUDA 的编程接口,其中最重要的部分就是CUDA C runtime.希望在接下来的93天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计566字,阅读时间15分钟

这几章节都在讲CUDA C Runtime,前面我们已经讲解了初始化、设备显存、共享内存、锁页内存,今天我们要讲解异步并发执行。这部分内容也是相当多,我们将再分3天时间来梳理

3.2.5. Asynchronous Concurrent Execution【异步并发执行】

CUDA exposes the following operations as independent tasks that can operate concurrently with one another:

· Computation on the host;

· Computation on the device;

· Memory transfers from the host to the device;

· Memory transfers from the device to the host;

· Memory transfers within the memory of a given device;

· Memory transfers among devices.

The level of concurrency achieved between these operations will depend on the feature set【特性集】 and compute capability of the device as described below.

3.2.5.1. Concurrent Execution【并发执行】 between Host and Device

Concurrent host execution is facilitated through asynchronous library functions that return control to the host thread before the device completes the requested task. Using asynchronous calls, many device operations can be queued up【排队】 together to be executed by the CUDA driver when appropriate device resources are available. This relieves the host thread of much of the responsibility to manage the device, leaving it free for other tasks. The following device operations are asynchronous【异步】 with respect to 【关于】the host:

· Kernel launches;

· Memory copies within a single device's memory;

· Memory copies from host to device of a memory block of 64 KB or less;

· Memory copies performed by functions that are suffixed with Async;

· Memory set function calls.

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.

Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.

3.2.5.2. Concurrent Kernel Execution

Some devices of compute capability 2.x and higher can execute multiple kernels concurrently.【同时地】 Applications may query 【查询】this capability by checking the concurrentKernels device property (see Device Enumeration), which is equal to 1 for devices that support it.

The maximum number of kernel launches that a device can execute concurrently depends on its compute capability and is listed in Table 14.

A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.

Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels.

3.2.5.3. Overlap of Data Transfer and Kernel Execution

Some devices can perform an asynchronous memory copy to or from the GPU concurrently with kernel execution. Applications may query this capability by checking the asyncEngineCountdevice property (see Device Enumeration), which is greater than zero for devices that support it. If host memory is involved in the copy, it must be page-locked.

It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property). Intra-device copies are initiated using the standard memory copy functions with destination and source addresses residing on the same device.

3.2.5.4. Concurrent Data Transfers

Some devices of compute capability 2.x and higher can overlap copies to and from the device. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is equal to 2 for devices that support it. In order to be overlapped, any host memory involved in the transfers must be page-locked.

本文备注/经验分享:

Concurrent host execution is facilitated through asynchronous library functions that return control to the host thread before the device completes the requested task: 通过异步的库函数调用,不等这些函数完成要求的任务,控制就已经返回了Host线程了。很多函数都这样的,例如cudaMemcpyAsync*,例如kernel启动<<<>>>,都不等完成就立刻控制权返回了host了,这样通过这种机制,host可以继续往下执行(从代码流的角度),而device则在执行刚才给他的任务,这样host和device看上去是同时进行的,host可以继续准备下一个任务,例如分配一段缓冲区,用来容纳即将算完的结果,并用cudaMemcpy*取回它们。

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. 程序员可以全局的通过设定一个值为1的CUDA_LAUNCH_BLOCKING环境变量,来禁用当前系统上运行的CUDA程序们的kernel异步启动特性(<<<>>>变成同步的了),还记得上面说的<<<>>>是异步的么。这特性会造成性能损失的。但可以更容易的调试BUG

Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked. 当Profiler正在收集硬件计数器信息的时候,kenrel启动是同步的(包括NSight和Visual Profiler),除非你在profiler中设定了启用concurrent kernel profiling(concurrent kernel profiling不用翻译,因为profiler中这个设置本身就是英文的)。后面半句是说,cudaMemcpyAsync*()如果给定的参数不符合要求(要求page-locked的memory,你给了普通内存),则会退化成普通的同步版本的cudaMemcpy*()。这个特性是后来的CUDA有的,以前如果参数不对不会退化,会出错。后来很多人错的太多了,NV做了容错处理,如果现在参数给的不对,不对出错,只是会失去特性,变成这函数的没有Async后缀的姐妹版本函数。

Overlap of Data Transfer 将kernel的执行和数据的传输(在时间轴上)重叠同时执行。

还记得昨天我们说过有的情况下,数据传输和kernel执行可以同时执行么?这样就节省了时间了。否则常规情况是这样的:

数据传输给显卡---->SM们计算----->数据传输回来---->下一论数据传输过去----->SM们下一轮计算----->继续取回

这样会导致设备性能被浪费的,为何?

数据传输给显卡(DMA忙碌,SM/SP们空闲)---->SM们计算(SM/SP们忙碌,DMA空闲)----->数据传输回来(DMA忙碌,SM又空闲了)......

你看,只要有数据传输中,SM们这样安排就在空闲中,0计算量。而一旦SM们在忙碌中,这样数据就0传输量了。这就很尴尬了。如果这样安排: 数据传输给显卡------>SM们计算------->数据传输回来

下一论数据传输过去----->SM们下一轮计算----->继续取回

下一论数据传输过去----->SM们下一轮计算----->继续取回.... 这样显卡的计算能力,和数据传输能力都没有浪费.三行间重叠的部分能同时进行(需要计算能力2.0+,Copy Engines > 1) 在实际应用中,浪费传输能力不可耻。但浪费计算性能不能接受。哪怕你不是为了传输能力,光为了不浪费计算能力了,就应当考虑让它们overlap的。而overlap可以通过多流,异步复制等来实现。如果你用profiler的话,能看到一个彩色的时间轴,上面会画出来类似我的3行文字的效果的。其他可以重叠的情况有:设备内部的复制和kernel执行,或者设备内部的传输和普通的跨PCI-E传输,前者需要支持并发kernel执行,后者需要copy engines >= 1。

It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property) :设备内部的复制比较其他,是启动一个隐形的kernel进行的。(而不是通过DMA引擎),比较奇特。这是为了说,设备内部的复制,和kernel执行如果要同时进行,需要的是并发kernel执行能力(一个设备的属性)。需要说明的是,这两个括号内的都不需要了。因为要求计算能力至少是Fermi+,而9.0早就不支持1.X了(Tesla架构),所以括号应当删除,不应当为不存在的,不支持的设备说明情况。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

原文发布于微信公众号 - 吉浦迅科技(gpusolution)

原文发表时间:2018-05-09

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏用户2442861的专栏

实时消息传输协议 RTMP(Real Time Messaging Protocol)

http://blog.csdn.net/defonds/article/details/17403225

3891
来自专栏美团技术团队

日志级别动态调整——小工具解决大问题

背景 随着外卖业务的快速发展,业务复杂度不断增加,线上系统环境有任何细小波动,对整个外卖业务都可能产生巨大的影响,甚至形成灾难性的雪崩效应,造成巨大的经济损失。...

8115
来自专栏java架构学习交流

如何用比较快速的方法掌握Spring的核心——依赖注入,Java web轻量级开发面试教程 读书笔记

      我们知道,Java方面的高级程序员一定得掌握Spring的技能,其中包括Spring 依赖注入(IOC),面向切面(AOP),和数据库的整合(比如和...

19910
来自专栏安恒信息

Hacking Team移动智能设备入侵途径—WAP PUSH

0x00 概览 Hacking Team的RCS针对的系统平台覆盖面广、泄漏的源码模块众多。安恒安全研究团队发现其中的vector-rmi-master.zi...

2987
来自专栏EAWorld

揭秘:RESTEasy如何完美支持JAVA 微服务中的多种数据格式

本文获得stackify.com授权翻译发表,转载需要注明来自公众号EAWorld。

1394
来自专栏林欣哲

SpringBoot的微信点餐系统后台开发要点

项目设计 角色划分 买家(微信端) 卖家(PC端) 功能分析 ? 关系 ? 部署架构 ? 架构和基础框架 演进:单一应用架构->垂直应用架构->分布式服务架构-...

1.7K39
来自专栏云计算

使用Akka HTTP构建微服务:CDC方法

原文地址:https://dzone.com/articles/building-microservices-with-akka-http-a-cdc-appr...

3045
来自专栏更流畅、简洁的软件开发方式

【自然框架】之数据访问 —— 再小的类库也需要设计。

  以前也写过几篇关于数据访问的,这里是最新的总结。麻雀虽小五脏俱全,数据访问也许不起眼,但是也要好好的设计一翻。从2004年开始用自己的数据访问,一直到现在,...

2149
来自专栏Java技术

你应该知道的缓存进化史!

本文是上周去技术沙龙听了一下爱奇艺的Java缓存之路有感写出来的。先简单介绍一下爱奇艺的java缓存道路的发展吧。

1171
来自专栏吉浦迅科技

DAY9:阅读CUDA异步并发执行中的Streams

3092

扫码关注云+社区