DAY9:阅读CUDA异步并发执行中的Streams

今天继续讲解异步并发执行中的Streams:

3.2.5.5.4. Implicit Synchronization【隐式同步】

Two commands from different streams cannot run concurrently【同时地】 if any one of the following operations is issued in-between them by the host thread

【 两个不同流中的命令不能同时执行,如果host线程在这两个命令中间发布了下面任意操作】:

· a page-locked host memory allocation,【分配page-locked内存】

· a device memory allocation,【分配显存】

· a device memory set,【指普通的memset()函数的cuda版本: cudaMemset,一般用来初始化或者显存清零之类的用途】

· a memory copy between two addresses to the same device memory,【从两个其他地址到相同显存地址的复制操作】

· any CUDA command to the NULL stream,【任何对默认流发布的命令】

· a switch between the L1/shared memory configurations described in Compute Capability 3.x and Compute Capability 7.x.【这计算能力3.X和7.x上进行L1 / shared memory的大小切换配置】

For devices that support concurrent kernel execution【内核并发执行】 and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete:

· Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing;

· Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete.

Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to cudaStreamQuery() on that stream. Therefore, applications should follow these guidelines to improve their potential for concurrent kernel execution:

· All independent operations should be issued before dependent operations,

· Synchronization of any kind should be delayed as long as possible.

3.2.5.5.5. Overlapping Behavior【重叠行为】

The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and whether or not the device supports overlap of data transfer and kernel execution , concurrent kernel execution , and/or concurrent data transfers.

For example, on devices that do not support concurrent data transfers, the two streams of the code sample of Creation and Destruction do not overlap at all because the memory copy from host to device is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it can only start once the memory copy from device to host issued to stream[0] has completed. If the code is rewritten the following way (and assuming the device supports overlap of data transfer and kernel execution)

then the memory copy from host to device issued to stream[1] overlaps with the kernel launch issued to stream[0].

On devices that do support concurrent data transfers, the two streams of the code sample of Creation and Destruction do overlap: The memory copy from host to device issued to stream[1] overlaps with the memory copy from device to host issued to stream[0] and even with the kernel launch issued to stream[0] (assuming the device supports overlap of data transfer and kernel execution). However, for devices of compute capability 3.0 or lower, the kernel executions cannot possibly overlap because the second kernel launch is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0] is complete as per Implicit Synchronization. If the code is rewritten as above, the kernel executions overlap (assuming the device supports concurrent kernel execution) since the second kernel launch is issued to stream[1] before the memory copy from device to host is issued to stream[0]. In that case however, the memory copy from device to host issued to stream[0] only overlaps with the last thread blocks of the kernel launch issued to stream[1] as per Implicit Synchronization, which can represent only a small portion of the total execution time of the kernel.

本文备注/经验分享:

Two commands from different streams cannot run concurrently【同时地】 if any one of the following operations is issued in-between them by the host thread:

  1. 这下面列的这几点,我做个说明: a device memory set是指普通的memset()函数的cuda版本: cudaMemset,这函数可以对一段显存进行清零或者填充上特定的数据(例如0xff),一般用于初始化之类的。主要是,一般的配置型工作或者初始化都不能同时和其他操作进行。不过这种操作一般在程序开头,实际应用中不影响的。
  2. 至于默认流那个,那个默认不能同时进行的。除非你建立流的时候明确带有允许异步执行的标志。
  3. 至于3.X和7.X上的问题, 这是因为2.X(已经从9.0开始被放弃支持)和3.X,7.X都是L1 cache和shared memory大小可调的,而Maxwell改成了不可调,Pascal又延续了,然后7.X又改回来了,也就是:2.X可调 -> 3.X可调 -> 5.X不可调 -> 6.X不可 -> 7.x可。
  4. a memory copy between two addresses to the same device memory 这个可能不是CUDA的限制,而是如果都是从a -> c和b -> c的复制操作,如果同时进行了,可能会造成逻辑上的混乱。例如用户原本从a->c复制了10MB, 然后从b -> c也复制了10MB,正常情况下先后进行是b覆盖了a的结果。但如果同时进行,结果可能是未知的。(例如c最后有一部分是从a传输来的,另外一部分是从b传输来的)这样结果可能是混乱的。如果从这个角度说,倒是说的过去。但是应当没有人会这样写吧。正常写的人心里应当知道这样是不安全的。不知道CUDA为何要单独说一下。就像一个人去餐厅可能先喝茶,再喝酒,但是一般没有人同时喝酒一口,再喝茶一口的。只不过餐厅现在贴出了通知:本餐厅不支持同时喝酒+喝茶,您每次可以只要一种。

For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete: 字面意思是:对于支持并发kernel执行的,同时计算能力小于等于3.0的设备(即Fermi和初代Kepler---请注意这CUDA 9个时候已经放弃了Fermi支持了,这里应该改成,仅对于初代Kepler(3.0)才好),需要查询或者等待(依赖)某流中的之前的某kernel完成状态的任何操作:

(1)该操作必须等待之前的CUDA Context中的所有流中的所有操作都开始执行后,才能开始执行;

(2)该操作将阻止之后的当前Context中的所有流中的所有操作执行,直到该操作如前所说的,所依赖的某kernel完成执行,或者查询结果返回(操作未完成)。

但是实际中,老卡上的第二点是不对的。主要是老卡只有一个物理上的Kernel Execution Queue, 和2个DMA Queues(Device -> Host 和 Host -> Device),导致了很多情况下原本能并发执行的操作不能并发执行。但是什么操作是所谓的“需要查询或者等待(依赖)某流中的之前的某kernel完成状态”的操作?

显然常见的只有Async结尾的cudaMemcpy*()函数,

以及,应当附加上cudaStreamQuery()

广义的说还有cudaMemcpy*()无async的同步版本和各种分配函数之类的,但这种就包含的广了。

和这里的这段英文说的不同的是,根据实际经验,在老卡(Fermi和计算能力3.0)上使用cudaStreamQuery,非但不像手册这段说的,会可能阻止多种操作的并发性,反而可能会增加老卡上的并发执行效果。(从老卡+Profiler的时间轴上能很容易看到这点)。 好在从计算能力3.5开始(例如K40?),Maxwell, Pascal这些,都具有Hyper-Q了。不存在这些种种限制了。用户也不用学习各种命令发布技巧了。新点的卡任何一种(无论深度,广度,还是用户自己随心所欲的任何一种发布方式),只要逻辑上能并行的,资源也允许的,卡就能给你并行,非常给力。

The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and whether or not the device supports overlap of data transfer and kernel execution , concurrent kernel execution , and/or concurrent data transfers. overlap指的是 执行的操作在时间上重叠(同时执行),比如这个图:

这个重叠比较多。 一共启动了6个kernel

两个流之间的执行重叠程度,取决于每个流中的命令发布顺序(特别对于无Hyper-Q的卡,这个很重要。例如手册说过的深度优先和广度优先这两种顺序),取决于是否设备支持数据传输和kernel执行重叠,取决于(设备是否支持)并发kernel执行,和/或(取决于)并发数据传输。(计算能力5.0(包含)一下的双向传输需要专业卡,计算能力5.2(包含)家用卡也支持数据双向传输(双Copy Engines)。双向原本是专业卡的特性,现在都开放),类似的一些TCC才能用远程桌面或者服务中使用CUDA,现在家用卡也可以了。很多以前的特性需要专业卡,现在都开放了。 类似的,以前NVENC需要买license才能用。现在NV家用卡开放编码能力,限两路同时编码。目前NV还有的常见限制是专业卡的double,ECC,编码以及虚拟化。(Titan系列算是准专业卡,连Jetson Tx2也有ECC哟)

However, for devices of compute capability 3.0 or lower, the kernel executions cannot possibly overlap because the second kernel launch is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0] is complete as per Implicit Synchronization. 然后,因计算能力3.0或者更低的设备上的隐式同步问题,(多个)kernel之间的执行可能不能重叠,因为第二个流stream[1]中的kernel启动命令,是在第一个流中stream[0]中的D->H传输命令发布以后,这样它将阻塞,直到第一个流stream[0]中的第一个kernel执行完成以后(才能开始执行)。老卡有很多限制的。发布命令给多个流,需要注意顺序。多种问题。3.5+的卡无任何问题,只要是多流,逻辑上应该并发的,资源允许的情况下就会并发。而不管一些隐晦的限制条件。不过现在的GPU卡都至少5.0以上了

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

原文发布于微信公众号 - 吉浦迅科技(gpusolution)

原文发表时间:2018-05-11

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏Cloud Native - 产品级敏捷

微服务架构 (五): 获取微服务数据, 生成报表

2016.8.17, 深圳, Ken Fang 架构师在设计从多个微服务取数据, 而生成报表的架构设计方案时, 往往面临著需在边界上下文 (Bounded Co...

20110
来自专栏大数据挖掘DT机器学习

Python实现爬取知乎神回复

这篇文章主要介绍了Python实现爬取知乎神回复简单爬虫代码分享,本文实现了爬取知乎的“如何正确地吐槽”收藏夹,是对个人的一个兴趣实现,需要的朋友可以参考下。 ...

3315
来自专栏编程微刊

highcharts中的x轴如何显示时分秒时间格式

上一篇文章写道:三分钟上手Highcharts简易甘特图:https://www.jianshu.com/p/d669d451711b,在官方文档里面,x轴默认...

1462
来自专栏向治洪

android开发性能分析

1 背景 其实有点不想写这篇文章的,但是又想写,有些矛盾。不想写的原因是随便上网一搜一堆关于性能的建议,感觉大家你一总结、我一总结的都说到了很多优化注意事项...

2385
来自专栏黑泽君的专栏

从零讲JAVA ,给你一条清晰地学习道路!该学什么就学什么!!

 原文链接:https://zhuanlan.zhihu.com/p/25296859

842
来自专栏青青天空树

node.js+vue.js搭建程序设计类课程教学辅助系统

  毕业才刚刚两个多月而已,现在想想大学生活是那么的遥不可及,感觉已经过了好久好久,社会了两个月才明白学校的好啊。。。额,扯远了,自从毕业开始就想找个时间写下毕...

2791
来自专栏开源FPGA

基于basys2驱动LCDQC12864B的verilog设计图片显示

  话不多说先上图 ? 前言        在做这个实验的时候在网上找了许多资料,都是关于使用单片机驱动LCD显示,确实用单片机驱动是要简单不少,记得在FPGA...

2325
来自专栏java工会

从零讲JAVA ,给你一条清晰地学习道路!该学什么就学什么!!

1717
来自专栏铭毅天下

干货 |《从Lucene到Elasticsearch全文检索实战》拆解实践

1、题记 2018年3月初,萌生了一个想法:对Elasticsearch相关的技术书籍做拆解阅读,该想法源自非计算机领域红火已久的【樊登读书会】、得到的每天听本...

1.9K6
来自专栏java思维导图

从零讲JAVA ,给你一条清晰地学习道路!该学什么就学什么!!

主要学习: 1.向量,链表,栈,队列和堆,词典。熟悉 2.树,二叉搜索树。熟悉 3.图,有向图,无向图,基本概念 4.二叉搜索A,B,C类熟练,9大排序熟悉。 ...

1405

扫码关注云+社区