DAY9:阅读CUDA异步并发执行中的Streams

GPUS Lady

发布于 2018-06-22 18:22:14

2.3K1

发布于 2018-06-22 18:22:14

文章被收录于专栏：GPUS开发者

今天继续讲解异步并发执行中的Streams：

3.2.5.5.4. Implicit Synchronization【隐式同步】

Two commands from different streams cannot run concurrently【同时地】 if any one of the following operations is issued in-between them by the host thread

【两个不同流中的命令不能同时执行，如果host线程在这两个命令中间发布了下面任意操作】:

· a page-locked host memory allocation,【分配page-locked内存】

· a device memory allocation,【分配显存】

· a device memory set,【指普通的memset()函数的cuda版本: cudaMemset，一般用来初始化或者显存清零之类的用途】

· a memory copy between two addresses to the same device memory,【从两个其他地址到相同显存地址的复制操作】

· any CUDA command to the NULL stream,【任何对默认流发布的命令】

· a switch between the L1/shared memory configurations described in Compute Capability 3.x and Compute Capability 7.x.【这计算能力3.X和7.x上进行L1 / shared memory的大小切换配置】

For devices that support concurrent kernel execution【内核并发执行】 and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete:

· Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing;

· Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete.

Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to cudaStreamQuery() on that stream. Therefore, applications should follow these guidelines to improve their potential for concurrent kernel execution:

· All independent operations should be issued before dependent operations,

· Synchronization of any kind should be delayed as long as possible.

3.2.5.5.5. Overlapping Behavior【重叠行为】

The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and whether or not the device supports overlap of data transfer and kernel execution , concurrent kernel execution , and/or concurrent data transfers.

For example, on devices that do not support concurrent data transfers, the two streams of the code sample of Creation and Destruction do not overlap at all because the memory copy from host to device is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it can only start once the memory copy from device to host issued to stream[0] has completed. If the code is rewritten the following way (and assuming the device supports overlap of data transfer and kernel execution)

then the memory copy from host to device issued to stream[1] overlaps with the kernel launch issued to stream[0].

On devices that do support concurrent data transfers, the two streams of the code sample of Creation and Destruction do overlap: The memory copy from host to device issued to stream[1] overlaps with the memory copy from device to host issued to stream[0] and even with the kernel launch issued to stream[0] (assuming the device supports overlap of data transfer and kernel execution). However, for devices of compute capability 3.0 or lower, the kernel executions cannot possibly overlap because the second kernel launch is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0] is complete as per Implicit Synchronization. If the code is rewritten as above, the kernel executions overlap (assuming the device supports concurrent kernel execution) since the second kernel launch is issued to stream[1] before the memory copy from device to host is issued to stream[0]. In that case however, the memory copy from device to host issued to stream[0] only overlaps with the last thread blocks of the kernel launch issued to stream[1] as per Implicit Synchronization, which can represent only a small portion of the total execution time of the kernel.

本文备注/经验分享：

Two commands from different streams cannot run concurrently【同时地】 if any one of the following operations is issued in-between them by the host thread：

这下面列的这几点，我做个说明： a device memory set是指普通的memset()函数的cuda版本: cudaMemset，这函数可以对一段显存进行清零或者填充上特定的数据（例如0xff)，一般用于初始化之类的。主要是，一般的配置型工作或者初始化都不能同时和其他操作进行。不过这种操作一般在程序开头，实际应用中不影响的。
至于默认流那个，那个默认不能同时进行的。除非你建立流的时候明确带有允许异步执行的标志。
至于3.X和7.X上的问题，这是因为2.X（已经从9.0开始被放弃支持）和3.X，7.X都是L1 cache和shared memory大小可调的，而Maxwell改成了不可调，Pascal又延续了，然后7.X又改回来了，也就是:2.X可调 -> 3.X可调 -> 5.X不可调 -> 6.X不可 -> 7.x可。
a memory copy between two addresses to the same device memory 这个可能不是CUDA的限制，而是如果都是从a -> c和b -> c的复制操作，如果同时进行了，可能会造成逻辑上的混乱。例如用户原本从a->c复制了10MB, 然后从b -> c也复制了10MB，正常情况下先后进行是b覆盖了a的结果。但如果同时进行，结果可能是未知的。（例如c最后有一部分是从a传输来的，另外一部分是从b传输来的）这样结果可能是混乱的。如果从这个角度说，倒是说的过去。但是应当没有人会这样写吧。正常写的人心里应当知道这样是不安全的。不知道CUDA为何要单独说一下。就像一个人去餐厅可能先喝茶，再喝酒，但是一般没有人同时喝酒一口，再喝茶一口的。只不过餐厅现在贴出了通知：本餐厅不支持同时喝酒+喝茶，您每次可以只要一种。

For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete: 字面意思是：对于支持并发kernel执行的，同时计算能力小于等于3.0的设备（即Fermi和初代Kepler---请注意这CUDA 9个时候已经放弃了Fermi支持了，这里应该改成，仅对于初代Kepler（3.0）才好），需要查询或者等待（依赖）某流中的之前的某kernel完成状态的任何操作：

（1）该操作必须等待之前的CUDA Context中的所有流中的所有操作都开始执行后，才能开始执行；

（2）该操作将阻止之后的当前Context中的所有流中的所有操作执行，直到该操作如前所说的，所依赖的某kernel完成执行，或者查询结果返回（操作未完成）。

但是实际中，老卡上的第二点是不对的。主要是老卡只有一个物理上的Kernel Execution Queue, 和2个DMA Queues（Device -> Host 和 Host -> Device)，导致了很多情况下原本能并发执行的操作不能并发执行。但是什么操作是所谓的“需要查询或者等待（依赖）某流中的之前的某kernel完成状态”的操作？

显然常见的只有Async结尾的cudaMemcpy*()函数，

以及，应当附加上cudaStreamQuery()

广义的说还有cudaMemcpy*()无async的同步版本和各种分配函数之类的，但这种就包含的广了。

和这里的这段英文说的不同的是，根据实际经验，在老卡（Fermi和计算能力3.0）上使用cudaStreamQuery，非但不像手册这段说的，会可能阻止多种操作的并发性，反而可能会增加老卡上的并发执行效果。（从老卡+Profiler的时间轴上能很容易看到这点）。好在从计算能力3.5开始（例如K40？），Maxwell， Pascal这些，都具有Hyper-Q了。不存在这些种种限制了。用户也不用学习各种命令发布技巧了。新点的卡任何一种（无论深度，广度，还是用户自己随心所欲的任何一种发布方式），只要逻辑上能并行的，资源也允许的，卡就能给你并行，非常给力。

这个重叠比较多。一共启动了6个kernel

两个流之间的执行重叠程度，取决于每个流中的命令发布顺序（特别对于无Hyper-Q的卡，这个很重要。例如手册说过的深度优先和广度优先这两种顺序），取决于是否设备支持数据传输和kernel执行重叠，取决于（设备是否支持）并发kernel执行，和/或（取决于）并发数据传输。（计算能力5.0（包含）一下的双向传输需要专业卡，计算能力5.2（包含）家用卡也支持数据双向传输（双Copy Engines）。双向原本是专业卡的特性，现在都开放），类似的一些TCC才能用远程桌面或者服务中使用CUDA，现在家用卡也可以了。很多以前的特性需要专业卡，现在都开放了。类似的，以前NVENC需要买license才能用。现在NV家用卡开放编码能力，限两路同时编码。目前NV还有的常见限制是专业卡的double，ECC，编码以及虚拟化。（Titan系列算是准专业卡，连Jetson Tx2也有ECC哟）

However, for devices of compute capability 3.0 or lower, the kernel executions cannot possibly overlap because the second kernel launch is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0] is complete as per Implicit Synchronization. 然后，因计算能力3.0或者更低的设备上的隐式同步问题，（多个）kernel之间的执行可能不能重叠，因为第二个流stream[1]中的kernel启动命令，是在第一个流中stream[0]中的D->H传输命令发布以后，这样它将阻塞，直到第一个流stream[0]中的第一个kernel执行完成以后（才能开始执行）。老卡有很多限制的。发布命令给多个流，需要注意顺序。多种问题。3.5+的卡无任何问题，只要是多流，逻辑上应该并发的，资源允许的情况下就会并发。而不管一些隐晦的限制条件。不过现在的GPU卡都至少5.0以上了

有不明白的地方，请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-05-11，如有侵权请联系 cloudcommunity@tencent.com 删除

kernel

本文分享自 GPUS开发者微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

kernel

登录后参与评论

0 条评论

热度