DAY8:阅读CUDA异步并发执行中的Streams

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第8天,我们用几天时间来学习CUDA 的编程接口,其中最重要的部分就是CUDA C runtime.希望在接下来的92天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计581字,阅读时间15分钟

今天讲的内容依旧是CUDA C Runtime,前面我们已经讲解了初始化、设备显存、共享内存、锁页内存,昨天开始讲解异步并发执行。今天讲解异步并发执行中的Streams:

3.2.5.5. Streams【流】

Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently【同时】; this behavior is not guaranteed and should therefore not be relied upon for correctness【正确性】 (e.g., inter-kernel communication is undefined).

3.2.5.5.1. Creation and Destruction【创建与析构】

A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches and host <-> device memory copies. The following code sample creates two streams and allocates an array hostPtr of float in page-locked memory.

Each of these streams is defined by the following code sample as a sequence of one memory copy from host to device, one kernel launch, and one memory copy from device to host:

Each stream copies its portion of input array hostPtr to array inputDevPtr in device memory, processes inputDevPtr on the device by calling MyKernel(), and copies the result outputDevPtr back to the same portion of hostPtr. Overlapping Behavior describes how the streams overlap in this example depending on the capability of the device. Note that hostPtr must point to page-locked host memory for any overlap to occur.

Streams are released by calling cudaStreamDestroy().

In case the device is still doing work in the stream when cudaStreamDestroy() is called, the function will return immediately and the resources associated with the stream will be released automatically once the device has completed all work in the stream.

3.2.5.5.2. Default Stream

Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order.

For code that is compiled using the --default-stream per-thread compilation flag (or that defines the CUDA_API_PER_THREAD_DEFAULT_STREAM macro before including CUDA headers (cuda.h and cuda_runtime.h)), the default stream is a regular stream and each host thread has its own default stream.

For code that is compiled using the --default-stream legacy compilation flag, the default stream is a special stream called the NULL stream and each device has a single NULL stream used for all host threads. The NULL stream is special as it causes implicit synchronization as described in Implicit Synchronization.

For code that is compiled without specifying a --default-stream compilation flag, --default-stream legacy is assumed as the default.

3.2.5.5.3. Explicit Synchronization【显式同步】

There are various ways to explicitly synchronize streams with each other.

cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads have completed.

cudaStreamSynchronize()takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other streams to continue executing on the device.

cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for a description of events)and makes all the commands added to the given stream after the call tocudaStreamWaitEvent()delay their execution until the given event has completed. The stream can be 0, in which case all the commands added to any stream after the call tocudaStreamWaitEvent()wait on the event.

cudaStreamQuery()provides applications with a way to know if all preceding commands in a stream have completed.

To avoid unnecessary slowdowns, all these synchronization functions are usually best used for timing purposes or to isolate a launch or memory copy that is failing.

本文备注/经验分享:

Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently【同时】; this behavior is not guaranteed and should therefore not be relied upon for correctness【正确性】 (e.g., inter-kernel communication is undefined). 流和流之间正常情况下不保证任何顺序的,(或者说是乱序的)不能假定任何可能存在的顺序关系的(不过你可以通过event来强制保证一下,后面再来解释)。

Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order.

默认就存在一个叫默认流的流,不需要额外创建,不指定流就等于使用默认流。因为默认流也是一个流,而流内的操作总是顺序进行的。因为不指定任何流的情况下,这些操作将一个一个的顺序执行。(也就是你起不到一些,例如传输和计算,计算和计算等,能overlap的效果,这也是为了如果需要进行让一些操作并发进行,你需要额外创建并使用多个流的原因)

Explicit Synchronization 显式同步,相对的是implicit synchronization隐式同步,例如普通版本的cudaMemcpy将等待之前的同步流,和相对同步流保持传统同步特性的其他流中的操作,这叫隐式同步。 显式就是明显的你写了一行代码,隐式就是你没有写,但是他们都等待了,同步了。

cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads have completed. cudaDeviceSynchronize()函数将等待(并不返回),直到之前的所有host线程中(控制)的所有流中的所有命令都完成后,再返回。请注意这句话是现在来说不对的。在之前的老版本的CUDA中是对的,但是手册没有改。 精确的说,将等待所有和调用cudaDeviceSynchronize()所在的host线程,所设定的设备相同的,其他所有host线程中的,所有streams中的之前操作完成。因为CUDA从某个版本起, 扩展了功能了。允许多个host线程分别处理不同的多卡。只有和那些其他host线程所对应的设备和本次调用的设备一样才对。极端情况,例如4卡+4个host线程,每个host线程只负责一张卡,则该cudaDeviceSynchronize()将只阻塞当前host线程(让当前host线程等待),而不会管另外3个host线程的。这就是一个反例。

cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for a description of events)and makes all the commands added to the given stream after the call tocudaStreamWaitEvent()delay their execution until the given event has completed. The stream can be 0, in which case all the commands added to any stream after the call tocudaStreamWaitEvent()wait on the event.就是说,将一个特殊的东西压入流中,这东西将等待event完成(recorded),后续的压入该流中的命令才能继续执行。如果该函数使用了0或者NULL作为它的参数流,则所有后续添加到所有其他普通流中的所有命令等将等待event完成才能继续。 这货可以用来跨卡同步的。 这东西在OpenCL里叫barrier

To avoid unnecessary slowdowns, all these synchronization functions are usually best used for timing purposes or to isolate a launch or memory copy that is failing. 为了避免不必要的减速,应当最好只在计时用途或者用来隔离出错的代码片段,例如一次kernel启动或者内存复制的时候使用。 就是说少用同步,可能会避免无辜的降速。不过后者可以不用手工来。 一般情况下使用NSight之类的可以快速定位错误。而不需要用户手工一段一段的隔离。哪个kernel有访存之类的,可以一次性找到访存出错的位置。 因为正常人cudaMemcpy*之类的不会弄错。而绝大部分访存错误都是kernel内部的BUG,用nsight比较方便。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

原文发布于微信公众号 - 吉浦迅科技(gpusolution)

原文发表时间:2018-05-10

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏DOTNET

MongoDB模拟多文档事务操作

Mongodb不支持多文档原子性操作,因此依据两阶段提交协议(Two Phase Commits protocol)来模拟事务。 以两个银行账户之间的转账行为为...

4449
来自专栏贾老师の博客

makecontext 理解与使用

1193
来自专栏岑玉海

利用WCF改进文件流传输的三种方式

WCF在跨域传输使用了两种模型的方法调用:一种是同步模型,这种模型显然对那些需要大量操作时间的方法调用(如从数据库中获取大量数据时)是一种痛苦的选择。另一种是异...

2836
来自专栏林欣哲

科个普啦--内存的实现原理

本文讲内存的实现,从底层的二极管到内存的电路结构,本章逻辑线路为:电路-->二极管-->逻辑门-->组合逻辑单元j和存储单元-->内存 我们知道计算机本质是在做...

33410
来自专栏前端杂货铺

判断js引擎是javascriptCore或者v8

来由   纯粹的无聊,一直在搜索JavaScriptCore和SpiderMonkey的一些信息,却无意中学习了如何在ios的UIWebView中判断其js解析...

3625
来自专栏牛肉圆粉不加葱

揭开Spark Streaming神秘面纱③ - 动态生成 job

JobScheduler有两个重要成员,一是上文介绍的 ReceiverTracker,负责分发 receivers 及源源不断地接收数据;二是本文将要介绍的 ...

753
来自专栏石奈子的Java之路

原 荐 JVM笔记整理

1743
来自专栏区块链大本营

七夕送礼很发愁?自己编写一个区块链送女友吧~

35416
来自专栏工科狗和生物喵

【计算机本科补全计划】指令:计算机的语言(MIPS) Part3

正文之前 今天学的很尴尬,因为有事情,而且新认识了两个计算机学院的保研大佬,不得不感叹我找的导师之强,第一个去上交的,是被金老师推荐去的,听说是跟了目前亚洲第一...

2998
来自专栏数据库新发现

Statspack之十四-"log file sync" 等待事件

http://www.eygle.com/statspack/statspack14-LogFileSync.htm 当一个用户提交(commits)或者回滚...

911

扫码关注云+社区