前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >DAY8:阅读CUDA异步并发执行中的Streams

DAY8:阅读CUDA异步并发执行中的Streams

作者头像
GPUS Lady
发布2018-06-25 16:09:33
1.9K0
发布2018-06-25 16:09:33
举报
文章被收录于专栏:GPUS开发者GPUS开发者

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第8天,我们用几天时间来学习CUDA 的编程接口,其中最重要的部分就是CUDA C runtime.希望在接下来的92天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计581字,阅读时间15分钟

今天讲的内容依旧是CUDA C Runtime,前面我们已经讲解了初始化、设备显存、共享内存、锁页内存,昨天开始讲解异步并发执行。今天讲解异步并发执行中的Streams:

3.2.5.5. Streams【流】

Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently【同时】; this behavior is not guaranteed and should therefore not be relied upon for correctness【正确性】 (e.g., inter-kernel communication is undefined).

3.2.5.5.1. Creation and Destruction【创建与析构】

A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches and host <-> device memory copies. The following code sample creates two streams and allocates an array hostPtr of float in page-locked memory.

Each of these streams is defined by the following code sample as a sequence of one memory copy from host to device, one kernel launch, and one memory copy from device to host:

Each stream copies its portion of input array hostPtr to array inputDevPtr in device memory, processes inputDevPtr on the device by calling MyKernel(), and copies the result outputDevPtr back to the same portion of hostPtr. Overlapping Behavior describes how the streams overlap in this example depending on the capability of the device. Note that hostPtr must point to page-locked host memory for any overlap to occur.

Streams are released by calling cudaStreamDestroy().

In case the device is still doing work in the stream when cudaStreamDestroy() is called, the function will return immediately and the resources associated with the stream will be released automatically once the device has completed all work in the stream.

3.2.5.5.2. Default Stream

Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order.

For code that is compiled using the --default-stream per-thread compilation flag (or that defines the CUDA_API_PER_THREAD_DEFAULT_STREAM macro before including CUDA headers (cuda.h and cuda_runtime.h)), the default stream is a regular stream and each host thread has its own default stream.

For code that is compiled using the --default-stream legacy compilation flag, the default stream is a special stream called the NULL stream and each device has a single NULL stream used for all host threads. The NULL stream is special as it causes implicit synchronization as described in Implicit Synchronization.

For code that is compiled without specifying a --default-stream compilation flag, --default-stream legacy is assumed as the default.

3.2.5.5.3. Explicit Synchronization【显式同步】

There are various ways to explicitly synchronize streams with each other.

cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads have completed.

cudaStreamSynchronize()takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other streams to continue executing on the device.

cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for a description of events)and makes all the commands added to the given stream after the call tocudaStreamWaitEvent()delay their execution until the given event has completed. The stream can be 0, in which case all the commands added to any stream after the call tocudaStreamWaitEvent()wait on the event.

cudaStreamQuery()provides applications with a way to know if all preceding commands in a stream have completed.

To avoid unnecessary slowdowns, all these synchronization functions are usually best used for timing purposes or to isolate a launch or memory copy that is failing.

本文备注/经验分享:

Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently【同时】; this behavior is not guaranteed and should therefore not be relied upon for correctness【正确性】 (e.g., inter-kernel communication is undefined). 流和流之间正常情况下不保证任何顺序的,(或者说是乱序的)不能假定任何可能存在的顺序关系的(不过你可以通过event来强制保证一下,后面再来解释)。

Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order.

默认就存在一个叫默认流的流,不需要额外创建,不指定流就等于使用默认流。因为默认流也是一个流,而流内的操作总是顺序进行的。因为不指定任何流的情况下,这些操作将一个一个的顺序执行。(也就是你起不到一些,例如传输和计算,计算和计算等,能overlap的效果,这也是为了如果需要进行让一些操作并发进行,你需要额外创建并使用多个流的原因)

Explicit Synchronization 显式同步,相对的是implicit synchronization隐式同步,例如普通版本的cudaMemcpy将等待之前的同步流,和相对同步流保持传统同步特性的其他流中的操作,这叫隐式同步。 显式就是明显的你写了一行代码,隐式就是你没有写,但是他们都等待了,同步了。

cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads have completed. cudaDeviceSynchronize()函数将等待(并不返回),直到之前的所有host线程中(控制)的所有流中的所有命令都完成后,再返回。请注意这句话是现在来说不对的。在之前的老版本的CUDA中是对的,但是手册没有改。 精确的说,将等待所有和调用cudaDeviceSynchronize()所在的host线程,所设定的设备相同的,其他所有host线程中的,所有streams中的之前操作完成。因为CUDA从某个版本起, 扩展了功能了。允许多个host线程分别处理不同的多卡。只有和那些其他host线程所对应的设备和本次调用的设备一样才对。极端情况,例如4卡+4个host线程,每个host线程只负责一张卡,则该cudaDeviceSynchronize()将只阻塞当前host线程(让当前host线程等待),而不会管另外3个host线程的。这就是一个反例。

cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for a description of events)and makes all the commands added to the given stream after the call tocudaStreamWaitEvent()delay their execution until the given event has completed. The stream can be 0, in which case all the commands added to any stream after the call tocudaStreamWaitEvent()wait on the event.就是说,将一个特殊的东西压入流中,这东西将等待event完成(recorded),后续的压入该流中的命令才能继续执行。如果该函数使用了0或者NULL作为它的参数流,则所有后续添加到所有其他普通流中的所有命令等将等待event完成才能继续。 这货可以用来跨卡同步的。 这东西在OpenCL里叫barrier

To avoid unnecessary slowdowns, all these synchronization functions are usually best used for timing purposes or to isolate a launch or memory copy that is failing. 为了避免不必要的减速,应当最好只在计时用途或者用来隔离出错的代码片段,例如一次kernel启动或者内存复制的时候使用。 就是说少用同步,可能会避免无辜的降速。不过后者可以不用手工来。 一般情况下使用NSight之类的可以快速定位错误。而不需要用户手工一段一段的隔离。哪个kernel有访存之类的,可以一次性找到访存出错的位置。 因为正常人cudaMemcpy*之类的不会弄错。而绝大部分访存错误都是kernel内部的BUG,用nsight比较方便。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-05-10,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 GPUS开发者 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 3.2.5.5. Streams【流】
  • 3.2.5.5.1. Creation and Destruction【创建与析构】
  • 3.2.5.5.2. Default Stream
  • 3.2.5.5.3. Explicit Synchronization【显式同步】
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档