DAY8:阅读CUDA异步并发执行中的Streams

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第8天,我们用几天时间来学习CUDA 的编程接口,其中最重要的部分就是CUDA C runtime.希望在接下来的92天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计581字,阅读时间15分钟

今天讲的内容依旧是CUDA C Runtime,前面我们已经讲解了初始化、设备显存、共享内存、锁页内存,昨天开始讲解异步并发执行。今天讲解异步并发执行中的Streams:

3.2.5.5. Streams【流】

Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently【同时】; this behavior is not guaranteed and should therefore not be relied upon for correctness【正确性】 (e.g., inter-kernel communication is undefined).

3.2.5.5.1. Creation and Destruction【创建与析构】

A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches and host <-> device memory copies. The following code sample creates two streams and allocates an array hostPtr of float in page-locked memory.

Each of these streams is defined by the following code sample as a sequence of one memory copy from host to device, one kernel launch, and one memory copy from device to host:

Each stream copies its portion of input array hostPtr to array inputDevPtr in device memory, processes inputDevPtr on the device by calling MyKernel(), and copies the result outputDevPtr back to the same portion of hostPtr. Overlapping Behavior describes how the streams overlap in this example depending on the capability of the device. Note that hostPtr must point to page-locked host memory for any overlap to occur.

Streams are released by calling cudaStreamDestroy().

In case the device is still doing work in the stream when cudaStreamDestroy() is called, the function will return immediately and the resources associated with the stream will be released automatically once the device has completed all work in the stream.

3.2.5.5.2. Default Stream

Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order.

For code that is compiled using the --default-stream per-thread compilation flag (or that defines the CUDA_API_PER_THREAD_DEFAULT_STREAM macro before including CUDA headers (cuda.h and cuda_runtime.h)), the default stream is a regular stream and each host thread has its own default stream.

For code that is compiled using the --default-stream legacy compilation flag, the default stream is a special stream called the NULL stream and each device has a single NULL stream used for all host threads. The NULL stream is special as it causes implicit synchronization as described in Implicit Synchronization.

For code that is compiled without specifying a --default-stream compilation flag, --default-stream legacy is assumed as the default.

3.2.5.5.3. Explicit Synchronization【显式同步】

There are various ways to explicitly synchronize streams with each other.

cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads have completed.

cudaStreamSynchronize()takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other streams to continue executing on the device.

cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for a description of events)and makes all the commands added to the given stream after the call tocudaStreamWaitEvent()delay their execution until the given event has completed. The stream can be 0, in which case all the commands added to any stream after the call tocudaStreamWaitEvent()wait on the event.

cudaStreamQuery()provides applications with a way to know if all preceding commands in a stream have completed.

To avoid unnecessary slowdowns, all these synchronization functions are usually best used for timing purposes or to isolate a launch or memory copy that is failing.

本文备注/经验分享:

Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently【同时】; this behavior is not guaranteed and should therefore not be relied upon for correctness【正确性】 (e.g., inter-kernel communication is undefined). 流和流之间正常情况下不保证任何顺序的,(或者说是乱序的)不能假定任何可能存在的顺序关系的(不过你可以通过event来强制保证一下,后面再来解释)。

Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order.

默认就存在一个叫默认流的流,不需要额外创建,不指定流就等于使用默认流。因为默认流也是一个流,而流内的操作总是顺序进行的。因为不指定任何流的情况下,这些操作将一个一个的顺序执行。(也就是你起不到一些,例如传输和计算,计算和计算等,能overlap的效果,这也是为了如果需要进行让一些操作并发进行,你需要额外创建并使用多个流的原因)

Explicit Synchronization 显式同步,相对的是implicit synchronization隐式同步,例如普通版本的cudaMemcpy将等待之前的同步流,和相对同步流保持传统同步特性的其他流中的操作,这叫隐式同步。 显式就是明显的你写了一行代码,隐式就是你没有写,但是他们都等待了,同步了。

cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads have completed. cudaDeviceSynchronize()函数将等待(并不返回),直到之前的所有host线程中(控制)的所有流中的所有命令都完成后,再返回。请注意这句话是现在来说不对的。在之前的老版本的CUDA中是对的,但是手册没有改。 精确的说,将等待所有和调用cudaDeviceSynchronize()所在的host线程,所设定的设备相同的,其他所有host线程中的,所有streams中的之前操作完成。因为CUDA从某个版本起, 扩展了功能了。允许多个host线程分别处理不同的多卡。只有和那些其他host线程所对应的设备和本次调用的设备一样才对。极端情况,例如4卡+4个host线程,每个host线程只负责一张卡,则该cudaDeviceSynchronize()将只阻塞当前host线程(让当前host线程等待),而不会管另外3个host线程的。这就是一个反例。

cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for a description of events)and makes all the commands added to the given stream after the call tocudaStreamWaitEvent()delay their execution until the given event has completed. The stream can be 0, in which case all the commands added to any stream after the call tocudaStreamWaitEvent()wait on the event.就是说,将一个特殊的东西压入流中,这东西将等待event完成(recorded),后续的压入该流中的命令才能继续执行。如果该函数使用了0或者NULL作为它的参数流,则所有后续添加到所有其他普通流中的所有命令等将等待event完成才能继续。 这货可以用来跨卡同步的。 这东西在OpenCL里叫barrier

To avoid unnecessary slowdowns, all these synchronization functions are usually best used for timing purposes or to isolate a launch or memory copy that is failing. 为了避免不必要的减速,应当最好只在计时用途或者用来隔离出错的代码片段,例如一次kernel启动或者内存复制的时候使用。 就是说少用同步,可能会避免无辜的降速。不过后者可以不用手工来。 一般情况下使用NSight之类的可以快速定位错误。而不需要用户手工一段一段的隔离。哪个kernel有访存之类的,可以一次性找到访存出错的位置。 因为正常人cudaMemcpy*之类的不会弄错。而绝大部分访存错误都是kernel内部的BUG,用nsight比较方便。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

原文发布于微信公众号 - 吉浦迅科技(gpusolution)

原文发表时间:2018-05-10

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏最高权限比特流

Filter过滤器技术详解

2235
来自专栏Linyb极客之路

一份超详细的Java问题排查工具单

平时的工作中经常碰到很多疑难问题的处理,在解决问题的同时,有一些工具起到了相当大的作用,在此书写下来,一是作为笔记,可以让自己后续忘记了可快速翻阅,二是分享,希...

1122
来自专栏潇涧技术专栏

ChromeADB Project Analysis

chromeadb项目源码:https://github.com/importre/chromeadb chromeadb工具的本质就是利用adb命令以可视化...

3391
来自专栏程序员互动联盟

【C语言系列】C语言编译流程分析

前几天看了《程序员的自我修养——链接、装载与库》中的第二章“编译和链接”,主要根据其中的内容简单总结一下C程序编译的过程吧。 我现在一般都是用gcc,所以自然以...

3615
来自专栏云计算教程系列

如何在CVM上同步自建数据库的数据?

Transporter是一种用于在不同数据存储之间移动数据的开源工具。开发人员经常为诸如跨数据库移动数据,将数据从文件移动到数据库或反之亦然等任务编写一次性脚本...

21612
来自专栏用户2442861的专栏

操作系统的几种地址详解

http://bbs.chinaunix.net/thread-2083672-1-1.html

2651
来自专栏小樱的经验随笔

CTF---Web入门第十四题 忘记密码了

忘记密码了分值:20 来源: Justatest 难度:中 参与人数:7706人 Get Flag:2232人 答题人数:2386人 解题通过率:94% 找...

4088
来自专栏Kubernetes

Linux Kernel Cgroups源码浅析

本文是我几个月前在研究linux kernel Cgroups时整理的。文中大部分的理论知识是从网上各种贴子solo的,源码分析部分,我是基于kernel 4....

8908
来自专栏逸鹏说道

Python3 与 C# 并发编程之~ 进程篇下

看看 connection.Pipe方法的定义部分,是不是双向通信就看你是否设置 duplex=True

1953
来自专栏java达人

使用Redis做MyBatis的二级缓存

使用Redis做MyBatis的二级缓存  通常为了减轻数据库的压力,我们会引入缓存。在Dao查询数据库之前,先去缓存中找是否有要找的数据,如果有则用缓存中的数...

4535

扫码关注云+社区

领取腾讯云代金券