前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >DAY12:阅读CUDA C Runtime 之多GPU编程

DAY12:阅读CUDA C Runtime 之多GPU编程

作者头像
GPUS Lady
发布2018-06-22 18:23:53
1.6K0
发布2018-06-22 18:23:53
举报
文章被收录于专栏:GPUS开发者GPUS开发者

今天我们用一篇文章讲解完多GPU编程。

3.2.6. Multi-Device System

3.2.6.1. Device Enumeration【GPU枚举】

A host system can have multiple devices. The following code sample shows how to enumerate these devices, query their properties【属性】, and determine the number of CUDA-enabled devices.

3.2.6.2. Device Selection【GPU选择】

A host thread can set the device it operates on at any time by calling cudaSetDevice(). Device memory allocations and kernel launches are made on the currently set device; streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0.

The following code sample illustrates how setting the current device affects memory allocation and kernel execution.

3.2.6.3. Stream and Event Behavior

A kernel launch will fail if it is issued to a stream that is not associated to the current device as illustrated in the following code sample.

A memory copy will succeed even if it is issued to a stream that is not associated to the current device.

cudaEventRecord() will fail if the input event and input stream are associated to different devices.

cudaEventElapsedTime() will fail if the two input events are associated to different devices.

cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associated to a device that is different from the current device.

cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other.

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to【相对】 commands issued to the default stream of any other device.

3.2.6.4. Peer-to-Peer Memory Access

When the application is run as a 64-bit process, devices of compute capability 2.0 and higher from the Tesla series may address each other's memory (i.e., a kernel executing on one device can dereference a pointer to the memory of the other device). This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices.

Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. Each device can support a system-wide maximum of eight peer connections.

A unified address space is used for both devices (see Unified Virtual Address Space), so the same pointer can be used to address memory from both devices as shown in the code sample below.

3.2.6.5. Peer-to-Peer Memory Copy

Memory copies can be performed between the memories of two different devices.

When a unified address space is used for both devices (see Unified Virtual Address Space), this is done using the regular memory copy functions mentioned in Device Memory.

Otherwise, this is done using cudaMemcpyPeer(), cudaMemcpyPeerAsync(), cudaMemcpy3DPeer(), or cudaMemcpy3DPeerAsync() as illustrated in the following code sample.

A copy (in the implicit NULL stream) between the memories of two different devices:

· does not start until all commands previously issued to either device have completed and

· runs to completion before any commands (see Asynchronous Concurrent Execution) issued after the copy to either device can start.

Consistent with the normal behavior of streams, an asynchronous copy between the memories of two devices may overlap with copies or kernels in another stream.

Note that if peer-to-peer access is enabled between two devices via cudaDeviceEnablePeerAccess() as described in Peer-to-Peer Memory Access, peer-to-peer memory copy between these two devices no longer needs to be staged through the host and is therefore faster.

本文备注/经验分享:

streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0

一旦你设定了设备后,例如cudaSetDevice(3)选择了3号卡,则以后你进行显存分配(cudaMalloc),或者流创建,或者kernel启动,都将在这个设定的卡上启动,都将在这个设定的卡上进行。换句话说,如果你有4张卡, 你需要在这4张卡上分配分配1GB显存,你需要分配在CudaSetDevice了0,1,2,3后,再进行cudaMalloc。也换句话说,还是你有4张卡,你需要分别在cudaSetDevice了0,1,2,3后,再分别进行4次单独的启动,才能在这4张卡上运行你的kernel。而不是直接启动一次,就在这4张卡上全部使用了。换句话说,多卡编程是手动的,而不是自动的。 如果你不设置的话,就是默认在device0的设备, 那样的话剩下的卡就浪费了。以及,需要说明是,cuda 9进入了协作组,允许一个很特别的API能同时在多个卡上启动kernel,但有很多限制条件,以及,限制使用C++,这个以后再说。以及,还需要说明的是,很多库(例如自带的cublas)可以自动利用多卡。但这个也以后再说。你需要知道cublas这样的能自动多卡的,内部也是这样手工使用了多张卡,只是对用户屏蔽了这点,看上去是自动的。

A kernel launch will fail if it is issued to a stream that is not associated to the current device 流和当前的卡必须对应,试图直接使用另外一张卡(通过cudaSetDevice到卡2例如),和前一张卡上的流(例如卡1上的流),是无法在这样的组合下启动kernel的。也就是说,你不能试图在卡2上启动一个kernel,却使用另外不是本卡的流。(流和Kernel是啥关系? kernel必须在一个流中才能启动的,流中的所有操作都是顺序进行的,流在OpenCL中的对等概念叫CommandQueue)

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands issued to the default stream of any other device.

多卡的环境下,因为每张卡都有自己的默认流,所以发布给不同的卡的默认流中的命令,它们之间的执行关系是乱序的。 这段话其实是句废话。这不显然么。 因为乱序执行已经足够说明了。 可能kernel 1在kernel 2前面,也可能kernel 2在kernel 1前面,也可能他俩同时开始,同时完成。都有可能的。

Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. Each device can support a system-wide maximum of eight peer connections.。P2P内存访问必须在两个设备间,通过出cudaDeviceEnablePeerAccess()来启用, 在一个系统内,每张卡最多能和8张其他的卡建立起来P2P访存。

Peer-to-Peer Memory Access和Peer-to-Peer Memory Copy是啥区别? 前者是卡B,能直接像自己的显存那样的,使用卡A的显存,后者各个是P2P复制,必须卡B将卡A的显存中的内容复制到自己的显存,然后卡B(上的kernel)才能用。前者能直接用。后者需要复制过来。 能用前者建议总是用前者,除非:

(1)主板不支持(例如你将两张卡分别插在2路CPU各自管辖的PCI-E下面) (2)系统不支持(例如Windows平台下面试图使用,却是家用卡,不支持TCC) (3)神马都支持,完全可以直接使用前者。但你考虑到这段缓冲区会被反复使用,总是跨PCI-E访问另外一张卡的显存效率低,则可以手工复制过来,然后使用本卡的副本。

注意Windows下的P2P Copy是完全开放的,P2P Access却需要专业卡+TCC,P2P Copy在不能直接复制的时候,会自动通过内存中转(例如之前的情况1),而P2P Access会直接失败。P2P Access有个好处,就是一张卡能用2张卡的显存,甚至3张,4张,8张,对跑一些适合需要显存容量的应用很方便。以及,P2P Access有个超级强化版。就是DGX上的那个。卡间的P2P Access不仅仅可以通过PCI-E,还能通过NVLink提供超级高的带宽,这样DGX上的所有卡的显存几乎都可以聚合起来。适合那种跑超级大显存的应用。普通版本的P2P Access,在主板,系统,卡都支持的时候,虽然慢点(不如DGX),但依然解决了显存不够的问题。而P2P Copy,因为是将一张卡的显存复制到自己显存里,不能扩大等效显存容量的。所以没用。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-05-15,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 GPUS开发者 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 3.2.6. Multi-Device System
  • 3.2.6.1. Device Enumeration【GPU枚举】
  • 3.2.6.3. Stream and Event Behavior
  • 3.2.6.4. Peer-to-Peer Memory Access
  • 3.2.6.5. Peer-to-Peer Memory Copy
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档