专栏首页吉浦迅科技DAY12:阅读CUDA C Runtime 之多GPU编程

DAY12:阅读CUDA C Runtime 之多GPU编程

今天我们用一篇文章讲解完多GPU编程。

3.2.6. Multi-Device System

3.2.6.1. Device Enumeration【GPU枚举】

A host system can have multiple devices. The following code sample shows how to enumerate these devices, query their properties【属性】, and determine the number of CUDA-enabled devices.

3.2.6.2. Device Selection【GPU选择】

A host thread can set the device it operates on at any time by calling cudaSetDevice(). Device memory allocations and kernel launches are made on the currently set device; streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0.

The following code sample illustrates how setting the current device affects memory allocation and kernel execution.

3.2.6.3. Stream and Event Behavior

A kernel launch will fail if it is issued to a stream that is not associated to the current device as illustrated in the following code sample.

A memory copy will succeed even if it is issued to a stream that is not associated to the current device.

cudaEventRecord() will fail if the input event and input stream are associated to different devices.

cudaEventElapsedTime() will fail if the two input events are associated to different devices.

cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associated to a device that is different from the current device.

cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other.

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to【相对】 commands issued to the default stream of any other device.

3.2.6.4. Peer-to-Peer Memory Access

When the application is run as a 64-bit process, devices of compute capability 2.0 and higher from the Tesla series may address each other's memory (i.e., a kernel executing on one device can dereference a pointer to the memory of the other device). This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices.

Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. Each device can support a system-wide maximum of eight peer connections.

A unified address space is used for both devices (see Unified Virtual Address Space), so the same pointer can be used to address memory from both devices as shown in the code sample below.

3.2.6.5. Peer-to-Peer Memory Copy

Memory copies can be performed between the memories of two different devices.

When a unified address space is used for both devices (see Unified Virtual Address Space), this is done using the regular memory copy functions mentioned in Device Memory.

Otherwise, this is done using cudaMemcpyPeer(), cudaMemcpyPeerAsync(), cudaMemcpy3DPeer(), or cudaMemcpy3DPeerAsync() as illustrated in the following code sample.

A copy (in the implicit NULL stream) between the memories of two different devices:

· does not start until all commands previously issued to either device have completed and

· runs to completion before any commands (see Asynchronous Concurrent Execution) issued after the copy to either device can start.

Consistent with the normal behavior of streams, an asynchronous copy between the memories of two devices may overlap with copies or kernels in another stream.

Note that if peer-to-peer access is enabled between two devices via cudaDeviceEnablePeerAccess() as described in Peer-to-Peer Memory Access, peer-to-peer memory copy between these two devices no longer needs to be staged through the host and is therefore faster.

本文备注/经验分享:

streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0

一旦你设定了设备后,例如cudaSetDevice(3)选择了3号卡,则以后你进行显存分配(cudaMalloc),或者流创建,或者kernel启动,都将在这个设定的卡上启动,都将在这个设定的卡上进行。换句话说,如果你有4张卡, 你需要在这4张卡上分配分配1GB显存,你需要分配在CudaSetDevice了0,1,2,3后,再进行cudaMalloc。也换句话说,还是你有4张卡,你需要分别在cudaSetDevice了0,1,2,3后,再分别进行4次单独的启动,才能在这4张卡上运行你的kernel。而不是直接启动一次,就在这4张卡上全部使用了。换句话说,多卡编程是手动的,而不是自动的。 如果你不设置的话,就是默认在device0的设备, 那样的话剩下的卡就浪费了。以及,需要说明是,cuda 9进入了协作组,允许一个很特别的API能同时在多个卡上启动kernel,但有很多限制条件,以及,限制使用C++,这个以后再说。以及,还需要说明的是,很多库(例如自带的cublas)可以自动利用多卡。但这个也以后再说。你需要知道cublas这样的能自动多卡的,内部也是这样手工使用了多张卡,只是对用户屏蔽了这点,看上去是自动的。

A kernel launch will fail if it is issued to a stream that is not associated to the current device 流和当前的卡必须对应,试图直接使用另外一张卡(通过cudaSetDevice到卡2例如),和前一张卡上的流(例如卡1上的流),是无法在这样的组合下启动kernel的。也就是说,你不能试图在卡2上启动一个kernel,却使用另外不是本卡的流。(流和Kernel是啥关系? kernel必须在一个流中才能启动的,流中的所有操作都是顺序进行的,流在OpenCL中的对等概念叫CommandQueue)

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands issued to the default stream of any other device.

多卡的环境下,因为每张卡都有自己的默认流,所以发布给不同的卡的默认流中的命令,它们之间的执行关系是乱序的。 这段话其实是句废话。这不显然么。 因为乱序执行已经足够说明了。 可能kernel 1在kernel 2前面,也可能kernel 2在kernel 1前面,也可能他俩同时开始,同时完成。都有可能的。

Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. Each device can support a system-wide maximum of eight peer connections.。P2P内存访问必须在两个设备间,通过出cudaDeviceEnablePeerAccess()来启用, 在一个系统内,每张卡最多能和8张其他的卡建立起来P2P访存。

Peer-to-Peer Memory Access和Peer-to-Peer Memory Copy是啥区别? 前者是卡B,能直接像自己的显存那样的,使用卡A的显存,后者各个是P2P复制,必须卡B将卡A的显存中的内容复制到自己的显存,然后卡B(上的kernel)才能用。前者能直接用。后者需要复制过来。 能用前者建议总是用前者,除非:

(1)主板不支持(例如你将两张卡分别插在2路CPU各自管辖的PCI-E下面) (2)系统不支持(例如Windows平台下面试图使用,却是家用卡,不支持TCC) (3)神马都支持,完全可以直接使用前者。但你考虑到这段缓冲区会被反复使用,总是跨PCI-E访问另外一张卡的显存效率低,则可以手工复制过来,然后使用本卡的副本。

注意Windows下的P2P Copy是完全开放的,P2P Access却需要专业卡+TCC,P2P Copy在不能直接复制的时候,会自动通过内存中转(例如之前的情况1),而P2P Access会直接失败。P2P Access有个好处,就是一张卡能用2张卡的显存,甚至3张,4张,8张,对跑一些适合需要显存容量的应用很方便。以及,P2P Access有个超级强化版。就是DGX上的那个。卡间的P2P Access不仅仅可以通过PCI-E,还能通过NVLink提供超级高的带宽,这样DGX上的所有卡的显存几乎都可以聚合起来。适合那种跑超级大显存的应用。普通版本的P2P Access,在主板,系统,卡都支持的时候,虽然慢点(不如DGX),但依然解决了显存不够的问题。而P2P Copy,因为是将一张卡的显存复制到自己显存里,不能扩大等效显存容量的。所以没用。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文分享自微信公众号 - 吉浦迅科技(gpusolution),作者:GPU世界论坛

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2018-05-15

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • DAY6:阅读 CUDA C编程接口之CUDA C runtime

    GPUS Lady
  • 填坑系列(3):扒一扒NVIDIA Tegra Linux 驱动包 (L4T) 32.1里的那些坑

    这个文档里很清楚地描述了目前已知的几个问题(也就是坑),我这里专门挑出跟Jetson NANO相关的,希望用户们在使用过程中注意。当然随着版本的更新,这些问题(...

    GPUS Lady
  • DAY15:阅读CUDA C runtime之纹理内存

    GPUS Lady
  • Holiday -- hack the box

    Holiday is an insane box officially. It's really difficult to get the user permi...

    madneal
  • [security] Go 1.11.3 and Go 1.10.6 pre-announcement

    We plan to issue Go 1.11.3 and Go 1.10.6 on Wednesday, December 12 at

    李海彬
  • windows平台下redis安装及配置文件介绍

    redis是一个key-value存储系统。和Memcached类似,它支持存储的value类型相对更多,包括string(字符串)、list(链表)、set(...

    写代码的猿
  • SAP Hybris Commerce功能介绍:Consignment tracking

    This new feature is developed by Chengdu Hybris dev team. When a customer decide...

    Jerry Wang
  • 分布式计算中的8个谬论

    Eight-Fallacies-of-Distributed-Computing-Tech-Talk

    zhuanxu
  • How to suppress the annoying line break error in WebIDE

    This error is very annoying. There are different line break settings in unix ( L...

    Jerry Wang
  • Haystack - hack the box

    HayStack is an easy box in hack the box. But it does isn't easy at all. It's ann...

    madneal

扫码关注云+社区

领取腾讯云代金券