DAY12:阅读CUDA C Runtime 之多GPU编程

今天我们用一篇文章讲解完多GPU编程。

3.2.6. Multi-Device System

3.2.6.1. Device Enumeration【GPU枚举】

A host system can have multiple devices. The following code sample shows how to enumerate these devices, query their properties【属性】, and determine the number of CUDA-enabled devices.

3.2.6.2. Device Selection【GPU选择】

A host thread can set the device it operates on at any time by calling cudaSetDevice(). Device memory allocations and kernel launches are made on the currently set device; streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0.

The following code sample illustrates how setting the current device affects memory allocation and kernel execution.

3.2.6.3. Stream and Event Behavior

A kernel launch will fail if it is issued to a stream that is not associated to the current device as illustrated in the following code sample.

A memory copy will succeed even if it is issued to a stream that is not associated to the current device.

cudaEventRecord() will fail if the input event and input stream are associated to different devices.

cudaEventElapsedTime() will fail if the two input events are associated to different devices.

cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associated to a device that is different from the current device.

cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other.

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to【相对】 commands issued to the default stream of any other device.

3.2.6.4. Peer-to-Peer Memory Access

When the application is run as a 64-bit process, devices of compute capability 2.0 and higher from the Tesla series may address each other's memory (i.e., a kernel executing on one device can dereference a pointer to the memory of the other device). This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices.

Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. Each device can support a system-wide maximum of eight peer connections.

A unified address space is used for both devices (see Unified Virtual Address Space), so the same pointer can be used to address memory from both devices as shown in the code sample below.

3.2.6.5. Peer-to-Peer Memory Copy

Memory copies can be performed between the memories of two different devices.

When a unified address space is used for both devices (see Unified Virtual Address Space), this is done using the regular memory copy functions mentioned in Device Memory.

Otherwise, this is done using cudaMemcpyPeer(), cudaMemcpyPeerAsync(), cudaMemcpy3DPeer(), or cudaMemcpy3DPeerAsync() as illustrated in the following code sample.

A copy (in the implicit NULL stream) between the memories of two different devices:

· does not start until all commands previously issued to either device have completed and

· runs to completion before any commands (see Asynchronous Concurrent Execution) issued after the copy to either device can start.

Consistent with the normal behavior of streams, an asynchronous copy between the memories of two devices may overlap with copies or kernels in another stream.

Note that if peer-to-peer access is enabled between two devices via cudaDeviceEnablePeerAccess() as described in Peer-to-Peer Memory Access, peer-to-peer memory copy between these two devices no longer needs to be staged through the host and is therefore faster.

本文备注/经验分享:

streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0

一旦你设定了设备后,例如cudaSetDevice(3)选择了3号卡,则以后你进行显存分配(cudaMalloc),或者流创建,或者kernel启动,都将在这个设定的卡上启动,都将在这个设定的卡上进行。换句话说,如果你有4张卡, 你需要在这4张卡上分配分配1GB显存,你需要分配在CudaSetDevice了0,1,2,3后,再进行cudaMalloc。也换句话说,还是你有4张卡,你需要分别在cudaSetDevice了0,1,2,3后,再分别进行4次单独的启动,才能在这4张卡上运行你的kernel。而不是直接启动一次,就在这4张卡上全部使用了。换句话说,多卡编程是手动的,而不是自动的。 如果你不设置的话,就是默认在device0的设备, 那样的话剩下的卡就浪费了。以及,需要说明是,cuda 9进入了协作组,允许一个很特别的API能同时在多个卡上启动kernel,但有很多限制条件,以及,限制使用C++,这个以后再说。以及,还需要说明的是,很多库(例如自带的cublas)可以自动利用多卡。但这个也以后再说。你需要知道cublas这样的能自动多卡的,内部也是这样手工使用了多张卡,只是对用户屏蔽了这点,看上去是自动的。

A kernel launch will fail if it is issued to a stream that is not associated to the current device 流和当前的卡必须对应,试图直接使用另外一张卡(通过cudaSetDevice到卡2例如),和前一张卡上的流(例如卡1上的流),是无法在这样的组合下启动kernel的。也就是说,你不能试图在卡2上启动一个kernel,却使用另外不是本卡的流。(流和Kernel是啥关系? kernel必须在一个流中才能启动的,流中的所有操作都是顺序进行的,流在OpenCL中的对等概念叫CommandQueue)

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands issued to the default stream of any other device.

多卡的环境下,因为每张卡都有自己的默认流,所以发布给不同的卡的默认流中的命令,它们之间的执行关系是乱序的。 这段话其实是句废话。这不显然么。 因为乱序执行已经足够说明了。 可能kernel 1在kernel 2前面,也可能kernel 2在kernel 1前面,也可能他俩同时开始,同时完成。都有可能的。

Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. Each device can support a system-wide maximum of eight peer connections.。P2P内存访问必须在两个设备间,通过出cudaDeviceEnablePeerAccess()来启用, 在一个系统内,每张卡最多能和8张其他的卡建立起来P2P访存。

Peer-to-Peer Memory Access和Peer-to-Peer Memory Copy是啥区别? 前者是卡B,能直接像自己的显存那样的,使用卡A的显存,后者各个是P2P复制,必须卡B将卡A的显存中的内容复制到自己的显存,然后卡B(上的kernel)才能用。前者能直接用。后者需要复制过来。 能用前者建议总是用前者,除非:

(1)主板不支持(例如你将两张卡分别插在2路CPU各自管辖的PCI-E下面) (2)系统不支持(例如Windows平台下面试图使用,却是家用卡,不支持TCC) (3)神马都支持,完全可以直接使用前者。但你考虑到这段缓冲区会被反复使用,总是跨PCI-E访问另外一张卡的显存效率低,则可以手工复制过来,然后使用本卡的副本。

注意Windows下的P2P Copy是完全开放的,P2P Access却需要专业卡+TCC,P2P Copy在不能直接复制的时候,会自动通过内存中转(例如之前的情况1),而P2P Access会直接失败。P2P Access有个好处,就是一张卡能用2张卡的显存,甚至3张,4张,8张,对跑一些适合需要显存容量的应用很方便。以及,P2P Access有个超级强化版。就是DGX上的那个。卡间的P2P Access不仅仅可以通过PCI-E,还能通过NVLink提供超级高的带宽,这样DGX上的所有卡的显存几乎都可以聚合起来。适合那种跑超级大显存的应用。普通版本的P2P Access,在主板,系统,卡都支持的时候,虽然慢点(不如DGX),但依然解决了显存不够的问题。而P2P Copy,因为是将一张卡的显存复制到自己显存里,不能扩大等效显存容量的。所以没用。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

原文发布于微信公众号 - 吉浦迅科技(gpusolution)

原文发表时间:2018-05-15

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏数据小魔方

用R语言复盘美国总统大选结果~

这两天各种社交媒体都被美国大选的消息刷屏,各种段子满天飞,把平时不怎么关注政治的小编都吸引了。 美国大选的投票数据,给小编的写作提供了非常宝贵的案例数据,毕竟四...

4114
来自专栏大数据文摘

100张经典信息可视化图表 看看你的大脑在想什么

1355
来自专栏小樱的经验随笔

2016广东工业大学新生杯决赛网络同步赛暨全国新生邀请赛 题解&源码

Problem A: pigofzhou的巧克力棒 Description 众所周知,pigofzhou有许多妹子。有一天,pigofzhou得到了一根巧克力棒...

3024
来自专栏专知

2018年SCI期刊最新影响因子排行,最高244,人工智能TPAMI9.455

2018年6月26日,最新的SCI影响因子正式发布,涵盖1万2千篇期刊。CA-Cancer J Clin 依然拔得头筹,其影响因子今年再创新高,达244.585...

952
来自专栏运维

中国开源镜像

China    Alibaba Cloud Computing    http://mirrors.aliyun.com/centos/ 阿里巴巴      ...

692
来自专栏区块链资讯

慈善机构使用BCH向南苏丹民众提供食物救援

目前,慈善机构Eat BCH开始向非洲东部国家南苏丹缺乏食物的个人和家庭提供食物,而这些食物是该机构使用社区捐赠的比特币现金购买而来。此前,Eat BCH一直在...

943
来自专栏钱塘大数据

100张经典信息可视化图表,让你脑洞大开

01-Redefining the Dome by National Geographic

844
来自专栏GIS讲堂

OL3和echart的结合

2、通过监控view的propertychange和map的zoomend实现echart的更新;

553
来自专栏GIS讲堂

Arcgis for Js实现graphiclayer的空间查询

本节讲的是Arcgis for Js的针对graphiclayer的空间查询,内容很简单,代码如下:

602
来自专栏余生开发

echarts太阳分布图-饼图来回穿梭

var dom = document.getElementById("container");

772

扫码关注云+社区