DAY37：阅读不同存储器的修饰符

GPUS Lady

发布于 2018-06-25 16:40:33

7150

发布于 2018-06-25 16:40:33

文章被收录于专栏：GPUS开发者GPUS开发者

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第367天，我们正在讲解CUDA C语法，希望在接下来的63天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。

本文共计468字，阅读时间15分钟

B.2. Variable Memory Space Specifiers

Variable memory space specifiers denote the memory location on the device of a variable.

An automatic variable declared in device code without any of the __device__, __shared__ and __constant__ memory space specifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory, which can have adverse performance consequences as detailed in Device Memory Accesses.

B.2.1. device

The __device__ memory space specifier declares a variable that resides on the device.

At most one of the other memory space specifiers defined in the next two sections may be used together with __device__ to further denote which memory space the variable belongs to. If none of them is present, the variable:

· Resides in global memory space,

· Has the lifetime of the CUDA context in which it is created,

· Has a distinct object per device,

· Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).

B.2.2. constant

The __constant__ memory space specifier, optionally used together with __device__, declares a variable that：

· Resides in constant memory space,

· Has the lifetime of the CUDA context in which it is created,

· Has a distinct object per device,

· Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).

B.2.3. shared

The __shared__ memory space specifier, optionally used together with __device__, declares a variable that:

· Resides in the shared memory space of a thread block,

· Has the lifetime of the block,

· Has a distinct object per block,

· Is only accessible from all the threads within the block,

· Does not have a constant address.

When declaring a variable in shared memory as an external array such as

extern __shared__ float shared[];

the size of the array is determined at launch time (see Execution Configuration). All variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets. For example, if one wants the equivalent of

short array0[128];
float array1[64];
int   array2[256];

in dynamically allocated shared memory, one could declare and initialize the arrays the following way:

Note that pointers need to be aligned to the type they point to, so the following code, for example, does not work since array1 is not aligned to 4 bytes.

Alignment requirements for the built-in vector types are listed in Table 3.

Type	Alignment
char1, uchar1	1
char2, uchar2	2
char3, uchar3	1
char4, uchar4	4
short1, ushort1	2
short2, ushort2	4
short3, ushort3	2
short4, ushort4	8
int1, uint1	4
int2, uint2	8
int3, uint3	4
int4, uint4	16
long1, ulong1	4 if sizeof(long) is equal to sizeof(int) 8, otherwise
long2, ulong2	8 if sizeof(long) is equal to sizeof(int), 16, otherwise
long3, ulong3	4 if sizeof(long) is equal to sizeof(int), 8, otherwise
long4, ulong4	16
longlong1, ulonglong1	8
longlong2, ulonglong2	16
longlong3, ulonglong3	8
longlong4, ulonglong4	16
float1	4
float2	8
float3	4
float4	16
double1	8
double2	16
double3	8
double4	16

本文备注/经验分享：

今天这个章节主要是说了GPU上不同种类的存储器, 在使用它们的时候, 一定情况下所需要的修饰符。还记得前几天说过, 一张GPU能有不同的存储器种类吗? 今天这章节则是说, 如何在程序中使用它们。传统上, 在GPU上运行的kernel里,直接定义或者访问的各种存储器上的变量, 数组之类的, 需要加上特定的前缀: (1)Global memory: __device__前缀 (2)Constant Memory: __constant__前缀 (3)Shared memory: __shared__前缀 (4)Local memory: (不需要任何前缀, 编译器全自动处理) 注意这里的local memory比较奇特, 因为前几日说过, 它(local memory)是和寄存器一起使用的, 编译器能使用寄存器就使用寄存器, 不能则使用local memory(较慢)，正因为是编译器自动处理的, 所以local memory不需要任何前缀。那么本章节就剩下前3种存储器类型需要说明了。根据以前的章节, 聪明的你已经知道, global memory是基于显存(或者映射的内存, 或者映射的其他的伙伴卡的显存---但这里为了简单统称显存)。显存是在GPU设备上的, 静态的定义可以直接使用:

__device__ your_type your_variable[...];

__global__ your_kernel()

{

//use your variable here

}

则是一种典型的静态定义。另外一种则是手工的动态分配global memory, 例如通过cudaMalloc*()之类的函数.如果是动态定义的, 则需要单独将kernel做一个修改: //删除 __device__ your_type your_variable[...];

__global__ your_kernel(your_type *your_variable)

{

//use your variable here

}

请注意这个动态分配的global memory和静态分配的global memory的区别: (1)只有静态的才有__device__和具体定义 (2)动态的不出现直接的定义, 则是在Host Code中分配, 当成一个指针传递给kernel(注意*号)。静态定义的一般比较适合懒人. Kernel运行的时候, 需要的Global memory(显存)就已经准备好了.而动态的有更大的控制力, 适合需要更细微的代码控制能力的人使用.(此外, 静态的全局定义, 不符合很多现在编程模式所需要的, 只有输入和输出, 而没有全局状态的标准要求)。但是无论怎么说, 至少你已经知道了, 可以直接来个__device__定义显存上的变量或者数组了。第二点则是: constant memory. 请注意在很早之前, 它的标准写法是:__device__ __constant__ your_type your_variable[...]; 不过现在已经被大家缩写为__constant__这一个前缀了.不过本章节依然为你指出了, 同时使用2个前缀是可以的. 所以遇到老代码不要惊讶。 constant memory实际上在现在的卡种, 分成多个部分, (1)你手工静态分配的constant memory, 最大64KB (2)编译器自动搜集来的一些常量, 从你的代码中, 例如: int c = a * 888 + 999; 编译器很可能将这里的888或者999或者全部都放入constant memory, 并全自动的使用它(注意这个只是可能. 编译器还有其他更好的位置放置它, 例如编译成立即数. 但是这里只提一下自动放入constant memory) (3)你的kernel的参数, 在现在的卡中(2.0+)也将自动放入constant memory。请注意如果你一旦适合在程序里面进行对kernel的参数进行修改, 例如:

__global__ your_kernel(int *p.....)

{

//...

p ++;

}

类似这种代码, 则p将被自动生成一个同名的副本, 享受普通变量的待遇(自动放入寄存器或者local memory)。 (4) 普通的global memory, 但用户要求通过constant cache进行读取, 例如用户知道一些非常小的常数数组, 而且warp内部的线程非常一致的访问同一个下标的时候(Load Uniform, LDU操作), 这个手册后面有说明. 大致这4种是constant memory的使用. 请注意, kernel参数这里是很多人经常疑惑的, 很多人担心, 访问kernel的参数代价非常高昂吗? 答案是否定的. 和你的普通__constant__一样代价很低。很多用CUDA的人整天疑神疑鬼. 用这个会慢吗?用那个会慢吗?类似这种的——不会啊. 别乱想，真要感觉慢, 你先去考虑换一个好卡再说，很多时候不是你代码写的渣, 而是你的卡太烂。然后请注意的另外一点是, 正常使用的, 你只有(1)中的手工__constant__静态分配的才能用到. 其他均不常用(要么不常用, 要么是编译器自动的, 你控制不了) 而__constant__的内容实质上是可以改变的, 只是在一个kernel运行的期间, 不能改变.在没有kernel运行的时候, 可以通过cudaMemcpyToSymbol之类的改变它里面的值. 给下个kernel用.所谓常数, 只是在一个kernel的运行期间常数罢了。这点需要注意.(cudaMemcpyToSymbol等于普通的cudaGetSymbolAddress得到地址后 + 一个普通的cudaMemcpy而已. 直接cudaMemcpyToSymbol能简单一点)。这也是很多人经常在使用cudaMemcpyToSymbol时候的疑惑.特别是因为CUDA历史原因, Symbol的使用, 在不同时期的CUDA上, 有两种用法：一种是将你的变量名在Host中进行cudaMemcpyToSymbol的时候, 必须加引号: 例如: 有: __constant__ int dog_parameters[64]; 那么在cudaMemcpyToSymbol的时候, 历史原因, 有两种用法:cudaMemcpyToSymbol("dog_parameters", .....);和cudaMemcpyToSymbol(dog_paramemters, .....); 这个问题是新手经常遇到的问题.特别是你看到了老书的时候(很多人手里头都是老书. 我们已经替无数本市面上的各家出版社的各本书进行debug了.....)，现在的新版本CUDA只有没有引号的用法(下面那行) 维护老代码的人员, 或者手头还有老书的人员一定要注意这点. 这个是说的__constant__

关于__shared__, 这个是大家喜闻乐见, 耳熟能详的。前几天说的它的3大作用大家应该还记得吧. 忘记了? 不妨往前翻阅章节。而这里需要说的则是, 如同global memory一样, 这个也有动态分配的和静态分配的两种: (1)静态分配的是: __global__ void your_kernel(...)

{

__shared__ your_type your_variable[....]; //变量或者数组, 一行或者多行, 累计不得超过48KB

}

因为48KB = 48 * 1024 = 49152，也就是0xC000 经常有人在编译的时候看到报错, 说shared memory大小超过了0xC000, 则说明你超了48KB了.需要降低它(除了7.0计算能力，7.0计算能力的卡能用到96KB, 但需要动态分配)

(2)动态分配则是:

__global__ void your_kernel()

{

extern __shared__ your_type your_variable[空的]; //请注意空的是指[]

}

和静态的有两点形式上的区别: (1)前面多加了一个extern (2)后面的方括号内没有东西. 而在使用上则具有多种区别: 使用区别(1): 静态分配的多个变量或者数组, 它们的地址会不同. 例如你有8个1KB的float数组, 会得到地址分别是0, 1K, 2K, 3K.... 而动态分配的虽然也可以写成多个extern __shared__的行在里面, 但是它们所定义的所有数组的起始地址都是一样的, 这就需要你额外进行shared memory上的缓冲区拼接, 手工计算偏移量或者指针. 是不是很眼熟? 没错, 这就和前几天说的, 手工将几个小的global memory缓冲区拼接起来, 能一次性都传输完, 从而提高性能, 所需要使用的技术或者说技巧是一样的.请也需要注意一下元素类型, 和偏移量, 对齐方面的要求.你是知道的, 作为线程的访存, 元素不对齐会挂掉kernel的. 一定要注意了. 请注意手册这里是将extern那行写在kernel外面的, 我建议总是写在里面. 不过这不是重点, 用户可以随心的选择喜欢的风格.有人可能会问, 我可否两种分配方式同时使用? 答案是可以的,这就如同你同时可以使用2种风格的显存分配一样——__device__的静态分配 + cudaMalloc*()的动态分配也就是最终会形成这样的代码: __global__ void your_kernel()

{

__shared__ int dog[256]; //1KB

__shared__ int wolf[256]; //1KB

extern __shared__ int cat[]; //size unknown

}

请注意如何混合使用了, dog和wolf的地址是明确的, 可以直接使用. 连接在后面的cat[]如果再后面还有东西, 例如还有一个extern __shared__ int donkey[];, 则需要你像刚才说的那样好好计算地址. 我建议用户不写多个extern，而是只有1个. 如果有多个extern的shared memory分配要求, 建议只写一个, 然后手工推导指针. 例如: int *p_ass = (int *)(cat + 888); //假设的. 这种不容易出错. 最后再来到一点, 就是计算能力7.0允许支持大于48KB的shared memory分配,但需要使用动态分配的方式. 目前尚未知道在这种情况下, 是否允许静态+动态的混合使用, 以超过48KB，还是必须全部是动态的, 才能超过48KB.(但纯静态的是不能超过48KB的，建议用户自行试验一下(一试即可, 我还没有7.0的卡) 但需要补充说明的是, 你如果发现了一些计算能力的卡上, shared memory上不那么对齐(例如一个float4, 你手工对齐到4B而不是要求的16B),结果一切正常. 这说明恭喜你发现了一些计算能力的小秘密(或者说更先进性? 毕竟更宽松的对齐要求是一个进步)，但目前手册说的是要求你严格对齐. 那么请按照手册的来. undocumented的内容将来随时可能随着NV的新卡的问世而改变.

有不明白的地方，请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2018-06-22，如有侵权请联系 cloudcommunity@tencent.com 删除

kernel

深度学习

ide

本文分享自 GPUS开发者微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！

kernel

深度学习

ide

登录后参与评论

0 条评论

热度