我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第五天，我们用几天时间来学习CUDA 的编程接口，其中最重要的部分就是CUDA C runtime.希望在接下来的95天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。
The runtime is implemented in the cudart library, which is linked to the application, either statically via cudart.lib or libcudart.a, or dynamically via cudart.dll or libcudart.so. Applications that require cudart.dll and/or cudart.so for dynamic linking typically include them as part of the application installation package.
All its entry points are prefixed with cuda.
As mentioned in Heterogeneous Programming【异构编程】, the CUDA programming model assumes a system composed of a host and a device, each with their own separate memory. Device Memory gives an overview of the runtime functions used to manage device memory.
Shared Memory illustrates the use of shared memory, introduced in Thread Hierarchy, to maximize performance.
Page-Locked Host Memory【锁页主机内存】 introduces page-locked host memory that is required to overlap kernel execution with data transfers between host and device memory.
Asynchronous Concurrent Execution【异步并发执行】describes the concepts and API used to enable asynchronous concurrent execution at various levels in the system.
Multi-Device System shows how the programming model extends to a system with multiple devices attached to the same host.
Error Checking describes how to properly check the errors generated by the runtime.
Call Stack【调用栈】 mentions the runtime functions used to manage the CUDA C call stack.
Texture and Surface Memory presents the texture and surface memory spaces that provide another way to access device memory; they also expose a subset of the GPU texturing hardware.
Graphics Interoperability【图形互操作性】 introduces the various functions the runtime provides to interoperate with the two main graphics APIs, OpenGL and Direct3D.
There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called (more specifically any function other than functions from the device and version management sections of the reference manual). One needs to keep this in mind when timing runtime function calls and when interpreting the error code from the first call into the runtime.
During initialization, the runtime creates a CUDA context for each device in the system (see Context for more details on CUDA contexts). This context is the primary context for this device and it is shared among all the host threads of the application. As part of this context creation, the device code is just-in-time compiled if necessary (see Just-in-Time Compilation) and loaded into device memory. This all happens under the hood and the runtime does not expose the primary context to the application.
When a host thread calls cudaDeviceReset(), this destroys the primary context of the device the host thread currently operates on (i.e., the current device as defined in Device Selection). The next runtime function call made by any host thread that has this device as current will create a new primary context for this device.
As mentioned in Heterogeneous Programming, the CUDA programming model assumes a system composed of a host and a device, each with their own separate memory. Kernels operate out of device memory, so the runtime provides functions to allocate, deallocate, and copy device memory, as well as transfer data between host memory and device memory.
Device memory can be allocated either as linear memory or as CUDA arrays.
CUDA arrays are opaque memory layouts optimized for texture fetching.
Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one another via pointers, for example, in a binary tree.
Linear memory is typically allocated using cudaMalloc() and freed using cudaFree() and data transfer between host memory and device memory are typically done using cudaMemcpy(). In the vector addition code sample of Kernels, the vectors need to be copied from host memory to device memory:
Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements described in Device Memory Accesses, therefore ensuring best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions). The returned pitch (or stride) must be used to access array elements. The following code sample allocates a width x height 2D array of floating-point values and shows how to loop over the array elements in device code:
The following code sample allocates a width x height x depth 3D array of floating-point values and shows how to loop over the array elements in device code:
The reference manual lists all the various functions used to copy memory between linear memory allocated with cudaMalloc(), linear memory allocated with cudaMallocPitch() orcudaMalloc3D(), CUDA arrays, and memory allocated for variables declared in global or constant memory space.
The following code sample illustrates various ways of accessing global variables via the runtime API:
cudaGetSymbolAddress() is used to retrieve the address pointing to the memory allocated for a variable declared in global memory space. The size of the allocated memory is obtained throughcudaGetSymbolSize().
As detailed in Variable Memory Space Specifiers shared memory is allocated using the __shared__ memory space specifier.
Shared memory is expected to be much faster than global memory as mentioned in Thread Hierarchy and detailed in Shared Memory. Any opportunity to replace global memory accesses by shared memory accesses should therefore be exploited as illustrated by the following matrix multiplication example.
The following code sample is a straightforward implementation of matrix multiplication that does not take advantage of shared memory. Each thread reads one row of A and one column of B and computes the corresponding element of C as illustrated in Figure 9. A is therefore read B.width times from global memory and B is read A.height times.
Figure 9. Matrix Multiplication without Shared Memory
The following code sample is an implementation of matrix multiplication that does take advantage of shared memory. In this implementation, each thread block is responsible for computing one square sub-matrix Csub of C and each thread within the block is responsible for computing one element of Csub. As illustrated in Figure 10, Csub is equal to the product of two rectangular matrices: the sub-matrix of A of dimension (A.width, block_size) that has the same row indices as Csub, and the sub-matrix of B of dimension (block_size, A.width )that has the same column indices as Csub. In order to fit into the device's resources, these two rectangular matrices are divided into as many square matrices of dimension block_size as necessary and Csub is computed as the sum of the products of these square matrices. Each of these products is performed by first loading the two corresponding square matrices from global memory to shared memory with one thread loading one element of each matrix, and then by having each thread compute one element of the product. Each thread accumulates the result of each of these products into a register and once done writes the result to global memory.
By blocking the computation this way, we take advantage of fast shared memory and save a lot of global memory bandwidth since A is only read (B.width / block_size) times from global memory and B is read (A.height / block_size) times.
The Matrix type from the previous code sample is augmented with a stride field, so that sub-matrices can be efficiently represented with the same type. __device__ functions are used to get and set elements and build any sub-matrix from a matrix.
Figure 10. Matrix Multiplication with Shared Memory
All its entry points are prefixed with cuda. 所有的入口函数（也叫导出函数）都具有cuda前缀。（例如我们常说的cudaMemcpy就是这样的）。CUDA分成两部分，runtime api前缀都是cuda，driver api前缀都是cu（其他的扩展库具有更多其他前缀）。请注意driver api的前缀只有cuda的前两个字母（cu）。遇到cu开头就知道是Driver API的函数，而遇到cuda就知道是runtime api的。
There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called (more specifically any function other than functions from the device and version management sections of the reference manual). One needs to keep this in mind when timing runtime function calls and when interpreting the error code from the first call into the runtime. 对于(CUDA)Runtime, 没有显式的初始化函数。当你第一次调用一个runtime api函数的时候，它将（自动）初始化。你需要牢牢记住这点，当你第一次调用Runtime中的函数，对它们的进行计时，和分析错误返回值的时候，我们第一次计时不能算。为何？这里解释了。因为第一次会自动初始化。时间肯定长 ，也解释了第一次调用任何API函数（除了部分函数，例如版本和设备管理相关的），返回的错误值，可能来自这个隐形的自动初始化过程，而不是函数本身导致了错误。因为自动初始化可能会失败。例如你的代码里只有cubin，但没有为当前显卡编译过，同时没有PTX存在的时候，或者有巨大的设备端数组，而原本开发是24GB的显卡，客户运行的时候却是一个只有1GB显存的显卡，这个时候，第一次你无论神马调用函数（大部分，有些除外的就不说了）都会失败。但这失败来自自动化的初始化过程。而和这个函数自己无关。不记住这点，会让人极度迷惑。只要你知道这错误来自初始化，而不是当前行的函数即可。否则你会怎么也找不出错误的，因为你感觉当前行写的没有任何问题。
During initialization, the runtime creates a CUDA context for each device in the system 这里的context可以翻译也可以不翻译。一般都叫Context, 翻译的话可以翻译成“上下文”，或者“上下文环境”。这里是说，会自动建立context（Driver API需要手工建立），这也是为何说runtime简单的原因。很多事情都自动干了。
This all happens under the hood and the runtime does not expose the primary context to the application 就是说，这些全是Runtime私下里完成的，并没有将主上下文暴露给用户应用（代码）。也就是说，用户应用代码，并不知道Context的存在。都是Runtime在内部自动管理好的。你可以将under the hood直接理解或者翻译成“内部”，这个也是和Driver API对比的，因为Driver API需要用户代码明确的创建Context，并且使用它。用户代码必须知道Context的存在。所以你看，Runtime简单不少。
Device memory can be allocated either as linear memory or as CUDA arrays.CUDA arrays are opaque memory layouts optimized for texture fetching.显存可以直接分配成线性内存（就是最普通的内存缓冲区），或者对用户内部构造不透明的CUDA Array，后者优化了纹理访问（或者纹理读取、纹理拾取）。前面就是普通的内存，可以直接用指针访问，后者的不可以，必须通过特殊的函数才能用（就是前文说的纹理拾取函数）。普通内存是一个字节一个字节的直接排列。CUDA Array是NV私有的不公开的内部数据排列方式。所以说“对用户来说不透明”，然而，这个方式实际上目前被从第三方资料已经透明了。有很多第三方资料（例如AMD的资料），或者例如一本叫“CUDA Handbook”的第三方书，详细的描述了内部的纹理用的数据的构造方式。
Any opportunity to replace global memory accesses by shared memory accesses should therefore be exploited 尽可能用共享内存去取代全局内存，
原文发布于微信公众号 - 吉浦迅科技（gpusolution）