我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第二天，希望在接下来的98天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。
This chapter introduces the main concepts behind the CUDA programming model by outlining how they are exposed in C.
Full code for the vector addition example used in this chapter and the next can be found in the vectorAdd CUDA sample.
CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.
A kernel is defined using the __global__ declaration specifier 【声明说明符】and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>>execution configuration syntax【执行配置语法】. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
As an illustration, the following sample code adds two vectors A and B of size N and stores the result into vector C:
Here, each of the N threads that execute VecAdd() performs one pair-wise addition【两两相加】.
For convenience, threadIdx is a 3-component vector【三分量向量】, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional block of threads, called a thread block【线程块】. This provides a natural way to invoke【调用】 computation across the elements in a domain such as a vector, matrix, or volume.
The index of a thread and its thread ID relate to each other in a straightforward【直接的】 way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).
As an example, the following code adds two matrices A and B of size NxN and stores the result into matrix C:
There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.
However, a kernel can be executed by multiple equally-shaped【相同形状】 thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks as illustrated by Figure 6. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed.
Figure 6. Grid of Thread Blocks
The number of threads per block and the number of blocks per grid specified in the <<<...>>> syntax can be of type int or dim3. Two-dimensional blocks or grids can be specified as in the example above.
Each block within the grid can be identified by a one-dimensional, two-dimensional, or three-dimensional index accessible within the kernel through the built-in blockIdx variable. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
Extending the previous MatAdd() example to handle multiple blocks, the code becomes as follows.
A thread block size of 16x16 (256 threads), although arbitrary（任意的） in this case, is a common choice. The grid is created with enough blocks to have one thread per matrix element as before. For simplicity, this example assumes that the number of threads per grid in each dimension is evenly divisible【整除】 by the number of threads per block in that dimension, although that need not be the case.
Thread blocks are required to execute independently【独立地】: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores as illustrated by Figure 5, enabling programmers to write code that scales with the number of cores.
Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely【精确地】, one can specify synchronization points in the kernel by calling the __syncthreads() intrinsic function【内部函数】; __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed. Shared Memory gives an example of using shared memory. In addition to __syncthreads(), the Cooperative Groups API provides a rich set of thread-synchronization【线程同步】 primitives.
For efficient cooperation, the shared memory is expected to be a low-latency memory near each processor core (much like an L1 cache) and __syncthreads() is expected to be lightweight.
CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure 7. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block. All threads have access to the same global memory.
There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages . Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats.
The global, constant, and texture memory spaces are persistent【连续的】 across kernel launches by the same application.
Figure 7. Memory Hierarchy
As illustrated by Figure 8, the CUDA programming model assumes that the CUDA threads execute on a physically separate device that operates as a coprocessor to the host running the C program. This is the case, for example, when the kernels execute on a GPU and the rest of the C program executes on a CPU.
The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively. Therefore, a program manages the global, constant, and texture memory spaces visible to kernels through calls to the CUDA runtime . This includes device memory allocation and deallocation as well as data transfer between host and device memory.
Unified Memory provides managed memory to bridge【连通】 the host and device memory spaces. Managed memory is accessible from all CPUs and GPUs in the system as a single, coherent memory【连贯内存】 image with a common address space. This capability enables oversubscription of device memory and can greatly simplify the task of porting applications by eliminating the need to explicitly mirror data on host and device.
Figure 8. Heterogeneous Programming
Note: Serial code executes on the host while parallel code executes on the device.
The compute capability of a device is represented by a version number, also sometimes called its "SM version". This version number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the present GPU.
The compute capability comprises a major revision number X and a minor revision number Y and is denoted by X.Y.
Devices with the same major revision number are of the same core architecture. The major revision number is 7 for devices based on the Volta architecture, 6 for devices based on the Pascal architecture, 5 for devices based on the Maxwell architecture, 3 for devices based on the Kepler architecture, 2 for devices based on the Fermi architecture, and 1 for devices based on the Teslaarchitecture.
The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features. Compute Capabilities gives the technical specifications of each compute capability.
Note: The compute capability version of a particular GPU should not be confused with the CUDA version (e.g., CUDA 7.5, CUDA 8, CUDA 9), which is the version of the CUDA software platform. The CUDA platform is used by application developers to create applications that run on many generations of GPU architectures, including future GPU architectures yet to be invented. While new versions of the CUDA platform often add native support for a new GPU architecture by supporting the compute capability version of that architecture, new versions of the CUDA platform typically also include software features that are independent of hardware generation.
The Tesla and Fermi architectures are no longer supported starting with CUDA 7.0 and CUDA 9.0, respectively.
each of the N threads that execute VecAdd() performs one pair-wise addition
整体翻译的话，可以翻译为“每个线程进行一对数值的加法”，请注意pair-wise addition还有另外一个意思是log2(N)方式的相加。如果你有16个浮点数，一种并行化的方式是：分成前8个，和后8个。前8个里面分成4个+4个，4个+4个分成2+2+2+2...这种pair-wise的累加是为了保持精度。 我们常见的，常说的shared memory上的规约，实际上就是这种累加。 所以也叫log2规约加法。 这种累加能增加精度，减少误差。 回到VectorAdd这个例子，这里面就是普通的两个数相加的意思。
a kernel can be executed by multiple equally-shaped thread blocks, 这里equally-shape是相同形状，因为我们启动kernel的时候，标准的runtime api语法是：<<<A,B>>>，这代表启动A个blocks，每个blocks都是B个线程的形状。
所以这里提到了：equally shaped。 CUDA也不支持一次启动中有多种不同形状的block，如果需要有多种不同形状的blocks，可以多次启动，或者自己用代码变形。
A thread block size of 16x16 (256 threads), although arbitrary in this case。这里arbitrary是任意的意思。我们启动kernel的时候，可以使用任意形状。但128， 256（本例), 512这些，是常见的形状选择。（16 * 16 = 256)。需要说明的是，某些kernel往往有个最佳的block形状，此形状下启动性能最好。但不能提前知道是什么样子，得反复试验，无直接的公式能告诉大家什么形状是最好的。注意很多时候我们选择形状的时候，需要加上if限制，if或者while (.....)。 这是很多情况下问题规模并不能直接被你刚才选择的“同样形状/大小”的block给整除。此时往往需要过多启动blocks，并同时用if限制掉越界的线程。 这同时也是因为我们刚才说的，“同样形状”导致的----边界上的blocks并不能选择一些较小的，不同形状的。 注意OpenCL有不同的选择，它允许边界处的groups（等于CUDA的blocks）具有不同的形状，但这额外的增加了kernel书写者面对的复杂性。 CUDA比较易用，直接不让你考虑这样的。所以我们需要if或者while或者for之类的设定条件，处理好边界。
原文发布于微信公众号 - 吉浦迅科技（gpusolution）