我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第三天，我们将用三天时间来学习CUDA 的编程接口。希望在接下来的97天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。
CUDA C provides a simple path for users familiar with the C programming language to easily write programs for execution by the device.
It consists of a minimal set of extensions to the C language and a runtime library.
The core language extensions have been introduced in DAY2:阅读CUDA C Programming Guide之编程模型. They allow programmers to define a kernel as a C function and use some new syntax to specify the grid and block dimension each time the function is called. Any source file that contains some of these extensions must be compiled with nvcc .
The runtime is introduced in Compilation Workflow. It provides C functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc. A complete description of the runtime can be found in the CUDA reference manual.
The runtime is built on top of a lower-level C API, the CUDA driver API, which is also accessible by the application. The driver API provides an additional level of control by exposing lower-level concepts such as CUDA contexts - the analogue of host processes for the device - and CUDA modules - the analogue of dynamically loaded libraries for the device. Most applications do not use the driver API as they do not need this additional level of control and when using the runtime, context and module management are implicit, resulting in more concise code. The driver API is introduced in Driver API and fully described in the reference manual.
Kernels can be written using the CUDA instruction set architecture, called PTX, which is described in the PTX reference manual. It is however usually more effective to use a high-level programming language such as C. In both cases, kernels must be compiled into binary code by nvcc to execute on the device.
nvcc is a compiler driver that simplifies the process of compiling C or PTX code: It provides simple and familiar command line options and executes them by invoking【调用】 the collection of tools that implement the different compilation stages. This section gives an overview of nvcc workflow and command options. A complete description can be found in the nvcc user manual.
Source files compiled with nvcc can include a mix of host code (i.e., code that executes on the host) and device code (i.e., code that executes on the device). nvcc's basic workflow consists in separating device code from host code and then:
· compiling the device code into an assembly form (PTX code) and/or binary form (cubin object),
· and modifying the host code by replacing the <<<...>>> syntax introduced in Kernels (and described in more details in Execution Configuration) by the necessary CUDA C runtime function calls to load and launch each compiled kernel from the PTX code and/or cubin object.
The modified host code is output either as C code that is left to be compiled using another tool or as object code directly by letting nvcc invoke the host compiler during the last compilation stage.
Applications can then:
· Either link to the compiled host code (this is the most common case),
· Or ignore the modified host code (if any) and use the CUDA driver API (see Driver API) to load and execute the PTX code or cubin object.
Any PTX code loaded by an application at runtime is compiled further to binary code by the device driver. This is called just-in-time compilation【即时编译】. Just-in-time compilation increases application load time, but allows the application to benefit from any new compiler improvements coming with each new device driver. It is also the only way for applications to run on devices that did not exist at the time the application was compiled, as detailed in Application Compatibility.
When the device driver just-in-time compiles some PTX code for some application, it automatically caches a copy of the generated binary code in order to avoid repeating the compilation in subsequent invocations of the application. The cache - referred to as compute cache - is automatically invalidated when the device driver is upgraded, so that applications can benefit from the improvements in the new just-in-time compiler built into the device driver.
Environment variables are available to control just-in-time compilation as described in CUDA Environment Variables
Binary code is architecture-specific. A cubin object is generated using the compiler option -code that specifies the targeted architecture: For example, compiling with -code=sm_35 produces binary code for devices of compute capability 3.5. Binary compatibility is guaranteed from one minor revision to the next one, but not from one minor revision to the previous one or across major revisions. In other words, a cubin object generated for compute capability X.y will only execute on devices of compute capability X.z where z≥y.
Some PTX instructions are only supported on devices of higher compute capabilities. For example, Warp Shuffle Functions are only supported on devices of compute capability 3.0 and above. The -arch compiler option specifies the compute capability that is assumed when compiling C to PTX code. So, code that contains warp shuffle, for example, must be compiled with -arch=compute_30 (or higher).
PTX code produced for some specific compute capability can always be compiled to binary code of greater or equal compute capability. Note that a binary compiled from an earlier PTX version may not make use of some hardware features. For example, a binary targeting devices of compute capability 7.0 (Volta) compiled from PTX generated for compute capability 6.0 (Pascal) will not make use of Tensor Core instructions, since these were not available on Pascal. As a result, the final binary may perform worse than would be possible if the binary were generated using the latest version of PTX.
To execute code on devices of specific compute capability, an application must load binary or PTX code that is compatible with this compute capability as described in Binary Compatibility and PTX Compatibility. In particular, to be able to execute code on future architectures with higher compute capability (for which no binary code can be generated yet), an application must load PTXcode that will be just-in-time compiled for these devices (see Just-in-Time Compilation).
Which PTX and binary code gets embedded in a CUDA C application is controlled by the -arch and -code compiler options or the -gencode compiler option as detailed in the nvcc user manual. For example,
embeds binary code compatible with compute capability 3.5 and 5.0 (first and second -gencode options) and PTX and binary code compatible with compute capability 6.0 (third -gencodeoption).
Host code is generated to automatically select at runtime the most appropriate code to load and execute, which, in the above example, will be:
· 3.5 binary code for devices with compute capability 3.5 and 3.7,
· 5.0 binary code for devices with compute capability 5.0 and 5.2,
· 6.0 binary code for devices with compute capability 6.0 and 6.1,
· PTX code which is compiled to binary code at runtime for devices with compute capability 7.0 and higher.
x.cu can have an optimized code path that uses warp shuffle operations, for example, which are only supported in devices of compute capability 3.0 and higher. The __CUDA_ARCH__ macro can be used to differentiate various code paths based on compute capability. It is only defined for device code. When compiling with -arch=compute_35 for example, __CUDA_ARCH__ is equal to 350.
Applications using the driver API must compile code to separate files and explicitly load and execute the most appropriate file at runtime.
The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU. For code relying on specific behavior of SIMT scheduling in previous architecures, Independent Thread Scheduling may alter the set of participating threads, leading to incorrect results. To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal's thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.
The nvcc user manual lists various shorthand for the -arch, -code, and -gencode compiler options. For example, -arch=sm_35 is a shorthand for -arch=compute_35-code=compute_35,sm_35 (which is the same as -gencodearch=compute_35,code=\'compute_35,sm_35\').
The front end【前端】 of the compiler processes CUDA source files according to C++ syntax rules【语法规则】.Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code as described in C/C++ Language Support.
The 64-bit version of nvcc compiles device code in 64-bit mode (i.e., pointers are 64-bit). Device code compiled in 64-bit mode is only supported with host code compiled in 64-bit mode.
Similarly, the 32-bit version of nvcc compiles device code in 32-bit mode and device code compiled in 32-bit mode is only supported with host code compiled in 32-bit mode.
The 32-bit version of nvcc can compile device code in 64-bit mode also using the -m64 compiler option.
The 64-bit version of nvcc can compile device code in 32-bit mode also using the -m32 compiler option.
just-in-time compilation缩写为JIT，中文也叫“及时翻译”或者“及时编译”。具体的说法是在即将要被执行前的瞬间被编译。（反义词叫AOT。Ahead Of Time)。从你的角度看，普通编译发生在当下编译者的机器上。JIT编译发生了以后发布给用户，在用户的机器上进行有。或者有一个未来的时间，例如新一代的显卡发布了，因为编译者现在的机器上，在开发的时候，还没有新卡，编译器也不知道未来如何给新卡编译。采用JIT就不怕了，未来的编译器集成在未来的显卡驱动中，到时候在JIT编译即可。这样就解决了时间上的矛盾。而且如果将来有一天，编译器技术发生了进步，JIT编译可以在开发完成后很多年，甚至开发者都已经挂了的情况下（例如团队解散），依然能享受未来的更先进编译技术。因为它不是普通编译那样一次完成的，而是在将来在用户的机器上再即时的完成，所以这就是为何叫“即时编译”（Just in time）
Binary code is architecture-specific,这说的是SASS，SASS（Shader ASSembly的缩写）是每种架构的卡是固定的。为一种卡编译出来的SASS（例如cubin）只能在这种架构的卡上用。不像PTX那样通用。（二进制兼容性就像你的CPU。你的一个exe可能是10年前的。但CPU是今年出的，但这个CPU却依然可以运行当年的exe），GPU只能在PTX级别上保持兼容性，普通的SASS代码不能保持，除非是同一代架构的卡。等于你买了v5的CPU，只能运行v5上编译的exe，不能运行之前的，也不能运行之后的。
PTX Compatibility即PTX兼容性。PTX有几个不同的版本。越往后的驱动或者卡， 支持的PTX版本越高。低版本的PTX写的东西，能在高版本下运行。这样就保持了对老代码的兼容性。而不像是二进制的SASS，一代就只能在一代上运行。不能在老一代上，也不能上新一代上运行。这是SASS或者说二进制发布的最大坏处。PTX可以持续在未来的新卡上运行（JIT么），你可以直接将PTX理解成一种虚拟机和之上的虚拟指令。
Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code 在HOST代码中，具有完整的C++支持（也就是普通的CPU上）； 在DEVICE代码中，只有部分C++（的特性）被完全支持（也就是在GPU上）。
Device code compiled in 64-bit mode is only supported with host code compiled in 64-bit mode.
GPU端如果是64-bit，CPU端也必须是。这个看起来很正常，为何要特别说明？？ 因为CUDA 3.2和之前的版本，支持混合模式。允许一部分是64-bit，一部分是32-bit的。 后来发现这对很多人造成了困扰。于是直接要求都必须是统一的了。 这也是CUDA易用性的体验。 例如OpenCL就不要求这点。 所以CUDA可以很容易的将结构体（里面含有各种和字长相关的东西（32-bit或者64-bit）之类的在GPU和CPU上传递。 而OpenCL很难做到这种。
原文发布于微信公众号 - 吉浦迅科技（gpusolution）