DAY15：阅读CUDA C runtime之纹理内存

GPUS Lady

发布于 2018-06-25 15:46:48

8340

发布于 2018-06-25 15:46:48

文章被收录于专栏：GPUS开发者

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第15天，我们用几天时间来学习CUDA 的编程接口，其中最重要的部分就是CUDA C runtime.希望在接下来的85天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。

本文共计976字，阅读时间20分钟

今天开始要花几天时间讲解Texture and Surface Memory

3.2.11. Texture and Surface Memory

CUDA supports a subset of the texturing hardware that the GPU uses for graphics to access texture and surface memory. Reading data from texture or surface memory instead of global memory can have several performance benefits as described in Device Memory Accesses.

There are two different APIs to access texture and surface memory:

· The texture reference API that is supported on all devices,

· The texture object API that is only supported on devices of compute capability 3.x.

The texture reference API has limitations that the texture object API does not have. They are mentioned in Texture Reference API.

3.2.11.1. Texture Memory

Texture memory is read from kernels using the device functions described in Texture Functions. The process of reading a texture calling one of these functions is called a texture fetch. Each texture fetch specifies a parameter called a texture object for the texture object API or a texture reference for the texture reference API.

The texture object or the texture reference specifies:

· The texture, which is the piece of texture memory that is fetched. Texture objects are created at runtime and the texture is specified when creating the texture object as described in Texture Object API. Texture references are created at compile time and the texture is specified at runtime by bounding the texture reference to the texture through runtime functions as described in Texture Reference API; several distinct texture references might be bound to the same texture or to textures that overlap in memory. A texture can be any region of linear memory or a CUDA array (described in CUDA Arrays).

· Its dimensionality that specifies whether the texture is addressed as a one dimensional array using one texture coordinate, a two-dimensional array using two texture coordinates, or a three-dimensional array using three texture coordinates. Elements of the array are called texels, short for texture elements. The texture width, height, and depth refer to the size of the array in each dimension. Table 14 lists the maximum texture width, height, and depth depending on the compute capability of the device.

· The type of a texel, which is restricted to the basic integer and single-precision floating-point types and any of the 1-, 2-, and 4-component vector types defined in char, short, int, long, longlong, float, double that are derived from the basic integer and single-precision floating-point types.

· The read mode, which is equal to cudaReadModeNormalizedFloat or cudaReadModeElementType. If it is cudaReadModeNormalizedFloat and the type of the texel is a 16-bit or 8-bit integer type, the value returned by the texture fetch is actually returned as floating-point type and the full range of the integer type is mapped to [0.0, 1.0] for unsigned integer type and [-1.0, 1.0] for signed integer type; for example, an unsigned 8-bit texture element with the value 0xff reads as 1. If it is cudaReadModeElementType, no conversion is performed.

· Whether texture coordinates are normalized or not. By default, textures are referenced (by the functions of Texture Functions) using floating-point coordinates in the range [0, N-1] where N is the size of the texture in the dimension corresponding to the coordinate. For example, a texture that is 64x32 in size will be referenced with coordinates in the range [0, 63] and [0, 31] for the x and y dimensions, respectively. Normalized texture coordinates cause the coordinates to be specified in the range [0.0, 1.0-1/N] instead of [0, N-1], so the same 64x32 texture would be addressed by normalized coordinates in the range [0, 1-1/N] in both the x and y dimensions. Normalized texture coordinates are a natural fit to some applications' requirements, if it is preferable for the texture coordinates to be independent of the texture size.

· The addressing mode【地址模式】. It is valid to call the device functions of Section B.8 with coordinates that are out of range. The addressing mode defines what happens in that case. The default addressing mode is to clamp the coordinates to the valid range: [0, N) for non-normalized coordinates and [0.0, 1.0) for normalized coordinates. If the border mode is specified instead, texture fetches with out-of-range texture coordinates return zero. For normalized coordinates, the wrap mode and the mirror mode are also available. When using the wrap mode, each coordinate x is converted to frac(x)=x floor(x) where floor(x) is the largest integer not greater than x. When using the mirror mode, each coordinate x is converted to frac(x) if floor(x) is even and 1-frac(x) if floor(x) is odd. The addressing mode is specified as an array of size three whose first, second, and third elements specify the addressing mode for the first, second, and third texture coordinates, respectively; the addressing mode are cudaAddressModeBorder, cudaAddressModeClamp, cudaAddressModeWrap, andcudaAddressModeMirror; cudaAddressModeWrap and cudaAddressModeMirror are only supported for normalized texture coordinates

· The filtering mode which specifies how the value returned when fetching the texture is computed based on the input texture coordinates. Linear texture filtering may be done only for textures that are configured to return floating-point data. It performs low-precision interpolation between neighboring texels. When enabled, the texels surrounding a texture fetch location are read and the return value of the texture fetch is interpolated based on where the texture coordinates fell between the texels. Simple linear interpolation is performed for one-dimensional textures, bilinear interpolation for two-dimensional textures, and trilinear interpolation for three-dimensional textures. Texture Fetching gives more details on texture fetching. The filtering mode is equal to cudaFilterModePoint or cudaFilterModeLinear. If it is cudaFilterModePoint, the returned value is the texel whose texture coordinates are the closest to the input texture coordinates. If it is cudaFilterModeLinear, the returned value is the linear interpolation of the two (for a one-dimensional texture), four (for a two dimensional texture), or eight (for a three dimensional texture) texels whose texture coordinates are the closest to the input texture coordinates.cudaFilterModeLinear is only valid for returned values of floating-point type.

本文备注/经验分享：

The process of reading a texture calling one of these functions is called a texture fetch. Each texture fetch specifies a parameter called a texture object for the texture object API or a texture reference for the texture reference API. 每次纹理拾取, 需要对纹理对象API提供一个纹理对象, 或者对纹理引用API提供一个纹理引用，纹理引用是老API，纹理对象是新API。OpenCL中相对CUDA的, 只有纹理对象。纹理引用有很多缺点, 例如只能全局变量, 不能作为参数之类的。后来CUDA从某个版本起, 加入了纹理对象这一套，但以前的老代码还是兼容的, 老纹理引用的也能用。用不用都可以。反正手册上两个都有。对付老代码, 例如公司以前遗留下来的代码, 还是需要看纹理引用的。否则无法维护老代码了。纹理对象是只支持3.0以上的卡。CUDA 9+已经不支持小于3.0的设备了。

several distinct texture references might be bound to the same texture or to textures that overlap in memory. A texture can be any region of linear memory or a CUDA array。纹理引用自身, 和它的后备存储是分开的，因为纹理引用是定义成一个特殊的全局变量，而后备存储是用户额外分配的普通显存或者CUDA Array(一种特殊的显存, 里面的布局对用户隐藏)，需要将这两个绑定在一起才能用。而纹理对象是在建立的时候就直接在一起了. 不需要额外的绑定过程。这段话是说，多个纹理引用可以绑定到同一个后备存储上, 或者多个互相有重叠的后备存储上。后备存储可以是普通的显存, 或者是特殊的显存(CUDA Array)。特殊的后备存储往往性能更好，(因为里面的存储格式比较奇特. 具体细节AMD有描述---你没看错)。

但特殊的存储在创建里面的内容的时候可能较慢，但用起来可能较快。它是将临近的一些元素放在一起，这样很多图形处理代码往往需要立刻访问临近的元素, 所以性能更好，例如2D的时候, (x,y)的点(texel, texture element,纹元---这里用点代替)，和(x+1,y)的点,和(x, y+1)的点和(x+1, y+1)的点，这4个点很有可能放在一起。因为图形处理代码往往需要临近元素访问，这样一个点被读取了, 剩下的点可能就在缓存中了,下次读取就很快。而普通的显存, 虽然也有缓存，但一般只能元素顺序排列, 一行完了下一下，读取了(x,y), 可能只有(x + 1, y)在缓存中，再读取(x, y + 1)或者(x+1,y+1)可能就慢点，所以说CUDA Array比普通的显存有可能有性能优势

The type of a texel, which is restricted to the basic integer and single-precision floating-point types and any of the 1-, 2-, and 4-component vector types defined in char, short, int, long, longlong, float, double that are derived from the basic integer and single-precision floating-point types. 每个点(纹元, 纹理元素)的类型, 只能限制成普通的标量或者矢量的基本整型或者float类型，double不能支持的.(你可以读取int2模拟成一个double，前者是2*4B, 或者是8B, 64-bit)—— 如果这段不说, 很多人会试图使用double的.

Whether texture coordinates are normalized or not. By default, textures are referenced using floating-point coordinates in the range [0, N-1] where N is the size of the texture in the dimension corresponding to the coordinate. 这个是一个图像处理里面的很有用的特性:归一化坐标。一张64x64的图片和一张32x32的图片，可以都使用同样的坐标来访问(归一化坐标)，而不是原本的X: [0, 63], Y: [0, 63]，可以统统改成:X: [0.0, 1.0), Y: [0.0, 1.0)，

这样完全同样的图像处理代码, 在对于不同清晰度或者细节程度的图片的时候, 不用改动代码， (当然, 最终处理后的效果不同)，也就是说, 如果原本是W * H的，都分别折算到1x1，所有的坐标都变成了浮点小数，

例如原本是0,1,2,3...63，现在是0, 1/64, 2/64, ... 类似这种小数。而纹理本身有采样方式功能，哪怕你不能精确的对应到点也无所谓，可以直接自动到最临近的点或者附近的点插值，这样可以用统一的代码, 处理不同大小的图片，挺好一个特性。这个是坐标在读取的时候自动转换成后备的存储的实际的一个点或者多个点的坐标，而纹理读取的时候, 还可以自动将值也进行归一化\。类似的, 纹理还有越界/边界处理。例如边界自动钳位成边界值或者0值，自动避免越界, 而不需要用户自动操心。这些动能，你可以把纹理了理解成高级的读取方式，带有一些免费的处理能力.

(以及, 很久很久之前大家用纹理是因为纹理还自带一个纹理缓存，而1.X的硬件无缓存, 例如内部的L1之类的, 但是却有纹理缓存,性能更好)，但是现在已经无所谓了，特别是较新的卡(Maxwell+)，纹理缓存已经和普通的访存用的缓存合并了。较新的卡上它们叫Unified Cache，这实际上是L1 Data Cache / L1 texture cache合体。因此现在你无论是直接读取, 还是用纹理读取, 都能享受到。但纹理的其他特性, 例如CUDA Array的元素安排方式所带来的加速,纹理自动边界处理, 自动坐标映射, 自动值映射,这些就享受不到了。不过可以程序员手工在代码上加上。

需要说明的是, 有个很有用的特性，叫边界模式自动越界为0。很多人普通访存需要判断是否越界, 很麻烦，特别是一个很大很大的图片, 或者矩阵，边界处的那些线程数量其实很少，但是为了防止越界, 每个线程都要判断一下自己是否在合理的范围内，用纹理读取可以直接黑上, 越界不会挂, 自动读到0，(前提是, 读取到0可以作为一种容错措施)

The addressing mode. It is valid to call the device functions of Section B.8 with coordinates that are out of range. 正常应当叫寻址模式, 但这里我一般叫地址模式，因为和常见的addressing mode有区别, 这个是图像学相关的. 而不是CS相关的。地址模式有很多种，比如：

这个是越界补0模式(border), 这里的粉色代表0值，(或者说透明值之类的)，坐标是归一化的.在有效的坐标内, X: [0, 1), Y: [0, 1)，

读取到的是正常的2维数据, 这里的数据是图像. 也就是你看到的山和水。

这个是钳位模式(clamp), 越界后自动变成最近的边界点的值了。也就是你看到了似乎是边界被拉伸了，形成一些重复的线条。(重复的线条是边界值)。

卷绕模式，越界后, 自动从头绕回去.返回这样的值. 效果是重复的图片.你可以想象中间的合法坐标内, 逐渐往右越界了, 结果返回的是从左侧开始的值.

镜像模式，类似卷绕. 但是反的.

纹理读取不会越界，但会产生各种自动处理效果。默认的是钳位模式，越界后持续返回最后的合法的边界值。也就是你看到的拉伸效果。如果你有一个矩阵: 1 2 3 4 地址非归一化的, 你读取(0,0)是1, 读取(1,0)是2, 然后读取（2, 0), (3, 0), (4, 0)...都一直都是2，改成越界补0模式, 越界后一直是0。改成卷绕模式, 第一行往右越界后会返回1,2,1,2,1,2的这样重复，等等。普通的直接访存越界后往往会直接挂掉, 或者程序出错。不过我们可以总是用if处理一下，这个不用自己处理了.

纹理还有一个重要的自动化功能, 是免费的插值，如果你有非归一化坐标0,1,2,3,4，而你读取, 给出了坐标2.3。它会自动用临近的一点的值来线性的插值出来2.3这个位置的值。

纹理很多这种小细节. 不注意就错了.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-05-19，如有侵权请联系 cloudcommunity@tencent.com 删除

api