DAY15:阅读CUDA C runtime之纹理内存

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第15天,我们用几天时间来学习CUDA 的编程接口,其中最重要的部分就是CUDA C runtime.希望在接下来的85天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计976字,阅读时间20分钟

今天开始要花几天时间讲解Texture and Surface Memory

3.2.11. Texture and Surface Memory

CUDA supports a subset of the texturing hardware that the GPU uses for graphics to access texture and surface memory. Reading data from texture or surface memory instead of global memory can have several performance benefits as described in Device Memory Accesses.

There are two different APIs to access texture and surface memory:

· The texture reference API that is supported on all devices,

· The texture object API that is only supported on devices of compute capability 3.x.

The texture reference API has limitations that the texture object API does not have. They are mentioned in Texture Reference API.

3.2.11.1. Texture Memory

Texture memory is read from kernels using the device functions described in Texture Functions. The process of reading a texture calling one of these functions is called a texture fetch. Each texture fetch specifies a parameter called a texture object for the texture object API or a texture reference for the texture reference API.

The texture object or the texture reference specifies:

· The texture, which is the piece of texture memory that is fetched. Texture objects are created at runtime and the texture is specified when creating the texture object as described in Texture Object API. Texture references are created at compile time and the texture is specified at runtime by bounding the texture reference to the texture through runtime functions as described in Texture Reference API; several distinct texture references might be bound to the same texture or to textures that overlap in memory. A texture can be any region of linear memory or a CUDA array (described in CUDA Arrays).

· Its dimensionality that specifies whether the texture is addressed as a one dimensional array using one texture coordinate, a two-dimensional array using two texture coordinates, or a three-dimensional array using three texture coordinates. Elements of the array are called texels, short for texture elements. The texture width, height, and depth refer to the size of the array in each dimension. Table 14 lists the maximum texture width, height, and depth depending on the compute capability of the device.

· The type of a texel, which is restricted to the basic integer and single-precision floating-point types and any of the 1-, 2-, and 4-component vector types defined in char, short, int, long, longlong, float, double that are derived from the basic integer and single-precision floating-point types.

· The read mode, which is equal to cudaReadModeNormalizedFloat or cudaReadModeElementType. If it is cudaReadModeNormalizedFloat and the type of the texel is a 16-bit or 8-bit integer type, the value returned by the texture fetch is actually returned as floating-point type and the full range of the integer type is mapped to [0.0, 1.0] for unsigned integer type and [-1.0, 1.0] for signed integer type; for example, an unsigned 8-bit texture element with the value 0xff reads as 1. If it is cudaReadModeElementType, no conversion is performed.

· Whether texture coordinates are normalized or not. By default, textures are referenced (by the functions of Texture Functions) using floating-point coordinates in the range [0, N-1] where N is the size of the texture in the dimension corresponding to the coordinate. For example, a texture that is 64x32 in size will be referenced with coordinates in the range [0, 63] and [0, 31] for the x and y dimensions, respectively. Normalized texture coordinates cause the coordinates to be specified in the range [0.0, 1.0-1/N] instead of [0, N-1], so the same 64x32 texture would be addressed by normalized coordinates in the range [0, 1-1/N] in both the x and y dimensions. Normalized texture coordinates are a natural fit to some applications' requirements, if it is preferable for the texture coordinates to be independent of the texture size.

· The addressing mode【地址模式】. It is valid to call the device functions of Section B.8 with coordinates that are out of range. The addressing mode defines what happens in that case. The default addressing mode is to clamp the coordinates to the valid range: [0, N) for non-normalized coordinates and [0.0, 1.0) for normalized coordinates. If the border mode is specified instead, texture fetches with out-of-range texture coordinates return zero. For normalized coordinates, the wrap mode and the mirror mode are also available. When using the wrap mode, each coordinate x is converted to frac(x)=x floor(x) where floor(x) is the largest integer not greater than x. When using the mirror mode, each coordinate x is converted to frac(x) if floor(x) is even and 1-frac(x) if floor(x) is odd. The addressing mode is specified as an array of size three whose first, second, and third elements specify the addressing mode for the first, second, and third texture coordinates, respectively; the addressing mode are cudaAddressModeBorder, cudaAddressModeClamp, cudaAddressModeWrap, andcudaAddressModeMirror; cudaAddressModeWrap and cudaAddressModeMirror are only supported for normalized texture coordinates

· The filtering mode which specifies how the value returned when fetching the texture is computed based on the input texture coordinates. Linear texture filtering may be done only for textures that are configured to return floating-point data. It performs low-precision interpolation between neighboring texels. When enabled, the texels surrounding a texture fetch location are read and the return value of the texture fetch is interpolated based on where the texture coordinates fell between the texels. Simple linear interpolation is performed for one-dimensional textures, bilinear interpolation for two-dimensional textures, and trilinear interpolation for three-dimensional textures. Texture Fetching gives more details on texture fetching. The filtering mode is equal to cudaFilterModePoint or cudaFilterModeLinear. If it is cudaFilterModePoint, the returned value is the texel whose texture coordinates are the closest to the input texture coordinates. If it is cudaFilterModeLinear, the returned value is the linear interpolation of the two (for a one-dimensional texture), four (for a two dimensional texture), or eight (for a three dimensional texture) texels whose texture coordinates are the closest to the input texture coordinates.cudaFilterModeLinear is only valid for returned values of floating-point type.

本文备注/经验分享:

The process of reading a texture calling one of these functions is called a texture fetch. Each texture fetch specifies a parameter called a texture object for the texture object API or a texture reference for the texture reference API. 每次纹理拾取, 需要对纹理对象API提供一个纹理对象, 或者对纹理引用API提供一个纹理引用,纹理引用是老API,纹理对象是新API。OpenCL中相对CUDA的, 只有纹理对象。纹理引用有很多缺点, 例如只能全局变量, 不能作为参数之类的。 后来CUDA从某个版本起, 加入了纹理对象这一套,但以前的老代码还是兼容的, 老纹理引用的也能用。用不用都可以。反正手册上两个都有。 对付老代码, 例如公司以前遗留下来的代码, 还是需要看纹理引用的。否则无法维护老代码了。纹理对象是只支持3.0以上的卡。CUDA 9+已经不支持小于3.0的设备了。

several distinct texture references might be bound to the same texture or to textures that overlap in memory. A texture can be any region of linear memory or a CUDA array。 纹理引用自身, 和它的后备存储是分开的,因为纹理引用是定义成一个特殊的全局变量,而后备存储是用户额外分配的普通显存或者CUDA Array(一种特殊的显存, 里面的布局对用户隐藏),需要将这两个绑定在一起才能用。而纹理对象是在建立的时候就直接在一起了. 不需要额外的绑定过程。 这段话是说,多个纹理引用可以绑定到同一个后备存储上, 或者多个互相有重叠的后备存储上。后备存储可以是普通的显存, 或者是特殊的显存(CUDA Array)。 特殊的后备存储往往性能更好,(因为里面的存储格式比较奇特. 具体细节AMD有描述---你没看错)。

但特殊的存储在创建里面的内容的时候可能较慢,但用起来可能较快。 它是将临近的一些元素放在一起,这样很多图形处理代码 往往 需要立刻访问临近的元素, 所以性能更好,例如2D的时候, (x,y)的点(texel, texture element,纹元---这里用点代替),和(x+1,y)的点,和(x, y+1)的点和(x+1, y+1)的点,这4个点很有可能放在一起。因为图形处理代码往往需要临近元素访问,这样一个点被读取了, 剩下的点可能就在缓存中了,下次读取就很快。而普通的显存, 虽然也有缓存,但一般只能元素顺序排列, 一行完了下一下,读取了(x,y), 可能只有(x + 1, y)在缓存中,再读取(x, y + 1)或者(x+1,y+1)可能就慢点,所以说CUDA Array比普通的显存有可能有性能优势

The type of a texel, which is restricted to the basic integer and single-precision floating-point types and any of the 1-, 2-, and 4-component vector types defined in char, short, int, long, longlong, float, double that are derived from the basic integer and single-precision floating-point types. 每个点(纹元, 纹理元素)的类型, 只能限制成普通的标量或者矢量的基本整型或者float类型,double不能支持的.(你可以读取int2模拟成一个double,前者是2*4B, 或者是8B, 64-bit)—— 如果这段不说, 很多人会试图使用double的.

Whether texture coordinates are normalized or not. By default, textures are referenced using floating-point coordinates in the range [0, N-1] where N is the size of the texture in the dimension corresponding to the coordinate. 这个是一个图像处理里面的很有用的特性:归一化坐标。一张64x64的图片和一张32x32的图片,可以都使用同样的坐标来访问(归一化坐标), 而不是原本的X: [0, 63], Y: [0, 63],可以统统改成:X: [0.0, 1.0), Y: [0.0, 1.0),

这样完全同样的图像处理代码, 在对于不同清晰度或者细节程度的图片的时候, 不用改动代码, (当然, 最终处理后的效果不同), 也就是说, 如果原本是W * H的,都分别折算到1x1,所有的坐标都变成了浮点小数,

例如原本是0,1,2,3...63, 现在是0, 1/64, 2/64, ... 类似这种小数。而纹理本身有采样方式功能,哪怕你不能精确的对应到点也无所谓 , 可以直接自动到最临近的点或者附近的点插值,这样可以用统一的代码, 处理不同大小的图片,挺好一个特性。 这个是坐标在读取的时候自动转换成后备的存储的实际的一个点或者多个点的坐标,而纹理读取的时候, 还可以自动将值也进行归一化\。类似的, 纹理还有越界/边界处理。例如边界自动钳位成边界值或者0值,自动避免越界, 而不需要用户自动操心。这些动能,你可以把纹理了理解成高级的读取方式,带有一些免费的处理能力.

(以及, 很久很久之前大家用纹理是因为纹理还自带一个纹理缓存,而1.X的硬件无缓存, 例如内部的L1之类的, 但是却有纹理缓存,性能更好),但是现在已经无所谓了,特别是较新的卡(Maxwell+),纹理缓存已经和普通的访存用的缓存合并了。较新的卡上它们叫Unified Cache,这实际上是L1 Data Cache / L1 texture cache合体。因此现在你无论是直接读取, 还是用纹理读取, 都能享受到。但纹理的其他特性, 例如CUDA Array的元素安排方式所带来的加速,纹理自动边界处理, 自动坐标映射, 自动值映射,这些就享受不到了。不过可以程序员手工在代码上加上。

需要说明的是, 有个很有用的特性,叫边界模式 自动越界为0。很多人普通访存需要判断是否越界, 很麻烦,特别是一个很大很大的图片, 或者矩阵,边界处的那些线程数量其实很少,但是为了防止越界, 每个线程都要判断一下自己是否在合理的范围内,用纹理读取可以直接黑上, 越界不会挂, 自动读到0,(前提是, 读取到0可以作为一种容错措施)

The addressing mode. It is valid to call the device functions of Section B.8 with coordinates that are out of range. 正常应当叫寻址模式, 但这里我一般叫地址模式,因为和常见的addressing mode有区别, 这个是图像学相关的. 而不是CS相关的。地址模式有很多种,比如:

这个是越界补0模式(border), 这里的粉色代表0值,(或者说透明值之类的),坐标是归一化的.在有效的坐标内, X: [0, 1), Y: [0, 1),

读取到的是正常的2维数据, 这里的数据是图像. 也就是你看到的山和水。

这个是钳位模式(clamp), 越界后自动变成最近的边界点的值了。也就是你看到了似乎是边界被拉伸了,形成一些重复的线条。(重复的线条是边界值)。

卷绕模式, 越界后, 自动从头绕回去.返回这样的值. 效果是重复的图片.你可以想象中间的合法坐标内, 逐渐往右越界了, 结果返回的是从左侧开始的值.

镜像模式, 类似卷绕. 但是反的.

纹理读取不会越界,但会产生各种自动处理效果。默认的是钳位模式, 越界后持续返回最后的合法的边界值。也就是你看到的拉伸效果。 如果你有一个矩阵: 1 2 3 4 地址非归一化的, 你读取(0,0)是1, 读取(1,0)是2, 然后读取(2, 0), (3, 0), (4, 0)...都一直都是2,改成越界补0模式, 越界后一直是0。 改成卷绕模式, 第一行往右越界后会返回1,2,1,2,1,2的这样重复,等等。 普通的直接访存越界后往往会直接挂掉, 或者程序出错。不过我们可以总是用if处理一下,这个不用自己处理了.

纹理还有一个重要的自动化功能, 是免费的插值,如果你有非归一化坐标0,1,2,3,4,而你读取, 给出了坐标2.3。它会自动用临近的一点的值来线性的插值出来2.3这个位置的值。

纹理很多这种小细节. 不注意就错了.

原文发布于微信公众号 - 吉浦迅科技(gpusolution)

原文发表时间:2018-05-19

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏我和未来有约会

[Silverlight动画]转向行为 - 避开行为

避开行为与寻找行为彻底相反。实际上,除了代码最后一行用相减代替了相加以外,其它都一样。 public void flee(Vector2D target) ...

1797
来自专栏IMWeb前端团队

也谈 setTimeout

也谈 setTimeout setTimeout ,延迟一段事件执行代码,当然这是最基本的用法,这里不说基本用法。 jQuery 中的轮询 轮询,可能是 set...

20110
来自专栏一“技”之长

Cocos2d-x-v3动作体系 原

        cocos2d-x-v3版本v2的版本有的很大的改动,最直观的是在一些函数的改动和类名的改动上,首先以CC开头的类,都不再使用CC。在我个人的理...

551
来自专栏吉浦迅科技

DAY19:阅读纹理内存之Texture Gather

1973
来自专栏机器人网

一文教你识别:数控机床电柜内那些常用的元件

断路器、接触器、中间继电器、热继电器、按钮、指示灯、万能转换开关和行程开关是电气控制回路中最常见的八种元件。本文以图文并茂的方式介绍常用电气元件的原理及应用,通...

3235
来自专栏佳爷的后花媛

前端面试题整理

两栏布局是主内容区为主,左(右)侧有一栏,(将侧边区块<aside>域浮动,<aside>浮动后覆盖绿色<main>, 再将<main> overflow:au...

1162
来自专栏一心无二用,本人只专注于基础图像算法的实现与优化。

SSE图像算法优化系列十:简单的一个肤色检测算法的SSE优化。

在很多场合需要高效率的肤色检测代码,本人常用的一个C++版本的代码如下所示: void IM_GetRoughSkinRegion(unsigned char...

1959
来自专栏小小挖掘机

Matplotlib基础全攻略

Matplotlib是Python中最流行的绘图库,它模仿MATLAB中的绘图风格,提供了一整套与MATLAB相似的绘图API,通过API,我们可以轻松地绘制出...

3355
来自专栏AhDung

【C#】分享一个可灵活设置边框的Panel

---------------------------更新:2014-05-19---------------------------

691
来自专栏码匠的流水账

聊聊Color中的alpha值

color对象里头的alpha其实是指不透明度,其值范围为0-255,越大越不透明。 其通常对应opacity,这个就是单词语义表达的不透明度,其值范围[0,1...

692

扫码关注云+社区