DAY55：阅读 Formatted Output

GPUS Lady

发布于 2018-08-01 15:03:10

4470

发布于 2018-08-01 15:03:10

文章被收录于专栏：GPUS开发者

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第55天，我们正在讲解CUDA C语法，希望在接下来的45天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。

B.19. Formatted Output

Formatted output is only supported by devices of compute capability 2.x and higher.

int printf(const char *format[, arg, ...]);

prints formatted output from a kernel to a host-side output stream.

The in-kernel printf() function behaves in a similar way to the standard C-library printf() function, and the user is referred to the host system's manual pages for a complete description of printf()behavior. In essence, the string passed in as format is output to a stream on the host, with substitutions made from the argument list wherever a format specifier is encountered. Supported format specifiers are listed below.

The printf() command is executed as any other device-side function: per-thread, and in the context of the calling thread. From a multi-threaded kernel, this means that a straightforward call to printf() will be executed by every thread, using that thread's data as specified. Multiple versions of the output string will then appear at the host stream, once for each thread which encountered the printf().

It is up to the programmer to limit the output to a single thread if only a single output string is desired (see Examples for an illustrative example).

Unlike the C-standard printf(), which returns the number of characters printed, CUDA's printf() returns the number of arguments parsed. If no arguments follow the format string, 0 is returned. If the format string is NULL, -1 is returned. If an internal error occurs, -2 is returned.

B.19.1. Format Specifiers

As for standard printf(), format specifiers take the form: %[flags][width][.precision][size]type

The following fields are supported (see widely-available documentation for a complete description of all behaviors):

Flags: `#' ` ' `0' `+' `-'
Width: `*' `0-9'
Precision: `0-9'
Size: `h' `l' `ll'
Type: `%cdiouxXpeEfgGaAs'

Note that CUDA's printf()will accept any combination of flag, width, precision, size and type, whether or not overall they form a valid format specifier. In other words, "%hd" will be accepted and printf will expect a double-precision variable in the corresponding location in the argument list.

B.19.2. Limitations

Final formatting of the printf() output takes place on the host system. This means that the format string must be understood by the host-system's compiler and C library. Every effort has been made to ensure that the format specifiers supported by CUDA's printf function form a universal subset from the most common host compilers, but exact behavior will be host-OS-dependent.

As described in Format Specifiers, printf() will accept all combinations of valid flags and types. This is because it cannot determine what will and will not be valid on the host system where the final output is formatted. The effect of this is that output may be undefined if the program emits a format string which contains invalid combinations.

The printf() command can accept at most 32 arguments in addition to the format string. Additional arguments beyond this will be ignored, and the format specifier output as-is.

Owing to the differing size of the long type on 64-bit Windows platforms (four bytes on 64-bit Windows platforms, eight bytes on other 64-bit platforms), a kernel which is compiled on a non-Windows 64-bit machine but then run on a win64 machine will see corrupted output for all format strings which include "%ld". It is recommended that the compilation platform matches the execution platform to ensure safety.

The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API). It is circular and if more output is produced during kernel execution than can fit in the buffer, older output is overwritten. It is flushed only when one of these actions is performed:

Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as well),
Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(), cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(), or cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
Prior to executing a stream callback added by cudaStreamAddCallback or cuStreamAddCallback.

Note that the buffer is not flushed automatically when the program exits. The user must call cudaDeviceReset() or cuCtxDestroy() explicitly, as shown in the examples below.

Internally printf() uses a shared data structure and so it is possible that calling printf() might change the order of execution of threads. In particular, a thread which calls printf() might take a longer execution path than one which does not call printf(), and that path length is dependent upon the parameters of the printf(). Note, however, that CUDA makes no guarantees of thread execution order except at explicit __syncthreads() barriers, so it is impossible to tell whether execution order has been modified by printf() or by other scheduling behaviour in the hardware.

B.19.3. Associated Host-Side API

The following API functions get and set the size of the buffer used to transfer the printf() arguments and internal metadata to the host (default is 1 megabyte):

cudaDeviceGetLimit(size_t* size,cudaLimitPrintfFifoSize)
cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size_t size)

B.19.4. Examples

The following code sample:

will output:

Notice how each thread encounters the printf() command, so there are as many lines of output as there were threads launched in the grid. As expected, global values (i.e., float f) are common between all threads, and local values (i.e., threadIdx.x) are distinct per-thread.

The following code sample:

will output:

Hello thread 0, f=1.2345

Self-evidently, the if() statement limits which threads will call printf, so that only a single line of output is seen.

本文备注/经验分享：

今天的章节是著名的printf()。如同每个人写程序都会从printf("Hello, World\n");开始一样, 这个是每个人几乎都熟悉的东西, 但是在CUDA中, 并不是一开始就提供的.而且它经历了流行到不流行的变化.如同上次说的, 它和assert()一样, 基本上主要用途是进行辅助调试的，并非像你在CPU上编程那样, 进行主力输出之类的用途.(因为它的代价比较昂贵, IO本身就不是一个快速的操作, 而在目前的CUDA中, 它被实现被设备端的系统调用, 性能不好)。 printf()最初在CUDA中并没有提供, 而当时大家对它的使用要求的呼声很高.于是NV对部分注册用户, 提供了一个叫cuprintf的非标准扩展.后来发现不错, 于是从Fermi起(计算能力2.0), 正式提供了printf，所以你看到一些书上, 写的是cuprintf()请不要感到惊讶.这只是以前的老代码的残留而已. 直接当成printf去理解这些书即可. 而不要在论坛发帖询问为何. 类似的, 以前的一些老书或者项目还有一种情况, 是对printf这样用的:

#if __CUDA_ARCH >= 200 printf(......); #endif

这是因为以前的CUDA版本中(8.0之前), 需要这样写,以便能取得在当时的CUDA版本中, 能对所有的计算能力都通过编译(计算能力1.X不支持printf, 因此可以被#if屏蔽掉, 从而不影响编译). 现在已经开始普及CUDA 9.2+了(9.2这个版本, 加上额外的TensorCore补丁(在9.2的下载页面就有), 可以充分发挥现在的卡的能力,例如可以有效的针对需要FP16的图像处理, 或者深度学习之类的用途因此大家基本上没有使用老版本的机会了. 因此可以不用#if保护起来了. 但需要注意的是, 在CUDA 8.0中(这个版本还是有很多人用, 因为这个是最后一个和传统CUDA保持最大兼容性的版本了)，printf依然能发挥能力.因为CUDA 8.0是最后支持计算能力2.X的开发的版本, 而此版本的NSight却不支持2.0的调试.也就是说, 在CUDA 8.0 + Fermi的组合下, NSight 5.2(自带的版本)用不起来的时候,今天说的printf(), 和上次说的assert(), 依然是你调试用户的好帮手.实际上, 用户应当知道printf当初的主要由来, 当年NSight还叫Nexus的时候,调试是需要双卡专业卡才能调试的.当年很多人无法使用几万元的卡, 因此printf是当时必须的选择(你也可以选择将数据复制回来然后查看, 但毕竟不方便).后来NSight逐渐开放了, 变成双卡, 其中有一张卡是专业卡即可.再后来编程普通双卡即可.变成*，再再最后, 变成了普通一张卡即可.也就是目前现在的状态.因此现在大部分的时候, 使用printf不再是必须的了, 因为NSight往往更方便(但刚才的CUDA 8 + Fermi的情况还是偶而需要用的).类似的, 很多OpenCL的实现中, 并不能使用调试器.例如NV的OpenCL, 不能通过NSight调试, 此时printf依然非常重要.再例如, AMD的OpenCL Kernel, 一旦使用了原子操作, 就会被CodeXL调试器拒绝调试, 此时使用printf也成了必须的选择. 既然知道了printf主要是为了辅助调试的, 那么它的2个主要缺点, 往往可以无视

1)printf并不提供跨平台的兼容性. 很多时候, 我们往往给客户建议, 可以在Windows上调试好kernel, 然后直接移动到Linux下即可使用.只要注意类似sizeof(long), uint64_t这种类型的使用, 基本上, CUDA C的跨平台性是非常好的. 　

但是很遗憾, printf是为数不多的例外.因为为了代价和性能上的考虑, 以及和Host端的编译器的兼容性的考虑.实际上并没有一个真正的GPU设备端的printf()的，目前的实现总是只是将需要显示的数据, 例如printf("Its %d\n", 123)里面的123, 直接在kernel完成后, 复制到host端上, 然后在CPU端调用普通的printf, 进行显示的.而并非在GPU上, 就地拼接出来, 例如上面的"Its 123"的字符串的. 这会带来一个主要问题：不同的C编译器的printf格式符的描述是不一样的.因此同样的一段代码, 在不同的平台上, 会有不同的显示效果. 甚至有的平台上的特定格式字符串会给出错误的显示结果.这点需要注意.手册也说了, 主要需要注意long的区别.因为历史和兼容性的原因, 不同的64-bit平台下, I(Integer), L(Long), P(Pointer)有的全部是64-bit, 有的只有部分是, 其他保留32-bit.这个其实在之前的章节中稍微提到一点，也是为何之前我们总是建议使用明确大小的类型的原因.例如uint32_t, uint64_t这些, 而不是short, int, long，后者的大小是不定的, 标准只约定了short小于int, int小于long, 而没有具体约定大小.用uint32_t这些, 明确的你能知道就是4B(32-bit)大小.但是很遗憾的, 目前的GPU端的printf, 并不能接受,类似I32, I64这种明确的大小, 而只能使用ll, l, h这种.ll代表long long或者longlong*向量类型中的分量(请参考我们之前的向量类型章节)：l则代表long，h则代表short，你需要分别知道每种对应的大小的.好在目前只有l(long)会造成迷惑, 这个在Windows上是4B, Linux上是8B的(64-bit下).有人会说, 你这个是NVCC + VC(Windows), 和NVCC + GCC(Linux)吧.万一我别出心裁的想在Windows上使用NVCC + GCC的组合(例如Cygwin下)怎么办?幸运的是, NVCC在Windows下并不支持GCC(包括Cygwin和MingW). 这样不存在混合的情况的考虑了.这是第一点. (2)点则是, printf本身性能缓慢, 同时有最大输出大小限制.这点杜绝了它做为除了调试用途外的, 普通输出用途的考虑.首先你既然使用了GPU, 往往是为了提速, 你不可能大量使用一个慢速调用的.其次, 如果真需要有大量的输出用来观察的场合(例如, 一张1920x1080的图片,这大约有200万个点. 你不可能逐个观察每个点的值的. 而往往是直接cudaMemcpy回来, 当作图片显示来看的.此时也无需使用printf(). 而默认的输出大小虽然有限制, 但这个默认的限制其实就很大了. printf既然是输出字符串给人看的. 你顶多一次能看到几屏的滚动内容，心力好的, 可能能看个几十屏的文字. 但再多就没有意义了.因此这往往不是一个问题. 但如果真有人需要看很多, 也可以看我们论坛的这个主题的讨论的例子:http://bbs.gpuworld.cn/thread-10517-1-1.htmls 和http://bbs.gpuworld.cn/thread-10517-1-1.html 这个例子同时遭遇了这(2)点中的两个方面.一个是运行缓慢导致超时(不用printf会更快一些)，一个是最大的printf输出大小有限制. 大致这些主要内容. 需要注意的是: printf()辅助调试, 并非是"非侵入式"的, (一般认为cuda-gdb, 或者nsight这种是, 不会造成额外干扰)，用它进行调试, 会引入额外的代码执行变化. 也就是说, 如果一段代码, 使用printf和没有使用printf, 例如造成了变量生存期的改变，你会发现一种很尴尬的场景, 使用printf的时候, 好不容易调试正确了, 然后准备正式运行, 准备去掉printf,发现代码挂了.这就很令人郁闷了.这是因为printf本身也是代码的一部分. 而不像NSight那样, 不会对代码本身造成改动.所以用户应当注意.这也是条件使用NSight而不是printf之类的原因. 还需要注意的是，GPU端的printf, 支持的格式描述有限, 我们常用的对size_t的输出(例如一个Sizeof(结构体)),常用的%zu格式(GCC或者较新版本的VC(例如2015+)), 就不能在GPU端使用. printf做为经典调试手段, 肯定会长期存在的.无论是CPU端, GPU端, 甚至是MCU的调试协议上的printf overlay, 还是硬件的串口printf,这些在日常的生活中处处可见, 不会消失，不过做为具体的CUDA用户, 可以有选择的使用它.用的时候小心一点即可(注意之前说的副作用, 干扰之类的字样). 调试是一个需要时间经验积累的活，调试个几年出来, 慢慢也就习惯了, 也不会遇到这些常见的干扰导致的问题了.所以要有耐心.新人建议总是NSight的.

有不明白的地方，请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖、

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-07-25，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自 GPUS开发者微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度