# DAY34:阅读算术指令

### 5.4.1. Arithmetic Instructions

Table 2 gives the throughputs of the arithmetic instructions that are natively supported in hardware for devices of various compute capabilities.

（扫描二维码看Table 2,注意扫描后要停顿一下表格才会出现哟）

Other instructions and functions are implemented on top of the native instructions. The implementation may be different for devices of different compute capabilities, and the number of native instructions after compilation may fluctuate with every compiler version. For complicated functions, there can be multiple code paths depending on input. cuobjdump can be used to inspect a particular implementation in a cubin object.

The implementation of some functions are readily available on the CUDA header files (math_functions.h, device_functions.h, ...).

In general, code compiled with -ftz=true (denormalized numbers are flushed to zero) tends to have higher performance than code compiled with -ftz=false. Similarly, code compiled with -prec div=false (less precise division) tends to have higher performance code than code compiled with -prec div=true, and code compiled with -prec-sqrt=false (less precise square root) tends to have higher performance than code compiled with -prec-sqrt=true. The nvcc user manual describes these compilation flags in more details.

#### Single-Precision Floating-Point Division

__fdividef(x, y) (see Intrinsic Functions) provides faster single-precision floating-point division than the division operator.

#### Single-Precision Floating-Point Reciprocal Square Root

To preserve IEEE-754 semantics the compiler can optimize 1.0/sqrtf() into rsqrtf() only when both reciprocal and square root are approximate, (i.e., with -prec-div=false and -prec-sqrt=false). It is therefore recommended to invoke rsqrtf() directly where desired.

#### Single-Precision Floating-Point Square Root

Single-precision floating-point square root is implemented as a reciprocal square root followed by a reciprocal instead of a reciprocal square root followed by a multiplication so that it gives correct results for 0 and infinity.

#### Sine and Cosine

sinf(x), cosf(x), tanf(x), sincosf(x), and corresponding double-precision instructions are much more expensive and even more so if the argument x is large in magnitude.

More precisely, the argument reduction code (see Mathematical Functions for implementation) comprises two code paths referred to as the fast path and the slow path, respectively.

The fast path is used for arguments sufficiently small in magnitude and essentially consists of a few multiply-add operations. The slow path is used for arguments large in magnitude and consists of lengthy computations required to achieve correct results over the entire argument range.

At present, the argument reduction code for the trigonometric functions selects the fast path for arguments whose magnitude is less than 105615.0f for the single-precision functions, and less than 2147483648.0 for the double-precision functions.

As the slow path requires more registers than the fast path, an attempt has been made to reduce register pressure in the slow path by storing some intermediate variables in local memory, which may affect performance because of local memory high latency and bandwidth (see Device Memory Accesses). At present, 28 bytes of local memory are used by single-precision functions, and 44 bytes are used by double-precision functions. However, the exact amount is subject to change.

Due to the lengthy computations and use of local memory in the slow path, the throughput of these trigonometric functions is lower by one order of magnitude when the slow path reduction is required as opposed to the fast path reduction.

#### Integer Arithmetic

Integer division and modulo operation are costly as they compile to up to 20 instructions. They can be replaced with bitwise operations in some cases: If n is a power of 2, (i/n) is equivalent to(i>>log2(n)) and (i%n) is equivalent to (i&(n-1)); the compiler will perform these conversions if n is literal.

__brev and __popc map to a single instruction and __brevll and __popcll to a few instructions.

__[u]mul24 are legacy intrinsic functions that no longer have any reason to be used.

#### Half Precision Arithmetic

In order to achieve good half precision floating-point add, multiply or multiply-add throughput it is recommended that the half2 datatype is used. Vector intrinsics (eg. __hadd2, __hsub2,__hmul2, __hfma2) can then be used to do two operations in a single instruction. Using half2 in place of two calls using half may also help performance of other intrinsics, such as warp shuffles.

The intrinsic __halves2half2 is provided to convert two half precision values to the half2 datatype.

#### Type Conversion

Sometimes, the compiler must insert conversion instructions, introducing additional execution cycles. This is the case for:

· Functions operating on variables of type char or short whose operands generally need to be converted to int,

· Double-precision floating-point constants (i.e., those constants defined without any type suffix) used as input to single-precision floating-point computations (as mandated by C/C++ standards).

This last case can be avoided by using single-precision floating-point constants, defined with an f suffix such as 3.141592653589793f, 1.0f, 0.5f.

0 条评论

## 相关文章

38790

### 从互联网巨头数据挖掘类招聘笔试题目看我们还差多少

1 从阿里数据分析师笔试看职业要求 以下试题是来自阿里巴巴招募实习生的一次笔试题，从笔试题的几个要求我们一起来看看数据分析的职业要求。 一、异常值是指什么？请列...

39970

### Z3Py在CTF逆向中的运用

Z3是Microsoft Research开发的高性能定理证明器。Z3拥有者非常广泛的应用场景：软件/硬件验证和测试，约束求解，混合系统分析，安全性研究，生物学...

14320

### 3.1.3 绘制三维Contour图的思路

2007年秋，开始接触数值计算，看到Contour图形，我觉得很神奇，很好看。强烈的好奇心驱使下，零零碎碎看了相关文献，都看不懂。大约2009年深秋，我读到的最...

13000

### C++ STL编程轻松入门基础

C++ STL编程轻松入门基础 1 初识STL：解答一些疑问 1.1 一个最关心的问题：什么是STL 1.2 追根溯源：STL的历史 1.3 千丝万缕的联系 ...

31390

43350

386150

24250

26830

471110