对Single work item
形式的kernel
来说,最重要的优化策略就是让loop
能够pipeline
,并且让II
值尽可能为1
。
这里,对loop不能pipeline
的几种情况进行归纳整理,大致可分为以下三种:Unresolving loop exit condition
,Nonlinear execution
和 Out-of-order loop iterations
。
循环退出条件为访存或其他复杂操作,导致编译器在循环开始时不能推断循环退出边界。
下面的例子中,外层循环退出条件涉及到仿存操作,编译器没办法在loop开始时推断循环退出边界,导致pipeline失败。
#define N 128 __kernel void exitcond( __global unsigned* restrict input,
__global unsigned* restrict result )
{
unsigned i = 0;
unsigned sum = 0;
while( input[ i++ ] < N ) {
for ( unsigned j = 0; j < N; j++ )
sum += input[i+j];
}
*result = sum;
}
Loop Report:
-+ Loop "block1"
| NOT pipelined due to:
| Loop exit condition unresolvable at iteration initiation.
|
|-+ Loop "block2"
Pipelined well. Successive iterations are launched every cycle.
修改代码结构,避免使用带复杂操作的循环退出边界。
循环非线性执行,我们了解嵌套for循环的执行机理后就会明白,这种情况下,外层循环是没办法插入,自然也不能pipeline。
下面的例子中,外层循环每次迭代时,其内层for循环是选择执行的,外层循环没办法做插入。
kernel void structure (global unsigned* restrict output1,
global unsigned* restrict output2,
int N)
{
for (unsigned i = 0; i < N; i++) {
if ((i & 3) == 0) {
for (unsigned j = 0; j < N; j++) {
output1[i+j] = i * j;
}
} else {
for (unsigned j = 0; j < N; j++) {
output2[i+j] = i * j;
}
}
}
}
Loop Report:
+ Loop "Block2" (file test.cl line 5)
| NOT pipelined due to:
|
| Loop structure: loop contains divergent inner loops.
| Making all inner loops unconditional should fix this problem.
| See "Loop Structure Does Not Support Linear Execution" section of the Best Practices Guide for more information.
| Not pipelining this loop will most likely lead to poor performance.
|
|-+ Loop "Block3" (file test.cl line 7)
| Pipelined well. Successive iterations are launched every cycle.
|
|-+ Loop "Block4" (file test.cl line 11)
Pipelined well. Successive iterations are launched every cycle.
for (unsigned i = 0; i < N; i++)
{
for (unsigned j = 0; j < N; j++) {
...
output1=...
}
for (unsigned j = 0; j < N; j++) {
...
output1=...
}
if ((i & 3) == 0) {
output = output1;
} else {
output = output2;
}
}
这是比较常见的一种情况,往往发生在嵌套循环处,通常由于每次外层循环迭代时,内层循环的迭代次数不固定导致。结果是外层嵌套的循环通通不能pipeline。
内层循坏迭代次数不固定的情况有很多,比如:
for(unsigned i=0; i<N; i++){
for(unsigned j=0; j<i; j++){
}
}
for(unsigned i=0; i<N; i++){
if(i>3){
for(unsigned j=0; j<i; j++){
}
}
}
for(unsigned i=0; i<N; i++){
for(unsigned j=0; j<i; j++){
...
if(){
break;//注意,break尽量不要用
}
}
}
上面这些情况,都会导致外层循环pipeline失败,因为编译器没办法在内层循环做pipeline时,对外层循环做插入。
正常情况下,pipeline应该是这样的:
pipeline-31
但是发生 out-of-order loop 时,循环的执行会变成下面这样,使性能大打折扣:
pipeline-32
下面这个例子,属于上面第一种情况,即循坏边界为变量。
我们看,内层循环的边界是i
,也就是说i=0
时,内层循环迭代0次,i=1
时,内层循环迭代1次,i=2
时,内层循环迭代2次……,每次都是不一样。结果就是外层循环不能pipeline。
kernel void order( global unsigned* restrict input,
global unsigned* restrict output
int N )
{
unsigned sum = 0;
for (unsigned i = 0; i < N; i++) {
for (unsigned j = 0; j < i; j++) {
sum += input[i+j];
}
}
output[0] = sum;
}
Loop Report:
-+ Loop "block1"
| NOT pipelined due to:
| Loop iteration ordering, iterations may get out of order with respect to:
|
| Loop "block2"
|
|-+ Loop "block2"
Pipelined well. Successive iterations are launched every cycle.
修改算法,重新组织代码结构,比如:
for(unsigned i=0; i<N; i++){
for(unsigned j=0; j<N; j++){
if(j<i){
...
}else{
//空
}
}
}
for(unsigned i=0; i<N; i++){
for(unsigned j=0; j<N; j++){
if(){
...
}
}
}
[Intel FPGA SDK for OpenCL Best Practices Guide]