## 为什么当限制是959而不是960时，一个简单的循环是优化的？内容来源于 Stack Overflow，并遵循CC BY-SA 3.0许可协议进行翻译与使用

• 回答 (2)
• 关注 (0)
• 查看 (73)

```float f(float x[]) {
float p = 1.0;
for (int i = 0; i < 959; i++)
p += 1;
return p;
}```

```.LCPI0_0:
.long   1148190720              # float 960
f:                                      # @f
vmovss  xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
ret```

```float f(float x[]) {
float p = 1.0;
for (int i = 0; i < 960; i++)
p += 1;
return p;
}```

```.LCPI0_0:
.long   1065353216              # float 1
.LCPI0_1:
.long   1086324736              # float 6
f:                                      # @f
vmovss  xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
vxorps  ymm1, ymm1, ymm1
mov     eax, 960
vbroadcastss    ymm2, dword ptr [rip + .LCPI0_1]
vxorps  ymm3, ymm3, ymm3
vxorps  ymm4, ymm4, ymm4
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
vaddps  ymm0, ymm0, ymm2
vaddps  ymm1, ymm1, ymm2
vaddps  ymm3, ymm3, ymm2
vaddps  ymm4, ymm4, ymm2
add     eax, -192
jne     .LBB0_1
vaddps  ymm0, ymm1, ymm0
vaddps  ymm0, ymm3, ymm0
vaddps  ymm0, ymm4, ymm0
vextractf128    xmm1, ymm0, 1
vaddps  ymm0, ymm0, ymm1
vpermilpd       xmm1, xmm0, 1   # xmm1 = xmm0[1,0]
vaddps  ymm0, ymm0, ymm1
vhaddps ymm0, ymm0, ymm0
vzeroupper
ret```

## GCC版本<=6.3.0

gcc的相关优化选项是-fpeel-loops，它与标志一起间接启用。`-Ofast`

```\$ head test.c.151t.cunroll

;; Function f (f, funcdef_no=0, decl_uid=1919, cgraph_uid=0, symbol_order=0)

Not peeling: upper bound is known so can unroll completely```
```if (maxiter >= 0 && maxiter <= npeel)
{
if (dump_file)
fprintf (dump_file, "Not peeling: upper bound is known so can "
"unroll completely\n");
return false;
}```

```Loop 1 iterates 959 times.
Loop 1 iterates at most 959 times.
Not unrolling loop 1 (--param max-completely-peeled-times limit reached).
Not peeling: upper bound is known so can unroll completely```

`-march=core-avx2 -Ofast --param max-completely-peeled-insns=1000 --param max-completely-peel-times=1000`

```f:
vmovss  xmm0, DWORD PTR .LC0[rip]
ret
.LC0:
.long   1148207104```

```#pragma unroll
for (int i = 0; i < 960; i++)
p++;```

```.LCPI0_0:
.long   1148207104              # float 961
f:                                      # @f
vmovss  xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
ret```

1. 如果循环计数器为常数(且不太高)，编译器将完全展开循环。
2. 一旦展开，编译器就会看到SUM操作可以分组为一个。