blocks|key|146437|text|至少从版本14开始，英特尔编译器不会为您所链接的代码中的threshold2生成别名检查，这表明您的方法应该有效。然而，gcc自动向量化器错过了这个优化机会，但它确实生成了矢量化代码、测试正确对齐、测试别名和非矢量化的回退/清理代码。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|146438|entityMap^0|S|A|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|K|8|@]|D|@]|E|$]]]|G|$]]

The Intel compiler at least as of version 14 does not generate aliasing checks for <code>threshold2</code> in the code you linked indicating that your approach should work. However, the gcc auto-vectorizer misses this opportunity for optimization but does generate vectorized code, tests for proper alignment, tests for aliasing and non-vectorized fall-back/clean-up code.

blocks|key|2940201|text|如果您使用的是英特尔编译器，您可以尝试包括以下行：|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|2940202|#pragma+ivdep+|code-block|syntax|javascript|2940203|以下段落摘自英特尔编译器用户手册：|2940204|2940205|+ivdep杂注指示编译器忽略假定的向量依赖关系。为了确保代码正确，编译器将假定的依赖项视为已证实的依赖项，这会阻止向量化。这个编译指示覆盖了那个决定。仅当您知道假定的循环依赖关系可以安全忽略时，才使用此杂注。|blockquote|2940206|2940207|2940208|在gcc中，应该添加这一行：|2940209|#pragma+GCC+ivdep|2940210|在函数内部和循环之前，您想要矢量化(参见documentation)。这只支持从gcc+4.9开始，顺便说一句，这使得__restrict__的使用成为冗余。|CODE|2940211|entityMap|0|LINK|mutability|MUTABLE|url|https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html^0|7|6|0|0|0|0|0|0|0|1|3|0|0|1N|C|K|D|0|0^^$0|@$1|2|3|4|5|6|7|17|8|@$9|18|A|19|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|1A|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|1B|8|@]|D|@]|E|$]]|$1|M|3|-4|5|6|7|1C|8|@]|D|@]|E|$]]|$1|N|3|O|5|P|7|1D|8|@]|D|@]|E|$]]|$1|Q|3|-4|5|6|7|1E|8|@]|D|@]|E|$]]|$1|R|3|-4|5|6|7|1F|8|@]|D|@]|E|$]]|$1|S|3|T|5|6|7|1G|8|@$9|1H|A|1I|B|C]]|D|@]|E|$]]|$1|U|3|V|5|H|7|1J|8|@]|D|@]|E|$I|J]]|$1|W|3|X|5|6|7|1K|8|@$9|1L|A|1M|B|Y]]|D|@$9|1N|A|1O|1|1P]]|E|$]]|$1|Z|3|-4|5|6|7|1Q|8|@]|D|@]|E|$]]]|10|$11|$5|12|13|14|E|$15|16]]]]

if you are using Intel compiler, you can try to include the line:

<pre><code>#pragma ivdep 
</code></pre>

The following paragraph is quoted from Intel compiler user manual:

<blockquote>
 The ivdep pragma instructs the compiler to ignore assumed vector
 dependencies. To ensure correct code, the compiler treats an assumed
 dependence as a proven dependence, which prevents vectorization. This
 pragma overrides that decision. Use this pragma only when you know
 that the assumed loop dependencies are safe to ignore.
</blockquote>

In gcc, one should add the line: 

<pre><code>#pragma GCC ivdep
</code></pre>

inside the function and right before the loop you want to vectorize (see <a href="https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html" rel="nofollow">documentation</a>). This is only supported starting from gcc 4.9 and, by the way, makes the use of <code>__restrict__</code> redundant.

blocks|key|2940221|text|解决这个特定问题的另一种方法是使用OpenMP+simd+directive，这是从4.0版开始的标准的一部分，它是标准化的，完全可移植到(相当现代的)编译器。然后代码变成：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|2940222|void+threshold(const+unsigned+char*+inputRoi,+const+unsigned+char+valueTh,
+++++++++++++++unsigned+char*+outputRoi,+const+int+width,
+++++++++++++++const+int+stride,+const+int+height)+{
++++#pragma+omp+simd
++++for+(int+i+=+0;+i+<+width;+i%2B%2B)+{
++++++++outputRoi[i]+=+(inputRoi[i]+<+valueTh)+?+255+:+0;
++++}
}|code-block|syntax|javascript|2940223|在启用OpenMP支持的情况下进行编译时(完全支持或仅部分支持simd，如用于英特尔编译器的-qopenmp-simd+)，则代码是完全矢量化的。|2940224|此外，这使您有机会指示可能的向量对齐，这在某些情况下可能会很方便。例如，为您的输入和输出数组分配了一个支持对齐的内存分配器，这样一个具有256b对齐要求的posix_memalign()，那么代码可能会变成：|BOLD|2940225|void+threshold(const+unsigned+char*+inputRoi,+const+unsigned+char+valueTh,
+++++++++++++++unsigned+char*+outputRoi,+const+int+width,
+++++++++++++++const+int+stride,+const+int+height)+{
++++#pragma+omp+simd+aligned(inputRoi,+outputRoi+:+32)
++++for+(int+i+=+0;+i+<+width;+i%2B%2B)+{
++++++++outputRoi[i]+=+(inputRoi[i]+<+valueTh)+?+255+:+0;
++++}
}|2940226|这应该允许生成更快的二进制文件。而且，使用ivdep指令时，该功能并不容易使用。使用OpenMP+simd指令的理由就更多了。|2940227|entityMap|0|LINK|mutability|MUTABLE|url|http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf#G4.1507435^0|O|4|H|L|0|0|0|V|4|1A|D|0|10|P|25|G|0|0|L|5|1D|4|0^^$0|@$1|2|3|4|5|6|7|11|8|@$9|12|A|13|B|C]]|D|@$9|14|A|15|1|16]]|E|$]]|$1|F|3|G|5|H|7|17|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|18|8|@$9|19|A|1A|B|C]|$9|1B|A|1C|B|C]]|D|@]|E|$]]|$1|M|3|N|5|6|7|1D|8|@$9|1E|A|1F|B|O]|$9|1G|A|1H|B|C]]|D|@]|E|$]]|$1|P|3|Q|5|H|7|1I|8|@]|D|@]|E|$I|J]]|$1|R|3|S|5|6|7|1J|8|@$9|1K|A|1L|B|C]|$9|1M|A|1N|B|C]]|D|@]|E|$]]|$1|T|3|-4|5|6|7|1O|8|@]|D|@]|E|$]]]|U|$V|$5|W|X|Y|E|$Z|10]]]]

Another approach for this specific issue that is standardised and fully portable across (reasonably modern) compiler is to use the <a href="http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf#G4.1507435" rel="nofollow">OpenMP <code>simd</code> directive</a>, which is part of the standard since version 4.0. The code then becomes:

<pre><code>void threshold(const unsigned char* inputRoi, const unsigned char valueTh,
 unsigned char* outputRoi, const int width,
 const int stride, const int height) {
 #pragma omp simd
 for (int i = 0; i &lt; width; i++) {
 outputRoi[i] = (inputRoi[i] &lt; valueTh) ? 255 : 0;
 }
}
</code></pre>

And when compiled with OpenMP support enabled (with either full support or only partial one for <code>simd</code> only, like with <code>-qopenmp-simd</code> for the Intel compiler), then the code is fully vectorised.

In addition, this gives you the opportunity to indicate possible alignment of vectors, which can come handy in some circumstances. For example, had your input and output arrays been allocated with an alignment-aware memory allocator, such a <code>posix_memalign()</code> with an alignment requirement of 256b, then the code could become:

<pre><code>void threshold(const unsigned char* inputRoi, const unsigned char valueTh,
 unsigned char* outputRoi, const int width,
 const int stride, const int height) {
 #pragma omp simd aligned(inputRoi, outputRoi : 32)
 for (int i = 0; i &lt; width; i++) {
 outputRoi[i] = (inputRoi[i] &lt; valueTh) ? 255 : 0;
 }
}
</code></pre>

This should then permit to generate an even faster binary. And this feature isn't readily available using the <code>ivdep</code> directives. All the more reasons to use the OpenMP <code>simd</code> directive.

I am doing some image processing, for which I benefit from vectorization.
I have a function that vectorizes ok, but for which I am not able to convince the compiler that the input and output buffer have no overlap, and so no alias checking is necessary.
I should be able to do so using <code>__restrict__</code>, but if the buffers are not defined as <code>__restrict__</code> when arriving as function argument, there is no way to convince the compiler that I am absolutely sure that 2 buffers will never overlap.

This is the function:

<pre><code>__attribute__((optimize("tree-vectorize","tree-vectorizer-verbose=6")))
void threshold(const cv::Mat&amp; inputRoi, cv::Mat&amp; outputRoi, const unsigned char th) {

 const int height = inputRoi.rows;
 const int width = inputRoi.cols;

 for (int j = 0; j &lt; height; j++) {
 const uint8_t* __restrict in = (const uint8_t* __restrict) inputRoi.ptr(j);
 uint8_t* __restrict out = (uint8_t* __restrict) outputRoi.ptr(j);
 for (int i = 0; i &lt; width; i++) {
 out[i] = (in[i] &lt; valueTh) ? 255 : 0;
 }
 }
}
</code></pre>

The only way I can convince the compiler to not perform the alias checking is if I put the inner loop in a separate function, in which the pointers are defined as <code>__restrict__</code> arguments. If I declare this inner function as inlined, again the alias checking is activated.

You can see the effect also with this example, which I think is consistent: <a href="http://goo.gl/7HK5p7" rel="nofollow noreferrer">http://goo.gl/7HK5p7</a>

(Note: I know there might be better ways of writing the same function, but in this case I am just trying to understand how to avoid alias check)

Edit: 
Problem is solved!! (See <a href="https://stackoverflow.com/a/27996733/2436175">answer below</a>) 
Using gcc 4.9.2, <a href="http://goo.gl/nBjzPC" rel="nofollow noreferrer">here is the complete example</a>. Note the use of the compiler flag <a href="https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html#index-fopt-info-818" rel="nofollow noreferrer"><code>-fopt-info-vec-optimized</code></a> in place of the superseded <code>-ftree-vectorizer-verbose=N</code>. 
So, for gcc, use <code>#pragma GCC ivdep</code> and enjoy! :)

Auto-vectorizing: Convincing the compiler that alias check is not necessary

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我正在做一些图像处理，为此我受益于矢量化。我有一个向量化ok的函数，但我无法让编译器相信输入和输出缓冲区没有重叠，因此不需要别名检查。我应该能够使用__restrict__来做到这一点，但是如果缓冲区在作为函数参数到达时没有被定义为__restrict__，那么就没有办法让编译器相信我绝对肯定两个缓冲区永远不会重叠。这...

问自动矢量化:使编译器相信别名检查是不必要的
EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问自动矢量化:使编译器相信别名检查是不必要的EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问自动矢量化:使编译器相信别名检查是不必要的
EN