首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >为什么std::accumulate这么慢?

为什么std::accumulate这么慢?
EN

Stack Overflow用户
提问于 2014-02-11 02:47:42
回答 2查看 4.2K关注 0票数 17

我试着用一个简单的for循环,一个std::accumulate和一个手动展开的for循环来计算数组元素的和。正如我所期望的那样,手动展开的循环是最快的,但更有趣的是std::accumulate比简单循环慢得多。这是我的代码,我用带有-O3标志的gcc 4.7编译的。Visual Studio需要不同的rdtsc函数实现。

代码语言:javascript
复制
#include <iostream>
#include <algorithm>
#include <numeric>
#include <stdint.h>


using namespace std;

__inline__ uint64_t rdtsc() {
  uint64_t a, d;
  __asm__ volatile ("rdtsc" : "=a" (a), "=d" (d));
  return (d<<32) | a;
}

class mytimer
{
 public:
  mytimer() { _start_time = rdtsc(); }
  void   restart() { _start_time = rdtsc(); }
  uint64_t elapsed() const
  { return  rdtsc() - _start_time; }

 private:
  uint64_t _start_time;
}; // timer

int main()
{
    const int num_samples = 1000;
    float* samples = new float[num_samples];
    mytimer timer;
    for (int i = 0; i < num_samples; i++) {
        samples[i] = 1.f;
    }
    double result = timer.elapsed();
    std::cout << "rewrite of " << (num_samples*sizeof(float)/(1024*1024)) << " Mb takes " << result << std::endl;

    timer.restart();
    float sum = 0;
    for (int i = 0; i < num_samples; i++) {
        sum += samples[i];
    }
    result = timer.elapsed();
    std::cout << "naive:\t\t" << result << ", sum = " << sum << std::endl;

    timer.restart();
    float* end = samples + num_samples;
    sum = 0;
    for(float* i = samples; i < end; i++) {
        sum += *i;
    }
    result = timer.elapsed();
    std::cout << "pointers:\t\t" << result << ", sum = " << sum << std::endl;

    timer.restart();
    sum = 0;
    sum = std::accumulate(samples, end, 0);
    result = timer.elapsed();
    std::cout << "algorithm:\t" << result << ", sum = " << sum << std::endl;

    // With ILP
    timer.restart();
    float sum0 = 0, sum1 = 0;
    sum = 0;
    for (int i = 0; i < num_samples; i+=2) {
        sum0 += samples[i];
        sum1 += samples[i+1];
    }
    sum = sum0 + sum1;
    result = timer.elapsed();
    std::cout << "ILP:\t\t" << result << ", sum = " << sum << std::endl;
}
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-02-11 02:54:38

对于初学者来说,使用std::accumulate就是对整数求和。因此,在添加之前,您可能要为将每个浮点数转换为整数而付出代价。尝试:

代码语言:javascript
复制
sum = std::accumulate( samples, end, 0.f );

看看这会不会有什么不同。

票数 34
EN

Stack Overflow用户

发布于 2014-02-11 05:05:37

由于您(显然)关心快速完成此任务,您可能还会考虑尝试多线程计算,以利用所有可用内核。为了使用OpenMP,我对您的简单循环进行了相当简单的重写,给出了如下内容:

代码语言:javascript
复制
timer.restart();
sum = 0;

// only real change is adding the following line:
#pragma omp parallel for schedule(dynamic, 4096), reduction(+:sum)
for (int i = 0; i < num_samples; i++) {
    sum += samples[i];
}
result = timer.elapsed();
std::cout << "OMP:\t\t" << result << ", sum = " << sum << std::endl;

为了笑一笑,我还对展开的循环做了一些重写,以允许半任意展开,并添加了OpenMP:

代码语言:javascript
复制
static const int unroll = 32;
real total = real();
timer.restart();
double sum[unroll] = { 0.0f };
#pragma omp parallel for reduction(+:total) schedule(dynamic, 4096)
for (int i = 0; i < num_samples; i += unroll) {
    for (int j = 0; j < unroll; j++)
        total += samples[i + j];
}
result = timer.elapsed();
std::cout << "ILP+OMP:\t" << result << ", sum = " << total << std::endl;

我还(大幅度)增加了数组的大小,以获得更有意义的数字。研究结果如下:首先是双核AMD:

代码语言:javascript
复制
rewrite of 4096 Mb takes 8269023193
naive:          3336194526, sum = 536870912
pointers:       3348790101, sum = 536870912
algorithm:      3293786903, sum = 536870912
ILP:            2713824079, sum = 536870912
OMP:            1885895124, sum = 536870912
ILP+OMP:        1618134382, sum = 536870912

然后对于四核(英特尔i7):

代码语言:javascript
复制
rewrite of 4096 Mb takes 2415836465
naive:          1382962075, sum = 536870912
pointers:       1675826109, sum = 536870912
algorithm:      1748990122, sum = 536870912
ILP:            751649497, sum = 536870912
OMP:            575595251, sum = 536870912
ILP+OMP:        450832023, sum = 536870912

从外观上看,OpenMP版本可能遇到了内存带宽的限制-- OpenMP版本比非线程版本使用了更多的CPU,但仍然只达到了70%左右,这表明CPU以外的其他因素正在成为瓶颈。

票数 4
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/21685426

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档