文章/答案/技术大牛

发布

社区首页 >问答首页 >当95%的情况下的值为0或1时，对非常大的数组上的随机访问有任何优化吗？

问当95%的情况下的值为0或1时，对非常大的数组上的随机访问有任何优化吗？
EN

Stack Overflow用户

提问于 2018-05-14 05:23:23

回答 13查看 15.6K关注 0票数 134

对一个非常大的数组的随机访问是否有任何可能的优化(我目前使用的是uint8_t，我在问什么更好)

uint8_t MyArray[10000000];

当数组中任何位置的值为

或1占95%，
2占4%，
在3和255之间的其他1%的病例？

那么，是否有比uint8_t数组更好的方法呢？应该尽可能快地以随机顺序遍历整个数组，这对RAM带宽来说是非常沉重的，因此当有超过几个线程同时对不同的数组执行时，目前整个RAM带宽很快就饱和了。

我的问题是，如果有这么大的数组(10 MB)，而实际上知道除了5%之外，几乎所有的值都是0或1，那么当数组中95%的值实际上只需要1位而不是8位时，这将减少几乎一个数量级的内存使用量，这让人觉得效率很低。它觉得必须有一个更有效的内存解决方案，这将大大减少所需的RAM带宽，并因此也大大加快了随机访问。

performance

optimization

memory-bandwidth

c++

arrays

回答 13

Stack Overflow用户

回答已采纳

发布于 2018-05-14 06:06:23

我们想到的一个简单的可能性是，对于普通情况，保留一个压缩数组，每个值为2位，而对于每个值，则保留一个分隔的4字节数组(原始元素索引为24位，实际值为8位，因此(idx << 8) | value))对其他数组进行排序。

当您查找一个值时，首先在2bpp数组(O(1))中进行查找；如果您找到0、1或2，这是您想要的值；如果您找到3，就意味着您必须在辅助数组中查找它。在这里，您将执行一个二进制搜索，查找您感兴趣的索引左移8 (O(log(n)和一个小n，因为这应该是1%)，并从4字节的细化中提取值。

std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) {
    // extract the 2 bits of our interest from the main array
    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
    // usual (likely) case: value between 0 and 2
    if(v != 3) return v;
    // bad case: lookup the index<<8 in the secondary array
    // lower_bound finds the first >=, so we don't need to mask out the value
    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
    // some coherency checks
    if(ptr == sec_arr.end()) std::abort();
    if((*ptr >> 8) != idx) std::abort();
#endif
    // extract our 8-bit value from the 32 bit (index, value) thingie
    return (*ptr) & 0xff;
}

void populate(uint8_t *source, size_t size) {
    main_arr.clear(); sec_arr.clear();
    // size the main storage (round up)
    main_arr.resize((size+3)/4);
    for(size_t idx = 0; idx < size; ++idx) {
        uint8_t in = source[idx];
        uint8_t &target = main_arr[idx>>2];
        // if the input doesn't fit, cap to 3 and put in secondary storage
        if(in >= 3) {
            // top 24 bits: index; low 8 bit: value
            sec_arr.push_back((idx << 8) | in);
            in = 3;
        }
        // store in the target according to the position
        target |= in << ((idx & 3)*2);
    }
}

对于您建议的数组，第一个数组需要10000000 /4= 2500000字节，第二个数组需要10000000 * 1% *4B= 400000字节；因此2900000字节，即不到原始数组的三分之一，使用最多的部分都保存在内存中，这对于缓存应该很好(它甚至适合L3)。

如果您需要超过24位的寻址，您将不得不调整“二级存储”；扩展它的一个简单方法是有一个256个元素指针数组来切换索引的前8位，并转发到如上所述的24位索引排序数组。

快速基准

#include <algorithm>
#include <vector>
#include <stdint.h>
#include <chrono>
#include <stdio.h>
#include <math.h>

using namespace std::chrono;

/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
/// than LCG but fail some test suites
struct XorShift32 {
    /// This stuff allows to use this class wherever a library function
    /// requires a UniformRandomBitGenerator (e.g. std::shuffle)
    typedef uint32_t result_type;
    static uint32_t min() { return 1; }
    static uint32_t max() { return uint32_t(-1); }

    /// PRNG state
    uint32_t y;

    /// Initializes with seed
    XorShift32(uint32_t seed = 0) : y(seed) {
        if(y == 0) y = 2463534242UL;
    }

    /// Returns a value in the range [1, 1<<32)
    uint32_t operator()() {
        y ^= (y<<13);
        y ^= (y>>17);
        y ^= (y<<15);
        return y;
    }

    /// Returns a value in the range [0, limit); this conforms to the RandomFunc
    /// requirements for std::random_shuffle
    uint32_t operator()(uint32_t limit) {
        return (*this)()%limit;
    }
};

struct mean_variance {
    double rmean = 0.;
    double rvariance = 0.;
    int count = 0;

    void operator()(double x) {
        ++count;
        double ormean = rmean;
        rmean     += (x-rmean)/count;
        rvariance += (x-ormean)*(x-rmean);
    }

    double mean()     const { return rmean; }
    double variance() const { return rvariance/(count-1); }
    double stddev()   const { return std::sqrt(variance()); }
};

std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) {
    // extract the 2 bits of our interest from the main array
    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
    // usual (likely) case: value between 0 and 2
    if(v != 3) return v;
    // bad case: lookup the index<<8 in the secondary array
    // lower_bound finds the first >=, so we don't need to mask out the value
    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
    // some coherency checks
    if(ptr == sec_arr.end()) std::abort();
    if((*ptr >> 8) != idx) std::abort();
#endif
    // extract our 8-bit value from the 32 bit (index, value) thingie
    return (*ptr) & 0xff;
}

void populate(uint8_t *source, size_t size) {
    main_arr.clear(); sec_arr.clear();
    // size the main storage (round up)
    main_arr.resize((size+3)/4);
    for(size_t idx = 0; idx < size; ++idx) {
        uint8_t in = source[idx];
        uint8_t &target = main_arr[idx>>2];
        // if the input doesn't fit, cap to 3 and put in secondary storage
        if(in >= 3) {
            // top 24 bits: index; low 8 bit: value
            sec_arr.push_back((idx << 8) | in);
            in = 3;
        }
        // store in the target according to the position
        target |= in << ((idx & 3)*2);
    }
}

volatile unsigned out;

int main() {
    XorShift32 xs;
    std::vector<uint8_t> vec;
    int size = 10000000;
    for(int i = 0; i<size; ++i) {
        uint32_t v = xs();
        if(v < 1825361101)      v = 0; // 42.5%
        else if(v < 4080218931) v = 1; // 95.0%
        else if(v < 4252017623) v = 2; // 99.0%
        else {
            while((v & 0xff) < 3) v = xs();
        }
        vec.push_back(v);
    }
    populate(vec.data(), vec.size());
    mean_variance lk_t, arr_t;
    for(int i = 0; i<50; ++i) {
        {
            unsigned o = 0;
            auto beg = high_resolution_clock::now();
            for(int i = 0; i < size; ++i) {
                o += lookup(xs() % size);
            }
            out += o;
            int dur = (high_resolution_clock::now()-beg)/microseconds(1);
            fprintf(stderr, "lookup: %10d µs\n", dur);
            lk_t(dur);
        }
        {
            unsigned o = 0;
            auto beg = high_resolution_clock::now();
            for(int i = 0; i < size; ++i) {
                o += vec[xs() % size];
            }
            out += o;
            int dur = (high_resolution_clock::now()-beg)/microseconds(1);
            fprintf(stderr, "array:  %10d µs\n", dur);
            arr_t(dur);
        }
    }

    fprintf(stderr, " lookup |   ±  |  array  |   ±  | speedup\n");
    printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
            lk_t.mean(), lk_t.stddev(),
            arr_t.mean(), arr_t.stddev(),
            arr_t.mean()/lk_t.mean());
    return 0;
}

(代码和数据总是在我的Bitbucket中更新)

上面的代码用随机数据填充一个10M元素数组，这些数据分布在它们的文章中指定的OP中，初始化我的数据结构，然后：

使用我的数据结构执行10M元素的随机查找。
通过原始数组执行同样的操作。

(注意，在顺序查找的情况下，数组总是以一个很大的尺度获胜，因为这是您可以做的最有利于缓存的查找)

最后两个块被重复了50次并计时；最后，计算并打印了每种类型查找的平均和标准偏差，以及加速比(查找均值/数组平均值)。

我在Ubuntu16.04上用g++ 5.4.0 (-O3 -static，加上一些警告)编译了上面的代码，并在一些机器上运行它；大多数机器运行Ubuntu16.04，一些旧的Linux，一些较新的Linux。在这种情况下，我不认为操作系统应该是相关的。

            CPU           |  cache   |  lookup (µs)   |     array (µs)  | speedup (x)
Xeon E5-1650 v3 @ 3.50GHz | 15360 KB |  60011 ±  3667 |   29313 ±  2137 | 0.49
Xeon E5-2697 v3 @ 2.60GHz | 35840 KB |  66571 ±  7477 |   33197 ±  3619 | 0.50
Celeron G1610T  @ 2.30GHz |  2048 KB | 172090 ±   629 |  162328 ±   326 | 0.94
Core i3-3220T   @ 2.80GHz |  3072 KB | 111025 ±  5507 |  114415 ±  2528 | 1.03
Core i5-7200U   @ 2.50GHz |  3072 KB |  92447 ±  1494 |   95249 ±  1134 | 1.03
Xeon X3430      @ 2.40GHz |  8192 KB | 111303 ±   936 |  127647 ±  1503 | 1.15
Core i7 920     @ 2.67GHz |  8192 KB | 123161 ± 35113 |  156068 ± 45355 | 1.27
Xeon X5650      @ 2.67GHz | 12288 KB | 106015 ±  5364 |  140335 ±  6739 | 1.32
Core i7 870     @ 2.93GHz |  8192 KB |  77986 ±   429 |  106040 ±  1043 | 1.36
Core i7-6700    @ 3.40GHz |  8192 KB |  47854 ±   573 |   66893 ±  1367 | 1.40
Core i3-4150    @ 3.50GHz |  3072 KB |  76162 ±   983 |  113265 ±   239 | 1.49
Xeon X5650      @ 2.67GHz | 12288 KB | 101384 ±   796 |  152720 ±  2440 | 1.51
Core i7-3770T   @ 2.50GHz |  8192 KB |  69551 ±  1961 |  128929 ±  2631 | 1.85

结果是..。混合！

一般来说，在大多数这样的机器上都有某种加速，或者至少它们是一样的。
数组真正优于“智能结构”查找的两种情况是在具有大量缓存而不是特别繁忙的机器上:上面的Xeon E5-1650 (15 MB缓存)是一台夜间构建机器，目前相当空闲；Xeon E5-2697 (35 MB缓存)也是一台用于高性能计算的机器。这是有意义的，原来的数组完全符合他们巨大的缓存，所以紧凑的数据结构只会增加复杂性。
在“性能谱”的另一边--但在阵列速度稍快的地方，还有为我的NAS提供动力的卑微的Celeron；它的缓存太少了，以至于数组和“智能结构”都不适合它。其他缓存足够小的机器也会执行类似的操作。
必须谨慎对待Xeon X5650 --它们是一个非常繁忙的双套接字虚拟机服务器上的虚拟机；很可能，尽管名义上它有相当数量的缓存，但在测试期间，它会被完全无关的虚拟机抢占好几次。

票数 155

Stack Overflow用户

发布于 2018-05-14 06:57:51

另一种选择是

检查结果是0，1还是2
如果不定期查找

换句话说，类似于：

unsigned char lookup(int index) {
    int code = (bmap[index>>2]>>(2*(index&3)))&3;
    if (code != 3) return code;
    return full_array[index];
}

其中，bmap每个元素使用2位，值为3，意思是"other“。

这个结构是微不足道的更新，使用25%的内存，但大部分是查找仅在5%的情况。当然，和往常一样，如果这是一个好主意或不取决于许多其他条件，所以唯一的答案是尝试真正的使用。

票数 33

Stack Overflow用户

发布于 2018-05-14 06:23:14

这与其说是一个具体的答案，不如说是一个“长话”。

除非您的数据是众所周知的，否则我怀疑任何人都不能直接回答您的问题(我也不知道有什么与您的描述相匹配的，但我并不了解所有类型用例的所有数据模式)。稀疏数据是高性能计算中常见的问题，但它通常是“我们有一个非常大的数组，但只有一些值是非零的”。

对于像我认为你们这样的不为人熟知的模式，没有人会直接知道哪个更好，这取决于细节:随机访问有多大的随机性--系统是访问数据项的集群，还是完全随机的，就像来自统一随机数生成器的那样。表数据是完全随机的，还是有0的序列，然后是1的序列，还有其他值的散射？如果您有相当长的0和1序列，那么运行长度编码将工作得很好，但是如果您有“0/1的棋盘”，则不会工作。此外，你还必须保存一张“起点”的表格，这样你就可以合理地快速到达相关的地方。

我很久以前就知道，一些大型数据库只是RAM (本例中的电话交换机用户数据)中的一个大表，其中一个问题是处理器中的缓存和页表优化是非常无用的。打电话的人很少和最近打电话的人一样，所以没有任何形式的预装数据，这纯粹是随机的。对于这种访问，大页表是最好的优化。

在很多情况下，“速度和小尺寸”之间的妥协是你必须在其他工程的软件工程中选择的事情之一--这不一定是一种妥协。因此，“为更简单的代码浪费内存”通常是首选。从这个意义上说，“简单”解决方案在速度上很可能更好，但是如果您对RAM有“更好”的使用，那么优化表的大小将给您提供足够的性能和更好的大小改进。实现这个目标有很多不同的方法--正如注释中所建议的那样，在一个2位字段中存储两个或三个最常见的值，然后为其他值提供一些替代的数据格式--哈希表将是我的第一种方法，但列表或二叉树也可能工作--这同样取决于“不是0、1或2”的模式。同样，这取决于这些值是如何在表中“分散”的--它们是在集群中还是在更多的分布模式中？

但问题是，您仍在读取RAM中的数据。然后，您将花费更多的代码来处理数据，包括一些用于处理“这不是一个公共值”的代码。

大多数常见的压缩算法的问题是，它们是基于解包序列的，所以不能随机访问它们。如果一次将大数据分成256个条目，然后将256个数据解压缩到一个uint8_t数组中，获取想要的数据，然后丢弃未压缩的数据，这种开销很难给您带来好的性能--当然，假设这一点很重要。

最后，您可能需要在注释/答案中实现一个或几个想法来测试，看看它是否有助于解决您的问题，或者内存总线是否仍然是主要的限制因素。

票数 23

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50323522

复制

相似问题

问当95%的情况下的值为0或1时，对非常大的数组上的随机访问有任何优化吗？
EN

回答 13

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当95%的情况下的值为0或1时，对非常大的数组上的随机访问有任何优化吗？EN

回答 13

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当95%的情况下的值为0或1时，对非常大的数组上的随机访问有任何优化吗？
EN