问多线程数组处理，然后写入C-Python扩展的结果数组
EN

Stack Overflow用户

提问于 2019-02-19 17:22:48

回答 1查看 77关注 0票数 0

下面的代码是一个C-Python扩展。这段代码获取一个连续原始字节的输入缓冲区样本(对于我的应用程序，是原始字节的“块”，其中1块= 128字节)，然后将这些字节处理成两个字节的“”，将结果放入项中。返回的结构只是将缓冲区处理成python整数。

下面是两个主要函数：

unpack_block(items，items_offset，buffer，buffer_offset，samples_per_block，sample_bits)；

然后循环遍历items中的每个样本，然后将每个样本转换为Python Int。

PyList_SET_ITEM(结果，索引，PyInt_FromLong(项目索引))；

    unsigned int num_blocks_per_thread, num_samples_per_thread, num_bytes_per_thread;
    unsigned int thread_id, p;
    unsigned int n_threads, start_index_bytes, start_index_blocks, start_index_samples;

    items = malloc(num_samples*sizeof(unsigned long));
    assert(items);

    #pragma omp parallel\
    default(none)\
    private(num_blocks_per_thread, num_samples_per_thread, num_bytes_per_thread, d, j, thread_id, n_threads, start_index_bytes, start_index_blocks, start_index_samples)\
    shared(samples_per_block, num_blocks, buffer, bytes_per_block, sample_bits, result, num_samples, items)
      {

        n_threads = omp_get_num_threads();
        num_blocks_per_thread = num_blocks/n_threads;
        num_samples_per_thread = num_samples/n_threads; 
        num_bytes_per_thread = num_blocks_per_thread*samples_per_block*2/n_threads;

        thread_id = omp_get_thread_num();
        start_index_bytes = num_bytes_per_thread*thread_id;
        start_index_blocks = num_blocks_per_thread*thread_id;  
        start_index_samples = num_samples_per_thread*thread_id;

        for (d=0; d<num_blocks_per_thread; d++) {
          unpack_block(items, start_index_samples+d*samples_per_block, buffer, start_index_blocks + d*bytes_per_block, samples_per_block, sample_bits);
        }

      }

     result = PyList_New(num_samples);
     assert(result);

     //*THIS WOULD ALSO SEEM RIPE FOR MULTITHREADING*
     for (p=0; p<num_samples; p++) {
        PyList_SET_ITEM(result, p, PyInt_FromLong( items[p] ));
      }

    free(items);
    free(buffer);

  return result;
}

它的速度实在是太糟糕了，远远低于我对多线程的期望。尽管每个线程只处理同一数组中互斥的块，但在线程写入items数组的不同块的情况下，我可能会遇到错误共享问题。

对我来说，一个基本的问题是:如何正确地多线程处理单个数组的每个元素，然后将每个元素的结果输出到第二个" result“数组中。我用我的两个函数执行了两次。

任何想法、解决方案或优化方法都是很棒的。谢谢!

multithreading

openmp

python

回答 1

Stack Overflow用户

发布于 2019-02-19 18:10:52

你已经提到了虚假分享。为了避免这种情况，您必须相应地分配内存(使用posix_memalign或其他对齐的分配函数)，并选择块大小，以便一个块的数据大小恰好是高速缓存线大小的倍数。

通常，使用$N$线程测量执行时间并计算加速比。你能和我们分享一下加速曲线吗？

至于评论“这似乎是多线程的成熟”：通常，期望太高(只是作为一个警告，以避免失望)。考虑您使用的每个线程有多少线程/元素，以及每个线程的工作负载(即，每个项目需要多少计算)。也许工作负载太小，以至于OpenMP开销占主导地位。另外，每次内存加载操作需要多少条指令？通常，每个内存负载的许多指令都是合理的并行化候选。较低的比率表示程序受内存限制。

说到内存访问，您是否在使用具有不同NUMA域的多插槽系统？如果是，则必须处理亲和性问题。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/54762680

复制

相似问题

问多线程数组处理，然后写入C-Python扩展的结果数组
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问多线程数组处理，然后写入C-Python扩展的结果数组EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问多线程数组处理，然后写入C-Python扩展的结果数组
EN