下面的代码使用作为CUDA工具包一部分的推力库对GPU执行所需的操作。如果这个操作完成得足够频繁，以至于您关心如何获得最好的性能，那么您可以直接使用由Thrust使用的CUB库作为后端。这样，您应该能够避免多个不必要的复制操作。对此感兴趣的API将是cub::DeviceMergeSort (可能还可以使用cub::DeviceRadixSort和类型双关转换迭代器)和cub::DeviceSegmentedReduce。

避免临时向量不必要初始化的一种方法可以在uninitialized_vector.cu推力示例中找到。

另一种在不放弃推力的情况下可能提高性能的方法是使用thrust::cuda::par_nosync执行策略。它还没有实现在CUDA工具包11.18版本的推力(1.15)，所以你将需要一个更近期的版本(1.16或更高版本) 来自吉突布。这可以帮助隐藏内核启动开销，因为它使推力算法在可能的情况下是异步的(分段缩减必须是同步的，因为它返回新的结束迭代器)。您还可以向API中添加显式CUDA流和池内存资源，以更好地处理异步行为，并分别避免重复分配开销。有关更多信息，请参见cuda/explicit_cuda_stream.cu、cuda/custom_temporary_allocation.cu和mr_basic.cu推力示例。

#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include <thrust/distance.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/reduce.h>
#include <thrust/sort.h>
#include <thrust/tuple.h>

// Transforms triplets into COO sparse marix format in-place and
// returns the number of non-zero entries (nnz).
auto triplets2coo(int *d_rows,
                  int *d_cols,
                  float *d_vals,
                  int num_triplets) {
    // thrust::device_ptr is a lightweight wrapper that lets Thrust algorithms
    // know which memory the pointer points to. thrust::device_pointer_cast is
    // the preferred way of creating them. While an API directly taking
    // wrapped pointers as arguments would be preferable, I chose it this way 
    // to show how to wrap the pointers.
    auto const coord_iter = thrust::make_zip_iterator(
        thrust::make_tuple(
            thrust::device_pointer_cast(d_rows),
            thrust::device_pointer_cast(d_cols)
        ));
    auto const vals = thrust::device_pointer_cast(d_vals);

    // Sort in-place.
    thrust::sort_by_key(coord_iter, coord_iter + num_triplets,
                        vals);
    // Segmented reduction can only be done out-of-place,
    // so allocate more memory.
    thrust::device_vector<int> tmp_rows(num_triplets);
    thrust::device_vector<int> tmp_cols(num_triplets);
    thrust::device_vector<float> tmp_vals(num_triplets);
    // Caution: These vectors are unnecessarily initialized to zero!

    auto const tmp_coord_iter = thrust::make_zip_iterator(
        thrust::make_tuple(
            tmp_rows.begin(),
            tmp_cols.begin()
        ));

    auto const new_ends =
        thrust::reduce_by_key(coord_iter, coord_iter + num_triplets,
                              vals,
                              tmp_coord_iter,
                              tmp_vals.begin());
    auto const nnz = thrust::distance(tmp_vals.begin(), new_ends.second);

    thrust::copy_n(tmp_coord_iter, nnz, coord_iter);
    thrust::copy_n(tmp_vals.cbegin(), nnz, vals);

    return nnz;
}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74349633

复制

相似问题

问从CUDA中可能重复的三重奏列表中组装COO矩阵
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从CUDA中可能重复的三重奏列表中组装COO矩阵EN