问Cuda共享内存数组变量
EN

Stack Overflow用户

提问于 2012-02-08 12:36:23

回答 3查看 35.6K关注 0票数 16

我尝试为矩阵乘法声明一个变量，如下所示：

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

我正在尝试使它，以便用户可以输入矩阵的大小来计算，但这将意味着更改BLOCK_SIZE。我更改了它，但我得到了一个编译器错误：“错误:常量值未知”。我已经研究过了，它和这个thread很相似。所以我试着：

__shared__ int buf [];

但是我得到的是：“错误:不允许不完整的类型”

谢谢，Dan Update with code(基本上遵循了this guide和staring with cuda指南)：块大小是通过询问用户矩阵的大小传入的。他们输入x和y。块大小只有x，现在它必须接受与x和y相同的大小。

__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * block_size * by;

    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;

    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = block_size;

    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = block_size * bx;

    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = block_size * wB;
    float Csub=0;
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        // Declaration of the shared memory array As 
        // used to store the sub-matrix of A

        extern __shared__ float As[];

        // Declaration of the shared memory array Bs 
        // used to store the sub-matrix of B
        extern __shared__ float Bs[];
        extern __shared__ float smem[];

        // Load the matrices from global memory
        // to shared memory; each thread loads
        // one element of each matrix
        smem[ty*block_size+tx] = A[a + wA * ty + tx];
        //cuPrintf("\n\nWhat are the memory locations?\n");
        //cuPrintf("The shared memory(A) is: %.2f\n",smem[ty*block_size+tx]);
        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];
        //cuPrintf("The shared memory(B) is: %.2f\n",smem[block_size*block_size+ty*block_size+tx]);
        // Synchronize to make sure the matrices 
        // are loaded
        __syncthreads();

        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
        for (int k = 0; k < block_size; ++k)
        {

            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;
            //cuPrintf("Csub is currently: %.2f\n",Csub);
        }
        //cuPrintf("\n\n\n");
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        //cuPrintf("the results are csub: %.2f\n",Csub);
        __syncthreads();
    }
    // Write the block sub-matrix to device memory;
    // each thread writes one element
    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;


}

cuda

回答 3

Stack Overflow用户

回答已采纳

发布于 2012-02-08 17:50:29

extern __shared__ int buf[];

当你启动内核时，你应该这样启动它；

kernel<<<blocks,threads,numbytes_for_shared>>>(...);

如果您有多个shared外部声明：

extern __shared__ float As[];

extern __shared__ float Bs[];

这将导致As指向与Bs相同的地址。

您将需要将As和Bs保持在1D数组中。

extern __shared__ float smem[];

调用内核时，您应该使用2*BLOCK_SIZE*BLOCK_SIZE*sizeof(float)启动它。

当索引到As时，使用smem[y*BLOCK_SIZE+x]；当索引到Bs时，使用smem[BLOCK_SIZE*BLOCK_SIZE+y*BLOCK_SIZE+x]

票数 32

Stack Overflow用户

发布于 2012-02-08 17:42:27

在内核中声明共享内存有两种选择--静态或动态。我假设你现在正在做的事情是这样的：

#define BLOCK_SIZE (16)

__global__ void sgemm0(const float *A, const float *B, float *C)
{
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

}

并且您希望能够轻松地更改BLOCK_SIZE。

一种可能性是继续使用静态共享内存分配，但将分配大小作为模板参数，如下所示：

template<int blocksize=16>
__global__ void sgemm1(const float *A, const float *B, float *C)
{
    __shared__ float As[blocksize][blocksize];

}
template void sgemm1<16>(const float *, const float *, float *C);

然后，您可以在编译时根据需要实例化任意多个不同的块大小变量。

如果您希望动态分配内存，请按如下方式定义：

__global__ void sgemm2(const float *A, const float *B, float *C)
{
    extern __shared__ float As[];

}

然后将分配的大小作为参数添加到内核调用中：

size_t blocksize = BLOCK_SIZE * BLOCK_SIZE;
sgemm2<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);

如果您有多个静态声明的数组，您希望用动态分配的共享内存替换这些数组，那么请注意，每个内核只有一个动态共享内存分配，因此在该内存段中存在多个项(共享)。因此，如果你有这样的东西：

#define BLOCK_SIZE (16)

__global__ void sgemm0(const float *A, const float *B, float *C)
{
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

}

您可以将其替换为：

#define BLOCK_SIZE (16)

__global__ void sgemm3(const float *A, const float *B, float *C)
{
    extern __shared__ float buffer[];

    float *As = &buffer[0];
    float *Bs = &buffer[BLOCK_SIZE*BLOCK_SIZE];

}

并像这样启动内核：

size_t blocksize = 2 * BLOCK_SIZE * BLOCK_SIZE;
sgemm3<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);

所有这些都是同样有效的，尽管我个人更喜欢模板版本，因为它可以支持其他编译器优化，如自动循环展开，而动态版本在没有额外工作的情况下无法实现。

票数 30

Stack Overflow用户

发布于 2012-02-08 16:26:54

听起来是对的。

通常情况下，在这种情况下，您需要malloc一些东西。

这里有两件事，一是C不知道二维数组(它只是一个数组的数组)和数组大小需要编译时间常量(或者编译器可以在编译时计算的东西)。

如果您使用的是C99，则可以使用函数的参数声明数组大小，但C99支持...充其量也就是斑点。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/9187899

复制

相似问题

问Cuda共享内存数组变量
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Cuda共享内存数组变量EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Cuda共享内存数组变量
EN