我使用内联PTX ld.shared从共享内存加载数据:
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory
float Csub = 0;
As[TY][TX] = A[a + wA * TY + TX]; //load data from global memory to shared memory
__syncthreads();
float t;
asm("ld.shared.f32 %0, [%1];" :"=f"(
我正在尝试使用共享内存来缓存OpenACC。
基本上,我所做的是矩阵乘法,我有:
typedef float ff;
// Multiplies two square row-major matrices a and b, puts the result in c.
void mmul(const restrict ff* a,
const restrict ff* b,
restrict ff* c,
const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n])