首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >CUDA矩阵类

CUDA矩阵类
EN

Code Review用户
提问于 2019-06-15 03:34:33
回答 1查看 445关注 0票数 6

我开始在这个CUDA C矩阵类上工作,学习C++中的面向对象编程和学习CUDA。这个项目的最初目标是创建一个矩阵类,它可以具有与MATLAB几乎相似的语法。然而,在这个过程中,我遇到了一个很大的问题,那就是主机到设备的编号和承载数据传输调用的设备。我能够通过分离lvaluervalue引用并移动语义来改善这种情况。目前,我正处于一个需要优化的工作代码的阶段。

我正在寻找以下问题的答案:

  • 作为一名开发人员或计算机专业的学生,我需要知道这段代码写得有多好?
  • 我的错误和异常处理方法是否正确?有更好的办法吗?(看看imp_includes.hcucu_error_list.cuerror_check.cu)
  • 我知道在block_dim.cu中计算线程块的尺寸的方法是非常天真的。我想知道是否有更好的方法来自动识别线程块的尺寸。
  • 我想出的减少复制构造函数调用数量的解决方案现在要求我有多个重载来分离lavluervalue引用。有没有更简单的方法来解决这个问题?
  • 为了进一步提高数据传输性能,我正在考虑使用CUDA流。我想就它在这一具体情况下的有用性发表一些意见。
  • 我不指望有人来看完整的图书馆。但是,如果有任何代码可以改进,请告诉我。

以下是文件的最小版本,我认为这些版本足以回答我的问题。完整的代码可以找到这里

cu_mat.hcu ## - Class header

代码语言:javascript
运行
复制
#ifndef CU_MAT_HCU_
#define CU_MAT_HCU_

#include "imp_includes.hcu"

class cu_mat {
protected:
    size_t n_rows=0, n_cols=0;
    double *p=NULL;
    cu_mat(){}                                                                                              // Default constructor
    cu_mat(const size_t &r, const size_t &c, const double &n);                                              // Two argument constructor with initialization
    void init(const size_t &r, const size_t &c);                                                            // Two argument memory allocation with initialization

public:
    /***** Constructors *****/
    cu_mat(const cu_mat &to_b_copied);                                                                      // Copy constructor
    cu_mat(const std::initializer_list<std::initializer_list<double>> &mat);                                // Single argument constructor with 'double' values
    cu_mat(const std::initializer_list<std::initializer_list<cu_mat>> &mat);                                        // Single argument constructor with 'cu_mat' values
    cu_mat(const double &n);                                                                                // Single value constructor
    cu_mat(cu_mat&& to_be_moved);                                                                           // Move constructor



    /***** Operators *****/ // Add an ultimate '()' operator.
//  cu_mat operator()(const cu_mat rows, const cu_mat cols);                                                // Sub-matrix access with 'cu_mat'
    cu_mat operator()(const size_t &idx) const;                                                             // Matrix element access based on index
    cu_mat operator()(const size_t &r, const size_t &c) const;                                                  // Matrix element access
    cu_mat operator()(const size_t &r_begin, const size_t &r_end, const size_t &c_begin, const size_t &c_end) const;    // Sub-matrix access

    cu_mat& operator=(const cu_mat &b);                                                                     // Assignment operator to copy 'cu_mat'
    cu_mat& operator=(cu_mat &&b);                                                                  // Assignment operator to move 'cu_mat'                                                                     // Assignment operator for 'double' to avoid implicit type casting

    cu_mat operator*(const cu_mat &b) const &;                                                                      // Matrix multiplication operator
    cu_mat operator*(cu_mat &&b) const &;                                                                       // Matrix multiplication operator
    cu_mat operator*(const cu_mat &b) &&;                                                                       // Matrix multiplication operator
    cu_mat operator*(cu_mat &&b) &&;                                                                    // Matrix multiplication operator

    cu_mat operator>(const cu_mat &b) const &;                                                                      // Greater than operator
    cu_mat operator>(cu_mat &&b) const &;                                                                       // Greater than operator
    cu_mat operator>(const cu_mat &b) &&;                                                                       // Greater than operator
    cu_mat operator>(cu_mat &&b) &&;                                                                        // Greater than operator

    explicit operator double();                                                                             // Type conversion from cu_mat to double



    /***** Member functions *****/ // Add an ultimate replace function
    cu_mat div(const cu_mat &b) const &;                                                                            // Element wise division
    cu_mat div(cu_mat &&b) const &;                                                                         // Element wise division
    cu_mat div(const cu_mat &b) &&;                                                                         // Element wise division
    cu_mat div(cu_mat &&b) &&;                                                                          // Element wise division

    cu_mat mult(const cu_mat &b) const &;                                                                           // Element wise multiplication
    cu_mat mult(cu_mat &&b) const &;                                                                            // Element wise multiplication
    cu_mat mult(const cu_mat &b) &&;                                                                            // Element wise multiplication
    cu_mat mult(cu_mat &&b) &&;                                                                         // Element wise multiplication

    cu_mat pow(const double &n) const &;                                                                            // Element wise power
    cu_mat pow(const double &n) &&;                                                                         // Element wise power

    void replace(const size_t &r, const size_t &c, const cu_mat &n);                                        // Replace an element with a 'cu_mat' value

    void replace(const size_t &r_begin, const size_t &r_end, const size_t &c_begin, const size_t &c_end, const cu_mat &n);  // Replace submatrix with a 'cu_mat' matrix

    void get();                                                                                 // Print matrix data

    void print(std::ofstream &print);                                                                       // Print matrix to a file

    size_t rows();                                                                                          // Get number of rows
    size_t rows() const;                                                                                            // Get number of rows

    size_t cols();                                                                                          // Get number of columns
    size_t cols() const ;                                                                                           // Get number of columns

    double* pointer();                                                                                      // Get GPU memory pointer
    double* pointer() const;                                                                                        // Get GPU memory pointer



    /***** Supported external (friend) functions *****/
    friend cu_mat randn(const size_t &r, const size_t &c);                                                  // Generate a matrix with normalized random numbers
    friend cu_mat randn(const size_t &n);

    friend cu_mat mld(const cu_mat &a, const cu_mat &b);                                                                // Matrix left divide operator
    friend cu_mat mld(const cu_mat &a, cu_mat &&b);                                                             // Matrix left divide operator
    friend cu_mat mld(cu_mat &&a, const cu_mat &b);                                                             // Matrix left divide operator
    friend cu_mat mld(cu_mat &&a, cu_mat &&b);                                                              // Matrix left divide operator

    friend cu_mat eye(const size_t &r, const size_t &c);                                                    // Generate a non-square identity matrix
    friend cu_mat eye(const size_t &n);

    friend cu_mat ones(const size_t &r, const size_t &c);                                                   // Matrix with all values 1
    friend cu_mat ones(const size_t &n);

    friend cu_mat zeros(const size_t &r, const size_t &c);                                                  // Matrix with all values 0
    friend cu_mat zeros(const size_t &n);

    friend cu_mat trans(cu_mat &a);                                                                 // Transpose of the matrix

    friend cu_mat horzcat(cu_mat &a, cu_mat &b);                                                // Horizontal concatenation of two matrices

    friend cu_mat vertcat(cu_mat &a, cu_mat &b);                                                // Vertical concatenation of two matrices

    friend cu_mat stepspace(const double &i, const double &step, const double &f);                          // MATLAB colon operator

    friend bool isscalar(const cu_mat &a);                                                                  // Check if 'cu_mat' object is scalar

    /***** Destructor *****/
    virtual ~cu_mat();
};

/***** Supported external (friend) functions *****/
cu_mat randn(const size_t &r, const size_t &c);                                                 // Generate a matrix with normalized random numbers
cu_mat randn(const size_t &n);

cu_mat mld(const cu_mat &a, const cu_mat &b);                                                               // Matrix left divide operator
cu_mat mld(const cu_mat &a, cu_mat &&b);                                                                // Matrix left divide operator
cu_mat mld(cu_mat &&a, const cu_mat &b);                                                                // Matrix left divide operator
cu_mat mld(cu_mat &&a, cu_mat &&b);                                                             // Matrix left divide operator

cu_mat eye(const size_t &r, const size_t &c);                                                   // Generate a non-square identity matrix
cu_mat eye(const size_t &n);

cu_mat ones(const size_t &r, const size_t &c);                                                  // Matrix with all values 1
cu_mat ones(const size_t &n);

cu_mat zeros(const size_t &r, const size_t &c);                                                 // Matrix with all values 0
cu_mat zeros(const size_t &n);

cu_mat trans(cu_mat &a);                                                                    // Transpose of the matrix

cu_mat horzcat(cu_mat &a, cu_mat &b);                                               // Horizontal concatenation of two matrices

cu_mat vertcat(cu_mat &a, cu_mat &b);                                               // Vertical concatenation of two matrices

cu_mat stepspace(const double &i, const double &step, const double &f);                         // MATLAB colon operator

bool isscalar(const cu_mat &a);                                                                 // Check if 'cu_mat' object is scalar

#endif /* CU_MAT_HCU_ */

cu_mat.cu ## -成员函数、构造函数、析构函数、运算符和朋友函数

代码语言:javascript
运行
复制
#include "cu_mat.hcu"
//  Constructors
/**************************************   Single argument constructor with 'double' values   *******************************************/
cu_mat::cu_mat(const std::initializer_list<std::initializer_list<double>> &mat) : n_rows(mat.size()), n_cols(mat.begin()->size())
{
    // ' -> ' Means:  pointer to an object -> member function. Essentially accessing a member function with the help of a pointer to that object.
    // Define number of rows from the array input. Define number of columns from first row of array input
    // Check if the number of elements in each row are same.
    for(int i = 0; i<n_rows; ++i)
    {
        confirm((mat.begin()+i)->size()==n_cols,"Error: Object initialization failed. Number of elements in each row must be same.");
    }

    // Copy input initializer-list to a new 2D array while making it column major.
    double *m = new double[n_rows*n_cols]();    // Allocate space on CPU memory.
    confirm(m,"Error: Memory allocation failed while initializing the object."); // Check proper allocation.

    for(int i = 0; i<n_rows; ++i)
    {
        for(int j = 0; j<n_cols; ++j)
        {
            m[j*n_rows+i] = *((mat.begin()+i)->begin()+j);
        }
    }

    HANDLE_ERROR( cudaMalloc((void**)&p, n_rows*n_cols*sizeof(double)) ); // Allocate memory on GPU.
    HANDLE_ERROR( cudaMemcpy(p,m,n_rows*n_cols*sizeof(double),cudaMemcpyHostToDevice) ); // Copy array from CPU to GPU
    delete[] m;
}
/***********************************************************************************************************************/


/**************************************   Single argument constructor with 'cu_mat' values   *******************************************/
__global__ void copymat(double* dest, double* src, size_t bias, size_t src_rows, size_t main_rows_bias, size_t n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx<n_ele)
    dest[bias+idx+idx/src_rows*main_rows_bias] = src[idx];
}
cu_mat::cu_mat(const std::initializer_list<std::initializer_list<cu_mat>> &mat)
{
    // Calculate total number of columns
    for(int i = 0; i<mat.begin()->size(); ++i)
        n_cols += ((mat.begin())->begin()+i)->n_cols;

    // Check equal number of rows for horizontal concatenation and calculate total number of rows.
    for(int i = 0; i<mat.size(); ++i)
    {
        size_t cols = ((mat.begin()+i)->begin())->n_cols;
        for(int j = 0; j<(mat.begin()+i)->size()-1; ++j)
        {
            confirm(((mat.begin()+i)->begin()+j)->n_rows==((mat.begin()+i)->begin()+j+1)->n_rows,"Error: Dimensions of arrays being horizontally concatenated are not consistent.");
            cols += ((mat.begin()+i)->begin()+j+1)->n_cols;
        }
        confirm(cols == n_cols,"Error: Dimensions of arrays being vertically concatenated are not consistent.")
        n_rows += ((mat.begin()+i)->begin())->n_rows;
    }

    // Allocate memory and copy data
    if ((n_rows>0)&&(n_cols>0))
    {
        HANDLE_ERROR( cudaMalloc((void**)&p, n_rows*n_cols*sizeof(double)) ); // Allocate memory on GPU.
        size_t bias, src_rows, src_cols;
        size_t main_rows_bias, n_ele, n_threads;
        size_t r_sum = 0, c_sum = 0;
        for(int i = 0; i<mat.size(); ++i){
            for(int j = 0; j<(mat.begin()+i)->size(); ++j){
                bias = c_sum*n_rows+r_sum; src_rows = ((mat.begin()+i)->begin()+j)->n_rows; src_cols = ((mat.begin()+i)->begin()+j)->n_cols;
                main_rows_bias = n_rows-src_rows; n_ele = src_rows*src_cols; n_threads = block_dim(n_ele); c_sum += src_cols;
                copymat<<<n_ele/n_threads,n_threads>>>(p,((mat.begin()+i)->begin()+j)->p,bias,src_rows,main_rows_bias,n_ele);
                HANDLE_ERROR( cudaPeekAtLastError() );
            }
            r_sum += src_rows; c_sum = 0;
        }
    }
}
/***********************************************************************************************************************/


/************************************   Single value constructor   ***********************************************/
cu_mat::cu_mat(const double &n) : n_rows(1), n_cols(1)
{
    HANDLE_ERROR( cudaMalloc((void**)&p, n_rows*n_cols*sizeof(double)) ); // Allocate memory on GPU.
//    std::cout << p << std::endl;
    HANDLE_ERROR( cudaMemcpy(p,&n,n_rows*n_cols*sizeof(double),cudaMemcpyHostToDevice) ); // Copy array from CPU to GPU
}
/***********************************************************************************************************************/


/************************************   Copy constructor   ***********************************************/
cu_mat::cu_mat(const cu_mat &to_b_copied) : n_rows(to_b_copied.n_rows), n_cols(to_b_copied.n_cols)
{
//    std::cout << "Copy constructor called." << std::endl;
    if ((n_rows>0)&&(n_cols>0))
    {
        HANDLE_ERROR( cudaFree(p) );
        HANDLE_ERROR( cudaMalloc((void**)&p,n_rows*n_cols*sizeof(double)) ); // Allocate memory on GPU.
        HANDLE_ERROR( cudaMemcpy(p,to_b_copied.p,n_rows*n_cols*sizeof(double),cudaMemcpyDeviceToDevice) ); // Copy array from CPU to GPU
    }
}
/***********************************************************************************************************************/


/************************************   Two argument constructor with initialization   ***********************************************/
__global__ void set_data(double* p, const double n, const double n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx<n_ele)
    p[idx] = n;
}
cu_mat::cu_mat(const size_t &r, const size_t &c, const double &n=0) : n_rows(r), n_cols(c)
{
    if ((n_rows>0)&&(n_cols>0))
    {
        HANDLE_ERROR( cudaMalloc((void**)&p, n_rows*n_cols*sizeof(double)) );
        if (n!=0)
        {
            size_t n_ele = n_rows*n_cols;
            size_t n_threads = block_dim(n_ele);
            set_data<<<n_ele/n_threads,n_threads>>>(p,n,n_ele);
            HANDLE_ERROR( cudaPeekAtLastError() );
        }
        else
        {
            HANDLE_ERROR( cudaMemset(p,0,n_rows*n_cols*sizeof(double)) );
        }
    }
}
/***********************************************************************************************************************/


/************************************   Move constructor   ***********************************************/
cu_mat::cu_mat(cu_mat&& to_b_moved)
{
//    std::cout << "Move constructor called." << std::endl;
    size_t nrows = to_b_moved.n_rows, ncols = to_b_moved.n_cols; double *ptr = to_b_moved.p;
    to_b_moved.n_rows = 0; to_b_moved.n_cols = 0; to_b_moved.p = NULL;
    n_rows = nrows; n_cols = ncols; p = ptr;
}
/***********************************************************************************************************************/










//  Operators
/**************************************   Sub-matrix access with 'cu_mat'   *******************************************/
// __global__ void check_integer()
// cu_mat cu_mat::operator()(const cu_mat rows, const cu_mat cols)
// {
//     confirm((rows.n_rows==1) || (rows.n_cols==1), "Error: 'rows' has to be a vector.");
//     confirm((cols.n_rows==1) || (cols.n_cols==1), "Error: 'rows' has to be a vector.");
//     confirm(idx > 0, "Indexing starts from 1 for this library.")
//     confirm(idx <= n_rows*n_cols,"Error: Index exceeds matrix bounds. Matrix has " << n_rows*n_cols << "elements in it.");
//     cu_mat temp(1,1);
//     HANDLE_ERROR( cudaMemcpy(temp.p,p+(idx-1),sizeof(double),cudaMemcpyDeviceToDevice) ); // Copy value from GPU to GPU
//     return temp;
// }
/***********************************************************************************************************************/


/**************************************   Matrix element access based on index   *******************************************/
cu_mat cu_mat::operator()(const size_t &idx) const
{
    confirm(idx > 0, "Indexing starts from 1 for this library.")
    confirm(idx <= n_rows*n_cols,"Error: Index exceeds matrix bounds. Matrix has " << n_rows*n_cols << " elements in it.");
    cu_mat temp(1,1);
    HANDLE_ERROR( cudaMemcpy(temp.p,p+(idx-1),sizeof(double),cudaMemcpyDeviceToDevice) ); // Copy value from GPU to GPU
    return std::move(temp);
}
/***********************************************************************************************************************/


/**************************************   Access single element of the matrix   *******************************************/
cu_mat cu_mat::operator()(const size_t &r, const size_t &c) const
{
    confirm((r>0)&&(c>0), "Indexing starts from 1 for this library.")
    confirm((r<=n_rows)&&(c<=n_cols),"Error: Index exceeds matrix bounds. The size of the matrix is " << n_rows << "x" << n_cols << ".");
    cu_mat temp(1,1);
    HANDLE_ERROR( cudaMemcpy(temp.p,p+(c-1)*n_rows+r-1,sizeof(double),cudaMemcpyDeviceToDevice) ); // Copy value from GPU to GPU
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(temp);
}
/***********************************************************************************************************************/


/**************************************   Access sub-matrix   *******************************************/
__global__ void submat(double* dest, double* src, size_t bias, size_t dest_rows, size_t main_rows_bias, size_t n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx<n_ele)
    dest[idx] = src[bias+idx+idx/dest_rows*main_rows_bias];
}
cu_mat cu_mat::operator()(const size_t &r_begin, const size_t &r_end, const size_t &c_begin, const size_t &c_end) const
{
    confirm((r_begin>0)&&(c_begin>0), "Indexing starts from 1 for this library.")
    confirm((r_end<=n_rows)&&(c_end<=n_cols),"Error: Index exceeds matrix bounds. The size of the matrix is " << n_rows << "x" << n_cols << ".")
    cu_mat temp(r_end-r_begin+1,c_end-c_begin+1);
    size_t bias = (c_begin-1)*n_rows+r_begin-1;
    size_t main_rows_bias = n_rows-temp.n_rows;
    size_t n_ele = temp.n_rows*temp.n_cols;
    size_t n_threads = block_dim(n_ele);
    submat<<<n_ele/n_threads,n_threads>>>(temp.p,p,bias,temp.n_rows,main_rows_bias,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(temp);
}
/***********************************************************************************************************************/


/***************************************   Assignment operator to copy 'cu_mat'   **************************************/
cu_mat& cu_mat::operator=(const cu_mat &b)
{
//  std::cout << "Copy assignment operator called." << std::endl;
    if ((n_rows*n_cols)!=(b.n_rows*b.n_cols))
    {
        HANDLE_ERROR( cudaFree(p) );
        HANDLE_ERROR( cudaMalloc((void**)&p, b.n_rows*b.n_cols*sizeof(double)) ); // Allocate memory on GPU.
    }
    n_rows = b.n_rows; n_cols = b.n_cols;
    HANDLE_ERROR( cudaMemcpy(p,b.p,n_rows*n_cols*sizeof(double),cudaMemcpyDeviceToDevice) ); // Copy array from GPU to GPU
    return *this;
}
/***********************************************************************************************************************/


/***************************************   Assignment operator to move 'cu_mat'   **************************************/
cu_mat& cu_mat::operator=(cu_mat &&b)
{
//    std::cout << "Move assignment operator called." << std::endl;
    n_rows = b.n_rows; b.n_rows = 0;
    n_cols = b.n_cols; b.n_cols = 0;
    HANDLE_ERROR( cudaFree(p) );
    p = b.p; b.p = NULL;
    return *this;
}
/***********************************************************************************************************************/


/***************************************   Matrix multiplication   **************************************/
__global__ void const_mat_mult(double *dest, double *src, double *n, size_t n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx<n_ele)
    dest[idx] = (*n)*src[idx];
}
cu_mat cu_mat::operator*(const cu_mat &b) const &
{
    if (isscalar(*this))
    {
        cu_mat c(b.n_rows,b.n_cols);
        size_t n_ele = c.n_rows*c.n_cols, n_threads = block_dim(n_ele);
        const_mat_mult<<<n_ele/n_threads,n_threads>>>(c.p,b.p,p,n_ele);
        return std::move(c);
    }
    else if (isscalar(b))
    {
        cu_mat c(n_rows,n_cols);
        size_t n_ele = c.n_rows*c.n_cols, n_threads = block_dim(n_ele);
        const_mat_mult<<<n_ele/n_threads,n_threads>>>(c.p,p,b.p,n_ele);
        return std::move(c);
    }
    else
    {
        confirm(n_cols == b.n_rows,"Error : Matrix multiplication is not possible. Inner matrix dimensions must agree.");
        cu_mat c(n_rows,b.n_cols);
        HANDLE_ERROR( cudaMalloc((void**)&c.p,c.n_rows*c.n_cols*sizeof(double)) ); // Allocate memory on GPU.
        double alf = 1.0, bet = 0;
        cublasHandle_t handle;
        HANDLE_ERROR( cublasCreate(&handle) );
        HANDLE_ERROR( cublasDgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,n_rows,b.n_cols,n_cols,&alf,p,n_rows,b.p,n_cols,&bet,c.p,n_rows) );
        HANDLE_ERROR( cublasDestroy(handle) );
        return std::move(c);
    }
}
cu_mat cu_mat::operator*(cu_mat &&b) const &
{
    if (isscalar(*this))
    {
        cu_mat c = std::move(b);
        size_t n_ele = c.n_rows*c.n_cols, n_threads = block_dim(n_ele);
        const_mat_mult<<<n_ele/n_threads,n_threads>>>(c.p,c.p,p,n_ele);
        return std::move(c);
    }
    else if (isscalar(b))
    {
        cu_mat c(n_rows,n_cols);
        size_t n_ele = c.n_rows*c.n_cols, n_threads = block_dim(n_ele);
        const_mat_mult<<<n_ele/n_threads,n_threads>>>(c.p,p,b.p,n_ele);
        return std::move(c);
    }
    else
    {
        confirm(n_cols == b.n_rows,"Error : Matrix multiplication is not possible. Inner matrix dimensions must agree.");
        cu_mat c(n_rows,b.n_cols);
        HANDLE_ERROR( cudaMalloc((void**)&c.p,c.n_rows*c.n_cols*sizeof(double)) ); // Allocate memory on GPU.
        double alf = 1.0, bet = 0;
        cublasHandle_t handle;
        HANDLE_ERROR( cublasCreate(&handle) );
        HANDLE_ERROR( cublasDgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,n_rows,b.n_cols,n_cols,&alf,p,n_rows,b.p,n_cols,&bet,c.p,n_rows) );
        HANDLE_ERROR( cublasDestroy(handle) );
        return std::move(c);
    }
}
cu_mat cu_mat::operator*(const cu_mat &b)&&
{
    if (isscalar(*this))
    {
        cu_mat c(b.n_rows,b.n_cols);
        size_t n_ele = c.n_rows*c.n_cols, n_threads = block_dim(n_ele);
        const_mat_mult<<<n_ele/n_threads,n_threads>>>(c.p,b.p,p,n_ele);
        return std::move(c);
    }
    else if (isscalar(b))
    {
        cu_mat c = std::move(*this);
        size_t n_ele = c.n_rows*c.n_cols, n_threads = block_dim(n_ele);
        const_mat_mult<<<n_ele/n_threads,n_threads>>>(c.p,c.p,b.p,n_ele);
        return std::move(c);
    }
    else
    {
        confirm(n_cols == b.n_rows,"Error : Matrix multiplication is not possible. Inner matrix dimensions must agree.");
        cu_mat c(n_rows,b.n_cols);
        HANDLE_ERROR( cudaMalloc((void**)&c.p,c.n_rows*c.n_cols*sizeof(double)) ); // Allocate memory on GPU.
        double alf = 1.0, bet = 0;
        cublasHandle_t handle;
        HANDLE_ERROR( cublasCreate(&handle) );
        HANDLE_ERROR( cublasDgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,n_rows,b.n_cols,n_cols,&alf,p,n_rows,b.p,n_cols,&bet,c.p,n_rows) );
        HANDLE_ERROR( cublasDestroy(handle) );
        return std::move(c);
    }
}
cu_mat cu_mat::operator*(cu_mat &&b)&&
{
    if (isscalar(*this))
    {
        cu_mat c = std::move(b);
        size_t n_ele = c.n_rows*c.n_cols, n_threads = block_dim(n_ele);
        const_mat_mult<<<n_ele/n_threads,n_threads>>>(c.p,c.p,p,n_ele);
        return std::move(c);
    }
    else if (isscalar(b))
    {
        cu_mat c = std::move(*this);
        size_t n_ele = c.n_rows*c.n_cols, n_threads = block_dim(n_ele);
        const_mat_mult<<<n_ele/n_threads,n_threads>>>(c.p,c.p,b.p,n_ele);
        return std::move(c);
    }
    else
    {
        confirm(n_cols == b.n_rows,"Error : Matrix multiplication is not possible. Inner matrix dimensions must agree.");
        cu_mat c(n_rows,b.n_cols);
        HANDLE_ERROR( cudaMalloc((void**)&c.p,c.n_rows*c.n_cols*sizeof(double)) ); // Allocate memory on GPU.
        double alf = 1.0, bet = 0;
        cublasHandle_t handle;
        HANDLE_ERROR( cublasCreate(&handle) );
        HANDLE_ERROR( cublasDgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,n_rows,b.n_cols,n_cols,&alf,p,n_rows,b.p,n_cols,&bet,c.p,n_rows) );
        HANDLE_ERROR( cublasDestroy(handle) );
        return std::move(c);
    }
}
/***********************************************************************************************************************/


/************************************   Greater than operator   ***********************************************/
__global__ void elem_greater(double* a, double* b, double* c, size_t n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx<n_ele)
    c[idx] = (a[idx] > b[idx]);
}
cu_mat cu_mat::operator>(const cu_mat &b) const &
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Boolean check is not possible. Matrices must have same dimensions.");
    cu_mat c(n_rows,n_cols);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_greater<<<n_ele/n_threads,n_threads>>>(p,b.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
cu_mat cu_mat::operator>(cu_mat &&b) const &
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Boolean check is not possible. Matrices must have same dimensions.");
    cu_mat c = std::move(b);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_greater<<<n_ele/n_threads,n_threads>>>(p,c.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
cu_mat cu_mat::operator>(const cu_mat &b) &&
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Boolean check is not possible. Matrices must have same dimensions.");
    cu_mat c = std::move(*this);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_greater<<<n_ele/n_threads,n_threads>>>(c.p,b.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
cu_mat cu_mat::operator>(cu_mat &&b) &&
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Boolean check is not possible. Matrices must have same dimensions.");
    cu_mat c = std::move(*this);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_greater<<<n_ele/n_threads,n_threads>>>(c.p,b.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
/***********************************************************************************************************************/


/***************************************   Type conversion from cu_mat to double   **************************************/
cu_mat::operator double()
{
    confirm((n_rows==1) && (n_cols==1), "Error: Type conversion is only possible in the case of 1x1 matrix.");
    double val;
    // Copy data from GPU to CPU.
    HANDLE_ERROR( cudaMemcpy(&val,p,sizeof(double),cudaMemcpyDeviceToHost) );
    return std::move(val);
}
/***********************************************************************************************************************/









//  Member Functions
/************************************   Two argument memory allocation with initialization   ***********************************************/
void cu_mat::init(const size_t &r, const size_t &c)
{
    n_rows = r; n_cols = c;
    if ((n_rows>0)&&(n_cols>0))
    {
        HANDLE_ERROR( cudaMalloc((void**)&p, n_rows*n_cols*sizeof(double)) );
        HANDLE_ERROR( cudaMemset(p,0,n_rows*n_cols*sizeof(double)) );
    }
}
/***********************************************************************************************************************/


/************************************   Element wise division   ***********************************************/
__global__ void elem_div(double* a, double* b, double* c, size_t n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx<n_ele)
    c[idx] = a[idx] / b[idx];
}
cu_mat cu_mat::div(const cu_mat &b) const &
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Matrix multiplication is not possible. Matrices must have same dimensions.");
    cu_mat c(n_rows,n_cols);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_div<<<n_ele/n_threads,n_threads>>>(p,b.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
cu_mat cu_mat::div(cu_mat &&b) const &
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Matrix multiplication is not possible. Matrices must have same dimensions.");
    cu_mat c = std::move(b);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_div<<<n_ele/n_threads,n_threads>>>(p,c.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
cu_mat cu_mat::div(const cu_mat &b) &&
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Matrix multiplication is not possible. Matrices must have same dimensions.");
    cu_mat c = std::move(*this);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_div<<<n_ele/n_threads,n_threads>>>(c.p,b.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
cu_mat cu_mat::div(cu_mat &&b) &&
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Matrix multiplication is not possible. Matrices must have same dimensions.");
    cu_mat c = std::move(*this);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_div<<<n_ele/n_threads,n_threads>>>(c.p,b.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
/***********************************************************************************************************************/


/************************************   Element wise multiplication   ***********************************************/
__global__ void elem_mult(double* a, double* b, double* c, size_t n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx<n_ele)
    c[idx] = a[idx] * b[idx];
}
cu_mat cu_mat::mult(const cu_mat &b) const &
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Matrix multiplication is not possible. Matrices must have same dimensions.");
    cu_mat c(n_rows,n_cols);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_mult<<<n_ele/n_threads,n_threads>>>(p,b.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
cu_mat cu_mat::mult(cu_mat &&b) const &
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Matrix multiplication is not possible. Matrices must have same dimensions.");
    cu_mat c = std::move(b);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_mult<<<n_ele/n_threads,n_threads>>>(p,c.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
cu_mat cu_mat::mult(const cu_mat &b) &&
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Matrix multiplication is not possible. Matrices must have same dimensions.");
    cu_mat c = std::move(*this);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_mult<<<n_ele/n_threads,n_threads>>>(c.p,b.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
cu_mat cu_mat::mult(cu_mat &&b) &&
{
    confirm((n_rows == b.n_rows) && (n_cols == b.n_cols),"Error : Matrix multiplication is not possible. Matrices must have same dimensions.");
    cu_mat c = std::move(*this);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_mult<<<n_ele/n_threads,n_threads>>>(c.p,b.p,c.p,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
/***********************************************************************************************************************/


/************************************   Element wise power   ***********************************************/
__global__ void elem_power(double* dest, double* src, double n, size_t n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx<n_ele)
    dest[idx] = pow(src[idx],n);
}
cu_mat cu_mat::pow(const double &n) const &
{
    cu_mat c(n_rows,n_cols);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_power<<<n_ele/n_threads,n_threads>>>(c.p,p,n,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
cu_mat cu_mat::pow(const double &n) &&
{
    cu_mat c = std::move(*this);
    size_t n_ele = c.n_rows*c.n_cols;
    size_t n_threads = block_dim(n_ele);
    elem_power<<<n_ele/n_threads,n_threads>>>(c.p,c.p,n,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(c);
}
/***********************************************************************************************************************/


/************************************   Replace an element with a 'cu_mat' value   ***********************************************/
void cu_mat::replace(const size_t &r, const size_t &c, const cu_mat &n)
{
    confirm((n.n_rows==1) && (n.n_cols==1),"Error: Value being replaced with has to be scalar.");
    size_t bias = c*n_rows+r, src_rows = 1, src_cols = 1;
    size_t main_rows_bias = n_rows-src_rows, n_ele = src_rows*src_cols, n_threads = block_dim(n_ele);
    copymat<<<n_ele/n_threads,n_threads>>>(p,n.p,bias,src_rows,main_rows_bias,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
}
/***********************************************************************************************************************/


/************************************   Replace submatrix with a 'cu_mat' matrix   ***********************************************/
void cu_mat::replace(const size_t &r_begin, const size_t &r_end, const size_t &c_begin, const size_t &c_end, const cu_mat &n)
{
    confirm((r_end<=n_rows) && (c_end<=n_cols),"Error: Index exceeds matrix bounds. The size of the matrix is " << n_rows << "x" << n_cols << ".");
    confirm((n.n_rows==r_end-r_begin+1) && (n.n_cols==c_end-c_begin+1),"Error: Unable to replace the data due to size mismatch.");
    size_t bias = (c_begin-1)*n_rows+r_begin-1, src_rows = n.n_rows, src_cols = n.n_cols;
    size_t main_rows_bias = n_rows-src_rows, n_ele = src_rows*src_cols, n_threads = block_dim(n_ele);
    copymat<<<n_ele/n_threads,n_threads>>>(p,n.p,bias,src_rows,main_rows_bias,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
}
/***********************************************************************************************************************/


/************************************   Print matrix data   ***********************************************/
void cu_mat::get()
{
    double *m = new double[n_rows*n_cols]();    // Allocate space on CPU memory.
    confirm(m,"Error: Memory allocation failed in 'get()'.") // Check proper allocation.

    // Copy data from GPU to CPU.
    HANDLE_ERROR( cudaMemcpy(m,p,n_rows*n_cols*sizeof(double),cudaMemcpyDeviceToHost) );

    std::cout << std::scientific << std::setprecision(4);
    for(int i = 0; i<n_rows; ++i)
    {
        for(int j = 0; j<n_cols; ++j)
        {
            std::cout<<m[j*n_rows+i]<<" ";
        }
        std::cout<<std::endl;
    }
    delete[] m;
}
/***********************************************************************************************************************/


/************************************   Print matrix to a file   ***********************************************/
void cu_mat::print(std::ofstream &print)
{
    double *m = new double[n_rows*n_cols]();    // Allocate space on CPU memory.
    confirm(m,"Error: Memory allocation failed in 'print()'.") // Check proper allocation.

    // Copy data from GPU to CPU.
    HANDLE_ERROR( cudaMemcpy(m,p,n_rows*n_cols*sizeof(double),cudaMemcpyDeviceToHost) );

    // Print the matrix
    print << std::scientific << std::setprecision(8);
    for(int i = 0; i<n_rows; ++i)
    {
        print << " ";
        for(int j = 0; j<n_cols; ++j)
        {
            print << " " << m[j*n_rows+i] << " ";
        }
        print << std::endl;
    }
    delete[] m;
}
/***********************************************************************************************************************/


/***************************************   Get number of rows   *****************************************/
size_t cu_mat::rows(){return n_rows;}
/***********************************************************************************************************************/


/***************************************   Get number of columns   *****************************************/
size_t cu_mat::cols(){return n_cols;}
/***********************************************************************************************************************/


/***************************************   Get GPU memory pointer   *****************************************/
double* cu_mat::pointer(){return p;}
/***********************************************************************************************************************/


/***************************************   Get number of rows   *****************************************/
size_t cu_mat::rows() const {return n_rows;}
/***********************************************************************************************************************/


/***************************************   Get number of columns   *****************************************/
size_t cu_mat::cols() const {return n_cols;}
/***********************************************************************************************************************/


/***************************************   Get GPU memory pointer   *****************************************/
double* cu_mat::pointer() const {return p;}
/***********************************************************************************************************************/













//  Friend Functions
/**************************************   Matrix with random numbers   ***********************************************/
cu_mat randn(const size_t &r, const size_t &c)
{
    size_t r_new = r, c_new = c;
    if ((r%2!=0)&&(c%2!=0))
    {
        r_new = r+1; c_new = c+1;
    }
    cu_mat a(r_new,c_new);
    curandGenerator_t prng;
    HANDLE_ERROR( curandCreateGenerator(&prng, CURAND_RNG_PSEUDO_XORWOW) );
    HANDLE_ERROR( curandSetPseudoRandomGeneratorSeed(prng,(unsigned long long) clock()) );
    HANDLE_ERROR( curandGenerateNormalDouble(prng,a.p,a.n_rows*a.n_cols,0.0,1.0) ); //The number of values requested has to be multiple of 2.
    HANDLE_ERROR( curandDestroyGenerator(prng) );
    return std::move(a(1,r,1,c));
}
cu_mat randn(const size_t &n=1){return std::move(randn(n,n));}
/***************************************************************************************************************************/


/****************************************   Identity matrix   *******************************************/
__global__ void eye_mat(double* p, const int r, const int n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx<n_ele)
    {
        p[idx*r+idx] = 1.0;
    }
}
cu_mat eye(const size_t &r, const size_t &c)
{
    cu_mat temp(r,c);
    size_t n_ele = min(r,c);
    size_t n_threads = block_dim(n_ele);
    eye_mat<<<n_ele/n_threads,n_threads>>>(temp.p,r,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(temp);
}
cu_mat eye(const size_t &n){return std::move(eye(n,n));}
/***************************************************************************************************************************/


/*****************************************   Matrix left divide   *****************************************/
cu_mat mld(const cu_mat &a, const cu_mat &b)
{
    return std::move(mld(cu_mat(a),cu_mat(b)));
}
cu_mat mld(const cu_mat &a, cu_mat &&b)
{
    return std::move(mld(cu_mat(a),std::move(b)));
}
cu_mat mld(cu_mat &&a, const cu_mat &b)
{
    return std::move(mld(std::move(a),cu_mat(b)));
}
cu_mat mld(cu_mat &&a, cu_mat &&b) // Adapted from CUSOLVER_Library.pdf QR examples
{
    confirm(a.n_rows == b.n_rows,"Error: 'mld()' operation cannot be performed. Matrix dimensions must agree.")
    cu_mat A = std::move(a), B = std::move(b); // Copy current matrix to a new matrix for calculations.

    double *d_tau = NULL;
    double *d_work = NULL, alf = 1.0;
    int *devInfo = NULL, lwork = 0, info_gpu = 0;
    cusolverDnHandle_t cusolver_handle = NULL;
    cublasHandle_t cublas_handle = NULL;

    // step 1: create cusolver/cublas handle
    HANDLE_ERROR( cusolverDnCreate(&cusolver_handle) );
    HANDLE_ERROR( cublasCreate(&cublas_handle) );

    // step 2: allocate required extra memory on GPU.
    HANDLE_ERROR( cudaMalloc((void**)&d_tau,sizeof(double)*A.n_cols) );
    HANDLE_ERROR( cudaMalloc((void**)&devInfo,sizeof(int)) );

    // step 3: query working space of geqrf and ormqr
    HANDLE_ERROR( cusolverDnDgeqrf_bufferSize(cusolver_handle,A.n_rows,A.n_cols,A.p,A.n_rows,&lwork) );
    HANDLE_ERROR( cudaMalloc((void**)&d_work, sizeof(double)*lwork) );

    // step 4: compute QR factorization
    HANDLE_ERROR( cusolverDnDgeqrf(cusolver_handle,A.n_rows,A.n_cols,A.p,A.n_rows,d_tau,d_work,lwork,devInfo) );
    HANDLE_ERROR( cudaDeviceSynchronize() );
    // check if QR is good or not
    HANDLE_ERROR( cudaMemcpy(&info_gpu, devInfo, sizeof(int),cudaMemcpyDeviceToHost) );
    confirm(info_gpu == 0,"Error: 'mld()' operation cannot be performed. QR decomposition failed.");

    // step 5: compute Q^T*B (CUSOLVER documentation has typos. Follow LAPACK documentation.)
    HANDLE_ERROR( cusolverDnDormqr(cusolver_handle,CUBLAS_SIDE_LEFT,CUBLAS_OP_T,B.n_rows,B.n_cols,A.n_cols,A.p,A.n_rows,d_tau,B.p,B.n_rows,d_work,lwork,devInfo) );
    HANDLE_ERROR( cudaDeviceSynchronize() );
    // check if QR is good or not
    HANDLE_ERROR( cudaMemcpy(&info_gpu, devInfo, sizeof(int),cudaMemcpyDeviceToHost) );
    confirm(info_gpu == 0,"Error: 'mld()' operation cannot be performed. QR decomposition failed.");

    // step 6: compute x = R \ (Q^T*B)
    HANDLE_ERROR( cublasDtrsm(cublas_handle,CUBLAS_SIDE_LEFT,CUBLAS_FILL_MODE_UPPER,CUBLAS_OP_N,CUBLAS_DIAG_NON_UNIT,A.n_cols,B.n_cols,&alf,A.p,A.n_rows,B.p,A.n_cols) );
    HANDLE_ERROR( cudaDeviceSynchronize() );

    // Free resources
    HANDLE_ERROR( cudaFree(d_tau) );
    HANDLE_ERROR( cudaFree(devInfo) );
    HANDLE_ERROR( cudaFree(d_work) );

    HANDLE_ERROR( cublasDestroy(cublas_handle) );
    HANDLE_ERROR( cusolverDnDestroy(cusolver_handle) );

    return std::move(B);
}
/***************************************************************************************************************************/


/*****************************************   Matrix with all values 1   *****************************************/
cu_mat ones(const size_t &r, const size_t &c)
{
    cu_mat tmp(r,c,1); //tmp.del = 0;
    return std::move(tmp);
}
cu_mat ones(const size_t &n){return std::move(ones(n,n));}
/***************************************************************************************************************************/

/*****************************************   Matrix with all values 0   *****************************************/
cu_mat zeros(const size_t &r, const size_t &c)
{
    cu_mat tmp(r,c); //tmp.del = 0;
    return std::move(tmp);
}
cu_mat zeros(const size_t &n){return std::move(zeros(n,n));}
/***************************************************************************************************************************/


/***************************************   Transpose current matrix   *****************************************/
__global__ void mat_trans(double* a, double* b, size_t rows, size_t cols, size_t n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    size_t r = idx%rows, c = idx/rows;
    if (idx<n_ele)
    a[c+r*cols] = b[idx];
}
cu_mat trans(cu_mat &a)
{
    cu_mat tmp(a.n_cols,a.n_rows);
    size_t n_ele = tmp.n_rows*tmp.n_cols;
    size_t n_threads = block_dim(n_ele);
    mat_trans<<<n_ele/n_threads,n_threads>>>(tmp.p,a.p,a.n_rows,a.n_cols,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(tmp);
}
/***********************************************************************************************************************/


/***************************************   Horizontal concatenation of two matrices   *****************************************/
cu_mat horzcat(cu_mat &a, cu_mat &b)
{
    confirm(a.n_rows==b.n_rows,"Error: Dimensions of arrays being horizontally concatenated are not consistent.");
    cu_mat tmp(a.n_rows,a.n_cols+b.n_cols);
    HANDLE_ERROR( cudaMemcpy(tmp.p,a.p,a.n_rows*a.n_cols*sizeof(double),cudaMemcpyDeviceToDevice) );
    size_t n_ele = b.n_rows*b.n_cols, n_threads = block_dim(n_ele);
    copymat<<<n_ele/n_threads,n_threads>>>(tmp.p,b.p,a.n_cols*tmp.n_rows,tmp.n_rows,0,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(tmp);
}
/***************************************************************************************************************************/


/***************************************   Vertical concatenation of two matrices   *****************************************/
cu_mat vertcat(cu_mat &a, cu_mat &b)
{
    confirm(a.n_cols==b.n_cols,"Error: Dimensions of arrays being vertically concatenated are not consistent.");
    cu_mat tmp(a.n_rows+b.n_rows,a.n_cols);
    size_t n_ele = a.n_rows*a.n_cols, n_threads = block_dim(n_ele);
    copymat<<<n_ele/n_threads,n_threads>>>(tmp.p,a.p,0,a.n_rows,tmp.n_rows-a.n_rows,n_ele);
    n_ele = b.n_rows*b.n_cols; n_threads = block_dim(n_ele);
    copymat<<<n_ele/n_threads,n_threads>>>(tmp.p,b.p,a.n_rows,b.n_rows,tmp.n_rows-b.n_rows,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(tmp);
}
/***************************************************************************************************************************/


/***************************************   MATLAB colon operator   *****************************************/
__global__ void ss_mat_fill(double* dest, double i, double step, size_t n_ele)
{
    unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if(idx<n_ele)
    {
        dest[idx] = i+step*idx;
    }
}
cu_mat stepspace(const double &i, const double &f, const double &step=1)
{
    size_t n;
    if (((f-i)/step)>=0)
        n = (f-i)/step+1;
    else
        n = 0;
    cu_mat tmp(n,1);
    size_t n_ele = tmp.n_rows*tmp.n_cols, n_threads = block_dim(n_ele);
    ss_mat_fill<<<n_ele/n_threads,n_threads>>>(tmp.p,i,step,n_ele);
    HANDLE_ERROR( cudaPeekAtLastError() );
    return std::move(tmp);
}
/***************************************************************************************************************************/


/************************************   Check if 'cu_mat' object is scalar   ***********************************************/
bool isscalar(const cu_mat &a)
{
    return ((a.n_rows*a.n_cols)==1);
}
/***********************************************************************************************************************/



//  Destructor
cu_mat::~cu_mat() {
    HANDLE_ERROR( cudaFree(p) );
    p = NULL;
}

block_dim.cu ## -用于计算内核调用

的块维度的函数

代码语言:javascript
运行
复制
#include "imp_includes.hcu"

size_t block_dim(size_t n_ele)
{
    size_t tc;
    if (n_ele<=1024) return(n_ele);

    else if (n_ele%2==0){
        for (tc=1024;tc>1;tc=tc-2){
            if (n_ele%tc==0) return(tc);}}

    else{
        for (tc=1023;tc>1;tc=tc-2){
            if (n_ele%tc==0) return(tc);}}

    std::cout << "Warning: Calculations cannot be divided equally. Be careful when making the CUDA kernel." << std::endl;
    return(256);
}

cu_error_list.cu ## -可能的cuBLAS、cuRAND和cuSOLVER api错误列表.

代码语言:javascript
运行
复制
#include "imp_includes.hcu"

// cuBLAS API errors
const char *cublasGetErrorString(cublasStatus_t err)
{
    switch (err) {
        case CUBLAS_STATUS_NOT_INITIALIZED:
          return "CUBLAS_STATUS_NOT_INITIALIZED";
        case CUBLAS_STATUS_ALLOC_FAILED:
          return "CUBLAS_STATUS_ALLOC_FAILED";
        case CUBLAS_STATUS_INVALID_VALUE:
          return "CUBLAS_STATUS_INVALID_VALUE";
        case CUBLAS_STATUS_ARCH_MISMATCH:
          return "CUBLAS_STATUS_ARCH_MISMATCH";
        // case CUBLAS_STATUS_MAPPING_err:
        //   return "CUBLAS_STATUS_MAPPING_err";
        case CUBLAS_STATUS_EXECUTION_FAILED:
          return "CUBLAS_STATUS_EXECUTION_FAILED";
        // case CUBLAS_STATUS_INTERNAL_err:
        //   return "CUBLAS_STATUS_INTERNAL_err";
        case CUBLAS_STATUS_NOT_SUPPORTED:
          return "CUBLAS_STATUS_NOT_SUPPORTED";
        // case CUBLAS_STATUS_LICENSE_err:
        //   return "CUBLAS_STATUS_LICENSE_err";
      }
      return "<unknown>";
}

  // cuSOLVER API errors
const char *cusolverGetErrorString(cusolverStatus_t err)
{
    switch (err) {
      case CUSOLVER_STATUS_NOT_INITIALIZED:
        return "CUSOLVER_STATUS_NOT_INITIALIZED";
      case CUSOLVER_STATUS_ALLOC_FAILED:
        return "CUSOLVER_STATUS_ALLOC_FAILED";
      case CUSOLVER_STATUS_INVALID_VALUE:
        return "CUSOLVER_STATUS_INVALID_VALUE";
      case CUSOLVER_STATUS_ARCH_MISMATCH:
        return "CUSOLVER_STATUS_ARCH_MISMATCH";
    //   case CUSOLVER_STATUS_MAPPING_err:
    //     return "CUSOLVER_STATUS_MAPPING_err";
      case CUSOLVER_STATUS_EXECUTION_FAILED:
        return "CUSOLVER_STATUS_EXECUTION_FAILED";
    //   case CUSOLVER_STATUS_INTERNAL_err:
    //     return "CUSOLVER_STATUS_INTERNAL_err";
      case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
        return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
      case CUSOLVER_STATUS_NOT_SUPPORTED:
        return "CUSOLVER_STATUS_NOT_SUPPORTED ";
      case CUSOLVER_STATUS_ZERO_PIVOT:
        return "CUSOLVER_STATUS_ZERO_PIVOT";
      case CUSOLVER_STATUS_INVALID_LICENSE:
        return "CUSOLVER_STATUS_INVALID_LICENSE";
    }
    return "<unknown>";
}

  // cuRAND API errors
const char *curandGetErrorString(curandStatus_t err)
{
    switch (err) {
      case CURAND_STATUS_VERSION_MISMATCH:
        return "CURAND_STATUS_VERSION_MISMATCH";
      case CURAND_STATUS_NOT_INITIALIZED:
        return "CURAND_STATUS_NOT_INITIALIZED";
      case CURAND_STATUS_ALLOCATION_FAILED:
        return "CURAND_STATUS_ALLOCATION_FAILED";
    //   case CURAND_STATUS_TYPE_err:
    //     return "CURAND_STATUS_TYPE_err";
      case CURAND_STATUS_OUT_OF_RANGE:
        return "CURAND_STATUS_OUT_OF_RANGE";
      case CURAND_STATUS_LENGTH_NOT_MULTIPLE:
        return "CURAND_STATUS_LENGTH_NOT_MULTIPLE";
      case CURAND_STATUS_DOUBLE_PRECISION_REQUIRED:
        return "CURAND_STATUS_DOUBLE_PRECISION_REQUIRED";
      case CURAND_STATUS_LAUNCH_FAILURE:
        return "CURAND_STATUS_LAUNCH_FAILURE";
      case CURAND_STATUS_PREEXISTING_FAILURE:
        return "CURAND_STATUS_PREEXISTING_FAILURE";
      case CURAND_STATUS_INITIALIZATION_FAILED:
        return "CURAND_STATUS_INITIALIZATION_FAILED";
      case CURAND_STATUS_ARCH_MISMATCH:
        return "CURAND_STATUS_ARCH_MISMATCH";
    //   case CURAND_STATUS_INTERNAL_err:
    //     return "CURAND_STATUS_INTERNAL_err";
    }
    return "<unknown>";
}

error_check.cu ## -重载函数将错误转换为字符串

代码语言:javascript
运行
复制
#include "imp_includes.hcu"

//#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

void HandleError( cudaError_t err,const char *file,int line )
{
    confirm(err == cudaSuccess,cudaGetErrorString( err )<<" in "<<file<<" at "<<line<<".");
}

void HandleError( cublasStatus_t err,const char *file,int line )
{
    confirm(err == CUBLAS_STATUS_SUCCESS,cublasGetErrorString( err )<<" in "<<file<<" at "<<line<<".");
}

void HandleError( cusolverStatus_t err,const char *file,int line )
{
    confirm(err == CUSOLVER_STATUS_SUCCESS,cusolverGetErrorString( err )<<" in "<<file<<" at "<<line<<".");
}

void HandleError( curandStatus_t err,const char *file,int line )
{
    confirm(err == CURAND_STATUS_SUCCESS,curandGetErrorString( err )<<" in "<<file<<" at "<<line<<".");
}

cu_mat_test.cu ## -测试类

的主要功能

代码语言:javascript
运行
复制
#include "cu_mat.hcu"
#include <ctime>

int main()
{
    clock_t begin = clock();
    look_for_errors;

    cu_mat A = randn(5), b = randn(5,1);
    cu_mat c = mld(A,b);
    A.get(); b.get(); c.get();

    report_errors;
    clock_t end = clock();
    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << elapsed_secs << std::endl;
    return 0;
}

imp_includes.hcu ## -用于异常处理的宏定义,并包括其他必需的库(

)

代码语言:javascript
运行
复制
#ifndef IMP_INCLUDES_H_
#define IMP_INCLUDES_H_

#include <iostream>
#include <cuda_runtime.h>
#include <curand.h>
#include <curand_kernel.h>
#include <cublas_v2.h>
#include <cusolverDn.h>
#include <fstream>
#include <iomanip>
#include <functional>

// Macro definitions
#define look_for_errors try{

#define report_errors }catch(int n){}

#define confirm(cond,err)                   \
if(!(cond))                                 \
{                                           \
    std::cout << "\a" << err << std::endl;  \
    throw 1;                                \
}

#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

// Extra functions
size_t block_dim(size_t n_ele);
const char *cublasGetErrorString(cublasStatus_t err);               // cuBLAS API errors
const char *cusolverGetErrorString(cusolverStatus_t err);           // cuSOLVER API errors
const char *curandGetErrorString(curandStatus_t err);               // cuRAND API errors
void HandleError( cudaError_t err,const char *file,int line );
void HandleError( cublasStatus_t err,const char *file,int line );
void HandleError( cusolverStatus_t err,const char *file,int line );
void HandleError( curandStatus_t err,const char *file,int line );
__global__ void copymat(double* dest, double* src, size_t bias, size_t src_rows, size_t main_rows_bias, size_t n_ele);

#endif /* IMP_INCLUDES_HCU_ */
EN

回答 1

Code Review用户

发布于 2019-06-16 07:44:24

以下是一些简短的评论:

  1. 使用值的编译时列表(初始化程序列表)初始化矩阵似乎不太有用,因为CUDA用于处理大量的数据,而不是在编译时提供的少量元素。
  2. 将数学/数据结构抽象的代码与错误处理实用程序代码混合到同一个类中不是一个好主意。
  3. 您的函数和方法名称通常是不必要的缩短(如mld)、模糊(confirm)或涉及冗余(比如cu_mat而不是适当的名称空间中的matrix )。
  4. 看起来它们应该是位于主机内存中的矩阵,但是在每次操作中都会来回复制到设备内存中。为什么这是一个有用的想法?

PS -如果您想要使用更多的C++ish错误处理,您可以考虑我的CUDA包装器,它就是这样做的(显然与任何特定于应用程序的逻辑都是分开的)。

票数 2
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/222337

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档