前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >SDAccel矩阵乘法优化(一)

SDAccel矩阵乘法优化(一)

作者头像
AI异构
发布2020-07-29 10:46:27
1.2K0
发布2020-07-29 10:46:27
举报
文章被收录于专栏:AI异构AI异构

从一个矩阵乘法的例子一步一步进行功能设计与性能优化。

mmult实现及优化步骤

矩阵乘法优化步骤

步骤

实现功能

关键概念/ Keywords

1、cpu实现

即在host端实现简单的矩阵乘法,便于比对数据与性能对比

---

2、OpenCL实现

在device端实现基于OpenCL的FPGA矩阵乘法硬件设计.

OpenCL接口函数

3、加入Local Memory

采用 Local Memory 减少数据访存次数

内核优化

局部内存

4、实现读写的突发传输

采用突发传输的方式更好的实现DDR与 Local Memory数据的读写访问

内核优化

突发读/写

5、数组分割

通过循环展开与数组分割的方式,实现更好的计算性能

数组分割

循环展开

流水线设计

CPU端实现mmult计算

代码语言:javascript
复制
void mmult_cpu( int *in1,   // Input matrix 1
                int *in2,   // Input matrix 2
                int *out,   // Output matrix (out = A x B)
                int dim     // Matrix size of one dimension
              )
{
    //Performs matrix multiplication out = in1 x in2
    for (int i = 0; i < dim; i++){
        for (int j = 0; j < dim; j++){
            for (int k = 0; k < dim; k++){
                out[i * dim + j] += in1[i * dim + k] * in2[k * dim  + j];
            }
        }
    }
}

FPGA端实现mmult计算

OpenCL Host端初始化流程
host 端代码实现
代码语言:javascript
复制
//OpenCL utility layer include
#include "xcl2.hpp"
#include <vector>

//Array Size to access
#define DATA_SIZE 64

uint64_t get_duration_ns (const cl::Event &event) {
    uint64_t nstimestart, nstimeend;
    event.getProfilingInfo<uint64_t>(CL_PROFILING_COMMAND_START,&nstimestart);
    event.getProfilingInfo<uint64_t>(CL_PROFILING_COMMAND_END,&nstimeend);
    return(nstimeend-nstimestart);
}

//CPU implementation of Matrix Multiplication
//The inputs are of the size (DATA_SIZE x DATA_SIZE)
void mmult_cpu (
    int *in1,   //Input Matrix 1
    int *in2,   //Input Matrix 1
    int *out,   //Input Matrix 1
    int dim     //One dimension of matrix
)
{
    //Performs Matrix multiply Out = In1 x In2
    for(int i = 0; i < dim; i++) {
        for(int j = 0; j < dim; j++) {
            for(int k = 0; k < dim; k++) {
                out[i * dim + j] += in1[i * dim + k] * in2[k * dim + j];
            }
        }
    }
}

//Functionality to setup OpenCL context and trigger the Kernel
uint64_t mmult_fpga (
    std::vector<int,aligned_allocator<int>>& source_in1,   //Input Matrix 1
    std::vector<int,aligned_allocator<int>>& source_in2,   //Input Matrix 2
    std::vector<int,aligned_allocator<int>>& source_fpga_results,    //Output Matrix
    int dim                                                //One dimension of matrix
)
{
    int size = dim;
    size_t matrix_size_bytes = sizeof(int) * size * size;

    //The get_xil_devices will return vector of Xilinx Devices
    std::vector<cl::Device> devices = xcl::get_xil_devices();
    cl::Device device = devices[0];

    //Creating Context and Command Queue for selected Device
    cl::Context context(device);
    cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);
    std::string device_name = device.getInfo<CL_DEVICE_NAME>();

    //import_binary() command will find the OpenCL binary file created using the
    //xocc compiler load into OpenCL Binary and return as Binaries
    //OpenCL and it can contain many functions which can be executed on the
    //device.
    std::string binaryFile = xcl::find_binary_file(device_name,"mmult");
    cl::Program::Binaries bins = xcl::import_binary_file(binaryFile);
    devices.resize(1);
    cl::Program program(context, devices, bins);

    //This call will extract a kernel out of the program we loaded in the
    //previous line. A kernel is an OpenCL function that is executed on the
    //FPGA. This function is defined in the src/mmult.cl file.
    cl::Kernel kernel(program,"mmult");

    //These commands will allocate memory on the FPGA. The cl::Buffer
    //objects can be used to reference the memory locations on the device.
    //The cl::Buffer object cannot be referenced directly and must be passed
    //to other OpenCL functions.
    cl::Buffer buffer_in1(context,CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
            matrix_size_bytes,source_in1.data());
    cl::Buffer buffer_in2(context,CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
            matrix_size_bytes,source_in2.data());
    cl::Buffer buffer_output(context,CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
            matrix_size_bytes,source_fpga_results.data());

    //These commands will load the source_in1 and source_in2 vectors from the host
    //application into the buffer_in1 and buffer_in2 cl::Buffer objects. The data
    //will be be transferred from system memory over PCIe to the FPGA on-board
    //DDR memory.
    q.enqueueMigrateMemObjects({buffer_in1, buffer_in2},0/* 0 means from host*/);

    //Set the kernel arguments
    int narg = 0;
    kernel.setArg(narg++, buffer_in1);
    kernel.setArg(narg++, buffer_in2);
    kernel.setArg(narg++, buffer_output);
    kernel.setArg(narg++, size);

    cl::Event event;
    uint64_t kernel_duration = 0;

    //Launch the kernel
    q.enqueueTask(kernel, NULL, &event);

    //The result of the previous kernel execution will need to be retrieved in
    //order to view the results. This call will write the data from the
    //buffer_output cl_mem object to the source_fpga_results vector
    q.enqueueMigrateMemObjects({buffer_output},CL_MIGRATE_MEM_OBJECT_HOST);
    q.finish();

    kernel_duration = get_duration_ns(event);

    return kernel_duration;
}

int main(int argc, char** argv)
{
    //Allocate Memory in Host Memory
    int size = DATA_SIZE;
    size_t matrix_size_bytes = sizeof(int) * size * size;

    //When creating a buffer with user pointer, under the hood user ptr is
    //used if and only if it is properly aligned (page aligned). When not
    //aligned, runtime has no choice but to create its own host side buffer
    //that backs user ptr. This in turn implies that all operations that move
    //data to/from device incur an extra memcpy to move data to/from runtime's
    //own host buffer from/to user pointer. So it is recommended to use this
    //allocator if user wish to Create Buffer/Memory Object to align user buffer
    //to the page boundary. It will ensure that user buffer will be used when
    //user create Buffer/Mem Object.
    std::vector<int,aligned_allocator<int>> source_in1(matrix_size_bytes);
    std::vector<int,aligned_allocator<int>> source_in2(matrix_size_bytes);
    std::vector<int,aligned_allocator<int>> source_fpga_results(matrix_size_bytes);
    std::vector<int,aligned_allocator<int>> source_cpu_results(matrix_size_bytes);

    //Create the test data
    for(int i = 0 ; i < DATA_SIZE * DATA_SIZE ; i++){
        source_in1[i] = i;
        source_in2[i] = i * i;
        source_cpu_results[i] = 0;
        source_fpga_results[i] = 0;
    }

    uint64_t kernel_duration = 0;

    //Compute CPU Results
    mmult_cpu(source_in1.data(), source_in2.data(), source_cpu_results.data(), size);

    //Compute FPGA Results
    kernel_duration = mmult_fpga(source_in1, source_in2, source_fpga_results, size);

    //Compare the results of FPGA to CPU
    bool match = true;
    for (int i = 0 ; i < size * size; i++){
        if (source_fpga_results[i] != source_cpu_results[i]){
            std::cout << "Error: Result mismatch" << std::endl;
            std::cout << "i = " << i << " CPU result = " << source_cpu_results[i]
                << " FPGA result = " << source_fpga_results[i] << std::endl;
            match = false;
            break;
        }
    }

    std::cout << "TEST " << (match ? "PASSED" : "FAILED") << std::endl;

    std::cout << "Wall Clock Time (Kernel execution): " << kernel_duration << std::endl;
    std::cout << "Note: Wall Clock Time is meaningful for real hardware execution only,"
            << "not for emulation." << std::endl;

    return (match ? EXIT_SUCCESS :  EXIT_FAILURE);
}
device端代码实现(简单实现mmult逻辑)
代码语言:javascript
复制
kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult( __global int* in1,  //Read-only input matrix1
            __global int* in2,  //Read-only input matrix2
            __global int* out,  //Output matrix
            int dim             //One dimension of the matrix
          )
{
    //Reads the data from DDR, performs the computation
    //and writes back the result to DDR.
    LOOP1:for (int i = 0 ; i < dim ; i++){
        LOOP2:for(int j = 0; j < dim; j++){
                   out[i * dim + j] = 0;
            LOOP3:for(int k = 0; k < dim; k++){
                       out[i * dim + j] += in1[i * dim + k] * in2[k * dim + j];
            }
        }
    }
}
实验结果分析
  • vivado hls log文件分析(重点关注WARNING)
代码语言:javascript
复制
WARNING: [XFORM 203-542] Cannot flatten a loop nest 'LOOP2' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:47:44) in function 'mmult' :
WARNING: [XFORM 203-542] the outer loop is not a perfect loop.
INFO: [XFORM 203-541] Flattening a loop nest 'LOOP1' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:45:43) in function 'mmult'.
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:00.77 ; elapsed = 00:00:00.88 . Memory (MB): peak = 494.320 ; gain = 156.758 ; free physical = 19872 ; free virtual = 45217
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'mmult' ...
WARNING: [SYN 201-107] Renaming port name 'mmult/out' to 'mmult/out_r' to avoid the conflict with HDL keywords or other object names.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'mmult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'LOOP3'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 2, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 3, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 4, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 130, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 193, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 225, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 241, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 249, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 253, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 255, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 256, distance = 1, offset = 0)
   between 'add' operation ('tmp_13', /home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51) and bus write on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:51).
INFO: [SCHED 204-61] Unable to satisfy pipeline directive: Unable to pipeline the region.
INFO: [SCHED 204-11] Finished scheduling.
  • HLS Report
  • 综合结果分析

分析综合结果的方法: * 首先分析对于添加的优化指令是否综合实现,若不能实现,原因是什么? * 然后分析代码pipeline的情况。SDAccel对于嵌套的for循环来讲:pipeline内层的for循环全部unroll,pipeline外层的for循环试图进行Flattening,Flatten成功则统一到一个pipeline中。 * 对于pipeline的循环进一步分析II值是多少,理论能优化到多少?

从上述日志分析可知,该硬件的综合实现有很多问题: * 首先,硬件代码没有优化指令,不需要关注指令是否实现。 * 然后,对于实现的三层for循环,只是实现了最内层LOOP3循环的pipeline,中间层未实现Flatten的原因是:the outer loop is not a perfect loop.。而LOOP2LOOP1继续试图进行Flattening,成功则LOOP2LOOP1统一为LOOP1_LOOP2。一般情况下对于Flattening不成功的原因有两种:一种是外层for循环中夹杂内层for循环的结构;另一种是内层for循环的循环边界是变量。具体循环的类型如下图所示。所以此例中LOOP2不能与LOOP3实现Flatten的原因是前者。也就是在LOOP2循环体中有out[i * dim + j] = 0;操作,而out数组在内层LOOP3中同样用到。反过来说,假如说编译器对LOOP2LOOP3进行Flatten,那么对于out[i * dim + j] = 0操作在同一个循环中将不知如何与内部的循环体进行融合。

* 最后对于试图PipelineLOOP3进行II值的分析,从log文件中可知II值过大,以至于无法进行Pipeline,原因是产生接口gmemcarried dependence。所以,所有的loop都未能实现pipeline。 关于gmemcarried dependence问题可以关注我的另一篇文章 gmem carry dependency 分析

  • 硬件仿真结果
  • 硬件实现结果

参考

xilinx github Xilinx/SDAccelExamples/cputo_fpga ug1253 SDx Pragma Reference Guide 2017.2 ug1207 SDAccel Environment Optmizaton Guide

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-01-21,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 AI异构 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • mmult实现及优化步骤
    • CPU端实现mmult计算
      • FPGA端实现mmult计算
        • OpenCL Host端初始化流程
        • host 端代码实现
        • device端代码实现(简单实现mmult逻辑)
        • 实验结果分析
      • 参考
      相关产品与服务
      Elasticsearch Service
      腾讯云 Elasticsearch Service(ES)是云端全托管海量数据检索分析服务,拥有高性能自研内核,集成X-Pack。ES 支持通过自治索引、存算分离、集群巡检等特性轻松管理集群,也支持免运维、自动弹性、按需使用的 Serverless 模式。使用 ES 您可以高效构建信息检索、日志分析、运维监控等服务,它独特的向量检索还可助您构建基于语义、图像的AI深度应用。
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档