DAY 1: 学习CUDA C Programming Guide

今天开始,我们将带领大家开始阅读英文的《CUDA C Programming Guide》,希望在接下来的100天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计899字,阅读时间15分钟

Changes from Version 9.0

· Documented restriction that operator-overloads【操作符重载】 cannot be __global__ functions in Operator Function.

· Removed guidance to break 8-byte shuffles into two 4-byte instructions. 8-byte shuffle variants are provided since CUDA 9.0. See Warp Shuffle Functions.

· Passing __restrict__ references to __global__ functions is now supported. Updated comment in __global__ functions and function templates.

· Documented CUDA_ENABLE_CRC_CHECK in CUDA Environment Variables.

· Warp matrix functions [PREVIEW FEATURE] now support matrix products with m=32, n=8, k=16 and m=8, n=32, k=16 in addition to m=n=k=16.

1. Introduction

1.1. From Graphics Processing to General Purpose Parallel Computing

Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth, as illustrated by Figure 1 and Figure 2.

Figure 1. Floating-Point Operations per Second for the CPU and GPU

Figure 2. Memory Bandwidth for the CPU and GPU

The reason behind the discrepancy【差异】 in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive【计算密集】, highly parallel computation【高度并行计算】 - exactly what graphics rendering is about - and therefore【因此】 designed such that more transistors【晶体管】 are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 3.

Figure 3. The GPU Devotes More Transistors to Data Processing

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity【算术强度】 - the ratio of arithmetic operations to memory operations【算术和内存操作的比率】. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control【复杂的流控制】, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency【内存访问延迟】 can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering【渲染】, large sets of pixels and vertices are mapped to parallel threads. Similarly【类似地】, image and media processing applications such as post-processing【后处理】 of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition【模式识别】 can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model

In November 2006, NVIDIA introduced CUDA®, a general【通用】 purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 4, other languages, application programming interfaces, or directives-based approaches are supported, such as FORTRAN, DirectCompute, OpenACC.

Figure 4. GPU Computing Applications. CUDA is designed to support various languages and application programming interfaces.

1.3. A Scalable Programming Model

The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore【以此同时】, their parallelism continues to scale with Moore's law【摩尔定律】. The challenge is to develop application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

The CUDA parallel programming model is designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C.

At its core are three key abstractions - a hierarchy of thread groups【线程组的层次结构】, shared memories【共享内存】, and barrier synchronization【栅栏同步】 - that are simply exposed to the programmer as a minimal set of language extensions.

These abstractions provide fine-grained【细粒度】 data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block.

This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability. Indeed, each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 5, and only the runtime system needs to know the physical multiprocessor count.

This scalable programming model allows the GPU architecture to span a wide market range by simply scaling the number of multiprocessors and memory partitions: from the high-performance enthusiast GeForce GPUs and professional Quadro and Tesla computing products to a variety of inexpensive, mainstream GeForce GPUs (see CUDA-Enabled GPUs for a list of all CUDA-enabled GPUs).

Figure 5. Automatic Scalability

Note: A GPU is built around an array of Streaming Multiprocessors (SMs) (see Hardware Implementation for more details). A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors.

本文备注/经验分享:

CUDA对于C++的支持不完善。有各种限制的。

算术和内存操作的比率,这个是衡量一张卡计算性能和访存性能比率的指标。 有两种单位。一个是指令对字节(或者4B), 另外一个是指令对指令。但是这ratio实际上不用自己记住的。因为一般情况下profiler会告诉你是你卡计算,还是卡访存。

有不明白的地方,请在本文后留言

或者在我们的技术论坛bbs.gpuworld.cn上发帖

原文发布于微信公众号 - 吉浦迅科技(gpusolution)

原文发表时间:2018-05-02

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏陈仁松博客

ASP.NET Core 'Microsoft.Win32.Registry' 错误修复

今天在发布Asp.net Core应用到Azure的时候出现错误InvalidOperationException: Cannot find compilati...

4798
来自专栏跟着阿笨一起玩NET

c#实现打印功能

2612
来自专栏我和未来有约会

Silverlight第三方控件专题

这里我收集整理了目前网上silverlight第三方控件的专题,若果有所遗漏请告知我一下。 名称 简介 截图 telerik 商 RadC...

3955
来自专栏pangguoming

Spring Boot集成JasperReports生成PDF文档

由于工作需要,要实现后端根据模板动态填充数据生成PDF文档,通过技术选型,使用Ireport5.6来设计模板,结合JasperReports5.6工具库来调用渲...

1.2K7
来自专栏张善友的专栏

LINQ via C# 系列文章

LINQ via C# Recently I am giving a series of talk on LINQ. the name “LINQ via C...

2605
来自专栏一个爱瞎折腾的程序猿

sqlserver使用存储过程跟踪SQL

USE [master] GO /****** Object: StoredProcedure [dbo].[sp_perfworkload_trace_s...

1990
来自专栏魂祭心

原 canvas绘制clock

4004
来自专栏大内老A

The .NET of Tomorrow

Ed Charbeneau(http://developer.telerik.com/featured/the-net-of-tomorrow/) Exciti...

30810
来自专栏转载gongluck的CSDN博客

cocos2dx 打灰机

#include "GamePlane.h" #include "PlaneSprite.h" #include "BulletNode.h" #include...

5286
来自专栏飞扬的花生

jsencrypt参数前端加密c#解密

      写程序时一般是通过form表单或者ajax方式将参数提交到服务器进行验证,如何防止提交的请求不被抓包后串改,虽然无法说绝对安全却给非法提交提高了难度...

3849

扫码关注云+社区