手把手教你如何用Julia做GPU编程（附代码）

新智元

发布于 2018-11-22 17:59:32

2.1K0

发布于 2018-11-22 17:59:32

文章被收录于专栏：新智元

新智元报道

来源：nextjournal

编辑：肖琴、三石

【新智元导读】本文旨在快速介绍GPU的工作原理，详细介绍当前的Julia GPU生态系统，并让读者了解简单的GPU编程是多么的容易。

GPU是如何工作的？

首先，什么是GPU？

GPU是一个大规模并行处理器，具有几千个并行处理单元。例如，本文中使用的Tesla k80提供4992个并行CUDA内核。 GPU在频率，延迟和硬件功能方面与CPU完全不同，但有点类似于拥有4992个内核的慢速CPU！

“Tesla K80”

可启用并行线程的数量可以大幅提高GPU速度，但也让它的使用性变得更加困难。让我们来详细看看在使用这种原始动力时，你会遇到哪些缺点：

GPU是一个独立的硬件，具有自己的内存空间和不同的架构。因此，从RAM到GPU存储器（VRAM）的传输时间很长。即使在GPU上启动内核（换句话说，调度函数调用）也会带来较大的延迟。 GPU的时间约为10us，而CPU的时间则为几纳秒。
在没有高级包装器的情况下，设置内核会很快变得复杂
较低的精度是默认值，而较高的精度计算可以轻松地消除所有性能增益
GPU函数(内核)本质上是并行的，所以编写GPU内核至少和编写并行CPU代码一样困难，但是硬件上的差异增加了相当多的复杂性
与上述相关，许多算法都不能很好地移植到GPU上。
内核通常是用C/ C++编写的，这并不是写算法的最佳语言。
CUDA和OpenCL之间存在分歧，OpenCL是用于编写低级GPU代码的主要框架。虽然CUDA只支持英伟达硬件，但OpenCL支持所有硬件，但有些粗糙。

Julia的诞生是个好消息！它是一种高级脚本语言，允许你在Julia本身编写内核和周围的代码，同时在大多数GPU硬件上运行！

GPUArrays

大多数高度并行的算法需要通过相当多的数据来克服所有线程和延迟开销。因此，大多数算法都需要数组来管理所有数据，这需要一个好的GPU数组库（array library）作为关键基础。

GPUArrays.jl是Julia的基础。它提供了一个抽象数组实现，专门用于使用高度并行硬件的原始功能。它包含设置GPU所需的所有功能，启动Julia GPU函数并提供一些基本的数组算法。

抽象意味着它需要以CuArrays和CLArrays形式的具体实现。由于继承了GPUArrays的所有功能，它们都提供完全相同的接口。唯一的区别出现在分配数组时，这会强制你决定数组是否位于CUDA或OpenCL设备上。关于这一点的更多信息，请参阅内存部分。

GPUArrays有助于减少代码重复，因为它允许编写独立于硬件的GPU内核，可以通过CuArrays或CLArrays将其编译为本机GPU代码。因此，许多通用内核可以在继承自GPUArrays的所有packages之间共享。

一点选择建议：CuArrays仅适用于Nvidia GPU，而CLArrays适用于大多数可用的GPU。CuArrays比CLArrays更稳定，并且已经可以在Julia 0.7上运行。速度上差异不明显。我建议两者都试一下，看看哪个效果最好。

对于本文，我将选择CuArrays，因为本文是为Julia 0.7 / 1.0而写的，CLArrays仍然不支持。

性能

让我们用一个简单的交互式代码示例来快速说明为什么要将计算转移到GPU上，这个示例计算julia set：

 1using CuArrays, FileIO, Colors, GPUArrays, BenchmarkTools
 2using CuArrays: CuArray
 3"""
 4The function calculating the Julia set
 5"""
 6function juliaset(z0, maxiter)
 7    c = ComplexF32(-0.5, 0.75)
 8    z = z0
 9    for i in 1:maxiter
10        abs2(z) > 4f0 && return (i - 1) % UInt8
11        z = z * z + c
12    end
13    return maxiter % UInt8 # % is used to convert without overflow check
14end
15range = 100:50:2^12
16cutimes, jltimes = Float64[], Float64[]
17function run_bench(in, out)
18  # use dot syntax to apply `juliaset` to each elemt of q_converted 
19  # and write the output to result
20  out .= juliaset.(in, 16)
21  # all calls to the GPU are scheduled asynchronous, 
22  # so we need to synchronize
23  GPUArrays.synchronize(out)
24end
25# store a reference to the last results for plotting
26last_jl, last_cu = nothing, nothing
27for N in range
28  w, h = N, N
29  q = [ComplexF32(r, i) for i=1:-(2.0/w):-1, r=-1.5:(3.0/h):1.5]
30  for (times, Typ) in ((cutimes, CuArray), (jltimes, Array))
31    # convert to Array or CuArray - moving the calculation to CPU/GPU
32    q_converted = Typ(q)
33    result = Typ(zeros(UInt8, size(q)))
34    for i in 1:10 # 5 samples per size
35      # benchmarking macro, all variables need to be prefixed with $
36      t = Base.@elapsed begin
37                run_bench(q_converted, result)
38      end
39      global last_jl, last_cu # we're in local scope
40      if result isa CuArray
41        last_cu = result
42      else
43          last_jl = result
44      end
45      push!(times, t)
46    end
47  end
48end
49
50cu_jl = hcat(Array(last_cu), last_jl)
51cmap = colormap("Blues", 16 + 1)
52color_lookup(val, cmap) = cmap[val + 1]
53save("results/juliaset.png", color_lookup.(cu_jl, (cmap,)))

1using Plots; plotly()
2x = repeat(range, inner = 10)
3speedup = jltimes ./ cutimes
4Plots.scatter(
5  log2.(x), [speedup, fill(1.0, length(speedup))], 
6  label = ["cuda" "cpu"], markersize = 2, markerstrokewidth = 0,
7  legend = :right, xlabel = "2^N", ylabel = "speedup"
8)

如你所见，对于大型数组，通过将计算移动到GPU可以获得稳定的60-80倍的加速。而且非常简单，只需将Julia array转换为GPUArray。

有人可能认为GPU的性能受到像Julia这样的动态语言的影响，但Julia的GPU性能应该与CUDA或OpenCL的原始性能相当。Tim Besard在集成LLVM Nvidia编译pipeline方面做得非常出色，达到了与纯CUDA C代码相同（有时甚至更好）的性能。Tim发表了一篇非常详细的博文，里面进一步解释了这一点[1]。CLArrays方法有点不同，它直接从Julia生成OpenCL C代码，具有与OpenCL C相同的性能！

为了更好地了解性能并查看与多线程CPU代码的比较，我收集了一些基准测试[2]。

内存（Memory）

GPU具有自己的存储空间，包括视频存储器（VRAM），不同的高速缓存和寄存器。无论你做什么，任何Julia对象都必须先转移到GPU才能使用。并非Julia中的所有类型都可以在GPU上工作。

首先让我们看一下Julia的类型：

 1struct Test # an immutable struct
 2# that only contains other immutable, which makes 
 3# isbitstype(Test) == true
 4    x::Float32 
 5end
 6
 7# the isbits property is important, since those types can be used
 8# without constraints on the GPU!
 9@assert isbitstype(Test) == true
10x = (2, 2)
11isa(x, Tuple{Int, Int}) # tuples are also immutable
12mutable struct Test2 #-> mutable, isbits(Test2) == false
13    x::Float32
14end
15struct Test3
16    # contains a heap allocation/ reference, not isbits
17    x::Vector{Float32}
18    y::Test2 # Test2 is mutable and also heap allocated / a reference
19end
20Vector{Test} # <-  An Array with isbits elements is contigious in memory
21Vector{Test2} # <- An Array with mutable elements is basically an array of heap pointers. Since it just contains cpu heap pointers, it won't work on the GPU.

"Array{Test2,1}"

所有这些Julia类型在转移到GPU或在GPU上创建时表现都不同。下表概述了预期结果：

创建位置描述了对象是否在CPU上创建然后传输到GPU内核，或者是否在内核的GPU上创建。这个表显示了是否可以创建类型的实例，并且对于从CPU到GPU的传输，该表还指示对象是否通过引用复制或传递。

Garbage Collection

使用GPU时的一个很大的区别是GPU上没有垃圾回收（ garbage collector, GC）。这不是什么大问题，因为为GPU编写的高性能内核不应该一开始就创建任何GC-tracked memory。

为GPU实现GC是可能的，但请记住，每个执行的内核都是大规模并行的。在~1000 GPU线程中的每一个线程创建和跟踪大量堆内存将很快破坏性能增益，因此这实际上是不值得的。

作为内核中堆分配数组的替代方法，你可以使用GPUArrays。GPUArray构造函数将创建GPU缓冲区并将数据传输到VRAM。如果调用Array(gpu_array)，数组将被转移回RAM，表示为普通的Julia数组。这些GPU数组的Julia句柄由Julia的GC跟踪，如果它不再使用，GPU内存将被释放。

因此，只能在设备上使用堆栈分配，并且对其余的预先分配的GPU缓冲区使用。由于传输非常昂贵的，因此在编程GPU时尽可能多地重用和预分配是很常见的。

The GPUArray Constructors

 1using CuArrays, LinearAlgebra
 2
 3# GPU Arrays can be constructed from all Julia arrays containing isbits types!
 4A1D = cu([1, 2, 3]) # cl for CLArrays
 5A1D = fill(CuArray{Int}, 0, (100,)) # CLArray for CLArrays
 6# Float32 array - Float32 is usually preferred and can be up to 30x faster on most GPUs than Float64
 7diagonal_matrix = CuArray{Float32}(I, 100, 100)
 8filled = fill(CuArray, 77f0, (4, 4, 4)) # 3D array filled with Float32 77
 9randy = rand(CuArray, Float32, 42, 42) # random numbers generated on the GPU
10# The array constructor also accepts isbits iterators with a known size
11# Note, that since you can also pass isbits types to a gpu kernel directly, in most cases you won't need to materialize them as an gpu array
12from_iter = CuArray(1:10)
13# let's create a point type to further illustrate what can be done:
14struct Point
15    x::Float32
16    y::Float32
17end
18Base.convert(::Type{Point}, x::NTuple{2, Any}) = Point(x[1], x[2])
19# because we defined the above convert from a tuple to a point
20# [Point(2, 2)] can be written as Point[(2,2)] since all array 
21# elements will get converted to Point
22custom_types = cu(Point[(1, 2), (4, 3), (2, 2)])
23typeof(custom_types)

"CuArray{Point, 1}"

Array Operations

许多操作是已经定义好的。最重要的是，GPUArrays支持Julia的fusing dot broadcasting notation。这种标记法允许你将函数应用于数组的每个元素，并使用f的返回值创建一个新数组。这个功能通常称为映射（map）。 broadcast 指的是具有不同形状的数组被散布到相同的形状。

它的工作方式如下：

1x = zeros(4, 4) # 4x4 array of zeros
2y = zeros(4) # 4 element array
3z = 2 # a scalar
4# y's 1st dimension gets repeated for the 2nd dimension in x
5# and the scalar z get's repeated for all dimensions
6# the below is equal to `broadcast(+, broadcast(+, xx, y), z)`
7x .+ y .+ z

关于broadcasting如何工作的更多解释，可以看看这个指南：

julia.guide/broadcasting

这意味着在不分配堆内存（仅创建isbits类型）的情况下运行的任何Julia函数都可以应用于GPUArray的每个元素，并且多个dot调用将融合到一个内核调用中。由于内核调用延迟很高，这种融合是一个非常重要的优化。

 1using CuArrays
 2A = cu([1, 2, 3])
 3B = cu([1, 2, 3])
 4C = rand(CuArray, Float32, 3)
 5result = A .+ B .- C
 6test(a::T) where T = a * convert(T, 2) # convert to same type as `a`
 7
 8# inplace broadcast, writes directly into `result`
 9result .= test.(A) # custom function work
10
11# The cool thing is that this composes well with custom types and custom functions.
12# Let's go back to our Point type and define addition for it
13Base.:(+)(p1::Point, p2::Point) = Point(p1.x + p2.x, p1.y + p2.y)
14
15# now this works:
16custom_types = cu(Point[(1, 2), (4, 3), (2, 2)])
17
18# This particular example also shows the power of broadcasting: 
19# Non array types are broadcasted and repeated for the whole length
20result = custom_types .+ Ref(Point(2, 2))
21
22# So the above is equal to (minus all the allocations):
23# this allocates a new array on the gpu, which we can avoid with the above broadcast
24broadcasted = fill(CuArray, Point(2, 2), (3,))
25
26result == custom_types .+ broadcasted

ture

现实世界中的GPUArrays

让我们直接看看一些很酷的用例。

如下面的视频所示，这个GPU加速烟雾模拟是使用GPUArrays + CLArrays创建的，可在GPU或CPU上运行，GPU版本的速度提高了15倍：

还有更多的用例，包括求解微分方程，有限元模拟和求解偏微分方程。

让我们来看一个简单的机器学习示例，看看如何使用GPUArrays：

 1using Flux, Flux.Data.MNIST, Statistics
 2using Flux: onehotbatch, onecold, crossentropy, throttle
 3using Base.Iterators: repeated, partition
 4using CuArrays
 5
 6# Classify MNIST digits with a convolutional network
 7
 8imgs = MNIST.images()
 9
10labels = onehotbatch(MNIST.labels(), 0:9)
11
12# Partition into batches of size 1,000
13train = [(cat(float.(imgs[i])..., dims = 4), labels[:,i])
14         for i in partition(1:60_000, 1000)]
15
16use_gpu = true # helper to easily switch between gpu/cpu
17
18todevice(x) = use_gpu ? gpu(x) : x
19
20train = todevice.(train)
21
22# Prepare test set (first 1,000 images)
23tX = cat(float.(MNIST.images(:test)[1:1000])..., dims = 4) |> todevice
24tY = onehotbatch(MNIST.labels(:test)[1:1000], 0:9) |> todevice
25
26m = Chain(
27  Conv((2,2), 1=>16, relu),
28  x -> maxpool(x, (2,2)),
29  Conv((2,2), 16=>8, relu),
30  x -> maxpool(x, (2,2)),
31  x -> reshape(x, :, size(x, 4)),
32  Dense(288, 10), softmax) |> todevice
33
34m(train[1][1])
35
36loss(x, y) = crossentropy(m(x), y)
37
38accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))
39
40evalcb = throttle(() -> @show(accuracy(tX, tY)), 10)
41opt = ADAM(Flux.params(m));

1# train
2for i = 1:10
3    Flux.train!(loss, train, opt, cb = evalcb)
4end

accuracy(tX, tY) = 0.101

accuracy(tX, tY) = 0.888

accuracy(tX, tY) = 0.919

1using Colors, FileIO, ImageShow
2N = 22
3img = tX[:, :, 1:1, N:N]
4println("Predicted: ", Flux.onecold(m(img)) .- 1)
5Gray.(collect(tX[:, :, 1, N]))

只需将数组转换为GPUArrays（使用gpu(array)），我们就可以将整个计算转移到GPU并获得相当不错的速度提升。这要归功于Julia复杂的AbstractArray基础架构，GPUArray可以无缝地集成到其中。接着，如果你省略了对转换为GPUArray，代码也将使用普通的Julia数组运行——但当然这是在CPU上运行。你可以通过将use_gpu = true更改为use_gpu = false并重试初始化和训练单元格来尝试这个操作。对比GPU和CPU，CPU运行时间为975秒，GPU运行时间为29秒 ——加速了约33倍！

另一个值得关注的好处是，GPUArrays不需显式地实现自动微分以有效地支持神经网络的反向传播。这是因为Julia的自动微分库适用于任意函数，并发出可在GPU上高效运行的代码。这有助于帮助Flux以最少的开发人员在GPU上工作，并使Flux GPU能够有效地支持用户定义的函数。在没有GPUArrays + Flux之间协调的情况下开箱即用是Julia的一个非常独特的特性，详细解释见[3].

编写GPU内核

只需使用GPUArrays的通用抽象数组接口，而不用编写任何GPU内核，就可以做很多事了。但是，在某些时候，可能需要实现一个需要在GPU上运行的算法，并且不能用通用数组算法的组合来表示。

好的一点是，GPUArrays通过一种分层方法减少了大量的工作，这种方法允许你从高级代码开始编写低级内核，类似于大多数OpenCL / CUDA示例里的。它还允许你在OpenCL或CUDA设备上执行内核，从而抽象出这些框架中的任何差异。

使这成为可能的函数名为gpu_call。它可以被称为 gpu_call(kernel, A::GPUArray, args)，并将在GPU上使用参数 (state, args...) 调用内核。State是一个后端特定对象，用于实现获取线程索引之类的功能。GPUArray需要作为第二个参数传递，一遍分派到正确的后端并提供启动参数的缺省值。

让我们使用gpu_call来实现一个简单的map kernel：

 1using GPUArrays, CuArrays
 2# Overloading the Julia Base map! function for GPUArrays
 3function Base.map!(f::Function, A::GPUArray, B::GPUArray)
 4    # our function that will run on the gpu
 5    function kernel(state, f, A, B)
 6        # If launch parameters aren't specified, linear_index gets the index
 7        # into the Array passed as second argument to gpu_call (`A`)
 8        i = linear_index(state)
 9            if i <= length(A)
10          @inbounds A[i] = f(B[i])
11        end
12        return
13    end
14    # call kernel on the gpu
15    gpu_call(kernel, A, (f, A, B))
16end

简单来说，上面的代码将在GPU上并行调用julia函数内核length(A) 次。内核的每个并行调用都有一个线程索引，我们可以使用它来安全地索引到数组A和B。如果我们计算自己的索引，而不是使用linear_index，我们需要确保没有多个线程读写同一个数组位置。因此，如果我们使用线程在纯Julia中编写，其对应版本如下：

 1using BenchmarkTools
 2function threadded_map!(f::Function, A::Array, B::Array)
 3    Threads.@threads for i in 1:length(A)
 4        A[i] = f(B[i])
 5    end
 6  A
 7end
 8x, y = rand(10^7), rand(10^7)
 9kernel(y) = (y / 33f0) * (732.f0/y)
10# on the cpu without threads:
11single_t = @belapsed map!($kernel, $x, $y)
12
13# "on the CPU with 4 threads (2 real cores):
14thread_t = @belapsed threadded_map!($kernel, $x, $y)
15
16# on the GPU:
17xgpu, ygpu = cu(x), cu(y)
18gpu_t = @belapsed begin
19  map!($kernel, $xgpu, $ygpu)
20  GPUArrays.synchronize($xgpu)
21end
22times = [single_t, thread_t, gpu_t]
23speedup = maximum(times) ./ times
24println("speedup: $speedup")
25bar(["1 core", "2 cores", "gpu"], speedup, legend = false, fillcolor = :grey, ylabel = "speedup")

因为这个函数没有做很多工作，我们看不到完美的扩展，但线程和GPU版本仍然提供了很大的加速。

GPU比线程示例展示的要复杂得多，因为硬件线程是在线程块中布局的——gpu_call在简单版本中抽象出来，但它也可以用于更复杂的启动配置：

 1using CuArrays
 2
 3threads = (2, 2)
 4blocks = (2, 2)
 5T = fill(CuArray, (0, 0), (4, 4))
 6B = fill(CuArray, (0, 0), (4, 4))
 7gpu_call(T, (B, T), (blocks, threads)) do state, A, B
 8  # those names pretty much refer to the cuda names
 9    b = (blockidx_x(state), blockidx_y(state))
10    bdim = (blockdim_x(state), blockdim_y(state))
11    t = (threadidx_x(state), threadidx_y(state))
12    idx = (bdim .* (b .- 1)) .+ t
13    A[idx...] = b
14    B[idx...] = t
15    return
16end
17println("Threads index: \n", T)
18println("Block index: \n", B)

在上面的示例中，你可以看到更复杂的启动配置的迭代顺序。确定正确的迭代+启动配置对于达到GPU的最佳性能至关重要。

结论

在将可组合的高级编程引入高性能世界方面，Julia取得了长足的进步。现在是时候对GPU做同样的事情了。

希望Julia降低开始在GPU上编程的标准，并且我们可以为开源GPU计算发展可扩展的平台。第一个成功案例是通过Julia packages实现自动微分，这些软件包甚至不是为GPU编写，因此这给了我们很多理由相信Julia在GPU计算领域的可扩展和通用设计是成功的。

[1]https://devblogs.nvidia.com/gpu-computing-julia-programming-language/

[2]https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/results/results.md

[3]www.stochasticlifestyle.com/why-numba-and-cython-are-not-substitutes-for-julia/

原文地址：

https://nextjournal.com/sdanisch/julia-gpu-programming

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-10-22，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自新智元微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度