Within-Method GPU Parallelism of Ordinary Differential Equation Solves
Within-Method GPU Parallelism for ODE solvers is a method for accelerating large ODE solves with regularity, i.e., only using array-based “vectorized” operations like linear algebra, maps, and broadcast statements. In these cases, the solve can be GPU accelerated simply by placing the initial condition array on the GPU. As a quick example:
using OrdinaryDiffEq, CUDA, LinearAlgebra
function f(du, u, p, t)
mul!(du, A, u)
end
A = cu(-rand(3, 3))
u0 = cu([1.0; 0.0; 0.0])
tspan = (0.0f0, 100.0f0)
prob = ODEProblem(f, u0, tspan)
sol = solve(prob, Tsit5())
sol = solve(prob, Rosenbrock23())
retcode: Success
Interpolation: specialized 2nd order "free" stiffness-aware interpolation
t: 229-element Vector{Float32}:
0.0
9.27572f-5
0.0010203292
0.005445473
0.011240605
0.02369273
0.039545782
0.06525832
0.09650515
0.13952377
⋮
95.27473
95.91403
96.57579
97.23755
97.89931
98.561066
99.23637
99.91167
100.0
u: 229-element Vector{CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}:
Float32[1.0, 0.0, 0.0]
Float32[0.9999191, -4.136551f-5, -8.718866f-5]
Float32[0.9991111, -0.00045436196, -0.00095838745]
Float32[0.9952734, -0.002408212, -0.005097443]
Float32[0.99028957, -0.004926172, -0.0104752965]
Float32[0.9797392, -0.010182449, -0.021869581]
Float32[0.9666134, -0.016576296, -0.0360636]
Float32[0.9460261, -0.026261317, -0.05836517]
Float32[0.9221204, -0.036940157, -0.08431637]
Float32[0.8910716, -0.04980573, -0.11809613]
⋮
Float32[0.00049719884, -0.0022435721, 0.0021882595]
Float32[0.0010859583, -0.0024246967, 0.0015704404]
Float32[0.0015605923, -0.0023972206, 0.00084748876]
Float32[0.0018686663, -0.0021694193, 0.000106063744]
Float32[0.0019964012, -0.0017752235, -0.0005898974]
Float32[0.0019462514, -0.0012601186, -0.0011849877]
Float32[0.001729764, -0.00066427403, -0.0016438589]
Float32[0.0013776367, -5.4696073f-5, -0.0019241067]
Float32[0.001323388, 2.3389219f-5, -0.0019467665]
Notice that both stiff and non-stiff ODE solvers were used here.
Time span was changed to Float32
types, as GPUs generally have very slow Float64
operations, usually around 1/32 of the speed of Float32
. cu(x)
on an array automatically changes an Array{Float64}
to a CuArray{Float32}
. If this is not intended, use the CuArray
constructor directly. For more information on GPU Float64
performance issues, search around Google for discussions like this.
Float32
precision is sometimes not enough precision to accurately solve a stiff ODE. Make sure that the precision is necessary by investigating the condition number of the Jacobian. If this value is well-above 1e8
, use Float32
with caution!
Restrictions of CuArrays
Note that all the rules of CUDA.jl apply when CuArrays
are being used in the solver. While for most of the AbstractArray
interface they act similarly to Array
s, such as having valid broadcasting operations (x .* y
) defined, they will work on GPUs. For more information on the rules and restrictions of CuArrays
, see this page from the CUDA.jl documentation.