Within-Method GPU Parallelism of Ordinary Differential Equation Solves

Within-Method GPU Parallelism for ODE solvers is a method for accelerating large ODE solves with regularity, i.e., only using array-based “vectorized” operations like linear algebra, maps, and broadcast statements. In these cases, the solve can be GPU accelerated simply by placing the initial condition array on the GPU. As a quick example:

using OrdinaryDiffEq, CUDA, LinearAlgebra
function f(du, u, p, t)
    mul!(du, A, u)
end

A = cu(-rand(3, 3))
u0 = cu([1.0; 0.0; 0.0])
tspan = (0.0f0, 100.0f0)

prob = ODEProblem(f, u0, tspan)
sol = solve(prob, Tsit5())
sol = solve(prob, Rosenbrock23())
retcode: Success
Interpolation: specialized 2nd order "free" stiffness-aware interpolation
t: 229-element Vector{Float32}:
   0.0
   9.27572f-5
   0.0010203292
   0.005445473
   0.011240605
   0.02369273
   0.039545782
   0.06525832
   0.09650515
   0.13952377
   ⋮
  95.27473
  95.91403
  96.57579
  97.23755
  97.89931
  98.561066
  99.23637
  99.91167
 100.0
u: 229-element Vector{CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}:
 Float32[1.0, 0.0, 0.0]
 Float32[0.9999191, -4.136551f-5, -8.718866f-5]
 Float32[0.9991111, -0.00045436196, -0.00095838745]
 Float32[0.9952734, -0.002408212, -0.005097443]
 Float32[0.99028957, -0.004926172, -0.0104752965]
 Float32[0.9797392, -0.010182449, -0.021869581]
 Float32[0.9666134, -0.016576296, -0.0360636]
 Float32[0.9460261, -0.026261317, -0.05836517]
 Float32[0.9221204, -0.036940157, -0.08431637]
 Float32[0.8910716, -0.04980573, -0.11809613]
 ⋮
 Float32[0.00049719884, -0.0022435721, 0.0021882595]
 Float32[0.0010859583, -0.0024246967, 0.0015704404]
 Float32[0.0015605923, -0.0023972206, 0.00084748876]
 Float32[0.0018686663, -0.0021694193, 0.000106063744]
 Float32[0.0019964012, -0.0017752235, -0.0005898974]
 Float32[0.0019462514, -0.0012601186, -0.0011849877]
 Float32[0.001729764, -0.00066427403, -0.0016438589]
 Float32[0.0013776367, -5.4696073f-5, -0.0019241067]
 Float32[0.001323388, 2.3389219f-5, -0.0019467665]

Notice that both stiff and non-stiff ODE solvers were used here.

Note

Time span was changed to Float32 types, as GPUs generally have very slow Float64 operations, usually around 1/32 of the speed of Float32. cu(x) on an array automatically changes an Array{Float64} to a CuArray{Float32}. If this is not intended, use the CuArray constructor directly. For more information on GPU Float64 performance issues, search around Google for discussions like this.

Warn

Float32 precision is sometimes not enough precision to accurately solve a stiff ODE. Make sure that the precision is necessary by investigating the condition number of the Jacobian. If this value is well-above 1e8, use Float32 with caution!

Restrictions of CuArrays

Note that all the rules of CUDA.jl apply when CuArrays are being used in the solver. While for most of the AbstractArray interface they act similarly to Arrays, such as having valid broadcasting operations (x .* y) defined, they will work on GPUs. For more information on the rules and restrictions of CuArrays, see this page from the CUDA.jl documentation.