Open
Description
Coming from this observation #issue 1323.
Having a code:
using BenchmarkTools
using CUDA
function mysum(X,Y,n)
I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
I > n && return
cY = CUDA.Const(Y)
@inbounds for j in 1:n
X[I] += cY[I,j]
end
return
end
# MYSUM PERF. TEST.
nth=512
N = 500; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 1000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 2000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 4000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 6000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 8000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 10000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 12000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
# BUILTIN PERF. TEST.
N = 500; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 1000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 2000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 4000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 6000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 8000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 10000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 12000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
Gives us this results:
27.702 μs (44 allocations: 1.84 KiB)
80.011 μs (34 allocations: 1.44 KiB)
155.594 μs (168 allocations: 5.62 KiB)
323.131 μs (40 allocations: 1.62 KiB)
576.200 μs (314 allocations: 10.19 KiB)
885.445 μs (630 allocations: 20.06 KiB)
1.408 ms (988 allocations: 31.25 KiB)
1.965 ms (1230 allocations: 38.81 KiB)
61.396 μs (134 allocations: 4.84 KiB)
144.424 μs (266 allocations: 8.95 KiB)
333.410 μs (321 allocations: 10.67 KiB)
968.642 μs (1165 allocations: 37.05 KiB)
2.979 ms (1567 allocations: 49.61 KiB)
5.656 ms (2234 allocations: 70.45 KiB)
8.874 ms (10531 allocations: 329.73 KiB)
13.106 ms (19971 allocations: 624.73 KiB)
Possibly: 200-1000% speedup.
Why are we faster with "mysum"? Is it the CUDA.const? Or the specialisation on dims=2? Or we don't see something?