Slow 2D sum

Coming from this [observation #issue 1323](https://github.com/JuliaGPU/CUDA.jl/issues/1323). 

Having a code:
```
using BenchmarkTools
using CUDA
function mysum(X,Y,n) 
	I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
	I > n && return
	cY = CUDA.Const(Y)
	@inbounds for j in 1:n
		X[I] += cY[I,j]
	end
	return
end

# MYSUM PERF. TEST.
nth=512
N = 500; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 1000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 2000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 4000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 6000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 8000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 10000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 12000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 

# BUILTIN PERF. TEST.
N = 500; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 1000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 2000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 4000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 6000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 8000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 10000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 12000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
```
Gives us this results:
```
  27.702 μs (44 allocations: 1.84 KiB)
  80.011 μs (34 allocations: 1.44 KiB)
  155.594 μs (168 allocations: 5.62 KiB)
  323.131 μs (40 allocations: 1.62 KiB)
  576.200 μs (314 allocations: 10.19 KiB)
  885.445 μs (630 allocations: 20.06 KiB)
  1.408 ms (988 allocations: 31.25 KiB)
  1.965 ms (1230 allocations: 38.81 KiB)

  61.396 μs (134 allocations: 4.84 KiB)
  144.424 μs (266 allocations: 8.95 KiB)
  333.410 μs (321 allocations: 10.67 KiB)
  968.642 μs (1165 allocations: 37.05 KiB)
  2.979 ms (1567 allocations: 49.61 KiB)
  5.656 ms (2234 allocations: 70.45 KiB)
  8.874 ms (10531 allocations: 329.73 KiB)
  13.106 ms (19971 allocations: 624.73 KiB)
```
Possibly: 200-1000% speedup. 

Why are we faster with "mysum"? Is it the CUDA.const? Or the specialisation on dims=2? Or we don't see something? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow 2D sum #1330

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow 2D sum #1330

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions