Description
I was trying to use @avx
in the following subspace matrix multiplication code, but to my surprise, LoopVectorization is significantly slower than StaticArrays
, I'm not sure if I'm doing it correctly
The subspace matrix multiplication can be defined as the following julia code
function subspace_smm!(st::AbstractMatrix, indices, U::AbstractMatrix, subspace)
@inbounds @views for k in subspace
idx = k .+ indices
for b in 1:size(st, 2)
st[idx, b] = U * st[idx, b]
end
end
return st
end
where subspace
and indices
a set of integers to describe the subspace, and U
is a matrix in the subspace, the matrix multiplication happens inside the subspace. An example of how to use the above function
using StaticArrays
using BenchmarkTools
n = 3
st = rand(ComplexF64, 1 << 8, 100)
y = similar(st, (1 << n, 100))
U = rand(ComplexF64, 1 << n, 1 << n)
SU = @SMatrix rand(ComplexF64, 1 << n, 1 << n)
indices = [1, 2, 5, 6, 9, 10, 13, 14]
subspace = [0, 2, 16, 18, 32, 34, 48, 50, 64, 66, 80, 82, 96, 98, 112, 114, 128, 130, 144, 146, 160, 162, 176, 178, 192, 194, 208, 210, 224, 226, 240, 242]
julia> @benchmark subspace_smm!($(copy(st)), $indices, $SU, $subspace)
BenchmarkTools.Trial:
memory estimate: 4.50 KiB
allocs estimate: 32
--------------
minimum time: 89.259 μs (0.00% GC)
median time: 115.608 μs (0.00% GC)
mean time: 115.988 μs (0.10% GC)
maximum time: 1.334 ms (91.20% GC)
--------------
samples: 10000
evals/sample: 1
now if we implement the above matrix multiplication with plain for loop and use @avx
, it can be implemented as following
function subspace_mm!(st::AbstractMatrix, indices, U::AbstractMatrix{T}, y::AbstractMatrix, subspace) where T
@avx for k in subspace
for b in 1:size(st, 2)
for i in 1:size(U, 1)
y[i, b] = zero(T)
for j in 1:size(U, 2)
y[i, b] += U[i, j] * st[k + indices[j], b]
end
end
for i in 1:size(U, 1)
st[k + indices[i], b] = y[i, b]
end
end
end
return st
end
however this then become significantly slower than the previous naive version using StaticArrays
julia> y = similar(st, (1 << n, 100))
julia> subspace_mm!($(copy(st)), $indices, $SU, $y, $subspace)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 203.286 μs (0.00% GC)
median time: 213.306 μs (0.00% GC)
mean time: 213.945 μs (0.00% GC)
maximum time: 245.486 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
y above is a temp mem to store result of each subspace matrix multiplication, so we can assign them back to st
. I'm testing this on Ryzen
julia> versioninfo()
Julia Version 1.6.0-DEV.1699
Commit 8a5ac94e96* (2020-12-08 03:24 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 9 3900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.0 (ORCJIT, znver2)
Environment:
JULIA_EDITOR = "/home/roger/.vscode-server/bin/e5a624b788d92b8d34d1392e4c4d9789406efe8f/bin/code"
JULIA_NUM_THREADS =
I'm not sure if it would be different on an AVX512 available machine, but I guess probably not the AVX512 issue since SA is much faster on the same machine?