@avx in subspace matrix multiplication is much slower than StaticArrays

I was trying to use `@avx` in the following subspace matrix multiplication code, but to my surprise, LoopVectorization is significantly slower than `StaticArrays`, I'm not sure if I'm doing it correctly

The subspace matrix multiplication can be defined as the following julia code

```julia
function subspace_smm!(st::AbstractMatrix, indices, U::AbstractMatrix, subspace)
    @inbounds @views for k in subspace
        idx = k .+ indices
        for b in 1:size(st, 2)
            st[idx, b] = U * st[idx, b]
        end
    end
    return st
end
```

where `subspace` and `indices` a set of integers to describe the subspace, and `U` is a matrix in the subspace, the matrix multiplication happens inside the subspace. An example of how to use the above function

```julia
using StaticArrays
using BenchmarkTools

n = 3
st = rand(ComplexF64, 1 << 8, 100)
y = similar(st, (1 << n, 100))
U = rand(ComplexF64, 1 << n, 1 << n)
SU = @SMatrix rand(ComplexF64, 1 << n, 1 << n)
indices = [1, 2, 5, 6, 9, 10, 13, 14]
subspace = [0, 2, 16, 18, 32, 34, 48, 50, 64, 66, 80, 82, 96, 98, 112, 114, 128, 130, 144, 146, 160, 162, 176, 178, 192, 194, 208, 210, 224, 226, 240, 242]

julia> @benchmark subspace_smm!($(copy(st)), $indices, $SU, $subspace)
BenchmarkTools.Trial: 
  memory estimate:  4.50 KiB
  allocs estimate:  32
  --------------
  minimum time:     89.259 μs (0.00% GC)
  median time:      115.608 μs (0.00% GC)
  mean time:        115.988 μs (0.10% GC)
  maximum time:     1.334 ms (91.20% GC)
  --------------
  samples:          10000
  evals/sample:     1
```

now if we implement the above matrix multiplication with plain for loop and use `@avx`, it can be implemented as following

```julia
function subspace_mm!(st::AbstractMatrix, indices, U::AbstractMatrix{T}, y::AbstractMatrix, subspace) where T
    @avx for k in subspace
        for b in 1:size(st, 2)
            for i in 1:size(U, 1)
                y[i, b] = zero(T)
                for j in 1:size(U, 2)
                    y[i, b] += U[i, j] * st[k + indices[j], b]
                end
            end

            for i in 1:size(U, 1)
                st[k + indices[i], b] = y[i, b]
            end
        end
    end
    return st
end
```

however this then become significantly slower than the previous naive version using StaticArrays

```julia
julia> y = similar(st, (1 << n, 100))

julia> subspace_mm!($(copy(st)), $indices, $SU, $y, $subspace)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     203.286 μs (0.00% GC)
  median time:      213.306 μs (0.00% GC)
  mean time:        213.945 μs (0.00% GC)
  maximum time:     245.486 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
```

y above is  a temp mem to store result of each subspace matrix multiplication, so we can assign them back to `st`. I'm testing this on Ryzen

```
julia> versioninfo()
Julia Version 1.6.0-DEV.1699
Commit 8a5ac94e96* (2020-12-08 03:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, znver2)
Environment:
  JULIA_EDITOR = "/home/roger/.vscode-server/bin/e5a624b788d92b8d34d1392e4c4d9789406efe8f/bin/code"
  JULIA_NUM_THREADS = 
```

I'm not sure if it would be different on an AVX512 available machine, but I guess probably not the AVX512 issue since SA is much faster on the same machine?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

@avx in subspace matrix multiplication is much slower than StaticArrays #169

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

@avx in subspace matrix multiplication is much slower than StaticArrays #169

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions