Skip to content

@avx in subspace matrix multiplication is much slower than StaticArrays #169

Open
@Roger-luo

Description

@Roger-luo

I was trying to use @avx in the following subspace matrix multiplication code, but to my surprise, LoopVectorization is significantly slower than StaticArrays, I'm not sure if I'm doing it correctly

The subspace matrix multiplication can be defined as the following julia code

function subspace_smm!(st::AbstractMatrix, indices, U::AbstractMatrix, subspace)
    @inbounds @views for k in subspace
        idx = k .+ indices
        for b in 1:size(st, 2)
            st[idx, b] = U * st[idx, b]
        end
    end
    return st
end

where subspace and indices a set of integers to describe the subspace, and U is a matrix in the subspace, the matrix multiplication happens inside the subspace. An example of how to use the above function

using StaticArrays
using BenchmarkTools

n = 3
st = rand(ComplexF64, 1 << 8, 100)
y = similar(st, (1 << n, 100))
U = rand(ComplexF64, 1 << n, 1 << n)
SU = @SMatrix rand(ComplexF64, 1 << n, 1 << n)
indices = [1, 2, 5, 6, 9, 10, 13, 14]
subspace = [0, 2, 16, 18, 32, 34, 48, 50, 64, 66, 80, 82, 96, 98, 112, 114, 128, 130, 144, 146, 160, 162, 176, 178, 192, 194, 208, 210, 224, 226, 240, 242]

julia> @benchmark subspace_smm!($(copy(st)), $indices, $SU, $subspace)
BenchmarkTools.Trial: 
  memory estimate:  4.50 KiB
  allocs estimate:  32
  --------------
  minimum time:     89.259 μs (0.00% GC)
  median time:      115.608 μs (0.00% GC)
  mean time:        115.988 μs (0.10% GC)
  maximum time:     1.334 ms (91.20% GC)
  --------------
  samples:          10000
  evals/sample:     1

now if we implement the above matrix multiplication with plain for loop and use @avx, it can be implemented as following

function subspace_mm!(st::AbstractMatrix, indices, U::AbstractMatrix{T}, y::AbstractMatrix, subspace) where T
    @avx for k in subspace
        for b in 1:size(st, 2)
            for i in 1:size(U, 1)
                y[i, b] = zero(T)
                for j in 1:size(U, 2)
                    y[i, b] += U[i, j] * st[k + indices[j], b]
                end
            end

            for i in 1:size(U, 1)
                st[k + indices[i], b] = y[i, b]
            end
        end
    end
    return st
end

however this then become significantly slower than the previous naive version using StaticArrays

julia> y = similar(st, (1 << n, 100))

julia> subspace_mm!($(copy(st)), $indices, $SU, $y, $subspace)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     203.286 μs (0.00% GC)
  median time:      213.306 μs (0.00% GC)
  mean time:        213.945 μs (0.00% GC)
  maximum time:     245.486 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

y above is a temp mem to store result of each subspace matrix multiplication, so we can assign them back to st. I'm testing this on Ryzen

julia> versioninfo()
Julia Version 1.6.0-DEV.1699
Commit 8a5ac94e96* (2020-12-08 03:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, znver2)
Environment:
  JULIA_EDITOR = "/home/roger/.vscode-server/bin/e5a624b788d92b8d34d1392e4c4d9789406efe8f/bin/code"
  JULIA_NUM_THREADS = 

I'm not sure if it would be different on an AVX512 available machine, but I guess probably not the AVX512 issue since SA is much faster on the same machine?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions