Speeding up my logsumexp function

Elrod · July 1, 2020, 9:44pm

mcabbott:

function lsexp_mat1(mat; dims=1)
    max_ = maximum(mat, dims=1)
    zero1_mat = safeeq(mat, max_) # working around a Zygote bug, today?
    exp_mat = exp.(mat .- max_)
    sum_exp_ = sum(exp_mat .- zero1_mat, dims=dims)
    log1p.(sum_exp_) .+ max_
end
safeeq(mat, max_) = (mat .== max_)

function lsexp_mat2(mat; dims=1) # less memory but not really faster?
    max_ = maximum(mat, dims=1)
    exp_mat = exp.(mat .- max_) .- (mat .== max_) # fuse this broadcast, @Oscar_Smith beat me to it!
    sum_exp_ = sum(exp_mat, dims=dims)
    sum_exp_ .= log1p.(sum_exp_) .+ max_ # re-use this array?
end

using Tullio # ] add Tullio#master -- I just fixed a bug about == & gradients

function lsexp_mat3(mat) # not generic over dims, but differentiable
    max_ = maximum(mat, dims=1)
    @tullio exp_mat[i,j] := exp(mat[i,j] - max_[1,j]) - (mat[i,j] == max_[1,j]) avx=false # grad=Dual # fixed on master
    sum_exp_ = sum(exp_mat, dims=1)
    @tullio out[i,j] := log1p(sum_exp_[i,j]) + max_[i,j] avx=false
end

using LoopVectorization

# function lsexp_mat4(mat; dims=1) # @avx broadcasting, is having a bad day
#     max_ = maximum(mat, dims=1)
#     # zero1_mat = (mat .== max_)
#     exp_mat = @avx exp.(mat .- max_) .- (mat .== max_) # has lots of NaN & Inf in it?
#     sum_exp_ = sum(exp_mat, dims=dims)
#     @avx sum_exp_ .= log1p.(sum_exp_) .+ max_ # mostly NaN
# end

function lsexp_mat5(mat) # also using @avx
    max_ = maximum(mat, dims=1)
    @tullio exp_mat[i,j] := exp(mat[i,j] - max_[1,j]) - (mat[i,j] == max_[1,j])
    sum_exp_ = sum(exp_mat, dims=1)
    @tullio out[i,j] := log1p(sum_exp_[i,j]) + max_[i,j]
end

n = 1_000; A = rand(n,n);
lsexp_mat(A) ≈ lsexp_mat1(A) ≈ lsexp_mat2(A)
# lsexp_mat(A) ≈ lsexp_mat4(A) # false? 
lsexp_mat(A) ≈ lsexp_mat3(A) ≈ lsexp_mat5(A)

@btime lsexp_mat($A)  # 10.164 ms (13 allocations: 15.41 MiB)
@btime lsexp_mat2($A) # 10.631 ms (6 allocations: 7.65 MiB)
@btime lsexp_mat3($A) #  3.188 ms (78 allocations: 7.66 MiB)
# @btime lsexp_mat4($A) #  3.494 ms (14 allocations: 7.65 MiB)
@btime lsexp_mat5($A) #  2.069 ms (76 allocations: 7.66 MiB)

using Tracker, Zygote #, ForwardDiff
Zygote.@nograd safeeq

gA = Tracker.gradient(sum∘lsexp_mat, A)[1];
Zygote.gradient(sum∘lsexp_mat1, A)[1] ≈ gA
Zygote.gradient(sum∘lsexp_mat3, A)[1] ≈ gA
Zygote.gradient(sum∘lsexp_mat5, A)[1] ≈ gA

@btime Zygote.gradient(sum∘lsexp_mat1, $A); # 69.199 ms (3003130 allocations: 137.61 MiB)
@btime Zygote.gradient(sum∘lsexp_mat3, $A); # 12.547 ms (253 allocations: 38.23 MiB)
@btime Zygote.gradient(sum∘lsexp_mat5, $A); # 10.309 ms (248 allocations: 38.23 MiB)

The awkward thing about broadcasting is that type information and syntax combined still do not fully specify the behavior: if isone(size(A, n)), then axis n is broadcasted.
Currently LoopVectorization handles broadcasting by setting strides corresponding to dimensions of size 1 to 0. The linear index A[n...] equals dot(n, strides(A)), so by setting the stride to 0, it’ll ignore indexes on that axis and “broadcast” along it.

However, this doesn’t work well when loading data along a contiguous axis.
For this to be efficient, we need to use vmovup* instructions. These load contiguous elements, so the stride = 0 trick won’t work.
It would work with gather instructions, available in AVX2 and AVX512, but they’re many times slower. While better than nothing, they cripple performance on a comparative basis.
This means to be efficient, we have to use the contiguous loads (and stores, but stores aren’t a problem when broadcasting).

Obviously Julia+LLVM don’t have a problem with this. I’m guessing it uses a few runtime checks to switch between different versions of the loop. I should probably follow that approach.

But for now, a workaround (incompatible with the dims argument) is to make it known at compile time that isone(size(max_,1)):

function lsexp_mat4(mat; dims=1) # @avx broadcasting, is having a bad day
    @assert dims == 1
    max_ = vec(maximum(mat, dims=1))' # requires dims=1
    # zero1_mat = (mat .== max_)
    exp_mat = @avx exp.(mat .- max_) .- (mat .== max_) # should now work
    sum_exp_ = sum(exp_mat, dims=dims)
    @avx sum_exp_ .= log1p.(sum_exp_) .+ max_ # mostly NaN
end

So now I get:

julia> n = 1_000; A = rand(n,n);

julia> lsexp_mat(A) ≈ lsexp_mat1(A) ≈ lsexp_mat2(A)
true

julia> lsexp_mat(A) ≈ lsexp_mat4(A)
true

julia> lsexp_mat(A) ≈ lsexp_mat3(A) ≈ lsexp_mat5(A)
true

julia> @btime lsexp_mat($A);
  10.844 ms (13 allocations: 15.41 MiB)

julia> @btime lsexp_mat2($A);
  10.726 ms (6 allocations: 7.65 MiB)

julia> @btime lsexp_mat3($A);
  8.750 ms (21 allocations: 7.65 MiB)

julia> @btime lsexp_mat4($A);
  2.402 ms (17 allocations: 7.65 MiB)

julia> @btime lsexp_mat5($A);
  2.557 ms (19 allocations: 7.65 MiB)

julia> using Tracker, Zygote #, ForwardDiff

julia> Zygote.@nograd safeeq

julia> gA = Tracker.gradient(sum∘lsexp_mat, A)[1];

julia> Zygote.gradient(sum∘lsexp_mat1, A)[1] ≈ gA
true

julia> Zygote.gradient(sum∘lsexp_mat3, A)[1] ≈ gA
true

julia> Zygote.gradient(sum∘lsexp_mat5, A)[1] ≈ gA
true

julia> @btime Zygote.gradient(sum∘lsexp_mat1, $A);
  67.391 ms (3003130 allocations: 137.61 MiB)

julia> @btime Zygote.gradient(sum∘lsexp_mat3, $A);
  22.924 ms (133 allocations: 38.22 MiB)

julia> @btime Zygote.gradient(sum∘lsexp_mat5, $A);
  10.519 ms (128 allocations: 38.22 MiB)

Overall, our performance numbers are really similar except for lsexp_mat3, where my computer is much slower.

Topic		Replies	Views
Fast LogSumExp over 4th dimension Performance tullio , loopvectorization	36	2230	October 21, 2022
How to speed up these functions? Performance function	19	1740	August 6, 2021
Array Contraction, LoopVectorization & AD Performance zygote , loopvectorization , autodiff	7	546	December 18, 2023
Maximizing an objective function with many logs Performance tullio	6	723	April 14, 2021
Sum operations between arrays Performance	21	5626	April 7, 2020

Speeding up my logsumexp function

Related topics