Surprising slowdown from using array comprehension

jwortmann · June 2, 2025, 8:37pm

Hello! I recently came across a (for me) quite difficult to debug performance issue in my code, and it still leaves me somewhat puzzled. An algorithm that I wrote seems to experience severe slowdowns when using an array comprehension to populate a local variable which is accessed repeatedly within an inner loop. More particular, it seems to happen when using variables instead of number literals for the range bounds in the comprehension. But I don’t really understand why that happens. More details with the actual code are explained below. I would be grateful for any help to understand where this slowdown comes from, and what would be the best way to avoid it.

Julia version:

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b73 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 6 × Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 6 default, 0 interactive, 3 GC (on 6 virtual cores)

Consider the following minimal working example, in which I have marked a few important lines with comments Line B1, Line B2, etc.

function solve()
    pieces = UInt8[0x0b 0x0b 0x01 0x02; 0x0b 0x0b 0x02 0x01; 0x0b 0x0b 0x02 0x03; 0x0b 0x0b 0x03 0x02; 0x0a 0x01 0x04 0x03; 0x0a 0x01 0x05 0x01; 0x0a 0x01 0x05 0x02; 0x0a 0x01 0x08 0x01; 0x0a 0x01 0x08 0x03; 0x0a 0x01 0x09 0x02; 0x0a 0x01 0x09 0x03; 0x0a 0x02 0x04 0x02; 0x0a 0x02 0x05 0x02; 0x0a 0x02 0x07 0x01; 0x0a 0x02 0x07 0x03; 0x0a 0x02 0x08 0x03; 0x0a 0x02 0x09 0x01; 0x0a 0x03 0x04 0x01; 0x0a 0x03 0x06 0x01; 0x0a 0x03 0x06 0x02; 0x0a 0x03 0x06 0x03; 0x0a 0x03 0x07 0x02; 0x0a 0x03 0x08 0x01; 0x0a 0x03 0x08 0x03; 0x04 0x04 0x08 0x06; 0x04 0x04 0x08 0x09; 0x04 0x05 0x05 0x09; 0x04 0x05 0x07 0x05; 0x04 0x05 0x08 0x06; 0x04 0x05 0x09 0x07; 0x04 0x06 0x04 0x08; 0x04 0x06 0x04 0x09; 0x04 0x06 0x06 0x09; 0x04 0x07 0x05 0x08; 0x04 0x07 0x06 0x06; 0x04 0x08 0x08 0x07; 0x04 0x09 0x07 0x06; 0x05 0x05 0x06 0x07; 0x05 0x06 0x05 0x07; 0x05 0x06 0x06 0x07; 0x05 0x07 0x08 0x06; 0x05 0x07 0x09 0x09; 0x05 0x08 0x09 0x07; 0x05 0x09 0x07 0x08; 0x05 0x09 0x07 0x09; 0x06 0x07 0x08 0x08; 0x06 0x09 0x09 0x07; 0x06 0x09 0x09 0x09; 0x07 0x08 0x08 0x08]
    npieces = size(pieces, 1)
    ncolors = length(unique(pieces))

    colors = Matrix{UInt8}(undef, npieces << 2 | 3, 2)
    colors[0x0001, :] .= 0x0a
    colors[0x0002, :] .= 0x0b

    _candidates = [UInt16[] for _ in 1:ncolors, _ in 1:ncolors]
    for (piece, piece_colors) in enumerate(eachrow(pieces)), rotation = 0:3
        colors[piece << 2 | rotation, 1] = piece_colors[mod1(1 - rotation, 4)]
        colors[piece << 2 | rotation, 2] = piece_colors[mod1(2 - rotation, 4)]
        left = piece_colors[4 - rotation]
        bottom = piece_colors[mod1(3 - rotation, 4)]
        push!(_candidates[left, bottom], piece << 2 | rotation)
    end
    candidates = Vector{UInt16}(undef, mapreduce(length, +, _candidates))
    index_table = Matrix{UnitRange{Int}}(undef, ncolors, ncolors)
    idx = 1
    for left = 1:ncolors, bottom = 1:ncolors
        start_idx = idx
        for candidate in _candidates[left, bottom]
            candidates[idx] = candidate
            idx += 1
        end
        end_idx = idx - 1
        index_table[left, bottom] = start_idx:end_idx
    end

    nrows, ncols = 7, 7                   # Line A1
    # nrows::Int, ncols::Int = 7, 7         # Line A2

    @assert npieces >= nrows * ncols

    rowcol = [(row, col + 1) for row = 7:-1:1 for col = 1:7]                        # Line B1
    # rowcol = [(row, col + 1) for row = nrows:-1:1 for col = 1:ncols]                # Line B2
    # rowcol = NTuple{2, Int}[(row, col + 1) for row = nrows:-1:1 for col = 1:ncols]  # Line B3

    @assert typeof(rowcol) == Vector{NTuple{2, Int}}

    board = fill(0x0002, nrows + 1, ncols + 1)
    board[2:end-2, 1] .= 0x0001
    board[end, 3:end-1] .= 0x0001
    board[1:end-1, 2:end] .= 0x0000


    available = fill(true, npieces)
    idx_range = Vector{UnitRange{Int}}(undef, nrows*ncols)

    depth = 1
    row, col = rowcol[1]
    left = colors[board[row, col - 1], 2]
    bottom = colors[board[row + 1, col], 1]
    _idx_range = index_table[left, bottom]

    iters = 0

    @inbounds while true  # main loop
        @label next_iter
        for idx in _idx_range
            candidate = candidates[idx]
            available[candidate >> 2] || continue
            board[row, col] = candidate
            iters += 1
            if depth == 49
                println("Solution found after $iters iterations")
                return
            end
            false && nrows * ncols     # Line C
            available[candidate >> 2] = false
            idx_range[depth] = idx+1:_idx_range.stop
            depth += 1
            row, col = rowcol[depth]
            left = colors[board[row, col - 1], 2]
            bottom = colors[board[row + 1, col], 1]
            _idx_range = index_table[left, bottom]
            @goto next_iter
        end
        depth -= 1
        if depth == 0
            @warn "Search finished with no valid solution found"
            return
        end
        row, col = rowcol[depth]
        available[board[row, col] >> 2] = true
        _idx_range = idx_range[depth]
    end
end

@time solve()

Running this code should output something like this:

PS D:\> julia mwe.jl
Solution found after 16330466 iterations
  0.426470 seconds (6.75 k allocations: 370.647 KiB, 4.74% compilation time)

So far, so good. Now comment out line B1 and uncomment line B2 instead, then run the code again.

    # rowcol = [(row, col + 1) for row = 7:-1:1 for col = 1:7]                        # Line B1
    rowcol = [(row, col + 1) for row = nrows:-1:1 for col = 1:ncols]                # Line B2

Suddenly the runtime is more than an order of magnitude slower, and there are lots of allocations:

PS D:\> julia mwe.jl
Solution found after 16330466 iterations
 17.586502 seconds (193.57 M allocations: 5.772 GiB, 2.13% gc time, 0.93% compilation time)

The only thing that changed between line B1 and B2 are the bounds of the ranges in the array comprehension, which use the nrows and ncols variables instead of number literals now. But nrows and ncols were just set to the number literal 7 in the line A1 right above. In my real code, the variables nrows and ncols depend on the size of the input data, therefore I obviously cannot use number literals directly inside the array comprehension, but I need to use something similar to line B2 instead.

My vague guess is that the array comprehension somehow becomes type unstable. So in order to give a hint to the compiler, I tried to annotate the type for the resulting array explicitly, as described in the manual section about comprehensions.

    rowcol = NTuple{2, Int}[(row, col + 1) for row = nrows:-1:1 for col = 1:ncols]  # Line B3

This makes the code around 50% faster than the untyped comprehension, but it’s still a lot slower than using number literals for the range bounds.

PS D:\> julia mwe.jl
Solution found after 16330466 iterations
  9.287622 seconds (95.43 M allocations: 2.845 GiB, 2.15% gc time, 1.02% compilation time)

To add even more confusion, try to comment out line C in the main loop at the bottom of the function:

            # false && nrows * ncols     # Line C

Without this line, the algorithm runs quickly again and without allocations for the main loop, regardless which version (line B1, B2, B3) for the array comprehension is used. I also wonder what does even happen (in the sense of what exactly is actually computed) in this line when it is not commented out. The line doesn’t have any external effects, and there is a literal false at the left side of the short-circuit &&, so the multiplication in this line shouldn’t even happen at all. I would have guessed that the compiler would even eliminate the entire line in the first place. In this example, this line obviously doesn’t make any sense, however in my real code I’d like to occasionally read from the nrows and ncols variables within the main loop.

I also noticed that if the types for nrows and ncols are explicitly annotated in the code (line A2), the slowdown also does not occur. This is what I currently use as a workaround in my real code. But why is this necessary here? nrows and ncols are local variables that only live inside the solve function, so the compiler should know that the number literal 7 is an Int, no? Notably these two variables are constants that are never written to in the remaining code. Furthermore, it feels quite unsatisfying if I had to annotate local variables this way.

Another possible workaround (that I haven’t tested yet) might be to move the main loop into a separate function and pass all of the relevant local variables as arguments. I’m aware that this approach is described in the performance tips in the Julia manual (function barriers). However, the reason explained therein is to let the compiler allow to specialize code for variables whose types are not known during compilation. But I think this shouldn’t really apply to my example code, because as far as I can tell, the types for all variables should be well defined and stable. So I’d like to understand where exactly the slowdown and the allocations in my code come from when using certain forms of the array comprehensions.

Thanks for reading this far and thanks in advance for any help or suggestions.

adienes · June 2, 2025, 8:53pm

is it better with add static size information for hvcat and hvncat by johnnychen94 · Pull Request #58422 · JuliaLang/julia · GitHub ? (you can test with juliaup add pr58422 and then julia +pr58422)

jwortmann · June 2, 2025, 9:08pm

I’ve tested with the PR and unfortunately the code runs even slower in that case:

Line B2 uncommented:

PS D:\> julia +pr58422 mwe.jl
Solution found after 16330466 iterations
 25.104451 seconds (291.58 M allocations: 7.721 GiB, 3.16% gc time, 0.55% compilation time)

(was 17.5 sec / 5.7 GB before)

Line B3 uncommented:

PS D:\> julia +pr58422 mwe.jl
Solution found after 16330466 iterations
 13.689603 seconds (144.42 M allocations: 3.818 GiB, 3.02% gc time, 0.56% compilation time)

(was 9.3 sec / 2.8 GB before)

mcabbott · June 2, 2025, 9:41pm

Haven’t looked very hard, but the following change here takes me from 270ms (initial code above) to 100ms:

rowcol = [(row, col + 1) for row = nrows:-1:1, col = 1:ncols] |> vec

The first is Iterators.flatten, and allows for one range to depend on the other. The second is Iterators.product, and does not.

I also tried moving const pieces = UInt8[0x0b 0x0b 0x01 ... outside the function, which should avoid hvcat. But I think this isn’t the bottleneck.

jwortmann · June 2, 2025, 10:01pm

Thanks, but in this case the speedup happens because the content of the rowcol vector is not the same anymore as before. In your version the vector contains grid positions traversed column by column, while the original version contains the positions row by row. This results in a different amount of iterations needed for the main loop (5.8 M iterations in your example and 16.3 M iterations originally) and therefore a different resulting time. In general all setup costs for this algorithm are negligible and all of the computation time happens in the main loop.

Benny · June 2, 2025, 10:07pm

Checking @code_warntype, the root difference seems to be ncols getting boxed instead of inferred as Int, while the paired nrows is inferred as Int. Comprehensions are built on closures, which risk that sort of thing, but normally this wouldn’t happen if you never reassign the captured variables. As you’ve pointed out, the boxing doesn’t happen if you comment out false && nrows * ncols. More tweaks:

switching the assignment order to ncols, nrows = 7, 7 or separating them in 2 lines doesn’t change which gets boxed
switching the order to false && ncols * nrows doesn’t change which gets boxed
switching the comprehension loop levels for col = 1:ncols for row = nrows:-1:1 (which also changes the problem so I didn’t bother running it) does switch to boxing nrows.
switching the comprehension to 2D for row = nrows:-1:1, col = 1:ncols infers both captured variables but is such a drastic change it makes the whole thing fail

So it’s just a hole in the lowerer’s ability to identify when captured variables in nested comprehensions don’t need boxing; I have no idea why the false && nrows * ncols is necessary because it’s not a reassignment, but that’s probably a factor why this doesn’t happen often. ncols::Int (and that alone) indeed restores good type inference in the rest of the algorithm, even though it doesn’t remove the box. You should open an issue.

EDIT: Okay I did manage to make a MWE, and it suggests @label is the more critical factor. Commenting the label (and its paired @goto) out in the original example also undoes the boxing (but ruins the algorithm, don’t run it). Just writing the captured variable after the label is enough for boxing. Comprehension for-headers also seem to have it easier, as the body (or a more typical closure ()->(x,y)) boxes all captured variables.

# Reduced example doesn't do anything, only for @code_warntype
function solvemwe()
    nrows, ncols, x, y = 7, 7, 1, 1
              # body boxes x,y     # for boxes latter variable, swap to check
    rowcol = [(x, y, row, col + 1) for row = nrows:-1:1 for col = 1:ncols]
    @label next_iter # comment out for good inference
    nrows, ncols, x, y # comment out boxed variables for good inference
end

jwortmann · June 4, 2025, 9:14pm

Thank you very much for pinpointing the relevant factors and creating the minimized example.

After re-reading the performance tips in the manual, I discovered the section discussing the performance of captured variables. As an alternative to using type annotations, it suggests that the boxing can also be avoided with a let block around the part with the captured variables. This alternative way does indeed restore good performance in my code:

    rowcol = let nrows = nrows, ncols = ncols
        [(row, col + 1) for row = nrows:-1:1 for col = 1:ncols]
    end

I’m still surprised that the compiler/lowerer doesn’t treat it like this automatically with the comprehension in my code. Would you consider this to be a bug (in Julia)?

adienes · June 4, 2025, 9:28pm

performance footguns would generally not be called a “bug” to use that term specifically, but this is definitely an infamous problem that many users have run into (and is unfortunately very hard to fix)

Benny · June 4, 2025, 10:01pm

No, adding the let block changes the scope and the variables being captured, and if made automatic, it would actually be a bug that violates Julia’s scoping rules and the more general concept of variable capture; for example, behaviors will diverge significantly if you had reassigned nrows or ncols outside the block. Either version’s semantics do not prescribe any optimizations; in fact, an implementation that doesn’t infer types at all can be perfectly compliant with the language standard.

@label causing boxing is a separate issue, probably should be mentioned in issue #53367. Whether an eager comprehension can prove the closure is never used afterward and the captured variables’ values don’t change in the meantime is another issue.

Topic		Replies	Views
For-loop vs list-comprehension Performance	15	2011	April 7, 2021
Slowdown with reinterpret Performance tuple , arrays , vector	3	376	December 16, 2022
Zeros slower than Array comprehension Performance question	3	682	October 7, 2018
Comprehensions and vectorization New to Julia	13	4899	September 20, 2017
Benchmarking with interpolated variable faster than hardcoded literal value Performance	1	555	September 18, 2019

Surprising slowdown from using array comprehension

Related topics