Hello! I recently came across a (for me) quite difficult to debug performance issue in my code, and it still leaves me somewhat puzzled. An algorithm that I wrote seems to experience severe slowdowns when using an array comprehension to populate a local variable which is accessed repeatedly within an inner loop. More particular, it seems to happen when using variables instead of number literals for the range bounds in the comprehension. But I don’t really understand why that happens. More details with the actual code are explained below. I would be grateful for any help to understand where this slowdown comes from, and what would be the best way to avoid it.
Julia version:
julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b73 (2025-04-14 06:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 6 × Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 6 default, 0 interactive, 3 GC (on 6 virtual cores)
Consider the following minimal working example, in which I have marked a few important lines with comments Line B1
, Line B2
, etc.
function solve()
pieces = UInt8[0x0b 0x0b 0x01 0x02; 0x0b 0x0b 0x02 0x01; 0x0b 0x0b 0x02 0x03; 0x0b 0x0b 0x03 0x02; 0x0a 0x01 0x04 0x03; 0x0a 0x01 0x05 0x01; 0x0a 0x01 0x05 0x02; 0x0a 0x01 0x08 0x01; 0x0a 0x01 0x08 0x03; 0x0a 0x01 0x09 0x02; 0x0a 0x01 0x09 0x03; 0x0a 0x02 0x04 0x02; 0x0a 0x02 0x05 0x02; 0x0a 0x02 0x07 0x01; 0x0a 0x02 0x07 0x03; 0x0a 0x02 0x08 0x03; 0x0a 0x02 0x09 0x01; 0x0a 0x03 0x04 0x01; 0x0a 0x03 0x06 0x01; 0x0a 0x03 0x06 0x02; 0x0a 0x03 0x06 0x03; 0x0a 0x03 0x07 0x02; 0x0a 0x03 0x08 0x01; 0x0a 0x03 0x08 0x03; 0x04 0x04 0x08 0x06; 0x04 0x04 0x08 0x09; 0x04 0x05 0x05 0x09; 0x04 0x05 0x07 0x05; 0x04 0x05 0x08 0x06; 0x04 0x05 0x09 0x07; 0x04 0x06 0x04 0x08; 0x04 0x06 0x04 0x09; 0x04 0x06 0x06 0x09; 0x04 0x07 0x05 0x08; 0x04 0x07 0x06 0x06; 0x04 0x08 0x08 0x07; 0x04 0x09 0x07 0x06; 0x05 0x05 0x06 0x07; 0x05 0x06 0x05 0x07; 0x05 0x06 0x06 0x07; 0x05 0x07 0x08 0x06; 0x05 0x07 0x09 0x09; 0x05 0x08 0x09 0x07; 0x05 0x09 0x07 0x08; 0x05 0x09 0x07 0x09; 0x06 0x07 0x08 0x08; 0x06 0x09 0x09 0x07; 0x06 0x09 0x09 0x09; 0x07 0x08 0x08 0x08]
npieces = size(pieces, 1)
ncolors = length(unique(pieces))
colors = Matrix{UInt8}(undef, npieces << 2 | 3, 2)
colors[0x0001, :] .= 0x0a
colors[0x0002, :] .= 0x0b
_candidates = [UInt16[] for _ in 1:ncolors, _ in 1:ncolors]
for (piece, piece_colors) in enumerate(eachrow(pieces)), rotation = 0:3
colors[piece << 2 | rotation, 1] = piece_colors[mod1(1 - rotation, 4)]
colors[piece << 2 | rotation, 2] = piece_colors[mod1(2 - rotation, 4)]
left = piece_colors[4 - rotation]
bottom = piece_colors[mod1(3 - rotation, 4)]
push!(_candidates[left, bottom], piece << 2 | rotation)
end
candidates = Vector{UInt16}(undef, mapreduce(length, +, _candidates))
index_table = Matrix{UnitRange{Int}}(undef, ncolors, ncolors)
idx = 1
for left = 1:ncolors, bottom = 1:ncolors
start_idx = idx
for candidate in _candidates[left, bottom]
candidates[idx] = candidate
idx += 1
end
end_idx = idx - 1
index_table[left, bottom] = start_idx:end_idx
end
nrows, ncols = 7, 7 # Line A1
# nrows::Int, ncols::Int = 7, 7 # Line A2
@assert npieces >= nrows * ncols
rowcol = [(row, col + 1) for row = 7:-1:1 for col = 1:7] # Line B1
# rowcol = [(row, col + 1) for row = nrows:-1:1 for col = 1:ncols] # Line B2
# rowcol = NTuple{2, Int}[(row, col + 1) for row = nrows:-1:1 for col = 1:ncols] # Line B3
@assert typeof(rowcol) == Vector{NTuple{2, Int}}
board = fill(0x0002, nrows + 1, ncols + 1)
board[2:end-2, 1] .= 0x0001
board[end, 3:end-1] .= 0x0001
board[1:end-1, 2:end] .= 0x0000
available = fill(true, npieces)
idx_range = Vector{UnitRange{Int}}(undef, nrows*ncols)
depth = 1
row, col = rowcol[1]
left = colors[board[row, col - 1], 2]
bottom = colors[board[row + 1, col], 1]
_idx_range = index_table[left, bottom]
iters = 0
@inbounds while true # main loop
@label next_iter
for idx in _idx_range
candidate = candidates[idx]
available[candidate >> 2] || continue
board[row, col] = candidate
iters += 1
if depth == 49
println("Solution found after $iters iterations")
return
end
false && nrows * ncols # Line C
available[candidate >> 2] = false
idx_range[depth] = idx+1:_idx_range.stop
depth += 1
row, col = rowcol[depth]
left = colors[board[row, col - 1], 2]
bottom = colors[board[row + 1, col], 1]
_idx_range = index_table[left, bottom]
@goto next_iter
end
depth -= 1
if depth == 0
@warn "Search finished with no valid solution found"
return
end
row, col = rowcol[depth]
available[board[row, col] >> 2] = true
_idx_range = idx_range[depth]
end
end
@time solve()
Running this code should output something like this:
PS D:\> julia mwe.jl
Solution found after 16330466 iterations
0.426470 seconds (6.75 k allocations: 370.647 KiB, 4.74% compilation time)
So far, so good. Now comment out line B1 and uncomment line B2 instead, then run the code again.
# rowcol = [(row, col + 1) for row = 7:-1:1 for col = 1:7] # Line B1
rowcol = [(row, col + 1) for row = nrows:-1:1 for col = 1:ncols] # Line B2
Suddenly the runtime is more than an order of magnitude slower, and there are lots of allocations:
PS D:\> julia mwe.jl
Solution found after 16330466 iterations
17.586502 seconds (193.57 M allocations: 5.772 GiB, 2.13% gc time, 0.93% compilation time)
The only thing that changed between line B1 and B2 are the bounds of the ranges in the array comprehension, which use the nrows
and ncols
variables instead of number literals now. But nrows
and ncols
were just set to the number literal 7
in the line A1 right above. In my real code, the variables nrows
and ncols
depend on the size of the input data, therefore I obviously cannot use number literals directly inside the array comprehension, but I need to use something similar to line B2 instead.
My vague guess is that the array comprehension somehow becomes type unstable. So in order to give a hint to the compiler, I tried to annotate the type for the resulting array explicitly, as described in the manual section about comprehensions.
rowcol = NTuple{2, Int}[(row, col + 1) for row = nrows:-1:1 for col = 1:ncols] # Line B3
This makes the code around 50% faster than the untyped comprehension, but it’s still a lot slower than using number literals for the range bounds.
PS D:\> julia mwe.jl
Solution found after 16330466 iterations
9.287622 seconds (95.43 M allocations: 2.845 GiB, 2.15% gc time, 1.02% compilation time)
To add even more confusion, try to comment out line C in the main loop at the bottom of the function:
# false && nrows * ncols # Line C
Without this line, the algorithm runs quickly again and without allocations for the main loop, regardless which version (line B1, B2, B3) for the array comprehension is used. I also wonder what does even happen (in the sense of what exactly is actually computed) in this line when it is not commented out. The line doesn’t have any external effects, and there is a literal false
at the left side of the short-circuit &&
, so the multiplication in this line shouldn’t even happen at all. I would have guessed that the compiler would even eliminate the entire line in the first place. In this example, this line obviously doesn’t make any sense, however in my real code I’d like to occasionally read from the nrows
and ncols
variables within the main loop.
I also noticed that if the types for nrows
and ncols
are explicitly annotated in the code (line A2), the slowdown also does not occur. This is what I currently use as a workaround in my real code. But why is this necessary here? nrows
and ncols
are local variables that only live inside the solve
function, so the compiler should know that the number literal 7
is an Int
, no? Notably these two variables are constants that are never written to in the remaining code. Furthermore, it feels quite unsatisfying if I had to annotate local variables this way.
Another possible workaround (that I haven’t tested yet) might be to move the main loop into a separate function and pass all of the relevant local variables as arguments. I’m aware that this approach is described in the performance tips in the Julia manual (function barriers). However, the reason explained therein is to let the compiler allow to specialize code for variables whose types are not known during compilation. But I think this shouldn’t really apply to my example code, because as far as I can tell, the types for all variables should be well defined and stable. So I’d like to understand where exactly the slowdown and the allocations in my code come from when using certain forms of the array comprehensions.
Thanks for reading this far and thanks in advance for any help or suggestions.