Hello to all,
I would like to know how the most efficient way to chunk a multi-dimensional array in julia.
arr = rand(Float64, (10, 100))
How can split it into chunks of size 30 so that I get 3 (10, 30) and 1 (10,10) arrays?
I have this function:
function chunk_array(arr, N)
chunked = []
n_columns = size(arr)[2]
for i in 1:N:n_columns
push!(chunked, arr[:, i:min(i+N-1, n_columns)])
end
return chunked
end
Is there something more performant?
Best Regards
You can replace size(arr)[2]
by size(arr, 2)
(this is more of a style improvement). You can pre-allocate the vector for the chunks.
function chunk_array_2(arr::AbstractMatrix, N::Integer)
n_cols = size(arr, 2)
n_chunks = ceil(Int, n_cols / N)
chunks = Vector{typeof(arr)}(undef, n_chunks)
for i in 1:n_chunks
from = N * (i - 1) + 1
to = min(i * N, n_cols)
chunks[i] = arr[:, from:to]
end
return chunks
end
If it suits your needs, you can use a view when slicing the matrix.
function chunk_array_3(arr::AbstractMatrix, N::Integer)
n_cols = size(arr, 2)
n_chunks = ceil(Int, n_cols / N)
chunks = Vector{AbstractMatrix}(undef, n_chunks)
for i in 1:n_chunks
from = N * (i - 1) + 1
to = min(i * N, n_cols)
chunks[i] = @view arr[:, from:to]
end
return chunks
end
The timings are
arr = rand(Float64, (10, 100))
@btime chunk_array($arr, 30);
2.271 μs (6 allocations: 8.50 KiB)
@btime chunk_array_2($arr, 30);
2.168 μs (5 allocations: 8.45 KiB)
@btime chunk_array_3($arr, 30);
94.155 ns (5 allocations: 336 bytes)
1 Like
Loop in Julia are fine and what you wrote is fine except for one or two things:
chunked isa Vector{Any}
and will lead to poor downstream performance every time it is accessed, due to type instability. Note that Vector{AbstractMatrix}
is basically just as bad. You want the type to be declared or inferred concretely.
- You are making copies of the data when you write
arr[:, i:min(i+N-1, n_columns)]
. If you want copies, this is fine. If it’s okay to alias the input data, consider using @view arr[:, i:min(i+N-1, n_columns)]
instead. When aliased, changes to arr
will be reflected in chunked
and vice-versa, as they share memory.
I would probably write this function like this
function chunk_array(arr::AbstractVecOrMat, N)
chunked = map(Iterators.partition(axes(arr,2),N)) do cols
@view arr[:,cols] # remove @view if you want copies that do not alias `arr`
end
return chunked
end
1 Like
I actually want a copy of the data.
Is is sufficient to declare the type of variable for efficiency?
Yes, if you declared the type via chunked = THE_TYPE[]
or chunked = Vector{THE_TYPE}(undef, num_chunks)
it would be fine. The annoying part is that the type can be a little complicated at times. But in this case, since you want the data copied, it’s pretty easy and Matrix{eltype(arr)}
in place of THE_TYPE
will work.
But I would still recommend you use my suggested solution with @view
deleted. It will make the copies and let the compiler determine the type for you (thanks to map
). Declaring types manually can be tedious (and sometimes very difficult) to do correctly.
2 Likes