Background
MLIR modelling of basic loops
The fir
dialect model both variants of do
loops: increment and concurrent using a single MLIR operation: fir.do_loop
. For example, a simple increment loop such as the following:
do i=1,10
end do
is emitted as:
%5:2 = fir.do_loop %arg0 = %2 to %3 step %c1 iter_args(%arg1 = %4) -> (index, i32) {
fir.store %arg1 to %1#1 : !fir.ref<i32>
%6 = arith.addi %arg0, %c1 overflow<nsw> : index
%7 = fir.convert %c1 : (index) -> i32
%8 = fir.load %1#1 : !fir.ref<i32>
%9 = arith.addi %8, %7 overflow<nsw> : i32
fir.result %6, %9 : index, i32
}
do concurrent
loops look a bit different. For example, given the following single-range loop:
do concurrent (i=1:10)
end do
The MLIR op looks like:
fir.do_loop %arg0 = %4 to %5 step %c1 unordered {
%6 = fir.convert %arg0 : (index) -> i32
fir.store %6 to %1#1 : !fir.ref<i32>
}
The most important difference here is the unordered
attribute attached to the concurrent loop. This models the fact that the loop iterations may be executed in any order. There are other differences between increment and concurrent loops which are detailed below.
MLIR modelling of multi-range loops
One major difference between increment and concurrent loops is that increment loops can only have a single ineration range. Concurrent loops, on the other hand, can have multiple iteration ranges. For example the following is invalid Fortran code:
! Invalid syntax
do i=1,10, j=1,10
end do
While the following is valid:
do concurrent (i=1:10, j=1:10)
end do
The fir.do_loop
is currently restricted to single-range loops. To overcome this limitation, flang emits multi-range do concurrent
loops as a nest of fir.do_loop
ops:
1. fir.do_loop %arg0 = %8 to %9 step %c1 unordered {
2. %12 = fir.convert %arg0 : (index) -> i32
3. fir.store %12 to %3#1 : !fir.ref<i32>
4. fir.do_loop %arg1 = %10 to %11 step %c1_2 unordered {
5. %13 = fir.convert %arg1 : (index) -> i32
6. fir.store %13 to %1#1 : !fir.ref<i32>
7. }
8. }
Locality specifiers and data environment control
Another difference is that do concurrent
loops need to somehow model locality of data used inside the loop so that the iterations of the loop can be executed out-of-order (or concurrently) without introducing any data races. The fir.do_loop
op is now able to model these locality specifiers in case the fir.do_loop
is marked with the unordered
attribute.
Problem statement
There are few consequences for modeling both increment and concurrent loops using the same MLIR op. The most obvious one is that multi-range loops are not directly detectable. To detect a multi-range do concurrent
loop, a pass would need to walk the fir.do_loop
nest structure (see above example) and make sure that the only ops between 2 nested fir.do_loop
ops are the ops needed to update the iteration variable of the inner loop (lines 2 & 3 in the above snippet). This is not ideal for a number of reasons:
- We cannot know for sure whether the loop nest originated from a
do concurrent
loop or was emitted by flang as part of some other expression or op conversion. - The nest detection logic introduces complexity that can be alleviated if the multi-range is directly detectable on the MLIR level.
Another consequence of using the same MLIR op is that the fir.do_loop
op is overloaded with the semantics of both increment and concurrent loops. While this can enable code sharing (e.g. we use the same functions to emit the MLIR ops), it also makes things more difficult in later stages of the compiler pipeline because we have to analyze the op to understand what it represents.
We recently came across these issues while working on a pass to map do concurrent
loops to OpenMP constructs so that they can be automatically parallelized. See: [flang][OpenMP] Upstream do concurrent
loop-nest detection and other PRs in the stack for more info.
Proposal
Introducing a separate MLIR op for do concurrent
loops would help making the IR more representative of the original source at least during the earlier stages of the compiler pipeline. It would also help limiting the scope of the current fir.do_loop
op to only model increment loops.
With the new op, the following loop:
do concurrent (i=1:10, j=1:10)
end do
would be emitted in MLIR as (using same SSA value numbering as the above snippet for comparison):
fir.do_concurrent (%arg0, %arg1) = (%8, %10) to (%9 to %11) step (%c1, %c1_2) {
%12 = fir.convert %arg0 : (index) -> i32
fir.store %12 to %3#1 : !fir.ref<i32>
%13 = fir.convert %arg1 : (index) -> i32
fir.store %13 to %1#1 : !fir.ref<i32>
}
The above op comes with changes to the current fir.do_loop
op as well. In particular, we would:
- Remove the
unordered
attribute since there will no more need for it. - Move the locality specifier from
fir.do_loop
tofir.do_concurrent
.
Considered alternatives
Loop nest detection
This is what is currently being implemented in the do concurrent
to OpenMP conversion pass. However, as detailed above this is not ideal and therefore should be replaced by an explicit do concurrent
MLIR op.
Adding a loop_nest(n)
attribute to fir.do_loop
This would help us directly detect the depth of the nest and therefore avoid having to implement a loop nest detection algorithm. However, this is also not ideal since there is no guaranteee this attribute will be correctly maintained and carried forward by all passes of the pipeline.
Creating a shared dialect to model parallel/concurrent loops
The above proposal is very similar to what the scf.parallel
op looks like. Possibly there are other similar ops downstream to model these kinds of loops. This is similar to the current effort to have a shared ptr
dielact. This also falls inline with the road ahead proposed by a recent dev meeting talk: 2024 LLVM Dev Mtg - Making upstream MLIR more friendly to programming languages. I think this should be still on our radar for future improvement.
Final remarks
The RFC proposes a new op for do concurrent
in the FIR dialect in the short and medium terms. In the longer term, I think it would be beneficial to have an MLIR control-flow dialect that unifies ops such as fir.do_concurrent
and scf.concurrent
in one generic op.