Modeling `do concurrent` loops in the `fir` dialect

Background

MLIR modelling of basic loops

The fir dialect model both variants of do loops: increment and concurrent using a single MLIR operation: fir.do_loop. For example, a simple increment loop such as the following:

  do i=1,10
  end do 

is emitted as:

    %5:2 = fir.do_loop %arg0 = %2 to %3 step %c1 iter_args(%arg1 = %4) -> (index, i32) {
      fir.store %arg1 to %1#1 : !fir.ref<i32>
      %6 = arith.addi %arg0, %c1 overflow<nsw> : index
      %7 = fir.convert %c1 : (index) -> i32
      %8 = fir.load %1#1 : !fir.ref<i32>
      %9 = arith.addi %8, %7 overflow<nsw> : i32
      fir.result %6, %9 : index, i32
    }

do concurrent loops look a bit different. For example, given the following single-range loop:

  do concurrent (i=1:10)
  end do 

The MLIR op looks like:

    fir.do_loop %arg0 = %4 to %5 step %c1 unordered {
      %6 = fir.convert %arg0 : (index) -> i32
      fir.store %6 to %1#1 : !fir.ref<i32>
    }

The most important difference here is the unordered attribute attached to the concurrent loop. This models the fact that the loop iterations may be executed in any order. There are other differences between increment and concurrent loops which are detailed below.

MLIR modelling of multi-range loops

One major difference between increment and concurrent loops is that increment loops can only have a single ineration range. Concurrent loops, on the other hand, can have multiple iteration ranges. For example the following is invalid Fortran code:

  ! Invalid syntax
  do i=1,10, j=1,10
  end do

While the following is valid:

  do concurrent (i=1:10, j=1:10)
  end do

The fir.do_loop is currently restricted to single-range loops. To overcome this limitation, flang emits multi-range do concurrent loops as a nest of fir.do_loop ops:

1.    fir.do_loop %arg0 = %8 to %9 step %c1 unordered {
2.      %12 = fir.convert %arg0 : (index) -> i32
3.      fir.store %12 to %3#1 : !fir.ref<i32>
4.      fir.do_loop %arg1 = %10 to %11 step %c1_2 unordered {
5.        %13 = fir.convert %arg1 : (index) -> i32
6.        fir.store %13 to %1#1 : !fir.ref<i32>
7.      }
8.    }

Locality specifiers and data environment control

Another difference is that do concurrent loops need to somehow model locality of data used inside the loop so that the iterations of the loop can be executed out-of-order (or concurrently) without introducing any data races. The fir.do_loop op is now able to model these locality specifiers in case the fir.do_loop is marked with the unordered attribute.

Problem statement

There are few consequences for modeling both increment and concurrent loops using the same MLIR op. The most obvious one is that multi-range loops are not directly detectable. To detect a multi-range do concurrent loop, a pass would need to walk the fir.do_loop nest structure (see above example) and make sure that the only ops between 2 nested fir.do_loop ops are the ops needed to update the iteration variable of the inner loop (lines 2 & 3 in the above snippet). This is not ideal for a number of reasons:

  1. We cannot know for sure whether the loop nest originated from a do concurrent loop or was emitted by flang as part of some other expression or op conversion.
  2. The nest detection logic introduces complexity that can be alleviated if the multi-range is directly detectable on the MLIR level.

Another consequence of using the same MLIR op is that the fir.do_loop op is overloaded with the semantics of both increment and concurrent loops. While this can enable code sharing (e.g. we use the same functions to emit the MLIR ops), it also makes things more difficult in later stages of the compiler pipeline because we have to analyze the op to understand what it represents.

We recently came across these issues while working on a pass to map do concurrent loops to OpenMP constructs so that they can be automatically parallelized. See: [flang][OpenMP] Upstream do concurrent loop-nest detection and other PRs in the stack for more info.

Proposal

Introducing a separate MLIR op for do concurrent loops would help making the IR more representative of the original source at least during the earlier stages of the compiler pipeline. It would also help limiting the scope of the current fir.do_loop op to only model increment loops.

With the new op, the following loop:

  do concurrent (i=1:10, j=1:10)
  end do 

would be emitted in MLIR as (using same SSA value numbering as the above snippet for comparison):

    fir.do_concurrent (%arg0, %arg1) = (%8, %10) to (%9 to %11) step (%c1, %c1_2) {
      %12 = fir.convert %arg0 : (index) -> i32
      fir.store %12 to %3#1 : !fir.ref<i32>
      %13 = fir.convert %arg1 : (index) -> i32
      fir.store %13 to %1#1 : !fir.ref<i32>
    }

The above op comes with changes to the current fir.do_loop op as well. In particular, we would:

  1. Remove the unordered attribute since there will no more need for it.
  2. Move the locality specifier from fir.do_loop to fir.do_concurrent.

Considered alternatives

Loop nest detection

This is what is currently being implemented in the do concurrent to OpenMP conversion pass. However, as detailed above this is not ideal and therefore should be replaced by an explicit do concurrent MLIR op.

Adding a loop_nest(n) attribute to fir.do_loop

This would help us directly detect the depth of the nest and therefore avoid having to implement a loop nest detection algorithm. However, this is also not ideal since there is no guaranteee this attribute will be correctly maintained and carried forward by all passes of the pipeline.

Creating a shared dialect to model parallel/concurrent loops

The above proposal is very similar to what the scf.parallel op looks like. Possibly there are other similar ops downstream to model these kinds of loops. This is similar to the current effort to have a shared ptr dielact. This also falls inline with the road ahead proposed by a recent dev meeting talk: 2024 LLVM Dev Mtg - Making upstream MLIR more friendly to programming languages. I think this should be still on our radar for future improvement.

Final remarks

The RFC proposes a new op for do concurrent in the FIR dialect in the short and medium terms. In the longer term, I think it would be beneficial to have an MLIR control-flow dialect that unifies ops such as fir.do_concurrent and scf.concurrent in one generic op.

Thank you for the writeup.

I agree that creating a new operation is the best option. Sticking with fir.do_loop and detecting loop nests is unnecessarily fragile, and the unordered attribute can mean things other than that this was a do-concurrent.

If it works as a drop in replacement I would be happy to see scf parallel used here, but I don’t think it is worth it if there would be any non-trivial wrapping needed to make it fit the fortran semantics.

One small detail is that you don’t need the stores to loop iteration variables (from the surrounding scope) inside of the do_concurrent loop. do-concurrent index names have the scope of the do-concurrent construct (Fortran 2023 draft section 19.4 “Statement and construct entities”).

I think that is not possible to do this directly in lowering because, as mentioned in this comment, the rules of DO CONCURRENT do not allow it to be run in parallel without some analysis (or user blanket approval).

Thanks Tom and Jean for your replies. The sharing between fir.do_concurrent and scf.parallel was something that I just wanted to keep an eye for a possible improvement in the future just like the efforts in the talk I shared are trying to achieve. So, I will focus on fir.do_concurrent as a separate op for now for sure.

Thanks for the write-up! I agree with the direction.

A few points:

  • I think there is a use case to allow multiblock body (or maybe use scf.execute_region when that occurs instead of going into degraded mode like currently when there is an “internal” goto inside the do concurrent body).

For instance, we currently directly lower this to block CFG:

subroutine test()
  integer :: x(10), y(10), z(10)
  do concurrent (i=1:10)
    if (i.lt.3) goto 1
    if (i.lt.5) goto 2
    x(i) = i
1   y(i) = i
2   z(i) = i
  end do
end subroutine
  • The loop iteration variable are local to the loop, maybe the alloca should be inside the iteration to help analysis/offload, deleting useless alloca/or hoisting them only in fir.do_concurrent codegen (or they could be special block arguments, but that may be hard to understand for alias analysis). Note that we still need to create storage for them because they are variable and could appear in contexts where lowering expects memory to be mapped to these symbols, even though they are read-only inside the iteration.

  • Quickly looking at LoopLikeInterface, it looks like it supports multiple induction variables, and would not prevent attaching extra “non-body” region if that would help model data specifier (just noting that to open the design space).

  1. Remove the unordered attribute since there will no more need for it.

I think we should keep it for array expressions. It is currently set when generating loops for array = array_1 + array_2, and I think we should keep it as an indication that we do not care about the order of evaluation. Generating fir.do_concurrent for all array expressions would probably have the downside of create suboptimal parallel regions (too small, too many data transfers) when the user expect his source DO CONCURRENTs to be parallelized.

1 Like

I think using scf.execute_region makes perfect sense. Thanks for the suggestion! I will keep multiblock bodies in mind from the start.

I am leaning towards keeping the stores and allocating the iteration variables outside the loop. This will be closer to how OpenMP, for example, expects worksharing loops (and maybe OpenACC as well) where we “privatize” the iteration variable with a delayed privatizer. Once we have locality specifiers for do concurrent, we can model these variables using init just like we use private in the OpenMP case. My main points are:

  1. To more easily bridge the gap between do concurrent and the target parallelization models we care about: OpenMP and OpenACC.
  2. In the case of sequentializing do concurrent loops, make code-gen easier to handle.
    Let me know if you disagree.

Thanks, I will take a look. For locality specifiers, I hacked what we have in OpenMP to be reused for do concurrent loops. See this PoC. This is just for experimentation, I will write an RFC for shared “Data Management” dialect later (after implementing the proper fir.do_concurrent op :crossed_fingers:)

Makes sense, thanks for pointing that out!

A few suggestion comments.

I was hoping that the new concurrent operation will be in hlfir and lower to the existing unordered do in the default flow.

Modelling as a wrapper can provide a place to keep allocas.

do_concurrent {
; allocas can go here
loop (v1,v2) : (v1_start, v2_start) to (v1_end,v2_end)  {
}
}

You might need a fir.execute_region if you want to keep the concurrent operation a single block and want to control when that is inlined or exposed to the containing region only in flang/lib/Optimizer/Transforms/ControlFlowConverter.cpp.

I like this idea. Particularly if you can convert the Flang concurrent operation to scf.parallel when it is allowed (by the standard). You can reuse the existing conversion from scf.parallel to omp.parallel + omp.wsloop. Any additional work that strengthens the SCFToOpenMP Conversion would also benefit the MLIR users.

Thanks for your suggestions Kiran.

I think the complexity with that is having to model locality specifiers for both ops. Or do you think we can still remove locality specifiers from fir.do_loop and when lowering do_concurrent to fir.do_loop ... unordered, we would handle specifiers similar to what we currently do in delayed privatization in OpenMP during LLVM translation? So we would model locality specifiers only for do_concurrent and “inline” them when translating to fir.do_loop ... unordered?

I think that would be a good middle-ground solution. I think the do_concurrent can have 2 regions:

  1. One region for the allocas.
  2. Another one for the loop body itself.
    And it would be pretty printed as you suggested.

Makes sense. In general, I think there are at least a few places to consolidate framgentation across dialects. This might end up being one example and locality specifiers (data environment clauses) might be another.

Yes.

Alright, thanks all for your comments. If there are no other suggestions/comments, I will go with the wrapper op as Kiran suggested and do that in the HLFIR dialect as well.

Opened a PR for the new MLIR represetation: [flang][hlfir] Add MLIR op for `do concurrent` by ergawy · Pull Request #130893 · llvm/llvm-project · GitHub. Added all people who commented here and on the original PR where the idea came up, feel free to add yourself if interested in reviewing it.

Thanks for the review and approval of 130893.

I am going to work on lowering fir.do_concurrent now, starting with the serialized case (i.e. no mapping to OpenMP or other parallel models). And wanted to make sure we are aligned.

I think we have 2 obvious options:

  1. Convert fir.do_concurrent to fir.do_loop unordered and let the rest of the pipeline deal with fir.do_loop unordered.
  2. Lower fir.do_concurrent directly to CFG.

I think option 1 makes more sense, but let me know if you disagree.

Going with option 1, I am not aware of any pass(es) that do fir to fir transformations. If there are such pass(es), can you point me to them? If not, do you have any suggestions for what the best location for such a transformation be?

Not giving an opinion here. But FYI, such transforms are located in flang/lib/Optimizer/Transforms like CharacterConversion.cpp, ArrayValueCopy.cpp, LoopVersioning.cpp, SimplifyIntrinsics.cpp etc.