Skip to content

[flang][OpenMP] Upstream do concurrent loop-nest detection. #127595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 2, 2025

Conversation

ergawy
Copy link
Member

@ergawy ergawy commented Feb 18, 2025

Upstreams the next part of do concurrent to OpenMP mapping pass (from
AMD's ROCm implementation). See #126026 for more context.

This PR add loop nest detection logic. This enables us to discover
muli-range do concurrent loops and then map them as "collapsed" loop
nests to OpenMP.

This is a follow up for #126026, only the latest commit is relevant.

This is a replacement for #127478 using a /user/<username>/<branchname> branch.

PR stack:

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' flang:driver flang Flang issues not falling into any other category flang:fir-hlfir labels Feb 18, 2025
@llvmbot
Copy link
Member

llvmbot commented Feb 18, 2025

@llvm/pr-subscribers-flang-driver

@llvm/pr-subscribers-clang

Author: Kareem Ergawy (ergawy)

Changes

Upstreams the next part of do concurrent to OpenMP mapping pass (from
AMD's ROCm implementation). See #126026 for more context.

This PR add loop nest detection logic. This enables us to discover
muli-range do concurrent loops and then map them as "collapsed" loop
nests to OpenMP.

This is a follow up for #126026, only the latest commit is relevant.

This is a replacement for #127478 using a /user/&lt;username&gt;/&lt;branchname&gt; branch.


Patch is 39.64 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127595.diff

19 Files Affected:

  • (modified) clang/include/clang/Driver/Options.td (+4)
  • (modified) clang/lib/Driver/ToolChains/Flang.cpp (+2-1)
  • (added) flang/docs/DoConcurrentConversionToOpenMP.md (+229)
  • (modified) flang/docs/index.md (+1)
  • (modified) flang/include/flang/Frontend/CodeGenOptions.def (+2)
  • (modified) flang/include/flang/Frontend/CodeGenOptions.h (+5)
  • (modified) flang/include/flang/Optimizer/OpenMP/Passes.h (+2)
  • (modified) flang/include/flang/Optimizer/OpenMP/Passes.td (+30)
  • (added) flang/include/flang/Optimizer/OpenMP/Utils.h (+26)
  • (modified) flang/include/flang/Optimizer/Passes/Pipelines.h (+15-3)
  • (modified) flang/lib/Frontend/CompilerInvocation.cpp (+28)
  • (modified) flang/lib/Frontend/FrontendActions.cpp (+27-5)
  • (modified) flang/lib/Optimizer/OpenMP/CMakeLists.txt (+1)
  • (added) flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp (+209)
  • (modified) flang/lib/Optimizer/Passes/Pipelines.cpp (+10-2)
  • (added) flang/test/Driver/do_concurrent_to_omp_cli.f90 (+20)
  • (added) flang/test/Transforms/DoConcurrent/basic_host.f90 (+53)
  • (added) flang/test/Transforms/DoConcurrent/loop_nest_test.f90 (+89)
  • (modified) flang/tools/bbc/bbc.cpp (+19-1)
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 5ad187926e710..0cd3dfd3fb29d 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -6927,6 +6927,10 @@ defm loop_versioning : BoolOptionWithoutMarshalling<"f", "version-loops-for-stri
 
 def fhermetic_module_files : Flag<["-"], "fhermetic-module-files">, Group<f_Group>,
   HelpText<"Emit hermetic module files (no nested USE association)">;
+
+def fdo_concurrent_to_openmp_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
+  HelpText<"Try to map `do concurrent` loops to OpenMP [none|host|device]">,
+      Values<"none, host, device">;
 } // let Visibility = [FC1Option, FlangOption]
 
 def J : JoinedOrSeparate<["-"], "J">,
diff --git a/clang/lib/Driver/ToolChains/Flang.cpp b/clang/lib/Driver/ToolChains/Flang.cpp
index 9ad795edd724d..cb0b00a2fd699 100644
--- a/clang/lib/Driver/ToolChains/Flang.cpp
+++ b/clang/lib/Driver/ToolChains/Flang.cpp
@@ -153,7 +153,8 @@ void Flang::addCodegenOptions(const ArgList &Args,
     CmdArgs.push_back("-fversion-loops-for-stride");
 
   Args.addAllArgs(CmdArgs,
-                  {options::OPT_flang_experimental_hlfir,
+                  {options::OPT_fdo_concurrent_to_openmp_EQ,
+                   options::OPT_flang_experimental_hlfir,
                    options::OPT_flang_deprecated_no_hlfir,
                    options::OPT_fno_ppc_native_vec_elem_order,
                    options::OPT_fppc_native_vec_elem_order,
diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
new file mode 100644
index 0000000000000..de2525dd8b57d
--- /dev/null
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -0,0 +1,229 @@
+<!--===- docs/DoConcurrentMappingToOpenMP.md
+
+   Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+   See https://llvm.org/LICENSE.txt for license information.
+   SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+-->
+
+# `DO CONCURRENT` mapping to OpenMP
+
+```{contents}
+---
+local:
+---
+```
+
+This document seeks to describe the effort to parallelize `do concurrent` loops
+by mapping them to OpenMP worksharing constructs. The goals of this document
+are:
+* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
+  constructs.
+* Tracking the current status of such mapping.
+* Describing the limitations of the current implementation.
+* Describing next steps.
+* Tracking the current upstreaming status (from the AMD ROCm fork).
+
+## Usage
+
+In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
+compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
+1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
+   This maps such loops to the equivalent of `omp parallel do`.
+2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
+   This maps such loops to the equivalent of
+   `omp target teams distribute parallel do`.
+3. `none`: this disables `do concurrent` mapping altogether. In that case, such
+   loops are emitted as sequential loops.
+
+The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
+OpenMP is also enabled. So you need to provide the following options to flang in
+order to enable it:
+```
+flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
+```
+For mapping to device, the target device architecture must be specified as well.
+See `-fopenmp-targets` and `--offload-arch` for more info.
+
+## Current status
+
+Under the hood, `do concurrent` mapping is implemented in the
+`DoConcurrentConversionPass`. This is still an experimental pass which means
+that:
+* It has been tested in a very limited way so far.
+* It has been tested mostly on simple synthetic inputs.
+
+### Loop nest detection
+
+On the `FIR` dialect level, the following loop:
+```fortran
+  do concurrent(i=1:n, j=1:m, k=1:o)
+    a(i,j,k) = i + j + k
+  end do
+```
+is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
+contains **only** the following:
+  1. The operations needed to assign/update the outer loop's induction variable.
+  1. The inner loop itself.
+
+So the MLIR structure for the above example looks similar to the following:
+```
+  fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
+    %i_idx_2 = fir.convert %i_idx : (index) -> i32
+    fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>
+
+    fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
+      %j_idx_2 = fir.convert %j_idx : (index) -> i32
+      fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>
+
+      fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
+        %k_idx_2 = fir.convert %k_idx : (index) -> i32
+        fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>
+
+        ... loop nest body goes here ...
+      }
+    }
+  }
+```
+This applies to multi-range loops in general; they are represented in the IR as
+a nest of `fir.do_loop` ops with the above nesting structure.
+
+Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
+loops and map them as "collapsed" loops in OpenMP.
+
+#### Further info regarding loop nest detection
+
+Loop nest detection is currently limited to the scenario described in the previous
+section. However, this is quite limited and can be extended in the future to cover
+more cases. For example, for the following loop nest, even though, both loops are
+perfectly nested; at the moment, only the outer loop is parallelized:
+```fortran
+do concurrent(i=1:n)
+  do concurrent(j=1:m)
+    a(i,j) = i * j
+  end do
+end do
+```
+
+Similarly, for the following loop nest, even though the intervening statement `x = 41`
+does not have any memory effects that would affect parallelization, this nest is
+not parallelized as well (only the outer loop is).
+
+```fortran
+do concurrent(i=1:n)
+  x = 41
+  do concurrent(j=1:m)
+    a(i,j) = i * j
+  end do
+end do
+```
+
+The above also has the consequence that the `j` variable will **not** be
+privatized in the OpenMP parallel/target region. In other words, it will be
+treated as if it was a `shared` variable. For more details about privatization,
+see the "Data environment" section below.
+
+See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
+of what is and is not detected as a perfect loop nest.
+
+<!--
+More details about current status will be added along with relevant parts of the
+implementation in later upstreaming patches.
+-->
+
+## Next steps
+
+This section describes some of the open questions/issues that are not tackled yet
+even in the downstream implementation.
+
+### Delayed privatization
+
+So far, we emit the privatization logic for IVs inline in the parallel/target
+region. This is enough for our purposes right now since we don't
+localize/privatize any sophisticated types of variables yet. Once we have need
+for more advanced localization through `do concurrent`'s locality specifiers
+(see below), delayed privatization will enable us to have a much cleaner IR.
+Once delayed privatization's implementation upstream is supported for the
+required constructs by the pass, we will move to it rather than inlined/early
+privatization.
+
+### Locality specifiers for `do concurrent`
+
+Locality specifiers will enable the user to control the data environment of the
+loop nest in a more fine-grained way. Implementing these specifiers on the
+`FIR` dialect level is needed in order to support this in the
+`DoConcurrentConversionPass`.
+
+Such specifiers will also unlock a potential solution to the
+non-perfectly-nested loops' IVs issue described above. In particular, for a
+non-perfectly nested loop, one middle-ground proposal/solution would be to:
+* Emit the loop's IV as shared/mapped just like we do currently.
+* Emit a warning that the IV of the loop is emitted as shared/mapped.
+* Given support for `LOCAL`, we can recommend the user to explicitly
+  localize/privatize the loop's IV if they choose to.
+
+#### Sharing TableGen clause records from the OpenMP dialect
+
+At the moment, the FIR dialect does not have a way to model locality specifiers
+on the IR level. Instead, something similar to early/eager privatization in OpenMP
+is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
+modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
+reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
+to OpenMP (and other parallel programming models) much easier.
+
+Therefore, one way to approach this problem is to extract the TableGen records
+for relevant OpenMP clauses in a shared dialect for "data environment management"
+and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
+as well.
+
+#### Supporting reductions
+
+Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
+is also still an open TODO. We can potentially extend the MLIR infrastructure
+proposed in the previous section to share reduction records among the different 
+relevant dialects as well.
+
+### More advanced detection of loop nests
+
+As pointed out earlier, any intervening code between the headers of 2 nested
+`do concurrent` loops prevents us from detecting this as a loop nest. In some
+cases this is overly conservative. Therefore, a more flexible detection logic
+of loop nests needs to be implemented.
+
+### Data-dependence analysis
+
+Right now, we map loop nests without analysing whether such mapping is safe to
+do or not. We probably need to at least warn the user of unsafe loop nests due
+to loop-carried dependencies.
+
+### Non-rectangular loop nests
+
+So far, we did not need to use the pass for non-rectangular loop nests. For
+example:
+```fortran
+do concurrent(i=1:n)
+  do concurrent(j=i:n)
+    ...
+  end do
+end do
+```
+We defer this to the (hopefully) near future when we get the conversion in a
+good share for the samples/projects at hand.
+
+### Generalizing the pass to other parallel programming models
+
+Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
+this in a more generalized direction and allow the pass to target other models;
+e.g. OpenACC. This goal should be kept in mind from the get-go even while only
+targeting OpenMP.
+
+
+## Upstreaming status
+
+- [x] Command line options for `flang` and `bbc`.
+- [x] Conversion pass skeleton (no transormations happen yet).
+- [x] Status description and tracking document (this document).
+- [x] Loop nest detection to identify multi-range loops.
+- [ ] Basic host/CPU mapping support.
+- [ ] Basic device/GPU mapping support.
+- [ ] More advanced host and device support (expaned to multiple items as needed).
diff --git a/flang/docs/index.md b/flang/docs/index.md
index c35f634746e68..913e53d4cfed9 100644
--- a/flang/docs/index.md
+++ b/flang/docs/index.md
@@ -50,6 +50,7 @@ on how to get in touch with us and to learn more about the current status.
    DebugGeneration
    Directives
    DoConcurrent
+   DoConcurrentConversionToOpenMP
    Extensions
    F202X
    FIRArrayOperations
diff --git a/flang/include/flang/Frontend/CodeGenOptions.def b/flang/include/flang/Frontend/CodeGenOptions.def
index deb8d1aede518..13cda965600b5 100644
--- a/flang/include/flang/Frontend/CodeGenOptions.def
+++ b/flang/include/flang/Frontend/CodeGenOptions.def
@@ -41,5 +41,7 @@ ENUM_CODEGENOPT(DebugInfo,  llvm::codegenoptions::DebugInfoKind, 4,  llvm::codeg
 ENUM_CODEGENOPT(VecLib, llvm::driver::VectorLibrary, 3, llvm::driver::VectorLibrary::NoLibrary) ///< Vector functions library to use
 ENUM_CODEGENOPT(FramePointer, llvm::FramePointerKind, 2, llvm::FramePointerKind::None) ///< Enable the usage of frame pointers
 
+ENUM_CODEGENOPT(DoConcurrentMapping, DoConcurrentMappingKind, 2, DoConcurrentMappingKind::DCMK_None) ///< Map `do concurrent` to OpenMP
+
 #undef CODEGENOPT
 #undef ENUM_CODEGENOPT
diff --git a/flang/include/flang/Frontend/CodeGenOptions.h b/flang/include/flang/Frontend/CodeGenOptions.h
index f19943335737b..23d99e1f0897a 100644
--- a/flang/include/flang/Frontend/CodeGenOptions.h
+++ b/flang/include/flang/Frontend/CodeGenOptions.h
@@ -15,6 +15,7 @@
 #ifndef FORTRAN_FRONTEND_CODEGENOPTIONS_H
 #define FORTRAN_FRONTEND_CODEGENOPTIONS_H
 
+#include "flang/Optimizer/OpenMP/Utils.h"
 #include "llvm/Frontend/Debug/Options.h"
 #include "llvm/Frontend/Driver/CodeGenOptions.h"
 #include "llvm/Support/CodeGen.h"
@@ -143,6 +144,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// (-mlarge-data-threshold).
   uint64_t LargeDataThreshold;
 
+  /// Optionally map `do concurrent` loops to OpenMP. This is only valid of
+  /// OpenMP is enabled.
+  using DoConcurrentMappingKind = flangomp::DoConcurrentMappingKind;
+
   // Define accessors/mutators for code generation options of enumeration type.
 #define CODEGENOPT(Name, Bits, Default)
 #define ENUM_CODEGENOPT(Name, Type, Bits, Default)                             \
diff --git a/flang/include/flang/Optimizer/OpenMP/Passes.h b/flang/include/flang/Optimizer/OpenMP/Passes.h
index feb395f1a12db..c67bddbcd2704 100644
--- a/flang/include/flang/Optimizer/OpenMP/Passes.h
+++ b/flang/include/flang/Optimizer/OpenMP/Passes.h
@@ -13,6 +13,7 @@
 #ifndef FORTRAN_OPTIMIZER_OPENMP_PASSES_H
 #define FORTRAN_OPTIMIZER_OPENMP_PASSES_H
 
+#include "flang/Optimizer/OpenMP/Utils.h"
 #include "mlir/Dialect/Func/IR/FuncOps.h"
 #include "mlir/IR/BuiltinOps.h"
 #include "mlir/Pass/Pass.h"
@@ -30,6 +31,7 @@ namespace flangomp {
 /// divided into units of work.
 bool shouldUseWorkshareLowering(mlir::Operation *op);
 
+std::unique_ptr<mlir::Pass> createDoConcurrentConversionPass(bool mapToDevice);
 } // namespace flangomp
 
 #endif // FORTRAN_OPTIMIZER_OPENMP_PASSES_H
diff --git a/flang/include/flang/Optimizer/OpenMP/Passes.td b/flang/include/flang/Optimizer/OpenMP/Passes.td
index 3add0c560f88d..fcc7a4ca31fef 100644
--- a/flang/include/flang/Optimizer/OpenMP/Passes.td
+++ b/flang/include/flang/Optimizer/OpenMP/Passes.td
@@ -50,6 +50,36 @@ def FunctionFilteringPass : Pass<"omp-function-filtering"> {
   ];
 }
 
+def DoConcurrentConversionPass : Pass<"omp-do-concurrent-conversion", "mlir::func::FuncOp"> {
+  let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";
+
+  let description = [{ This is an experimental pass to map `DO CONCURRENT` loops
+     to their correspnding equivalent OpenMP worksharing constructs.
+
+     For now the following is supported:
+       - Mapping simple loops to `parallel do`.
+
+     Still TODO:
+       - More extensive testing.
+  }];
+
+  let dependentDialects = ["mlir::omp::OpenMPDialect"];
+
+  let options = [
+    Option<"mapTo", "map-to",
+           "flangomp::DoConcurrentMappingKind",
+           /*default=*/"flangomp::DoConcurrentMappingKind::DCMK_None",
+           "Try to map `do concurrent` loops to OpenMP [none|host|device]",
+           [{::llvm::cl::values(
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_None,
+                          "none", "Do not lower `do concurrent` to OpenMP"),
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Host,
+                          "host", "Lower to run in parallel on the CPU"),
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Device,
+                          "device", "Lower to run in parallel on the GPU")
+           )}]>,
+  ];
+}
 
 // Needs to be scheduled on Module as we create functions in it
 def LowerWorkshare : Pass<"lower-workshare", "::mlir::ModuleOp"> {
diff --git a/flang/include/flang/Optimizer/OpenMP/Utils.h b/flang/include/flang/Optimizer/OpenMP/Utils.h
new file mode 100644
index 0000000000000..636c768b016b7
--- /dev/null
+++ b/flang/include/flang/Optimizer/OpenMP/Utils.h
@@ -0,0 +1,26 @@
+//===-- Optimizer/OpenMP/Utils.h --------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Coding style: https://mlir.llvm.org/getting_started/DeveloperGuide/
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef FORTRAN_OPTIMIZER_OPENMP_UTILS_H
+#define FORTRAN_OPTIMIZER_OPENMP_UTILS_H
+
+namespace flangomp {
+
+enum class DoConcurrentMappingKind {
+  DCMK_None,  ///< Do not lower `do concurrent` to OpenMP.
+  DCMK_Host,  ///< Lower to run in parallel on the CPU.
+  DCMK_Device ///< Lower to run in parallel on the GPU.
+};
+
+} // namespace flangomp
+
+#endif // FORTRAN_OPTIMIZER_OPENMP_UTILS_H
diff --git a/flang/include/flang/Optimizer/Passes/Pipelines.h b/flang/include/flang/Optimizer/Passes/Pipelines.h
index ef5d44ded706c..a3f59ee8dd013 100644
--- a/flang/include/flang/Optimizer/Passes/Pipelines.h
+++ b/flang/include/flang/Optimizer/Passes/Pipelines.h
@@ -128,6 +128,17 @@ void createHLFIRToFIRPassPipeline(
     mlir::PassManager &pm, bool enableOpenMP,
     llvm::OptimizationLevel optLevel = defaultOptLevel);
 
+struct OpenMPFIRPassPipelineOpts {
+  /// Whether code is being generated for a target device rather than the host
+  /// device
+  bool isTargetDevice;
+
+  /// Controls how to map `do concurrent` loops; to device, host, or none at
+  /// all.
+  Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind
+      doConcurrentMappingKind;
+};
+
 /// Create a pass pipeline for handling certain OpenMP transformations needed
 /// prior to FIR lowering.
 ///
@@ -135,9 +146,10 @@ void createHLFIRToFIRPassPipeline(
 /// that the FIR is correct with respect to OpenMP operations/attributes.
 ///
 /// \param pm - MLIR pass manager that will hold the pipeline definition.
-/// \param isTargetDevice - Whether code is being generated for a target device
-/// rather than the host device.
-void createOpenMPFIRPassPipeline(mlir::PassManager &pm, bool isTargetDevice);
+/// \param opts - options to control OpenMP code-gen; see struct docs for more
+/// details.
+void createOpenMPFIRPassPipeline(mlir::PassManager &pm,
+                                 OpenMPFIRPassPipelineOpts opts);
 
 #if !defined(FLANG_EXCLUDE_CODEGEN)
 void createDebugPasses(mlir::PassManager &pm,
diff --git a/flang/lib/Frontend/CompilerInvocation.cpp b/flang/lib/Frontend/CompilerInvocation.cpp
index f3d9432c62d3b..809e423f5aae9 100644
--- a/flang/lib/Frontend/CompilerInvocation.cpp
+++ b/flang/lib/Frontend/CompilerInvocation.cpp
@@ -157,6 +157,32 @@ static bool parseDebugArgs(Fortran::frontend::CodeGenOptions &opts,
   return true;
 }
 
+static void parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
+                                     llvm::opt::ArgList &args,
+                                     clang::DiagnosticsEngine &diags) {
+  llvm::opt::Arg *arg =
+      args.getLastArg(clang::driver::options::OPT_fdo_concurrent_to_openmp_EQ);
+  if (!arg)
+    return;
+
+  using DoConcurrentMappingKind =
+      Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
+  std::optional<DoConcurrentMappingKind> val =
+      llvm::StringSwitch<std::optional<DoConcurrentMappingKind>>(
+          arg->getValue())
+          .Case("none", DoConcurrentMappingKind::DCMK_None)
+          .Case("host", DoConcurrentMappingKind::DCMK_Host)
+          .Case("device", DoConcurrentMappingKind::DCMK_Device)
+          .Default(std::nullopt);
+
+  if (!val.has_value()) {
+    diags.Report(clang::diag::err_drv_invalid_value)
+        << arg->getAsString(args) << arg->getValue();
+  }
+
+  opts.setDoConcurrentMapping(val.value());
+}
+
 static bool parseVectorLibArg(Fortran::frontend::CodeGenOptions &opts,
                               llvm::opt::ArgList &args,
                               clang::DiagnosticsEngine &diags) {
@@ -426,6 +452,8 @@ static void parseCodeGenArgs(Fortran::frontend::CodeGenOptions &opts,
                    clang::driver::options::OPT_funderscoring, false)) {
     opts.Underscoring = 0;
   }
+
+  parseDoConcurrentMapping(opts, args, diags);
 }
 
 /// Parses all target input arguments and populates the target
diff --git a/flang/lib/Frontend/FrontendActions.cpp b/flang/lib/Frontend/FrontendActions.cpp
index 763c810ace0eb..ccc8c7d96135f 100644
--- a/flang/lib/Frontend/FrontendActions.cpp
+++ b/flang/lib/Frontend/FrontendActions.cp...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Feb 18, 2025

@llvm/pr-subscribers-clang-driver

Author: Kareem Ergawy (ergawy)

Changes

Upstreams the next part of do concurrent to OpenMP mapping pass (from
AMD's ROCm implementation). See #126026 for more context.

This PR add loop nest detection logic. This enables us to discover
muli-range do concurrent loops and then map them as "collapsed" loop
nests to OpenMP.

This is a follow up for #126026, only the latest commit is relevant.

This is a replacement for #127478 using a /user/&lt;username&gt;/&lt;branchname&gt; branch.


Patch is 39.64 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127595.diff

19 Files Affected:

  • (modified) clang/include/clang/Driver/Options.td (+4)
  • (modified) clang/lib/Driver/ToolChains/Flang.cpp (+2-1)
  • (added) flang/docs/DoConcurrentConversionToOpenMP.md (+229)
  • (modified) flang/docs/index.md (+1)
  • (modified) flang/include/flang/Frontend/CodeGenOptions.def (+2)
  • (modified) flang/include/flang/Frontend/CodeGenOptions.h (+5)
  • (modified) flang/include/flang/Optimizer/OpenMP/Passes.h (+2)
  • (modified) flang/include/flang/Optimizer/OpenMP/Passes.td (+30)
  • (added) flang/include/flang/Optimizer/OpenMP/Utils.h (+26)
  • (modified) flang/include/flang/Optimizer/Passes/Pipelines.h (+15-3)
  • (modified) flang/lib/Frontend/CompilerInvocation.cpp (+28)
  • (modified) flang/lib/Frontend/FrontendActions.cpp (+27-5)
  • (modified) flang/lib/Optimizer/OpenMP/CMakeLists.txt (+1)
  • (added) flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp (+209)
  • (modified) flang/lib/Optimizer/Passes/Pipelines.cpp (+10-2)
  • (added) flang/test/Driver/do_concurrent_to_omp_cli.f90 (+20)
  • (added) flang/test/Transforms/DoConcurrent/basic_host.f90 (+53)
  • (added) flang/test/Transforms/DoConcurrent/loop_nest_test.f90 (+89)
  • (modified) flang/tools/bbc/bbc.cpp (+19-1)
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 5ad187926e710..0cd3dfd3fb29d 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -6927,6 +6927,10 @@ defm loop_versioning : BoolOptionWithoutMarshalling<"f", "version-loops-for-stri
 
 def fhermetic_module_files : Flag<["-"], "fhermetic-module-files">, Group<f_Group>,
   HelpText<"Emit hermetic module files (no nested USE association)">;
+
+def fdo_concurrent_to_openmp_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
+  HelpText<"Try to map `do concurrent` loops to OpenMP [none|host|device]">,
+      Values<"none, host, device">;
 } // let Visibility = [FC1Option, FlangOption]
 
 def J : JoinedOrSeparate<["-"], "J">,
diff --git a/clang/lib/Driver/ToolChains/Flang.cpp b/clang/lib/Driver/ToolChains/Flang.cpp
index 9ad795edd724d..cb0b00a2fd699 100644
--- a/clang/lib/Driver/ToolChains/Flang.cpp
+++ b/clang/lib/Driver/ToolChains/Flang.cpp
@@ -153,7 +153,8 @@ void Flang::addCodegenOptions(const ArgList &Args,
     CmdArgs.push_back("-fversion-loops-for-stride");
 
   Args.addAllArgs(CmdArgs,
-                  {options::OPT_flang_experimental_hlfir,
+                  {options::OPT_fdo_concurrent_to_openmp_EQ,
+                   options::OPT_flang_experimental_hlfir,
                    options::OPT_flang_deprecated_no_hlfir,
                    options::OPT_fno_ppc_native_vec_elem_order,
                    options::OPT_fppc_native_vec_elem_order,
diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
new file mode 100644
index 0000000000000..de2525dd8b57d
--- /dev/null
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -0,0 +1,229 @@
+<!--===- docs/DoConcurrentMappingToOpenMP.md
+
+   Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+   See https://llvm.org/LICENSE.txt for license information.
+   SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+-->
+
+# `DO CONCURRENT` mapping to OpenMP
+
+```{contents}
+---
+local:
+---
+```
+
+This document seeks to describe the effort to parallelize `do concurrent` loops
+by mapping them to OpenMP worksharing constructs. The goals of this document
+are:
+* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
+  constructs.
+* Tracking the current status of such mapping.
+* Describing the limitations of the current implementation.
+* Describing next steps.
+* Tracking the current upstreaming status (from the AMD ROCm fork).
+
+## Usage
+
+In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
+compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
+1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
+   This maps such loops to the equivalent of `omp parallel do`.
+2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
+   This maps such loops to the equivalent of
+   `omp target teams distribute parallel do`.
+3. `none`: this disables `do concurrent` mapping altogether. In that case, such
+   loops are emitted as sequential loops.
+
+The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
+OpenMP is also enabled. So you need to provide the following options to flang in
+order to enable it:
+```
+flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
+```
+For mapping to device, the target device architecture must be specified as well.
+See `-fopenmp-targets` and `--offload-arch` for more info.
+
+## Current status
+
+Under the hood, `do concurrent` mapping is implemented in the
+`DoConcurrentConversionPass`. This is still an experimental pass which means
+that:
+* It has been tested in a very limited way so far.
+* It has been tested mostly on simple synthetic inputs.
+
+### Loop nest detection
+
+On the `FIR` dialect level, the following loop:
+```fortran
+  do concurrent(i=1:n, j=1:m, k=1:o)
+    a(i,j,k) = i + j + k
+  end do
+```
+is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
+contains **only** the following:
+  1. The operations needed to assign/update the outer loop's induction variable.
+  1. The inner loop itself.
+
+So the MLIR structure for the above example looks similar to the following:
+```
+  fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
+    %i_idx_2 = fir.convert %i_idx : (index) -> i32
+    fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>
+
+    fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
+      %j_idx_2 = fir.convert %j_idx : (index) -> i32
+      fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>
+
+      fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
+        %k_idx_2 = fir.convert %k_idx : (index) -> i32
+        fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>
+
+        ... loop nest body goes here ...
+      }
+    }
+  }
+```
+This applies to multi-range loops in general; they are represented in the IR as
+a nest of `fir.do_loop` ops with the above nesting structure.
+
+Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
+loops and map them as "collapsed" loops in OpenMP.
+
+#### Further info regarding loop nest detection
+
+Loop nest detection is currently limited to the scenario described in the previous
+section. However, this is quite limited and can be extended in the future to cover
+more cases. For example, for the following loop nest, even though, both loops are
+perfectly nested; at the moment, only the outer loop is parallelized:
+```fortran
+do concurrent(i=1:n)
+  do concurrent(j=1:m)
+    a(i,j) = i * j
+  end do
+end do
+```
+
+Similarly, for the following loop nest, even though the intervening statement `x = 41`
+does not have any memory effects that would affect parallelization, this nest is
+not parallelized as well (only the outer loop is).
+
+```fortran
+do concurrent(i=1:n)
+  x = 41
+  do concurrent(j=1:m)
+    a(i,j) = i * j
+  end do
+end do
+```
+
+The above also has the consequence that the `j` variable will **not** be
+privatized in the OpenMP parallel/target region. In other words, it will be
+treated as if it was a `shared` variable. For more details about privatization,
+see the "Data environment" section below.
+
+See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
+of what is and is not detected as a perfect loop nest.
+
+<!--
+More details about current status will be added along with relevant parts of the
+implementation in later upstreaming patches.
+-->
+
+## Next steps
+
+This section describes some of the open questions/issues that are not tackled yet
+even in the downstream implementation.
+
+### Delayed privatization
+
+So far, we emit the privatization logic for IVs inline in the parallel/target
+region. This is enough for our purposes right now since we don't
+localize/privatize any sophisticated types of variables yet. Once we have need
+for more advanced localization through `do concurrent`'s locality specifiers
+(see below), delayed privatization will enable us to have a much cleaner IR.
+Once delayed privatization's implementation upstream is supported for the
+required constructs by the pass, we will move to it rather than inlined/early
+privatization.
+
+### Locality specifiers for `do concurrent`
+
+Locality specifiers will enable the user to control the data environment of the
+loop nest in a more fine-grained way. Implementing these specifiers on the
+`FIR` dialect level is needed in order to support this in the
+`DoConcurrentConversionPass`.
+
+Such specifiers will also unlock a potential solution to the
+non-perfectly-nested loops' IVs issue described above. In particular, for a
+non-perfectly nested loop, one middle-ground proposal/solution would be to:
+* Emit the loop's IV as shared/mapped just like we do currently.
+* Emit a warning that the IV of the loop is emitted as shared/mapped.
+* Given support for `LOCAL`, we can recommend the user to explicitly
+  localize/privatize the loop's IV if they choose to.
+
+#### Sharing TableGen clause records from the OpenMP dialect
+
+At the moment, the FIR dialect does not have a way to model locality specifiers
+on the IR level. Instead, something similar to early/eager privatization in OpenMP
+is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
+modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
+reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
+to OpenMP (and other parallel programming models) much easier.
+
+Therefore, one way to approach this problem is to extract the TableGen records
+for relevant OpenMP clauses in a shared dialect for "data environment management"
+and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
+as well.
+
+#### Supporting reductions
+
+Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
+is also still an open TODO. We can potentially extend the MLIR infrastructure
+proposed in the previous section to share reduction records among the different 
+relevant dialects as well.
+
+### More advanced detection of loop nests
+
+As pointed out earlier, any intervening code between the headers of 2 nested
+`do concurrent` loops prevents us from detecting this as a loop nest. In some
+cases this is overly conservative. Therefore, a more flexible detection logic
+of loop nests needs to be implemented.
+
+### Data-dependence analysis
+
+Right now, we map loop nests without analysing whether such mapping is safe to
+do or not. We probably need to at least warn the user of unsafe loop nests due
+to loop-carried dependencies.
+
+### Non-rectangular loop nests
+
+So far, we did not need to use the pass for non-rectangular loop nests. For
+example:
+```fortran
+do concurrent(i=1:n)
+  do concurrent(j=i:n)
+    ...
+  end do
+end do
+```
+We defer this to the (hopefully) near future when we get the conversion in a
+good share for the samples/projects at hand.
+
+### Generalizing the pass to other parallel programming models
+
+Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
+this in a more generalized direction and allow the pass to target other models;
+e.g. OpenACC. This goal should be kept in mind from the get-go even while only
+targeting OpenMP.
+
+
+## Upstreaming status
+
+- [x] Command line options for `flang` and `bbc`.
+- [x] Conversion pass skeleton (no transormations happen yet).
+- [x] Status description and tracking document (this document).
+- [x] Loop nest detection to identify multi-range loops.
+- [ ] Basic host/CPU mapping support.
+- [ ] Basic device/GPU mapping support.
+- [ ] More advanced host and device support (expaned to multiple items as needed).
diff --git a/flang/docs/index.md b/flang/docs/index.md
index c35f634746e68..913e53d4cfed9 100644
--- a/flang/docs/index.md
+++ b/flang/docs/index.md
@@ -50,6 +50,7 @@ on how to get in touch with us and to learn more about the current status.
    DebugGeneration
    Directives
    DoConcurrent
+   DoConcurrentConversionToOpenMP
    Extensions
    F202X
    FIRArrayOperations
diff --git a/flang/include/flang/Frontend/CodeGenOptions.def b/flang/include/flang/Frontend/CodeGenOptions.def
index deb8d1aede518..13cda965600b5 100644
--- a/flang/include/flang/Frontend/CodeGenOptions.def
+++ b/flang/include/flang/Frontend/CodeGenOptions.def
@@ -41,5 +41,7 @@ ENUM_CODEGENOPT(DebugInfo,  llvm::codegenoptions::DebugInfoKind, 4,  llvm::codeg
 ENUM_CODEGENOPT(VecLib, llvm::driver::VectorLibrary, 3, llvm::driver::VectorLibrary::NoLibrary) ///< Vector functions library to use
 ENUM_CODEGENOPT(FramePointer, llvm::FramePointerKind, 2, llvm::FramePointerKind::None) ///< Enable the usage of frame pointers
 
+ENUM_CODEGENOPT(DoConcurrentMapping, DoConcurrentMappingKind, 2, DoConcurrentMappingKind::DCMK_None) ///< Map `do concurrent` to OpenMP
+
 #undef CODEGENOPT
 #undef ENUM_CODEGENOPT
diff --git a/flang/include/flang/Frontend/CodeGenOptions.h b/flang/include/flang/Frontend/CodeGenOptions.h
index f19943335737b..23d99e1f0897a 100644
--- a/flang/include/flang/Frontend/CodeGenOptions.h
+++ b/flang/include/flang/Frontend/CodeGenOptions.h
@@ -15,6 +15,7 @@
 #ifndef FORTRAN_FRONTEND_CODEGENOPTIONS_H
 #define FORTRAN_FRONTEND_CODEGENOPTIONS_H
 
+#include "flang/Optimizer/OpenMP/Utils.h"
 #include "llvm/Frontend/Debug/Options.h"
 #include "llvm/Frontend/Driver/CodeGenOptions.h"
 #include "llvm/Support/CodeGen.h"
@@ -143,6 +144,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// (-mlarge-data-threshold).
   uint64_t LargeDataThreshold;
 
+  /// Optionally map `do concurrent` loops to OpenMP. This is only valid of
+  /// OpenMP is enabled.
+  using DoConcurrentMappingKind = flangomp::DoConcurrentMappingKind;
+
   // Define accessors/mutators for code generation options of enumeration type.
 #define CODEGENOPT(Name, Bits, Default)
 #define ENUM_CODEGENOPT(Name, Type, Bits, Default)                             \
diff --git a/flang/include/flang/Optimizer/OpenMP/Passes.h b/flang/include/flang/Optimizer/OpenMP/Passes.h
index feb395f1a12db..c67bddbcd2704 100644
--- a/flang/include/flang/Optimizer/OpenMP/Passes.h
+++ b/flang/include/flang/Optimizer/OpenMP/Passes.h
@@ -13,6 +13,7 @@
 #ifndef FORTRAN_OPTIMIZER_OPENMP_PASSES_H
 #define FORTRAN_OPTIMIZER_OPENMP_PASSES_H
 
+#include "flang/Optimizer/OpenMP/Utils.h"
 #include "mlir/Dialect/Func/IR/FuncOps.h"
 #include "mlir/IR/BuiltinOps.h"
 #include "mlir/Pass/Pass.h"
@@ -30,6 +31,7 @@ namespace flangomp {
 /// divided into units of work.
 bool shouldUseWorkshareLowering(mlir::Operation *op);
 
+std::unique_ptr<mlir::Pass> createDoConcurrentConversionPass(bool mapToDevice);
 } // namespace flangomp
 
 #endif // FORTRAN_OPTIMIZER_OPENMP_PASSES_H
diff --git a/flang/include/flang/Optimizer/OpenMP/Passes.td b/flang/include/flang/Optimizer/OpenMP/Passes.td
index 3add0c560f88d..fcc7a4ca31fef 100644
--- a/flang/include/flang/Optimizer/OpenMP/Passes.td
+++ b/flang/include/flang/Optimizer/OpenMP/Passes.td
@@ -50,6 +50,36 @@ def FunctionFilteringPass : Pass<"omp-function-filtering"> {
   ];
 }
 
+def DoConcurrentConversionPass : Pass<"omp-do-concurrent-conversion", "mlir::func::FuncOp"> {
+  let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";
+
+  let description = [{ This is an experimental pass to map `DO CONCURRENT` loops
+     to their correspnding equivalent OpenMP worksharing constructs.
+
+     For now the following is supported:
+       - Mapping simple loops to `parallel do`.
+
+     Still TODO:
+       - More extensive testing.
+  }];
+
+  let dependentDialects = ["mlir::omp::OpenMPDialect"];
+
+  let options = [
+    Option<"mapTo", "map-to",
+           "flangomp::DoConcurrentMappingKind",
+           /*default=*/"flangomp::DoConcurrentMappingKind::DCMK_None",
+           "Try to map `do concurrent` loops to OpenMP [none|host|device]",
+           [{::llvm::cl::values(
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_None,
+                          "none", "Do not lower `do concurrent` to OpenMP"),
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Host,
+                          "host", "Lower to run in parallel on the CPU"),
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Device,
+                          "device", "Lower to run in parallel on the GPU")
+           )}]>,
+  ];
+}
 
 // Needs to be scheduled on Module as we create functions in it
 def LowerWorkshare : Pass<"lower-workshare", "::mlir::ModuleOp"> {
diff --git a/flang/include/flang/Optimizer/OpenMP/Utils.h b/flang/include/flang/Optimizer/OpenMP/Utils.h
new file mode 100644
index 0000000000000..636c768b016b7
--- /dev/null
+++ b/flang/include/flang/Optimizer/OpenMP/Utils.h
@@ -0,0 +1,26 @@
+//===-- Optimizer/OpenMP/Utils.h --------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Coding style: https://mlir.llvm.org/getting_started/DeveloperGuide/
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef FORTRAN_OPTIMIZER_OPENMP_UTILS_H
+#define FORTRAN_OPTIMIZER_OPENMP_UTILS_H
+
+namespace flangomp {
+
+enum class DoConcurrentMappingKind {
+  DCMK_None,  ///< Do not lower `do concurrent` to OpenMP.
+  DCMK_Host,  ///< Lower to run in parallel on the CPU.
+  DCMK_Device ///< Lower to run in parallel on the GPU.
+};
+
+} // namespace flangomp
+
+#endif // FORTRAN_OPTIMIZER_OPENMP_UTILS_H
diff --git a/flang/include/flang/Optimizer/Passes/Pipelines.h b/flang/include/flang/Optimizer/Passes/Pipelines.h
index ef5d44ded706c..a3f59ee8dd013 100644
--- a/flang/include/flang/Optimizer/Passes/Pipelines.h
+++ b/flang/include/flang/Optimizer/Passes/Pipelines.h
@@ -128,6 +128,17 @@ void createHLFIRToFIRPassPipeline(
     mlir::PassManager &pm, bool enableOpenMP,
     llvm::OptimizationLevel optLevel = defaultOptLevel);
 
+struct OpenMPFIRPassPipelineOpts {
+  /// Whether code is being generated for a target device rather than the host
+  /// device
+  bool isTargetDevice;
+
+  /// Controls how to map `do concurrent` loops; to device, host, or none at
+  /// all.
+  Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind
+      doConcurrentMappingKind;
+};
+
 /// Create a pass pipeline for handling certain OpenMP transformations needed
 /// prior to FIR lowering.
 ///
@@ -135,9 +146,10 @@ void createHLFIRToFIRPassPipeline(
 /// that the FIR is correct with respect to OpenMP operations/attributes.
 ///
 /// \param pm - MLIR pass manager that will hold the pipeline definition.
-/// \param isTargetDevice - Whether code is being generated for a target device
-/// rather than the host device.
-void createOpenMPFIRPassPipeline(mlir::PassManager &pm, bool isTargetDevice);
+/// \param opts - options to control OpenMP code-gen; see struct docs for more
+/// details.
+void createOpenMPFIRPassPipeline(mlir::PassManager &pm,
+                                 OpenMPFIRPassPipelineOpts opts);
 
 #if !defined(FLANG_EXCLUDE_CODEGEN)
 void createDebugPasses(mlir::PassManager &pm,
diff --git a/flang/lib/Frontend/CompilerInvocation.cpp b/flang/lib/Frontend/CompilerInvocation.cpp
index f3d9432c62d3b..809e423f5aae9 100644
--- a/flang/lib/Frontend/CompilerInvocation.cpp
+++ b/flang/lib/Frontend/CompilerInvocation.cpp
@@ -157,6 +157,32 @@ static bool parseDebugArgs(Fortran::frontend::CodeGenOptions &opts,
   return true;
 }
 
+static void parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
+                                     llvm::opt::ArgList &args,
+                                     clang::DiagnosticsEngine &diags) {
+  llvm::opt::Arg *arg =
+      args.getLastArg(clang::driver::options::OPT_fdo_concurrent_to_openmp_EQ);
+  if (!arg)
+    return;
+
+  using DoConcurrentMappingKind =
+      Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
+  std::optional<DoConcurrentMappingKind> val =
+      llvm::StringSwitch<std::optional<DoConcurrentMappingKind>>(
+          arg->getValue())
+          .Case("none", DoConcurrentMappingKind::DCMK_None)
+          .Case("host", DoConcurrentMappingKind::DCMK_Host)
+          .Case("device", DoConcurrentMappingKind::DCMK_Device)
+          .Default(std::nullopt);
+
+  if (!val.has_value()) {
+    diags.Report(clang::diag::err_drv_invalid_value)
+        << arg->getAsString(args) << arg->getValue();
+  }
+
+  opts.setDoConcurrentMapping(val.value());
+}
+
 static bool parseVectorLibArg(Fortran::frontend::CodeGenOptions &opts,
                               llvm::opt::ArgList &args,
                               clang::DiagnosticsEngine &diags) {
@@ -426,6 +452,8 @@ static void parseCodeGenArgs(Fortran::frontend::CodeGenOptions &opts,
                    clang::driver::options::OPT_funderscoring, false)) {
     opts.Underscoring = 0;
   }
+
+  parseDoConcurrentMapping(opts, args, diags);
 }
 
 /// Parses all target input arguments and populates the target
diff --git a/flang/lib/Frontend/FrontendActions.cpp b/flang/lib/Frontend/FrontendActions.cpp
index 763c810ace0eb..ccc8c7d96135f 100644
--- a/flang/lib/Frontend/FrontendActions.cpp
+++ b/flang/lib/Frontend/FrontendActions.cp...
[truncated]

@kiranchandramohan
Copy link
Contributor

It is slightly unfortunate to rediscover the loops so early in the flow when we had it in source. Have you considered changing the representation of do_concurrent in the IR for multi-range do concurrent loops? Representation could in HLFIR capture all the loops, so there is no need to re-discover the loops. It could be modelled on the old workshare loop hlfir.do_concurrent (i,j) : (start_i,start_j) (end_i,end_j) (step_i,step_j) or the current representation

hlfir.do_concurrent {
fir.do
fir.do
}

Alternatively would adding attributes to denote a multi-range loop help this?

fir.do_loop unordered multi-range
fir.do_loop unordered multi-range

Might be worth a discussion with @clementval and @jeanPerier to see whether there is appetite for it.

@ergawy ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch from 57e10a4 to 41f77da Compare February 18, 2025 09:22
@ergawy
Copy link
Member Author

ergawy commented Feb 18, 2025

It is slightly unfortunate to rediscover the loops so early in the flow when we had it in source.

Totally agree, it should be more trivial than this. And it actually was slight worse, see: #114020.

Have you considered changing the representation of do_concurrent in the IR for multi-range do concurrent loops?

I am all for it. I can add that to the future work part of the document and look into it once we have a more fleshed out pass. The nest detection algorithm is luckily not that complicated and helps us move forward with a working implementation faster. WDYT?

Copy link
Contributor

@bhandarkar-pranav bhandarkar-pranav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly fine. @ergawy, it would be nice if you included a motivating example in the Pass file and used that to document your algorithm for the transformation or even the expected result of the transformation.


while (true) {
loopNest.insert(currentLoop);
auto directlyNestedLoops = currentLoop.getRegion().getOps<fir::DoLoopOp>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify the type since it is not obvious from the right hand side of the assignment (Note: Since the MLIR developer guide is silent on this, I am deferring to the LLVM style guide)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an iterator range so the type will be quie ugly. I used it directly in the loop instead to iterate the result. Hopefully that's better.

@ergawy ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch from 41f77da to 2561e21 Compare February 19, 2025 12:34
@ergawy
Copy link
Member Author

ergawy commented Feb 20, 2025

@clementval @jeanPerier can you please take a look at the PR and @kiranchandramohan's comment above? 🙏

Copy link
Member

@skatrak skatrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@ergawy ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch from 2561e21 to 6ef2fae Compare February 20, 2025 15:34
@clementval
Copy link
Contributor

clementval commented Feb 20, 2025

@clementval @jeanPerier can you please take a look at the PR and @kiranchandramohan's comment above? 🙏

Yes, I think we should add a proper operation to represent the do concurrent. It would also allow use to carry the locality mapping into the IR representation and apply different semantic based on the target.

Without a proper representation, how do you differentiate a do concurrent loop nest to an array notation like:

subroutine sub1(a)
  integer :: a(10,10,10)

  a = 1
end subroutine
  func.func @_QPsub1(%arg0: !fir.ref<!fir.array<10x10x10xi32>> {fir.bindc_name = "a"}) {
    %c1 = arith.constant 1 : index
    %c1_i32 = arith.constant 1 : i32
    %c10 = arith.constant 10 : index
    %0 = fir.dummy_scope : !fir.dscope
    %1 = fir.shape %c10, %c10, %c10 : (index, index, index) -> !fir.shape<3>
    %2 = fir.declare %arg0(%1) dummy_scope %0 {uniq_name = "_QFsub1Ea"} : (!fir.ref<!fir.array<10x10x10xi32>>, !fir.shape<3>, !fir.dscope) -> !fir.ref<!fir.array<10x10x10xi32>>
    fir.do_loop %arg1 = %c1 to %c10 step %c1 unordered {
      fir.do_loop %arg2 = %c1 to %c10 step %c1 unordered {
        fir.do_loop %arg3 = %c1 to %c10 step %c1 unordered {
          %3 = fir.array_coor %2(%1) %arg3, %arg2, %arg1 : (!fir.ref<!fir.array<10x10x10xi32>>, !fir.shape<3>, index, index, index) -> !fir.ref<i32>
          fir.store %c1_i32 to %3 : !fir.ref<i32>
        }
      }
    }
    return
  }

Copy link
Contributor

@bhandarkar-pranav bhandarkar-pranav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microminor nit. Ignore, if no other reviewers have requested changes.

break;

// Having more than one unordered loop means that we are not dealing with a
// perfect loop nest (i.e. a mulit-range `do concurrent` loop); which is the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/mulit/multi

@ergawy
Copy link
Member Author

ergawy commented Feb 21, 2025

Thanks @skatrak and @bhandarkar-pranav for the approval.

@kiranchandramohan @clementval I think there is a pretty simple solution that enables us to mark multi-range loop nests. I think we can add an optional attribute to the fir::DoLoopOp to store the loop nest depth: nest_depth(n). So for the following input:

do concurrent (i=1:10, j=1:10)
end do

The MLIR would look like this:

    fir.do_loop %arg0 = %10 to %11 step %c1 unordered nest_depth(2) {
      %14 = fir.convert %arg0 : (index) -> i32
      fir.store %14 to %3#1 : !fir.ref<i32>
      fir.do_loop %arg1 = %12 to %13 step %c1_2 unordered {
        %15 = fir.convert %arg1 : (index) -> i32
        fir.store %15 to %1#1 : !fir.ref<i32>
        ....
      }
    }

So the new attribute would be attached only to the outermost loop op in the nest. I think this is a non-distruptive change to the op that enables us to model loop nests more easily. So this is similar to what @kiranchandramohan suggested above but I think more self-contained: one attribute to tie the whole nest together. WDYT?

@clementval as for the locality specifiers, I am working on a PoC (based-on which I will write an RFC) to have a shared "Data Environemt" dialect that can be used across do concurrent, OpenMP, and OpenACC. I published what I have so far in #128148. Please take a look at the PR for more info.

@clementval
Copy link
Contributor

Thanks @skatrak and @bhandarkar-pranav for the approval.

@kiranchandramohan @clementval I think there is a pretty simple solution that enables us to mark multi-range loop nests. I think we can add an optional attribute to the fir::DoLoopOp to store the loop nest depth: nest_depth(n). So for the following input:

do concurrent (i=1:10, j=1:10)
end do

The MLIR would look like this:

    fir.do_loop %arg0 = %10 to %11 step %c1 unordered nest_depth(2) {
      %14 = fir.convert %arg0 : (index) -> i32
      fir.store %14 to %3#1 : !fir.ref<i32>
      fir.do_loop %arg1 = %12 to %13 step %c1_2 unordered {
        %15 = fir.convert %arg1 : (index) -> i32
        fir.store %15 to %1#1 : !fir.ref<i32>
        ....
      }
    }

So the new attribute would be attached only to the outermost loop op in the nest. I think this is a non-distruptive change to the op that enables us to model loop nests more easily. So this is similar to what @kiranchandramohan suggested above but I think more self-contained: one attribute to tie the whole nest together. WDYT?

I would prefer a proper operation than trying to patch the current one. We have plenty of example of loop operation with multiple ranges so let's model it the proper way.

@ergawy ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch 2 times, most recently from 92d976d to ca40210 Compare March 4, 2025 07:18
@ergawy
Copy link
Member Author

ergawy commented Mar 4, 2025

@kiranchandramohan @jeanPerier @clementval as we agreed in the last community call; we will introduce a separate op for do concurrent loops to model multi-range ops. I agree that just having an attribute to carry the nest depth won't be a good idea since there are no guarantees any random pass will drop these attributes and hence losing the information we wanted to maintain.

We also agreed that this change can be postponed for a future improvement because:

  1. What we have now should be easy to refactor, we "just" get rid of the loop nest detection logic.
  2. The change of introducing a new op might be pervasive and require quite a bit of effort to propagate through the whole flang pipeline.

In the design/tracking document, I added an item to the top of the Next steps section so that we tackle that problem even before handling locality specifiers.

@ergawy
Copy link
Member Author

ergawy commented Mar 4, 2025

I wrote an RFC for adding a new do concurrent op in fir: https://discourse.llvm.org/t/modeling-do-concurrent-loops-in-the-fir-dialect/84950. Please take a look and let me know if you have any comments. (this is for future work as we agreed)

@ergawy ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch from ca40210 to 9339d07 Compare March 10, 2025 04:52
@ergawy ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch 2 times, most recently from 4ca6023 to 2c3ed0e Compare March 21, 2025 05:46
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 27, 2025
…126026)

This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces llvm#77285 since we extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful.

In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options.

This looks like a huge PR but a lot of the added stuff is documentation.

It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers.

PR stack:
- llvm#126026 (this PR)
- llvm#127595
- llvm#127633
- llvm#127634
- llvm#127635
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 27, 2025
…27595)

Upstreams the next part of do concurrent to OpenMP mapping pass (from
AMD's ROCm implementation). See llvm#126026 for more context.

This PR add loop nest detection logic. This enables us to discover
muli-range do concurrent loops and then map them as "collapsed" loop
nests to OpenMP.

This is a follow up for llvm#126026, only the latest commit is relevant.

This is a replacement for llvm#127478 using a `/user/<username>/<branchname>` branch.

PR stack:
- llvm#126026
- llvm#127595 (this PR)
- llvm#127633
- llvm#127634
- llvm#127635
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 27, 2025
…ructs (llvm#127633)

Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: `omp parallel do`. Towards that end, we have to collect more information about loop nests for which we add new utils in the `looputils` name space.

PR stack:
- llvm#126026
- llvm#127595
- llvm#127633 (this PR)
- llvm#127634
- llvm#127635
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 27, 2025
…lvm#127634)

Adds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost `fir.do_loop` op in the nest.

PR stack:
- llvm#126026
- llvm#127595
- llvm#127633
- llvm#127634 (this PR)
- llvm#127635
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 27, 2025
…lvm#127635)

Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations.

PR stack:
- llvm#126026
- llvm#127595
- llvm#127633
- llvm#127634
- llvm#127635 (this PR)
ergawy added a commit that referenced this pull request Apr 2, 2025
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces #77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.

This looks like a huge PR but a lot of the added stuff is documentation.

It is also worth noting that the downstream pass has been validated on
https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived
performance speed-ups that match pure OpenMP, for GPU mapping we are
still working on extending our support for implicit memory mapping and
locality specifiers.

PR stack:
- #126026 (this PR)
- #127595
- #127633
- #127634
- #127635
Upstreams the next part of `do concurrent` to OpenMP mapping pass (from
AMD's ROCm implementation). See #126026 for more context.

This PR add loop nest detection logic. This enables us to discover
muli-range `do concurrent` loops and then map them as "collapsed" loop
nests to OpenMP.
@ergawy ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch from 2c3ed0e to e15842d Compare April 2, 2025 07:26
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 2, 2025
…ping (#126026)

This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces #77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.

This looks like a huge PR but a lot of the added stuff is documentation.

It is also worth noting that the downstream pass has been validated on
https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived
performance speed-ups that match pure OpenMP, for GPU mapping we are
still working on extending our support for implicit memory mapping and
locality specifiers.

PR stack:
- llvm/llvm-project#126026 (this PR)
- llvm/llvm-project#127595
- llvm/llvm-project#127633
- llvm/llvm-project#127634
- llvm/llvm-project#127635
@ergawy ergawy merged commit 41d718b into main Apr 2, 2025
12 checks passed
@ergawy ergawy deleted the users/ergawy/upstream_do_concurrent_2 branch April 2, 2025 08:12
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 2, 2025
…on. (#127595)

Upstreams the next part of do concurrent to OpenMP mapping pass (from
AMD's ROCm implementation). See
llvm/llvm-project#126026 for more context.

This PR add loop nest detection logic. This enables us to discover
muli-range do concurrent loops and then map them as "collapsed" loop
nests to OpenMP.

This is a follow up for
llvm/llvm-project#126026, only the latest commit
is relevant.

This is a replacement for
llvm/llvm-project#127478 using a
`/user/<username>/<branchname>` branch.

PR stack:
- llvm/llvm-project#126026
- llvm/llvm-project#127595 (this PR)
- llvm/llvm-project#127633
- llvm/llvm-project#127634
- llvm/llvm-project#127635
ergawy added a commit that referenced this pull request Apr 2, 2025
…ructs (#127633)

Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping
pass. This PR add support for converting simple loops to the equivalent
OpenMP constructs on the host: `omp parallel do`. Towards that end, we
have to collect more information about loop nests for which we add new
utils in the `looputils` name space.

PR stack:
- #126026
- #127595
- #127633 (this PR)
- #127634
- #127635
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 2, 2025
… host constructs (#127633)

Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping
pass. This PR add support for converting simple loops to the equivalent
OpenMP constructs on the host: `omp parallel do`. Towards that end, we
have to collect more information about loop nests for which we add new
utils in the `looputils` name space.

PR stack:
- llvm/llvm-project#126026
- llvm/llvm-project#127595
- llvm/llvm-project#127633 (this PR)
- llvm/llvm-project#127634
- llvm/llvm-project#127635
ergawy added a commit that referenced this pull request Apr 2, 2025
…127634)

Adds support for converting mulit-range loops to OpenMP (on the host
only for now). The changes here "prepare" a loop nest for collapsing by
sinking iteration variables to the innermost `fir.do_loop` op in the
nest.

PR stack:
- #126026
- #127595
- #127633
- #127634 (this PR)
- #127635
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 2, 2025
…nge loops (#127634)

Adds support for converting mulit-range loops to OpenMP (on the host
only for now). The changes here "prepare" a loop nest for collapsing by
sinking iteration variables to the innermost `fir.do_loop` op in the
nest.

PR stack:
- llvm/llvm-project#126026
- llvm/llvm-project#127595
- llvm/llvm-project#127633
- llvm/llvm-project#127634 (this PR)
- llvm/llvm-project#127635
ergawy added a commit that referenced this pull request Apr 2, 2025
…127635)

Extends `do concurrent` mapping to handle "loop-local values". A
loop-local value is one that is used exclusively inside the loop but
allocated outside of it. This usually corresponds to temporary values
that are used inside the loop body for initialzing other variables for
example. After collecting these values, the pass localizes them to the
loop nest by moving their allocations.

PR stack:
- #126026
- #127595
- #127633
- #127634
- #127635 (this PR)
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 2, 2025
…nt` nests (#127635)

Extends `do concurrent` mapping to handle "loop-local values". A
loop-local value is one that is used exclusively inside the loop but
allocated outside of it. This usually corresponds to temporary values
that are used inside the loop body for initialzing other variables for
example. After collecting these values, the pass localizes them to the
loop nest by moving their allocations.

PR stack:
- llvm/llvm-project#126026
- llvm/llvm-project#127595
- llvm/llvm-project#127633
- llvm/llvm-project#127634
- llvm/llvm-project#127635 (this PR)
@kazutakahirata
Copy link
Contributor

@ergawy I've landed aa33c09 to fix a warning from this PR. Thanks!

@DavidSpickett
Copy link
Collaborator

DavidSpickett commented Apr 7, 2025

This new test was flakey on Linaro's bots, should be fixed by #134608.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category flang:driver flang:fir-hlfir flang Flang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants