[RFC] OpenMP Offloading Backend for C++ Parallel Algorithms

jdoerfert · September 19, 2023, 10:24pm

Yep, that one. We’ll still discuss this topic though

For now, I think the next step is a “fresh” PR with a standalone GPU backend, or at least the first parts of it. Tests, etc. so we can have a proper review. While lambdas and objects with call operator already work, in the sense that we auto-magically compile them for the GPU target together with their visible transitive dependences, we need to look at syntax/APIs for users that “dislike OpenMP”.

It should not be much of a problem, we simply hide the pragmas and such behind agnostic llvm offload APIs. Note that the below is only needed for complex cases and if you want to improve performance by reducing memory movement on “non-unified shared memory systems”.

To move an array to the GPU and keep it there to avoid moves when std::par algorithms are used in sequence:

#pragma omp target data enter map(tofrom:Array[0:N]) device(DevNo)

can be hidden in a function like

void llvm_offload_host_to_device(void * ptr, size_t bytes, int device_no);

The runtime will notice the mapping, not move the array, and use the device address of it automatically in the GPU code.

Similarly, to compile a function for the GPU explicitly, e.g., if it is passed via a function pointer or in a different translation unit:

#pragma omp declare target(fn_foo) [indirect]

could be replaced by a attribute. We have __device__, we have [[omp::declare_target]] and we could have something for the llvm offload API, all do effectively the same thing.
(Please ignore syntax errors, the idea should be sound.)

Wrt. tests:
We have ~4 GPU buildbots right now, I’ll check with the owners if they build libc++ and if we can test the GPU backend. We are also in the process of setting up CI capabilities on a multi GPU system, but that might need some more time.
If the libcxx buildbots could be migrated to a GPU system, or if we want to test them via CPU offfload, that would be great too. All of the required code, runtimes, etc. are in upstream LLVM, so the outside dependence’s are only the GPU “drivers” (rocm for AMD and CUDA for NVIDIA).

Topic		Replies	Views
[RFC] Adding C++ Parallel Algorithm Offload Support To Clang & LLVM Clang Frontend gpu , llvm	43	4641	August 23, 2023
OpenMP GPU Target Offload in Clang OpenMP	3	179	August 21, 2018
[RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver Clang Frontend	50	163	March 21, 2016
GPU offload with OpenMP Beginners	0	257	May 27, 2021
Test with clang offloading to GPU OpenMP	3	112	October 21, 2019

[RFC] OpenMP Offloading Backend for C++ Parallel Algorithms

Related topics