OpenMP GPU Target Offload in Clang

Hi, OpenMP dev community!

Recently I tried setting up the OpenMP benchmarks for SPEC ACCEL and test it with clang, but I ran into several difficulties.

The core of the issue is that I was not able to get the workload onto the GPUs. I wrote the following small test

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

#define DATATYPE unsigned long long

/gpu offload openmp/
DATATYPE reduce_gpu_omp(DATATYPE *arr, size_t size) {
DATATYPE result = IDENTITY;
#pragma omp target data map(tofrom:arr[:size]) map(tofrom:result)
{
#pragma omp target teams distribute parallel for reduction(+:result) schedule(static, 1)
for (size_t i = 0; i < size; i++) {
result += arr[i];
}
}
return result;
}

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

And compiled that with clang trunk with the following commands:

clang -O3 -fopenmp -omptargets=nvptx64sm_35-nvidia-linux -Wall -o reduce reduce.c

clang -O3 -fopenmp -omptargets=nvptx64sm_35-nvidia-linux-cuda -Wall -o reduce reduce.c

clang -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wall -o reduce reduce.c

The offloading to GPU was unsuccessful for all these commands. That said, the CPU load did go up when the kernel above was run, so the offloading did happen, but the computation was offloaded to the CPU, not the GPU.

My speculation is that I missed some steps setting up the compiler/libraries and the offloading did not happen correctly. Or it could be the fact that reductions were not supported across teams (as stated here https://clang.llvm.org/docs/OpenMPSupport.html).

In the end, I would like to ask two questions:

  1. What is a good candidate of llvm based compiler to test OpenMP GPU offloading? Should clang-ykt be used instead of clang trunk?

  2. What is the recommended procedure for compiler and linker flags to build programs with GPU offloading? Maybe I am not searching correctly, but I was not able to find a documentation on how that is supposed to be done. Additionally, will the compiler show some warning if offloading to GPU is unsuccessful?

Thanks for your help!

Sincerely,
Qiongsi

Hi Qiongsi,

Hi, OpenMP dev community!

Recently I tried setting up the OpenMP benchmarks for SPEC ACCEL and test it with clang, but I ran into several difficulties.

The core of the issue is that I was not able to get the workload onto the GPUs. I wrote the following small test

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

#define DATATYPE unsigned long long

/*gpu offload openmp*/
DATATYPE reduce_gpu_omp(DATATYPE *arr, size_t size) {
   DATATYPE result = IDENTITY;
#pragma omp target data map(tofrom:arr[:size]) map(tofrom:result)
   {
#pragma omp target teams distribute parallel for reduction(+:result) schedule(static, 1)
       for (size_t i = 0; i < size; i++) {
           result += arr[i];
       }
   }
   return result;
}

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

When compiling your code, but leaving out the reduction, I can execute this on a GPU. With the reduction the code seems to hang for me. (Posting a full compile-able example next time would be preferred!)

This is how I compiled:

clang -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o reduce reduce.c

To see whether the code is actually executed on the device, you can add this to the loop for debugging:

if (i==0) printf("omp_is_initial_device=%i\n", omp_is_initial_device());

And compiled that with clang trunk with the following commands:

clang -O3 -fopenmp -omptargets=nvptx64sm_35-nvidia-linux -Wall -o reduce reduce.c
clang -O3 -fopenmp -omptargets=nvptx64sm_35-nvidia-linux-cuda -Wall -o reduce reduce.c
clang -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wall -o reduce reduce.c

The offloading to GPU was unsuccessful for all these commands. That said, the CPU load did go up when the kernel above was run, so the offloading did happen, but the computation was offloaded to the CPU, not the GPU.

My speculation is that I missed some steps setting up the compiler/libraries and the offloading did not happen correctly. Or it could be the fact that reductions were not supported across teams (as stated here OpenMP Support — Clang 18.0.0git documentation).

In the end, I would like to ask two questions:

1. What is a good candidate of llvm based compiler to test OpenMP GPU
    offloading? Should clang-ykt be used instead of clang trunk?

I used clang trunk, compiled like:

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$INSTALL \
       -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_60 \
       -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60 \
       $SRC

2. What is the recommended procedure for compiler and linker flags to
    build programs with GPU offloading? Maybe I am not searching
    correctly, but I was not able to find a documentation on how that is
    supposed to be done. Additionally, will the compiler show some
    warning if offloading to GPU is unsuccessful?

Successful / unsuccessful is a runtime decision. You will get an error and execution aborts, once this patch is submitted:

https://reviews.llvm.org/D50522

Best
Joachim

Hi Alexey!

Thanks for the clarification! I am moving on to testing stencil computations instead.

Sincerely,

Qiongsi

Hi Joachim!

Thanks for your help! I missed the cmake flags for OpenMP targets when building. Additionally, I found that my libelf was not installed properly. After rebuilding clang and removing the reduction code, the code offload was successful.

The debugging code you suggested works like a charm!

Sincerely,

Qiongsi