CUDA_ERROR_ILLEGAL_ADDRESS with OpenMP "distribute parallel for"

hakostra · May 14, 2024, 8:52am

I have the following example:

#include <omp.h>

#include <iostream>
#include <algorithm>

int main() {
    // Initialize
    int npts = 4096;
    float* PA = (float*)malloc(npts * sizeof(float));
    for (std::size_t i = 0; i < npts; i++) {
        PA[i] = 0.0f;
    }

    #pragma omp target teams map(to: npts) reduction(+:PA[0:npts])
    {
        int iteam = omp_get_team_num();
        if (iteam == 0) {
            int nteams = omp_get_num_teams();
            printf("Running %d teams\n", nteams);
        }

        // Without this pragma line the program works
        #pragma omp distribute parallel for
        for (std::size_t i = 0; i < npts; ++i) {
            PA[i] = 1.0f;
        }
    }

    // Print content of 10 first elements of array
    std::cout << std::endl;
    std::cout << "PA: " << std::endl;
    for (std::size_t i = 0; i < std::min(npts, 10); i++) {
        std::cout << PA[i] << std::endl;
    }
    std::cout << std::endl;

    return 0;
}

It compile (nvc++ -g -mp=gpu example.cpp) just fine, but when executed I get the following error:

Running 80 teams
Failing in Thread:1
Accelerator Fatal Error: call to cuMemcpyDtoHAsync returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
 File: /path/to/example.cpp
 Function: main:6
 Line: 29

I am fresh to OpenMP device/GPU programming, so maybe I am trying to do something in the wrong way… I can try to explain the reason for the example:

I have an algorithm which in practice is a big reduction on an array (here: PA). Each array element receive a contribution from a huge number of places, and this is a reduction (PA[i] += something).

I am trying to initialize this by the outer omp target teams construct. This seems to start 80 teams on my GPU. Then, I want each of these 80 “teams” to distribute some work between a number of threads (compute the elements “PA[i]”) with the inner omp distribute parallel for and when this is done there shall be a big reduction on the individual “PA” from each team onto the global host array “PA”.

The global reduction to host seems to work - when the inner omp distribute parallel for is removed, each team set 1.0 in the PA array and the global array ends up having 80.0 in at the end of the program (using 80 teams). However, when using omp distribute parallel for I get the error above.

I had the impression that each thread (inside the omp distribute parallel for) of a team could access the memory of that team, is this wrong?

I tried to map(tofrom: PA) in the outer omp target teams but that did not change anything.

Can any of you help me understand this problem? Thanks.

My compiler version:

$ nvc++ --version

nvc++ 24.3-0 64-bit target on x86-64 Linux -tp alderlake 
NVIDIA Compilers and Tools
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

MatColgrove · May 14, 2024, 5:02pm

Hi hakostra,

You’re likely encountering a known issue when using reductions with dynamically sized arrays with multiple OpenMP levels. It would be ok (with caveats) if it was a single “target teams distribute parallel for reduction…”, but this wouldn’t work with your algorithm.

Instead you can try either making “PA” a fixed size array, or use atomics instead of a reduction. For Example:

#include <omp.h>

#include <iostream>
#include <algorithm>

int main() {
    // Initialize
    const int npts = 4096;
#ifdef USE_FIXED_SIZE
    float PA[npts];
#else
    float* PA = (float*)malloc(npts * sizeof(float));
#endif
    for (std::size_t i = 0; i < npts; i++) {
        PA[i] = 0.0f;
    }

#ifdef USE_ATOMIC
    #pragma omp target teams map(to: npts) map(tofrom:PA[0:npts])
#else
    #pragma omp target teams map(to: npts) reduction(+:PA[0:npts])
#endif
    {
        int iteam = omp_get_team_num();
        if (iteam == 0) {
            int nteams = omp_get_num_teams();
            printf("Running %d teams\n", nteams);
        }

        // Without this pragma line the program works
        #pragma omp distribute parallel for
        for (std::size_t i = 0; i < npts; ++i) {
#ifdef USE_ATOMIC
    #pragma omp atomic write
#endif
            PA[i] = 1.0f;
        }
    }

    // Print content of 10 first elements of array
    std::cout << std::endl;
    std::cout << "PA: " << std::endl;
    for (std::size_t i = 0; i < std::min(npts, 10); i++) {
        std::cout << PA[i] << std::endl;
    }
    std::cout << std::endl;

    return 0;
}

Example runs:

% nvc++ -w -mp=gpu -Minfo test.cpp -DUSE_FIXED_SIZE ; a.out
main:
      3, include "iostream"
          39, include "ostream"
               38, include "ios"
                    44, include "basic_ios.h"
                         23, #omp target teams
                             23, Generating "nvkernel_main_F237L23_2" GPU kernel
                                 Generating reduction(+:PA)
                             32, Team private (i) located in CUDA shared memory
                                 #omp distribute parallel for
                               32, Loop parallelized across teams and threads, schedule(static)
     23, Generating map(to:npts)
         Generating map(tofrom:PA[:])
Running 114 teams

PA:
1
1
1
1
1
1
1
1
1
1

% nvc++ -w -mp=gpu -Minfo test.cpp -DUSE_ATOMIC ; a.out
main:
      3, include "iostream"
          39, include "ostream"
               38, include "ios"
                    44, include "basic_ios.h"
                         23, #omp target teams
                             23, Generating "nvkernel_main_F237L23_2" GPU kernel
                             32, Team private (i) located in CUDA shared memory
                                 #omp distribute parallel for
                               32, Loop parallelized across teams and threads, schedule(static)
     23, Generating map(to:npts)
         Generating map(tofrom:PA[:4096])
Running 114 teams

PA:
1
1
1
1
1
1
1
1
1
1

With dynamic arrays, the runtime needs to allocate each private array used to hold the partial reductions on the device. With multiple levels, the compiler can’t determine the size of these arrays. If you were using loops where the size is determined at compile time, you’d see the same error reported HERE. Though with distribute, the region is outlined and passed to the runtime, where the unknown bounds causes the runtime error.

Though even with the single level, array reduction with dynamically sized arrays are problematic. The device heap is relatively small so it’s easy to overflow especially with larger arrays and many threads. You can increase the heap size (via the env var NV_ACC_CUDA_HEAPSIZE), performance will suffer given device allocation is serialized. In general, I try to avoid using them.

Another option is to switch to using OpenACC given it allows for more up-front compiler analysis so the partial reduction arrays can be allocated before the kernel launch rather than during.

For example:

% cat testacc.cpp
#include <iostream>
#include <algorithm>

int main() {
    // Initialize
    const int npts = 4096;
    float* PA = (float*)malloc(npts * sizeof(float));
    for (std::size_t i = 0; i < npts; i++) {
        PA[i] = 0.0f;
    }

    #pragma acc parallel reduction(+:PA[0:npts])
    {
        #pragma acc loop gang vector
        for (std::size_t i = 0; i < npts; ++i) {
            PA[i] += 1.0f;
        }
    }

    // Print content of 10 first elements of array
    std::cout << std::endl;
    std::cout << "PA: " << std::endl;
    for (std::size_t i = 0; i < std::min(npts, 10); i++) {
        std::cout << PA[i] << std::endl;
    }
    std::cout << std::endl;

    return 0;
}
% nvc++ -acc testacc.cpp -Minfo=accel ; a.out
main:
     13, Generating NVIDIA GPU code
         15, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:PA[:4096])
     13, Local memory used for PA
         Generating implicit copy(PA[:4096]) [if not already present]

PA:
1
1
1
1
1
1
1
1
1
1

hakostra · May 15, 2024, 3:57pm

Hi and thanks for the comments. Your hint about the reduction being the problem was the key point for me to being able to rewrite the code and make it work.

I chose a bit different strategy than your suggestions, though, and since these forums is a great source of knowledge I’ll explain what I did in the end:

On the host, I found the maximum number of teams that would start, and I allocated float* PA_d = (float*)malloc(npts*nteams_max * sizeof(float));. This array was then mapped to the GPU map(tofrom: PA_d[0:nteams_max*npts]).

Inside the “omp target teams” I can create a local pointer that maps to a unique part of the PA_d array:

int iteam = omp_get_team_num();
float* PA = PA_d + iteam*npts;

Now each team can write into PA[…], and the final result is transferred back to the host where the host takes care of the final reduction from the partial results coming from each team.

Thanks for your help.

Topic		Replies	Views
Out of range error with openmp gpu offload nvc, nvc++ and nvfortran	10	1015	February 1, 2023
OpenMP offloading to GPU: call to cuMemcpyDtoHAsync returned error 700: nvc, nvc++ and nvfortran	3	318	May 6, 2024
Invalid result of the reduction with teams distribute in a parallel region nvc, nvc++ and nvfortran	1	18	February 26, 2025
Target reduction in parallel region : Internal compiler error. unexpected ILM number operands nvc, nvc++ and nvfortran	1	25	February 25, 2025
Runtime error 700: Illegal address during kernel execution nvc, nvc++ and nvfortran	3	283	April 30, 2024
Reductions on pointer type must have bounds specified nvc, nvc++ and nvfortran	4	718	September 14, 2022
Failed to use teams distribute in GPU nvc, nvc++ and nvfortran cuda	7	933	January 27, 2022
Nvfortran: Internal compiler error in openmp reduction nvc, nvc++ and nvfortran	3	353	November 17, 2023
Internal Compiler error in nvfortran with OMP-GPU offloading nvc, nvc++ and nvfortran hpc	3	669	May 17, 2022
OpenMP teams distribute thread binding nvc, nvc++ and nvfortran gpu	3	406	April 16, 2024

CUDA_ERROR_ILLEGAL_ADDRESS with OpenMP "distribute parallel for"

Related topics