I have the following example:
#include <omp.h>
#include <iostream>
#include <algorithm>
int main() {
// Initialize
int npts = 4096;
float* PA = (float*)malloc(npts * sizeof(float));
for (std::size_t i = 0; i < npts; i++) {
PA[i] = 0.0f;
}
#pragma omp target teams map(to: npts) reduction(+:PA[0:npts])
{
int iteam = omp_get_team_num();
if (iteam == 0) {
int nteams = omp_get_num_teams();
printf("Running %d teams\n", nteams);
}
// Without this pragma line the program works
#pragma omp distribute parallel for
for (std::size_t i = 0; i < npts; ++i) {
PA[i] = 1.0f;
}
}
// Print content of 10 first elements of array
std::cout << std::endl;
std::cout << "PA: " << std::endl;
for (std::size_t i = 0; i < std::min(npts, 10); i++) {
std::cout << PA[i] << std::endl;
}
std::cout << std::endl;
return 0;
}
It compile (nvc++ -g -mp=gpu example.cpp
) just fine, but when executed I get the following error:
Running 80 teams
Failing in Thread:1
Accelerator Fatal Error: call to cuMemcpyDtoHAsync returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
File: /path/to/example.cpp
Function: main:6
Line: 29
I am fresh to OpenMP device/GPU programming, so maybe I am trying to do something in the wrong way… I can try to explain the reason for the example:
I have an algorithm which in practice is a big reduction on an array (here: PA). Each array element receive a contribution from a huge number of places, and this is a reduction (PA[i] += something).
I am trying to initialize this by the outer omp target teams
construct. This seems to start 80 teams on my GPU. Then, I want each of these 80 “teams” to distribute some work between a number of threads (compute the elements “PA[i]”) with the inner omp distribute parallel for
and when this is done there shall be a big reduction on the individual “PA” from each team onto the global host array “PA”.
The global reduction to host seems to work - when the inner omp distribute parallel for
is removed, each team set 1.0 in the PA array and the global array ends up having 80.0 in at the end of the program (using 80 teams). However, when using omp distribute parallel for
I get the error above.
I had the impression that each thread (inside the omp distribute parallel for
) of a team could access the memory of that team, is this wrong?
I tried to map(tofrom: PA)
in the outer omp target teams
but that did not change anything.
Can any of you help me understand this problem? Thanks.
My compiler version:
$ nvc++ --version
nvc++ 24.3-0 64-bit target on x86-64 Linux -tp alderlake
NVIDIA Compilers and Tools
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.