FATAL ERROR at run time

PLindes · December 18, 2014, 1:26am

After much effort I finally found a machine here that has an NDVIDIA GPU and a PGI compiler on it that I can use. I’ve gotten my code set up with a parallel loop and that seems to compile OK. The compiler output looks like this:

bash-4.1$ pgCC -acc -Minfo=accel -ta=nvidia,time placc.cpp
Task::ProcessData(double *, double *, double *, int *, int):
217, Generating copy(modelXs[:15])
Generating copy(modelYs[:15])
Generating copy(dataMap[:20000][:15])
Generating copyin(dataXs[:20000])
Generating copyin(dataYs[:20000])
232, Generating present_or_create(this[:])
Generating Tesla code
236, Loop is parallelizable
Accelerator kernel generated
236, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x */
244, Accelerator scalar kernel generated
bash-4.1$

However, when I try to run the output file, it goes through all the prelimary code just fine and then gives me:

FATAL ERROR: data in PRESENT clause was not found on device 1: name=_T24849248
file:/afs/umich.edu/user/p/l/plindes/Desktop/acc/Taskacc.h ProcessData__4TaskFPdN21PPii line:232

Accelerator Kernel Timing data
/afs/umich.edu/user/p/l/plindes/Desktop/acc/Taskacc.h
ProcessData__4TaskFPdN21PPii NVIDIA devicenum=0
time(us): 54,946
217: data region reached 1 time
30: data copyin transfers: 20000
device time(us): total=29,312 max=62 min= avg=
30: kernel launched 20000 times
grid: [1] block: [128]
device time(us): total=25,617 max=30 min= avg=
elapsed time(us): total=168,461 max=315 min=6 avg=8
217: data copyin transfers: 4
device time(us): total=17 max=8 min= avg=4
232: data region reached 1 time
bash-4.1$

I’ve tried searching on the web for what this error means, but the only thing I found was that supposedly it was fixed in version 12.4 of the compiler. I’m running version 14.4.

Does anyone have an idea of what the problem might be?

MatColgrove · December 18, 2014, 2:05am

Hi PLindes,

It’s probably the “this” pointer. I typically put the following in my constructors to get the this pointer over on the device:

#pragma acc enter data create(this)

Although you reference class members by their variable name, they’re implicitly referenced as “this->member”. So the this pointer needs to be created before the member can be accessed.

If you can post or send to PGI Customer Service ([email protected]) a reproducing example, I should be able to help you fix the issue.

Hope this helps,
Mat

PLindes · December 18, 2014, 4:35am

Thanks for the suggestion, Mat. I tried adding that and now I don’t get the FATAL ERROR, but I do get another run time error that’s about as bad. Here is the section of code I’m working on:

void ProcessData(double *restrict dataXs, double *restrict dataYs,
double *restrict modelYs, int **restrict dataMap,
int nToAssign) {
#pragma acc data copy(modelXs[0:N_MODELS], modelYs[0:N_MODELS], dataMap[0:N_INSTANCES][0:N_MODELS])
copyin(dataXs[0:N_INSTANCES], dataYs[0:N_INSTANCES])
#pragma acc enter data create(this)
{
//printf(“Assign instances 0 to %d.\r\n”, nToAssign);
// Assign all the instances for this pass
for (int i = 0; i < nToAssign; i++) {
// Get the details of this instance
// DON’T get the ID, this is unsupervised learning
double X = dataXs_;
double Y = dataYs;
// Loop through all the models (except the reference)
for (int j = 1; j < N_MODELS; j++) {
// Find the cluster centroid for this model
// that is closest to this data instance
int K = K_MIN + j - 1;
int base = j * K_MAX;
#pragma acc kernels
{
double minDistance2 = MAX_SQR_DISTANCE;
int minID = -1;
#pragma acc for private(minDistance2, minID)
for (int k = 0; k < K; k++) {
double deltaX = X - modelXs[base + k];
double deltaY = Y - modelYs[base + k];
double distance2 = deltaX * deltaX + deltaY * deltaY;
if (distance2 < minDistance2) {
minDistance2 = distance2;
minID = k + 1;
}
}
// Assign the instance to the closest cluster
dataMap[j] = minID;
}
}
}
// Now update the clusters
for (int i = 0; i < nToAssign; i++) {
}
}
}

Here is what the compiler says for this:

bash-4.1$ pgCC -acc -Minfo=accel -ta=nvidia,time placc.cpp
Task::ProcessData(double *, double *, double *, int *, int):
218, Generating copy(modelXs[:15])
Generating copy(modelYs[:15])
Generating copy(dataMap[:20000][:15])
Generating copyin(dataXs[:20000])
Generating copyin(dataYs[:20000])
Generating enter data create(this[:1])
233, Generating present_or_create(this[:])
Generating Tesla code
237, Loop is parallelizable
Accelerator kernel generated
237, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x */
245, Accelerator scalar kernel generated

And when I run it I get:

bash-4.1$ ./a.out
… a bunch of irrelevant stuff
call to cuStreamSynchronize returned error 716: Misaligned address

Accelerator Kernel Timing data
/afs/umich.edu/user/p/l/plindes/Desktop/acc/Taskacc.h
ProcessData__4TaskFPdN21PPii NVIDIA devicenum=0
time(us): 57,409
218: data region reached 2 times
30: data copyin transfers: 20000
device time(us): total=30,562 max=36 min= avg=
30: kernel launched 20000 times
grid: [1] block: [128]
device time(us): total=26,815 max=29 min= avg=
elapsed time(us): total=170,969 max=305 min=6 avg=8
218: data copyin transfers: 4
device time(us): total=17 max=8 min= avg=4
233: data region reached 1 time
233: compute region reached 1 time
237: kernel launched 1 time
grid: [1] block: [128]
device time(us): total=15 max=15 min=15 avg=15
elapsed time(us): total=21 max=21 min=21 avg=21
bash-4.1$

Does this give you any more clues?

Thanks a lot,

Peter_

MatColgrove · December 18, 2014, 4:47pm

Hi Peter,

What happens is when the “this” pointer is created, space for the data members are created but not initialized. You can do a “pcopy” instead of “create” then the values of the scalar data members will be copied over, however in the case of pointers, the value will be that of the host pointer.

Next when you create (or copy) a data member that’s an array, space is created for the array and then “attached” to the this pointer (i.e. the data member pointer value is set to the device pointer’s address). Now when the data member is accessed via the implicit dereference, i.e. “this->data_”, the address is correct. Hence, order matters.

Try the following:_

void ProcessData(double *restrict dataXs, double *restrict dataYs, 
 double *restrict modelYs, int **restrict dataMap, 
 int nToAssign) { 

#pragma acc enter data copy(this) 
#pragma acc data copy(modelXs[0:N_MODELS], modelYs[0:N_MODELS], dataMap[0:N_INSTANCES][0:N_MODELS]) \ 
 copyin(dataXs[0:N_INSTANCES], dataYs[0:N_INSTANCES]) 

{ 
...
// Assign the instance to the closest cluster 
	dataMap[i][j] = minID; 
} 
	} 
	} 
#pragma acc exit data delete(this)
...

Note that in this example, you’re creating the data on the device every time this routine is call. I would suggest instead move the create/copy of the “this” pointer into the constructor and the corresponding delete into the destructor. I would also use an unstructured data region to create/delete the data member arrays at same time they are allocated on the host. Then finally, change the structured data region’s copy you have in the code to instead use the update directive.

This way as you expand the use of compute regions, you can leave the data on the device and not have to copy it back and forth. You would then replace the “updates” with “present” clauses, and then only use the “updates” when you need to synchronize with the host. I will typically create a “acc_update_host” and “acc_update_device” methods in my classes where I encapsulate the update directives.

Here’s a simple example to help illustrate this:

% cat simple1.cpp
 #include <iostream>
 class foo {
  protected:
  int * data;
  int size;
  int factor;
  public:
  foo() {
   size = 32;
   factor=1;
   data = new int[size];
   #pragma acc enter data create(this)
   #pragma acc update device(this)
// In 14.7 or later the above two directives can be replaced by
// #pragma acc enter data copy(this)
   #pragma acc enter data create(data[0:size])
  }
  ~foo() {
    #pragma acc exit data delete(data[0:size])
    #pragma acc exit data delete(this)
  }

  void setfactor(int fac) {
    factor=fac;
    #pragma acc update device(factor)
  }
  int getfactor() {
    return factor;
  }

  void setdata() {
   #pragma acc data present(data)
   {
    #pragma acc kernels loop independent
    for (int i=0; i < size; ++i) {
     data[i] = i+getfactor();
    }
   }
  }
  void printdata() {
   for (int i=0; i < size; ++i) {
    std::cout << data[i] << " ";
   }
   std::cout << std::endl;
  }
#ifdef _OPENACC
  void acc_update_host() {
    #pragma acc update host(data[0:size])
  }
  void acc_update_device() {
    #pragma acc update device(data[0:size])
  }
#endif

 };

 int main () {
  foo A;
  A.setfactor(2);
  A.setdata();
#ifdef _OPENACC
  A.acc_update_host();
#endif
  A.printdata();
  return 0;
 }

% pgc++ simple1.cpp  -V14.4 -o cpu.out; cpu.out
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
% pgc++ simple1.cpp -acc -Minfo=accel -V14.4 -o acc.out ; acc.out
foo::foo():
     17, Generating enter data create(this[:1])
         Generating update device(this[:1])
         Generating enter data create(data[:size])
foo::setfactor(int):
     26, Generating update device(factor)
foo::getfactor():
     27, Generating implicit acc routine seq
         Generating Tesla code
foo::setdata(int):
     33, Generating present(data[:])
         Generating present_or_copy(this[:])
         Generating Tesla code
     35, Conditional loop will be executed in scalar mode
         Accelerator scalar kernel generated
foo::acc_update_host():
     49, Generating update host(data[:size])
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

Note that 14.4 was the first release that included support for the “this” pointer and unstructured data regions. There were a few bugs in this initial efforts, including the one I note in the comments, but later releases have hardened support.

If you get a chance to attend the NVIDIA GTC conference next March (http://www.gputechconf.com/), I have session on OpenACC and C++ which will discuss these issues.

Also, if you scroll down to the tutorials section on our OpenACC page (PGI Compilers with OpenACC | PGI), you can find an hour long presentation by Michael Wolfe on using C++ and OpenACC.

Hope this helps,
Mat

PLindes · December 18, 2014, 9:50pm

Hi Mat,

Thanks a lot for all this good instruction, it is very helpful.

I have applied your first idea about this. I put:

#pragma acc enter data create(this)

before the #pragma acc data copy … and a

#pragma acc exit data delete(this)

at the end and now it runs with no errors! However, in your post it said to do

#pragma acc enter data [u]copy/u

and this does not work. The compiler seems to like create there but not copy.

The second part of your post with the long example looks like it has a lot of good ideas. I haven’t had a chance to try it yet, and I’m not sure when I will since I already turned in the report for my class. But I will work on getting some time to take this further.

One more question: with the -ta=nvidia,time compiler option I get a big report of run-time statistics that looks very interesting. Is there a manual somewhere that tells how to interpret all that?

Thanks a lot for your help, you saved my life! I got the first run that worked an hour before I had to turn in my final report, so I was barely able to squeeze it in. What a relief!

Thanks,

Peter

MatColgrove · December 18, 2014, 10:12pm

at the end and now it runs with no errors! However, in your post it said to do
#pragma acc enter data copy(this)
and this does not work. The compiler seems to like create there but not copy.

Yes, this is one of the bugs I mentioned that was fixed in 14.7. For 14.4, you need to use create (and then update if you have any scalars that need initializing).

One more question: with the -ta=nvidia,time compiler option I get a big report of run-time statistics that looks very interesting. Is there a manual somewhere that tells how to interpret all that?

Section 7.14 of the PGI User’s Guide: PGI Documentation Archive for Versions Prior to 17.7

Note that the “time” option is deprecated. Instead set the environment variable “PGI_ACC_TIME=1” to get the profile information. That way you don’t need to recompile if you want to disable profiling.

Thanks a lot for your help, you saved my life! I got the first run that worked an hour before I had to turn in my final report, so I was barely able to squeeze it in. What a relief!

You’re welcome. I’m glad I could help.

Mat

Topic		Replies	Views
Accelerator Fatal Error: No NVIDIA/CUDA version... Legacy PGI Compilers	12	14666	May 15, 2017
PGF90-W-0155-Compiler failed ... with PGI 12.4 Legacy PGI Compilers	17	11288	August 30, 2012
OpenACC (Fortran): Descriptor partially present? Legacy PGI Compilers	8	5900	February 23, 2018
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20316	October 25, 2017
Launch of the kernel Legacy PGI Compilers	4	2865	October 18, 2017
cuModuleLoadData error 209 Legacy PGI Compilers	7	16296	February 10, 2015
Unknown 8GB memory getting allocated on GPU Legacy PGI Compilers	12	9671	December 7, 2020
Fatal Error in ACC kernels Legacy PGI Compilers	12	11064	September 12, 2013
Code crashing with 18.7 - worked with 18.4 Legacy PGI Compilers	6	4439	February 28, 2019
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	10950	November 27, 2013

FATAL ERROR at run time

Related topics