prevent parallelization

I use openacc to test Matrix multiplication. code is :
void gputest(float a,float b,float c)
{
int i,j,k;
#pragma acc data copy(c[:N
N]) copyin(a[:N
N],b[:N
N])
{
#pragma acc parallel for
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
for(k=0;k<N;k++)
{
c[i*N+j]+=a[i*N+k]b[kN+j];
}
}
}
}
I compile the code with "pgcc -o matric-2 matric-2.c -ta=nvidia,cc2.0 -Minfo ", the output is :
52, Generating copyin(b[:9999])
Generating copyin(a[:9999])
Generating copy(c[:9999])
54, Generating compute capability 2.0 binary
56, Loop carried dependence of ‘(c)’ prevents parallelization
Loop carried backward dependence of '
(c)’ prevents vectorization
58, Loop is parallelizable
#pragma acc loop gang, vector(96) /* blockIdx.x threadIdx.x /
61, Complex loop carried dependence of '
(c)’ prevents parallelization
Loop carried dependence of ‘(c)’ prevents parallelization
Loop carried backward dependence of '
(c)’ prevents vectorization
Inner sequential loop scheduled on accelerator
#pragma acc loop seq(96)
Cached references to size [195] block of ‘a’
CC 2.0 : 23 registers; 788 shared, 44 constant, 0 local memory bytes; 50% occupancy.

What should I do to sovle this problem?
Thanks!

Hi wcj0626,

What should I do to sovle this problem?

There are two problems here. First, since you don’t use the “restrict” keyword, the compiler must assume that the a, b, and c pointers could point at the same location in memory. This will prevent parallelization. To fix, either add the restrict keyword or use the flag “-Msafeptr”.

The second issue is the use of a computed index. The compiler is not always able to determine when the index is computed, hence you need to add the “independent” clause.

Hope this helps,
Mat

$ cat test2.c 
#define N 1024

void gputest(float * restrict a,float * restrict b,float * restrict c)
{
int i,j,k;
#pragma acc data copy(c[:N*N]) copyin(a[:N*N],b[:N*N])
{
#pragma acc region for independent 
for(i=0;i<N;i++)
{
#pragma for independent
for(j=0;j<N;j++)
{
#pragma for independent
for(k=0;k<N;k++)
{
c[i*N+j]+=a[i*N+k]*b[k*N+j];
}
}
}
}
} 
$ pgcc -ta=nvidia test2.c  -c -Minfo=accel
PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (test2.c: 8)
gputest:
      6, Generating copyin(b[:1048575])
         Generating copyin(a[:1048575])
         Generating copy(c[:1048575])
      9, Loop is parallelizable
     12, Loop is parallelizable
     14, Complex loop carried dependence of '*(c)' prevents parallelization
         Loop carried dependence of '*(c)' prevents parallelization
         Loop carried backward dependence of '*(c)' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
          9, #pragma acc for parallel, vector(16) /* blockIdx.y threadIdx.y */
         12, #pragma acc for parallel, vector(16) /* blockIdx.x threadIdx.x */
         14, #pragma acc for seq(16)
             Cached references to size [16399] block of 'a'
             Cached references to size [16399] block of 'b'
PGC/x86-64 Linux 12.3-0: compilation completed with warnings

as I see ,after you compile the code ,there is still a “prevents parallelization” problem.

6, Generating copyin(b[:1048575])
Generating copyin(a[:1048575])
Generating copy(c[:1048575])
9, Loop is parallelizable
12, Loop is parallelizable
14, Complex loop carried dependence of ‘(c)’ prevents parallelization
Loop carried dependence of '
(c)’ prevents parallelization
Loop carried backward dependence of ‘*(c)’ prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated

Hi wcj0626,

My mistake in putting the “independent” clause on the innermost loop. The compiler is correct in making the inner loops sequential since the same element of C is being updated for each iteration. Only the outer two loops are parallelizable.

  • Mat