prevent parallelization

wcj06261 · March 20, 2012, 3:21am

I use openacc to test Matrix multiplication. code is :
void gputest(float a,float b,float c)
{
int i,j,k;
#pragma acc data copy(c[:NN]) copyin(a[:NN],b[:NN])
{
#pragma acc parallel for
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
for(k=0;k<N;k++)
{
c[i*N+j]+=a[i*N+k]b[kN+j];
}
}
}
}
I compile the code with "pgcc -o matric-2 matric-2.c -ta=nvidia,cc2.0 -Minfo ", the output is :
52, Generating copyin(b[:9999])
Generating copyin(a[:9999])
Generating copy(c[:9999])
54, Generating compute capability 2.0 binary
56, Loop carried dependence of ‘(c)’ prevents parallelization
Loop carried backward dependence of '(c)’ prevents vectorization
58, Loop is parallelizable
#pragma acc loop gang, vector(96) /* blockIdx.x threadIdx.x /
61, Complex loop carried dependence of '(c)’ prevents parallelization
Loop carried dependence of ‘(c)’ prevents parallelization
Loop carried backward dependence of '(c)’ prevents vectorization
Inner sequential loop scheduled on accelerator
#pragma acc loop seq(96)
Cached references to size [195] block of ‘a’
CC 2.0 : 23 registers; 788 shared, 44 constant, 0 local memory bytes; 50% occupancy.

What should I do to sovle this problem?
Thanks!

MatColgrove · March 20, 2012, 4:32pm

Hi wcj0626,

What should I do to sovle this problem?

There are two problems here. First, since you don’t use the “restrict” keyword, the compiler must assume that the a, b, and c pointers could point at the same location in memory. This will prevent parallelization. To fix, either add the restrict keyword or use the flag “-Msafeptr”.

The second issue is the use of a computed index. The compiler is not always able to determine when the index is computed, hence you need to add the “independent” clause.

Hope this helps,
Mat

$ cat test2.c 
#define N 1024

void gputest(float * restrict a,float * restrict b,float * restrict c)
{
int i,j,k;
#pragma acc data copy(c[:N*N]) copyin(a[:N*N],b[:N*N])
{
#pragma acc region for independent 
for(i=0;i<N;i++)
{
#pragma for independent
for(j=0;j<N;j++)
{
#pragma for independent
for(k=0;k<N;k++)
{
c[i*N+j]+=a[i*N+k]*b[k*N+j];
}
}
}
}
} 
$ pgcc -ta=nvidia test2.c  -c -Minfo=accel
PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (test2.c: 8)
gputest:
      6, Generating copyin(b[:1048575])
         Generating copyin(a[:1048575])
         Generating copy(c[:1048575])
      9, Loop is parallelizable
     12, Loop is parallelizable
     14, Complex loop carried dependence of '*(c)' prevents parallelization
         Loop carried dependence of '*(c)' prevents parallelization
         Loop carried backward dependence of '*(c)' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
          9, #pragma acc for parallel, vector(16) /* blockIdx.y threadIdx.y */
         12, #pragma acc for parallel, vector(16) /* blockIdx.x threadIdx.x */
         14, #pragma acc for seq(16)
             Cached references to size [16399] block of 'a'
             Cached references to size [16399] block of 'b'
PGC/x86-64 Linux 12.3-0: compilation completed with warnings

wcj06261 · March 21, 2012, 2:08am

as I see ,after you compile the code ,there is still a “prevents parallelization” problem.
…
6, Generating copyin(b[:1048575])
Generating copyin(a[:1048575])
Generating copy(c[:1048575])
9, Loop is parallelizable
12, Loop is parallelizable
14, Complex loop carried dependence of ‘(c)’ prevents parallelization
Loop carried dependence of '(c)’ prevents parallelization
Loop carried backward dependence of ‘*(c)’ prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
…

MatColgrove · March 22, 2012, 5:26pm

Hi wcj0626,

My mistake in putting the “independent” clause on the innermost loop. The compiler is correct in making the inner loops sequential since the same element of C is being updated for each iteration. Only the outer two loops are parallelizable.

Mat

Topic		Replies	Views
MatMul with openACC Legacy PGI Compilers	7	13040	December 17, 2012
Nested loops in C Legacy PGI Compilers	2	3674	September 9, 2010
dependence in loop prevents parallelization Legacy PGI Compilers	3	8773	February 9, 2010
PGI attempts to parallelize sequential loop Legacy PGI Compilers	3	2610	August 28, 2012
Loop is parallelizable Legacy PGI Compilers	2	1784	June 10, 2010
How to parallelize this loop... Legacy PGI Compilers	14	7826	December 18, 2012
Loop "too deeply nested" and "data dependency Legacy PGI Compilers	9	10591	November 27, 2017
Questions about local arrays in loops Legacy PGI Compilers	1	3041	February 17, 2015
Matrix multiplication parallelizing Legacy PGI Compilers	4	6903	June 1, 2010
OpenACC and nested loops Legacy PGI Compilers	2	4027	September 19, 2014

prevent parallelization

Related topics