I use openacc to test Matrix multiplication. code is :
void gputest(float a,float b,float c)
{
int i,j,k;
#pragma acc data copy(c[:NN]) copyin(a[:NN],b[:NN])
{
#pragma acc parallel for
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
for(k=0;k<N;k++)
{
c[i*N+j]+=a[i*N+k]b[kN+j];
}
}
}
}
I compile the code with "pgcc -o matric-2 matric-2.c -ta=nvidia,cc2.0 -Minfo ", the output is :
52, Generating copyin(b[:9999])
Generating copyin(a[:9999])
Generating copy(c[:9999])
54, Generating compute capability 2.0 binary
56, Loop carried dependence of ‘(c)’ prevents parallelization
Loop carried backward dependence of '(c)’ prevents vectorization
58, Loop is parallelizable
#pragma acc loop gang, vector(96) /* blockIdx.x threadIdx.x /
61, Complex loop carried dependence of '(c)’ prevents parallelization
Loop carried dependence of ‘(c)’ prevents parallelization
Loop carried backward dependence of '(c)’ prevents vectorization
Inner sequential loop scheduled on accelerator
#pragma acc loop seq(96)
Cached references to size [195] block of ‘a’
CC 2.0 : 23 registers; 788 shared, 44 constant, 0 local memory bytes; 50% occupancy.
What should I do to sovle this problem?
Thanks!
Hi wcj0626,
What should I do to sovle this problem?
There are two problems here. First, since you don’t use the “restrict” keyword, the compiler must assume that the a, b, and c pointers could point at the same location in memory. This will prevent parallelization. To fix, either add the restrict keyword or use the flag “-Msafeptr”.
The second issue is the use of a computed index. The compiler is not always able to determine when the index is computed, hence you need to add the “independent” clause.
Hope this helps,
Mat
$ cat test2.c
#define N 1024
void gputest(float * restrict a,float * restrict b,float * restrict c)
{
int i,j,k;
#pragma acc data copy(c[:N*N]) copyin(a[:N*N],b[:N*N])
{
#pragma acc region for independent
for(i=0;i<N;i++)
{
#pragma for independent
for(j=0;j<N;j++)
{
#pragma for independent
for(k=0;k<N;k++)
{
c[i*N+j]+=a[i*N+k]*b[k*N+j];
}
}
}
}
}
$ pgcc -ta=nvidia test2.c -c -Minfo=accel
PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (test2.c: 8)
gputest:
6, Generating copyin(b[:1048575])
Generating copyin(a[:1048575])
Generating copy(c[:1048575])
9, Loop is parallelizable
12, Loop is parallelizable
14, Complex loop carried dependence of '*(c)' prevents parallelization
Loop carried dependence of '*(c)' prevents parallelization
Loop carried backward dependence of '*(c)' prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
9, #pragma acc for parallel, vector(16) /* blockIdx.y threadIdx.y */
12, #pragma acc for parallel, vector(16) /* blockIdx.x threadIdx.x */
14, #pragma acc for seq(16)
Cached references to size [16399] block of 'a'
Cached references to size [16399] block of 'b'
PGC/x86-64 Linux 12.3-0: compilation completed with warnings
as I see ,after you compile the code ,there is still a “prevents parallelization” problem.
…
6, Generating copyin(b[:1048575])
Generating copyin(a[:1048575])
Generating copy(c[:1048575])
9, Loop is parallelizable
12, Loop is parallelizable
14, Complex loop carried dependence of ‘(c)’ prevents parallelization
Loop carried dependence of '(c)’ prevents parallelization
Loop carried backward dependence of ‘*(c)’ prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
…
Hi wcj0626,
My mistake in putting the “independent” clause on the innermost loop. The compiler is correct in making the inner loops sequential since the same element of C is being updated for each iteration. Only the outer two loops are parallelizable.