Hi,
I’m new to PGI directives and may be doing something stupid.
The Fortran test code below populates a 3d array with the number 5 using nested loops. I accelerate it in three ways: round the outer loop, the middle loop or the inner loop.
The outer and inner cases work correctly. When I put “!$acc region” round the middle loop, however, I only seem to populate b(:,1,:) rather than b(:,:,:). What is wrong?
I am using: “pgf90 10.5-0 64-bit target on x86-64 Linux -tp nehalem-64”
Thanks for any help,
Alistair.
PROGRAM test
IMPLICIT NONE
INTEGER, PARAMETER :: N = 10
INTEGER :: b(N,N,N),i,j,k
!!$ CASE 1 ****************************************
b(:,:,:) = 0
DO k = 1,N
DO j = 1,N
!$acc region
DO i = 1,N
b(i,j,k) = 5
ENDDO
!$acc end region
ENDDO
ENDDO
PRINT '(/,"Case ",I1)',1
DO i = 1,N
PRINT *,i,b(i,1,1),b(i,2,1)
ENDDO
!!$ CASE 2 ****************************************
b(:,:,:) = 0
DO k = 1,N
!$acc region
DO j = 1,N
DO i = 1,N
b(i,j,k) = 5
ENDDO
ENDDO
!$acc end region
ENDDO
PRINT '(/,"Case ",I1)',2
DO i = 1,N
PRINT *,i,b(i,1,1),b(i,2,1)
ENDDO
!!$ CASE 3 ****************************************
b(:,:,:) = 0
!$acc region
DO k = 1,N
DO j = 1,N
DO i = 1,N
b(i,j,k) = 5
ENDDO
ENDDO
ENDDO
!$acc end region
PRINT '(/,"Case ",I1)',3
DO i = 1,N
PRINT *,i,b(i,1,1),b(i,2,1)
ENDDO
END PROGRAM test
The compiler report is:
pgf90 test.F90 -ta=nvidia -Minfo=accel
test:
17, Generating copyout(b(1:10,j,k))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
18, Loop is parallelizable
Accelerator kernel generated
18, !$acc do parallel, vector(10)
CC 1.0 : 3 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
CC 1.3 : 3 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
35, Generating copyout(b(1:10,1:10,k))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
36, Loop is parallelizable
37, Loop is parallelizable
Accelerator kernel generated
36, !$acc do parallel, vector(10)
37, !$acc do parallel, vector(10)
CC 1.0 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
CC 1.3 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
53, Generating copyout(b(1:10,1:10,1:10))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
54, Loop is parallelizable
55, Loop is parallelizable
56, Loop is parallelizable
Accelerator kernel generated
54, !$acc do parallel, vector(4)
55, !$acc do parallel, vector(4)
56, !$acc do vector(10)
CC 1.0 : 7 registers; 24 shared, 20 constant, 0 local memory bytes; 83 occupancy
CC 1.3 : 7 registers; 24 shared, 20 constant, 0 local memory bytes; 93 occupancy
and the output is
./a.out
Case 1
1 5 5
2 5 5
3 5 5
4 5 5
5 5 5
6 5 5
7 5 5
8 5 5
9 5 5
10 5 5
Case 2
1 5 0
2 5 0
3 5 0
4 5 0
5 5 0
6 5 0
7 5 0
8 5 0
9 5 0
10 5 0
Case 3
1 5 5
2 5 5
3 5 5
4 5 5
5 5 5
6 5 5
7 5 5
8 5 5
9 5 5
10 5 5