Problem accelerating nested arrays

Hi,

I’m new to PGI directives and may be doing something stupid.

The Fortran test code below populates a 3d array with the number 5 using nested loops. I accelerate it in three ways: round the outer loop, the middle loop or the inner loop.

The outer and inner cases work correctly. When I put “!$acc region” round the middle loop, however, I only seem to populate b(:,1,:) rather than b(:,:,:). What is wrong?

I am using: “pgf90 10.5-0 64-bit target on x86-64 Linux -tp nehalem-64”

Thanks for any help,

Alistair.

PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 10
  INTEGER :: b(N,N,N),i,j,k

!!$ CASE 1 ****************************************                            

  b(:,:,:) = 0

  DO k = 1,N
     DO j = 1,N
!$acc region                                                                   
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
!$acc end region                                                               
     ENDDO
  ENDDO

  PRINT '(/,"Case ",I1)',1
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO

!!$ CASE 2 ****************************************                            

  b(:,:,:) = 0

  DO k = 1,N
!$acc region                                                                   
     DO j = 1,N
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
     ENDDO
!$acc end region                                                               
  ENDDO

  PRINT '(/,"Case ",I1)',2
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO

!!$ CASE 3 ****************************************                            

  b(:,:,:) = 0

!$acc region                                                                   
  DO k = 1,N
     DO j = 1,N
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
     ENDDO
  ENDDO
!$acc end region                                                               

  PRINT '(/,"Case ",I1)',3
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO

END PROGRAM test

The compiler report is:

 pgf90 test.F90 -ta=nvidia -Minfo=accel
test:
     17, Generating copyout(b(1:10,j,k))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     18, Loop is parallelizable
         Accelerator kernel generated
         18, !$acc do parallel, vector(10)
             CC 1.0 : 3 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 3 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     35, Generating copyout(b(1:10,1:10,k))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     36, Loop is parallelizable
     37, Loop is parallelizable
         Accelerator kernel generated
         36, !$acc do parallel, vector(10)
         37, !$acc do parallel, vector(10)
             CC 1.0 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
     53, Generating copyout(b(1:10,1:10,1:10))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     54, Loop is parallelizable
     55, Loop is parallelizable
     56, Loop is parallelizable
         Accelerator kernel generated
         54, !$acc do parallel, vector(4)
         55, !$acc do parallel, vector(4)
         56, !$acc do vector(10)
             CC 1.0 : 7 registers; 24 shared, 20 constant, 0 local memory bytes; 83 occupancy
             CC 1.3 : 7 registers; 24 shared, 20 constant, 0 local memory bytes; 93 occupancy

and the output is

./a.out 

Case 1
            1            5            5
            2            5            5
            3            5            5
            4            5            5
            5            5            5
            6            5            5
            7            5            5
            8            5            5
            9            5            5
           10            5            5

Case 2
            1            5            0
            2            5            0
            3            5            0
            4            5            0
            5            5            0
            6            5            0
            7            5            0
            8            5            0
            9            5            0
           10            5            0

Case 3
            1            5            5
            2            5            5
            3            5            5
            4            5            5
            5            5            5
            6            5            5
            7            5            5
            8            5            5
            9            5            5
           10            5            5

Hi Alistair,

For the second case, add “copy(b)” to “!$acc region”.

Hope this helps,
Mat

Thanks. This works.

Will future versions of the compiler recognise the need for this clause automatically?

Cheers,

Alistair.

Hi Alistair,

Will future versions of the compiler recognise the need for this clause automatically?

I submitted a problem report (TPR#17096).

Note that for performance reasons, the first and second cases would be poor methods. For each iteration of the outer loops, the B array would need to be copied to and from the GPU. Copying data is very slow so should be avoided whenever possible.

In scenarios where you do need to put an accelerator region within a loop, try to use the ‘data region’ directives to move the copies outside the loop. For example:

% cat b.f90
PROGRAM test

  IMPLICIT NONE
  INTEGER, PARAMETER :: N = 10
  INTEGER :: b(N,N,N),i,j,k

!!$ CASE 2 ****************************************
  b(:,:,:) = 0

!$acc data region copyout(b)
  DO k = 1,N
!$acc region
     DO j = 1,N
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
     ENDDO
!$acc end region
  ENDDO
!$acc end data region

  PRINT '(/,"Case ",I1)',2
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO


END PROGRAM test
% pgf90 -ta=nvidia b.f90 -V10.6 -Minfo=accel -fast
test:
     10, Generating copyout(b(:,:,:))
     12, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     13, Loop is parallelizable
     14, Loop is parallelizable
         Accelerator kernel generated
         13, !$acc do parallel, vector(10)
         14, !$acc do parallel, vector(10)
             CC 1.0 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
% a.out

Case 2
            1            5            5
            2            5            5
            3            5            5
            4            5            5
            5            5            5
            6            5            5
            7            5            5
            8            5            5
            9            5            5
           10            5            5
  • Mat

Thanks for the prompt and informative reply.

I did find that it was most efficient to put the acc region around all the loops, but I was interested to see what was possible if I wanted to place it elsewhere.

Cheers,

Alistair.

FYI TPR#17096 will be fixed in version 10.8.

  • Mat