Problem accelerating nested arrays

Alistair_Hart · July 6, 2010, 2:53pm

Hi,

I’m new to PGI directives and may be doing something stupid.

The Fortran test code below populates a 3d array with the number 5 using nested loops. I accelerate it in three ways: round the outer loop, the middle loop or the inner loop.

The outer and inner cases work correctly. When I put “!$acc region” round the middle loop, however, I only seem to populate b(:,1,:) rather than b(:,:,:). What is wrong?

I am using: “pgf90 10.5-0 64-bit target on x86-64 Linux -tp nehalem-64”

Thanks for any help,

Alistair.

PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 10
  INTEGER :: b(N,N,N),i,j,k

!!$ CASE 1 ****************************************                            

  b(:,:,:) = 0

  DO k = 1,N
     DO j = 1,N
!$acc region                                                                   
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
!$acc end region                                                               
     ENDDO
  ENDDO

  PRINT '(/,"Case ",I1)',1
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO

!!$ CASE 2 ****************************************                            

  b(:,:,:) = 0

  DO k = 1,N
!$acc region                                                                   
     DO j = 1,N
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
     ENDDO
!$acc end region                                                               
  ENDDO

  PRINT '(/,"Case ",I1)',2
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO

!!$ CASE 3 ****************************************                            

  b(:,:,:) = 0

!$acc region                                                                   
  DO k = 1,N
     DO j = 1,N
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
     ENDDO
  ENDDO
!$acc end region                                                               

  PRINT '(/,"Case ",I1)',3
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO

END PROGRAM test

The compiler report is:

 pgf90 test.F90 -ta=nvidia -Minfo=accel
test:
     17, Generating copyout(b(1:10,j,k))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     18, Loop is parallelizable
         Accelerator kernel generated
         18, !$acc do parallel, vector(10)
             CC 1.0 : 3 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 3 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     35, Generating copyout(b(1:10,1:10,k))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     36, Loop is parallelizable
     37, Loop is parallelizable
         Accelerator kernel generated
         36, !$acc do parallel, vector(10)
         37, !$acc do parallel, vector(10)
             CC 1.0 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
     53, Generating copyout(b(1:10,1:10,1:10))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     54, Loop is parallelizable
     55, Loop is parallelizable
     56, Loop is parallelizable
         Accelerator kernel generated
         54, !$acc do parallel, vector(4)
         55, !$acc do parallel, vector(4)
         56, !$acc do vector(10)
             CC 1.0 : 7 registers; 24 shared, 20 constant, 0 local memory bytes; 83 occupancy
             CC 1.3 : 7 registers; 24 shared, 20 constant, 0 local memory bytes; 93 occupancy

and the output is

./a.out 

Case 1
            1            5            5
            2            5            5
            3            5            5
            4            5            5
            5            5            5
            6            5            5
            7            5            5
            8            5            5
            9            5            5
           10            5            5

Case 2
            1            5            0
            2            5            0
            3            5            0
            4            5            0
            5            5            0
            6            5            0
            7            5            0
            8            5            0
            9            5            0
           10            5            0

Case 3
            1            5            5
            2            5            5
            3            5            5
            4            5            5
            5            5            5
            6            5            5
            7            5            5
            8            5            5
            9            5            5
           10            5            5

MatColgrove · July 7, 2010, 4:02pm

Hi Alistair,

For the second case, add “copy(b)” to “!$acc region”.

Hope this helps,
Mat

Alistair_Hart · July 14, 2010, 11:10am

Thanks. This works.

Will future versions of the compiler recognise the need for this clause automatically?

Cheers,

Alistair.

MatColgrove · July 14, 2010, 4:59pm

Hi Alistair,

Will future versions of the compiler recognise the need for this clause automatically?

I submitted a problem report (TPR#17096).

Note that for performance reasons, the first and second cases would be poor methods. For each iteration of the outer loops, the B array would need to be copied to and from the GPU. Copying data is very slow so should be avoided whenever possible.

In scenarios where you do need to put an accelerator region within a loop, try to use the ‘data region’ directives to move the copies outside the loop. For example:

% cat b.f90
PROGRAM test

  IMPLICIT NONE
  INTEGER, PARAMETER :: N = 10
  INTEGER :: b(N,N,N),i,j,k

!!$ CASE 2 ****************************************
  b(:,:,:) = 0

!$acc data region copyout(b)
  DO k = 1,N
!$acc region
     DO j = 1,N
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
     ENDDO
!$acc end region
  ENDDO
!$acc end data region

  PRINT '(/,"Case ",I1)',2
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO


END PROGRAM test
% pgf90 -ta=nvidia b.f90 -V10.6 -Minfo=accel -fast
test:
     10, Generating copyout(b(:,:,:))
     12, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     13, Loop is parallelizable
     14, Loop is parallelizable
         Accelerator kernel generated
         13, !$acc do parallel, vector(10)
         14, !$acc do parallel, vector(10)
             CC 1.0 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
% a.out

Case 2
            1            5            5
            2            5            5
            3            5            5
            4            5            5
            5            5            5
            6            5            5
            7            5            5
            8            5            5
            9            5            5
           10            5            5

Mat

Alistair_Hart · July 15, 2010, 8:27am

Thanks for the prompt and informative reply.

I did find that it was most efficient to put the acc region around all the loops, but I was interested to see what was possible if I wanted to place it elsewhere.

Cheers,

Alistair.

MatColgrove · August 4, 2010, 6:56pm

FYI TPR#17096 will be fixed in version 10.8.

Mat

Topic		Replies	Views
Privatization of array Legacy PGI Compilers	9	17605	July 14, 2010
Starting Accel. Fortran Legacy PGI Compilers	2	3628	February 17, 2011
Sharing device data with subroutines and Fortran !$acc direc Legacy PGI Compilers	5	8897	July 20, 2010
Error and huge slowdown from !$acc region Legacy PGI Compilers	4	2955	March 26, 2012
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20317	October 25, 2017
nested loop with re-used variable optimization Legacy PGI Compilers	2	2644	November 24, 2011
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	10950	November 27, 2013
No parallel kernels found, accelerator region ignored Legacy PGI Compilers	3	8449	February 11, 2010
Accelerator directives dislike Fortran CONTAINS statement? Legacy PGI Compilers	2	4090	July 16, 2010
accelerating 3 nested loops Legacy PGI Compilers	5	7902	May 26, 2017

Problem accelerating nested arrays

Related topics