Hi aturk,
The compiler is correct in that the outer “K” loop can’t be parallelized. Each iteration computation depends on the results from the previous iteration.
The compiler is correctly parallelizing the inner implicit loops from the array syntax and does run to completion, though takes awhile given the size of NT (1,000,000,000).
% pgfortran -acc -Minfo=accel -ta=tesla:cc60 test.f90 -V17.4
diffusion:
36, Generating copyout(cz(:,:))
Generating copyin(d,dx,dt,c(:,:))
38, Loop carried dependence due to exposed use of c(2:499,:),c(:498,2:99),c(3:,2:99) prevents parallelization
Parallelization would require privatization of array cz(2:499,i2+2)
Sequential loop scheduled on host
40, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
40, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
!$acc loop gang, vector(32) ! blockidx%x threadidx%x
44, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
44, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
!$acc loop gang, vector(32) ! blockidx%x threadidx%x
While your example runs correctly, it will give incorrect answers since your only copying “C” to the device. You should change “copyin” to “copy” so “C” is both copied to and copied from the device.
Since “CZ” is only used on the device, you can use a “create” clause so the data is not copied. If you do need CZ’s values back on the host keep in mind that since only the interior of the array is set, copying back the entire array will give you garbage values in the halo. In this case I’d suggest to also putting CZ in a “copy” clause so the halo values are initialized on the device and thus contain valid values when copied back to the host.
Note for convenience, I set NT=10. I also modified the directives so that the kernels directives are around the inner implicit loops and used a data directive to manage the data. Note that you do not need to copyin the scalars in this case since scalars are firstprivate by default.
% cat test.f90
PROGRAM DIFFUSION
IMPLICIT NONE
!**********************************************************************
INTEGER :: I,J,K,NT
INTEGER, PARAMETER :: NX = 500, NY = 100
DOUBLE PRECISION :: INIC,BC,D,T
DOUBLE PRECISION :: DX,DT
DOUBLE PRECISION :: TMAX,XMAX,YMAX
DOUBLE PRECISION :: C(NX,NY)
DOUBLE PRECISION :: CZ(NX,NY) !PLACEHOLDER FOR PARALLELISATION
D = 1.D-5
DX = 1.D-4
INIC = 300 ! INITIAL VALUE
XMAX = NX*DX
YMAX = NY*DX
TMAX = 5E5 ! TIME
BC = 600
T = 300
!**********************************************************************
! DEFINE DT
DT = 0.5*DX*DX/D
NT = TMAX/DT
C = INIC
CZ = 0
print *, NT
NT=10
! ***************************************************************
! TIME LOOP
!$ACC DATA COPY(C(1:NX,1:NY)) CREATE(CZ(1:NX,1:NY))
DO K=1, NT
!$ACC KERNELS
CZ(2:NX-1,2:NY-1) = C(2:NX-1,2:NY-1)+0.5*((D*DT)/(DX*DX))* &
((C(1:NX-2,2:NY-1)-2.D0*C(2:NX-1,2:NY-1)+C(3:NX,2:NY-1))+ &
(C(2:NX-1,1:NY-2)-2.D0*C(2:NX-1,2:NY-1)+C(2:NX-1,3:NY)))
C(2:NX-1,2:NY-1) = CZ(2:NX-1,2:NY-1)
!$ACC END KERNELS
ENDDO
!$ACC END DATA
print *, C(2:5,2:5)
END
% export PGI_ACC_TIME=1
% pgfortran -acc -Minfo=accel -ta=tesla:cc60 test.f90 -V17.4; a.out
diffusion:
36, Generating copy(c(:,:))
Generating create(cz(:,:))
39, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
39, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
!$acc loop gang, vector(32) ! blockidx%x threadidx%x
42, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
42, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
!$acc loop gang, vector(32) ! blockidx%x threadidx%x
1000000000
300.0000000000000 300.0000000000000 300.0000000000000
300.0000000000000 300.0000000000000 300.0000000000000
300.0000000000000 300.0000000000000 300.0000000000000
300.0000000000000 300.0000000000000 300.0000000000000
300.0000000000000 300.0000000000000 300.0000000000000
300.0000000000000
Accelerator Kernel Timing data
test.f90
diffusion NVIDIA devicenum=0
time(us): 249
36: data region reached 2 times
36: data copyin transfers: 2
device time(us): total=31 max=18 min=13 avg=15
45: data copyout transfers: 1
device time(us): total=218 max=218 min=218 avg=218
38: compute region reached 10 times
39: kernel launched 10 times
grid: [16x25] block: [32x4]
elapsed time(us): total=2,929 max=317 min=290 avg=292
42: kernel launched 10 times
grid: [16x25] block: [32x4]
elapsed time(us): total=2,924 max=305 min=290 avg=292
Hope this helps,
Mat