Dear all,
I’m a new user of OpenACC and I have some confusion when I accelerate my code. Previously the code is written in Fortran90+OpenMPI. In order to test the behavior of OpenACC, I only added OpenACC directives to one loop. Here is the part of my code:
!$acc data copyin(Betaexp,DeltaT,gravity,phi1,xsize,xstart) &
!$acc copy(tb1)
!$acc kernels
do k=1,xsize(3)
do j=1,xsize(2)
do i=1,xsize(1)
tb1(i,j,k)=tb1(i,j,k)+Betaexp*DeltaT*gravity*phi1(xstart(1)-1+i,xstart(2)-1+j,xstart(3)-1+k)
enddo
enddo
enddo
!$acc end kernels
!$acc end data
The outer loop is for the time step. I run the code for 1000 time steps so this loop will be executed for 1000 times. I used the PGI compiler and here is the compiling information:
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c decomp_2d.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c glassman.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c fft_generic.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c module_param.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c io.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c variables.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c poisson.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c schemes.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c convdiff.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c incompact3d.f90
PGF90-W-0155-Constant or Parameter used in data clause - betaexp (incompact3d.f90: 149)
PGF90-W-0155-Constant or Parameter used in data clause - deltat (incompact3d.f90: 149)
PGF90-W-0155-Constant or Parameter used in data clause - gravity (incompact3d.f90: 149)
0 inform, 3 warnings, 0 severes, 0 fatal for incompact3d
incompact3d:
149, Generating copy(tb1(:,:,:))
Generating copyin(xstart(:),xsize(:),phi1(:,:,:))
152, Loop is parallelizable
153, Loop is parallelizable
154, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
152, !$acc loop gang ! blockidx%y
153, !$acc loop gang, vector(4) ! blockidx%z threadidx%y
154, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c navier.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c filter.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c derive.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c parameters.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c tools.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c visu.f90
ftn -O3 -acc -ta=tesla:cuda8.0 -o incompact3d decomp_2d.o glassman.o fft_generic.o module_param.o io.o variables.o poisson.o schemes.o convdiff.o incompact3d.o navier.o filter.o derive.o parameters.o tools.o visu.o
When I submit the code to the cluster, I set PGI_ACC_TIME=1 and I got runtime summarization. Here is a part of it:
Accelerator Kernel Timing data
/scratch/snx3000/guow/Incompact3d_GPU/test3_copyin/source_code/incompact3d.f90
incompact3d NVIDIA devicenum=0
time(us): 1,000,835
149: data region reached 6000 times
149: data copyin transfers: 18000
device time(us): total=677,027 max=401 min=4 avg=37
160: data copyout transfers: 3000
device time(us): total=323,808 max=743 min=61 avg=107
151: compute region reached 3000 times
154: kernel launched 3000 times
grid: [4x21x8] block: [32x4]
elapsed time(us): total=410,046 max=2,449 min=47 avg=136
So my question is:
- Is this loop calculated on GPU? From the runtime summarization I don’t see the device time after the line “154: kernel launched 3000 times”. I only see the data is copied in and out. So I doubt the loop is not even executed.
- In the directive (!$acc data copyin(Betaexp,DeltaT,gravity,phi1,xsize,xstart)) I added to the code, I also copyin Betaexp,DeltaT and gravity which are constants I defined to the GPU. However the complier seems ignored those constants and only copyin xstart(:),xsize(:) and phi1(:,:,:) to the GPU. (See the compilation information:Generating copyin(xstart(:),xsize(:),phi1(:,:,:)))
So is it not necessary to copy constants to the GPU? - If I don’t use OpenACC directives and the code only use CPUs. It needs 0.28 seconds for each time step. If I added OpenACC directives to the loop shown above, each time step will take 0.61 seconds. Because the time loop cannot be paralleled on GPU, so for each time step, the loop shown above will be executed on the GPU and data movement will take up much more time than the gain from acceleration. So actually it is a waste of time to parallel such simple loops. So If I want to optimize my code, I need to use profiling programs and find the most computation heavy loops to parallel. Do I understand correctly?
Any suggestions will be appreciated! Thanks in advance!
Best regards,
Wentao Guo