OpenACC for code acceleration

Dear all,
My code is written in Fortran and OpenMPI. I want to use OpenACC to accelerate my code and run on GPU machines. In my code there is a outer time step loop. The calculation in each time step will need the data from the previous time step. So the data is not independent. So I think I can only add some directives for the child loops inside the time step loop. Because I use OpenMPI for parallel computation, the calculation time spent on each time step is not so much, only two or three hundred milliseconds. But I need to compute millions time steps for data process. I did a test simulation. I tried to add OpenACC directives to one of my subroutines and the average calculation time spent on each time step is 0.738 second. But if I don’t add openACC directives and I only run my code on CPU, the average calculation time spent on each time step is 0.598. So you see my code is slowed down rather than accelerated by adding OpenACC directives. So I want to ask if my code is appropriate to be ported on GPU using OpenACC? Because every time step I need to transfer data between GPU and CPU.And the time for running each time step is small. So I feel it is difficult to accelerate my code. Is it correct? Any suggestion will be appreciated. Thanks in advance!
Best regards,
Wentao Guo

Hi Wentao Guo,

The GPU is a computational beast, but you do need feed it a lot of computation in order to see good acceleration. Though, even with small computations, some speed-up can usually be achieved.

Of the 0.738 seconds, how much of that time is spent copying data? (From your stackoverflow post, we’re you able to get the profiler working?)

If there is any data movement, I’d try and remove this as best as possible. It sounds like you use MPI calls within the timestep loop, so could use the CUDA aware OpenMPI to pass device data between ranks eliminating the need to bring data back to the host. I don’t have the details on how to build CUDA aware into OpenMPI but can get them if needed. But you’d then pass the device data to the calls by using an OpenACC “host_data” region. You may find NVIDIA’s 2016 OpenACC online training, Class #2 helpful since it goes into depth on using OpenACC and MPI.

Next, I’d review the PGI compiler feedback messages (-Minfo=accel) to see how the loops are being scheduled on the device. You might need to adjust things if the compiler is making a poor choice. It’s rare that it doesn’t find the optimal schedule, but something to looks at.

I’d also make sure that you’re data layout has the “vector” loop index accessing the stride-1 dimensions of your arrays (1st column in Fortran, last row in C).

If you can share the code, I’d be happy to take a look and offer more specific suggestions.

-Mat

Hi Mat,
Thanks for your help. I followed the suggestions you gave to me on my previous post named “Launch of the kernel”. So I rewrite my directives and pay more attention to data movement. Here is the main function of my code:

PROGRAM incompact3d

USE decomp_2d
USE decomp_2d_poisson
use decomp_2d_io
USE variables
USE param
USE var
USE MPI
USE IBM
USE derivX
USE derivZ

implicit none

integer :: code,nlock,i,j,k,ijk,ii,bcx,bcy,bcz,fh,ierror,nvect1
real(mytype) :: x,y,z,tmp1,phimax,phimin
double precision :: t1,t2
character(len=20) :: filename

TYPE(DECOMP_INFO) :: phG,ph1,ph2,ph3,ph4

CALL MPI_INIT(code)
call decomp_2d_init(nx,ny,nz,p_row,p_col)
!start from 1 == true
call init_coarser_mesh_statS(nstat,nstat,nstat,.true.)
call init_coarser_mesh_statV(nvisu,nvisu,nvisu,.true.)
call parameter()

call init_variables

call schemes()

if (nclx==0) then
   bcx=0
else
   bcx=1
endif
if (ncly==0) then
   bcy=0
else
   bcy=1
endif
if (nclz==0) then
   bcz=0
else
   bcz=1
endif

call decomp_2d_poisson_init(bcx,bcy,bcz)

call decomp_info_init(nxm,nym,nzm,phG)

!if you want to collect 100 snapshots randomly on XXXXX time steps
!call collect_data() !it will generate 100 random time steps

if (ilit==0) call init(ux1,uy1,uz1,ep1,phi1,gx1,gy1,gz1,phis1,hx1,hy1,hz1,phiss1)  
if (ilit==1) call restart(ux1,uy1,uz1,ep1,pp3,phi1,gx1,gy1,gz1,&
        px1,py1,pz1,phis1,hx1,hy1,hz1,phiss1,phG,0)
call test_speed_min_max(ux1,uy1,uz1)
if (iscalar==1) call test_scalar_min_max(phi1)

!array for stat to zero
umean=0.;vmean=0.;wmean=0.
uumean=0.;vvmean=0.;wwmean=0.
uvmean=0.;uwmean=0.;vwmean=0.
phimean=0.;phiphimean=0.
utmean=0.;vtmean=0.;wtmean=0.


t1 = MPI_WTIME()

!div: nx ny nz --> nxm ny nz --> nxm nym nz --> nxm nym nzm
call decomp_info_init(nxm, nym, nzm, ph1)
call decomp_info_init(nxm, ny, nz, ph4)

!gradp: nxm nym nzm -> nxm nym nz --> nxm ny nz --> nx ny nz
call decomp_info_init(nxm, ny, nz, ph2)  
call decomp_info_init(nxm, nym, nz, ph3) 

!$acc data copy(ux1,uy1,uz1,phi1) create(ta1,tb1,tc1,td1,te1,tf1,tg1,th1, &
!$acc ti1,di1,ux2,uy2,uz2,ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2, &
!$acc ux3,uy3,uz3,ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3,pp2y,pp4y) &
!$acc copyin(xsize,ysize,zsize,xstart)

do itime=ifirst,ilast
   t=(itime-1)*dt
   if (nrank==0) then
      write(*,1001) itime,t
1001  format('Time step =',i7,', Time unit =',F9.3)
   endif
   
   do itr=1,iadvance_time

      if (nclx.eq.2) then
         call inflow (ux1,uy1,uz1,phi1) !X PENCILS
         call outflow(ux1,uy1,uz1,phi1) !X PENCILS 
      endif

!if (itr.eq.1)  then
!      print *, 'nrank, xstart(1:3): ', nrank, xstart(1), xstart(2), xstart(3)    
!      phimax=-1.0d6
!      phimin= 1.0d6
!      do k=1,xsize(3)
!      do j=1,xsize(2)
!      do i=1,xsize(1)
!        if (phi1(i,j,k).gt.phimax) phimax=phi1(i,j,k)
!        if (phi1(i,j,k).lt.phimin) phimin=phi1(i,j,k)
!      enddo
!      enddo
!      enddo
!      print *,'b-nrank:',nrank, phimax, phimin
!endif


     !X-->Y-->Z-->Y-->X
      call convdiff(ux1,uy1,uz1,ta1,tb1,tc1,td1,te1,tf1,tg1,th1,ti1,di1,&
           ux2,uy2,uz2,ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2,&
           ux3,uy3,uz3,ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3)

!$acc update device(tb1,phi1)
!$acc kernels present(tb1,phi1,xstart) 
      do k=1,xsize(3)
      do j=1,xsize(2)
      do i=1,xsize(1)
         tb1(i,j,k)=tb1(i,j,k)+Betaexp*DeltaT*gravity*phi1(xstart(1)-1+i,xstart(2)-1+j,xstart(3)-1+k)
      enddo
      enddo
      enddo
!$acc end kernels
!$acc update self(tb1)

!if (itr.eq.1)  then
!      phimax=-1.0d6
!      phimin= 1.0d6
!      do k=1,xsize(3)
!      do j=1,xsize(2)
!      do i=1,xsize(1)
!        if (phi1(i,j,k).gt.phimax) phimax=phi1(i,j,k)
!        if (phi1(i,j,k).lt.phimin) phimin=phi1(i,j,k)
!      enddo
!      enddo
!      enddo
!      print *,'a-nrank:',nrank, phimax, phimin
!endif

           
      if (iscalar==1) then
         call scalar(ux1,uy1,uz1,phi1,phis1,phiss1,di1,tg1,th1,ti1,td1,&
              uy2,uz2,phi2,di2,ta2,tb2,tc2,td2,uz3,phi3,di3,ta3,tb3,ep1) 
      endif

      !X PENCILS
      call intt (ux1,uy1,uz1,gx1,gy1,gz1,hx1,hy1,hz1,ta1,tb1,tc1) 


      call pre_correc(ux1,uy1,uz1)

      if (ivirt==1) then !solid body old school
         !we are in X-pencil
         call corgp_IBM(ux1,uy1,uz1,px1,py1,pz1,1)
         call body(ux1,uy1,uz1,ep1)
         call corgp_IBM(ux1,uy1,uz1,px1,py1,pz1,2)
      endif

      !X-->Y-->Z
      call divergence (ux1,uy1,uz1,ep1,ta1,tb1,tc1,di1,td1,te1,tf1,&
           td2,te2,tf2,di2,ta2,tb2,tc2,ta3,tb3,tc3,di3,td3,te3,tf3,pp3,&
           nxmsize,nymsize,nzmsize,ph1,ph3,ph4,1)       

      !POISSON Z-->Z 
      call decomp_2d_poisson_stg(pp3,bcx,bcy,bcz)

      !Z-->Y-->X
      call gradp(px1,py1,pz1,di1,td2,tf2,ta2,tb2,tc2,di2,&
           ta3,tc3,di3,pp3,nxmsize,nymsize,nzmsize,ph2,ph3)

      !X PENCILS
      call corgp(ux1,ux2,uy1,uz1,px1,py1,pz1) 
      
     !does not matter -->output=DIV U=0 (in dv3)
      call divergence (ux1,uy1,uz1,ep1,ta1,tb1,tc1,di1,td1,te1,tf1,&
           td2,te2,tf2,di2,ta2,tb2,tc2,ta3,tb3,tc3,di3,td3,te3,tf3,dv3,&
           nxmsize,nymsize,nzmsize,ph1,ph3,ph4,2)

!      if (nrank==0) ux1(12,64,:)=2.
!      if (nrank==1) ux1(12,64,:)=-1.
     !if (nrank==0) ux(12,64,42)=1.5
     !print *,nrank,xstart(2),xend(2),xstart(3),xend(3)

      call test_speed_min_max(ux1,uy1,uz1)
      if (iscalar==1) call test_scalar_min_max(phi1)

   enddo

   if(itime.gt.100000)then
   call STATISTIC(ux1,uy1,uz1,phi1,ta1,umean,vmean,wmean,phimean,uumean,vvmean,wwmean,&
        uvmean,uwmean,vwmean,phiphimean,tmean,utmean,vtmean,wtmean)
   endif

   if (mod(itime,isave)==0) call restart(ux1,uy1,uz1,ep1,pp3,phi1,gx1,gy1,gz1,&
        px1,py1,pz1,phis1,hx1,hy1,hz1,phiss1,phG,1)
     
   if (mod(itime,imodulo)==0) then
      call VISU_INSTA(ux1,uy1,uz1,phi1,ta1,tb1,tc1,td1,te1,tf1,tg1,th1,ti1,di1,&
           ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2,&
           ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3,phG,uvisu)
      call VISU_PRE (pp3,ta1,tb1,di1,ta2,tb2,di2,&
           ta3,di3,nxmsize,nymsize,nzmsize,phG,ph2,ph3,uvisu)
   endif
enddo
!$acc end data

t2=MPI_WTIME()-t1
call MPI_ALLREDUCE(t2,t1,1,MPI_REAL8,MPI_SUM, &
                   MPI_COMM_WORLD,code)
if (nrank==0) print *,'time per time_step: ', &
     t1/float(nproc)/(ilast-ifirst+1),' seconds'
if (nrank==0) print *,'simulation with nx*ny*nz=',nx,ny,nz,'mesh nodes'
if (nrank==0) print *,'Mapping p_row*p_col=',p_row,p_col


!call decomp_2d_poisson_finalize
call decomp_2d_finalize
CALL MPI_FINALIZE(code)

end PROGRAM incompact3d

Here you can see before my outer time step loop (do itime=ifirst,ilast) I added some OpenACC directives to copy or create data on device.

!$acc data copy(ux1,uy1,uz1,phi1) create(ta1,tb1,tc1,td1,te1,tf1,tg1,th1, &
!$acc ti1,di1,ux2,uy2,uz2,ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2, &
!$acc ux3,uy3,uz3,ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3,pp2y,pp4y) &
!$acc copyin(xsize,ysize,zsize,xstart)

do itime=ifirst,ilast

Then inside the time loop, I added some directives to the subroutine named convdiff and also one loop shown below:

      call convdiff(ux1,uy1,uz1,ta1,tb1,tc1,td1,te1,tf1,tg1,th1,ti1,di1,&
           ux2,uy2,uz2,ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2,&
           ux3,uy3,uz3,ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3)

!$acc update device(tb1,phi1)
!$acc kernels present(tb1,phi1,xstart) 
      do k=1,xsize(3)
      do j=1,xsize(2)
      do i=1,xsize(1)
         tb1(i,j,k)=tb1(i,j,k)+Betaexp*DeltaT*gravity*phi1(xstart(1)-1+i,xstart(2)-1+j,xstart(3)-1+k)
      enddo
      enddo
      enddo
!$acc end kernels
!$acc update self(tb1)

Here is the subroutine convdiff:

!********************************************************************
!
subroutine convdiff(ux1,uy1,uz1,ta1,tb1,tc1,td1,te1,tf1,tg1,th1,ti1,di1,&
     ux2,uy2,uz2,ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2,&
     ux3,uy3,uz3,ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3)
! 
!********************************************************************
USE param
USE variables
USE decomp_2d


implicit none

real(mytype),dimension(xsize(1),xsize(2),xsize(3)) :: ux1,uy1,uz1
real(mytype),dimension(xsize(1),xsize(2),xsize(3)) :: ta1,tb1,tc1,td1,te1,tf1,tg1,th1,ti1,di1
real(mytype),dimension(ysize(1),ysize(2),ysize(3)) :: ux2,uy2,uz2 
real(mytype),dimension(ysize(1),ysize(2),ysize(3)) :: ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2
real(mytype),dimension(zsize(1),zsize(2),zsize(3)) :: ux3,uy3,uz3
real(mytype),dimension(zsize(1),zsize(2),zsize(3)) :: ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3

integer :: ijk,nvect1,nvect2,nvect3,i,j,k
real(mytype) :: x,y,z



nvect1=xsize(1)*xsize(2)*xsize(3)
nvect2=ysize(1)*ysize(2)*ysize(3)
nvect3=zsize(1)*zsize(2)*zsize(3)
 
if (iskew==0) then !UROTU!
!WORK X-PENCILS
   call derx (ta1,uy1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
   call derx (tb1,uz1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
   call transpose_x_to_y(ux1,ux2)
   call transpose_x_to_y(uy1,uy2)
   call transpose_x_to_y(uz1,uz2)
   call transpose_x_to_y(ta1,ta2)
   call transpose_x_to_y(tb1,tb2)
!WORK Y-PENCILS
   call dery (tc2,ux2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1) 
   call dery (td2,uz2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1) 
   call transpose_y_to_z(ux2,ux3)
   call transpose_y_to_z(uy2,uy3)
   call transpose_y_to_z(uz2,uz3)
   call transpose_y_to_z(ta2,ta3)
   call transpose_y_to_z(tb2,tb3)
   call transpose_y_to_z(tc2,tc3)
   call transpose_y_to_z(td2,td3)
!WORK Z-PENCILS
   call derz (te3,ux3,di3,sz,ffzp,fszp,fwzp,zsize(1),zsize(2),zsize(3),1)
   call derz (tf3,uy3,di3,sz,ffzp,fszp,fwzp,zsize(1),zsize(2),zsize(3),1)
   do ijk=1,nvect3
      ta3(ijk,1,1)=uz3(ijk,1,1)*(te3(ijk,1,1)-tb3(ijk,1,1))-&
           uy3(ijk,1,1)*(ta3(ijk,1,1)-tc3(ijk,1,1))
      tb3(ijk,1,1)=ux3(ijk,1,1)*(ta3(ijk,1,1)-tc3(ijk,1,1))-&
           uz3(ijk,1,1)*(td3(ijk,1,1)-tf3(ijk,1,1))
      tc3(ijk,1,1)=uy3(ijk,1,1)*(td3(ijk,1,1)-tf3(ijk,1,1))-&
           ux3(ijk,1,1)*(te3(ijk,1,1)-tb3(ijk,1,1))
   enddo
else !SKEW!


!WORK X-PENCILS

!$acc update device(ux1,uy1,uz1)
!$acc kernels present(ux1,uy1,uz1)
   do ijk=1,nvect1
      ta1(ijk,1,1)=ux1(ijk,1,1)*ux1(ijk,1,1)
      tb1(ijk,1,1)=ux1(ijk,1,1)*uy1(ijk,1,1)
      tc1(ijk,1,1)=ux1(ijk,1,1)*uz1(ijk,1,1)
   enddo
!$acc end kernels
!$acc update self(ta1,tb1,tc1)

   call derx (td1,ta1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
   call derx (te1,tb1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
   call derx (tf1,tc1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
   call derx (ta1,ux1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
   call derx (tb1,uy1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
   call derx (tc1,uz1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)

!$acc update device(ta1,tb1,tc1,td1,te1,tf1)
!$acc kernels present(td1,te1,tf1,ux1)
   do ijk=1,nvect1
      ta1(ijk,1,1)=0.5*td1(ijk,1,1)+0.5*ux1(ijk,1,1)*ta1(ijk,1,1)
      tb1(ijk,1,1)=0.5*te1(ijk,1,1)+0.5*ux1(ijk,1,1)*tb1(ijk,1,1)
      tc1(ijk,1,1)=0.5*tf1(ijk,1,1)+0.5*ux1(ijk,1,1)*tc1(ijk,1,1)      
   enddo
!$acc end kernels
!$acc update self(ta1,tb1,tc1)

   call transpose_x_to_y(ux1,ux2)
   call transpose_x_to_y(uy1,uy2)
   call transpose_x_to_y(uz1,uz2)
   call transpose_x_to_y(ta1,ta2)
   call transpose_x_to_y(tb1,tb2)
   call transpose_x_to_y(tc1,tc2)
!WORK Y-PENCILS

!$acc update device(ux2,uy2,uz2)
!$acc kernels present(ux2,uy2,uz2)
   do ijk=1,nvect2
      td2(ijk,1,1)=ux2(ijk,1,1)*uy2(ijk,1,1)
      te2(ijk,1,1)=uy2(ijk,1,1)*uy2(ijk,1,1)
      tf2(ijk,1,1)=uz2(ijk,1,1)*uy2(ijk,1,1)
   enddo
!$acc end kernels
!$acc update self(td2,te2,tf2)

   call dery (tg2,td2,di2,sy,ffy,fsy,fwy,ppy,ysize(1),ysize(2),ysize(3),0) 
   call dery (th2,te2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1)
   call dery (ti2,tf2,di2,sy,ffy,fsy,fwy,ppy,ysize(1),ysize(2),ysize(3),0) 
   call dery (td2,ux2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1) 
   call dery (te2,uy2,di2,sy,ffy,fsy,fwy,ppy,ysize(1),ysize(2),ysize(3),0)
   call dery (tf2,uz2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1) 

!$acc update device(ta2,tb2,tc2,tg2,th2,ti2,td2,te2,tf2)
!$acc kernels present(ta2,tb2,tc2,tg2,th2,ti2,td2,te2,tf2,uy2)
   do ijk=1,nvect2
      ta2(ijk,1,1)=ta2(ijk,1,1)+0.5*tg2(ijk,1,1)+0.5*uy2(ijk,1,1)*td2(ijk,1,1)
      tb2(ijk,1,1)=tb2(ijk,1,1)+0.5*th2(ijk,1,1)+0.5*uy2(ijk,1,1)*te2(ijk,1,1)
      tc2(ijk,1,1)=tc2(ijk,1,1)+0.5*ti2(ijk,1,1)+0.5*uy2(ijk,1,1)*tf2(ijk,1,1)      
   enddo
!$acc end kernels
!$acc update self(ta2,tb2,tc2)

   call transpose_y_to_z(ux2,ux3)
   call transpose_y_to_z(uy2,uy3)
   call transpose_y_to_z(uz2,uz3)
   call transpose_y_to_z(ta2,ta3)
   call transpose_y_to_z(tb2,tb3)
   call transpose_y_to_z(tc2,tc3)

!WORK Z-PENCILS

!$acc update device(ux3,uy3,uz3)
!$acc kernels present(ux3,uy3,uz3)
   do ijk=1,nvect3
      td3(ijk,1,1)=ux3(ijk,1,1)*uz3(ijk,1,1)
      te3(ijk,1,1)=uy3(ijk,1,1)*uz3(ijk,1,1)
      tf3(ijk,1,1)=uz3(ijk,1,1)*uz3(ijk,1,1)
   enddo
!$acc end kernels
!$acc update self(td3,te3,tf3)

   call derz (tg3,td3,di3,sz,ffz,fsz,fwz,zsize(1),zsize(2),zsize(3),0)
   call derz (th3,te3,di3,sz,ffz,fsz,fwz,zsize(1),zsize(2),zsize(3),0)
   call derz (ti3,tf3,di3,sz,ffzp,fszp,fwzp,zsize(1),zsize(2),zsize(3),1)
   call derz (td3,ux3,di3,sz,ffzp,fszp,fwzp,zsize(1),zsize(2),zsize(3),1)
   call derz (te3,uy3,di3,sz,ffzp,fszp,fwzp,zsize(1),zsize(2),zsize(3),1)
   call derz (tf3,uz3,di3,sz,ffz,fsz,fwz,zsize(1),zsize(2),zsize(3),0)

!$acc update device(ta3,tb3,tc3,tg3,th3,ti3,td3,te3,tf3)
!$acc kernels present(ta3,tb3,tc3,tg3,th3,ti3,td3,te3,tf3)
   do ijk=1,nvect3
      ta3(ijk,1,1)=ta3(ijk,1,1)+0.5*tg3(ijk,1,1)+0.5*uz3(ijk,1,1)*td3(ijk,1,1)
      tb3(ijk,1,1)=tb3(ijk,1,1)+0.5*th3(ijk,1,1)+0.5*uz3(ijk,1,1)*te3(ijk,1,1)
      tc3(ijk,1,1)=tc3(ijk,1,1)+0.5*ti3(ijk,1,1)+0.5*uz3(ijk,1,1)*tf3(ijk,1,1)   
   enddo
!$acc end kernels
!$acc update self(ta3,tb3,tc3)

endif
!ALL THE CONVECTIVE TERMS ARE IN TA3, TB3 and TC3

td3(:,:,:)=ta3(:,:,:)
te3(:,:,:)=tb3(:,:,:)
tf3(:,:,:)=tc3(:,:,:)

!DIFFUSIVE TERMS IN Z
call derzz (ta3,ux3,di3,sz,sfzp,sszp,swzp,zsize(1),zsize(2),zsize(3),1)
call derzz (tb3,uy3,di3,sz,sfzp,sszp,swzp,zsize(1),zsize(2),zsize(3),1)
call derzz (tc3,uz3,di3,sz,sfz ,ssz ,swz ,zsize(1),zsize(2),zsize(3),0)


!WORK Y-PENCILS
call transpose_z_to_y(ta3,ta2)
call transpose_z_to_y(tb3,tb2)
call transpose_z_to_y(tc3,tc2)
call transpose_z_to_y(td3,td2)
call transpose_z_to_y(te3,te2)
call transpose_z_to_y(tf3,tf2)

tg2(:,:,:)=td2(:,:,:)
th2(:,:,:)=te2(:,:,:)
ti2(:,:,:)=tf2(:,:,:)

!DIFFUSIVE TERMS IN Y
!-->for ux
if (istret.ne.0) then 
   call deryy (td2,ux2,di2,sy,sfyp,ssyp,swyp,ysize(1),ysize(2),ysize(3),1)
   call dery (te2,ux2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1)

!$acc update device(td2,pp2y,pp4y,te2)
!$acc kernels present(td2,pp2y,pp4y,te2)
   do k=1,ysize(3)
   do j=1,ysize(2)
   do i=1,ysize(1)
      td2(i,j,k)=td2(i,j,k)*pp2y(j)-pp4y(j)*te2(i,j,k)
   enddo
   enddo
   enddo
!$acc end kernels
!$acc update self(td2)
else
   call deryy (td2,ux2,di2,sy,sfyp,ssyp,swyp,ysize(1),ysize(2),ysize(3),1) 
endif
!-->for uy
if (istret.ne.0) then 
   call deryy (te2,uy2,di2,sy,sfy,ssy,swy,ysize(1),ysize(2),ysize(3),0)
   call dery (tf2,uy2,di2,sy,ffy,fsy,fwy,ppy,ysize(1),ysize(2),ysize(3),0)
!$acc update device(te2,tf2)
!$acc kernels present(te2,pp2y,pp4y,tf2)
   do k=1,ysize(3)
   do j=1,ysize(2)
   do i=1,ysize(1)
      te2(i,j,k)=te2(i,j,k)*pp2y(j)-pp4y(j)*tf2(i,j,k)
   enddo
   enddo
   enddo
!$acc end kernels
!$acc update self(te2)
else
   call deryy (te2,uy2,di2,sy,sfy,ssy,swy,ysize(1),ysize(2),ysize(3),0) 
endif
!-->for uz
if (istret.ne.0) then 
   call deryy (tf2,uz2,di2,sy,sfyp,ssyp,swyp,ysize(1),ysize(2),ysize(3),1)
   call dery (tj2,uz2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1)

!$acc update device(tf2,tj2)
!$acc kernels present(tf2,pp2y,pp4y,tj2)
   do k=1,ysize(3)
   do j=1,ysize(2)
   do i=1,ysize(1)
      tf2(i,j,k)=tf2(i,j,k)*pp2y(j)-pp4y(j)*tj2(i,j,k)
   enddo
   enddo
   enddo
!$acc end kernels
!$acc update self(tf2)
else
   call deryy (tf2,uz2,di2,sy,sfyp,ssyp,swyp,ysize(1),ysize(2),ysize(3),1) 
endif

ta2(:,:,:)=ta2(:,:,:)+td2(:,:,:)
tb2(:,:,:)=tb2(:,:,:)+te2(:,:,:)
tc2(:,:,:)=tc2(:,:,:)+tf2(:,:,:)

!WORK X-PENCILS
call transpose_y_to_x(ta2,ta1)
call transpose_y_to_x(tb2,tb1)
call transpose_y_to_x(tc2,tc1) !diff
call transpose_y_to_x(tg2,td1)
call transpose_y_to_x(th2,te1)
call transpose_y_to_x(ti2,tf1) !conv

tg1(:,:,:)=td1(:,:,:)
th1(:,:,:)=te1(:,:,:)
ti1(:,:,:)=tf1(:,:,:)

!DIFFUSIVE TERMS IN X
call derxx (td1,ux1,di1,sx,sfx ,ssx ,swx ,xsize(1),xsize(2),xsize(3),0)
call derxx (te1,uy1,di1,sx,sfxp,ssxp,swxp,xsize(1),xsize(2),xsize(3),1)
call derxx (tf1,uz1,di1,sx,sfxp,ssxp,swxp,xsize(1),xsize(2),xsize(3),1)

ta1(:,:,:)=ta1(:,:,:)+td1(:,:,:)
tb1(:,:,:)=tb1(:,:,:)+te1(:,:,:)
tc1(:,:,:)=tc1(:,:,:)+tf1(:,:,:)

if(itime.lt.20000)then
if (nrank==1) print *,'WARNING rotating channel',itime
tg1(:,:,:)=tg1(:,:,:)-2./18.*uy1(:,:,:)
th1(:,:,:)=th1(:,:,:)-2./18.*ux1(:,:,:)
endif


!FINAL SUM: DIFF TERMS + CONV TERMS
ta1(:,:,:)=xnu*ta1(:,:,:)-tg1(:,:,:)
tb1(:,:,:)=xnu*tb1(:,:,:)-th1(:,:,:)
tc1(:,:,:)=xnu*tc1(:,:,:)-ti1(:,:,:)


end subroutine convdiff

Inside the subroutine convdiff it also calls other subroutines such as derx,dery,derxx etc. I haven’t added any directives to the subroutines called inside convdiff. The subroutine named derx is shown below, the other subroutines are similar:

!********************************************************************
!
subroutine derx(tx,ux,rx,sx,ffx,fsx,fwx,nx,ny,nz,npaire) 
!
!********************************************************************

USE param
USE derivX

implicit none

integer :: nx,ny,nz,npaire,i,j,k 
real(mytype), dimension(nx,ny,nz) :: tx,ux,rx 
real(mytype), dimension(ny,nz):: sx
real(mytype), dimension(nx):: ffx,fsx,fwx 

if (nclx==0) then 
   do k=1,nz 
   do j=1,ny 
      tx(1,j,k)=afix*(ux(2,j,k)-ux(nx,j,k))& 
           +bfix*(ux(3,j,k)-ux(nx-1,j,k)) 
      rx(1,j,k)=-1. 
      tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
           +bfix*(ux(4,j,k)-ux(nx,j,k)) 
      rx(2,j,k)=0. 
      do i=3,nx-2
         tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
              +bfix*(ux(i+2,j,k)-ux(i-2,j,k)) 
         rx(i,j,k)=0. 
      enddo
      tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
           +bfix*(ux(1,j,k)-ux(nx-3,j,k)) 
      rx(nx-1,j,k)=0. 
      tx(nx,j,k)=afix*(ux(1,j,k)-ux(nx-1,j,k))&
           +bfix*(ux(2,j,k)-ux(nx-2,j,k)) 
      rx(nx,j,k)=alfaix           
      do i=2, nx
         tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i) 
         rx(i,j,k)=rx(i,j,k)-rx(i-1,j,k)*fsx(i) 
      enddo
      tx(nx,j,k)=tx(nx,j,k)*fwx(nx) 
      rx(nx,j,k)=rx(nx,j,k)*fwx(nx) 
      do i=nx-1,1,-1
         tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i) 
         rx(i,j,k)=(rx(i,j,k)-ffx(i)*rx(i+1,j,k))*fwx(i) 
      enddo
      sx(j,k)=(tx(1,j,k)-alfaix*tx(nx,j,k))&
           /(1.+rx(1,j,k)-alfaix*rx(nx,j,k)) 
      do i=1,nx 
         tx(i,j,k)=tx(i,j,k)-sx(j,k)*rx(i,j,k) 
      enddo
   enddo
   enddo
endif

if (nclx==1) then 
   if (npaire==1) then 
      do k=1,nz 
      do j=1,ny 
         tx(1,j,k)=0. 
         tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
              +bfix*(ux(4,j,k)-ux(2,j,k)) 
         do i=3,nx-2 
            tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
                 +bfix*(ux(i+2,j,k)-ux(i-2,j,k)) 
         enddo
         tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
              +bfix*(ux(nx-1,j,k)-ux(nx-3,j,k)) 
         tx(nx,j,k)=0. 
         do i=2,nx
            tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i) 
         enddo      
         tx(nx,j,k)=tx(nx,j,k)*fwx(nx) 
         do i=nx-1,1,-1 
            tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i) 
         enddo
      enddo
      enddo
   endif
   if (npaire==0) then 
      do k=1,nz 
      do j=1,ny 
         tx(1,j,k)=afix*(ux(2,j,k)+ux(2,j,k))&
              +bfix*(ux(3,j,k)+ux(3,j,k)) 
         tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
              +bfix*(ux(4,j,k)+ux(2,j,k))
         do i=3,nx-2  
            tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
                 +bfix*(ux(i+2,j,k)-ux(i-2,j,k)) 
         enddo
         tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
              +bfix*((-ux(nx-1,j,k))-ux(nx-3,j,k)) 
         tx(nx,j,k)=afix*((-ux(nx-1,j,k))-ux(nx-1,j,k))&
              +bfix*((-ux(nx-2,j,k))-ux(nx-2,j,k))
         do i=2,nx 
            tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i) 
         enddo
         tx(nx,j,k)=tx(nx,j,k)*fwx(nx) 
         do i=nx-1,1,-1 
            tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i) 
         enddo
      enddo
      enddo
   endif
endif

if (nclx==2) then 
   do k=1,nz
   do j=1,ny 
      tx(1,j,k)=af1x*ux(1,j,k)+bf1x*ux(2,j,k)+cf1x*ux(3,j,k) 
      tx(2,j,k)=af2x*(ux(3,j,k)-ux(1,j,k)) 
      do i=3,nx-2
         tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
              +bfix*(ux(i+2,j,k)-ux(i-2,j,k)) 
      enddo
      tx(nx-1,j,k)=afmx*(ux(nx,j,k)-ux(nx-2,j,k)) 
      tx(nx,j,k)=(-afnx*ux(nx,j,k))-bfnx*ux(nx-1,j,k)-cfnx*ux(nx-2,j,k)
      do i=2,nx 
         tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i) 
      enddo
      tx(nx,j,k)=tx(nx,j,k)*fwx(nx) 
      do i=nx-1,1,-1
         tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i) 
      enddo
   enddo 
   enddo 
endif

return
end subroutine derx

I compiled my code and submit it to the cluster Cray XC50 which has both CPU and GPU. I only use one core to compute.

Here is my script for submitting the code:

#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=12
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=1
#SBATCH --constraint=gpu
#SBATCH --time=24:00:00
#SBATCH --account=s807

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CRAY_CUDA_MPS=1
export TMPDIR=/scratch/snx3000/guow/Incompact3d_GPU/test7_convdiff_1_core
export PGI_ACC_TIME=1

set -ex

# set some parameters
OUT=log.txt
#WMIN=1400
NP=1


# tasks: $SLURM_NTASKS 
# tasks-per-node: $SLURM_NTASKS_PER_NODE 
# cpus-per-task: $SLURM_CPUS_PER_TASK
# srun nvprof --cpu-profiling on  -o nvprof.%p.out ./incompact3d $WMIN >> $OUT
srun ./incompact3d $WMIN >> $OUT
# srun nvprof -o nvprof.%p.out ./incompact3d $WMIN >> $OUT

Here is the profiling information from PGI_ACC_TIME:

ModuleCmd_Switch.c(179):ERROR:152: Module 'PrgEnv-cray/6.0.4' is currently not loaded
+ OUT=log.txt
+ NP=1
+ srun ./incompact3d

Accelerator Kernel Timing data
/scratch/snx3000/guow/Incompact3d_GPU/test9_convdiff_optimize/source_code/incompact3d.f90
  incompact3d  NVIDIA  devicenum=0
    time(us): 1,045,364
    113: data region reached 2 times
        113: data copyin transfers: 81
             device time(us): total=44,437 max=1,351 min=4 avg=548
        244: data copyout transfers: 36
             device time(us): total=41,177 max=1,294 min=46 avg=1,143
    153: update directive reached 30 times
        153: data copyin transfers: 540
             device time(us): total=651,492 max=1,358 min=46 avg=1,206
    154: compute region reached 30 times
        157: kernel launched 30 times
            grid: [8x256x65]  block: [32x4]
            elapsed time(us): total=23,532 max=793 min=780 avg=784
    154: data region reached 60 times
    163: update directive reached 30 times
        163: data copyout transfers: 270
             device time(us): total=308,258 max=1,311 min=46 avg=1,141
/scratch/snx3000/guow/Incompact3d_GPU/test9_convdiff_optimize/source_code/convdiff.f90
  convdiff  NVIDIA  devicenum=0
    time(us): 19,179,300
    98: update directive reached 30 times
        98: data copyin transfers: 810
             device time(us): total=977,150 max=1,356 min=46 avg=1,206
    99: compute region reached 30 times
        100: kernel launched 30 times
            grid: [65535]  block: [128]
            elapsed time(us): total=47,765 max=2,281 min=1,558 avg=1,592
    99: data region reached 120 times
    106: update directive reached 30 times
        106: data copyout transfers: 810
             device time(us): total=925,810 max=1,325 min=46 avg=1,142
    115: update directive reached 30 times
        115: data copyin transfers: 1620
             device time(us): total=1,954,015 max=1,357 min=46 avg=1,206
    116: compute region reached 30 times
        117: kernel launched 30 times
            grid: [65535]  block: [128]
            elapsed time(us): total=73,801 max=2,474 min=2,457 avg=2,460
    116: data region reached 120 times
    123: update directive reached 30 times
        123: data copyout transfers: 810
             device time(us): total=925,485 max=1,322 min=46 avg=1,142
    133: update directive reached 30 times
        133: data copyin transfers: 810
             device time(us): total=977,091 max=1,357 min=46 avg=1,206
    134: compute region reached 30 times
        135: kernel launched 30 times
            grid: [65535]  block: [128]
            elapsed time(us): total=47,015 max=1,572 min=1,560 avg=1,567
    134: data region reached 120 times
    141: update directive reached 30 times
        141: data copyout transfers: 810
             device time(us): total=925,293 max=1,329 min=46 avg=1,142
    150: update directive reached 30 times
        150: data copyin transfers: 2430
             device time(us): total=2,930,724 max=1,362 min=46 avg=1,206
    151: compute region reached 30 times
        152: kernel launched 30 times
            grid: [65535]  block: [128]
            elapsed time(us): total=96,829 max=3,239 min=3,212 avg=3,227
    151: data region reached 60 times
    158: update directive reached 30 times
        158: data copyout transfers: 810
             device time(us): total=925,619 max=1,308 min=46 avg=1,142
    169: update directive reached 30 times
        169: data copyin transfers: 810
             device time(us): total=977,057 max=1,360 min=46 avg=1,206
    170: compute region reached 30 times
        171: kernel launched 30 times
            grid: [65535]  block: [128]
            elapsed time(us): total=47,122 max=1,577 min=1,565 avg=1,570
    170: data region reached 120 times
    177: update directive reached 30 times
        177: data copyout transfers: 810
             device time(us): total=925,148 max=1,322 min=46 avg=1,142
    186: update directive reached 30 times
        186: data copyin transfers: 2430
             device time(us): total=2,930,691 max=1,380 min=46 avg=1,206
187: compute region reached 30 times
        188: kernel launched 30 times
            grid: [65535]  block: [128]
            elapsed time(us): total=96,743 max=3,231 min=3,221 avg=3,224
    187: data region reached 120 times
    194: update directive reached 30 times
        194: data copyout transfers: 810
             device time(us): total=925,410 max=1,316 min=46 avg=1,142
    227: update directive reached 30 times
        227: data copyin transfers: 600
             device time(us): total=651,771 max=1,358 min=4 avg=1,086
    228: compute region reached 30 times
        231: kernel launched 30 times
            grid: [8x257x64]  block: [32x4]
            elapsed time(us): total=23,524 max=789 min=781 avg=784
    228: data region reached 60 times
    237: update directive reached 30 times
        237: data copyout transfers: 270
             device time(us): total=308,323 max=1,304 min=46 avg=1,141
    245: update directive reached 30 times
        245: data copyin transfers: 540
             device time(us): total=651,436 max=1,356 min=46 avg=1,206
    246: compute region reached 30 times
        249: kernel launched 30 times
            grid: [8x257x64]  block: [32x4]
            elapsed time(us): total=23,754 max=794 min=790 avg=791
    246: data region reached 60 times
    255: update directive reached 30 times
        255: data copyout transfers: 270
             device time(us): total=308,324 max=1,308 min=46 avg=1,141
    264: update directive reached 30 times
        264: data copyin transfers: 540
             device time(us): total=651,422 max=1,358 min=46 avg=1,206
    265: compute region reached 30 times
        268: kernel launched 30 times
            grid: [8x257x64]  block: [32x4]
            elapsed time(us): total=23,603 max=793 min=776 avg=786
    265: data region reached 60 times
    274: update directive reached 30 times
        274: data copyout transfers: 270
             device time(us): total=308,531 max=1,308 min=46 avg=1,142

The mesh I used is 256257256. I only ran ten time steps and each time step spent 54.26679129600525 seconds. If I don’t add any directives, it only needs 51.9558027982711 seconds for each step. The code is still a little bit slowed down rather than accelerated. I haven’t found a way to optimize the data movement. Could you show me the right way to add directives in order to achieve acceleration? Thanks in advance!
Best regards,
Wentao Guo

I’m sorry for the bad format. But when you click post reply you can see the full sentence in the topic review.

So it looks like the code is spending about 20 seconds transferring data back and forth. You could probably improve that by using Pinned memory (-ta=tesla:pinned), but a better direction would be to offload the loops in the der and transpose routines, so you don’t need to transfer data at all.

Looking at derx, I don’t see any reason why you can parallelize across the “j” and “k” loops in each case. Ideally you’d be able parallelize across the “i” loops as well since this is the stride-1 dimension, but I see a few of the “i” loops have dependencies so can’t be parallelized. You’re not going to get great performance due to the data layout, but it should be better than having to copy the data back and forth between the host and GPU.

Hi Mat,
I tried to use the pinned memory. Here is the make file I used:

FC = ftn
OPTFC = -acc -Minfo=accel -ta=tesla:pinned -cpp
CC = cc
CFLAGS = -03

all: incompact3d

alloc_shm.o: alloc_shm.c
        $(CC) $(CFLAGS) -c $<

FreeIPC_c.o: FreeIPC_c.c
        $(CC) $(CFLAGS) -c $<

incompact3d : $(OBJ)
        $(FC) -O3 -acc -ta=tesla:pinned -o $@ $(OBJ) $(LIBFFT)

%.o : %.f90
        $(FC) $(OPTFC) $(OPTIONS) $(INC) -c $<

.PHONY: clean
clean:
        rm -f *.o *.mod incompact3d

I can compile the code. But when I submit the job to the cluster, I received the following error:

ModuleCmd_Switch.c(179):ERROR:152: Module 'PrgEnv-cray/6.0.4' is currently not loaded
+ OUT=log.txt
+ NP=1
+ srun ./incompact3d
malloc: cuMemHostAlloc returns error code 201
0: ALLOCATE: 4 bytes requested; not enough memory
srun: error: nid03072: task 0: Exited with exit code 127
srun: Terminating job step 4065377.0

Could you tell me how to fix it?
And about offloading the loops in der and transpose routines, do you mean I don’t add any directives to these subroutines and just let them run on CPU?Thanks in advance!
Best regards,
Wentao Guo

Could you tell me how to fix it?

Unfortunately, you can’t. Pinned allocates memory in the CPU’s physical memory. But the error is indicating that system doesn’t have (or wont let you have) enough physical memory for your program. So unless you can add more DIMMs you wont be able to use pinned. Alternatively, you might check if the Cray environment is setting a limit on memory use.

And about offloading the loops in der and transpose routines, do you mean I don’t add any directives to these subroutines and just let them run on CPU?

No, the opposite. You will want add OpenACC compute regions around your parallel loops in the der and transpose routines. The more work you can offload to the device means the fewer times the host and device data needs to be synchronized. In other words, once all of your compute is offloaded to the device, you can remove most if not all of the “update” directives thus getting you at least 20 seconds plus the parallel speed-up of offloading the remaining loops.

Hi Mat,

You said I can accelerate the code by offloading the loops in der and transpose routines. But then you said:

Maybe I’m wrong, but I think here you mean it is better not to add any directives to der because of the data layout. As you said, in the der subroutine in i direction there are several loops that cannot be parallized on GPU because of data dependency. So I have to compute these loops on CPU.If I add directives to other loops I cannot avoid data transfer between GPU and CPU. So I don’t offload der and transpose routines? Is the pinned memory the only way to accelerate my code?
Best regards,
Wentao Guo

Hi Wentao Guo,

Maybe I’m wrong, but I think here you mean it is better not to add any directives to der because of the data layout

Apologies if I’m not clear. I’m saying that you should try to offload the loops in the “der” routines and if possible the “transpose” routines.

While offloading the “der” routines to the GPU device will incur some performance penalty due to memory divergence (because of the data layout), it will still most likely be faster to offload them to the GPU.

Even if the times of these routines are equivalent or slightly slower than the CPU, the savings of not having to copy the data should give you an overall win.

Granted, I don’t know for sure what the performance will be but at least for the “derx” routine, it looks very straight forward to add compute directive to so should be simple to try.

-Mat

Hi Mat,
According to your reply, here is what I understood. I took a part of my code (subroutine convdiff)as an example:

!$acc update device(ux1,uy1,uz1)
!$acc kernels present(ux1,uy1,uz1)
   do ijk=1,nvect1
      ta1(ijk,1,1)=ux1(ijk,1,1)*ux1(ijk,1,1)
      tb1(ijk,1,1)=ux1(ijk,1,1)*uy1(ijk,1,1)
      tc1(ijk,1,1)=ux1(ijk,1,1)*uz1(ijk,1,1)
   enddo
!$acc end kernels
!$acc update self(ta1,tb1,tc1)

   call derx (td1,ta1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
   call derx (te1,tb1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
   call derx (tf1,tc1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
   call derx (ta1,ux1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
   call derx (tb1,uy1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
   call derx (tc1,uz1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)

I think you want me to reduce the data movement between GPU and CPU, so I shouldn’t update ta1,tb1,tc1 to the host immediately after the loop is finished. I should use them in the subroutine derx and try to update the data to the host close to the end of this subroutine named convdiff. And here is a part of my subroutine derx:

if (nclx==0) then 
   do k=1,nz 
   do j=1,ny 
      tx(1,j,k)=afix*(ux(2,j,k)-ux(nx,j,k))& 
           +bfix*(ux(3,j,k)-ux(nx-1,j,k)) 
      rx(1,j,k)=-1. 
      tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
           +bfix*(ux(4,j,k)-ux(nx,j,k)) 
      rx(2,j,k)=0. 
      do i=3,nx-2
         tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
              +bfix*(ux(i+2,j,k)-ux(i-2,j,k)) 
         rx(i,j,k)=0. 
      enddo
      tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
           +bfix*(ux(1,j,k)-ux(nx-3,j,k)) 
      rx(nx-1,j,k)=0. 
      tx(nx,j,k)=afix*(ux(1,j,k)-ux(nx-1,j,k))&
           +bfix*(ux(2,j,k)-ux(nx-2,j,k)) 
      rx(nx,j,k)=alfaix           
      do i=2, nx
         tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i) 
         rx(i,j,k)=rx(i,j,k)-rx(i-1,j,k)*fsx(i) 
      enddo
      tx(nx,j,k)=tx(nx,j,k)*fwx(nx) 
      rx(nx,j,k)=rx(nx,j,k)*fwx(nx) 
      do i=nx-1,1,-1
         tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i) 
         rx(i,j,k)=(rx(i,j,k)-ffx(i)*rx(i+1,j,k))*fwx(i) 
      enddo
      sx(j,k)=(tx(1,j,k)-alfaix*tx(nx,j,k))&
           /(1.+rx(1,j,k)-alfaix*rx(nx,j,k)) 
      do i=1,nx 
         tx(i,j,k)=tx(i,j,k)-sx(j,k)*rx(i,j,k) 
      enddo
   enddo
   enddo
endif

I think I can add some compute directives to offload the first loop in i direction:

do i=3,nx-2 
         tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))& 
              +bfix*(ux(i+2,j,k)-ux(i-2,j,k)) 
         rx(i,j,k)=0. 
      enddo

But when it comes to the second loop, I will have troubles:

do i=2, nx 
         tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i) 
         rx(i,j,k)=rx(i,j,k)-rx(i-1,j,k)*fsx(i) 
      enddo

I cannot compute this loop on GPU because the data is dependent. To my current understanding, after the first loop, I have to update tx,rx to CPU because the second loop needs to use them. So compared with using pure CPU, I need extra data transfer. You said:

but a better direction would be to offload the loops in the der and transpose routines, so you don’t need to transfer data at all.

I don’t know how to avoid data transfer.Is there a way to run loops which have data dependency on GPU without paralleling? If what I understood is not right, please correct me.
Best regards,
Wentao Guo

I think you want me to reduce the data movement between GPU and CPU, so I shouldn’t update ta1,tb1,tc1 to the host immediately after the loop is finished. I should use them in the subroutine derx and try to update the data to the host close to the end of this subroutine named convdiff.

Correct.

I cannot compute this loop on GPU because the data is dependent.

Correct, you wont be able to parallelize the “i” loops because of this dependency. However, you should be able to parallelize the “j” and “k” loops. Something like:

if (nclx==0) then 
!$acc kernels loop
   do k=1,nz 
   do j=1,ny 
      tx(1,j,k)=afix*(ux(2,j,k)-ux(nx,j,k))& 
           +bfix*(ux(3,j,k)-ux(nx-1,j,k)) 
      rx(1,j,k)=-1. 
....cut....
      do i=1,nx 
         tx(i,j,k)=tx(i,j,k)-sx(j,k)*rx(i,j,k) 
      enddo 
   enddo 
   enddo 
endif

The repeat this for each of your if cases.

-Mat

Hi Mat,
Thanks for your reply! I tried your suggestion and add “!$acc kernels loop” in the derx subroutine:

!********************************************************************
!
subroutine derx(tx,ux,rx,sx,ffx,fsx,fwx,nx,ny,nz,npaire) 
!
!********************************************************************

USE param
USE derivX

implicit none

integer :: nx,ny,nz,npaire,i,j,k 
real(mytype), dimension(nx,ny,nz) :: tx,ux,rx 
real(mytype), dimension(ny,nz):: sx
real(mytype), dimension(nx):: ffx,fsx,fwx 

if (nclx==0) then 
!$acc kernels loop
   do k=1,nz 
   do j=1,ny 
      tx(1,j,k)=afix*(ux(2,j,k)-ux(nx,j,k))& 
           +bfix*(ux(3,j,k)-ux(nx-1,j,k)) 
      rx(1,j,k)=-1. 
      tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
           +bfix*(ux(4,j,k)-ux(nx,j,k)) 
      rx(2,j,k)=0. 
      do i=3,nx-2
         tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
              +bfix*(ux(i+2,j,k)-ux(i-2,j,k)) 
         rx(i,j,k)=0. 
      enddo
      tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
           +bfix*(ux(1,j,k)-ux(nx-3,j,k)) 
      rx(nx-1,j,k)=0. 
      tx(nx,j,k)=afix*(ux(1,j,k)-ux(nx-1,j,k))&
           +bfix*(ux(2,j,k)-ux(nx-2,j,k)) 
      rx(nx,j,k)=alfaix           
      do i=2, nx
         tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i) 
         rx(i,j,k)=rx(i,j,k)-rx(i-1,j,k)*fsx(i) 
      enddo
      tx(nx,j,k)=tx(nx,j,k)*fwx(nx) 
      rx(nx,j,k)=rx(nx,j,k)*fwx(nx) 
      do i=nx-1,1,-1
         tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i) 
         rx(i,j,k)=(rx(i,j,k)-ffx(i)*rx(i+1,j,k))*fwx(i) 
      enddo
      sx(j,k)=(tx(1,j,k)-alfaix*tx(nx,j,k))&
           /(1.+rx(1,j,k)-alfaix*rx(nx,j,k)) 
      do i=1,nx 
         tx(i,j,k)=tx(i,j,k)-sx(j,k)*rx(i,j,k) 
      enddo
   enddo
   enddo
endif

if (nclx==1) then 
   if (npaire==1) then 
!$acc kernels loop
      do k=1,nz 
      do j=1,ny 
         tx(1,j,k)=0. 
         tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
              +bfix*(ux(4,j,k)-ux(2,j,k)) 
         do i=3,nx-2 
            tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
                 +bfix*(ux(i+2,j,k)-ux(i-2,j,k)) 
         enddo
         tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
              +bfix*(ux(nx-1,j,k)-ux(nx-3,j,k)) 
         tx(nx,j,k)=0. 
         do i=2,nx
            tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i) 
         enddo      
         tx(nx,j,k)=tx(nx,j,k)*fwx(nx) 
         do i=nx-1,1,-1 
            tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i) 
         enddo
      enddo
      enddo
   endif
   if (npaire==0) then 
!$acc kernels loop
      do k=1,nz 
      do j=1,ny 
         tx(1,j,k)=afix*(ux(2,j,k)+ux(2,j,k))&
              +bfix*(ux(3,j,k)+ux(3,j,k)) 
         tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
              +bfix*(ux(4,j,k)+ux(2,j,k))
         do i=3,nx-2  
            tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
                 +bfix*(ux(i+2,j,k)-ux(i-2,j,k)) 
         enddo
         tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
              +bfix*((-ux(nx-1,j,k))-ux(nx-3,j,k)) 
         tx(nx,j,k)=afix*((-ux(nx-1,j,k))-ux(nx-1,j,k))&
              +bfix*((-ux(nx-2,j,k))-ux(nx-2,j,k))
         do i=2,nx 
            tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i) 
         enddo
         tx(nx,j,k)=tx(nx,j,k)*fwx(nx) 
         do i=nx-1,1,-1 
            tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i) 
         enddo
      enddo
      enddo
   endif
endif

if (nclx==2) then 
!$acc kernels loop
   do k=1,nz
   do j=1,ny 
      tx(1,j,k)=af1x*ux(1,j,k)+bf1x*ux(2,j,k)+cf1x*ux(3,j,k) 
      tx(2,j,k)=af2x*(ux(3,j,k)-ux(1,j,k)) 
      do i=3,nx-2
         tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
              +bfix*(ux(i+2,j,k)-ux(i-2,j,k)) 
      enddo
      tx(nx-1,j,k)=afmx*(ux(nx,j,k)-ux(nx-2,j,k)) 
      tx(nx,j,k)=(-afnx*ux(nx,j,k))-bfnx*ux(nx-1,j,k)-cfnx*ux(nx-2,j,k)
      do i=2,nx 
         tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i) 
      enddo
      tx(nx,j,k)=tx(nx,j,k)*fwx(nx) 
      do i=nx-1,1,-1
         tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i) 
      enddo
   enddo 
   enddo 
endif

return
end subroutine derx

I also tried to reduce data movement in the subroutine (convdiff) which called derx. Here is a part of convdiff:

!$acc update device(ux1,uy1,uz1)
!$acc kernels present(ux1,uy1,uz1)
   do ijk=1,nvect1
      ta1(ijk,1,1)=ux1(ijk,1,1)*ux1(ijk,1,1)
      tb1(ijk,1,1)=ux1(ijk,1,1)*uy1(ijk,1,1)
      tc1(ijk,1,1)=ux1(ijk,1,1)*uz1(ijk,1,1)
   enddo
!$acc end kernels

   call derx (td1,ta1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
   call derx (te1,tb1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
   call derx (tf1,tc1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
   call derx (ta1,ux1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
   call derx (tb1,uy1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
   call derx (tc1,uz1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)

!$acc kernels present(ta1,tb1,tc1,td1,te1,tf1,ux1)
   do ijk=1,nvect1
      ta1(ijk,1,1)=0.5*td1(ijk,1,1)+0.5*ux1(ijk,1,1)*ta1(ijk,1,1)
      tb1(ijk,1,1)=0.5*te1(ijk,1,1)+0.5*ux1(ijk,1,1)*tb1(ijk,1,1)
      tc1(ijk,1,1)=0.5*tf1(ijk,1,1)+0.5*ux1(ijk,1,1)*tc1(ijk,1,1)
   enddo
!$acc end kernels
!$acc update self(ta1,tb1,tc1)

The code can be compiled but after running the code, the output is not correct. I checked the compile information:

derx:
     50, Generating implicit copyin(fsx(2:nx),ffx(:nx-1),ux(:,:ny,:nz))
         Generating implicit copyout(sx(:ny,:nz))
         Generating implicit allocate(tx(:,:ny,:nz),rx(:,:ny,:nz))
         Generating implicit copy(rx(:2,:ny,:nz),tx(:2,:ny,:nz))
         Generating implicit copyin(fwx(:nx))
     51, Loop is parallelizable
     52, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         51, !$acc loop gang ! blockidx%y
         52, !$acc loop gang ! blockidx%x
         59, !$acc loop vector(128) ! threadidx%x
         70, !$acc loop seq
         76, !$acc loop seq
         82, !$acc loop vector(128) ! threadidx%x
     59, Loop is parallelizable
     70, Loop carried dependence of tx,rx prevents parallelization
         Loop carried backward dependence of tx,rx prevents vectorization
         Inner sequential loop scheduled on accelerator
     76, Loop carried dependence of tx,rx prevents parallelization
         Loop carried backward dependence of tx,rx prevents vectorization
         Inner sequential loop scheduled on accelerator
     82, Loop is parallelizable
     91, Generating implicit allocate(tx(:,:ny,:nz))
         Generating implicit copyin(tx(:2,:ny,:nz))
         Generating implicit copyout(tx(:nx,:ny,:nz))
         Generating implicit copyin(fwx(:nx),ux(:,:ny,:nz),ffx(:nx-1),fsx(2:nx))
     92, Loop is parallelizable
     93, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         92, !$acc loop gang ! blockidx%y
         93, !$acc loop gang ! blockidx%x
         97, !$acc loop vector(128) ! threadidx%x
        104, !$acc loop seq
        108, !$acc loop seq
     97, Loop is parallelizable
    104, Loop carried dependence of tx prevents parallelization
         Loop carried backward dependence of tx prevents vectorization
         Inner sequential loop scheduled on accelerator
    108, Loop carried dependence of tx prevents parallelization
         Loop carried backward dependence of tx prevents vectorization
         Inner sequential loop scheduled on accelerator
    115, Generating implicit copyin(ux(:,:ny,:nz),fwx(:nx))
         Generating implicit allocate(tx(:,:ny,:nz))
         Generating implicit copyin(tx(:2,:ny,:nz))
         Generating implicit copyout(tx(:nx,:ny,:nz))
         Generating implicit copyin(ffx(:nx-1),fsx(2:nx))
    116, Loop is parallelizable
    117, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        116, !$acc loop gang ! blockidx%y
        117, !$acc loop gang ! blockidx%x
        122, !$acc loop vector(128) ! threadidx%x
        130, !$acc loop seq
        134, !$acc loop seq
    122, Loop is parallelizable
    130, Loop carried dependence of tx prevents parallelization
         Loop carried backward dependence of tx prevents vectorization
         Inner sequential loop scheduled on accelerator
    134, Loop carried dependence of tx prevents parallelization
         Loop carried backward dependence of tx prevents vectorization
         Inner sequential loop scheduled on accelerator
    143, Generating implicit copyin(ux(:,:ny,:nz),fwx(:nx))
         Generating implicit allocate(tx(:,:ny,:nz))
         Generating implicit copyin(tx(:2,:ny,:nz))
         Generating implicit copyout(tx(:nx,:ny,:nz))
         Generating implicit copyin(ffx(:nx-1),fsx(2:nx))
    144, Loop is parallelizable
    145, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        144, !$acc loop gang ! blockidx%y
        145, !$acc loop gang ! blockidx%x
        148, !$acc loop vector(128) ! threadidx%x
        154, !$acc loop seq
        158, !$acc loop seq
    148, Loop is parallelizable
    154, Loop carried dependence of tx prevents parallelization
         Loop carried backward dependence of tx prevents vectorization
         Inner sequential loop scheduled on accelerator
    158, Loop carried dependence of tx prevents parallelization
         Loop carried backward dependence of tx prevents vectorization
         Inner sequential loop scheduled on accelerator

Here you see, for the first IF statement(if (nclx==0) ), it returns:

50, Generating implicit copyin(fsx(2:nx),ffx(:nx-1),ux(:,:ny,:nz))
         Generating implicit copyout(sx(:ny,:nz))
         Generating implicit allocate(tx(:,:ny,:nz),rx(:,:ny,:nz))
         Generating implicit copy(rx(:2,:ny,:nz),tx(:2,:ny,:nz))
         Generating implicit copyin(fwx(:nx))

It only copied out rx(:2,:ny,:nz),tx(:2,:ny,:nz) to the host rather than the whole matrix. However, for other IF statements(nclx==1 or nclx==2),
the whole rx matrix is copied out:

Generating implicit copyout(tx(:nx,:ny,:nz))

Because I only use the condition which nclx==0, so I think that’s why I got the wrong output. But I’m confused why only a part of the matrix is copied out for nclx==0 since the other two cases have similar data layout and structure. I also tried to use “!$acc kernels present”,“!$acc end kernels” and “!$acc update self(tx,rx)” directives to replace “!acc kernels loop” and copy data back to the host after the loop is finished but I failed. Could you give me some suggestions?
Best regards,
Wentao Guo

Can you send the full source and any input files to PGI Customer Service ([email protected])?

I think it would be easier for me to determine the issue if I could see the whole code as well as run it.

Thanks!
Mat

Hi Mat,
I have sent the code to [email protected] and asked them to redirect my email to you. Thanks in advance!
Best regards,
Wentao Guo