Dear all,
My code is written in Fortran and OpenMPI. I want to use OpenACC to accelerate my code and run on GPU machines. In my code there is a outer time step loop. The calculation in each time step will need the data from the previous time step. So the data is not independent. So I think I can only add some directives for the child loops inside the time step loop. Because I use OpenMPI for parallel computation, the calculation time spent on each time step is not so much, only two or three hundred milliseconds. But I need to compute millions time steps for data process. I did a test simulation. I tried to add OpenACC directives to one of my subroutines and the average calculation time spent on each time step is 0.738 second. But if I don’t add openACC directives and I only run my code on CPU, the average calculation time spent on each time step is 0.598. So you see my code is slowed down rather than accelerated by adding OpenACC directives. So I want to ask if my code is appropriate to be ported on GPU using OpenACC? Because every time step I need to transfer data between GPU and CPU.And the time for running each time step is small. So I feel it is difficult to accelerate my code. Is it correct? Any suggestion will be appreciated. Thanks in advance!
Best regards,
Wentao Guo
Hi Wentao Guo,
The GPU is a computational beast, but you do need feed it a lot of computation in order to see good acceleration. Though, even with small computations, some speed-up can usually be achieved.
Of the 0.738 seconds, how much of that time is spent copying data? (From your stackoverflow post, we’re you able to get the profiler working?)
If there is any data movement, I’d try and remove this as best as possible. It sounds like you use MPI calls within the timestep loop, so could use the CUDA aware OpenMPI to pass device data between ranks eliminating the need to bring data back to the host. I don’t have the details on how to build CUDA aware into OpenMPI but can get them if needed. But you’d then pass the device data to the calls by using an OpenACC “host_data” region. You may find NVIDIA’s 2016 OpenACC online training, Class #2 helpful since it goes into depth on using OpenACC and MPI.
Next, I’d review the PGI compiler feedback messages (-Minfo=accel) to see how the loops are being scheduled on the device. You might need to adjust things if the compiler is making a poor choice. It’s rare that it doesn’t find the optimal schedule, but something to looks at.
I’d also make sure that you’re data layout has the “vector” loop index accessing the stride-1 dimensions of your arrays (1st column in Fortran, last row in C).
If you can share the code, I’d be happy to take a look and offer more specific suggestions.
-Mat
Hi Mat,
Thanks for your help. I followed the suggestions you gave to me on my previous post named “Launch of the kernel”. So I rewrite my directives and pay more attention to data movement. Here is the main function of my code:
PROGRAM incompact3d
USE decomp_2d
USE decomp_2d_poisson
use decomp_2d_io
USE variables
USE param
USE var
USE MPI
USE IBM
USE derivX
USE derivZ
implicit none
integer :: code,nlock,i,j,k,ijk,ii,bcx,bcy,bcz,fh,ierror,nvect1
real(mytype) :: x,y,z,tmp1,phimax,phimin
double precision :: t1,t2
character(len=20) :: filename
TYPE(DECOMP_INFO) :: phG,ph1,ph2,ph3,ph4
CALL MPI_INIT(code)
call decomp_2d_init(nx,ny,nz,p_row,p_col)
!start from 1 == true
call init_coarser_mesh_statS(nstat,nstat,nstat,.true.)
call init_coarser_mesh_statV(nvisu,nvisu,nvisu,.true.)
call parameter()
call init_variables
call schemes()
if (nclx==0) then
bcx=0
else
bcx=1
endif
if (ncly==0) then
bcy=0
else
bcy=1
endif
if (nclz==0) then
bcz=0
else
bcz=1
endif
call decomp_2d_poisson_init(bcx,bcy,bcz)
call decomp_info_init(nxm,nym,nzm,phG)
!if you want to collect 100 snapshots randomly on XXXXX time steps
!call collect_data() !it will generate 100 random time steps
if (ilit==0) call init(ux1,uy1,uz1,ep1,phi1,gx1,gy1,gz1,phis1,hx1,hy1,hz1,phiss1)
if (ilit==1) call restart(ux1,uy1,uz1,ep1,pp3,phi1,gx1,gy1,gz1,&
px1,py1,pz1,phis1,hx1,hy1,hz1,phiss1,phG,0)
call test_speed_min_max(ux1,uy1,uz1)
if (iscalar==1) call test_scalar_min_max(phi1)
!array for stat to zero
umean=0.;vmean=0.;wmean=0.
uumean=0.;vvmean=0.;wwmean=0.
uvmean=0.;uwmean=0.;vwmean=0.
phimean=0.;phiphimean=0.
utmean=0.;vtmean=0.;wtmean=0.
t1 = MPI_WTIME()
!div: nx ny nz --> nxm ny nz --> nxm nym nz --> nxm nym nzm
call decomp_info_init(nxm, nym, nzm, ph1)
call decomp_info_init(nxm, ny, nz, ph4)
!gradp: nxm nym nzm -> nxm nym nz --> nxm ny nz --> nx ny nz
call decomp_info_init(nxm, ny, nz, ph2)
call decomp_info_init(nxm, nym, nz, ph3)
!$acc data copy(ux1,uy1,uz1,phi1) create(ta1,tb1,tc1,td1,te1,tf1,tg1,th1, &
!$acc ti1,di1,ux2,uy2,uz2,ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2, &
!$acc ux3,uy3,uz3,ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3,pp2y,pp4y) &
!$acc copyin(xsize,ysize,zsize,xstart)
do itime=ifirst,ilast
t=(itime-1)*dt
if (nrank==0) then
write(*,1001) itime,t
1001 format('Time step =',i7,', Time unit =',F9.3)
endif
do itr=1,iadvance_time
if (nclx.eq.2) then
call inflow (ux1,uy1,uz1,phi1) !X PENCILS
call outflow(ux1,uy1,uz1,phi1) !X PENCILS
endif
!if (itr.eq.1) then
! print *, 'nrank, xstart(1:3): ', nrank, xstart(1), xstart(2), xstart(3)
! phimax=-1.0d6
! phimin= 1.0d6
! do k=1,xsize(3)
! do j=1,xsize(2)
! do i=1,xsize(1)
! if (phi1(i,j,k).gt.phimax) phimax=phi1(i,j,k)
! if (phi1(i,j,k).lt.phimin) phimin=phi1(i,j,k)
! enddo
! enddo
! enddo
! print *,'b-nrank:',nrank, phimax, phimin
!endif
!X-->Y-->Z-->Y-->X
call convdiff(ux1,uy1,uz1,ta1,tb1,tc1,td1,te1,tf1,tg1,th1,ti1,di1,&
ux2,uy2,uz2,ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2,&
ux3,uy3,uz3,ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3)
!$acc update device(tb1,phi1)
!$acc kernels present(tb1,phi1,xstart)
do k=1,xsize(3)
do j=1,xsize(2)
do i=1,xsize(1)
tb1(i,j,k)=tb1(i,j,k)+Betaexp*DeltaT*gravity*phi1(xstart(1)-1+i,xstart(2)-1+j,xstart(3)-1+k)
enddo
enddo
enddo
!$acc end kernels
!$acc update self(tb1)
!if (itr.eq.1) then
! phimax=-1.0d6
! phimin= 1.0d6
! do k=1,xsize(3)
! do j=1,xsize(2)
! do i=1,xsize(1)
! if (phi1(i,j,k).gt.phimax) phimax=phi1(i,j,k)
! if (phi1(i,j,k).lt.phimin) phimin=phi1(i,j,k)
! enddo
! enddo
! enddo
! print *,'a-nrank:',nrank, phimax, phimin
!endif
if (iscalar==1) then
call scalar(ux1,uy1,uz1,phi1,phis1,phiss1,di1,tg1,th1,ti1,td1,&
uy2,uz2,phi2,di2,ta2,tb2,tc2,td2,uz3,phi3,di3,ta3,tb3,ep1)
endif
!X PENCILS
call intt (ux1,uy1,uz1,gx1,gy1,gz1,hx1,hy1,hz1,ta1,tb1,tc1)
call pre_correc(ux1,uy1,uz1)
if (ivirt==1) then !solid body old school
!we are in X-pencil
call corgp_IBM(ux1,uy1,uz1,px1,py1,pz1,1)
call body(ux1,uy1,uz1,ep1)
call corgp_IBM(ux1,uy1,uz1,px1,py1,pz1,2)
endif
!X-->Y-->Z
call divergence (ux1,uy1,uz1,ep1,ta1,tb1,tc1,di1,td1,te1,tf1,&
td2,te2,tf2,di2,ta2,tb2,tc2,ta3,tb3,tc3,di3,td3,te3,tf3,pp3,&
nxmsize,nymsize,nzmsize,ph1,ph3,ph4,1)
!POISSON Z-->Z
call decomp_2d_poisson_stg(pp3,bcx,bcy,bcz)
!Z-->Y-->X
call gradp(px1,py1,pz1,di1,td2,tf2,ta2,tb2,tc2,di2,&
ta3,tc3,di3,pp3,nxmsize,nymsize,nzmsize,ph2,ph3)
!X PENCILS
call corgp(ux1,ux2,uy1,uz1,px1,py1,pz1)
!does not matter -->output=DIV U=0 (in dv3)
call divergence (ux1,uy1,uz1,ep1,ta1,tb1,tc1,di1,td1,te1,tf1,&
td2,te2,tf2,di2,ta2,tb2,tc2,ta3,tb3,tc3,di3,td3,te3,tf3,dv3,&
nxmsize,nymsize,nzmsize,ph1,ph3,ph4,2)
! if (nrank==0) ux1(12,64,:)=2.
! if (nrank==1) ux1(12,64,:)=-1.
!if (nrank==0) ux(12,64,42)=1.5
!print *,nrank,xstart(2),xend(2),xstart(3),xend(3)
call test_speed_min_max(ux1,uy1,uz1)
if (iscalar==1) call test_scalar_min_max(phi1)
enddo
if(itime.gt.100000)then
call STATISTIC(ux1,uy1,uz1,phi1,ta1,umean,vmean,wmean,phimean,uumean,vvmean,wwmean,&
uvmean,uwmean,vwmean,phiphimean,tmean,utmean,vtmean,wtmean)
endif
if (mod(itime,isave)==0) call restart(ux1,uy1,uz1,ep1,pp3,phi1,gx1,gy1,gz1,&
px1,py1,pz1,phis1,hx1,hy1,hz1,phiss1,phG,1)
if (mod(itime,imodulo)==0) then
call VISU_INSTA(ux1,uy1,uz1,phi1,ta1,tb1,tc1,td1,te1,tf1,tg1,th1,ti1,di1,&
ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2,&
ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3,phG,uvisu)
call VISU_PRE (pp3,ta1,tb1,di1,ta2,tb2,di2,&
ta3,di3,nxmsize,nymsize,nzmsize,phG,ph2,ph3,uvisu)
endif
enddo
!$acc end data
t2=MPI_WTIME()-t1
call MPI_ALLREDUCE(t2,t1,1,MPI_REAL8,MPI_SUM, &
MPI_COMM_WORLD,code)
if (nrank==0) print *,'time per time_step: ', &
t1/float(nproc)/(ilast-ifirst+1),' seconds'
if (nrank==0) print *,'simulation with nx*ny*nz=',nx,ny,nz,'mesh nodes'
if (nrank==0) print *,'Mapping p_row*p_col=',p_row,p_col
!call decomp_2d_poisson_finalize
call decomp_2d_finalize
CALL MPI_FINALIZE(code)
end PROGRAM incompact3d
Here you can see before my outer time step loop (do itime=ifirst,ilast) I added some OpenACC directives to copy or create data on device.
!$acc data copy(ux1,uy1,uz1,phi1) create(ta1,tb1,tc1,td1,te1,tf1,tg1,th1, &
!$acc ti1,di1,ux2,uy2,uz2,ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2, &
!$acc ux3,uy3,uz3,ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3,pp2y,pp4y) &
!$acc copyin(xsize,ysize,zsize,xstart)
do itime=ifirst,ilast
Then inside the time loop, I added some directives to the subroutine named convdiff and also one loop shown below:
call convdiff(ux1,uy1,uz1,ta1,tb1,tc1,td1,te1,tf1,tg1,th1,ti1,di1,&
ux2,uy2,uz2,ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2,&
ux3,uy3,uz3,ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3)
!$acc update device(tb1,phi1)
!$acc kernels present(tb1,phi1,xstart)
do k=1,xsize(3)
do j=1,xsize(2)
do i=1,xsize(1)
tb1(i,j,k)=tb1(i,j,k)+Betaexp*DeltaT*gravity*phi1(xstart(1)-1+i,xstart(2)-1+j,xstart(3)-1+k)
enddo
enddo
enddo
!$acc end kernels
!$acc update self(tb1)
Here is the subroutine convdiff:
!********************************************************************
!
subroutine convdiff(ux1,uy1,uz1,ta1,tb1,tc1,td1,te1,tf1,tg1,th1,ti1,di1,&
ux2,uy2,uz2,ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2,&
ux3,uy3,uz3,ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3)
!
!********************************************************************
USE param
USE variables
USE decomp_2d
implicit none
real(mytype),dimension(xsize(1),xsize(2),xsize(3)) :: ux1,uy1,uz1
real(mytype),dimension(xsize(1),xsize(2),xsize(3)) :: ta1,tb1,tc1,td1,te1,tf1,tg1,th1,ti1,di1
real(mytype),dimension(ysize(1),ysize(2),ysize(3)) :: ux2,uy2,uz2
real(mytype),dimension(ysize(1),ysize(2),ysize(3)) :: ta2,tb2,tc2,td2,te2,tf2,tg2,th2,ti2,tj2,di2
real(mytype),dimension(zsize(1),zsize(2),zsize(3)) :: ux3,uy3,uz3
real(mytype),dimension(zsize(1),zsize(2),zsize(3)) :: ta3,tb3,tc3,td3,te3,tf3,tg3,th3,ti3,di3
integer :: ijk,nvect1,nvect2,nvect3,i,j,k
real(mytype) :: x,y,z
nvect1=xsize(1)*xsize(2)*xsize(3)
nvect2=ysize(1)*ysize(2)*ysize(3)
nvect3=zsize(1)*zsize(2)*zsize(3)
if (iskew==0) then !UROTU!
!WORK X-PENCILS
call derx (ta1,uy1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
call derx (tb1,uz1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
call transpose_x_to_y(ux1,ux2)
call transpose_x_to_y(uy1,uy2)
call transpose_x_to_y(uz1,uz2)
call transpose_x_to_y(ta1,ta2)
call transpose_x_to_y(tb1,tb2)
!WORK Y-PENCILS
call dery (tc2,ux2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1)
call dery (td2,uz2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1)
call transpose_y_to_z(ux2,ux3)
call transpose_y_to_z(uy2,uy3)
call transpose_y_to_z(uz2,uz3)
call transpose_y_to_z(ta2,ta3)
call transpose_y_to_z(tb2,tb3)
call transpose_y_to_z(tc2,tc3)
call transpose_y_to_z(td2,td3)
!WORK Z-PENCILS
call derz (te3,ux3,di3,sz,ffzp,fszp,fwzp,zsize(1),zsize(2),zsize(3),1)
call derz (tf3,uy3,di3,sz,ffzp,fszp,fwzp,zsize(1),zsize(2),zsize(3),1)
do ijk=1,nvect3
ta3(ijk,1,1)=uz3(ijk,1,1)*(te3(ijk,1,1)-tb3(ijk,1,1))-&
uy3(ijk,1,1)*(ta3(ijk,1,1)-tc3(ijk,1,1))
tb3(ijk,1,1)=ux3(ijk,1,1)*(ta3(ijk,1,1)-tc3(ijk,1,1))-&
uz3(ijk,1,1)*(td3(ijk,1,1)-tf3(ijk,1,1))
tc3(ijk,1,1)=uy3(ijk,1,1)*(td3(ijk,1,1)-tf3(ijk,1,1))-&
ux3(ijk,1,1)*(te3(ijk,1,1)-tb3(ijk,1,1))
enddo
else !SKEW!
!WORK X-PENCILS
!$acc update device(ux1,uy1,uz1)
!$acc kernels present(ux1,uy1,uz1)
do ijk=1,nvect1
ta1(ijk,1,1)=ux1(ijk,1,1)*ux1(ijk,1,1)
tb1(ijk,1,1)=ux1(ijk,1,1)*uy1(ijk,1,1)
tc1(ijk,1,1)=ux1(ijk,1,1)*uz1(ijk,1,1)
enddo
!$acc end kernels
!$acc update self(ta1,tb1,tc1)
call derx (td1,ta1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
call derx (te1,tb1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
call derx (tf1,tc1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
call derx (ta1,ux1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
call derx (tb1,uy1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
call derx (tc1,uz1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
!$acc update device(ta1,tb1,tc1,td1,te1,tf1)
!$acc kernels present(td1,te1,tf1,ux1)
do ijk=1,nvect1
ta1(ijk,1,1)=0.5*td1(ijk,1,1)+0.5*ux1(ijk,1,1)*ta1(ijk,1,1)
tb1(ijk,1,1)=0.5*te1(ijk,1,1)+0.5*ux1(ijk,1,1)*tb1(ijk,1,1)
tc1(ijk,1,1)=0.5*tf1(ijk,1,1)+0.5*ux1(ijk,1,1)*tc1(ijk,1,1)
enddo
!$acc end kernels
!$acc update self(ta1,tb1,tc1)
call transpose_x_to_y(ux1,ux2)
call transpose_x_to_y(uy1,uy2)
call transpose_x_to_y(uz1,uz2)
call transpose_x_to_y(ta1,ta2)
call transpose_x_to_y(tb1,tb2)
call transpose_x_to_y(tc1,tc2)
!WORK Y-PENCILS
!$acc update device(ux2,uy2,uz2)
!$acc kernels present(ux2,uy2,uz2)
do ijk=1,nvect2
td2(ijk,1,1)=ux2(ijk,1,1)*uy2(ijk,1,1)
te2(ijk,1,1)=uy2(ijk,1,1)*uy2(ijk,1,1)
tf2(ijk,1,1)=uz2(ijk,1,1)*uy2(ijk,1,1)
enddo
!$acc end kernels
!$acc update self(td2,te2,tf2)
call dery (tg2,td2,di2,sy,ffy,fsy,fwy,ppy,ysize(1),ysize(2),ysize(3),0)
call dery (th2,te2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1)
call dery (ti2,tf2,di2,sy,ffy,fsy,fwy,ppy,ysize(1),ysize(2),ysize(3),0)
call dery (td2,ux2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1)
call dery (te2,uy2,di2,sy,ffy,fsy,fwy,ppy,ysize(1),ysize(2),ysize(3),0)
call dery (tf2,uz2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1)
!$acc update device(ta2,tb2,tc2,tg2,th2,ti2,td2,te2,tf2)
!$acc kernels present(ta2,tb2,tc2,tg2,th2,ti2,td2,te2,tf2,uy2)
do ijk=1,nvect2
ta2(ijk,1,1)=ta2(ijk,1,1)+0.5*tg2(ijk,1,1)+0.5*uy2(ijk,1,1)*td2(ijk,1,1)
tb2(ijk,1,1)=tb2(ijk,1,1)+0.5*th2(ijk,1,1)+0.5*uy2(ijk,1,1)*te2(ijk,1,1)
tc2(ijk,1,1)=tc2(ijk,1,1)+0.5*ti2(ijk,1,1)+0.5*uy2(ijk,1,1)*tf2(ijk,1,1)
enddo
!$acc end kernels
!$acc update self(ta2,tb2,tc2)
call transpose_y_to_z(ux2,ux3)
call transpose_y_to_z(uy2,uy3)
call transpose_y_to_z(uz2,uz3)
call transpose_y_to_z(ta2,ta3)
call transpose_y_to_z(tb2,tb3)
call transpose_y_to_z(tc2,tc3)
!WORK Z-PENCILS
!$acc update device(ux3,uy3,uz3)
!$acc kernels present(ux3,uy3,uz3)
do ijk=1,nvect3
td3(ijk,1,1)=ux3(ijk,1,1)*uz3(ijk,1,1)
te3(ijk,1,1)=uy3(ijk,1,1)*uz3(ijk,1,1)
tf3(ijk,1,1)=uz3(ijk,1,1)*uz3(ijk,1,1)
enddo
!$acc end kernels
!$acc update self(td3,te3,tf3)
call derz (tg3,td3,di3,sz,ffz,fsz,fwz,zsize(1),zsize(2),zsize(3),0)
call derz (th3,te3,di3,sz,ffz,fsz,fwz,zsize(1),zsize(2),zsize(3),0)
call derz (ti3,tf3,di3,sz,ffzp,fszp,fwzp,zsize(1),zsize(2),zsize(3),1)
call derz (td3,ux3,di3,sz,ffzp,fszp,fwzp,zsize(1),zsize(2),zsize(3),1)
call derz (te3,uy3,di3,sz,ffzp,fszp,fwzp,zsize(1),zsize(2),zsize(3),1)
call derz (tf3,uz3,di3,sz,ffz,fsz,fwz,zsize(1),zsize(2),zsize(3),0)
!$acc update device(ta3,tb3,tc3,tg3,th3,ti3,td3,te3,tf3)
!$acc kernels present(ta3,tb3,tc3,tg3,th3,ti3,td3,te3,tf3)
do ijk=1,nvect3
ta3(ijk,1,1)=ta3(ijk,1,1)+0.5*tg3(ijk,1,1)+0.5*uz3(ijk,1,1)*td3(ijk,1,1)
tb3(ijk,1,1)=tb3(ijk,1,1)+0.5*th3(ijk,1,1)+0.5*uz3(ijk,1,1)*te3(ijk,1,1)
tc3(ijk,1,1)=tc3(ijk,1,1)+0.5*ti3(ijk,1,1)+0.5*uz3(ijk,1,1)*tf3(ijk,1,1)
enddo
!$acc end kernels
!$acc update self(ta3,tb3,tc3)
endif
!ALL THE CONVECTIVE TERMS ARE IN TA3, TB3 and TC3
td3(:,:,:)=ta3(:,:,:)
te3(:,:,:)=tb3(:,:,:)
tf3(:,:,:)=tc3(:,:,:)
!DIFFUSIVE TERMS IN Z
call derzz (ta3,ux3,di3,sz,sfzp,sszp,swzp,zsize(1),zsize(2),zsize(3),1)
call derzz (tb3,uy3,di3,sz,sfzp,sszp,swzp,zsize(1),zsize(2),zsize(3),1)
call derzz (tc3,uz3,di3,sz,sfz ,ssz ,swz ,zsize(1),zsize(2),zsize(3),0)
!WORK Y-PENCILS
call transpose_z_to_y(ta3,ta2)
call transpose_z_to_y(tb3,tb2)
call transpose_z_to_y(tc3,tc2)
call transpose_z_to_y(td3,td2)
call transpose_z_to_y(te3,te2)
call transpose_z_to_y(tf3,tf2)
tg2(:,:,:)=td2(:,:,:)
th2(:,:,:)=te2(:,:,:)
ti2(:,:,:)=tf2(:,:,:)
!DIFFUSIVE TERMS IN Y
!-->for ux
if (istret.ne.0) then
call deryy (td2,ux2,di2,sy,sfyp,ssyp,swyp,ysize(1),ysize(2),ysize(3),1)
call dery (te2,ux2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1)
!$acc update device(td2,pp2y,pp4y,te2)
!$acc kernels present(td2,pp2y,pp4y,te2)
do k=1,ysize(3)
do j=1,ysize(2)
do i=1,ysize(1)
td2(i,j,k)=td2(i,j,k)*pp2y(j)-pp4y(j)*te2(i,j,k)
enddo
enddo
enddo
!$acc end kernels
!$acc update self(td2)
else
call deryy (td2,ux2,di2,sy,sfyp,ssyp,swyp,ysize(1),ysize(2),ysize(3),1)
endif
!-->for uy
if (istret.ne.0) then
call deryy (te2,uy2,di2,sy,sfy,ssy,swy,ysize(1),ysize(2),ysize(3),0)
call dery (tf2,uy2,di2,sy,ffy,fsy,fwy,ppy,ysize(1),ysize(2),ysize(3),0)
!$acc update device(te2,tf2)
!$acc kernels present(te2,pp2y,pp4y,tf2)
do k=1,ysize(3)
do j=1,ysize(2)
do i=1,ysize(1)
te2(i,j,k)=te2(i,j,k)*pp2y(j)-pp4y(j)*tf2(i,j,k)
enddo
enddo
enddo
!$acc end kernels
!$acc update self(te2)
else
call deryy (te2,uy2,di2,sy,sfy,ssy,swy,ysize(1),ysize(2),ysize(3),0)
endif
!-->for uz
if (istret.ne.0) then
call deryy (tf2,uz2,di2,sy,sfyp,ssyp,swyp,ysize(1),ysize(2),ysize(3),1)
call dery (tj2,uz2,di2,sy,ffyp,fsyp,fwyp,ppy,ysize(1),ysize(2),ysize(3),1)
!$acc update device(tf2,tj2)
!$acc kernels present(tf2,pp2y,pp4y,tj2)
do k=1,ysize(3)
do j=1,ysize(2)
do i=1,ysize(1)
tf2(i,j,k)=tf2(i,j,k)*pp2y(j)-pp4y(j)*tj2(i,j,k)
enddo
enddo
enddo
!$acc end kernels
!$acc update self(tf2)
else
call deryy (tf2,uz2,di2,sy,sfyp,ssyp,swyp,ysize(1),ysize(2),ysize(3),1)
endif
ta2(:,:,:)=ta2(:,:,:)+td2(:,:,:)
tb2(:,:,:)=tb2(:,:,:)+te2(:,:,:)
tc2(:,:,:)=tc2(:,:,:)+tf2(:,:,:)
!WORK X-PENCILS
call transpose_y_to_x(ta2,ta1)
call transpose_y_to_x(tb2,tb1)
call transpose_y_to_x(tc2,tc1) !diff
call transpose_y_to_x(tg2,td1)
call transpose_y_to_x(th2,te1)
call transpose_y_to_x(ti2,tf1) !conv
tg1(:,:,:)=td1(:,:,:)
th1(:,:,:)=te1(:,:,:)
ti1(:,:,:)=tf1(:,:,:)
!DIFFUSIVE TERMS IN X
call derxx (td1,ux1,di1,sx,sfx ,ssx ,swx ,xsize(1),xsize(2),xsize(3),0)
call derxx (te1,uy1,di1,sx,sfxp,ssxp,swxp,xsize(1),xsize(2),xsize(3),1)
call derxx (tf1,uz1,di1,sx,sfxp,ssxp,swxp,xsize(1),xsize(2),xsize(3),1)
ta1(:,:,:)=ta1(:,:,:)+td1(:,:,:)
tb1(:,:,:)=tb1(:,:,:)+te1(:,:,:)
tc1(:,:,:)=tc1(:,:,:)+tf1(:,:,:)
if(itime.lt.20000)then
if (nrank==1) print *,'WARNING rotating channel',itime
tg1(:,:,:)=tg1(:,:,:)-2./18.*uy1(:,:,:)
th1(:,:,:)=th1(:,:,:)-2./18.*ux1(:,:,:)
endif
!FINAL SUM: DIFF TERMS + CONV TERMS
ta1(:,:,:)=xnu*ta1(:,:,:)-tg1(:,:,:)
tb1(:,:,:)=xnu*tb1(:,:,:)-th1(:,:,:)
tc1(:,:,:)=xnu*tc1(:,:,:)-ti1(:,:,:)
end subroutine convdiff
Inside the subroutine convdiff it also calls other subroutines such as derx,dery,derxx etc. I haven’t added any directives to the subroutines called inside convdiff. The subroutine named derx is shown below, the other subroutines are similar:
!********************************************************************
!
subroutine derx(tx,ux,rx,sx,ffx,fsx,fwx,nx,ny,nz,npaire)
!
!********************************************************************
USE param
USE derivX
implicit none
integer :: nx,ny,nz,npaire,i,j,k
real(mytype), dimension(nx,ny,nz) :: tx,ux,rx
real(mytype), dimension(ny,nz):: sx
real(mytype), dimension(nx):: ffx,fsx,fwx
if (nclx==0) then
do k=1,nz
do j=1,ny
tx(1,j,k)=afix*(ux(2,j,k)-ux(nx,j,k))&
+bfix*(ux(3,j,k)-ux(nx-1,j,k))
rx(1,j,k)=-1.
tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
+bfix*(ux(4,j,k)-ux(nx,j,k))
rx(2,j,k)=0.
do i=3,nx-2
tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
+bfix*(ux(i+2,j,k)-ux(i-2,j,k))
rx(i,j,k)=0.
enddo
tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
+bfix*(ux(1,j,k)-ux(nx-3,j,k))
rx(nx-1,j,k)=0.
tx(nx,j,k)=afix*(ux(1,j,k)-ux(nx-1,j,k))&
+bfix*(ux(2,j,k)-ux(nx-2,j,k))
rx(nx,j,k)=alfaix
do i=2, nx
tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i)
rx(i,j,k)=rx(i,j,k)-rx(i-1,j,k)*fsx(i)
enddo
tx(nx,j,k)=tx(nx,j,k)*fwx(nx)
rx(nx,j,k)=rx(nx,j,k)*fwx(nx)
do i=nx-1,1,-1
tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i)
rx(i,j,k)=(rx(i,j,k)-ffx(i)*rx(i+1,j,k))*fwx(i)
enddo
sx(j,k)=(tx(1,j,k)-alfaix*tx(nx,j,k))&
/(1.+rx(1,j,k)-alfaix*rx(nx,j,k))
do i=1,nx
tx(i,j,k)=tx(i,j,k)-sx(j,k)*rx(i,j,k)
enddo
enddo
enddo
endif
if (nclx==1) then
if (npaire==1) then
do k=1,nz
do j=1,ny
tx(1,j,k)=0.
tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
+bfix*(ux(4,j,k)-ux(2,j,k))
do i=3,nx-2
tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
+bfix*(ux(i+2,j,k)-ux(i-2,j,k))
enddo
tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
+bfix*(ux(nx-1,j,k)-ux(nx-3,j,k))
tx(nx,j,k)=0.
do i=2,nx
tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i)
enddo
tx(nx,j,k)=tx(nx,j,k)*fwx(nx)
do i=nx-1,1,-1
tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i)
enddo
enddo
enddo
endif
if (npaire==0) then
do k=1,nz
do j=1,ny
tx(1,j,k)=afix*(ux(2,j,k)+ux(2,j,k))&
+bfix*(ux(3,j,k)+ux(3,j,k))
tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
+bfix*(ux(4,j,k)+ux(2,j,k))
do i=3,nx-2
tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
+bfix*(ux(i+2,j,k)-ux(i-2,j,k))
enddo
tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
+bfix*((-ux(nx-1,j,k))-ux(nx-3,j,k))
tx(nx,j,k)=afix*((-ux(nx-1,j,k))-ux(nx-1,j,k))&
+bfix*((-ux(nx-2,j,k))-ux(nx-2,j,k))
do i=2,nx
tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i)
enddo
tx(nx,j,k)=tx(nx,j,k)*fwx(nx)
do i=nx-1,1,-1
tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i)
enddo
enddo
enddo
endif
endif
if (nclx==2) then
do k=1,nz
do j=1,ny
tx(1,j,k)=af1x*ux(1,j,k)+bf1x*ux(2,j,k)+cf1x*ux(3,j,k)
tx(2,j,k)=af2x*(ux(3,j,k)-ux(1,j,k))
do i=3,nx-2
tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
+bfix*(ux(i+2,j,k)-ux(i-2,j,k))
enddo
tx(nx-1,j,k)=afmx*(ux(nx,j,k)-ux(nx-2,j,k))
tx(nx,j,k)=(-afnx*ux(nx,j,k))-bfnx*ux(nx-1,j,k)-cfnx*ux(nx-2,j,k)
do i=2,nx
tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i)
enddo
tx(nx,j,k)=tx(nx,j,k)*fwx(nx)
do i=nx-1,1,-1
tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i)
enddo
enddo
enddo
endif
return
end subroutine derx
I compiled my code and submit it to the cluster Cray XC50 which has both CPU and GPU. I only use one core to compute.
Here is my script for submitting the code:
#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=12
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=1
#SBATCH --constraint=gpu
#SBATCH --time=24:00:00
#SBATCH --account=s807
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CRAY_CUDA_MPS=1
export TMPDIR=/scratch/snx3000/guow/Incompact3d_GPU/test7_convdiff_1_core
export PGI_ACC_TIME=1
set -ex
# set some parameters
OUT=log.txt
#WMIN=1400
NP=1
# tasks: $SLURM_NTASKS
# tasks-per-node: $SLURM_NTASKS_PER_NODE
# cpus-per-task: $SLURM_CPUS_PER_TASK
# srun nvprof --cpu-profiling on -o nvprof.%p.out ./incompact3d $WMIN >> $OUT
srun ./incompact3d $WMIN >> $OUT
# srun nvprof -o nvprof.%p.out ./incompact3d $WMIN >> $OUT
Here is the profiling information from PGI_ACC_TIME:
ModuleCmd_Switch.c(179):ERROR:152: Module 'PrgEnv-cray/6.0.4' is currently not loaded
+ OUT=log.txt
+ NP=1
+ srun ./incompact3d
Accelerator Kernel Timing data
/scratch/snx3000/guow/Incompact3d_GPU/test9_convdiff_optimize/source_code/incompact3d.f90
incompact3d NVIDIA devicenum=0
time(us): 1,045,364
113: data region reached 2 times
113: data copyin transfers: 81
device time(us): total=44,437 max=1,351 min=4 avg=548
244: data copyout transfers: 36
device time(us): total=41,177 max=1,294 min=46 avg=1,143
153: update directive reached 30 times
153: data copyin transfers: 540
device time(us): total=651,492 max=1,358 min=46 avg=1,206
154: compute region reached 30 times
157: kernel launched 30 times
grid: [8x256x65] block: [32x4]
elapsed time(us): total=23,532 max=793 min=780 avg=784
154: data region reached 60 times
163: update directive reached 30 times
163: data copyout transfers: 270
device time(us): total=308,258 max=1,311 min=46 avg=1,141
/scratch/snx3000/guow/Incompact3d_GPU/test9_convdiff_optimize/source_code/convdiff.f90
convdiff NVIDIA devicenum=0
time(us): 19,179,300
98: update directive reached 30 times
98: data copyin transfers: 810
device time(us): total=977,150 max=1,356 min=46 avg=1,206
99: compute region reached 30 times
100: kernel launched 30 times
grid: [65535] block: [128]
elapsed time(us): total=47,765 max=2,281 min=1,558 avg=1,592
99: data region reached 120 times
106: update directive reached 30 times
106: data copyout transfers: 810
device time(us): total=925,810 max=1,325 min=46 avg=1,142
115: update directive reached 30 times
115: data copyin transfers: 1620
device time(us): total=1,954,015 max=1,357 min=46 avg=1,206
116: compute region reached 30 times
117: kernel launched 30 times
grid: [65535] block: [128]
elapsed time(us): total=73,801 max=2,474 min=2,457 avg=2,460
116: data region reached 120 times
123: update directive reached 30 times
123: data copyout transfers: 810
device time(us): total=925,485 max=1,322 min=46 avg=1,142
133: update directive reached 30 times
133: data copyin transfers: 810
device time(us): total=977,091 max=1,357 min=46 avg=1,206
134: compute region reached 30 times
135: kernel launched 30 times
grid: [65535] block: [128]
elapsed time(us): total=47,015 max=1,572 min=1,560 avg=1,567
134: data region reached 120 times
141: update directive reached 30 times
141: data copyout transfers: 810
device time(us): total=925,293 max=1,329 min=46 avg=1,142
150: update directive reached 30 times
150: data copyin transfers: 2430
device time(us): total=2,930,724 max=1,362 min=46 avg=1,206
151: compute region reached 30 times
152: kernel launched 30 times
grid: [65535] block: [128]
elapsed time(us): total=96,829 max=3,239 min=3,212 avg=3,227
151: data region reached 60 times
158: update directive reached 30 times
158: data copyout transfers: 810
device time(us): total=925,619 max=1,308 min=46 avg=1,142
169: update directive reached 30 times
169: data copyin transfers: 810
device time(us): total=977,057 max=1,360 min=46 avg=1,206
170: compute region reached 30 times
171: kernel launched 30 times
grid: [65535] block: [128]
elapsed time(us): total=47,122 max=1,577 min=1,565 avg=1,570
170: data region reached 120 times
177: update directive reached 30 times
177: data copyout transfers: 810
device time(us): total=925,148 max=1,322 min=46 avg=1,142
186: update directive reached 30 times
186: data copyin transfers: 2430
device time(us): total=2,930,691 max=1,380 min=46 avg=1,206
187: compute region reached 30 times
188: kernel launched 30 times
grid: [65535] block: [128]
elapsed time(us): total=96,743 max=3,231 min=3,221 avg=3,224
187: data region reached 120 times
194: update directive reached 30 times
194: data copyout transfers: 810
device time(us): total=925,410 max=1,316 min=46 avg=1,142
227: update directive reached 30 times
227: data copyin transfers: 600
device time(us): total=651,771 max=1,358 min=4 avg=1,086
228: compute region reached 30 times
231: kernel launched 30 times
grid: [8x257x64] block: [32x4]
elapsed time(us): total=23,524 max=789 min=781 avg=784
228: data region reached 60 times
237: update directive reached 30 times
237: data copyout transfers: 270
device time(us): total=308,323 max=1,304 min=46 avg=1,141
245: update directive reached 30 times
245: data copyin transfers: 540
device time(us): total=651,436 max=1,356 min=46 avg=1,206
246: compute region reached 30 times
249: kernel launched 30 times
grid: [8x257x64] block: [32x4]
elapsed time(us): total=23,754 max=794 min=790 avg=791
246: data region reached 60 times
255: update directive reached 30 times
255: data copyout transfers: 270
device time(us): total=308,324 max=1,308 min=46 avg=1,141
264: update directive reached 30 times
264: data copyin transfers: 540
device time(us): total=651,422 max=1,358 min=46 avg=1,206
265: compute region reached 30 times
268: kernel launched 30 times
grid: [8x257x64] block: [32x4]
elapsed time(us): total=23,603 max=793 min=776 avg=786
265: data region reached 60 times
274: update directive reached 30 times
274: data copyout transfers: 270
device time(us): total=308,531 max=1,308 min=46 avg=1,142
The mesh I used is 256257256. I only ran ten time steps and each time step spent 54.26679129600525 seconds. If I don’t add any directives, it only needs 51.9558027982711 seconds for each step. The code is still a little bit slowed down rather than accelerated. I haven’t found a way to optimize the data movement. Could you show me the right way to add directives in order to achieve acceleration? Thanks in advance!
Best regards,
Wentao Guo
I’m sorry for the bad format. But when you click post reply you can see the full sentence in the topic review.
So it looks like the code is spending about 20 seconds transferring data back and forth. You could probably improve that by using Pinned memory (-ta=tesla:pinned), but a better direction would be to offload the loops in the der and transpose routines, so you don’t need to transfer data at all.
Looking at derx, I don’t see any reason why you can parallelize across the “j” and “k” loops in each case. Ideally you’d be able parallelize across the “i” loops as well since this is the stride-1 dimension, but I see a few of the “i” loops have dependencies so can’t be parallelized. You’re not going to get great performance due to the data layout, but it should be better than having to copy the data back and forth between the host and GPU.
Hi Mat,
I tried to use the pinned memory. Here is the make file I used:
FC = ftn
OPTFC = -acc -Minfo=accel -ta=tesla:pinned -cpp
CC = cc
CFLAGS = -03
all: incompact3d
alloc_shm.o: alloc_shm.c
$(CC) $(CFLAGS) -c $<
FreeIPC_c.o: FreeIPC_c.c
$(CC) $(CFLAGS) -c $<
incompact3d : $(OBJ)
$(FC) -O3 -acc -ta=tesla:pinned -o $@ $(OBJ) $(LIBFFT)
%.o : %.f90
$(FC) $(OPTFC) $(OPTIONS) $(INC) -c $<
.PHONY: clean
clean:
rm -f *.o *.mod incompact3d
I can compile the code. But when I submit the job to the cluster, I received the following error:
ModuleCmd_Switch.c(179):ERROR:152: Module 'PrgEnv-cray/6.0.4' is currently not loaded
+ OUT=log.txt
+ NP=1
+ srun ./incompact3d
malloc: cuMemHostAlloc returns error code 201
0: ALLOCATE: 4 bytes requested; not enough memory
srun: error: nid03072: task 0: Exited with exit code 127
srun: Terminating job step 4065377.0
Could you tell me how to fix it?
And about offloading the loops in der and transpose routines, do you mean I don’t add any directives to these subroutines and just let them run on CPU?Thanks in advance!
Best regards,
Wentao Guo
Could you tell me how to fix it?
Unfortunately, you can’t. Pinned allocates memory in the CPU’s physical memory. But the error is indicating that system doesn’t have (or wont let you have) enough physical memory for your program. So unless you can add more DIMMs you wont be able to use pinned. Alternatively, you might check if the Cray environment is setting a limit on memory use.
And about offloading the loops in der and transpose routines, do you mean I don’t add any directives to these subroutines and just let them run on CPU?
No, the opposite. You will want add OpenACC compute regions around your parallel loops in the der and transpose routines. The more work you can offload to the device means the fewer times the host and device data needs to be synchronized. In other words, once all of your compute is offloaded to the device, you can remove most if not all of the “update” directives thus getting you at least 20 seconds plus the parallel speed-up of offloading the remaining loops.
Hi Mat,
You said I can accelerate the code by offloading the loops in der and transpose routines. But then you said:
Maybe I’m wrong, but I think here you mean it is better not to add any directives to der because of the data layout. As you said, in the der subroutine in i direction there are several loops that cannot be parallized on GPU because of data dependency. So I have to compute these loops on CPU.If I add directives to other loops I cannot avoid data transfer between GPU and CPU. So I don’t offload der and transpose routines? Is the pinned memory the only way to accelerate my code?
Best regards,
Wentao Guo
Hi Wentao Guo,
Maybe I’m wrong, but I think here you mean it is better not to add any directives to der because of the data layout
Apologies if I’m not clear. I’m saying that you should try to offload the loops in the “der” routines and if possible the “transpose” routines.
While offloading the “der” routines to the GPU device will incur some performance penalty due to memory divergence (because of the data layout), it will still most likely be faster to offload them to the GPU.
Even if the times of these routines are equivalent or slightly slower than the CPU, the savings of not having to copy the data should give you an overall win.
Granted, I don’t know for sure what the performance will be but at least for the “derx” routine, it looks very straight forward to add compute directive to so should be simple to try.
-Mat
Hi Mat,
According to your reply, here is what I understood. I took a part of my code (subroutine convdiff)as an example:
!$acc update device(ux1,uy1,uz1)
!$acc kernels present(ux1,uy1,uz1)
do ijk=1,nvect1
ta1(ijk,1,1)=ux1(ijk,1,1)*ux1(ijk,1,1)
tb1(ijk,1,1)=ux1(ijk,1,1)*uy1(ijk,1,1)
tc1(ijk,1,1)=ux1(ijk,1,1)*uz1(ijk,1,1)
enddo
!$acc end kernels
!$acc update self(ta1,tb1,tc1)
call derx (td1,ta1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
call derx (te1,tb1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
call derx (tf1,tc1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
call derx (ta1,ux1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
call derx (tb1,uy1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
call derx (tc1,uz1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
I think you want me to reduce the data movement between GPU and CPU, so I shouldn’t update ta1,tb1,tc1 to the host immediately after the loop is finished. I should use them in the subroutine derx and try to update the data to the host close to the end of this subroutine named convdiff. And here is a part of my subroutine derx:
if (nclx==0) then
do k=1,nz
do j=1,ny
tx(1,j,k)=afix*(ux(2,j,k)-ux(nx,j,k))&
+bfix*(ux(3,j,k)-ux(nx-1,j,k))
rx(1,j,k)=-1.
tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
+bfix*(ux(4,j,k)-ux(nx,j,k))
rx(2,j,k)=0.
do i=3,nx-2
tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
+bfix*(ux(i+2,j,k)-ux(i-2,j,k))
rx(i,j,k)=0.
enddo
tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
+bfix*(ux(1,j,k)-ux(nx-3,j,k))
rx(nx-1,j,k)=0.
tx(nx,j,k)=afix*(ux(1,j,k)-ux(nx-1,j,k))&
+bfix*(ux(2,j,k)-ux(nx-2,j,k))
rx(nx,j,k)=alfaix
do i=2, nx
tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i)
rx(i,j,k)=rx(i,j,k)-rx(i-1,j,k)*fsx(i)
enddo
tx(nx,j,k)=tx(nx,j,k)*fwx(nx)
rx(nx,j,k)=rx(nx,j,k)*fwx(nx)
do i=nx-1,1,-1
tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i)
rx(i,j,k)=(rx(i,j,k)-ffx(i)*rx(i+1,j,k))*fwx(i)
enddo
sx(j,k)=(tx(1,j,k)-alfaix*tx(nx,j,k))&
/(1.+rx(1,j,k)-alfaix*rx(nx,j,k))
do i=1,nx
tx(i,j,k)=tx(i,j,k)-sx(j,k)*rx(i,j,k)
enddo
enddo
enddo
endif
I think I can add some compute directives to offload the first loop in i direction:
do i=3,nx-2
tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
+bfix*(ux(i+2,j,k)-ux(i-2,j,k))
rx(i,j,k)=0.
enddo
But when it comes to the second loop, I will have troubles:
do i=2, nx
tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i)
rx(i,j,k)=rx(i,j,k)-rx(i-1,j,k)*fsx(i)
enddo
I cannot compute this loop on GPU because the data is dependent. To my current understanding, after the first loop, I have to update tx,rx to CPU because the second loop needs to use them. So compared with using pure CPU, I need extra data transfer. You said:
but a better direction would be to offload the loops in the der and transpose routines, so you don’t need to transfer data at all.
I don’t know how to avoid data transfer.Is there a way to run loops which have data dependency on GPU without paralleling? If what I understood is not right, please correct me.
Best regards,
Wentao Guo
I think you want me to reduce the data movement between GPU and CPU, so I shouldn’t update ta1,tb1,tc1 to the host immediately after the loop is finished. I should use them in the subroutine derx and try to update the data to the host close to the end of this subroutine named convdiff.
Correct.
I cannot compute this loop on GPU because the data is dependent.
Correct, you wont be able to parallelize the “i” loops because of this dependency. However, you should be able to parallelize the “j” and “k” loops. Something like:
if (nclx==0) then
!$acc kernels loop
do k=1,nz
do j=1,ny
tx(1,j,k)=afix*(ux(2,j,k)-ux(nx,j,k))&
+bfix*(ux(3,j,k)-ux(nx-1,j,k))
rx(1,j,k)=-1.
....cut....
do i=1,nx
tx(i,j,k)=tx(i,j,k)-sx(j,k)*rx(i,j,k)
enddo
enddo
enddo
endif
The repeat this for each of your if cases.
-Mat
Hi Mat,
Thanks for your reply! I tried your suggestion and add “!$acc kernels loop” in the derx subroutine:
!********************************************************************
!
subroutine derx(tx,ux,rx,sx,ffx,fsx,fwx,nx,ny,nz,npaire)
!
!********************************************************************
USE param
USE derivX
implicit none
integer :: nx,ny,nz,npaire,i,j,k
real(mytype), dimension(nx,ny,nz) :: tx,ux,rx
real(mytype), dimension(ny,nz):: sx
real(mytype), dimension(nx):: ffx,fsx,fwx
if (nclx==0) then
!$acc kernels loop
do k=1,nz
do j=1,ny
tx(1,j,k)=afix*(ux(2,j,k)-ux(nx,j,k))&
+bfix*(ux(3,j,k)-ux(nx-1,j,k))
rx(1,j,k)=-1.
tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
+bfix*(ux(4,j,k)-ux(nx,j,k))
rx(2,j,k)=0.
do i=3,nx-2
tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
+bfix*(ux(i+2,j,k)-ux(i-2,j,k))
rx(i,j,k)=0.
enddo
tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
+bfix*(ux(1,j,k)-ux(nx-3,j,k))
rx(nx-1,j,k)=0.
tx(nx,j,k)=afix*(ux(1,j,k)-ux(nx-1,j,k))&
+bfix*(ux(2,j,k)-ux(nx-2,j,k))
rx(nx,j,k)=alfaix
do i=2, nx
tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i)
rx(i,j,k)=rx(i,j,k)-rx(i-1,j,k)*fsx(i)
enddo
tx(nx,j,k)=tx(nx,j,k)*fwx(nx)
rx(nx,j,k)=rx(nx,j,k)*fwx(nx)
do i=nx-1,1,-1
tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i)
rx(i,j,k)=(rx(i,j,k)-ffx(i)*rx(i+1,j,k))*fwx(i)
enddo
sx(j,k)=(tx(1,j,k)-alfaix*tx(nx,j,k))&
/(1.+rx(1,j,k)-alfaix*rx(nx,j,k))
do i=1,nx
tx(i,j,k)=tx(i,j,k)-sx(j,k)*rx(i,j,k)
enddo
enddo
enddo
endif
if (nclx==1) then
if (npaire==1) then
!$acc kernels loop
do k=1,nz
do j=1,ny
tx(1,j,k)=0.
tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
+bfix*(ux(4,j,k)-ux(2,j,k))
do i=3,nx-2
tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
+bfix*(ux(i+2,j,k)-ux(i-2,j,k))
enddo
tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
+bfix*(ux(nx-1,j,k)-ux(nx-3,j,k))
tx(nx,j,k)=0.
do i=2,nx
tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i)
enddo
tx(nx,j,k)=tx(nx,j,k)*fwx(nx)
do i=nx-1,1,-1
tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i)
enddo
enddo
enddo
endif
if (npaire==0) then
!$acc kernels loop
do k=1,nz
do j=1,ny
tx(1,j,k)=afix*(ux(2,j,k)+ux(2,j,k))&
+bfix*(ux(3,j,k)+ux(3,j,k))
tx(2,j,k)=afix*(ux(3,j,k)-ux(1,j,k))&
+bfix*(ux(4,j,k)+ux(2,j,k))
do i=3,nx-2
tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
+bfix*(ux(i+2,j,k)-ux(i-2,j,k))
enddo
tx(nx-1,j,k)=afix*(ux(nx,j,k)-ux(nx-2,j,k))&
+bfix*((-ux(nx-1,j,k))-ux(nx-3,j,k))
tx(nx,j,k)=afix*((-ux(nx-1,j,k))-ux(nx-1,j,k))&
+bfix*((-ux(nx-2,j,k))-ux(nx-2,j,k))
do i=2,nx
tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i)
enddo
tx(nx,j,k)=tx(nx,j,k)*fwx(nx)
do i=nx-1,1,-1
tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i)
enddo
enddo
enddo
endif
endif
if (nclx==2) then
!$acc kernels loop
do k=1,nz
do j=1,ny
tx(1,j,k)=af1x*ux(1,j,k)+bf1x*ux(2,j,k)+cf1x*ux(3,j,k)
tx(2,j,k)=af2x*(ux(3,j,k)-ux(1,j,k))
do i=3,nx-2
tx(i,j,k)=afix*(ux(i+1,j,k)-ux(i-1,j,k))&
+bfix*(ux(i+2,j,k)-ux(i-2,j,k))
enddo
tx(nx-1,j,k)=afmx*(ux(nx,j,k)-ux(nx-2,j,k))
tx(nx,j,k)=(-afnx*ux(nx,j,k))-bfnx*ux(nx-1,j,k)-cfnx*ux(nx-2,j,k)
do i=2,nx
tx(i,j,k)=tx(i,j,k)-tx(i-1,j,k)*fsx(i)
enddo
tx(nx,j,k)=tx(nx,j,k)*fwx(nx)
do i=nx-1,1,-1
tx(i,j,k)=(tx(i,j,k)-ffx(i)*tx(i+1,j,k))*fwx(i)
enddo
enddo
enddo
endif
return
end subroutine derx
I also tried to reduce data movement in the subroutine (convdiff) which called derx. Here is a part of convdiff:
!$acc update device(ux1,uy1,uz1)
!$acc kernels present(ux1,uy1,uz1)
do ijk=1,nvect1
ta1(ijk,1,1)=ux1(ijk,1,1)*ux1(ijk,1,1)
tb1(ijk,1,1)=ux1(ijk,1,1)*uy1(ijk,1,1)
tc1(ijk,1,1)=ux1(ijk,1,1)*uz1(ijk,1,1)
enddo
!$acc end kernels
call derx (td1,ta1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
call derx (te1,tb1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
call derx (tf1,tc1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
call derx (ta1,ux1,di1,sx,ffx,fsx,fwx,xsize(1),xsize(2),xsize(3),0)
call derx (tb1,uy1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
call derx (tc1,uz1,di1,sx,ffxp,fsxp,fwxp,xsize(1),xsize(2),xsize(3),1)
!$acc kernels present(ta1,tb1,tc1,td1,te1,tf1,ux1)
do ijk=1,nvect1
ta1(ijk,1,1)=0.5*td1(ijk,1,1)+0.5*ux1(ijk,1,1)*ta1(ijk,1,1)
tb1(ijk,1,1)=0.5*te1(ijk,1,1)+0.5*ux1(ijk,1,1)*tb1(ijk,1,1)
tc1(ijk,1,1)=0.5*tf1(ijk,1,1)+0.5*ux1(ijk,1,1)*tc1(ijk,1,1)
enddo
!$acc end kernels
!$acc update self(ta1,tb1,tc1)
The code can be compiled but after running the code, the output is not correct. I checked the compile information:
derx:
50, Generating implicit copyin(fsx(2:nx),ffx(:nx-1),ux(:,:ny,:nz))
Generating implicit copyout(sx(:ny,:nz))
Generating implicit allocate(tx(:,:ny,:nz),rx(:,:ny,:nz))
Generating implicit copy(rx(:2,:ny,:nz),tx(:2,:ny,:nz))
Generating implicit copyin(fwx(:nx))
51, Loop is parallelizable
52, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
51, !$acc loop gang ! blockidx%y
52, !$acc loop gang ! blockidx%x
59, !$acc loop vector(128) ! threadidx%x
70, !$acc loop seq
76, !$acc loop seq
82, !$acc loop vector(128) ! threadidx%x
59, Loop is parallelizable
70, Loop carried dependence of tx,rx prevents parallelization
Loop carried backward dependence of tx,rx prevents vectorization
Inner sequential loop scheduled on accelerator
76, Loop carried dependence of tx,rx prevents parallelization
Loop carried backward dependence of tx,rx prevents vectorization
Inner sequential loop scheduled on accelerator
82, Loop is parallelizable
91, Generating implicit allocate(tx(:,:ny,:nz))
Generating implicit copyin(tx(:2,:ny,:nz))
Generating implicit copyout(tx(:nx,:ny,:nz))
Generating implicit copyin(fwx(:nx),ux(:,:ny,:nz),ffx(:nx-1),fsx(2:nx))
92, Loop is parallelizable
93, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
92, !$acc loop gang ! blockidx%y
93, !$acc loop gang ! blockidx%x
97, !$acc loop vector(128) ! threadidx%x
104, !$acc loop seq
108, !$acc loop seq
97, Loop is parallelizable
104, Loop carried dependence of tx prevents parallelization
Loop carried backward dependence of tx prevents vectorization
Inner sequential loop scheduled on accelerator
108, Loop carried dependence of tx prevents parallelization
Loop carried backward dependence of tx prevents vectorization
Inner sequential loop scheduled on accelerator
115, Generating implicit copyin(ux(:,:ny,:nz),fwx(:nx))
Generating implicit allocate(tx(:,:ny,:nz))
Generating implicit copyin(tx(:2,:ny,:nz))
Generating implicit copyout(tx(:nx,:ny,:nz))
Generating implicit copyin(ffx(:nx-1),fsx(2:nx))
116, Loop is parallelizable
117, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
116, !$acc loop gang ! blockidx%y
117, !$acc loop gang ! blockidx%x
122, !$acc loop vector(128) ! threadidx%x
130, !$acc loop seq
134, !$acc loop seq
122, Loop is parallelizable
130, Loop carried dependence of tx prevents parallelization
Loop carried backward dependence of tx prevents vectorization
Inner sequential loop scheduled on accelerator
134, Loop carried dependence of tx prevents parallelization
Loop carried backward dependence of tx prevents vectorization
Inner sequential loop scheduled on accelerator
143, Generating implicit copyin(ux(:,:ny,:nz),fwx(:nx))
Generating implicit allocate(tx(:,:ny,:nz))
Generating implicit copyin(tx(:2,:ny,:nz))
Generating implicit copyout(tx(:nx,:ny,:nz))
Generating implicit copyin(ffx(:nx-1),fsx(2:nx))
144, Loop is parallelizable
145, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
144, !$acc loop gang ! blockidx%y
145, !$acc loop gang ! blockidx%x
148, !$acc loop vector(128) ! threadidx%x
154, !$acc loop seq
158, !$acc loop seq
148, Loop is parallelizable
154, Loop carried dependence of tx prevents parallelization
Loop carried backward dependence of tx prevents vectorization
Inner sequential loop scheduled on accelerator
158, Loop carried dependence of tx prevents parallelization
Loop carried backward dependence of tx prevents vectorization
Inner sequential loop scheduled on accelerator
Here you see, for the first IF statement(if (nclx==0) ), it returns:
50, Generating implicit copyin(fsx(2:nx),ffx(:nx-1),ux(:,:ny,:nz))
Generating implicit copyout(sx(:ny,:nz))
Generating implicit allocate(tx(:,:ny,:nz),rx(:,:ny,:nz))
Generating implicit copy(rx(:2,:ny,:nz),tx(:2,:ny,:nz))
Generating implicit copyin(fwx(:nx))
It only copied out rx(:2,:ny,:nz),tx(:2,:ny,:nz) to the host rather than the whole matrix. However, for other IF statements(nclx==1 or nclx==2),
the whole rx matrix is copied out:
Generating implicit copyout(tx(:nx,:ny,:nz))
Because I only use the condition which nclx==0, so I think that’s why I got the wrong output. But I’m confused why only a part of the matrix is copied out for nclx==0 since the other two cases have similar data layout and structure. I also tried to use “!$acc kernels present”,“!$acc end kernels” and “!$acc update self(tx,rx)” directives to replace “!acc kernels loop” and copy data back to the host after the loop is finished but I failed. Could you give me some suggestions?
Best regards,
Wentao Guo
Can you send the full source and any input files to PGI Customer Service ([email protected])?
I think it would be easier for me to determine the issue if I could see the whole code as well as run it.
Thanks!
Mat
Hi Mat,
I have sent the code to [email protected] and asked them to redirect my email to you. Thanks in advance!
Best regards,
Wentao Guo