CUDA-aware MPI on 1 GPU transferring data to host?

caplanr · September 26, 2017, 9:13pm

Hi,

I am using CUDA-aware MPI with PGI 17.7 with the OpenMPI that came with the compiler.

The code works but when I profile the code, the MPI routines are taking longer than they did using the CPU-only code.

Using pgprof, I can see several async memory transfers happening from the device to host and host to device around the MPI calls.

I am running the code on only 1 GPU (a GeForce 970) so I do not understand why the code is making these transfers.

(I have learned that GeForce does not support GPUdirect RDMA, but even so, if the MPI destination is the same card, shouldn’t the compiler/library use a device-to-device copy instead of the host transfers?)

Thanks

MatColgrove · September 27, 2017, 8:26pm

Hi Sumseq,

Yes, MPI-aware should make device to device memory transfers even on a GeForce card. RDMA buffers aren’t necessary since the transfer should be done over IPC.

While not a GeForce 970, I was able to test a MPI code on a GTX 690. In my example, I see no extra HtoD or DtoH transfers and only a DtoD.

With your code, do you use the OpenACC “host_data” construct around your MPI calls so that the device data is passed to MPI? Or if using CUDA Fortran, are you passing in device arrays?

-Mat

caplanr · September 28, 2017, 12:05am

Hi,

I am using the “host_data” as follows:

!$acc host_data use_device(v%r)
      call MPI_Irecv (v%r(:,:,  1),lbuf3r,ntype_real,iproc_pm,tagr,
     &                comm_all,req(1),ierr)
      call MPI_Irecv (v%r(:,:,n3r),lbuf3r,ntype_real,iproc_pp,tagr,
     &                comm_all,req(2),ierr)
c
      call MPI_Isend (v%r(:,:,n3r-1),lbuf3r,ntype_real,iproc_pp,tagr,
     &                comm_all,req(3),ierr)
      call MPI_Isend (v%r(:,:,    2),lbuf3r,ntype_real,iproc_pm,tagr,
     &                comm_all,req(4),ierr)

      call MPI_Waitall (4,req,MPI_STATUSES_IGNORE,ierr)
!$acc end host_data

where v is a derived type that I copied to the device using deepcopy

enter data copyin(v)

.

I also have “managed” turned on as I have been having problems with some arrays with it turned off.

MatColgrove · September 28, 2017, 3:22pm

Hi sumseq,

By “deepcopy” are you meaning the new PGI 17.7 implicit deep copy beta feature, i.e. “-ta=tesla:deepcopy” or are you manually deep copying the structure. For example:

!$acc enter data copyin(v)
!$acc enter data copyin(v%r)

In my example I wasn’t using a user defined type, but try to make a reproducing example similar to yours. If you’re using implicit deep copy, it’s possible that this is a bug with the beta feature when interacting with the host_data construct. I’ll investigate.

-Mat

caplanr · September 28, 2017, 6:03pm

Hi,

Yes, I mean the beta deepcopy feature in 17.7.

MatColgrove · September 29, 2017, 10:06pm

Hi sumseq,

I updated my test OpenMPI+OpenACC test program to pass allocatable array data members of a UDT to MPI_SEND/MPI_RECIEVE wrapped in a host_data directive. As before, I only see the DtoD transfers associated with the MPI calls.

The only HtoD transfer I have is for a Fortran descriptor being passed to the compute region just after the MPI calls.

Does your profile show any DtoD transfers? Could the HtoD and/or DtoH transfers be accounted for by some other data movement?

If you want, you can send a copy of the code to PGI Customer Service ([email protected]) and I can take a look and try to determine where the extra copies are coming from.

-Mat

caplanr · October 2, 2017, 8:02pm

Hi,

Could it have to do with my arrays being declared as pointers instead of allocatable?

MatColgrove · October 2, 2017, 9:20pm

I highly doubt it. Pointers would be treated the same as an allocatable where in both cases, “host_data” would pass in the device pointer to your MPI call.

Topic		Replies	Views
Call to collective mpi subroutine with openacc host_data directive Legacy PGI Compilers	8	1005	March 26, 2021
Direct GPU-to-GPU data transfer with OpenACC+managed+MPI nvc, nvc++ and nvfortran	4	1157	April 12, 2022
Questions about omp offload and memory transfer nvc, nvc++ and nvfortran	13	1439	October 15, 2021
Mixed CUDA and MPI programming CUDA Programming and Performance	7	8074	November 12, 2009
Implicity Memory Transfers with Kernels CUDA Programming and Performance openmpi	8	127	July 5, 2024
An Introduction to CUDA-Aware MPI Technical Blog	5	974	August 30, 2019
CUDA + MPI: HtoD and DtoH in profiling & how to keep communication on the device CUDA Programming and Performance	1	539	June 29, 2022
Is it possible to use both OpenMP + CUDA in PGI fortran ? Legacy PGI Compilers	4	7699	December 18, 2010
cudaMemcpy fails copying ACC variable to CUF variable Legacy PGI Compilers	3	3337	August 8, 2013
does deepcopy work with host_data? Legacy PGI Compilers	1	1709	September 15, 2017

CUDA-aware MPI on 1 GPU transferring data to host?

Related topics