Skip to content

Derived types with CUDA-Aware MPI #8720

Open
@ShatrovOA

Description

@ShatrovOA

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.0.5 shipped with Nvidia hpc_sdk 21.2

Please describe the system on which you are running

  • Operating system/version: CentOS Linux 7
  • Computer hardware: 2 х Intel Xeon Gold 6142 v4, 4 x nVidia Volta GV100GL, 768GB
  • Network type: Infiniband

Details of the problem

I am developing library that uses MPI derived datatypes to send and receive aligned data. Derived datatypes created as a combination of vector, hvector, contiguous and resized.

It runs fine on a CPU. I tried to execute code on GPU with the help of CUDA-Aware MPI shipped with hpc_sdk from Nvidia. I noticed that when I call MPI_Alltoall with GPU buffers, MPI starts to copy data from host to device. Single call contains more then 1 million of such calls. It is not a surprise that code runs very slow.

Can you please explain how this works? Are you aware of such behaviour?

cuda_mpi_alltoall_issue

Best regards,
Oleg

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions