Host_data with MPI and datatype members

Hello, I’m trying to port a simulation code to GPUs using OpenACC, and I’m having some troubles with MPI directives. The code uses derived MPI datatypes to extract slices of 3D arrays and communicate just the boundary data that I need. The problem is that the 3D arrays are members of a structure, and I’m having troubles in making it work with the host_data use_device directive.

Here a reproducible example:

module data
use mpi
implicit none
type box
  integer                                :: m,n
  integer, dimension(:,:,:), allocatable :: a
end type box
!
contains
!
subroutine create_subarray(b,i1,BoxDataType)
implicit none
!
type(box), intent(in)  :: b
integer,   intent(in)  :: i1
integer,   intent(out) :: BoxDataType
!
integer                                      :: ierr, TypeMPI, n
integer, dimension(1)                        :: a_BlockLengths, a_Types
integer(kind=MPI_ADDRESS_KIND), dimension(1) :: a_Displacements, dummy
!
n = b%n - b%m + 1
call MPI_Type_Create_Subarray(3, &
                              (/ n,  n, n /), &
                              (/ 1,  n, n /), &
                              (/ i1, 0, 0 /), &
                               MPI_ORDER_FORTRAN, &
                               MPI_INTEGER, &
                               TypeMPI, &
                               ierr)
!
a_BlockLengths(1) = 1
call MPI_Get_Address(b%a(0,0,0), a_Displacements(1), ierr)
call MPI_Get_Address(b%m,        dummy(1),           ierr)
a_Displacements(1) =  a_Displacements(1)-dummy(1)
a_Types(1) = TypeMPI
!
call MPI_Type_Create_Struct(1, a_BlockLengths(1), a_Displacements(1), a_Types(1), &
                            BoxDataType, ierr)
call MPI_Type_Commit(BoxDataType, ierr)
!
end subroutine create_subarray
!
end module data
!
program main
use data
implicit none
!
integer                                 :: myRank, ierr
integer, parameter                      :: n=3
type(box)                               :: b1, b2
integer                                 :: i1,i2,i3
integer, dimension(2)                   :: Requests, BoxGhostTypes
integer, dimension(MPI_STATUS_SIZE,1:2) :: Status
!
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myRank, ierr)
!
b1%m = 1
b1%n = n
allocate(b1%a(1:n,1:n,1:n))
b2%m = 0
b2%n = n+1
allocate(b2%a(0:n+1,0:n+1,0:n+1))
!
b1%a = myRank
b2%a = -1
!
!$acc data copyin(b1,b1%a,b2) copy(b2%a)
!$acc parallel loop collapse(3)
do i3=1,n
  do i2=1,n
    do i1=1,n
      b2%a(i1,i2,i3) = b1%a(i1,i2,i3)*100 + i1
    end do
  end do
end do
!
!$acc update self(b2%a)
!!$acc host_data use_device(b2%a)
if (myRank .eq. 0) then
  call create_subarray(b2,0,BoxGhostTypes(1)) ! recv
  call create_subarray(b2,1,BoxGhostTypes(2)) ! send
  !
  call MPI_IRecv(b2, 1, boxGhostTypes(1), 1, 0, MPI_COMM_WORLD, Requests(1), ierr)
  call MPI_ISend(b2, 1, boxGhostTypes(2), 1, 1, MPI_COMM_WORLD, Requests(2), ierr)
else
call create_subarray(b2,n+1,BoxGhostTypes(1)) ! recv
  call create_subarray(b2,n  ,BoxGhostTypes(2)) ! send
  !
  call MPI_IRecv(b2, 1, boxGhostTypes(1), 0, 1, MPI_COMM_WORLD, Requests(1), ierr)
  call MPI_ISend(b2, 1, boxGhostTypes(2), 0, 0, MPI_COMM_WORLD, Requests(2), ierr)
end if
!
call MPI_WaitAll(2, Requests, Status, ierr)
!$acc update device(b2%a)
!!$acc end host_data
!
!$acc end data
print *, "#", myRank, ":", b2%a(:,2,2)
!
deallocate(b1%a,b2%a)
!
call MPI_FINALIZE(ierr)
!
end program main

I’m compiling it with nvfortran 24.11. Running with 2 MPI tasks (each with a GPU) it returns:

#            0 :          103            1            2            3
           -1
 #            1 :           -1          101          102          103
            1

However, if I use the host_data use_device(b2%a) instead of the update self, update device commands, I get the wrong result (the boundary data is communicated with host data, and not with device data). I also tried with the member a being a pointer instead than an allocatable, but it doesn’t solve the problem. If, on the other hand, I rewrite the code with a just being an array, and not a member of a fortran datatype, everything works correctly.

Is there a solution to MPI communicate arrays that are members of a fortran datatype directly from the device, or should I create a buffer and communicate the buffer?

Best,
Fabio

Hi Fabio,

A host_data region tells the compiler to use the device address within this section of host code. However here, you’re using the device address for “b2%a” but passing “b2” to the MPI routines. “b2%a” isn’t used in this section and there’s no way to pass your intent to use the device pointer of “a” since the MPI calls just sees “b2”.

Also, the compiler does have a known limitation where it doesn’t support derived type members in host_data regions. So while not the issue here, even if you were passing in “b2%a” to the MPI calls, you may still have issues.

I was looking to see about putting “b2” in host_data regions, just around the MPI calls themselves since passing a device variable to “create_subarray” would be problematic. Though this caused a seg fault. While I didn’t dig into it, my guess MPI is doing something extra on the host in order to handle the subarrays, which causes the segv when “b2” is a device address.

-Mat

Hi Mat,

thank you very much for your quick answer. I thought this could be the issue, but I wasn’t sure.

As you mentioned, I also tried directly using b2 in the host_data directive, but it returns a segmentation fault. I even tried using b2%m, and replacing in all the MPI calls ‘b2’ with ‘b2%m’ (anyhow, I compute the displacement in MPI_Type_Create_Struct with respect to b%m), but it still crashes with a segmentation fault (Error: segmentation violation, invalid permissions for mapped object). I guess that, after your statement “Also, the compiler does have a known limitation where it doesn’t support derived type members in host_data regions.”, there is no much hope here to solve the problem with the derived data types and MPI_Type_Create_Subarray.

One work around would be to create a buffer (contiguous in memory) on the device for each derived type member that I have to communicate, and communicate it individually, and then copy back the data to the type members. My main problem here is that I would need to invoke an individual MPI call for each derived type member, which is not ideal (to my knowledge, it is generally better to pack data and communicate them together, in order to improve MPI speed). In principle, I can even use a buffer with the last dimension being the number of derived type members (to pack data together), but this only if all members have the same dimensions (which, unfortunately, is not always the case).

Do you see a better work around? Imagine that I have the derived data type with a list of array members that are on the device:

xc1(m1:n1), xc2(m2:n2), xc3(m3:n3), xb1(m1:n1+1), xb2(m2:n2+1), xb3(m3:n3+1)
data1(m1:n1,m2:n2,m3:n3), data2(m1:n1+1,m2:n2,m3:n3), data3(m1:n1,m2:n2+1,m3:n3), data4(m1:n1,m2:n2,m3:n3+1),...

and I have to MPI communicate the boundary planes of data1, data2, etc. Is the a better way than creating a list of buffers buffer1(m2:n2+1,m3:n3+1,1:Nvars), buffer2(m1:n1+1,m3:n3+1,1:Nvars), etc.
and communicating them?

Best,
Fabio

I think I found a possible workaround: put the host_data directive around the MPI_Get_Adress directive, such that the displacement that goes into MPI_Type_Create_Struct is computed directly with respect to the device.
Here is my code that seems to work as expected:

module data
use mpi
implicit none
type box
  integer                                :: m,n
  integer, dimension(:,:,:), allocatable :: a
end type box
!
contains
!
subroutine create_subarray(b,i1,BoxDataType)
implicit none
!
type(box), intent(in)  :: b
integer,   intent(in)  :: i1
integer,   intent(out) :: BoxDataType
!
integer                                      :: ierr, TypeMPI, n
integer, dimension(1)                        :: a_BlockLengths, a_Types
integer(kind=MPI_ADDRESS_KIND), dimension(1) :: a_Displacements, dummy
!
n = b%n - b%m + 1
call MPI_Type_Create_Subarray(3, &
                              (/ n,  n, n /), &
                              (/ 1,  n, n /), &
                              (/ i1, 0, 0 /), &
                               MPI_ORDER_FORTRAN, &
                               MPI_INTEGER, &
                               TypeMPI, &
                               ierr)
!
a_BlockLengths(1) = 1
a_Types(       1) = TypeMPI
!$acc host_data use_device(b%a)
call MPI_Get_Address(b%a, a_Displacements(1), ierr)
!$acc end host_data
call MPI_Get_Address(b%m,        dummy(1),           ierr)
a_Displacements(1) =  a_Displacements(1)-dummy(1)
!
call MPI_Type_Create_Struct(1, a_BlockLengths(1), a_Displacements(1), a_Types(1), &
                            BoxDataType, ierr)
call MPI_Type_Commit(BoxDataType, ierr)
!
end subroutine create_subarray
!
end module data
!
program main
use data
implicit none
!
integer                                 :: myRank, ierr
integer, parameter                      :: n=3
type(box)                               :: b1, b2
integer                                 :: i1,i2,i3
integer, dimension(2)                   :: Requests, BoxGhostTypes
integer, dimension(MPI_STATUS_SIZE,1:2) :: Status
!
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myRank, ierr)
!
b1%m = 1
b1%n = n
allocate(b1%a(1:n,1:n,1:n))
b2%m = 0
b2%n = n+1
allocate(b2%a(0:n+1,0:n+1,0:n+1))
!
b1%a = myRank
b2%a = -1
!
!$acc data copyin(b1,b1%a,b2) copy(b2%a)
!$acc parallel loop collapse(3)
do i3=1,n
  do i2=1,n
    do i1=1,n
      b2%a(i1,i2,i3) = b1%a(i1,i2,i3)*100 + i1
    end do
  end do
end do
!
if (myRank .eq. 0) then
  call create_subarray(b2,0,BoxGhostTypes(1)) ! recv
  call create_subarray(b2,1,BoxGhostTypes(2)) ! send
  !
  call MPI_IRecv(b2, 1, boxGhostTypes(1), 1, 0, MPI_COMM_WORLD, Requests(1), ierr)
  call MPI_ISend(b2, 1, boxGhostTypes(2), 1, 1, MPI_COMM_WORLD, Requests(2), ierr)
else
  call create_subarray(b2,n+1,BoxGhostTypes(1)) ! recv
  call create_subarray(b2,n  ,BoxGhostTypes(2)) ! send
  !
  call MPI_IRecv(b2, 1, boxGhostTypes(1), 0, 1, MPI_COMM_WORLD, Requests(1), ierr)
  call MPI_ISend(b2, 1, boxGhostTypes(2), 0, 0, MPI_COMM_WORLD, Requests(2), ierr)
end if
!
call MPI_WaitAll(2, Requests, Status, ierr)
!
!$acc end data
print *, "#", myRank, ":", b2%a(:,2,2)
!
deallocate(b1%a,b2%a)
!
call MPI_FINALIZE(ierr)
!
end program main

This way, I don’t need a host_data directive around the MPI_send/recv calls, as it already uses the address of b%a on the device, as the displacement is computed between b%m (on the host) and b%a (on the device). Maybe it is not the cleanest way, but it seems to work.

Fabio

Are you using unified memory?

The code seg faults for me without “-gpu=mem:managed”. Mixing a host and device address to get the displacement doesn’t seem correct and would only work if both are unified memory addresses. Though if it is unified, them there’s no need for the host_data as the device and host address is the same.

Mhm, this is an interesting point. Here what I have:

I simply compile the code with mpif90 -acc -Minfo=accel main.f90
So, no managed or unified memory (using the compiler from the nvhpc suite, version 24.11).

Also enforcing mpif90 -acc -Minfo=accel -gpu=mem:separate main.f90 produces the same, correct result.

On the other hand, if I comment out the !$acc host_data use_device(b%a) command in the subroutine create_subarray, I simply get -1 in all boundary data, as the code is MPI communicating the data from the host, and not from the device.

Could it be that the NVIDIA GH200 Grace Hopper nodes on which I am running the code enable such behaviour, since the CPU and GPU are built together in the chip? If this is the case, it would mean that my solution is not portable to other machines and I have to find another solution for better portability (e.g. based on the buffer solution, which I can pack together with the MPI_Type_Create_Struct routine).

Fabio

It still fails for me with “separate” on a GH system here, so I still don’t why it’s working for you. Also, the default for OpenACC is “separate” and only changes to default to “unified” if “-stdpar” is added. But then explicitly adding “-gpu=mem:separate” should override the default for stdpar.

Maybe your mpif90 is implicitly adding some flag?

The CPU and GPU on a GH system are separate, but connected via the NVLink interconnect, so have very fast access to each other’s memory. Also, it has HMM enabled so the GPU can directly access the CPU memory.

All that to say, I don’t think this will be portable. Also, I’m not sure there even is a way to get CUDA Aware MPI to work with passing device UDTs. Granted this is the first time I’ve seen this so there might be some way, but from what I’ve tried with your code, I’m not seeing it.

Does the full UDT need to be communicated or just the member array, “a”?

I was reading a bit more information about the HPC system, and here is what I found: “CPU and GPU memory is unified into a single address space, with both CPU and GPU memory that are allocatable and accessible from the CPU.” I don’t know if it is a specific flag implicitly added to the mpif90 command, or some sort of configuration of the system, but I think this is the reason why both the CPU and GPU can see the members of derived datatypes that are stored on the device.

You are right saying that my solution is absolutely not portable. Moreover, after some tests I was running on the machine, I realized that it is also very slow (slower than copying data to the host and MPI communicating from the host), so it is also not good in terms of performance.

To clarify what I’m trying to do: Let’s assume that I have a derived datatype box with several 3D members (either allocatable or pointers), such as

type box
  real, dimension(:,:,:), pointer :: a,b,c,d,e
end type box

Here, the members of the datatype box have similar, but not exactly same size, and are allocated and computed on the device. I want to MPI communicate one slice of each member (e.g. a(1,:,:), b(1,:,:), c(1,:,:), etc.). Unfortunately, these slices are not contiguous in memory.

On the CPU, it is possible to use the MPI_Type_Create_Subarray command to select the non-contiguous slice, and then to use the MPI_Type_Create_Struct command to pack all slices into one single MPI call. This way, it is not necessary to create a send/recv buffer for each member, or to invoke an MPI send/recv for each member. We do it in one go, we MPI communicate the MPI structure we created (probably MPI is creating memory-contiguous buffers under the hood).

When the data are on the device, on the other hand, the situation is more complex. One possibility would be to copy a(1,:,:), b(1,:,:), c(1,:,:), etc. from the device to the host, pack them with an MPI_Type_Create_Struct, and then MPI communicate from the host with one single MPI send/recv (this is not optimal, but on the GH200 should not be too slow, because of the NVlink connection between the CPU and GPU).

The other option is to create 2D memory-contiguous buffers on the device (one for each member of the derived datatype), and then communicate each buffer with an individual MPI send/recv. As an alternative, do you know if it is possible at all to use MPI_Type_Create_Struct combined with the host_data directive, to pack data from the device and send everything with one single MPI call?

Fabio

As an alternative, do you know if it is possible at all to use MPI_Type_Create_Struct combined with the host_data directive, to pack data from the device and send everything with one single MPI call?

It’s not listed in the list of OpenMPI APIs that support CUDA Aware MPI, nor in the listed in those not supporting it. So it’s unclear. Though, I would think not. It seems problematic if the implementation is to create contiguous buffers, then it would need to be executed this on the device.

I’ve never used MPI_Type_Create_Subarray myself, but typically would create my own send and receive buffers and then scatter and gather loops to map the non-contiguous data into these contiguous buffers.

Thank you very much Mat, all of this makes very sense. I will then work with contiguous-memory buffers, which should ensure portability and guarantee good performance.

In reality, on the HPC system I’m working on, it seems even more efficient to copy data from the device to the host and then MPI communicate it, rather than performing the MPI communication directly from the device. For example, the following code

program main
use MPI
implicit none
!
integer                                 :: myRank, sendRank, ierr
integer, parameter                      :: n=100
integer                                 :: i1,i2,i3, is,ir
integer, dimension(2)                   :: Requests
integer, dimension(MPI_STATUS_SIZE,1:2) :: Status
integer                                 :: init, end, r
real, dimension(:,:,:), allocatable     :: a
real, dimension(:,:),   allocatable     :: buff_s, buff_r

!
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myRank, ierr)
!
allocate(a(0:n+1,0:n+1,0:n+1))
!
a = -1.
!
!$acc data create(a)
!$acc update device(a)
!$acc parallel loop collapse(3)
do i3=1,n
  do i2=1,n
    do i1=1,n
      a(i1,i2,i3) =   myRank*100 + i1 + 0.01*myRank
    end do
  end do
end do
!$acc update self(a)
!
if (myRank .eq. 0) then
  sendRank = 1
  is = 1
  ir = 0
  print *, myRank, ":", myRank*100 + 1 + 0.01*myRank
else
  sendRank = 0
  is = n
  ir = n+1
  print *, myRank, ":", myRank*100 + n + 0.01*myRank
end if
!
call MPI_Barrier(MPI_COMM_WORLD, ierr)
call system_clock(init, r)
!
allocate(buff_s(1:n,1:n),buff_r(1:n,1:n))
!$acc data create(buff_s)
!$acc parallel loop collapse(2)
do i3=1,n
  do i2=1,n
    buff_s(i2,i3) = a(is,i2,i3)
  end do
end do
!$acc update self(buff_s)
!$acc end data
!
call MPI_IRecv(buff_r, n*n, MPI_REAL, sendRank,   myRank, MPI_COMM_WORLD, Requests(1), ierr)
call MPI_ISend(buff_s, n*n, MPI_REAL, sendRank, sendRank, MPI_COMM_WORLD, Requests(2), ierr)
call MPI_WaitAll(2, Requests(1:2), Status(:,1:2), ierr)
!
!$acc data create(buff_r)
!$acc update device(buff_r)
!$acc parallel loop collapse(2)
do i3=1,n
  do i2=1,n
    a(ir,i2,i3) = buff_r(i2,i3)
  end do
end do
!$acc end data
!
deallocate(buff_s,buff_r)
call system_clock(end)
!
!$acc update self(a(ir,1,1))
print *, "#", myRank, ":", a(ir,1,1), "elapsed time(1): ", real(end - init)/real(r)
!
a(ir,:,:) = -1.
!$acc update device(a)
!
call MPI_Barrier(MPI_COMM_WORLD, ierr)
call system_clock(init, r)
!
allocate(buff_s(1:n,1:n),buff_r(1:n,1:n))
!$acc data create(buff_s,buff_r)
!$acc parallel loop collapse(2)
do i3=1,n
  do i2=1,n
    buff_s(i2,i3) = a(is,i2,i3)
  end do
end do
!
!$acc host_data use_device(buff_r,buff_s)
call MPI_IRecv(buff_r, n*n, MPI_REAL, sendRank,   myRank, MPI_COMM_WORLD, Requests(1), ierr)
call MPI_ISend(buff_s, n*n, MPI_REAL, sendRank, sendRank, MPI_COMM_WORLD, Requests(2), ierr)
call MPI_WaitAll(2, Requests(1:2), Status(:,1:2), ierr)
!$acc end host_data
!
!$acc parallel loop collapse(2)
do i3=1,n
  do i2=1,n
    a(ir,i2,i3) = buff_r(i2,i3)
  end do
end do
!$acc end data
!
deallocate(buff_s,buff_r)
call system_clock(end)
!
!$acc update self(a(ir,1,1))
print *, "#", myRank, ":", a(ir,1,1), "elapsed time(1): ", real(end - init)/real(r)
!
!$acc end data
deallocate(a)
call MPI_FINALIZE(ierr)
!
end program main

on two MPI tasks on the same node of a machine with 4 GH200 per node (using the command srun -N1 -n2 --gpus-per-task=1 ./a.out), I get:

            1 :    200.0100
 #            1 :    1.000000     elapsed time(1):    1.2530000E-03
 #            1 :    1.000000     elapsed time(1):    9.6028000E-02
            0 :    1.000000
 #            0 :    200.0100     elapsed time(1):    1.5430000E-03
 #            0 :    200.0100     elapsed time(1):    9.6027002E-02

that is, about a factor 50-70 faster to copy data to the host and communicate from the host, than doing it directly from the device. I will try to investigate with the HPC support why it is so slow to MPI communicate directly from the device.

Best,
Fabio