Hello, I’m trying to port a simulation code to GPUs using OpenACC, and I’m having some troubles with MPI directives. The code uses derived MPI datatypes to extract slices of 3D arrays and communicate just the boundary data that I need. The problem is that the 3D arrays are members of a structure, and I’m having troubles in making it work with the host_data use_device directive.
Here a reproducible example:
module data
use mpi
implicit none
type box
integer :: m,n
integer, dimension(:,:,:), allocatable :: a
end type box
!
contains
!
subroutine create_subarray(b,i1,BoxDataType)
implicit none
!
type(box), intent(in) :: b
integer, intent(in) :: i1
integer, intent(out) :: BoxDataType
!
integer :: ierr, TypeMPI, n
integer, dimension(1) :: a_BlockLengths, a_Types
integer(kind=MPI_ADDRESS_KIND), dimension(1) :: a_Displacements, dummy
!
n = b%n - b%m + 1
call MPI_Type_Create_Subarray(3, &
(/ n, n, n /), &
(/ 1, n, n /), &
(/ i1, 0, 0 /), &
MPI_ORDER_FORTRAN, &
MPI_INTEGER, &
TypeMPI, &
ierr)
!
a_BlockLengths(1) = 1
call MPI_Get_Address(b%a(0,0,0), a_Displacements(1), ierr)
call MPI_Get_Address(b%m, dummy(1), ierr)
a_Displacements(1) = a_Displacements(1)-dummy(1)
a_Types(1) = TypeMPI
!
call MPI_Type_Create_Struct(1, a_BlockLengths(1), a_Displacements(1), a_Types(1), &
BoxDataType, ierr)
call MPI_Type_Commit(BoxDataType, ierr)
!
end subroutine create_subarray
!
end module data
!
program main
use data
implicit none
!
integer :: myRank, ierr
integer, parameter :: n=3
type(box) :: b1, b2
integer :: i1,i2,i3
integer, dimension(2) :: Requests, BoxGhostTypes
integer, dimension(MPI_STATUS_SIZE,1:2) :: Status
!
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myRank, ierr)
!
b1%m = 1
b1%n = n
allocate(b1%a(1:n,1:n,1:n))
b2%m = 0
b2%n = n+1
allocate(b2%a(0:n+1,0:n+1,0:n+1))
!
b1%a = myRank
b2%a = -1
!
!$acc data copyin(b1,b1%a,b2) copy(b2%a)
!$acc parallel loop collapse(3)
do i3=1,n
do i2=1,n
do i1=1,n
b2%a(i1,i2,i3) = b1%a(i1,i2,i3)*100 + i1
end do
end do
end do
!
!$acc update self(b2%a)
!!$acc host_data use_device(b2%a)
if (myRank .eq. 0) then
call create_subarray(b2,0,BoxGhostTypes(1)) ! recv
call create_subarray(b2,1,BoxGhostTypes(2)) ! send
!
call MPI_IRecv(b2, 1, boxGhostTypes(1), 1, 0, MPI_COMM_WORLD, Requests(1), ierr)
call MPI_ISend(b2, 1, boxGhostTypes(2), 1, 1, MPI_COMM_WORLD, Requests(2), ierr)
else
call create_subarray(b2,n+1,BoxGhostTypes(1)) ! recv
call create_subarray(b2,n ,BoxGhostTypes(2)) ! send
!
call MPI_IRecv(b2, 1, boxGhostTypes(1), 0, 1, MPI_COMM_WORLD, Requests(1), ierr)
call MPI_ISend(b2, 1, boxGhostTypes(2), 0, 0, MPI_COMM_WORLD, Requests(2), ierr)
end if
!
call MPI_WaitAll(2, Requests, Status, ierr)
!$acc update device(b2%a)
!!$acc end host_data
!
!$acc end data
print *, "#", myRank, ":", b2%a(:,2,2)
!
deallocate(b1%a,b2%a)
!
call MPI_FINALIZE(ierr)
!
end program main
I’m compiling it with nvfortran 24.11. Running with 2 MPI tasks (each with a GPU) it returns:
# 0 : 103 1 2 3
-1
# 1 : -1 101 102 103
1
However, if I use the host_data use_device(b2%a) instead of the update self, update device commands, I get the wrong result (the boundary data is communicated with host data, and not with device data). I also tried with the member a being a pointer instead than an allocatable, but it doesn’t solve the problem. If, on the other hand, I rewrite the code with a just being an array, and not a member of a fortran datatype, everything works correctly.
Is there a solution to MPI communicate arrays that are members of a fortran datatype directly from the device, or should I create a buffer and communicate the buffer?
Best,
Fabio