Hello, I’ve been trying to use the unified binary for accelerators in the 13.9 release with OpenACC and running into some confusing results. While the GPU code is generated and works fine, the resulting CPU code is serial. Is that intentional? Specifically it seems odd that it wouldn’t generate the necessary OpenMP result for the CPU to run in parallel as well since the compiler currently prints an error when including both “acc parallel” and “omp parallel <for/do>” on the same construct.
If not, perhaps it’s the way I’m building our codes, make output is below with full commands and ouput.
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline -c set_precision.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline -c set_constants.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline -c setup.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline -c fileio.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline -c matrix_manip.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline -c solvers.f90
ldc_explicit_iter:
117, Generating present(soln_new(:,:,:))
Generating present(soln(:,:,:))
Accelerator kernel generated
119, !$acc loop gang ! blockidx%x
121, !$acc loop vector(256) ! threadidx%x
117, Generating present_or_copyin(soln(:x_nodes,:y_nodes,:))
Generating present_or_copyout(soln_new(3:x_nodes-2,3:y_nodes-2,:))
Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
121, Loop is parallelizable
167, Generating present(soln_new(:,:,:))
Generating present(soln(:,:,:))
Accelerator kernel generated
170, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
167, Generating copyin(soln_new(:,3:y_nodes-2,:))
Generating copyout(soln_new(x_nodes-2:x_nodes,3:y_nodes-2,:))
Generating present_or_copyin(soln(:,1:y_nodes,:3))
Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
263, Generating present(soln_new(:,:,:))
Generating present(soln(:,:,:))
Generating copyin(soln_new(3:x_nodes-2,:,:))
Generating copyout(soln_new(3:x_nodes-2,y_nodes-2:y_nodes,:))
Generating present_or_copyin(soln(:x_nodes,:,:3))
Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
265, Loop is parallelizable
Accelerator kernel generated
265, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
358, Generating present(soln_new(:,:,:))
Generating present(soln(:,:,:))
Accelerator kernel generated
360, !$acc loop gang ! blockidx%x
362, !$acc loop vector(256) ! threadidx%x
358, Generating present_or_copyin(soln(:x_nodes,:y_nodes,:))
Generating present_or_copyout(soln_new(3:x_nodes-2,3:y_nodes-2,:))
Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
362, Loop is parallelizable
411, Generating present(soln_new(:,:,:))
Generating present(soln(:,:,:))
Accelerator kernel generated
414, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
411, Generating copyin(soln_new(:,3:y_nodes-2,:))
Generating copyout(soln_new(x_nodes-2:x_nodes,3:y_nodes-2,:))
Generating present_or_copyin(soln(:,1:y_nodes,:3))
Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
514, Generating present(soln_new(:,:,:))
Generating present(soln(:,:,:))
Generating copyin(soln_new(3:x_nodes-2,:,:))
Generating copyout(soln_new(3:x_nodes-2,y_nodes-2:y_nodes,:))
Generating present_or_copyin(soln(:x_nodes,:,:3))
Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
516, Loop is parallelizable
Accelerator kernel generated
516, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
619, Generating present(soln_new(:,:,:))
Generating present(soln(:,:,:))
Accelerator scalar kernel generated
Generating present_or_copyin(soln(:,:,1:3))
Generating copyin(soln_new(:,:y_nodes,1:3))
Generating copyout(soln_new(1:3,1:3,1:3))
Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
816, Loop is parallelizable
Accelerator kernel generated
816, !$acc loop gang ! blockidx%y
!$acc loop gang, vector(128) ! blockidx%x threadidx%x
ldc_explicit:
874, Generating copy(soln_new(:,:,:))
Generating copy(soln(:,:,:))
ldc_implicit:
1172, Parallel region activated
1175, Parallel loop activated with static block schedule
1189, Barrier
Parallel region terminated
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline -c ldc.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline set_precision.o set_constants.o setup.o fileio.o matrix_manip.o solvers.o ldc.o -o ldc
Based on the user guide example, showing that when the unified binary is produced two statements are printed by -Minfo one for each of the GPU and the CPU devices, I’m thinking it’s just not generating what I want. Is there an option missing perhaps?