Unified binary for accelerators, serial?

Hello, I’ve been trying to use the unified binary for accelerators in the 13.9 release with OpenACC and running into some confusing results. While the GPU code is generated and works fine, the resulting CPU code is serial. Is that intentional? Specifically it seems odd that it wouldn’t generate the necessary OpenMP result for the CPU to run in parallel as well since the compiler currently prints an error when including both “acc parallel” and “omp parallel <for/do>” on the same construct.

If not, perhaps it’s the way I’m building our codes, make output is below with full commands and ouput.

/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline  -c set_precision.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline  -c set_constants.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline  -c setup.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline  -c fileio.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline  -c matrix_manip.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline  -c solvers.f90
ldc_explicit_iter:
    117, Generating present(soln_new(:,:,:))
         Generating present(soln(:,:,:))
         Accelerator kernel generated
        119, !$acc loop gang ! blockidx%x
        121, !$acc loop vector(256) ! threadidx%x
    117, Generating present_or_copyin(soln(:x_nodes,:y_nodes,:))
         Generating present_or_copyout(soln_new(3:x_nodes-2,3:y_nodes-2,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    121, Loop is parallelizable
    167, Generating present(soln_new(:,:,:))
         Generating present(soln(:,:,:))
         Accelerator kernel generated
        170, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
    167, Generating copyin(soln_new(:,3:y_nodes-2,:))
         Generating copyout(soln_new(x_nodes-2:x_nodes,3:y_nodes-2,:))
         Generating present_or_copyin(soln(:,1:y_nodes,:3))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    263, Generating present(soln_new(:,:,:))
         Generating present(soln(:,:,:))
         Generating copyin(soln_new(3:x_nodes-2,:,:))
         Generating copyout(soln_new(3:x_nodes-2,y_nodes-2:y_nodes,:))
         Generating present_or_copyin(soln(:x_nodes,:,:3))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    265, Loop is parallelizable
         Accelerator kernel generated
        265, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    358, Generating present(soln_new(:,:,:))
         Generating present(soln(:,:,:))
         Accelerator kernel generated
        360, !$acc loop gang ! blockidx%x
        362, !$acc loop vector(256) ! threadidx%x
    358, Generating present_or_copyin(soln(:x_nodes,:y_nodes,:))
         Generating present_or_copyout(soln_new(3:x_nodes-2,3:y_nodes-2,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    362, Loop is parallelizable
    411, Generating present(soln_new(:,:,:))
         Generating present(soln(:,:,:))
         Accelerator kernel generated
        414, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
    411, Generating copyin(soln_new(:,3:y_nodes-2,:))
         Generating copyout(soln_new(x_nodes-2:x_nodes,3:y_nodes-2,:))
         Generating present_or_copyin(soln(:,1:y_nodes,:3))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    514, Generating present(soln_new(:,:,:))
         Generating present(soln(:,:,:))
         Generating copyin(soln_new(3:x_nodes-2,:,:))
         Generating copyout(soln_new(3:x_nodes-2,y_nodes-2:y_nodes,:))
         Generating present_or_copyin(soln(:x_nodes,:,:3))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    516, Loop is parallelizable
         Accelerator kernel generated
        516, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    619, Generating present(soln_new(:,:,:))
         Generating present(soln(:,:,:))
         Accelerator scalar kernel generated
         Generating present_or_copyin(soln(:,:,1:3))
         Generating copyin(soln_new(:,:y_nodes,1:3))
         Generating copyout(soln_new(1:3,1:3,1:3))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    816, Loop is parallelizable
         Accelerator kernel generated
        816, !$acc loop gang ! blockidx%y
             !$acc loop gang, vector(128) ! blockidx%x threadidx%x
ldc_explicit:
    874, Generating copy(soln_new(:,:,:))
         Generating copy(soln(:,:,:))
ldc_implicit:
   1172, Parallel region activated
   1175, Parallel loop activated with static block schedule
   1189, Barrier
         Parallel region terminated
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline  -c ldc.f90
/opt/pgi/linux86-64/13.5/bin/pgfortran -I functions/ -O4 -acc -ta=nvidia,host -tp=amd64 -Minfo=accel,mp,unified,par -mp=allcores -Minline  set_precision.o set_constants.o setup.o fileio.o matrix_manip.o solvers.o ldc.o -o ldc

Based on the user guide example, showing that when the unified binary is produced two statements are printed by -Minfo one for each of the GPU and the CPU devices, I’m thinking it’s just not generating what I want. Is there an option missing perhaps?

While the GPU code is generated and works fine, the resulting CPU code is serial. Is that intentional?

“-ta=host” targets a serial host and is meant for portability. We discussed internally about targeting a multi-core system as if it were an accelerator, but it hasn’t been added to our road-map as of yet. Once support for AMD is out (Open beta will be in 13.10), we’ll decide on our next target. I’ll let management know you’re looking for multi-core CPU support.

  • Mat

Thank you for the quick reply.

In the meantime is there some way to use both omp and acc parallel constructs in conjunction with it to get that effect? I mean just supplying both such that the host version uses the OMP directives and the NVIDIA version uses the ACC directives. My attempts resulted in errors, but it seems like it would be a simple way to at least let users do it when necessary.

In the meantime is there some way to use both omp and acc parallel constructs in conjunction with it to get that effect?

You can use OMP and ACC in combination to utilize multiple GPUs (or a single GPU if your device supports Hyper-Q). Basically, you’re adding an additional layer of parallelism above OpenACC. So if your algorithm can take advantage of this extra layer (or needs it due to memory limits on a single GPU), then it’s a way to go.

In my opinion, what you don’t want to do is try to have part of the work done by the GPU and part by a Multi-core CPU via OpenMP. It’s certainly possible to do, but makes things very complex as you then need to balance work performed by different resources (it becomes a hard scheduling problem unless you know your exact workload and compute resources)

Personally, I much prefer using MPI over OpenMP for multi-GPU programming. It’s much easier to write since the domain decomposition is natural and data is already discrete between MPI processes. In OpenMP, typically domain decomposition is done by the compiler and global data is often shared. When putting OpenACC under OpenMP, you now need to have each OpenMP thread mange it’s GPU and the GPU’s data. Very possible to do, but it isn’t how OpenMP is typically programmed.

Also, if you did decide to distribute work across both multi-core and GPUs, I’d still recommend using MPI at a higher level. Each MPI process would then either run it’s work on the GPU using OpenACC or multi-core via OpenMP. You’d still have a scheduling problem, but it would be a bit easier to manage.

In my article 5x in 5 Hours: Porting a 3D Elastic Wave Simulator to GPUs Using OpenACC, I show how multiple GPUs can be used with MPI.

  • Mat

Hi,

my question goes into the same direction. I want to ship my application as a single binary to my customer, but don’t know whether he is going to execute it on a multicore host with or without an attached GPU. Previously I was using an MPI-OpenMP hybrid implementation, i.e. domain decomposition and for each domain OpenMP parallel for loop constructs to parallelize work within a domain.
Now, if the customer would run the application on a cluster where each node has a GPU attached to it, I would like disable OpenMP parallel for and instead offload the work to the GPU.

In a simple example (Jacobi iteration) this could look like:

#ifdef _OPENACC
#pragma acc data copyin(Anew[m*n]), copy(A[m*n]) if(acc_get_device_type()!=acc_device_host)
#endif
 while(error > tol && iter < iter_max){
   error = 0.0;
   
#ifdef _OPENMP
#pragma omp parallel for shared(A,Anew,m,n) reduction(max:error) if(acc_get_device_type()==acc_device_host)
#endif

#ifdef _OPENACC
#pragma acc kernels if(acc_get_device_type()!=acc_device_host)
#pragma acc loop independent collapse(2) 
#endif
     for(int i=1;i<m-1;i++){
       for(int j=1;j<n-1;j++){
	 Anew[Idx(i,j)] = 0.25 * ( A[Idx(i,j+1)] + A[Idx(i,j-1)]
				   + A[Idx(i-1,j)] + A[Idx(i+1,j)] );
	 error = fmax( error, fabs(Anew[Idx(i,j)] - A[Idx(i,j)]));
       }
     }
     
#ifdef _OPENMP
#pragma omp parallel for shared(A,Anew,m,n) if(acc_get_device_type()==acc_device_host)
#endif
#ifdef _OPENACC
#pragma acc kernels if(acc_get_device_type()!=acc_device_host)
#endif
#ifdef _OPENACC
#pragma acc loop independent collapse(2) 
#endif
     for( int i = 1; i < m-1; i++){
       for( int j = 1; j < n-1; j++ ){
	 A[Idx(i,j)] = Anew[Idx(i,j)];
       }
     } 
   
   if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
   
   iter++;
 }

Compiling this with:
pgcpp -mp -acc -Minfo=accel,mp -ta=nvidia,host -fast main.C
gives the following output:

pgcpp-Fatal-/opt/pgi/linux86-64/13.8/bin/pgcpp2 TERMINATED by signal 11
Arguments to /opt/pgi/linux86-64/13.8/bin/pgcpp2
/opt/pgi/linux86-64/13.8/bin/pgcpp2 main.C -opt 2 -terse 1 -inform warn -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -vect 56 -y 34 16 -x 34 0x8 -x 32 20971520 -y 19 8 -y 35 0 -x 42 0x30 -x 39 0x40 -x 199 10 -x 39 0x80 -x 34 0x400000 -x 149 1 -x 150 1 -x 59 4 -x 59 4 -tp nehalem -x 120 0x1000 -astype 0 -x 121 1 -fn main.C -il /tmp/pgcpp4p7O155wEJz.il -x 117 0x600 -x 123 0x80000000 -x 123 4 -x 2 0x400 -x 119 0x20 -def __pgnu_vsn=40102 -alwaysinline /opt/pgi/linux86-64/13.8/lib/libintrinsics.il 4 -autoinl 10 -x 168 100 -x 174 8000 -x 14 0x200000 -x 120 0x200000 -x 70 0x40000000 -accel nvidia -accel host -mp -x 69 0x200 -x 69 0x400 -x 186 0x80000 -x 180 0x400 -x 180 0x4000000 -y 70 0x40000000 -x 0 0x1000000 -x 2 0x100000 -x 0 0x2000000 -x 161 16512 -x 162 16512 -x 163 0x1 -x 186 0x80000 -x 180 0x400 -x 180 0x4000000 -x 189 8 -x 176 0x140000 -x 177 0x0202007f -x 176 0x100 -x 186 0x10000 -x 176 0x100 -x 186 0x20000 -x 176 0x100 -x 176 0x100 -x 189 4 -y 70 0x40000000 -x 9 1 -x 42 0x14200000 -x 72 0x1 -x 136 0x11 -quad -x 119 0x10000000 -x 129 0x40000000 -x 129 2 -x 164 0x1000 -gnuvsn 40102 -x 69 0x200 -cmdline '+pgcpp /tmp/pgcpp4p7O155wEJz.il -mp -acc -Minfo=accel,mp -ta=nvidia,host -fast -Mvect=sse -Mcache_align -Mflushz -Mpre' -asm /tmp/pgcpp4p7OIiLHvxh.sm

I understand that the if-clauses don’t help too much at compile time and the compiler probably cannot figure out that I want either OpenMP or OpenACC. As you mentioned before there are good reasons to use both at the same time e.g. for handling multiple GPUs.

Do you know of any workaround to achieve my goal or will I always have to produce two binaries, i.e. one compiled with -mp only and the other compiled with -acc only?

Thanks,
LS

Do you know of any workaround to achieve my goal or will I always have to produce two binaries, i.e. one compiled with -mp only and the other compiled with -acc only?

You don’t need two binaries, but would need two different code paths, one OpenMP enabled and one OpenACC enabled. You then set a flag in your code to determine which code path to take.

Since you’re using MPI, I could imagine scenarios where some processes run OpenMP and others run OpenACC depending upon the resources available and how the domain is decomposed.

  • Mat

You don’t need two binaries, but would need two different code paths, one OpenMP enabled and one OpenACC enabled. You then set a flag in your code to determine which code path to take.

OK, that would be a solution, however, it would come at the expense of code duplication if I understand you correctly:

if(accelerator_attachedd){
#pragma acc kernels ....
  /*
   loop nests
 */

}else{
#pragma omp parallel ...
  /*
   loop nests
 */
}

For larger kernel regions keeping the two paths in sync could be error prone.

Since you’re using MPI, I could imagine scenarios where some processes run OpenMP and others run OpenACC depending upon the resources available and how the domain is decomposed.

I would imagine that the load balancing could be tricky especially if you don’t know the cluster composition of your customers.

Anyway, thanks a lot for your quick feedback. Really appreciated.

Best,
LS

OK, that would be a solution, however, it would come at the expense of code duplication if I understand you correctly:

Correct. Not ideal, but will work.

I would imagine that the load balancing could be tricky especially if you don’t know the cluster composition of your customers.

Absolutely! There’s a PhD candidate, Tom Scogland, at Virginia Tech who’s looking at this problem as part of his thesis. (http://tom.scogland.com/).

  • Mat