I am attempting to write a Multi-GPU code using OpenMP. Every example I have seen thus far has been a small token example where there is only one parallel region in the OMP code and one token kernel launch on the GPUs within that. I’m writing a much larger code, one with many kernel launches and certain synchronizations between GPUs between series of kernel launches. I want to have multiple parallel sections and keep the thread-to-GPU linking across them. I’m not sure if this is possible (or formally proper from within the CUDA programming specs). If I have to, I can have the entire code in a parallel region with OMP master pragmas doing the serial work, but this constitutes a major modification to most of the code base!
I know I cannot call cudaSetDevice(…) more than once per GPU within my program without getting an error. So I have written up the following code:
[codebox]
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main() {
int id, gpuId, num_gpus;
cudaGetDeviceCount(&num_gpus);
omp_set_num_threads(num_gpus);
#pragma omp parallel private(id)
{
id = omp_get_thread_num();
cudaSetDevice(id);
}
printf(“printing something in between parallel regions\n”);
#pragma omp parallel private(id, gpuId)
{
id = omp_get_thread_num();
cudaGetDevice(&gpuId);
printf("CPU ID: %d,\t, GPU ID: %d\n",id,gpuId);
}
}
[/codebox]
This code when run gives the following output:
[codebox]
[********@********** ~]$ nvcc -Xcompiler -fopenmp ompcuda.cu
[********@********** ~]$ ./a.out
printing something in between parallel regions
CPU ID: 0, , GPU ID: 0
CPU ID: 1, , GPU ID: 1
[********@********** ~]$ ./a.out
printing something in between parallel regions
CPU ID: 1, , GPU ID: 1
CPU ID: 0, , GPU ID: 0
[********@********** ~]$ ./a.out
printing something in between parallel regions
CPU ID: 0, , GPU ID: 0
CPU ID: 1, , GPU ID: 1
[********@********** ~]$ ./a.out
printing something in between parallel regions
CPU ID: 1, , GPU ID: 1
CPU ID: 0, , GPU ID: 0
[********@********** ~]$
[/codebox]
So, it seems to be working just fine in my token example. Though the CPU threads are destroyed after the parallel region, when they are re-created, the link to the correct GPU seems to be intact. My question is: Can I expect this in general? Is this “correct” according to specs and thus portable and reliable? If this is correct, then I can get my code working with much less effort. Thanks.