OpenMP + CUDA Multiple Parallel Sections Does GPU to Thread linking persist across multiple parallel

I am attempting to write a Multi-GPU code using OpenMP. Every example I have seen thus far has been a small token example where there is only one parallel region in the OMP code and one token kernel launch on the GPUs within that. I’m writing a much larger code, one with many kernel launches and certain synchronizations between GPUs between series of kernel launches. I want to have multiple parallel sections and keep the thread-to-GPU linking across them. I’m not sure if this is possible (or formally proper from within the CUDA programming specs). If I have to, I can have the entire code in a parallel region with OMP master pragmas doing the serial work, but this constitutes a major modification to most of the code base!

I know I cannot call cudaSetDevice(…) more than once per GPU within my program without getting an error. So I have written up the following code:

[codebox]

#include <stdlib.h>

#include <stdio.h>

#include <omp.h>

int main() {

int id, gpuId, num_gpus;

cudaGetDeviceCount(&num_gpus);

omp_set_num_threads(num_gpus);

#pragma omp parallel private(id)

{

	id = omp_get_thread_num();

    cudaSetDevice(id);

}

printf(“printing something in between parallel regions\n”);

#pragma omp parallel private(id, gpuId)

{

	id = omp_get_thread_num();

    cudaGetDevice(&gpuId);

    printf("CPU ID: %d,\t, GPU ID: %d\n",id,gpuId);

}

}

[/codebox]

This code when run gives the following output:

[codebox]

[********@********** ~]$ nvcc -Xcompiler -fopenmp ompcuda.cu

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 0, , GPU ID: 0

CPU ID: 1, , GPU ID: 1

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 1, , GPU ID: 1

CPU ID: 0, , GPU ID: 0

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 0, , GPU ID: 0

CPU ID: 1, , GPU ID: 1

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 1, , GPU ID: 1

CPU ID: 0, , GPU ID: 0

[********@********** ~]$

[/codebox]

So, it seems to be working just fine in my token example. Though the CPU threads are destroyed after the parallel region, when they are re-created, the link to the correct GPU seems to be intact. My question is: Can I expect this in general? Is this “correct” according to specs and thus portable and reliable? If this is correct, then I can get my code working with much less effort. Thanks.

I am attempting to write a Multi-GPU code using OpenMP. Every example I have seen thus far has been a small token example where there is only one parallel region in the OMP code and one token kernel launch on the GPUs within that. I’m writing a much larger code, one with many kernel launches and certain synchronizations between GPUs between series of kernel launches. I want to have multiple parallel sections and keep the thread-to-GPU linking across them. I’m not sure if this is possible (or formally proper from within the CUDA programming specs). If I have to, I can have the entire code in a parallel region with OMP master pragmas doing the serial work, but this constitutes a major modification to most of the code base!

I know I cannot call cudaSetDevice(…) more than once per GPU within my program without getting an error. So I have written up the following code:

[codebox]

#include <stdlib.h>

#include <stdio.h>

#include <omp.h>

int main() {

int id, gpuId, num_gpus;

cudaGetDeviceCount(&num_gpus);

omp_set_num_threads(num_gpus);

#pragma omp parallel private(id)

{

	id = omp_get_thread_num();

    cudaSetDevice(id);

}

printf(“printing something in between parallel regions\n”);

#pragma omp parallel private(id, gpuId)

{

	id = omp_get_thread_num();

    cudaGetDevice(&gpuId);

    printf("CPU ID: %d,\t, GPU ID: %d\n",id,gpuId);

}

}

[/codebox]

This code when run gives the following output:

[codebox]

[********@********** ~]$ nvcc -Xcompiler -fopenmp ompcuda.cu

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 0, , GPU ID: 0

CPU ID: 1, , GPU ID: 1

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 1, , GPU ID: 1

CPU ID: 0, , GPU ID: 0

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 0, , GPU ID: 0

CPU ID: 1, , GPU ID: 1

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 1, , GPU ID: 1

CPU ID: 0, , GPU ID: 0

[********@********** ~]$

[/codebox]

So, it seems to be working just fine in my token example. Though the CPU threads are destroyed after the parallel region, when they are re-created, the link to the correct GPU seems to be intact. My question is: Can I expect this in general? Is this “correct” according to specs and thus portable and reliable? If this is correct, then I can get my code working with much less effort. Thanks.

Actually, I suspect this works because the OpenMP implementation does not destroy threads between parallel regions, but rather keeps them around in a thread-pool. Thread creation can be somewhat slow, so reusing threads between parallel regions is a performance win.

That said, I don’t think anything in the OpenMP standard requires that thread pools are used, so if your program works, it does so on accident. I think the only robust way to run CUDA from OpenMP threads would be to use the CUDA Driver API and push/pop contexts at the beginning and end of each parallel section.

Actually, I suspect this works because the OpenMP implementation does not destroy threads between parallel regions, but rather keeps them around in a thread-pool. Thread creation can be somewhat slow, so reusing threads between parallel regions is a performance win.

That said, I don’t think anything in the OpenMP standard requires that thread pools are used, so if your program works, it does so on accident. I think the only robust way to run CUDA from OpenMP threads would be to use the CUDA Driver API and push/pop contexts at the beginning and end of each parallel section.

Thanks for the great reply. I had a feeling that my correct result was not robust. If that’s the case, I cannot afford that much data transfer. And since the whole program must then be parallel, it would seem more lucrative from a software development standpoint to go ahead and write a strictly MPI version, linking separate processes to separate GPUs. I wouldn’t anticipate the data transfer overhead to be too much higher between MPI processes than between threads, given that the DMA might overwhelm them both. Plus, the code could seamlessly link GPUs on the same node or GPUs on different nodes or a heterogeneous mixture of both from within the same code.

Thanks for the great reply. I had a feeling that my correct result was not robust. If that’s the case, I cannot afford that much data transfer. And since the whole program must then be parallel, it would seem more lucrative from a software development standpoint to go ahead and write a strictly MPI version, linking separate processes to separate GPUs. I wouldn’t anticipate the data transfer overhead to be too much higher between MPI processes than between threads, given that the DMA might overwhelm them both. Plus, the code could seamlessly link GPUs on the same node or GPUs on different nodes or a heterogeneous mixture of both from within the same code.

This will get easier in a future release.

This will get easier in a future release.

Is it possible to get any update on this?

That future release is now less in the future! But yeah, I know this particular case was really bad, so I explicitly set out to solve it.

Any update on this ? I have the similar issue,I based my code on Driver API to avoid any context creation overhead, but am now working with Thrust, which makes it impossible to use Driver API. I rewrote my code to use runtime API and am now in a situation where the context creation inside CudaSetDevice(i) (that I call everytime I need to run my functions) is really slow.

Yeah, I was talking about 4.0 in this thread. If you use 4.0 and cudaSetDevice inside your parallel region, things should be pretty reasonable.