OpenMP + CUDA Multiple Parallel Sections Does GPU to Thread linking persist across multiple parallel

mrnorman · November 1, 2010, 3:41pm

I am attempting to write a Multi-GPU code using OpenMP. Every example I have seen thus far has been a small token example where there is only one parallel region in the OMP code and one token kernel launch on the GPUs within that. I’m writing a much larger code, one with many kernel launches and certain synchronizations between GPUs between series of kernel launches. I want to have multiple parallel sections and keep the thread-to-GPU linking across them. I’m not sure if this is possible (or formally proper from within the CUDA programming specs). If I have to, I can have the entire code in a parallel region with OMP master pragmas doing the serial work, but this constitutes a major modification to most of the code base!

I know I cannot call cudaSetDevice(…) more than once per GPU within my program without getting an error. So I have written up the following code:

[codebox]

#include <stdlib.h>

#include <stdio.h>

#include <omp.h>

int main() {

int id, gpuId, num_gpus;

cudaGetDeviceCount(&num_gpus);

omp_set_num_threads(num_gpus);

#pragma omp parallel private(id)

{

	id = omp_get_thread_num();

    cudaSetDevice(id);

}

printf(“printing something in between parallel regions\n”);

#pragma omp parallel private(id, gpuId)

{

	id = omp_get_thread_num();

    cudaGetDevice(&gpuId);

    printf("CPU ID: %d,\t, GPU ID: %d\n",id,gpuId);

}

}

[/codebox]

This code when run gives the following output:

[codebox]

[********@********** ~]$ nvcc -Xcompiler -fopenmp ompcuda.cu

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 0, , GPU ID: 0

CPU ID: 1, , GPU ID: 1

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 1, , GPU ID: 1

CPU ID: 0, , GPU ID: 0

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 0, , GPU ID: 0

CPU ID: 1, , GPU ID: 1

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 1, , GPU ID: 1

CPU ID: 0, , GPU ID: 0

[********@********** ~]$

[/codebox]

So, it seems to be working just fine in my token example. Though the CPU threads are destroyed after the parallel region, when they are re-created, the link to the correct GPU seems to be intact. My question is: Can I expect this in general? Is this “correct” according to specs and thus portable and reliable? If this is correct, then I can get my code working with much less effort. Thanks.

mrnorman · November 1, 2010, 3:41pm

I am attempting to write a Multi-GPU code using OpenMP. Every example I have seen thus far has been a small token example where there is only one parallel region in the OMP code and one token kernel launch on the GPUs within that. I’m writing a much larger code, one with many kernel launches and certain synchronizations between GPUs between series of kernel launches. I want to have multiple parallel sections and keep the thread-to-GPU linking across them. I’m not sure if this is possible (or formally proper from within the CUDA programming specs). If I have to, I can have the entire code in a parallel region with OMP master pragmas doing the serial work, but this constitutes a major modification to most of the code base!

I know I cannot call cudaSetDevice(…) more than once per GPU within my program without getting an error. So I have written up the following code:

[codebox]

#include <stdlib.h>

#include <stdio.h>

#include <omp.h>

int main() {

int id, gpuId, num_gpus;

cudaGetDeviceCount(&num_gpus);

omp_set_num_threads(num_gpus);

#pragma omp parallel private(id)

{

	id = omp_get_thread_num();

    cudaSetDevice(id);

}

printf(“printing something in between parallel regions\n”);

#pragma omp parallel private(id, gpuId)

{

	id = omp_get_thread_num();

    cudaGetDevice(&gpuId);

    printf("CPU ID: %d,\t, GPU ID: %d\n",id,gpuId);

}

}

[/codebox]

This code when run gives the following output:

[codebox]

[********@********** ~]$ nvcc -Xcompiler -fopenmp ompcuda.cu

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 0, , GPU ID: 0

CPU ID: 1, , GPU ID: 1

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 1, , GPU ID: 1

CPU ID: 0, , GPU ID: 0

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 0, , GPU ID: 0

CPU ID: 1, , GPU ID: 1

[********@********** ~]$ ./a.out

printing something in between parallel regions

CPU ID: 1, , GPU ID: 1

CPU ID: 0, , GPU ID: 0

[********@********** ~]$

[/codebox]

So, it seems to be working just fine in my token example. Though the CPU threads are destroyed after the parallel region, when they are re-created, the link to the correct GPU seems to be intact. My question is: Can I expect this in general? Is this “correct” according to specs and thus portable and reliable? If this is correct, then I can get my code working with much less effort. Thanks.

seibert · November 1, 2010, 4:03pm

Actually, I suspect this works because the OpenMP implementation does not destroy threads between parallel regions, but rather keeps them around in a thread-pool. Thread creation can be somewhat slow, so reusing threads between parallel regions is a performance win.

That said, I don’t think anything in the OpenMP standard requires that thread pools are used, so if your program works, it does so on accident. I think the only robust way to run CUDA from OpenMP threads would be to use the CUDA Driver API and push/pop contexts at the beginning and end of each parallel section.

seibert · November 1, 2010, 4:03pm

Actually, I suspect this works because the OpenMP implementation does not destroy threads between parallel regions, but rather keeps them around in a thread-pool. Thread creation can be somewhat slow, so reusing threads between parallel regions is a performance win.

That said, I don’t think anything in the OpenMP standard requires that thread pools are used, so if your program works, it does so on accident. I think the only robust way to run CUDA from OpenMP threads would be to use the CUDA Driver API and push/pop contexts at the beginning and end of each parallel section.

mrnorman · November 1, 2010, 4:47pm

Thanks for the great reply. I had a feeling that my correct result was not robust. If that’s the case, I cannot afford that much data transfer. And since the whole program must then be parallel, it would seem more lucrative from a software development standpoint to go ahead and write a strictly MPI version, linking separate processes to separate GPUs. I wouldn’t anticipate the data transfer overhead to be too much higher between MPI processes than between threads, given that the DMA might overwhelm them both. Plus, the code could seamlessly link GPUs on the same node or GPUs on different nodes or a heterogeneous mixture of both from within the same code.

mrnorman · November 1, 2010, 4:47pm

Thanks for the great reply. I had a feeling that my correct result was not robust. If that’s the case, I cannot afford that much data transfer. And since the whole program must then be parallel, it would seem more lucrative from a software development standpoint to go ahead and write a strictly MPI version, linking separate processes to separate GPUs. I wouldn’t anticipate the data transfer overhead to be too much higher between MPI processes than between threads, given that the DMA might overwhelm them both. Plus, the code could seamlessly link GPUs on the same node or GPUs on different nodes or a heterogeneous mixture of both from within the same code.

tmurray · November 1, 2010, 9:48pm

This will get easier in a future release.

tmurray · November 1, 2010, 9:48pm

This will get easier in a future release.

Adam_Simpson · January 28, 2011, 11:04pm

Is it possible to get any update on this?

tmurray · January 28, 2011, 11:46pm

That future release is now less in the future! But yeah, I know this particular case was really bad, so I explicitly set out to solve it.

proutix · June 29, 2011, 4:18pm

Any update on this ? I have the similar issue,I based my code on Driver API to avoid any context creation overhead, but am now working with Thrust, which makes it impossible to use Driver API. I rewrote my code to use runtime API and am now in a situation where the context creation inside CudaSetDevice(i) (that I call everytime I need to run my functions) is really slow.

tmurray · June 29, 2011, 4:32pm

Yeah, I was talking about 4.0 in this thread. If you use 4.0 and cudaSetDevice inside your parallel region, things should be pretty reasonable.

Topic		Replies	Views
Not answered question? CUDA and OpenMP? CUDA Programming and Performance	28	10908	September 30, 2010
Mutiple GPUs with Runtime API and OpenMP CUDA Programming and Performance	10	4071	April 17, 2010
How to Avoid Re-Creating CUDA Context with Multi-GPU using OpenMP? Call a function that does omp par CUDA Programming and Performance	2	1833	July 28, 2009
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14026	September 12, 2015
OpenMP doesn't work in a templated function CUDA Programming and Performance	4	2262	September 14, 2009
When to use Serial CPU, CUDA, OpenMP and MPI? CUDA Programming and Performance	8	13596	May 29, 2021
MultiGPU, multithread, and establishing contexts Odd (but good) behavior with OpenMP affecting multi CUDA Programming and Performance	4	6251	July 10, 2009
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9608	January 1, 2009
Suggestion for CUDA CUDA Programming and Performance	7	4288	April 29, 2008
Does kernel execution still block one CPU? CUDA Programming and Performance	4	10323	October 26, 2007

OpenMP + CUDA Multiple Parallel Sections Does GPU to Thread linking persist across multiple parallel

Related topics