openMP+CUDA, need help!

snider524 · November 21, 2012, 5:36pm

Does anyone have any experience on two GPUs programming using CUDA+openMP?
Im having a trouble in my case:

#include <stdio.h>
#include <omp.h>
#include <stdlib.h>

static void HandleError(cudaError_t err, const char *file, int line){
	if(err!=cudaSuccess){
		printf("%s in %s at line %d
", cudaGetErrorString(err), file, line);
		exit(EXIT_FAILURE);
	}
}

#define HANDLE_ERROR(err)(HandleError(err, __FILE__, __LINE__)) 

__global__ void Test(int *x){
x[threadIdx.x]=1;
}

int main(){
int *x;
int iam;

omp_set_num_threads(2);

#pragma omp parallel private(x, iam)
{
iam=omp_get_thread_num();
cudaSetDevice(iam);

HANDLE_ERROR(cudaMalloc((void**)&x, 100*sizeof(int)));

#pragma omp barrier
	
	if(iam==0){
		Test<<1,100>>(x);
	}
	#pragma barrier
	cudaDeviceSynchronize();
	if(iam==1){
		HANDLE_ERROR(cudaMemcpyPeer(x, 1, x, 0, 100*sizeof(int)));
	}
	
#pragma omp barrier

}
cudaFree(x);

return 0;

}

Above is a simple test case I wrote. It’s just creating two threads, and grant controls of each GPU to each thread, repectively. Then GPU1 writes something into memory through kernel launch, and then GPU2 copy it back to its own memory using cudaMemcpyPeer. My platform is Tesla C2050 + GTS430, both are Fermi GPUs. I have to use cudaMemcpyPeer to recude data copy overhead. When I’m running the program, it shows me invalid argument, and cuda-gdb shows invalid device ordinal. Anybody has a clue on this will be much appreciated. Thank you!

Pyt · November 21, 2012, 7:34pm

Hi snider524,

Before doing any P2P transfer of any kind, you first have to check that both devices are capable of communicating between each other, then allow them to communicate.
You can check if communication is possible with:

int gpuid0=0;
int gpuid1=1
int can_access_peer_0_1,can_access_peer_1_0 = 0;

cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid0, gpuid1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid1, gpuid0);

If both can_access_peer_0_1 and can_access_peer_1_0 are not equal to 0, then the GPUs can communicate with each other.

You then have to enable P2P access both ways:

cudaSetDevice(gpuid0);
cudaDeviceEnablePeerAccess(gpuid1, 0);
cudaSetDevice(gpuid1);
cudaDeviceEnablePeerAccess(gpuid0, 0);

It is then possible to do P2P memory copies.
Finally, you have to disable P2P access with

cudaDeviceDisablePeerAccess(i)

, where i is the target GPU with which you do not want to communicate anymore.
Note that all the above code snippets assume serial execution.

For a parallel execution, I adapted your OpenMP code, which successfully ran on two M2070s. In your case, you first want to declare two different arrays (x0,x1) instead of 1 (x) to resolve addresses correctly (see code below). You can then use cudaMemcpy to transfer data across GPUs.

int main(){
	int iam;
	int canaccess=0;
	int np=0;
	omp_set_num_threads(2);

	int *x0,*x1;
	#pragma omp parallel default(shared)  private(iam, canaccess)
	{

		iam=omp_get_thread_num();
		np = omp_get_num_threads();

		cudaSetDevice(iam);

		if(iam == 0) cudaMalloc((void**)&x0,100*sizeof(int));
		else if(iam == 1) cudaMalloc((void**)&x1,100*sizeof(int));

		// Enable peer access for everyone
		for(int i = 0;i<np;i++)
		{
			if( i != iam)
			{
				cudaDeviceCanAccessPeer(&canaccess,iam,i); 
				if(canaccess ==0)
					printf("%d cant access %d
",iam,i);

				else
				{
					printf("%d can access %d
",iam,i);
					cudaDeviceEnablePeerAccess(i,0);
				}	
			}		
		}		

		if(iam==0){
		   	Test<<<1,100>>>(x);
			cudaDeviceSynchronize();
		}
	        #pragma omp barrier

		if(iam == 0)	HANDLE_ERROR(cudaMemcpy(x1,x0,100*sizeof(int),cudaMemcpyDefault));

		// memcpy is a blocking call, no need for devicesynchronize here
		#pragma omp barrier
		// disable P2P accesses
		for(int i = 0;i<np;i++)
			if(i!=iam)
				cudaDeviceDisablePeerAccess(i);

		// Free the memory corresponding to the correct context
	        #pragma omp barrier
  		if(iam == 0) cudaFree(x0);
  		else if(iam == 1) cudaFree(x1);
    }
   return 0;
}

Pyt-

Jimmy_Pettersson · November 21, 2012, 9:03pm

Yes, and apparently Peer To Peer capability is only bestowed upon Tesla cards. Since one of your cards is not a Tesla this won’t work.

“When the application is run as a 64-bit process on Windows Vista/7 in TCC mode (see Section 3.6), on Windows XP, or on Linux, devices of compute capability 2.0 and higher from the Tesla series may address each other’s memory (i.e. a kernel executing on one device can dereference a pointer to the memory of the other device). This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices.
Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample.”

See PG 3.2.6.4 Peer-to-Peer Memory Access

snider524 · November 21, 2012, 9:10pm

Thanks you very much Pyt for your reply.

So I’ve tried out to check the access, both showing me 0, so I’m assuming the peer to peer memory copy is not realizable between these two GPUs? Is there a way that this can be enabled? i.e. change PCI-E slots, etc.? Since as described by NVIDIA, peer to peer memory copy should work between two Fermi GPUs.

Thanks again for you reply!

snider524 · November 21, 2012, 9:14pm

Jimmy Pettersson:

Yes, and apparently Peer To Peer capability is only bestowed upon Tesla cards. Since one of your cards is not a Tesla this won’t work.

“When the application is run as a 64-bit process on Windows Vista/7 in TCC mode (see Section 3.6), on Windows XP, or on Linux, devices of compute capability 2.0 and higher from the Tesla series may address each other’s memory (i.e. a kernel executing on one device can dereference a pointer to the memory of the other device). This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices.
Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample.”

See PG 3.2.6.4 Peer-to-Peer Memory Access

Hi Jimmy,

Thanks for you reply.

I know peer to peer memory access has to be enabled between two Tesla GPUs. However, peer to peer memory copy is a different machanism and should work between two Fermi GPUs.

I’m not quite sure about this point as I read this from another paper said cudamemcpypeer works on GTX480 as well.

Thanks

snider524 · November 22, 2012, 6:01pm

Hi Pyt,

I just tried this program on another machine with 4 GTX465, and it works fine. However, when I ran my project code, it gave me an invalid argument error on the cudaMemcpy line using peer to peer. What I’m doing is similar as this. And it worked well with normal cudaMemcpy through CPU, so the pointers should be correct and the memories should be fine. I don’t know what could be the problem with this?
Do you have any insight on this?

THanks

Pyt:

Hi snider524,

Before doing any P2P transfer of any kind, you first have to check that both devices are capable of communicating between each other, then allow them to communicate.
You can check if communication is possible with:
int gpuid0=0;
int gpuid1=1
int can_access_peer_0_1,can_access_peer_1_0 = 0;

cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid0, gpuid1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid1, gpuid0);
If both can_access_peer_0_1 and can_access_peer_1_0 are not equal to 0, then the GPUs can communicate with each other.

You then have to enable P2P access both ways:
cudaSetDevice(gpuid0);
cudaDeviceEnablePeerAccess(gpuid1, 0);
cudaSetDevice(gpuid1);
cudaDeviceEnablePeerAccess(gpuid0, 0);
It is then possible to do P2P memory copies.
Finally, you have to disable P2P access with
cudaDeviceDisablePeerAccess(i)
, where i is the target GPU with which you do not want to communicate anymore.
Note that all the above code snippets assume serial execution.

For a parallel execution, I adapted your OpenMP code, which successfully ran on two M2070s. In your case, you first want to declare two different arrays (x0,x1) instead of 1 (x) to resolve addresses correctly (see code below). You can then use cudaMemcpy to transfer data across GPUs.
int main(){
	int iam;
	int canaccess=0;
	int np=0;
	omp_set_num_threads(2);

	int *x0,*x1;
	#pragma omp parallel default(shared)  private(iam, canaccess)
	{

		iam=omp_get_thread_num();
		np = omp_get_num_threads();

		cudaSetDevice(iam);

		if(iam == 0) cudaMalloc((void**)&x0,100*sizeof(int));
		else if(iam == 1) cudaMalloc((void**)&x1,100*sizeof(int));

		// Enable peer access for everyone
		for(int i = 0;i<np;i++)
		{
			if( i != iam)
			{
				cudaDeviceCanAccessPeer(&canaccess,iam,i); 
				if(canaccess ==0)
					printf("%d cant access %d
",iam,i);

				else
				{
					printf("%d can access %d
",iam,i);
					cudaDeviceEnablePeerAccess(i,0);
				}	
			}		
		}		

		if(iam==0){
		   	Test<<<1,100>>>(x);
			cudaDeviceSynchronize();
		}
	        #pragma omp barrier

		if(iam == 0)	HANDLE_ERROR(cudaMemcpy(x1,x0,100*sizeof(int),cudaMemcpyDefault));

		// memcpy is a blocking call, no need for devicesynchronize here
		#pragma omp barrier
		// disable P2P accesses
		for(int i = 0;i<np;i++)
			if(i!=iam)
				cudaDeviceDisablePeerAccess(i);

		// Free the memory corresponding to the correct context
	        #pragma omp barrier
  		if(iam == 0) cudaFree(x0);
  		else if(iam == 1) cudaFree(x1);
    }
   return 0;
}
Pyt-

snider524 · November 22, 2012, 6:02pm

Hi Jimmy,

I’ve checked, peer to peer memory copy can work on 2 GTX465 GPUs, doesn’t have to be two Tesla chips.

Thanks

snider524:

Jimmy Pettersson:

Yes, and apparently Peer To Peer capability is only bestowed upon Tesla cards. Since one of your cards is not a Tesla this won’t work.

“When the application is run as a 64-bit process on Windows Vista/7 in TCC mode (see Section 3.6), on Windows XP, or on Linux, devices of compute capability 2.0 and higher from the Tesla series may address each other’s memory (i.e. a kernel executing on one device can dereference a pointer to the memory of the other device). This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices.
Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample.”

See PG 3.2.6.4 Peer-to-Peer Memory Access

Hi Jimmy,

Thanks for you reply.

I know peer to peer memory access has to be enabled between two Tesla GPUs. However, peer to peer memory copy is a different machanism and should work between two Fermi GPUs.

I’m not quite sure about this point as I read this from another paper said cudamemcpypeer works on GTX480 as well.

Thanks

Pyt · November 23, 2012, 5:12pm

Hi snider524,

It will be hard to tell without any code snippet; here are some general questions though:

have you tried using “cudaMemcpyDeviceToDevice” instead of “cudaMemcpyDefault” for the P2P copies (though the program should understand where the memory resides) ?
Did you initialize correctly the P2P transfers with cudaEnableDeviceAccess ?
Are you sure the error comes from the cudaMemcpy and not from a previous call of a kernel or any other asynchronous functions ? I had instances where I did out of bounds accesses in a kernel that would fail silently in the kernel and then result in cudaMemcpyToSymbol errors…
Is the memory allocated before any transfer is done ?

Pyt-

Topic		Replies	Views
CUDA peer to peer example ./simpleP2P failing CUDA Programming and Performance	11	8634	February 5, 2015
P2P access not enabled, is this a software or a hardware issue? CUDA Setup and Installation	7	9690	November 10, 2015
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8135	June 30, 2010
P2P: How do I know if cudaMemcpy falls back to non-P2P? CUDA Programming and Performance	8	2413	October 12, 2021
multi-GPU Peer to Peer access CUDA SDK example not working, why? CUDA Programming and Performance	13	5176	February 26, 2015
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	5127	November 27, 2018
cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues CUDA Programming and Performance	11	13049	May 3, 2011
Transparent inter-GPU memory migration CUDA Programming and Performance cuda	4	325	December 14, 2023
MPI + Peer2Peer combine MPI and Peer2Peer CUDA Programming and Performance	5	1818	February 8, 2012
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14025	September 12, 2015

openMP+CUDA, need help!

Related topics