openMP+CUDA, need help!

Does anyone have any experience on two GPUs programming using CUDA+openMP?
Im having a trouble in my case:

#include <stdio.h>
#include <omp.h>
#include <stdlib.h>

static void HandleError(cudaError_t err, const char *file, int line){
	if(err!=cudaSuccess){
		printf("%s in %s at line %d
", cudaGetErrorString(err), file, line);
		exit(EXIT_FAILURE);
	}
}

#define HANDLE_ERROR(err)(HandleError(err, __FILE__, __LINE__)) 

__global__ void Test(int *x){
x[threadIdx.x]=1;
}

int main(){
int *x;
int iam;

omp_set_num_threads(2);

#pragma omp parallel private(x, iam)
{
iam=omp_get_thread_num();
cudaSetDevice(iam);

HANDLE_ERROR(cudaMalloc((void**)&x, 100*sizeof(int)));

#pragma omp barrier
	
	if(iam==0){
		Test<<1,100>>(x);
	}
	#pragma barrier
	cudaDeviceSynchronize();
	if(iam==1){
		HANDLE_ERROR(cudaMemcpyPeer(x, 1, x, 0, 100*sizeof(int)));
	}
	
#pragma omp barrier

}
cudaFree(x);

return 0;

}

Above is a simple test case I wrote. It’s just creating two threads, and grant controls of each GPU to each thread, repectively. Then GPU1 writes something into memory through kernel launch, and then GPU2 copy it back to its own memory using cudaMemcpyPeer. My platform is Tesla C2050 + GTS430, both are Fermi GPUs. I have to use cudaMemcpyPeer to recude data copy overhead. When I’m running the program, it shows me invalid argument, and cuda-gdb shows invalid device ordinal. Anybody has a clue on this will be much appreciated. Thank you!

Hi snider524,

Before doing any P2P transfer of any kind, you first have to check that both devices are capable of communicating between each other, then allow them to communicate.
You can check if communication is possible with:

int gpuid0=0;
int gpuid1=1
int can_access_peer_0_1,can_access_peer_1_0 = 0;

cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid0, gpuid1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid1, gpuid0);

If both can_access_peer_0_1 and can_access_peer_1_0 are not equal to 0, then the GPUs can communicate with each other.

You then have to enable P2P access both ways:

cudaSetDevice(gpuid0);
cudaDeviceEnablePeerAccess(gpuid1, 0);
cudaSetDevice(gpuid1);
cudaDeviceEnablePeerAccess(gpuid0, 0);

It is then possible to do P2P memory copies.
Finally, you have to disable P2P access with

cudaDeviceDisablePeerAccess(i)

, where i is the target GPU with which you do not want to communicate anymore.
Note that all the above code snippets assume serial execution.

For a parallel execution, I adapted your OpenMP code, which successfully ran on two M2070s. In your case, you first want to declare two different arrays (x0,x1) instead of 1 (x) to resolve addresses correctly (see code below). You can then use cudaMemcpy to transfer data across GPUs.

int main(){
	int iam;
	int canaccess=0;
	int np=0;
	omp_set_num_threads(2);

	int *x0,*x1;
	#pragma omp parallel default(shared)  private(iam, canaccess)
	{

		iam=omp_get_thread_num();
		np = omp_get_num_threads();

		cudaSetDevice(iam);

		if(iam == 0) cudaMalloc((void**)&x0,100*sizeof(int));
		else if(iam == 1) cudaMalloc((void**)&x1,100*sizeof(int));

		// Enable peer access for everyone
		for(int i = 0;i<np;i++)
		{
			if( i != iam)
			{
				cudaDeviceCanAccessPeer(&canaccess,iam,i); 
				if(canaccess ==0)
					printf("%d cant access %d
",iam,i);

				else
				{
					printf("%d can access %d
",iam,i);
					cudaDeviceEnablePeerAccess(i,0);
				}	
			}		
		}		

		if(iam==0){
		   	Test<<<1,100>>>(x);
			cudaDeviceSynchronize();
		}
	        #pragma omp barrier

		if(iam == 0)	HANDLE_ERROR(cudaMemcpy(x1,x0,100*sizeof(int),cudaMemcpyDefault));

		// memcpy is a blocking call, no need for devicesynchronize here
		#pragma omp barrier
		// disable P2P accesses
		for(int i = 0;i<np;i++)
			if(i!=iam)
				cudaDeviceDisablePeerAccess(i);

		// Free the memory corresponding to the correct context
	        #pragma omp barrier
  		if(iam == 0) cudaFree(x0);
  		else if(iam == 1) cudaFree(x1);
    }
   return 0;
}

Pyt-

Yes, and apparently Peer To Peer capability is only bestowed upon Tesla cards. Since one of your cards is not a Tesla this won’t work.

“When the application is run as a 64-bit process on Windows Vista/7 in TCC mode (see Section 3.6), on Windows XP, or on Linux, devices of compute capability 2.0 and higher from the Tesla series may address each other’s memory (i.e. a kernel executing on one device can dereference a pointer to the memory of the other device). This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices.
Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample.”

See PG 3.2.6.4 Peer-to-Peer Memory Access

Thanks you very much Pyt for your reply.

So I’ve tried out to check the access, both showing me 0, so I’m assuming the peer to peer memory copy is not realizable between these two GPUs? Is there a way that this can be enabled? i.e. change PCI-E slots, etc.? Since as described by NVIDIA, peer to peer memory copy should work between two Fermi GPUs.

Thanks again for you reply!

Hi Jimmy,

Thanks for you reply.

I know peer to peer memory access has to be enabled between two Tesla GPUs. However, peer to peer memory copy is a different machanism and should work between two Fermi GPUs.

I’m not quite sure about this point as I read this from another paper said cudamemcpypeer works on GTX480 as well.

Thanks

Hi Pyt,

I just tried this program on another machine with 4 GTX465, and it works fine. However, when I ran my project code, it gave me an invalid argument error on the cudaMemcpy line using peer to peer. What I’m doing is similar as this. And it worked well with normal cudaMemcpy through CPU, so the pointers should be correct and the memories should be fine. I don’t know what could be the problem with this?
Do you have any insight on this?

THanks

Hi Jimmy,

I’ve checked, peer to peer memory copy can work on 2 GTX465 GPUs, doesn’t have to be two Tesla chips.

Thanks

Hi snider524,

It will be hard to tell without any code snippet; here are some general questions though:

  • have you tried using “cudaMemcpyDeviceToDevice” instead of “cudaMemcpyDefault” for the P2P copies (though the program should understand where the memory resides) ?
  • Did you initialize correctly the P2P transfers with cudaEnableDeviceAccess ?
  • Are you sure the error comes from the cudaMemcpy and not from a previous call of a kernel or any other asynchronous functions ? I had instances where I did out of bounds accesses in a kernel that would fail silently in the kernel and then result in cudaMemcpyToSymbol errors…
  • Is the memory allocated before any transfer is done ?

Pyt-