CUDA code runs on one card, not another

Hi. I am relatively new to CUDA and have written a program that runs well on my workstation with a Quadro FX5800 card. On an identical workstation with a GTX 480 card instead of the Quadro, the program clearly does not run properly. On the Quadro, it takes 5 minutes, whereas on the GTX, it takes 0 seconds. I’ve done a bit of debugging and am guessing that the issue is caused by something in the kernel (perhaps I’m using too much memory for the GTX which has 1.5 GB as opposed to the Quadro which has 4 GB?) and maybe that’s causing the kernel to abort and thus the program finishes quickly? When I launch the kernel, I am using a pretty huge number of threads:

[indent]grid.x = 32768;
grid.y = 1;
block.x = 512;
block.y = 1;
numThreads = 16777216;

Ramsey_Kernel<<<grid, block>>>(mySourceGraph_device, mySolutionGraph_device, foundSolution_device, N, numThreads, answer_device);[/indent]
Any thoughts or suggestions would be greatly appreciated. Thanks!
template_kernel.cu (12 KB)

Check the status returned by every CUDA API call and every kernel invoked. There is a high probability that this will pin-point the problem right away.

The kernel might be compiled for compute capability 1.x only (which would also be indicated by the return status Norbert just mentioned).

Thanks for the suggestion. I have wrapped the CUDA API calls in code like “if cudaMemcpy(…) != cudaSuccess)” {printf (“failed”);} else {printf (“success”);} and added a line "if (cudaGetLastError() != cudaSuccess) {printf (“failed”);} else {printf (“success”);} immediately after the kernel call. I only call the one kernel.

The printfs are indicating success for all the API calls and the kernel. Any other thoughts? Thanks.

Thanks for the suggestion. The output in the build window from Visual Studio 2008 includes the line

1>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\nvcc.exe” -gencode=arch=compute_10,code="sm_10,compute_10" -gencode=arch=compute_20,code="sm_20,compute_20" --machine 32 -ccbin “C:\Program Files\Microsoft Visual Studio 9.0\VC\bin” …

Does that mean it’s being compiled for both the 1.x and 2.x capability cards?

A couple more data points… I replaced the GTX 480 with a GTS 450 (1 GB of memory and 2.1 compute capability) and that didn’t work either (same behavior) and then replaced that with a GT 240 (512 MB of memory and compute capability 1.2) and that DOES work. So this is sounding more and more like tera’s suggestion about something with the compute capability of the GPUs and how the code is compiled… I’m not quite sure how to proceed from here.

Yes, that should allow the kernels to run on all current and future cards. And an error core would have been returned if the binary contained no suitable code for the GPU.

I see the [font=“Courier New”]include <stdio.h>[/font] at the top of the kernel and wonder what happens if you leave that out (doesn’t seem to be needed by this kernel?). It would more likely explain why the kernel doesn’t run on an 1.x card. But at least it might provide a reason why a.x and 2.x cards behave differently.

I had intended to try printf from the kernel for some debugging and forgot to remove that. I just removed that and no luck. Thanks for all the help so far. Any other suggestions would be welcome, too.

You could try do drop the [font=“Courier New”]-gencode=arch=compute_20,code="sm_20,compute_20\ "[/font] from the build, so that the same codepath is executed on both devices. Just to see what happens.

After that, just as a test, you could drop the [font=“Courier New”],compute_10[/font] as well. And if it still doesn’t return an error, check the error checking or find out which build rule actually covers the build. External Image

I took a quick look at the code and the following struck me as potentially troublesome (regardless of compute capability), although it is probably orthogonal to the issue you are chasing:

// Translates offset into binary and stores the bits into the offset array

	for(int x = 39; x >= 0; x--)

	{

		long long powResult = pow(2.0f, x);

I would recommend changing this to:

long long powResult = 1LL << x;

If you can provide a self-contained, runnable version of the code, I could take it for a quick spin on a Quadro FX5800 and a C2050. I assume there are differences in output data between running on the FX5800 and the GTX480? The nature of the differences may point in a particular direction as to the root cause of your observations. I assume you have checked for possible race conditions or use of warp-synchronous programming constructs?

Thank you both for your suggestions and for offering to try the code on your cards. Another friend suggested this morning that I try removing my second video card from the systems and see if that helped (I was running an NVIDIA GeForce 6200 in the systems to handle the video and running the code on the other card). It turns out that my 3 compute capability 1.x cards worked ok with the 6200 as the display card, but the 2 compute capability 2.x cards did not. After removing the 6200, all 5 of my cards run my code now! Thanks again.