CUDA code runs on one card, not another

DMT · July 18, 2011, 8:35pm

Hi. I am relatively new to CUDA and have written a program that runs well on my workstation with a Quadro FX5800 card. On an identical workstation with a GTX 480 card instead of the Quadro, the program clearly does not run properly. On the Quadro, it takes 5 minutes, whereas on the GTX, it takes 0 seconds. I’ve done a bit of debugging and am guessing that the issue is caused by something in the kernel (perhaps I’m using too much memory for the GTX which has 1.5 GB as opposed to the Quadro which has 4 GB?) and maybe that’s causing the kernel to abort and thus the program finishes quickly? When I launch the kernel, I am using a pretty huge number of threads:

[indent]grid.x = 32768;
grid.y = 1;
block.x = 512;
block.y = 1;
numThreads = 16777216;

Ramsey_Kernel<<<grid, block>>>(mySourceGraph_device, mySolutionGraph_device, foundSolution_device, N, numThreads, answer_device);[/indent]
Any thoughts or suggestions would be greatly appreciated. Thanks!
template_kernel.cu (12 KB)

njuffa · July 18, 2011, 8:38pm

Check the status returned by every CUDA API call and every kernel invoked. There is a high probability that this will pin-point the problem right away.

tera · July 18, 2011, 9:10pm

The kernel might be compiled for compute capability 1.x only (which would also be indicated by the return status Norbert just mentioned).

DMT · July 19, 2011, 12:19pm

Thanks for the suggestion. I have wrapped the CUDA API calls in code like “if cudaMemcpy(…) != cudaSuccess)” {printf (“failed”);} else {printf (“success”);} and added a line "if (cudaGetLastError() != cudaSuccess) {printf (“failed”);} else {printf (“success”);} immediately after the kernel call. I only call the one kernel.

The printfs are indicating success for all the API calls and the kernel. Any other thoughts? Thanks.

DMT · July 19, 2011, 12:28pm

Thanks for the suggestion. The output in the build window from Visual Studio 2008 includes the line

1>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\nvcc.exe” -gencode=arch=compute_10,code="sm_10,compute_10" -gencode=arch=compute_20,code="sm_20,compute_20" --machine 32 -ccbin “C:\Program Files\Microsoft Visual Studio 9.0\VC\bin” …

Does that mean it’s being compiled for both the 1.x and 2.x capability cards?

DMT · July 19, 2011, 1:34pm

Hi. I am relatively new to CUDA and have written a program that runs well on my workstation with a Quadro FX5800 card. On an identical workstation with a GTX 480 card instead of the Quadro, the program clearly does not run properly. On the Quadro, it takes 5 minutes, whereas on the GTX, it takes 0 seconds. I’ve done a bit of debugging and am guessing that the issue is caused by something in the kernel (perhaps I’m using too much memory for the GTX which has 1.5 GB as opposed to the Quadro which has 4 GB?) and maybe that’s causing the kernel to abort and thus the program finishes quickly? When I launch the kernel, I am using a pretty huge number of threads:

[indent]grid.x = 32768;

grid.y = 1;

block.x = 512;

block.y = 1;

numThreads = 16777216;

Ramsey_Kernel<<<grid, block>>>(mySourceGraph_device, mySolutionGraph_device, foundSolution_device, N, numThreads, answer_device);[/indent]

Any thoughts or suggestions would be greatly appreciated. Thanks!

A couple more data points… I replaced the GTX 480 with a GTS 450 (1 GB of memory and 2.1 compute capability) and that didn’t work either (same behavior) and then replaced that with a GT 240 (512 MB of memory and compute capability 1.2) and that DOES work. So this is sounding more and more like tera’s suggestion about something with the compute capability of the GPUs and how the code is compiled… I’m not quite sure how to proceed from here.

tera · July 19, 2011, 3:19pm

Yes, that should allow the kernels to run on all current and future cards. And an error core would have been returned if the binary contained no suitable code for the GPU.

I see the [font=“Courier New”]include <stdio.h>[/font] at the top of the kernel and wonder what happens if you leave that out (doesn’t seem to be needed by this kernel?). It would more likely explain why the kernel doesn’t run on an 1.x card. But at least it might provide a reason why a.x and 2.x cards behave differently.

DMT · July 19, 2011, 4:29pm

I had intended to try printf from the kernel for some debugging and forgot to remove that. I just removed that and no luck. Thanks for all the help so far. Any other suggestions would be welcome, too.

tera · July 19, 2011, 5:20pm

You could try do drop the [font=“Courier New”]-gencode=arch=compute_20,code="sm_20,compute_20\ "[/font] from the build, so that the same codepath is executed on both devices. Just to see what happens.

After that, just as a test, you could drop the [font=“Courier New”],compute_10[/font] as well. And if it still doesn’t return an error, check the error checking or find out which build rule actually covers the build. External Image

njuffa · July 19, 2011, 5:44pm

I took a quick look at the code and the following struck me as potentially troublesome (regardless of compute capability), although it is probably orthogonal to the issue you are chasing:

// Translates offset into binary and stores the bits into the offset array

	for(int x = 39; x >= 0; x--)

	{

		long long powResult = pow(2.0f, x);

I would recommend changing this to:

long long powResult = 1LL << x;

If you can provide a self-contained, runnable version of the code, I could take it for a quick spin on a Quadro FX5800 and a C2050. I assume there are differences in output data between running on the FX5800 and the GTX480? The nature of the differences may point in a particular direction as to the root cause of your observations. I assume you have checked for possible race conditions or use of warp-synchronous programming constructs?

DMT · July 21, 2011, 7:35pm

I took a quick look at the code and the following struck me as potentially troublesome (regardless of compute capability), although it is probably orthogonal to the issue you are chasing:
// Translates offset into binary and stores the bits into the offset array

	for(int x = 39; x >= 0; x--)

	{

		long long powResult = pow(2.0f, x);
I would recommend changing this to:
long long powResult = 1LL << x;
If you can provide a self-contained, runnable version of the code, I could take it for a quick spin on a Quadro FX5800 and a C2050. I assume there are differences in output data between running on the FX5800 and the GTX480? The nature of the differences may point in a particular direction as to the root cause of your observations. I assume you have checked for possible race conditions or use of warp-synchronous programming constructs?

Thank you both for your suggestions and for offering to try the code on your cards. Another friend suggested this morning that I try removing my second video card from the systems and see if that helped (I was running an NVIDIA GeForce 6200 in the systems to handle the video and running the code on the other card). It turns out that my 3 compute capability 1.x cards worked ok with the 6200 as the display card, but the 2 compute capability 2.x cards did not. After removing the 6200, all 5 of my cards run my code now! Thanks again.

Topic		Replies	Views
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4330	March 7, 2011
Number of Blocks CUDA Programming and Performance	3	1611	October 15, 2011
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20165	May 4, 2007
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1951	January 12, 2019
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13527	July 9, 2008
CUDA Bug: "CUDA error: unspecified launch failure" CUDA Programming and Performance	7	11727	March 11, 2011
code that crashes unpredictably CUDA Programming and Performance	15	12653	April 28, 2010
Same Kernel different machine CUDA Programming and Performance	3	872	June 5, 2013
Program work only on one computer, why? CUDA Programming and Performance	11	1004	March 6, 2017
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8636	December 18, 2008

CUDA code runs on one card, not another

Related topics