When -maxrregcount option is used, kernel fail to run

Hi everyone,

When I compile my kernel with command:

nvcc.exe -Xptxas=-v -cubin kernel.cu -o test.cubin
kernel.cu
tmpxft_000005a0_00000000-3_kernel.cudafe1.gpu
tmpxft_000005a0_00000000-8_kernel.cudafe2.gpu
ptxas info : Compiling entry function ‘kernel’
ptxas info : Used 28 registers, 28+24 bytes smem, 456 bytes cmem[0], 64 bytes cmem[1]

It uses 28 registers, and the kernel run successfully with correct result.

Then I compile the same source code with “-maxrregcount=16” option:

nvcc.exe -Xptxas=-v -cubin -maxrregcount=16 kernel.cu -o test.cubin
kernel.cu
tmpxft_00000844_00000000-3_kernel.cudafe1.gpu
tmpxft_00000844_00000000-8_kernel.cudafe2.gpu
ptxas info : Compiling entry function ‘kernel’
ptxas info : Used 16 registers, 20+0 bytes lmem, 28+24 bytes smem, 456 bytes cmem[0], 64 bytes cmem[1]

Now only 16 registers are used, but when the kernel is run it output incorrect result.

Anyone help to figure out what is the problem? :blink:

I think you’ve run into a lower limit of how many registers you need for your particular algorithm. If you observe the other numbers, you’d expect local memory usage to increase since you are forcing register to spill into it - but instead they get lower, which could indicate some kind of internal error.

I also agree that to reduce the register number from 28 to 16 is not easy, but nvcc simply complete the compilation without reporting any error. And the runtime also report no error.

Just tried the nvcc in 2.1 toolkit, same problem.

CUDA is really powerful and imperfect… :rolleyes:

You should make a small test case and report it to NVidia (either on this forum or by submitting a bug report), this is certainly not ‘expected behaviour’. Kernels can become slow if you restrict them to 16 registers, but they should not behave incorrectly.

I too have had an issue with this… My kernel requires 52 registers… If I force this number to be 32 so I can have 50% occupancy. No errors it reportedly runs properly, no more shared memory is used, but the results are drastically different. Some of the numerical values are 75% different. I am using this for scientific research, with an error of 75%, I cannot use cuda.
I would post my code, but it is too long to do so… my kernel is 730 lines, and the code requires an extensive database to run.

The results are incorrect if I don’t specify maxrregcount. If I specify the maxrregcount, the results seem to be OK. Any idea for this behaviour?

Do you use last version of sdk?

I have a similar problem.
Compiling for a Tesla C2050 using arch=sm_20, and if I set the maxregcount too low (32) or too high (64), or leave it out, some of my answers are completely wrong.
If I use arch=sm_13, the answers are all correct whether or not I specify the maxregcount.
I’m using the latest cuda toolkit and sdk (Jan 2011).

Is this a known problem?

Assuming your code contains no architecture specific code paths, the fact that the code compiled for sm_13 runs fine but fails when compiled for sm_20 suggests (but doesn’t demonstrate conclusively) that there might be a compiler issue. It is not clear whether the code compiled for sm_13 is run on an sm_13 platform while the code compiled for sm_20 is run on an sm_20 platform. If so, please note that sm_2x has much tighter checking for out of bounds accesses. Also, are all CUDA API calls properly checked for error returns? There is always a possibility that it’s not the kernel code itself that gives rise to the unexpected results.

If proper API error checking is in place, and if the code compiled for sm_13 passes, but compiled for sm_20 fails, when running on the same sm_2x GPU, this would be a strong hint that something may be amiss in the compiler. In this case I would encourage you to file a bug report, with a self-contained repro case attached.