When -maxrregcount option is used, kernel fail to run

rainysky · March 18, 2009, 3:16am

Hi everyone,

When I compile my kernel with command:

nvcc.exe -Xptxas=-v -cubin kernel.cu -o test.cubin
kernel.cu
tmpxft_000005a0_00000000-3_kernel.cudafe1.gpu
tmpxft_000005a0_00000000-8_kernel.cudafe2.gpu
ptxas info : Compiling entry function ‘kernel’
ptxas info : Used 28 registers, 28+24 bytes smem, 456 bytes cmem[0], 64 bytes cmem[1]

It uses 28 registers, and the kernel run successfully with correct result.

Then I compile the same source code with “-maxrregcount=16” option:

nvcc.exe -Xptxas=-v -cubin -maxrregcount=16 kernel.cu -o test.cubin
kernel.cu
tmpxft_00000844_00000000-3_kernel.cudafe1.gpu
tmpxft_00000844_00000000-8_kernel.cudafe2.gpu
ptxas info : Compiling entry function ‘kernel’
ptxas info : Used 16 registers, 20+0 bytes lmem, 28+24 bytes smem, 456 bytes cmem[0], 64 bytes cmem[1]

Now only 16 registers are used, but when the kernel is run it output incorrect result.

Anyone help to figure out what is the problem? :blink:

Fugl · March 18, 2009, 1:17pm

Hi everyone,

When I compile my kernel with command:

nvcc.exe -Xptxas=-v -cubin kernel.cu -o test.cubin

kernel.cu

tmpxft_000005a0_00000000-3_kernel.cudafe1.gpu

tmpxft_000005a0_00000000-8_kernel.cudafe2.gpu

ptxas info : Compiling entry function ‘kernel’

ptxas info : Used 28 registers, 28+24 bytes smem, 456 bytes cmem[0], 64 bytes cmem[1]

It uses 28 registers, and the kernel run successfully with correct result.

Then I compile the same source code with “-maxrregcount=16” option:

nvcc.exe -Xptxas=-v -cubin -maxrregcount=16 kernel.cu -o test.cubin

kernel.cu

tmpxft_00000844_00000000-3_kernel.cudafe1.gpu

tmpxft_00000844_00000000-8_kernel.cudafe2.gpu

ptxas info : Compiling entry function ‘kernel’

ptxas info : Used 16 registers, 20+0 bytes lmem, 28+24 bytes smem, 456 bytes cmem[0], 64 bytes cmem[1]

Now only 16 registers are used, but when the kernel is run it output incorrect result.

Anyone help to figure out what is the problem? :blink:

I think you’ve run into a lower limit of how many registers you need for your particular algorithm. If you observe the other numbers, you’d expect local memory usage to increase since you are forcing register to spill into it - but instead they get lower, which could indicate some kind of internal error.

rainysky · March 18, 2009, 2:36pm

I also agree that to reduce the register number from 28 to 16 is not easy, but nvcc simply complete the compilation without reporting any error. And the runtime also report no error.

Just tried the nvcc in 2.1 toolkit, same problem.

CUDA is really powerful and imperfect… :rolleyes:

wumpus · March 18, 2009, 8:18pm

You should make a small test case and report it to NVidia (either on this forum or by submitting a bug report), this is certainly not ‘expected behaviour’. Kernels can become slow if you restrict them to 16 registers, but they should not behave incorrectly.

friedonionrings · April 3, 2009, 3:03am

I too have had an issue with this… My kernel requires 52 registers… If I force this number to be 32 so I can have 50% occupancy. No errors it reportedly runs properly, no more shared memory is used, but the results are drastically different. Some of the numerical values are 75% different. I am using this for scientific research, with an error of 75%, I cannot use cuda.
I would post my code, but it is too long to do so… my kernel is 730 lines, and the code requires an extensive database to run.

Nanthan · May 4, 2010, 9:39pm

The results are incorrect if I don’t specify maxrregcount. If I specify the maxrregcount, the results seem to be OK. Any idea for this behaviour?

Lev · May 4, 2010, 10:46pm

Do you use last version of sdk?

TheOke · February 10, 2011, 10:43am

I have a similar problem.
Compiling for a Tesla C2050 using arch=sm_20, and if I set the maxregcount too low (32) or too high (64), or leave it out, some of my answers are completely wrong.
If I use arch=sm_13, the answers are all correct whether or not I specify the maxregcount.
I’m using the latest cuda toolkit and sdk (Jan 2011).

Is this a known problem?

njuffa · February 10, 2011, 8:53pm

Assuming your code contains no architecture specific code paths, the fact that the code compiled for sm_13 runs fine but fails when compiled for sm_20 suggests (but doesn’t demonstrate conclusively) that there might be a compiler issue. It is not clear whether the code compiled for sm_13 is run on an sm_13 platform while the code compiled for sm_20 is run on an sm_20 platform. If so, please note that sm_2x has much tighter checking for out of bounds accesses. Also, are all CUDA API calls properly checked for error returns? There is always a possibility that it’s not the kernel code itself that gives rise to the unexpected results.

If proper API error checking is in place, and if the code compiled for sm_13 passes, but compiled for sm_20 fails, when running on the same sm_2x GPU, this would be a strong hint that something may be amiss in the compiler. In this case I would encourage you to file a bug report, with a self-contained repro case attached.

Topic		Replies	Views
two questions about maxrregcount parameter of nvcc CUDA Programming and Performance	1	13693	July 27, 2010
questions about maxrregcount and Xptxas CUDA Programming and Performance	3	3547	April 16, 2012
Register usage Understanding -ptx and -cubin CUDA Programming and Performance	11	5372	July 24, 2007
Kernel Works on lesser number of threads but fails as I increase total threads spawned "CUDA err CUDA Programming and Performance	8	5997	February 28, 2012
Register usage of a device function for vector rotation CUDA Programming and Performance	14	705	June 12, 2022
Error: ran out of registers CUDA Programming and Performance	9	11579	January 12, 2009
Register usage differs between platforms CUDA Programming and Performance	5	3768	July 3, 2009
Understanding how the compiler assigns registers Checking the .cubin file CUDA Programming and Performance	4	3337	November 10, 2008
Register/SMEM Usage with different -arch=sm_xx not consistent.. CUDA Programming and Performance	5	2823	December 19, 2009
Maxrregcount ? CUDA Programming and Performance	2	4760	September 19, 2009

When -maxrregcount option is used, kernel fail to run

Related topics