I think you’ve run into a lower limit of how many registers you need for your particular algorithm. If you observe the other numbers, you’d expect local memory usage to increase since you are forcing register to spill into it - but instead they get lower, which could indicate some kind of internal error.
I also agree that to reduce the register number from 28 to 16 is not easy, but nvcc simply complete the compilation without reporting any error. And the runtime also report no error.
You should make a small test case and report it to NVidia (either on this forum or by submitting a bug report), this is certainly not ‘expected behaviour’. Kernels can become slow if you restrict them to 16 registers, but they should not behave incorrectly.
I too have had an issue with this… My kernel requires 52 registers… If I force this number to be 32 so I can have 50% occupancy. No errors it reportedly runs properly, no more shared memory is used, but the results are drastically different. Some of the numerical values are 75% different. I am using this for scientific research, with an error of 75%, I cannot use cuda.
I would post my code, but it is too long to do so… my kernel is 730 lines, and the code requires an extensive database to run.
I have a similar problem.
Compiling for a Tesla C2050 using arch=sm_20, and if I set the maxregcount too low (32) or too high (64), or leave it out, some of my answers are completely wrong.
If I use arch=sm_13, the answers are all correct whether or not I specify the maxregcount.
I’m using the latest cuda toolkit and sdk (Jan 2011).
Assuming your code contains no architecture specific code paths, the fact that the code compiled for sm_13 runs fine but fails when compiled for sm_20 suggests (but doesn’t demonstrate conclusively) that there might be a compiler issue. It is not clear whether the code compiled for sm_13 is run on an sm_13 platform while the code compiled for sm_20 is run on an sm_20 platform. If so, please note that sm_2x has much tighter checking for out of bounds accesses. Also, are all CUDA API calls properly checked for error returns? There is always a possibility that it’s not the kernel code itself that gives rise to the unexpected results.
If proper API error checking is in place, and if the code compiled for sm_13 passes, but compiled for sm_20 fails, when running on the same sm_2x GPU, this would be a strong hint that something may be amiss in the compiler. In this case I would encourage you to file a bug report, with a self-contained repro case attached.