I have 64 threads for a block. More threads per block does not make sense. I use havily __syncthreads. My kernel uses about 16 registers. I can use either 2000 bytes of shared memory or the full 16kb ( which gives faster performance).
Can somebody tell what is the maximum amount of blocks I can invoke? Is it limited by the total nr of registers? That would mean 8192/16 = 512 ??
The maximum number of blocks is 65535 in each dimension (pg 74 of programming guide). The CUDA driver will run as many of them simultaneously as possible, and that depends on register and shared memory usage. So not all blocks are guaranteed to be running at the same time. Some will be queued waiting for other blocks to finish.
I understand that the number of registers is of influence. But the shared memory? Does it mean it will swap shared memory. Or will be some of the shared memory left unused . Remember I use 64 threads each block.
Only the the occupancy depends amount of registers, block size, and shared memory. None of these influence the total number of blocks you can execute, which is 65535*65535.
Shared memory is not swapped. Each block gets its own exclusive section of shared memory from the start of the block’s execution to the end. In your 2000 bytes of shared mem usage, there may be some left unused: use the occupancy calculator spreasheet to determine this.