Using shared memory where a variable number of threads shares some data.

Consider a scenario in which an algorithm can be parallelized really well with a lot of threads and some of this threads share data. In order to make this efficient data should be fetched only once from global memory and then put into shared memory. This works well when the number of threads which share data is constant as explained in the matrix multiplication example.

However what if the number of threads which share data vary and the size of the data they share probably also varies. Since the block size is the same for each block in a kernel, what is the way to implement such an algorithm efficiently?

Launch with a maximum number of threads and shared memory size and then terminate unused threads?

If you want to take this further, you could even dynamically partition the threads and shared memory and treat these partitions as independent “subblocks”. On compute capability 2.x the bar.sync instruction has an optional parameter that allows to specify the number of warps (×32) participating in the barrier, which would allow truly independent operation. You’ll have to use inline assembly for this though, it doesn’t seem to be exposed to CUDA C yet. On 1.x devices you would have some needless inter-subblock synchronization though.

While this approach certainly isn’t optimal, I think its about the best you can do with the hardware if you don’t want to go through global memory (or L2 caches).

Guess this works for smaller variations, but not if you have a few extreme cases and a small average of maybe ~5 threads difference… Have you implemented anything like this so far?

No I haven’t.
You have the warp granularity of 32 threads anyway. If your “variable block size” varies by less than that, I don’t see any way to optimize anyway.

Another option on 2.x devices would be to launch different kernels in parallel that are optimized for different block sizes.