Using shared memory where a variable number of threads shares some data.

MrNightLifeLover · May 14, 2011, 9:56am

Consider a scenario in which an algorithm can be parallelized really well with a lot of threads and some of this threads share data. In order to make this efficient data should be fetched only once from global memory and then put into shared memory. This works well when the number of threads which share data is constant as explained in the matrix multiplication example.

However what if the number of threads which share data vary and the size of the data they share probably also varies. Since the block size is the same for each block in a kernel, what is the way to implement such an algorithm efficiently?

tera · May 14, 2011, 10:48am

Launch with a maximum number of threads and shared memory size and then terminate unused threads?

If you want to take this further, you could even dynamically partition the threads and shared memory and treat these partitions as independent “subblocks”. On compute capability 2.x the bar.sync instruction has an optional parameter that allows to specify the number of warps (Ã—32) participating in the barrier, which would allow truly independent operation. You’ll have to use inline assembly for this though, it doesn’t seem to be exposed to CUDA C yet. On 1.x devices you would have some needless inter-subblock synchronization though.

While this approach certainly isn’t optimal, I think its about the best you can do with the hardware if you don’t want to go through global memory (or L2 caches).

MrNightLifeLover · May 14, 2011, 11:57am

Guess this works for smaller variations, but not if you have a few extreme cases and a small average of maybe ~5 threads difference… Have you implemented anything like this so far?

tera · May 14, 2011, 12:41pm

No I haven’t.
You have the warp granularity of 32 threads anyway. If your “variable block size” varies by less than that, I don’t see any way to optimize anyway.

Another option on 2.x devices would be to launch different kernels in parallel that are optimized for different block sizes.

Topic		Replies	Views
efficiency of block/thread ratios CUDA Programming and Performance	2	3823	April 18, 2007
Code optimization with CDP and dynamic shared memory allocation CUDA Programming and Performance	18	73	January 13, 2025
Expanding shared memory into global memory? CUDA Programming and Performance	3	1541	August 3, 2009
Thread Block Size what difference does it make? CUDA Programming and Performance	6	5419	June 3, 2008
Shared Memory Buffer CUDA Programming and Performance	1	2690	May 13, 2011
Blocks with varying thread size? CUDA Programming and Performance	1	529	June 5, 2011
Using some threads for data load data to shared mem only CUDA Programming and Performance	3	1996	March 3, 2009
The choose of grid size and block size CUDA Programming and Performance	8	3578	May 8, 2024
Shared memory issues Initialization of shared memory CUDA Programming and Performance	2	6725	August 23, 2007
Threads per block Using shared memory CUDA Programming and Performance	11	2420	October 20, 2010

Using shared memory where a variable number of threads shares some data.

Related topics