I’d like to use all of the 16KB shared memory available on the GPU in my CUDA application.
Some of the shared memory is used for the device function parameter block [nvcc_2.2.pdf, p. 37], how much can be displayed with nvcc -Xptxas=-v (the number following the +).
By not passing any parameters to the kernel the device function parameter block can be reduced to 16 bytes:
16k is certainly a nice round number. I had an application where I wanted each thread to maintain a histogram requiring 256 bytes. I couldn’t use 64 threads per block, and 32 threads per block would allocate only a single block per SM. With 63 threads per block, at least I can get more than one warp per SM, but coalescing is no longer simple, because with the usual way of implementing tid = blockIdx.x*blockDim.x + threadIdx.x the first thread of the second block is no longer aligned properly.
Those few bytes matter. It’s not just bad algorithm design.
Unfortuntely, I don’t know of a way to use all 16k.
The problem I’m working can be divided in two steps:
keyderivation();
rc4encryption();
Each thread performs both steps on independent input data.
The first step (keyderivation) is computationally intensive, but needs little shared memory.
The second step (rc4encryption) does not require much computation, but needs 256 bytes addressable memory, which is accessed bytewise in a random pattern.
I can run the kernel with 63 threads per block using 256*63 = 16128 bytes of shared memory. A blocks size of 64 is not possible as described in the initial post.
In fact I’d like to use 128 (or 127) threads per block if possible.
I’ve tried this:
keyderivation();
__syncthreads():
if ((threadIdx.x & warpSize) == 0)
rc4encryption();
__syncthreads():
if ((threadIdx.x & warpSize) == 1)
rc4encryption();
__syncthreads():
The idea is that two threads in different warps share their state and to make sure that those threads execute serially.
I can get a 42% speedup with this method if I assume the full 16KB are available for a test. Having a few bytes less make allocation of shared memory to threads so ugly that I don’t think it is worth while…
Anyway, I believe there are many problems where a few bytes less that a power of 2 can hurt.
It would be great if somebody could explain why the device function parameter block eats shared memory and if there is a way to avoid this.
Long ago someone did some exploration with test kernels accessing shared arrays with negative indices to see what was stored in those initial bytes. I think it may have included blockDim and gridDim (perhaps warpSize now as well?). No idea how to get that memory back without doing something unsupported and dangerous, like deliberately indexing a shared array of size (16 kB - 16 bytes) with a negative offset. While it would be interesting to know if that actually works, I wouldn’t base any real code on it. :)
Where warpid and laneid are stored is an interesting question.
I suspect they are currently turned into constants by ptxas. I need to experiment with CUDA 2.2…
Edit: warpSize is the WARP_SZ constant in PTX, translated into the constant 32 in cubin.
The PTX variables %laneid, %warpid and %smid are extracted from the i0 special register (which contains the whole “physical thread ID”) using bit masking. Trying to use %nsmid in PTX causes a compilation error.
RC4 basically computes a random permutation of 0 … 255. Each step swaps two bytes, which are selected based on the value of a key byte and the permutation so far. I don’t see how that can be split in two.
I think textures are read only, so they won’t help in my case as RC4 needs write access.
dump all the data you need out of your parameters and vars blockIdx, blockDim, gridDim to registers (using the ‘volatile’ trick)
syncthreads before writing to sm
subtract 16 + sumoflocalvars bytes from your sm pointer and trash all your shared memory
__global__ void func(float4* _o, int _n)
{
extern __shared__ float _sd[];
float* sd = _sd - 4 - 2/*num of args*/;
// remember input args in registers
float4 * volatile o = _o;
int volatile n = _n;
// wait for all threads to finish dumping args to registers
__syncthreads();
// trash your shared memory and be happy
memset(sd, 0, 16384);
...
}
No, I don’t restore anything. I assume that it gets restored by the scheduler, and of course I can be horribly horribly wrong. So far things work alright, but it could be pure luck for me with my kernel. If your kernel crashes or produces unexpected results for unknown reason, then one should consider such behaviour and try moving input parameters to some other memory or try restoring them.
Yes, I’ll also remember to post here if I find out something else. Btw, keep in mind that there is currently a bug in nvcc when you save pointers to registers with volatile trick, compiler looses track of pointer alignment, and following loads or stores from that pointer end up being a sequence of 4byte reads/writes, instead of 1 8 byte for float2 or 16 byte for float4 read or write.
If the 256 bytes are used as a lookup-table, or something similar, I don’t see how this helps. Can you elaborate on how this can work?
Another thing you want to watch out for is bank conflicts. If the indices are “random” then there will be some unavoidable conflicts, but it looks like one of the indices is sequential, so you may want to use a 1024 byte block to store tables of 4 threads interleaved. For a memory-intense algorithm, bank conflicts can make a big difference.
My point was that if someone cant squeeze his data into 16K, those extra few bytes (overriding block indexes, grid dims etc…)
are probably won’t help too much either. Why does an algorithm need exact 16K and not for example 32K? what would one
do then?
I suggested to break the 16K (or whatever amount of data you need) into chunks that will fit into the smem, while giving
you good occupancy on the way. The chunks could overlap, something like this:
// lets say we need to load 16K data
__shared__ int iPos;
__shared__ float sdata[ 256 * 4 ];
if ( 0 == threadIdx.x ) iPos = 0; __syncthreads();
while ( iPos < 16 * 1024 ) // Or some other boolean condition...
{
// Load current chunk of 256 elements. Note that the kernel was called with 256 threads.
sdata[ threadIdx.x ] = pGMEMData[ .... + iPos + threadIdx.x ];
__syncthreads();
// Do whatever you need with the data...
...
// Move to the next chunk - if overlapping don't increment by 256 but less and write the data to pGMEMData
// so next iteration will re-read the data calculated on the current iteration.
if ( 0 == threadIdx.x )
{
iPos += 256;
}
__syncthreads():
}
Bank conflicts are surely an issue, I usually didnt bother with them too much since I got good speedups. I guess