Practical rules for coalesced memory access ?

Hi,

i’ve read the CUDA programming guide about coalesced memory access but I don’t quite get it. Maybe someone can help me with this practical problem:

My kernel threads access a struct as input consisting of four elements, each of which is made up by five 32bit long integers. 20 longs == 80 bytes input per thread. You can think of this struct as…

   typedef struct {

        unsigned long v0, v1, v2, v3, v4;

    } BUFFER_STRUCT;

   typedef struct {

        BUFFER_STRUCT top_buffer;

        BUFFER_STRUCT bot_buffer;

        BUFFER_STRUCT left_buffer;

        BUFFER_STRUCT right_buffer;

    } gpu_inbuffer;

There is exactly one gpu_inbuffer per thread and every thread will access the members of gpu_inbuffer (and BUFFER_STRUCT) in the same way - there is no divergence.

The Visual Profiler tells me that I get tons of uncoalesced memory accesses - and this is the only global memory struct to blame. Could someone be so kind and guide me how to modify this struct so the GPU can access it’s members in a less latency-driven way? I don’t get how to apply the rules of the programming guide to my case :-(

Make arrays of longs instead of arrays of structs. Let’s see if I can make a really dumb ASCII diagram…

Right now, if 1 are the regions of memory you’re reading at any given cycle and 0 are regions you’re not reading, your access pattern is probably something like

100000000000000000001000000000…

It might actually be 3 zeroes for every one (in case you can read the entire structure as one operation)–I’d have to double check how structures are accessed. Regardless, it’s very uncoalesced right now. You could do shared memory trickery to try to deal with some of this, but it’s silly.

If you make arrays of longs so that you have v0_top for all threads contiguous, etc., your access patterns will be perfectly coalesced (assuming you start on a correctly aligned boundary, etc).

Well, struct is only for convenience reasons (named address) - it’s made up of the same data type so it should compile to the code as arrays.

The layout in memory now is something like this:

[Thread1.bottom;Thread1.top;Thread1.left;Thread1.right][Thread2.bottom;Thread2.top…]

Do I get your right that the layout should be

[Thread1.bottom;Thread2.bottom;Thread3.bottom…][Thread1.top;Thread2.top;Thread3.top]…

Sure, it will compile as an array, but you’re not reading contiguous memory in a given cycle; you’re reading contiguous structs, sure, but they’re relatively far apart in memory and won’t be coalesced as a result.

I’m fairly sure Thread1.bottom, Thread2.bottom, Thread3.bottom won’t be coalesced either since they’re above 16 bytes per struct. Organizing your data as Thread1.bottom.v0, Thread2.bottom.v0, … Thread1.bottom.v1, Thread2.bottom.v1, etc will get you perfect coalescing, though.

ebfe
Just my 2 cents. Given the nature of kernel you’re optimizing (WPA-PSK if I get it right) I would not pay much attention on uncoalesced access. You can gain few microseconds but you won’t even notice it compared to overall running time.

As tmurray said, [Thread1.bottom;Thread2.bottom;Thread3.bottom…][Thread1.top;Thread2.top;Thread3.top]… won’t be coalesced. Reading Thread.bottom will most likely be split in two reads: v0-v3 and v5. Re-organizing data layout to [Thread1.bottom.v0;Thread2.bottom.v0;Thread3.bottom.v0…][Thread1.bottom.v1;Thread2.bottom.v1;Thread3.bottom.v1]… will ensure coalescing. But again, I’m pretty sure that for your particular kernel cost of re-organizing data on host side will be higher than penalty for uncoalesced access.