Practical rules for coalesced memory access ?

ebfe · September 12, 2008, 5:43pm

Hi,

i’ve read the CUDA programming guide about coalesced memory access but I don’t quite get it. Maybe someone can help me with this practical problem:

My kernel threads access a struct as input consisting of four elements, each of which is made up by five 32bit long integers. 20 longs == 80 bytes input per thread. You can think of this struct as…

   typedef struct {

        unsigned long v0, v1, v2, v3, v4;

    } BUFFER_STRUCT;

   typedef struct {

        BUFFER_STRUCT top_buffer;

        BUFFER_STRUCT bot_buffer;

        BUFFER_STRUCT left_buffer;

        BUFFER_STRUCT right_buffer;

    } gpu_inbuffer;

There is exactly one gpu_inbuffer per thread and every thread will access the members of gpu_inbuffer (and BUFFER_STRUCT) in the same way - there is no divergence.

The Visual Profiler tells me that I get tons of uncoalesced memory accesses - and this is the only global memory struct to blame. Could someone be so kind and guide me how to modify this struct so the GPU can access it’s members in a less latency-driven way? I don’t get how to apply the rules of the programming guide to my case :-(

tmurray · September 12, 2008, 6:16pm

Make arrays of longs instead of arrays of structs. Let’s see if I can make a really dumb ASCII diagram…

Right now, if 1 are the regions of memory you’re reading at any given cycle and 0 are regions you’re not reading, your access pattern is probably something like

100000000000000000001000000000…

It might actually be 3 zeroes for every one (in case you can read the entire structure as one operation)–I’d have to double check how structures are accessed. Regardless, it’s very uncoalesced right now. You could do shared memory trickery to try to deal with some of this, but it’s silly.

If you make arrays of longs so that you have v0_top for all threads contiguous, etc., your access patterns will be perfectly coalesced (assuming you start on a correctly aligned boundary, etc).

ebfe · September 12, 2008, 6:37pm

Well, struct is only for convenience reasons (named address) - it’s made up of the same data type so it should compile to the code as arrays.

The layout in memory now is something like this:

[Thread1.bottom;Thread1.top;Thread1.left;Thread1.right][Thread2.bottom;Thread2.top…]

Do I get your right that the layout should be

[Thread1.bottom;Thread2.bottom;Thread3.bottom…][Thread1.top;Thread2.top;Thread3.top]…

tmurray · September 12, 2008, 6:44pm

Sure, it will compile as an array, but you’re not reading contiguous memory in a given cycle; you’re reading contiguous structs, sure, but they’re relatively far apart in memory and won’t be coalesced as a result.

I’m fairly sure Thread1.bottom, Thread2.bottom, Thread3.bottom won’t be coalesced either since they’re above 16 bytes per struct. Organizing your data as Thread1.bottom.v0, Thread2.bottom.v0, … Thread1.bottom.v1, Thread2.bottom.v1, etc will get you perfect coalescing, though.

AndreiB · September 13, 2008, 7:15am

ebfe
Just my 2 cents. Given the nature of kernel you’re optimizing (WPA-PSK if I get it right) I would not pay much attention on uncoalesced access. You can gain few microseconds but you won’t even notice it compared to overall running time.

As tmurray said, [Thread1.bottom;Thread2.bottom;Thread3.bottom…][Thread1.top;Thread2.top;Thread3.top]… won’t be coalesced. Reading Thread.bottom will most likely be split in two reads: v0-v3 and v5. Re-organizing data layout to [Thread1.bottom.v0;Thread2.bottom.v0;Thread3.bottom.v0…][Thread1.bottom.v1;Thread2.bottom.v1;Thread3.bottom.v1]… will ensure coalescing. But again, I’m pretty sure that for your particular kernel cost of re-organizing data on host side will be higher than penalty for uncoalesced access.

Topic		Replies	Views
Quick memory access question. Threads fighting over a data source? CUDA Programming and Performance	9	4056	October 20, 2008
Memory, Structs, arrays, etc... CUDA Programming and Performance	0	2286	October 1, 2009
coalesced access to global memory CUDA Programming and Performance	6	1173	May 8, 2014
coalesced access of a struct of double's is this rite? CUDA Programming and Performance	14	7819	June 29, 2009
Coalesced Memory access related doubt CUDA Programming and Performance	13	2035	December 9, 2010
Another question about coalesced reads/writes CUDA Programming and Performance	10	2142	August 18, 2009
An example of coalesced memory access CUDA Programming and Performance	2	3655	June 28, 2010
Accessing same global memory address within warps CUDA Programming and Performance	4	4191	October 24, 2018
Memory access should be coalesced but is not CUDA Programming and Performance	6	1074	May 16, 2019
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1165	March 30, 2009

Practical rules for coalesced memory access ?

Related topics