Any alternative to using array in GPU function In __global__ or __device__ functions

In some of my global or device functions, I have used arrays. In fact, I wish I could use dynamic arrays, whose sizes are determined by input argument of the functions. But it seems impossible. :angry:

Back to the static size arrays, I saw in the old posts that these arrays are actually defined in the device memory. This seems to be the reason why my kernels are running slow. These arrays are quite convenient to use for my algorithms, though. Is there no better methods other than redesigning my algorithms? :unsure:

(I need to change the values of these arrays. So they cannot sit in the constant memory.)

Thank you,

What is wrong with cudaMalloc?

To get the most performance out of device memory arrays, make sure you follow all the rules for performing coalesced reads: the performance difference can be a factor or 20 or more.

Thank you. I meant I had wanted to dynamically allocate an array in a kernel function, which seems to be impossible. Currently in my kernel functions, I use something like follows to define an array:

float myArray[MAX_SIZE].

or a 2D array:

float myArray2[MAX_SIZE][MAX_SIZE].

This is inefficient because often I don’t need MAX_SIZE for myArray. The memory is thus wasted.

And these arrays end up residing in local device memory. When I want to set the the value of myArray, I have to read and write device memory, which is slow.

You asked how you could specify an array size as an argument to the function, how is that not accomplished by cudaMalloc() + call kernel with size parameter? Then you can make MAX_SIZE only just as large as it needs to be for the current dataset so that not too much memory is wasted.

And how big is MAX_SIZE? Do you really need one of these arrays in PER THREAD local memory? Maybe you can fit it into shared memory and then dump it to global memory when you are done. And what are your memory access patterns? Depending on those, a texture may be a better option.

If anything, don’t trust the compiler with the local memory: global memory bandwidth is very precious on the device and you should take absolute care to only perform as many reads and writes as you really need and make them all coalesced (or texture reads with good data locality). Trusting the compiler to handle it with local memory is asking for trouble.