No, you can’t have each thread add to a value in global mem. You can’t even sum into the same shared memory location. The threads will step all over each other.
I’d have one kernel where all threads compute their independent outcomes, then run a second that performs a reduction. Look at the reduction sample in the SDK and Mark’s Harris’ presentation from SC07 for a fast, safe parallel sum.
After the kernel that computes the individual outcomes is finished, you mentioned about the second kernel that will perform reduction.
It implies that after the first kernel is finished, it needs to copy the values to the CPU, and then get terminated, right ? Then, before calling second kernel, it has to copy those values to GPU again and launch another kernel.
Am I in the right direction ? Or is there any other method ?
depending a bit on the size of your problem and the nature of it, it might even be smart to merge the 2 kernels. So you write the output of your kernel to shared memory (possible adding already if you need to calculate more than 1 output per thread) and in the same kernel then perform the reduction.
The only problem is the reduction requires multiple kernel calls if the data is larger than a block. If you merge the first reduction step into the first kernel, you’d be duplicating code since you would still need a “reduction-only” kernel for the rest of the reduction. (I hate duplicated code. My maintenance-hating-software-engineer side is showing.)
But depending on the first problem, as you said, it might make sense. Templates can be used to make the reduction kernel extremely flexible without duplicating code. Mich, what kind of work is done on each element before the sum?
Well, you can have each thread process more than one element. What I usually do in my kernels is have 256 threads, 2048 elements to process:
thread tid processes entries (tid + N*256) and adds them to shared memory. So I am left with shared memory of size 256 which I then reduce to the sum. So each thread processes 8 elements and does a shared_mem[tid] += …
This means you can only use 1 block (so is not really efficient) but I process lots of these things in parallel, where I have each block calculate a different sum.
Since the original question was about combining partial results from threads within a block, the approach described in the first post is the way to go. To see how to do a scan within a block efficiently, check out Mark Harris’ slides from SC07, to which jimh referred.