Question regarding summing up outputs Summing outputs from each thread

mich · March 5, 2008, 6:43pm

I have another question regarding approach to the problem.

Suppose multiple threads within a block calculate their independent outcome.
But those individual outcomes must be added together at the end.

Is my following approach the standard way ?

Each thread calculates outcome, stores that value into one location in the shared memory.
syncthreads();
Do some scan algorithm to add the outputs together

I can’t just let each thread sum up the outputs in the global memory right ? like the following,

globalmemory[some fixed location] += output

I need to store each thread output into different shared memory locations and then sum them right ?

Thank you

jimh · March 5, 2008, 7:07pm

No, you can’t have each thread add to a value in global mem. You can’t even sum into the same shared memory location. The threads will step all over each other.

I’d have one kernel where all threads compute their independent outcomes, then run a second that performs a reduction. Look at the reduction sample in the SDK and Mark’s Harris’ presentation from SC07 for a fast, safe parallel sum.

mich · March 5, 2008, 7:28pm

Thanks

mich · March 5, 2008, 7:37pm

After the kernel that computes the individual outcomes is finished, you mentioned about the second kernel that will perform reduction.

It implies that after the first kernel is finished, it needs to copy the values to the CPU, and then get terminated, right ? Then, before calling second kernel, it has to copy those values to GPU again and launch another kernel.

Am I in the right direction ? Or is there any other method ?

Thank you

DenisR · March 5, 2008, 8:44pm

no need to copy the values back and forth, you can just use the output-pointer from the first kernel als input-pointer for the second.

jimh · March 5, 2008, 10:00pm

No need to copy, just like Denis said.

DenisR · March 6, 2008, 9:02am

depending a bit on the size of your problem and the nature of it, it might even be smart to merge the 2 kernels. So you write the output of your kernel to shared memory (possible adding already if you need to calculate more than 1 output per thread) and in the same kernel then perform the reduction.

jimh · March 6, 2008, 9:21pm

The only problem is the reduction requires multiple kernel calls if the data is larger than a block. If you merge the first reduction step into the first kernel, you’d be duplicating code since you would still need a “reduction-only” kernel for the rest of the reduction. (I hate duplicated code. My maintenance-hating-software-engineer side is showing.)

But depending on the first problem, as you said, it might make sense. Templates can be used to make the reduction kernel extremely flexible without duplicating code. Mich, what kind of work is done on each element before the sum?

DenisR · March 7, 2008, 6:32am

Well, you can have each thread process more than one element. What I usually do in my kernels is have 256 threads, 2048 elements to process:

thread tid processes entries (tid + N*256) and adds them to shared memory. So I am left with shared memory of size 256 which I then reduce to the sum. So each thread processes 8 elements and does a shared_mem[tid] += …

This means you can only use 1 block (so is not really efficient) but I process lots of these things in parallel, where I have each block calculate a different sum.

paulius · March 12, 2008, 5:54pm

Since the original question was about combining partial results from threads within a block, the approach described in the first post is the way to go. To see how to do a scan within a block efficiently, check out Mark Harris’ slides from SC07, to which jimh referred.

Paulius

jimh · March 12, 2008, 8:10pm

Sorry, I guess I misunderstood. It wasn’t clear to me the sum was only within each block.

Topic		Replies	Views
Need help with summing results from different blocks CUDA Programming and Performance	3	2547	May 10, 2010
Accumulate value within block CUDA Programming and Performance	15	3180	October 16, 2010
Performing multiple summations in one GPU kernel CUDA Programming and Performance	5	1183	August 19, 2013
finding sum CUDA Programming and Performance	1	2511	November 18, 2007
Variable Initialisation on Device Routine CUDA Programming and Performance	4	2537	May 24, 2008
Interpretation of Kernel CUDA Programming and Performance	4	3087	August 11, 2009
Simple Inefficient Parallel Addition CUDA Programming and Performance	5	3172	April 10, 2009
cuda shared memory usage + no reduction with threads CUDA Programming and Performance	5	1121	April 23, 2012
Does number of shared memory banks effect results? CUDA Programming and Performance	6	1186	May 29, 2011
Thread cooperative addition CUDA Programming and Performance	1	1641	June 3, 2008

Question regarding summing up outputs Summing outputs from each thread

Related topics