Conditions for Overlapping Kernel Execution and memcpy

tnwilly · May 24, 2025, 6:22am

As shown in the figure, I have confirmed that all memory allocations are pinned memory, and I am using four streams to partition my workflow.

The pseudo-code is as follows:

int n_batch = (_m + M_TILE_SIZE - 1) / M_TILE_SIZE;

for (int i_batch = 0; i_batch < n_batch; ++i_batch) {

    int i_stream = i_batch % NUM_STREAMS;

    // memcpy async H2D

    // computations
    kernel1<<<>>>;
    kernel2<<<>>>;
    kernel3<<<>>>;
    kernel4<<<>>>;

    // memcpy async D2H
    
}

From my understanding, memcpy and kernel computation should be able to overlap.
However, based on the figure, it appears that only D2H and H2D transfers across different streams are overlapping.

Am I missing something? Should GPU utilization also be considered—for example, if the utilization is too high, would that prevent overlap?

Thanks!

Curefab · May 24, 2025, 8:26am

Yes, of course. A new kernel can only start being executed, if there are free resources. “Most” kernels are optimized to run a very short time and take all the resources they can get within that time.

Topic		Replies	Views
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	989	February 1, 2022
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	888	July 7, 2017
Concurrent exec. of kernel and GPU mem copies CUDA Programming and Performance	5	2892	March 7, 2008
Bug when overlapping tranfert & data CUDA Programming and Performance	1	570	February 11, 2011
memory copy overlap CUDA Programming and Performance	7	14728	March 29, 2008
asynchronous memory transfer CUDA Programming and Performance	2	1650	October 29, 2008
What could cause kernel execution to not overlap on different streams? CUDA Programming and Performance	8	2153	June 1, 2017
Syncronization with cuda Streams CUDA Programming and Performance cuda	8	420	October 12, 2021
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2246	October 26, 2016
Weird behaviour of CUDA streams CUDA Programming and Performance	0	1890	June 17, 2010

Conditions for Overlapping Kernel Execution and memcpy

Related topics