Conditions for Overlapping Kernel Execution and memcpy

As shown in the figure, I have confirmed that all memory allocations are pinned memory, and I am using four streams to partition my workflow.

The pseudo-code is as follows:

int n_batch = (_m + M_TILE_SIZE - 1) / M_TILE_SIZE;

for (int i_batch = 0; i_batch < n_batch; ++i_batch) {

    int i_stream = i_batch % NUM_STREAMS;

    // memcpy async H2D

    // computations
    kernel1<<<>>>;
    kernel2<<<>>>;
    kernel3<<<>>>;
    kernel4<<<>>>;

    // memcpy async D2H
    
}

From my understanding, memcpy and kernel computation should be able to overlap.
However, based on the figure, it appears that only D2H and H2D transfers across different streams are overlapping.

Am I missing something? Should GPU utilization also be considered—for example, if the utilization is too high, would that prevent overlap?

Thanks!

Yes, of course. A new kernel can only start being executed, if there are free resources. “Most” kernels are optimized to run a very short time and take all the resources they can get within that time.