As shown in the figure, I have confirmed that all memory allocations are pinned memory, and I am using four streams to partition my workflow.
The pseudo-code is as follows:
int n_batch = (_m + M_TILE_SIZE - 1) / M_TILE_SIZE;
for (int i_batch = 0; i_batch < n_batch; ++i_batch) {
int i_stream = i_batch % NUM_STREAMS;
// memcpy async H2D
// computations
kernel1<<<>>>;
kernel2<<<>>>;
kernel3<<<>>>;
kernel4<<<>>>;
// memcpy async D2H
}
From my understanding, memcpy
and kernel computation should be able to overlap.
However, based on the figure, it appears that only D2H and H2D transfers across different streams are overlapping.
Am I missing something? Should GPU utilization also be considered—for example, if the utilization is too high, would that prevent overlap?
Thanks!