CUDA stream concurrency problem

Hi,

Now I want to convert the following code using default stream into two stream’s version for concurrency.

  1. H2D MemcpyAsync
  2. kernel
  3. D2H MemcpyAsync
  4. kernel
  5. kernel

Now the data dependency among the above 5 statements is following
1
|
2
|

| |
3 4
|
5
In a word, 3 and 4 can execute concurrently but both should be ordered after 2

For concurrency, 3 and 4 should execute in different streams each other.
So, at least, 3 or 4 should run in different stream with the stream where 1 and 2.
But putting 3 or 4 into different stream breaks the ordering, which leads to different result: 3 or 4 can run concurrently with 1 or 2.
So I wonder there is a way of ordering 3 or 4 after 2 though it executes in different stream with the one where 1 and 2 execute.

Launch an event into the stream that kernel launch 2 is in, after kernel launch 2.

For the items 3 and 4, create two separate streams, and in each of those streams, first insert cudaStreamWaitEvent on the aforementioned event. Then issue the work associated with 3 and 4 in their respective streams.