Distinct Kernels on Concurrent Streams?

Does CUDA allow, and is it normal practice, to run distinct kernels concurrently on separate streams on a single GPU?

The Stream Managment section 4.5.2.4 of NVIDIA CUDA Programming Guide Version 2.1, shows an example similar to:

kernel<<<100, 512, 0, stream[0]>>>(out + isize, in + isize);
kernel<<<100, 512, 0, stream[1]>>>(out + isize, in + isize);

with identical kernels. But is it reasonable to do:

kernel_1<<<100, 512, 0, stream[0]>>>(out + isize, in + isize);
kernel_2<<<100, 512, 0, stream[1]>>>(out + isize, in + isize);

where kernel_1 and kernel_2 are different?

Thanks.

Sure, you can use any kernel in any stream. But they won’t run concurrently, they will just overlap with memory copies (if you have a device which supports that feature).

Thanks. I am beginning to understand.
Is it correct to say that:

  1. At any moment, all Streaming Multiprocessors (SMs)
    on a device are running the same kernel.

  2. Once a kernel is started on a device, that kernel
    runs to completion.

Thanks.

Correct.