Since CUDA core and tensor core are separated components in terms of hardware, I’m trying to overlap these two parts of computing. Currently my implementation follows the methodology shown in the figure below.
However, this way of overlapping can only overlap 20-30% of the total compute time, I’m wondering is there any methods to overlap CUDA core and tensor core computing more?
This takes 23.5ms.
So totally I think (7.5 + 18.0 - 23.5) / (7.5 +18.0) = 0.07, 7% time is overlapped.
What problems does it have to prevent it from overlapping?
I attach my code here. hybrid.zip (6.6 KB)
you can run it with:
Quickly glancing at the code there are two observations:
The __syncthreads in both kernels is likely to cause false stalls. The code should only be synchronizing the warps in each code path.
The TC path may be doing a fair number of IMAD operations which use the FMA (CC 7.0-8.0) or FMAheavy (CC >8.0) pipe blocking fp32. On CC7.0-8.0 fp16x2 is done on same dispatch pipe as the tensor cores. On CC > 8.0 fp16x2 is executed on the FMA* pipes.
The next step would be to run all 3 kernels in NCU and determine the limiter. Diagnosing issue 1 is tricky. Look at NCU barrier stall in the details and source view. Issue 2 can be observed as well as memory limiter in the SOL section at the top of details page.