How to overlap CUDA core and tensor core computing

Since CUDA core and tensor core are separated components in terms of hardware, I’m trying to overlap these two parts of computing. Currently my implementation follows the methodology shown in the figure below.


pseudo code:

dim3 threadsPerBlock(512); 
dim3 blocksPerGrid((N + L - 1 ) / L, (M + L - 1 ) / L);
__device__ void matmul_tensor_core();
__device__ void matmul_cuda_core();
__global__ void matmul_hybrid(){
    if(threadIdx.x < 256) matmul_cuda_core();
    else if (threadIdx.x < 512) matmul_tensor_core();
}

However, this way of overlapping can only overlap 20-30% of the total compute time, I’m wondering is there any methods to overlap CUDA core and tensor core computing more?

Each Cuda SM comprises 4 SM partitions.

Each SM partition has hardware units for tensor core and conventional arithmetics.

So you should make sure that your warps or threads for tensor cores and conventional arithmetics are distributed over the SM partitions.

Also it could be that either one is faster, so a 50:50 distribution possibly is not the optimum.

Your computations could also be limited by memory bandwidth.

How have you measured the 20-30%? What is the speed for either kernel alone, what the combined speed?

I first try this

dim3 threadsPerBlock(256); 
dim3 blocksPerGrid((N + L - 1 ) / L, (M + L - 1 ) / L);
__device__ void matmul_tensor_core();
__device__ void matmul_cuda_core();
__global__ void matmul_hybrid(){
    matmul_cuda_core();
}

which takes 18.0ms.
Then I try

dim3 threadsPerBlock(256); 
dim3 blocksPerGrid((N + L - 1 ) / L, (M + L - 1 ) / L);
__device__ void matmul_tensor_core();
__device__ void matmul_cuda_core();
__global__ void matmul_hybrid(){
    matmul_tensor_core();
}

which takes 7.5ms.
Finally I combined these two and run

dim3 threadsPerBlock(512); 
dim3 blocksPerGrid((N + L - 1 ) / L, (M + L - 1 ) / L);
__device__ void matmul_tensor_core();
__device__ void matmul_cuda_core();
__global__ void matmul_hybrid(){
    if(threadIdx.x < 256) matmul_cuda_core();
    else if (threadIdx.x < 512) matmul_tensor_core();
}

This takes 23.5ms.
So totally I think (7.5 + 18.0 - 23.5) / (7.5 +18.0) = 0.07, 7% time is overlapped.
What problems does it have to prevent it from overlapping?
I attach my code here.
hybrid.zip (6.6 KB)
you can run it with:

nvcc -arch sm_80 -o hybrid runner_hybrid.cu -Xcompiler -fopenmp

Thanks very much for your help.

Quickly glancing at the code there are two observations:

  1. The __syncthreads in both kernels is likely to cause false stalls. The code should only be synchronizing the warps in each code path.
  2. The TC path may be doing a fair number of IMAD operations which use the FMA (CC 7.0-8.0) or FMAheavy (CC >8.0) pipe blocking fp32. On CC7.0-8.0 fp16x2 is done on same dispatch pipe as the tensor cores. On CC > 8.0 fp16x2 is executed on the FMA* pipes.

The next step would be to run all 3 kernels in NCU and determine the limiter. Diagnosing issue 1 is tricky. Look at NCU barrier stall in the details and source view. Issue 2 can be observed as well as memory limiter in the SOL section at the top of details page.