How to overlap CUDA core and tensor core computing

peiqi.zhang97 · June 18, 2025, 9:22am

Since CUDA core and tensor core are separated components in terms of hardware, I’m trying to overlap these two parts of computing. Currently my implementation follows the methodology shown in the figure below.

pseudo code:

dim3 threadsPerBlock(512); 
dim3 blocksPerGrid((N + L - 1 ) / L, (M + L - 1 ) / L);
__device__ void matmul_tensor_core();
__device__ void matmul_cuda_core();
__global__ void matmul_hybrid(){
    if(threadIdx.x < 256) matmul_cuda_core();
    else if (threadIdx.x < 512) matmul_tensor_core();
}

However, this way of overlapping can only overlap 20-30% of the total compute time, I’m wondering is there any methods to overlap CUDA core and tensor core computing more?

Curefab · June 18, 2025, 11:32am

Each Cuda SM comprises 4 SM partitions.

Each SM partition has hardware units for tensor core and conventional arithmetics.

So you should make sure that your warps or threads for tensor cores and conventional arithmetics are distributed over the SM partitions.

Also it could be that either one is faster, so a 50:50 distribution possibly is not the optimum.

Your computations could also be limited by memory bandwidth.

How have you measured the 20-30%? What is the speed for either kernel alone, what the combined speed?

peiqi.zhang97 · June 18, 2025, 2:55pm

I first try this

dim3 threadsPerBlock(256); 
dim3 blocksPerGrid((N + L - 1 ) / L, (M + L - 1 ) / L);
__device__ void matmul_tensor_core();
__device__ void matmul_cuda_core();
__global__ void matmul_hybrid(){
    matmul_cuda_core();
}

which takes 18.0ms.
Then I try

dim3 threadsPerBlock(256); 
dim3 blocksPerGrid((N + L - 1 ) / L, (M + L - 1 ) / L);
__device__ void matmul_tensor_core();
__device__ void matmul_cuda_core();
__global__ void matmul_hybrid(){
    matmul_tensor_core();
}

which takes 7.5ms.
Finally I combined these two and run

dim3 threadsPerBlock(512); 
dim3 blocksPerGrid((N + L - 1 ) / L, (M + L - 1 ) / L);
__device__ void matmul_tensor_core();
__device__ void matmul_cuda_core();
__global__ void matmul_hybrid(){
    if(threadIdx.x < 256) matmul_cuda_core();
    else if (threadIdx.x < 512) matmul_tensor_core();
}

This takes 23.5ms.
So totally I think (7.5 + 18.0 - 23.5) / (7.5 +18.0) = 0.07, 7% time is overlapped.
What problems does it have to prevent it from overlapping?
I attach my code here.
hybrid.zip (6.6 KB)
you can run it with:

nvcc -arch sm_80 -o hybrid runner_hybrid.cu -Xcompiler -fopenmp

Thanks very much for your help.

Greg · June 19, 2025, 2:14pm

Quickly glancing at the code there are two observations:

The __syncthreads in both kernels is likely to cause false stalls. The code should only be synchronizing the warps in each code path.
The TC path may be doing a fair number of IMAD operations which use the FMA (CC 7.0-8.0) or FMAheavy (CC >8.0) pipe blocking fp32. On CC7.0-8.0 fp16x2 is done on same dispatch pipe as the tensor cores. On CC > 8.0 fp16x2 is executed on the FMA* pipes.

The next step would be to run all 3 kernels in NCU and determine the limiter. Diagnosing issue 1 is tricky. Look at NCU barrier stall in the details and source view. Issue 2 can be observed as well as memory limiter in the SOL section at the top of details page.

Topic		Replies	Views
Overlapping CUDA Cores and Tensor Cores CUDA Programming and Performance kernel	2	469	April 7, 2024
Copies between CPU and GPU CUDA Programming and Performance	8	5360	November 3, 2009
Fast CUDA implementation for calculating cross-norm-distance of two matrices CUDA Programming and Performance cuda	0	937	April 8, 2021
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	260	July 7, 2024
Odd performance problem/question CUDA Programming and Performance	3	837	June 3, 2009
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4498	October 24, 2008
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	888	July 27, 2017
Copy-Compute Overlap Performance CUDA Programming and Performance	4	1010	January 19, 2019
looking for further suggestion to speed up the code CUDA Programming and Performance	9	1268	February 4, 2014
About coalescing CUDA Programming and Performance	6	2626	April 16, 2010

How to overlap CUDA core and tensor core computing

Related topics