What is the access speed of tensor memory compared to shared memory?

striker159 · June 15, 2025, 9:09am

In CC 10.0 Blackwell, tensor cores can load their inputs from either shared memory or tensor memory. The latter is a new type of on-chip memory 1. Introduction — PTX ISA 8.8 documentation

Registers can be stored to and loaded from tensor memory via ptx (subject to specific access patterns).

How fast are tensor memory accesses compared to shared memory accesses?
Can non-tensorcore code which is limited by shared memory speed be improved as well when using tensor memory instead, assuming the access patterns fit the restrictions of the load/store instructions?

rs277 · June 15, 2025, 7:28pm

Reading through this GTC, it seems reasonable to think of TMEM as a less flexible duplication of the register file, with performance being at a similar level:

“New memory on each SM; same size as the Register File: 256 KB.”
“TMEM addresses can NOT be dereferenced!”

Not sure if this comment affects what you have in mind:
“Used for Tensor Core (TC) ops. SIMT operations not supported on TMEM.”

striker159 · June 17, 2025, 8:59am

Only tensor core instructions can operate on tensor memory, yes. But I was wondering more about substituting shared memory by tensor memory like in the following code where funcA performance is limited by loading from shared memory because the work is small. (i.e. non-tensorcore code limited by smem speed)

funcA(){
   loadGmemToSmem()
   loop{
         loadSomeSmemToRegisters()
         workWithRegisters()
   }
}

funcB(){
   loadGmemToTensormem()
   loop{
         loadSomeTensormemToRegisters()
         workWithRegisters()
   }
}

Topic		Replies	Views
Can we directly use register value for tensor core calculation? CUDA Programming and Performance	4	595	October 18, 2023
Is there a support for copy from shared memory to global memory without using registers? CUDA Programming and Performance cuda	7	178	October 9, 2024
Register or shared memory? CUDA Programming and Performance	5	4166	July 31, 2009
memory organization CUDA Programming and Performance	3	4336	March 10, 2008
Device memory VS Shared memory CUDA Programming and Performance	4	4142	September 22, 2008
Shared mem vs. registers CUDA Programming and Performance	3	1357	October 14, 2009
How to achieve peak tensor core utilization TensorRT	1	767	September 20, 2022
Performance issues accessing pinned memory 1070 / 3060 CUDA Programming and Performance	9	330	April 21, 2023
Do Tensor Core fragments help conserve registers? CUDA Programming and Performance	2	637	October 30, 2021
Why shared memory has lower bandwidth/multiprocessor than global memory? CUDA Programming and Performance	2	1126	December 6, 2009

What is the access speed of tensor memory compared to shared memory?

Related topics