What is the access speed of tensor memory compared to shared memory?

In CC 10.0 Blackwell, tensor cores can load their inputs from either shared memory or tensor memory. The latter is a new type of on-chip memory 1. Introduction β€” PTX ISA 8.8 documentation

Registers can be stored to and loaded from tensor memory via ptx (subject to specific access patterns).

How fast are tensor memory accesses compared to shared memory accesses?
Can non-tensorcore code which is limited by shared memory speed be improved as well when using tensor memory instead, assuming the access patterns fit the restrictions of the load/store instructions?

Reading through this GTC, it seems reasonable to think of TMEM as a less flexible duplication of the register file, with performance being at a similar level:

β€œNew memory on each SM; same size as the Register File: 256 KB.”
β€œTMEM addresses can NOT be dereferenced!”

Not sure if this comment affects what you have in mind:
β€œUsed for Tensor Core (TC) ops. SIMT operations not supported on TMEM.”

1 Like

Only tensor core instructions can operate on tensor memory, yes. But I was wondering more about substituting shared memory by tensor memory like in the following code where funcA performance is limited by loading from shared memory because the work is small. (i.e. non-tensorcore code limited by smem speed)

funcA(){
   loadGmemToSmem()
   loop{
         loadSomeSmemToRegisters()
         workWithRegisters()
   }
}

funcB(){
   loadGmemToTensormem()
   loop{
         loadSomeTensormemToRegisters()
         workWithRegisters()
   }
}
2 Likes