How to utilize CUDA, Tensor, and RT cores in one program

We know that all three types of cores can be used for data acceleration. Is it possible to use all three cores in one program, e.g., do pipeline stage A on Tensor core, stage B on RT core, and stage C on CUDA core?

Thanks.

Yes, its possible. Many games do this. CUDA cores are used for general shader programs, RT cores are used for ray-tracing acceleration, and Tensor cores are used e.g. for DLSS (along with the RT core output). RT cores have nothing to do with CUDA, so this isn’t the correct forum to ask about them. Regarding the other 2, here is a recent similar question.

Can CUDA event-callbacks use RT cores on same GPU(guessing RT somehow shares some resources with CUDA on same GPU)? Documentations says CUDA is not allowed inside callback. But what about RT? Especially if it allows interop between RT buffers and CUDA buffers?

Why not just do multithreading and send RT commands from another thread?
If you need synchronization, you can do it in callbacks.

There isn’t a way to directly access RT cores. You use RT cores by using an abstraction such as Optix or DXR. Or you may wish to explain what you mean by “use RT cores”.

The general method to use RT cores provided by NVIDIA is Optix. Optix has the possibility to use CUDA under the hood, so as a general matter, calling Optix from a CUDA callback could result in the same violation you already mentioned. So again it might be necessary to identify exactly what you plan to do in Optix in a CUDA callback. And I would restate my suggestion - I suggest asking Optix questions on the Optix forum.

For some unofficial, but public, insides about the raytracing unit, look for example in Fig. 17 of US11928772B2 - Method for forward progress and programmable timeouts of tree traversal mechanisms in hardware - Google Patents - you find the TTU - Tree Traversal Unit there.

(It does not have to be implemented in actual hardware that way.)

But citing from the patent description there is the history:

Then, in 2010, NVIDIA took advantage of the high degree of parallelism of NVIDIA GPUs and other highly parallel architectures to develop the OptiX™ ray tracing engine. See Parker et al., “OptiX: A General Purpose Ray Tracing Engine” (ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, July 2010). In addition to improvements in API’s (application programming interfaces), one of the advances provided by OptiX™ was improving the acceleration data structures used for finding an intersection between a ray and the scene geometry. Such acceleration data structures are usually spatial or object hierarchies used by the ray tracing traversal algorithm to efficiently search for primitives that potentially intersect a given ray. OptiX™ provides a number of different acceleration structure types that the application can choose from. Each acceleration structure in the node graph can be a different type, allowing combinations of high-quality static structures with dynamically updated ones.

The OptiX™ programmable ray tracing pipeline provided significant advances, but was still generally unable by itself to provide real time interactive response to user input on relatively inexpensive computing platforms for complex 3D scenes. Since then, NVIDIA has been developing hardware acceleration capabilities for ray tracing. See e.g., U.S. Pat. Nos. 9,582,607; 9,569,559; US20160070820; and US20160070767.

Given the great potential of a truly interactive real time ray tracing graphics processing system for rendering high quality images of arbitrary complexity in response for example to user input, further work is possible and desirable.

Some Ray Processes can Take Too Long or May Need to be Interrupted

Ray tracing generally involves executing a ray intersection query against a pre-built Acceleration Structure (AS), sometimes referred to more specifically as a Bounding Volume Hierarchy (BVH). Depending on the build of the AS, the number of primitives in the scene and the orientation of a ray, the traversal can take anywhere from a few to hundreds to even thousands of cycles by specialized traversal hardware. Additionally, if a cycle or loop is inadvertently (or even intentionally in the case of a bad actor) encoded into the BVH, it is possible for a traversal to become infinite. For example, it is possible for a BVH to define a traversal that results in an “endless loop.”

To prevent any long-running query from hanging the GPU, the example implementation Tree Traversal Unit (TTU) 700 provides a mechanism for preemption that will allow rays to timeout early. The example non-limiting implementations described herein provide such a preemption mechanism, including a forward progress guarantee, and additional programmable timeout options that build upon that. Those programmable options provide a means for quality of service timing guarantees for applications such as virtual reality (VR) that have strict timing requirements.

According to GTC 2019: Chief Scientist Bill Dally Provides Glimpse into Nvidia Research Engine the TTU is just the internal name for the RT cores.

At least the Turing and Ampere generations had
TTUOPEN, TTUCLOSE, TTUCCTL (Cache Control), TTULD (Load), TTUST (Store), TTUGO, TTUMACROFUSE
instructions theoretically usable from Cuda and shaders.

But sadly undocumented, probably because the API is intended to be much less stable (i.e. stability of the interface specification, the next generation will probably have different architectures, not stability running it).

That means the TTU probably has the exact same restrictions or allowances like any other Cuda code.

Currently the most low-level way for using the RT cores is with ray queries from Vulcan or DXR (instead of OptiX, which is more high-level).

2 Likes