How to utilize CUDA, Tensor, and RT cores in one program

kay21s · September 18, 2023, 7:54am

We know that all three types of cores can be used for data acceleration. Is it possible to use all three cores in one program, e.g., do pipeline stage A on Tensor core, stage B on RT core, and stage C on CUDA core?

Thanks.

Robert_Crovella · September 18, 2023, 2:44pm

Yes, its possible. Many games do this. CUDA cores are used for general shader programs, RT cores are used for ray-tracing acceleration, and Tensor cores are used e.g. for DLSS (along with the RT core output). RT cores have nothing to do with CUDA, so this isn’t the correct forum to ask about them. Regarding the other 2, here is a recent similar question.

tugrul_192bit · September 16, 2024, 12:15pm

Can CUDA event-callbacks use RT cores on same GPU(guessing RT somehow shares some resources with CUDA on same GPU)? Documentations says CUDA is not allowed inside callback. But what about RT? Especially if it allows interop between RT buffers and CUDA buffers?

Curefab · September 16, 2024, 12:40pm

Why not just do multithreading and send RT commands from another thread?
If you need synchronization, you can do it in callbacks.

Robert_Crovella · September 16, 2024, 2:32pm

There isn’t a way to directly access RT cores. You use RT cores by using an abstraction such as Optix or DXR. Or you may wish to explain what you mean by “use RT cores”.

The general method to use RT cores provided by NVIDIA is Optix. Optix has the possibility to use CUDA under the hood, so as a general matter, calling Optix from a CUDA callback could result in the same violation you already mentioned. So again it might be necessary to identify exactly what you plan to do in Optix in a CUDA callback. And I would restate my suggestion - I suggest asking Optix questions on the Optix forum.

Curefab · September 17, 2024, 8:53am

For some unofficial, but public, insides about the raytracing unit, look for example in Fig. 17 of US11928772B2 - Method for forward progress and programmable timeouts of tree traversal mechanisms in hardware - Google Patents - you find the TTU - Tree Traversal Unit there.

(It does not have to be implemented in actual hardware that way.)

But citing from the patent description there is the history:

Then, in 2010, NVIDIA took advantage of the high degree of parallelism of NVIDIA GPUs and other highly parallel architectures to develop the OptiX™ ray tracing engine. See Parker et al., “OptiX: A General Purpose Ray Tracing Engine” (ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, July 2010). In addition to improvements in API’s (application programming interfaces), one of the advances provided by OptiX™ was improving the acceleration data structures used for finding an intersection between a ray and the scene geometry. Such acceleration data structures are usually spatial or object hierarchies used by the ray tracing traversal algorithm to efficiently search for primitives that potentially intersect a given ray. OptiX™ provides a number of different acceleration structure types that the application can choose from. Each acceleration structure in the node graph can be a different type, allowing combinations of high-quality static structures with dynamically updated ones.

The OptiX™ programmable ray tracing pipeline provided significant advances, but was still generally unable by itself to provide real time interactive response to user input on relatively inexpensive computing platforms for complex 3D scenes. Since then, NVIDIA has been developing hardware acceleration capabilities for ray tracing. See e.g., U.S. Pat. Nos. 9,582,607; 9,569,559; US20160070820; and US20160070767.

Given the great potential of a truly interactive real time ray tracing graphics processing system for rendering high quality images of arbitrary complexity in response for example to user input, further work is possible and desirable.

Some Ray Processes can Take Too Long or May Need to be Interrupted

Ray tracing generally involves executing a ray intersection query against a pre-built Acceleration Structure (AS), sometimes referred to more specifically as a Bounding Volume Hierarchy (BVH). Depending on the build of the AS, the number of primitives in the scene and the orientation of a ray, the traversal can take anywhere from a few to hundreds to even thousands of cycles by specialized traversal hardware. Additionally, if a cycle or loop is inadvertently (or even intentionally in the case of a bad actor) encoded into the BVH, it is possible for a traversal to become infinite. For example, it is possible for a BVH to define a traversal that results in an “endless loop.”

To prevent any long-running query from hanging the GPU, the example implementation Tree Traversal Unit (TTU) 700 provides a mechanism for preemption that will allow rays to timeout early. The example non-limiting implementations described herein provide such a preemption mechanism, including a forward progress guarantee, and additional programmable timeout options that build upon that. Those programmable options provide a means for quality of service timing guarantees for applications such as virtual reality (VR) that have strict timing requirements.

According to GTC 2019: Chief Scientist Bill Dally Provides Glimpse into Nvidia Research Engine the TTU is just the internal name for the RT cores.

At least the Turing and Ampere generations had
TTUOPEN, TTUCLOSE, TTUCCTL (Cache Control), TTULD (Load), TTUST (Store), TTUGO, TTUMACROFUSE
instructions theoretically usable from Cuda and shaders.

But sadly undocumented, probably because the API is intended to be much less stable (i.e. stability of the interface specification, the next generation will probably have different architectures, not stability running it).

That means the TTU probably has the exact same restrictions or allowances like any other Cuda code.

Currently the most low-level way for using the RT cores is with ray queries from Vulcan or DXR (instead of OptiX, which is more high-level).

Topic		Replies	Views
Maximizing GPU Utilization Raytracing	2	1300	July 17, 2023
CUDA/RTX CUDA Programming and Performance	4	104	September 8, 2024
Take full advantage of CUDA core and RT core OptiX	1	2281	February 6, 2023
Ray triangle intersection intrinsic in CUDA and other OptiX components CUDA Programming and Performance	0	793	December 31, 2020
RT Cores CUDA Programming and Performance	10	8465	October 2, 2019
API for BVH Traversal on Turing GPUs CUDA Programming and Performance	20	5384	February 23, 2019
Any lower access to RT core than OptiX? Raytracing	5	1686	July 21, 2024
Program RT Cores using OpenGL, Vulkan, or CUDA? Raytracing	5	3715	April 30, 2024
RTX arrangement CUDA Programming and Performance	11	933	January 23, 2020
Struct of vectors instead of vector of structs in Optix API OptiX	6	1504	June 14, 2022

How to utilize CUDA, Tensor, and RT cores in one program

Related topics