Difference in error handling between driver api and runtime api

youkaichao1 · June 17, 2025, 8:14am

Coming from How to clear cuda errors? , but that conversation is locked, so i opened a new one.

I checked the resource https://www.olcf.ornl.gov/wp-content/uploads/2021/06/cuda_training_series_cuda_debugging.pdf , but find it does not cover driver API.

It seems the driver api error handling CUDA Driver API :: CUDA Toolkit Documentation only has a cuGetErrorName and cuGetErrorString, which are clearly stateless functions that just implement a look-up table.

It seems we only have the concept of “error clearing” in runtime API cudaGetLastError?

My mental model is:

each cuda context has a flag to track if the current context is corrupted. when a kernel runs into issues (illegal memory access, illegal instruction, etc), that flag is set, and the context cannot be used anymore.

for driver api, if that flag is set, return the error; otherwise, just return the execution result of the driver api.

for runtime api (including kernel launch) , it additionally tracks a flag for persistent (persistent across runtime API calls) but clear-able errors, notably kernel launch errors like invalid shared memory size. if either flag is set, return the error; otherwise, return the execution result of the runtime api.

only certain runtime apis will put the persistent error flag, simple calls like cudaMalloc will not set the flag. So cudaMalloc failure will not affect the following kernel launch, but a failed kernel launch will affect the following cudaMalloc. Of course, an illegal memory access inside the kernel will fail both of them.

Is it correct?

youkaichao1 · June 19, 2025, 7:48am

After digging for a while, I think we can treat cuda driver API as the following:

CUresult some_driver_api(some_args) {
    // check if context is corrupted
    if (context_is_corrupted) {
        return corresponding_error_code;
    }
    // execute the corresponding driver API implementation
    return some_driver_api_implementation(some_args);
}

And treat cuda runtime API as the following:

cudaError_t some_runtime_api(some_args) {
    // check if context is corrupted
    if (context_is_corrupted) {
        return corresponding_error_code;
    }
    // call the corresponding driver API to implement the functionality
    cudaError_t value = some_runtime_api_implementation(some_args);
    // if the call is not successful, update the global variable
    if (value != CUDA_SUCCESS) {
        last_error_code = value;
    }
    // return the call result
    return value;
}

The difference is whether a failed API call would affect a global last_error_code. If we never call cudaGetLastError, then they are the same. However, since many code would explicitly call cudaGetLastError to check errors, the difference matters in practice.

youkaichao1 · June 19, 2025, 7:57am

so the code in How to clear cuda errors? - #3 by njuffa is problematic actually, although it can allocate memory successfully, the global error state is polluted. We need to call cudaGetLastError to clear the error for it to be useful.

Topic		Replies	Views
CUDA errors: determine "sticky-ness" CUDA Programming and Performance cuda	9	1246	November 3, 2023
Are the driver API and the runtime API mutually exclusive? (pyCUDA FAQ) CUDA Programming and Performance	3	1494	May 22, 2015
Problems when mix using CUDA runtime API and CUDA driver API CUDA Programming and Performance	1	3221	August 6, 2015
How to reset CUDA error in driver API CUDA Programming and Performance	5	7623	February 18, 2014
Troubleshooting errors when using the driver API CUDA Programming and Performance	7	9246	July 19, 2009
Obvious error cannot be detected through cudaGetLastError() CUDA Programming and Performance cuda , nvcc	3	845	September 3, 2021
Interoperability between Runtime and Driver APIs General Topics and Other SDKs	0	392	August 20, 2021
continue after error cutilSafeCall aborts program on error CUDA Programming and Performance	4	1860	January 4, 2010
Use of driver API in DLLs causes a hang on exit on some configurations CUDA Programming and Performance	6	1141	May 4, 2017
What is the difference between runtime and driver API? CUDA Programming and Performance	8	16138	August 28, 2016

Difference in error handling between driver api and runtime api

Related topics