Difference in error handling between driver api and runtime api

Coming from How to clear cuda errors? , but that conversation is locked, so i opened a new one.

I checked the resource https://www.olcf.ornl.gov/wp-content/uploads/2021/06/cuda_training_series_cuda_debugging.pdf , but find it does not cover driver API.

It seems the driver api error handling CUDA Driver API :: CUDA Toolkit Documentation only has a cuGetErrorName and cuGetErrorString, which are clearly stateless functions that just implement a look-up table.

It seems we only have the concept of “error clearing” in runtime API cudaGetLastError?

My mental model is:

each cuda context has a flag to track if the current context is corrupted. when a kernel runs into issues (illegal memory access, illegal instruction, etc), that flag is set, and the context cannot be used anymore.

for driver api, if that flag is set, return the error; otherwise, just return the execution result of the driver api.

for runtime api (including kernel launch) , it additionally tracks a flag for persistent (persistent across runtime API calls) but clear-able errors, notably kernel launch errors like invalid shared memory size. if either flag is set, return the error; otherwise, return the execution result of the runtime api.

only certain runtime apis will put the persistent error flag, simple calls like cudaMalloc will not set the flag. So cudaMalloc failure will not affect the following kernel launch, but a failed kernel launch will affect the following cudaMalloc. Of course, an illegal memory access inside the kernel will fail both of them.

Is it correct?

1 Like

After digging for a while, I think we can treat cuda driver API as the following:

CUresult some_driver_api(some_args) {
    // check if context is corrupted
    if (context_is_corrupted) {
        return corresponding_error_code;
    }
    // execute the corresponding driver API implementation
    return some_driver_api_implementation(some_args);
}

And treat cuda runtime API as the following:

cudaError_t some_runtime_api(some_args) {
    // check if context is corrupted
    if (context_is_corrupted) {
        return corresponding_error_code;
    }
    // call the corresponding driver API to implement the functionality
    cudaError_t value = some_runtime_api_implementation(some_args);
    // if the call is not successful, update the global variable
    if (value != CUDA_SUCCESS) {
        last_error_code = value;
    }
    // return the call result
    return value;
}

The difference is whether a failed API call would affect a global last_error_code. If we never call cudaGetLastError, then they are the same. However, since many code would explicitly call cudaGetLastError to check errors, the difference matters in practice.

so the code in How to clear cuda errors? - #3 by njuffa is problematic actually, although it can allocate memory successfully, the global error state is polluted. We need to call cudaGetLastError to clear the error for it to be useful.