Coming from How to clear cuda errors? , but that conversation is locked, so i opened a new one.
I checked the resource https://www.olcf.ornl.gov/wp-content/uploads/2021/06/cuda_training_series_cuda_debugging.pdf , but find it does not cover driver API.
It seems the driver api error handling CUDA Driver API :: CUDA Toolkit Documentation only has a cuGetErrorName
and cuGetErrorString
, which are clearly stateless functions that just implement a look-up table.
It seems we only have the concept of “error clearing” in runtime API cudaGetLastError
?
My mental model is:
each cuda context has a flag to track if the current context is corrupted. when a kernel runs into issues (illegal memory access, illegal instruction, etc), that flag is set, and the context cannot be used anymore.
for driver api, if that flag is set, return the error; otherwise, just return the execution result of the driver api.
for runtime api (including kernel launch) , it additionally tracks a flag for persistent (persistent across runtime API calls) but clear-able errors, notably kernel launch errors like invalid shared memory size. if either flag is set, return the error; otherwise, return the execution result of the runtime api.
only certain runtime apis will put the persistent error flag, simple calls like cudaMalloc
will not set the flag. So cudaMalloc
failure will not affect the following kernel launch, but a failed kernel launch will affect the following cudaMalloc
. Of course, an illegal memory access inside the kernel will fail both of them.
Is it correct?