I want to try some operations to see if it is supported. Normally I can use device query APIs, but sometimes driver can be broken, so I cannot rely on the driver.
However, it seems if an operation is not supported, an error code will be set, and I cannot recover from the error to proceed.
Is it possible to somehow reset the cuda error code?
If you cannot allocate any memory at all you would want to look at the actual CUDA error code returned. For example, there could be an incompatibility between CUDA driver and CUDA runtime that renders the CUDA runtime inoperable.
Yeah, that’s just one example. What I want to ask in general, is, if a cuda api call returns error, will it put the later cuda API call also in error? Or the rest cuda API call can go on as if that incorrect API call does not happen?
It depends on the type of error. Unit 12 of this online training series covers it in some detail. Basically, errors that are reported due to kernel code execution are “sticky” and cannot be cleared except by terminating the owning host process (for the runtime API. the driver API has the option to destroy the context and create a new one). Other errors are “non-sticky” and will be cleared after they are reported. A cudaMalloc error due to exceeding available size, for example, is a non-sticky error. It will be returned as an error code on the call that encountered it, but subsequent usage of the cuda runtime API is still possible, and should return cudaSuccess for acceptable usage.
As njuffa has pointed out below, cudaGetLastError and cudaPeekAtLastError() have somewhat special semantics. Roughly speaking, cudaGetLastError returns the last instance of a non-cudaSuccess error code, and clears that error. cudaPeekAtLastError returns the last error code but does not clear it.
There is substantial nuance here. The linked training will cover some of it.
Returns the last error that has been produced by any of the runtime calls in the same instance of the CUDA Runtime library in the host thread and resets it to cudaSuccess
Note that some errors (e.g. unspceified launch failure, CUDA’s equivalent of a segfault) are so severe that they result in the destruction of the CUDA context which obviously cannot be used any more. How your application reacts to that is your choice, but in many cases it likely means that the application needs to be terminated.