I would like to clarify some basics on CUDA error handling when dealing with CUDA streams. We have an application with all necessary GPU memory pre-allocated on the start. Then there are multiple threads with own CUDA streams queuing memory copies and kernel executions.
What happens on a memory copy error or a kernel execution error? Does the stream continue with executing queued operations or exits early? Is the stream error state reset after cudaStreamSynchronize() returns the error?
Currently, we have some unverified reports that adding cudaStreamAddCallback() with a stream state check after each queued operation reveals errors which are not visible otherwise.
PS. I haven’t found proper documentation on this topic – I would appreciate pointing me in the right direction.
What about cudaMemcpyAsync()? I assume since “all asynchronous errors are sticky” it is a sticky error as well.
Does a sticky error discards all operations queued after the failed operation? I have a feeling that it does not, but then all queued operations are quickly failing because of the sticky error flagged in the corresponding context. Which also means that some “trivial” usage of the CUDA runtime API might still do something after the initial error encountered.
The reported error is 214 (cudaErrorECCUncorrectable).
Given the ambiguity of error handling in CUDA streams I wonder – are there any reasons to use cudaStreamAddCallback() with a host callback implementing a simple status check? Or would it be effectively equivalent to checking the return value of cudaStreamSynchronize() after all necessary tasks are queued into the stream?
cudaMemcpyAsync won’t produce sticky errors, by itself.
sticky errors come about as a result of device code execution. Not anything else that I know of.
Yes, after the occurrence of a sticky error, you should assume that any further use of CUDA in that context/host process will be impossible, and nothing will succeed.
The only “trivial” usage of the CUDA API that may work at that point is retrieval of the error via e.g. cudaGetLastError().
proper - adhering to the linked instructions, carefully
rigorously - at every opportunity (at every kernel launch, and at every usage of the runtime api.)
If you don’t wish to do that, you should be able to get a current “check” on whether or not an error has occurred using cudaGetLastError() or cudaPeekAtLastError().
Mixing these methods may produce unexpected results - spotty application of error checking may still be confusing.
@Robert_Crovella That is an interesting moment. So, lets say, attempting to copy some data to a device using an invalid device pointer would not produce a sticky error?
Would it also mean that a failed cudaMemcpyAsync might lead to a subsequent kernel execution (queued in the same stream) tripping over uninitialized memory (e.g. used to index an array)? I suppose cudaMemcpyAsync might check validity of the address at the call time, but there is still risk of concurrent cudaFree after the check is done.
Such callbacks receive the status from the the CUDA runtime:
So, there is no need to call any additional CUDA APIs. And this way it is very easy to track the exact moment an error appear, so it is possible to log relevant information and abort() after that. But it is not obvious whether the added overhead would be more noticeable/worse compared to alternative approaches (especially in the “happy” case).
I don’t believe so. Did you try it? From what I can tell, it produces an “invalid argument” error, which is not sticky. That suggests to me that the cudaMemcpy... calls do considerable up-front validity checking on the arguments (e.g. device pointers and size) you pass. You can easily write a simple test to confirm these claims for yourself.
I think that is possible. Did I mention I suggest rigorous, proper error checking?
Although it may not be well documented, it seems to me that cudaFree is synchronizing with respect to device activity. If you issue a cudaFree after a kernel call that uses the referenced pointer, the free operation should not happen until the kernel is done. If you issue it before the kernel call, then the onus is on you as a programmer in that case. CUDA recently also provides stream ordered allocation/free possibilities.
I certainly can do that, but there is still difference between observed behavior and guaranteed behavior which can be relied upon :)
Well, I meant that the address check at the call (=enqueue) time of cudaMemcpyAsync function is unlikely to be sufficient. We can put quite a long queue of kernels into the stream in front of cudaMemcpyAsync which means that cudaFree must be blocked for potentially very long time (and a number of different operations) if we would like to ensure that the address check for cudaMemcpyAsync holds at the time it is actually executed. Likely meaning that every allocation must have an associated queue of operations that prevent the corresponding memory release and it just seems as a lost cause anyway. But I might be wrong of course.
That’s very interesting, I haven’t seen this yet.
Well, the problem is that once a task is queued into a CUDA stream, we have no control over error handling and async APIs inherently have TOCTOU problems. So, a non-fatal invalid-argument error caused by CudaMemcpyAsync is going to be “swallowed”/ignored within the stream execution and then a subsequent kernel execution might “upgrade” this error into an illegal address error (which I assume is fatal/sticky). It just seems that aborting the queue in the CUDA stream on any error would be a safer and easier to reason about.
I’m not aware of any documented guarantee of the behavior of cudaMemcpyAsync in this respect, other that what you may find here. If you would like to see documentation changes, you can file a bug requesting those, the bug reporting method is linked to a sticky post at the top of this sub-forum.
I’m not sure why.
I don’t think any of that commentary changes what I said about cudaFree. But as I already mentioned, the documentation is not crystal clear in that respect.
Only if your error checking allows that to happen. Why would your error checking allow a kernel to execute when the data it needs was not successfully delivered to it? You are the programmer. If you allow something to be swallowed up, you should fix that. I would expect that in most cases an invalid argument error from cudaMemcpyAsync is a serious issue and merits something like immediate catastrophic termination. I don’t think its just my view either. You can find many substantial software packages such as numba, pytorch, tensorflow, and others, that behave this way. They don’t allow such a thing to be swallowed up, from what I have seen.
I think I have covered most of what I know in this area. Its possible we may simply disagree on some things.
Error checking (ie checking returned status codes from the CUDA API) works only if there are no gaps between time of checks and time of use. And for async APIs there is such a gap since anything checked at the enqueue time is not necessarily still true at the execution time (this might be debatable).
There is also cudaStreamAddCallback which allows to see errors at the time of actual execution, but there are no ways to steer the CUDA stream execution from within the callback (other than crashing the program).
I will try reproducing such cases in practice to dig deeper.
In any case, thank you for the help, I appreciate it!