CUDA streams and error handling

sergeev917 · December 14, 2023, 9:30am

I would like to clarify some basics on CUDA error handling when dealing with CUDA streams. We have an application with all necessary GPU memory pre-allocated on the start. Then there are multiple threads with own CUDA streams queuing memory copies and kernel executions.

What happens on a memory copy error or a kernel execution error? Does the stream continue with executing queued operations or exits early? Is the stream error state reset after cudaStreamSynchronize() returns the error?

Currently, we have some unverified reports that adding cudaStreamAddCallback() with a stream state check after each queued operation reveals errors which are not visible otherwise.

PS. I haven’t found proper documentation on this topic – I would appreciate pointing me in the right direction.

striker159 · December 14, 2023, 9:59am

cudaStreamSynchronize does not reset the error state. Use cudaGetLastError().

Kernel errors are sticky errors and cannot be recovered.

CUDA errors: determine "sticky-ness"

Sticky errors are those errors that corrupt the CUDA context. In the CUDA runtime API, a corrupted CUDA context is non-recoverable. The only method to recover from it is to allow the owning process to terminate.

Yes, all asynchronous errors are sticky. Only asynchronous errors are sticky.

Asynchronous errors are those errors that occur as a result of device code execution itself.

It is true in some sense that a call to cudaGetLastError() “resets” the error. However in practice, it is self-evident that it does not reset the sticky error in any practical way. The same error will reoccur (i.e. be reported again) if any non-trivial usage of the CUDA runtime API is attempted. For all practical purposes the context is corrupted by the sticky error and is useless.

What is the exact error message you are experiencing?

sergeev917 · December 14, 2023, 12:34pm

What about cudaMemcpyAsync()? I assume since “all asynchronous errors are sticky” it is a sticky error as well.

Does a sticky error discards all operations queued after the failed operation? I have a feeling that it does not, but then all queued operations are quickly failing because of the sticky error flagged in the corresponding context. Which also means that some “trivial” usage of the CUDA runtime API might still do something after the initial error encountered.

The reported error is 214 (cudaErrorECCUncorrectable).

sergeev917 · December 14, 2023, 12:42pm

Given the ambiguity of error handling in CUDA streams I wonder – are there any reasons to use cudaStreamAddCallback() with a host callback implementing a simple status check? Or would it be effectively equivalent to checking the return value of cudaStreamSynchronize() after all necessary tasks are queued into the stream?

Robert_Crovella · December 14, 2023, 2:28pm

cudaMemcpyAsync won’t produce sticky errors, by itself.

sticky errors come about as a result of device code execution. Not anything else that I know of.

Yes, after the occurrence of a sticky error, you should assume that any further use of CUDA in that context/host process will be impossible, and nothing will succeed.

The only “trivial” usage of the CUDA API that may work at that point is retrieval of the error via e.g. cudaGetLastError().

Robert_Crovella · December 14, 2023, 4:10pm

host callbacks are not supposed to use the CUDA API. Therefore I don’t think a status check would be feasible/proper in a host callback.

I generally recommend proper CUDA error checking, rigorously.

proper - adhering to the linked instructions, carefully
rigorously - at every opportunity (at every kernel launch, and at every usage of the runtime api.)

If you don’t wish to do that, you should be able to get a current “check” on whether or not an error has occurred using cudaGetLastError() or cudaPeekAtLastError().

Mixing these methods may produce unexpected results - spotty application of error checking may still be confusing.

sergeev917 · December 14, 2023, 4:46pm

@Robert_Crovella That is an interesting moment. So, lets say, attempting to copy some data to a device using an invalid device pointer would not produce a sticky error?

Would it also mean that a failed cudaMemcpyAsync might lead to a subsequent kernel execution (queued in the same stream) tripping over uninitialized memory (e.g. used to index an array)? I suppose cudaMemcpyAsync might check validity of the address at the call time, but there is still risk of concurrent cudaFree after the check is done.

Such callbacks receive the status from the the CUDA runtime:

void(CUDART_CB* cudaStreamCallback_t )( cudaStream_t stream, cudaError_t status, void* userData )

So, there is no need to call any additional CUDA APIs. And this way it is very easy to track the exact moment an error appear, so it is possible to log relevant information and abort() after that. But it is not obvious whether the added overhead would be more noticeable/worse compared to alternative approaches (especially in the “happy” case).

Robert_Crovella · December 14, 2023, 5:13pm

I don’t believe so. Did you try it? From what I can tell, it produces an “invalid argument” error, which is not sticky. That suggests to me that the cudaMemcpy... calls do considerable up-front validity checking on the arguments (e.g. device pointers and size) you pass. You can easily write a simple test to confirm these claims for yourself.

I think that is possible. Did I mention I suggest rigorous, proper error checking?

Although it may not be well documented, it seems to me that cudaFree is synchronizing with respect to device activity. If you issue a cudaFree after a kernel call that uses the referenced pointer, the free operation should not happen until the kernel is done. If you issue it before the kernel call, then the onus is on you as a programmer in that case. CUDA recently also provides stream ordered allocation/free possibilities.

sergeev917 · December 14, 2023, 6:25pm

I certainly can do that, but there is still difference between observed behavior and guaranteed behavior which can be relied upon :)

Well, I meant that the address check at the call (=enqueue) time of cudaMemcpyAsync function is unlikely to be sufficient. We can put quite a long queue of kernels into the stream in front of cudaMemcpyAsync which means that cudaFree must be blocked for potentially very long time (and a number of different operations) if we would like to ensure that the address check for cudaMemcpyAsync holds at the time it is actually executed. Likely meaning that every allocation must have an associated queue of operations that prevent the corresponding memory release and it just seems as a lost cause anyway. But I might be wrong of course.

That’s very interesting, I haven’t seen this yet.

Well, the problem is that once a task is queued into a CUDA stream, we have no control over error handling and async APIs inherently have TOCTOU problems. So, a non-fatal invalid-argument error caused by CudaMemcpyAsync is going to be “swallowed”/ignored within the stream execution and then a subsequent kernel execution might “upgrade” this error into an illegal address error (which I assume is fatal/sticky). It just seems that aborting the queue in the CUDA stream on any error would be a safer and easier to reason about.

Robert_Crovella · December 14, 2023, 6:34pm

I’m not aware of any documented guarantee of the behavior of cudaMemcpyAsync in this respect, other that what you may find here. If you would like to see documentation changes, you can file a bug requesting those, the bug reporting method is linked to a sticky post at the top of this sub-forum.

I’m not sure why.

I don’t think any of that commentary changes what I said about cudaFree. But as I already mentioned, the documentation is not crystal clear in that respect.

Only if your error checking allows that to happen. Why would your error checking allow a kernel to execute when the data it needs was not successfully delivered to it? You are the programmer. If you allow something to be swallowed up, you should fix that. I would expect that in most cases an invalid argument error from cudaMemcpyAsync is a serious issue and merits something like immediate catastrophic termination. I don’t think its just my view either. You can find many substantial software packages such as numba, pytorch, tensorflow, and others, that behave this way. They don’t allow such a thing to be swallowed up, from what I have seen.

I think I have covered most of what I know in this area. Its possible we may simply disagree on some things.

sergeev917 · December 14, 2023, 6:51pm

Error checking (ie checking returned status codes from the CUDA API) works only if there are no gaps between time of checks and time of use. And for async APIs there is such a gap since anything checked at the enqueue time is not necessarily still true at the execution time (this might be debatable).

There is also cudaStreamAddCallback which allows to see errors at the time of actual execution, but there are no ways to steer the CUDA stream execution from within the callback (other than crashing the program).

I will try reproducing such cases in practice to dig deeper.

In any case, thank you for the help, I appreciate it!

system · December 28, 2023, 6:52pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUDA errors: determine "sticky-ness" CUDA Programming and Performance cuda	9	1247	November 3, 2023
cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking? CUDA Programming and Performance	6	2101	May 29, 2018
cudaGetLastError. Which kernel execution raised it? CUDA Programming and Performance	10	3506	March 8, 2019
cudaMemcpyAsync CUDA Programming and Performance	10	21056	October 16, 2015
cudaLaunchHostFunc API example CUDA Programming and Performance	31	6023	February 8, 2025
Compute-sanitizer not catching cudaErrorIllegalAddress CUDA Programming and Performance	16	1927	December 17, 2020
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16330	September 29, 2017
Synchronization hangs sporadically after kernel launch CUDA Programming and Performance	23	7834	August 20, 2015
Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams CUDA Programming and Performance	4	249	April 1, 2024
CUDA non-default stream synchronization CUDA Programming and Performance jetson-orin	4	220	October 30, 2024

CUDA streams and error handling

Related topics