How to clear cuda errors?

I want to try some operations to see if it is supported. Normally I can use device query APIs, but sometimes driver can be broken, so I cannot rely on the driver.

However, it seems if an operation is not supported, an error code will be set, and I cannot recover from the error to proceed.

Is it possible to somehow reset the cuda error code?

A simple example: if I want to dynamically allocate memory, e.g. when OOM occurs, I will try a smaller value. This seems impossible in cuda.

#include <stdio.h>
#include <cuda_runtime.h>

#define NUM_ALLOCS 100      // Number of allocations
#define ALLOC_SIZE (1 << 30) // 1 GB of memory

int main() {
    // Pointer for memory allocation
    void *ptr;

    // Loop for multiple allocations
    for (int i = 0; i < NUM_ALLOCS; i++) {
        cudaError_t status = cudaMalloc(&ptr, ALLOC_SIZE);
        if (status != cudaSuccess) {
            fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(status));
        }
        else{
            printf("Successfully allocated memory %d times.\n", i);
        }
    }

    return 0;
}

That is quite possible. But your posted sample code does not actually do that. Here is a sample app that implements it:

#include <stdio.h>
#include <stdlib.h>

#define INITIAL_ALLOC_SIZE (1ULL << 36)

int main (void)
{
    void *my_allocation = 0;
    unsigned long long int allocsize = INITIAL_ALLOC_SIZE;

    while ((my_allocation == 0) && (allocsize !=0)) {
        printf ("trying to allocate %llu bytes ...\n", allocsize);
        if (cudaSuccess == cudaMalloc (&my_allocation, allocsize)) {
            printf ("Success! my_allocation = %p. Freeing memory & exiting\n", 
                    my_allocation);
            cudaFree (my_allocation);
            return EXIT_SUCCESS;
        } else {
            printf ("Failed! Trying smaller size.\n");
            allocsize /= 2;
        }        
    }
    printf ("Could not allocate any memory\n");
    return EXIT_FAILURE;
}

Sample output on a system with a very low end GPU:

C:\Users\Norbert\My Programs>decreasin_alloc_size
trying to allocate 68719476736 bytes ...
Failed! Trying smaller size.
trying to allocate 34359738368 bytes ...
Failed! Trying smaller size.
trying to allocate 17179869184 bytes ...
Failed! Trying smaller size.
trying to allocate 8589934592 bytes ...
Failed! Trying smaller size.
trying to allocate 4294967296 bytes ...
Failed! Trying smaller size.
trying to allocate 2147483648 bytes ...
Failed! Trying smaller size.
trying to allocate 1073741824 bytes ...
Failed! Trying smaller size.
trying to allocate 536870912 bytes ...
Success! my_allocation = 00000005008E0000. Freeing memory & exiting

If you cannot allocate any memory at all you would want to look at the actual CUDA error code returned. For example, there could be an incompatibility between CUDA driver and CUDA runtime that renders the CUDA runtime inoperable.

Yeah, that’s just one example. What I want to ask in general, is, if a cuda api call returns error, will it put the later cuda API call also in error? Or the rest cuda API call can go on as if that incorrect API call does not happen?

It depends on the type of error. Unit 12 of this online training series covers it in some detail. Basically, errors that are reported due to kernel code execution are “sticky” and cannot be cleared except by terminating the owning host process (for the runtime API. the driver API has the option to destroy the context and create a new one). Other errors are “non-sticky” and will be cleared after they are reported. A cudaMalloc error due to exceeding available size, for example, is a non-sticky error. It will be returned as an error code on the call that encountered it, but subsequent usage of the cuda runtime API is still possible, and should return cudaSuccess for acceptable usage.

As njuffa has pointed out below, cudaGetLastError and cudaPeekAtLastError() have somewhat special semantics. Roughly speaking, cudaGetLastError returns the last instance of a non-cudaSuccess error code, and clears that error. cudaPeekAtLastError returns the last error code but does not clear it.

There is substantial nuance here. The linked training will cover some of it.

Lets read the fine manual:

__host__ ​ __device__ ​cudaError_t cudaGetLastError ( void )

Returns the last error from a runtime call.

Description

Returns the last error that has been produced by any of the runtime calls in the same instance of the CUDA Runtime library in the host thread and resets it to cudaSuccess

Note that some errors (e.g. unspceified launch failure, CUDA’s equivalent of a segfault) are so severe that they result in the destruction of the CUDA context which obviously cannot be used any more. How your application reacts to that is your choice, but in many cases it likely means that the application needs to be terminated.

Thanks so much!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.