How to clear cuda errors?

youkaichao1 · June 13, 2024, 11:51pm

I want to try some operations to see if it is supported. Normally I can use device query APIs, but sometimes driver can be broken, so I cannot rely on the driver.

However, it seems if an operation is not supported, an error code will be set, and I cannot recover from the error to proceed.

Is it possible to somehow reset the cuda error code?

youkaichao1 · June 14, 2024, 12:04am

A simple example: if I want to dynamically allocate memory, e.g. when OOM occurs, I will try a smaller value. This seems impossible in cuda.

#include <stdio.h>
#include <cuda_runtime.h>

#define NUM_ALLOCS 100      // Number of allocations
#define ALLOC_SIZE (1 << 30) // 1 GB of memory

int main() {
    // Pointer for memory allocation
    void *ptr;

    // Loop for multiple allocations
    for (int i = 0; i < NUM_ALLOCS; i++) {
        cudaError_t status = cudaMalloc(&ptr, ALLOC_SIZE);
        if (status != cudaSuccess) {
            fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(status));
        }
        else{
            printf("Successfully allocated memory %d times.\n", i);
        }
    }

    return 0;
}

njuffa · June 14, 2024, 12:28am

That is quite possible. But your posted sample code does not actually do that. Here is a sample app that implements it:

#include <stdio.h>
#include <stdlib.h>

#define INITIAL_ALLOC_SIZE (1ULL << 36)

int main (void)
{
    void *my_allocation = 0;
    unsigned long long int allocsize = INITIAL_ALLOC_SIZE;

    while ((my_allocation == 0) && (allocsize !=0)) {
        printf ("trying to allocate %llu bytes ...\n", allocsize);
        if (cudaSuccess == cudaMalloc (&my_allocation, allocsize)) {
            printf ("Success! my_allocation = %p. Freeing memory & exiting\n", 
                    my_allocation);
            cudaFree (my_allocation);
            return EXIT_SUCCESS;
        } else {
            printf ("Failed! Trying smaller size.\n");
            allocsize /= 2;
        }        
    }
    printf ("Could not allocate any memory\n");
    return EXIT_FAILURE;
}

Sample output on a system with a very low end GPU:

C:\Users\Norbert\My Programs>decreasin_alloc_size
trying to allocate 68719476736 bytes ...
Failed! Trying smaller size.
trying to allocate 34359738368 bytes ...
Failed! Trying smaller size.
trying to allocate 17179869184 bytes ...
Failed! Trying smaller size.
trying to allocate 8589934592 bytes ...
Failed! Trying smaller size.
trying to allocate 4294967296 bytes ...
Failed! Trying smaller size.
trying to allocate 2147483648 bytes ...
Failed! Trying smaller size.
trying to allocate 1073741824 bytes ...
Failed! Trying smaller size.
trying to allocate 536870912 bytes ...
Success! my_allocation = 00000005008E0000. Freeing memory & exiting

If you cannot allocate any memory at all you would want to look at the actual CUDA error code returned. For example, there could be an incompatibility between CUDA driver and CUDA runtime that renders the CUDA runtime inoperable.

youkaichao1 · June 14, 2024, 1:43am

Yeah, that’s just one example. What I want to ask in general, is, if a cuda api call returns error, will it put the later cuda API call also in error? Or the rest cuda API call can go on as if that incorrect API call does not happen?

Robert_Crovella · June 14, 2024, 2:00am

It depends on the type of error. Unit 12 of this online training series covers it in some detail. Basically, errors that are reported due to kernel code execution are “sticky” and cannot be cleared except by terminating the owning host process (for the runtime API. the driver API has the option to destroy the context and create a new one). Other errors are “non-sticky” and will be cleared after they are reported. A cudaMalloc error due to exceeding available size, for example, is a non-sticky error. It will be returned as an error code on the call that encountered it, but subsequent usage of the cuda runtime API is still possible, and should return cudaSuccess for acceptable usage.

As njuffa has pointed out below, cudaGetLastError and cudaPeekAtLastError() have somewhat special semantics. Roughly speaking, cudaGetLastError returns the last instance of a non-cudaSuccess error code, and clears that error. cudaPeekAtLastError returns the last error code but does not clear it.

There is substantial nuance here. The linked training will cover some of it.

njuffa · June 14, 2024, 2:01am

Lets read the fine manual:

__host__ __device__ cudaError_t cudaGetLastError ( void )

Returns the last error from a runtime call.

Description

Returns the last error that has been produced by any of the runtime calls in the same instance of the CUDA Runtime library in the host thread and resets it to cudaSuccess

Note that some errors (e.g. unspceified launch failure, CUDA’s equivalent of a segfault) are so severe that they result in the destruction of the CUDA context which obviously cannot be used any more. How your application reacts to that is your choice, but in many cases it likely means that the application needs to be terminated.

youkaichao1 · June 14, 2024, 2:44am

Thanks so much!

system · June 28, 2024, 2:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to solve memory allocation problem in cuda?? CUDA Programming and Performance	4	30996	February 2, 2015
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37704	April 23, 2009
CUDA errors: determine "sticky-ness" CUDA Programming and Performance cuda	9	1246	November 3, 2023
Unable to Allocate on Memory and Stack Overflow CUDA Programming and Performance	11	2538	September 6, 2011
cudaMallocHost not returning errors in emulation mode CUDA Programming and Performance	1	1472	June 3, 2009
cudaHostAlloc can only allocate about 3.5GB of memory out of 128GB CUDA Programming and Performance	7	464	June 2, 2023
Cuda Out of Memory with tons of memory left? CUDA Programming and Performance	5	39028	December 23, 2009
Cuda allocate device memory failed CUDA Programming and Performance	0	1336	January 31, 2019
Maximum memory allocation size CUDA Programming and Performance	7	16726	January 24, 2012
Is there any way to get more specific info when Error code=2(cudaErrorMemoryAllocation)? CUDA Programming and Performance	6	829	December 21, 2023

How to clear cuda errors?

Description

Related topics