is there any way to get more specific info when Error code=2(cudaErrorMemoryAllocation)?
eg. Use a OpenGL debug callback to get more details on the error.
What kind of “more specific info” are you thinking of?
Here is a typical scenario: The last-level memory allocator gets a request to allocate a block of a particular size (and maybe some additional required properties), walks its list of free blocks, and cannot find any free block that satisfies the allocation request. At this point it either returns with “allocation failed” or it may call into a lower-level allocator to increase the amount of memory it can parcel out going forward.
If the lower-level allocator has no memory available, the last-level allocator returns “allocation failed”, otherwise it satisfies the current allocation request and adds the balance of the memory just made available to it (if any) to its freelist.
In this scenario, what additional information would be useful to an application? What would the application do differently based on this information? Keep in mind that this information would be based on internal implementation artifacts of the allocator that could change at any time.
Is there a maximum number of graphics driver allocations of GPU memory in CUDA?(there is a limit in Vulkan, OpenGL) - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums
just doubt if there is a MAX number of allocation, then after reach that limit, CUDA return cudaErrorMemoryAllocation even there is actually enough contigous GPU memory there.
thanks! and is there any good article about the memory allocator mechanism and different levels inside CUDA?
Generally speaking, NVIDIA does not publish internal implementation details of their software.
I have written a few simple memory allocators myself when for some reason or other the system-provided allocators were not to my liking. Usually this happened when the performance was lower than desired. System-provided memory allocators have no knowledge of the usage patters of a particular app, so better performance can often be achieved when the application grabs a huge chunk of memory from the system allocator at the start, and then use that for memory pools, buffer rings, slab allocators, etc custom-tailored to the needs of the application. For example, an application may only need allocate blocks of a few different sizes.
not sure, but some project such as PyTorch has there own GPU memory mangement method which is obviously that raw cuda API is not enough to use.
It does not matter whether the memory allocator is for a CPU-based or a GPU-based programming platform: the generic allocators provided by a system are usually a grand compromise and therefore rarely optimal for any particular purpose, which is why app-specific custom allocators are quite common where performance is important.
Even when custom allocators are used one should try to minimize allocation and de-allocation of memory inside performance-critical code sections and re-use already allocated buffers as much as possible.