cudaMemcpyAsync Device to Host : Need to synchronize before using data on host

Hi,

Lets say I have the following

for( i = 0; i < N ; i++ )
{
checkCudaErrors(cudaMemcpyAsync(host[i], device[i], size_bytes,cudaMemcpyDeviceToHost, stream[i] )
}

// using host data after this point

Do I need to synchronize ( using cuda events or device synchronize or stream synchronize etc) to make sure the host data is available before using it ?

yes

Thanks txbob

@Robert_Crovella , I read contracting information on CUDA documentations.

This page explains some special cases that cudaMemcpyAysnc is done synchronously internally, e.g. if the host memory is pageable memory then the copy is synchronous (no need to sync after copy to have valid data on host):
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

So for the code in the question above, the copy must have been synchronous and no stream sync should be needed.

Either the document I have attached is outdated or the answer to this question is “synchronization is redundant, the copy call is already sync.”. Which one is correct?

Why is that?

Are you suggesting that you know that the host target is pageable memory for the posted code?

Yes, I made assumption that “host” is pageable.

So correct answer indeed depends on “host” pointer, right? (e.g. If “host” is allocated as pageable memory then synchronization is not needed (already synced), but if it is allocated as pinned memory then synchronization is needed.)

Yes, a better answer would be to condition my answer with “assuming the host allocations are pinned”

This is 5 years old, so I don’t have a memory of this. However, if I had to guess, perhaps the reason I answered the way I did is:

  1. my answer is “safe” from a correctness point of view, whether the host allocations are pinned or pageable (your suggestion is not safe, if the allocations are actually pinned)

  2. Usually people who are using cudaMemcpyAsync know or should know that the async character is lost when the host allocation is not pinned. This code appears to be carefully crafted that way - using streams for example. So I probably assumed something I shouldn’t.

Providing all this description makes it a better answer. My answer was sloppy.

I will say that I find the usage of cudaMemcpyAsync in a “non-typical” way is problematic for a few reasons. When I am teaching CUDA I encourage people to follow the recommended patterns.

  1. Many folks who are learning CUDA forget that the allocations must be pinned for async behavior.

  2. It makes for confusing code if you intentionally use cudaMemcpyAsync and intentionally use pageable allocations and intentionally depend on the conversion to sync behavior. I like code that is easily understandable. Code that requires me to check a bunch of conditions to determine its behavior is troublesome, in my view. Therefore I would encourage folks who intend to use blocking behavior here to use cudaMemcpy, not cudaMemcpyAsync, because the intent is then clear.

But to each his own. It’s not wrong, either way, as long as the underlying behavioral rules are understood.

Thanks for clarifications and quick response.