for( i = 0; i < N ; i++ )
{
checkCudaErrors(cudaMemcpyAsync(host[i], device[i], size_bytes,cudaMemcpyDeviceToHost, stream[i] )
}
// using host data after this point
Do I need to synchronize ( using cuda events or device synchronize or stream synchronize etc) to make sure the host data is available before using it ?
So for the code in the question above, the copy must have been synchronous and no stream sync should be needed.
Either the document I have attached is outdated or the answer to this question is “synchronization is redundant, the copy call is already sync.”. Which one is correct?
So correct answer indeed depends on “host” pointer, right? (e.g. If “host” is allocated as pageable memory then synchronization is not needed (already synced), but if it is allocated as pinned memory then synchronization is needed.)
Yes, a better answer would be to condition my answer with “assuming the host allocations are pinned”
This is 5 years old, so I don’t have a memory of this. However, if I had to guess, perhaps the reason I answered the way I did is:
my answer is “safe” from a correctness point of view, whether the host allocations are pinned or pageable (your suggestion is not safe, if the allocations are actually pinned)
Usually people who are using cudaMemcpyAsync know or should know that the async character is lost when the host allocation is not pinned. This code appears to be carefully crafted that way - using streams for example. So I probably assumed something I shouldn’t.
Providing all this description makes it a better answer. My answer was sloppy.
I will say that I find the usage of cudaMemcpyAsync in a “non-typical” way is problematic for a few reasons. When I am teaching CUDA I encourage people to follow the recommended patterns.
Many folks who are learning CUDA forget that the allocations must be pinned for async behavior.
It makes for confusing code if you intentionally use cudaMemcpyAsync and intentionally use pageable allocations and intentionally depend on the conversion to sync behavior. I like code that is easily understandable. Code that requires me to check a bunch of conditions to determine its behavior is troublesome, in my view. Therefore I would encourage folks who intend to use blocking behavior here to use cudaMemcpy, not cudaMemcpyAsync, because the intent is then clear.
But to each his own. It’s not wrong, either way, as long as the underlying behavioral rules are understood.