-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
There is the following issue on this page: https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html#asynchronous-vs-synchronous-operations-with-non-blocking-true-cuda-cudamemcpyasync
In section "Asynchronous vs. Synchronous Operations with non_blocking=True (CUDA cudaMemcpyAsync)", for the plot for the command benchmark_with_profiler(streamed=True, pinned=True), if we look at the pink cudaEventSynchronize blocks, it seems like first, the event on the stream on which the tensor multiplication (t3_cuda_mul = t3_cuda * t3_cuda * t3_cuda) is happening is being synchronized, and then the event on the stream on which the transfer (t1_cuda = t1_cpu_pinned.to(device, non_blocking=True)) is happening is being synchronized.
However, in the code in the same page, the order is reverse (first t_star_cuda_h2d_event is synchronized which is the event on the stream of transfer, and then t3_cuda_h2d_event which is the event of the stream of multiplication):
# The function we want to profile
def inner(pinned: bool, streamed: bool):
with [torch.cuda.stream](https://docs.pytorch.org/docs/stable/generated/torch.cuda.stream.html#torch.cuda.stream)([s](https://docs.pytorch.org/docs/stable/generated/torch.cuda.Stream.html#torch.cuda.Stream)) if streamed else contextlib.nullcontext():
if pinned:
t1_cuda = [t1_cpu_pinned](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor).to([device](https://docs.pytorch.org/docs/stable/tensor_attributes.html#torch.device), non_blocking=True)
else:
t2_cuda = [t2_cpu_paged](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor).to([device](https://docs.pytorch.org/docs/stable/tensor_attributes.html#torch.device), non_blocking=True)
t_star_cuda_h2d_event = [s.record_event](https://docs.pytorch.org/docs/stable/generated/torch.cuda.Stream.html#torch.cuda.Stream.record_event)()
# This operation can be executed during the CPU to GPU copy if and only if the tensor is pinned and the copy is
# done in the other stream
[t3_cuda](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor)_mul = [t3_cuda](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor) * [t3_cuda](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor) * t3_cuda
t3_cuda_h2d_event = [torch.cuda.current_stream](https://docs.pytorch.org/docs/stable/generated/torch.cuda.current_stream.html#torch.cuda.current_stream)().record_event()
t_star_cuda_h2d_event.synchronize()
t3_cuda_h2d_event.synchronize()
Hence, I think there might be a slight discrepancy between the code and the images (please correct me if I am misinterpreting anything here). It definitely doesn't change the conclusions any bit, but just thought it would be nice to bring to light, as I got confused for a second there.
Thanks a lot!
Best,
Shashank