Feedback about A guide on good usage of non_blocking and pin_memory() in PyTorch

There is the following issue on this page: https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html#asynchronous-vs-synchronous-operations-with-non-blocking-true-cuda-cudamemcpyasync

In section **"Asynchronous vs. Synchronous Operations with non_blocking=True (CUDA cudaMemcpyAsync)"**, for the plot for the command `benchmark_with_profiler(streamed=True, pinned=True)`, if we look at the pink cudaEventSynchronize blocks, it seems like first, the event on the stream on which the tensor multiplication (`t3_cuda_mul = t3_cuda * t3_cuda * t3_cuda`) is happening is being synchronized, and then the event on the stream on which the transfer (`t1_cuda = t1_cpu_pinned.to(device, non_blocking=True)`) is happening is being synchronized. 

<img width="886" height="231" alt="Image" src="https://github.com/user-attachments/assets/19a77c97-1051-44ea-9f5d-a5bf23f0fe5d" />

However, in the code in the same page, the order is reverse (first t_star_cuda_h2d_event is synchronized which is the event on the stream of transfer, and then t3_cuda_h2d_event which is the event of the stream of multiplication):

```
# The function we want to profile
def inner(pinned: bool, streamed: bool):
    with [torch.cuda.stream](https://docs.pytorch.org/docs/stable/generated/torch.cuda.stream.html#torch.cuda.stream)([s](https://docs.pytorch.org/docs/stable/generated/torch.cuda.Stream.html#torch.cuda.Stream)) if streamed else contextlib.nullcontext():
        if pinned:
            t1_cuda = [t1_cpu_pinned](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor).to([device](https://docs.pytorch.org/docs/stable/tensor_attributes.html#torch.device), non_blocking=True)
        else:
            t2_cuda = [t2_cpu_paged](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor).to([device](https://docs.pytorch.org/docs/stable/tensor_attributes.html#torch.device), non_blocking=True)
        t_star_cuda_h2d_event = [s.record_event](https://docs.pytorch.org/docs/stable/generated/torch.cuda.Stream.html#torch.cuda.Stream.record_event)()
    # This operation can be executed during the CPU to GPU copy if and only if the tensor is pinned and the copy is
    #  done in the other stream
    [t3_cuda](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor)_mul = [t3_cuda](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor) * [t3_cuda](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor) * t3_cuda
    t3_cuda_h2d_event = [torch.cuda.current_stream](https://docs.pytorch.org/docs/stable/generated/torch.cuda.current_stream.html#torch.cuda.current_stream)().record_event()
    t_star_cuda_h2d_event.synchronize()
    t3_cuda_h2d_event.synchronize()
```

Hence, I think there might be a slight discrepancy between the code and the images (please correct me if I am misinterpreting anything here). It definitely doesn't change the conclusions any bit, but just thought it would be nice to bring to light, as I got confused for a second there.

Thanks a lot!
Best,
Shashank

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feedback about A guide on good usage of non_blocking and pin_memory() in PyTorch #3648

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feedback about A guide on good usage of non_blocking and pin_memory() in PyTorch #3648

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions