Skip to content

Feedback about A guide on good usage of non_blocking and pin_memory() in PyTorch #3648

@shashkat

Description

@shashkat

There is the following issue on this page: https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html#asynchronous-vs-synchronous-operations-with-non-blocking-true-cuda-cudamemcpyasync

In section "Asynchronous vs. Synchronous Operations with non_blocking=True (CUDA cudaMemcpyAsync)", for the plot for the command benchmark_with_profiler(streamed=True, pinned=True), if we look at the pink cudaEventSynchronize blocks, it seems like first, the event on the stream on which the tensor multiplication (t3_cuda_mul = t3_cuda * t3_cuda * t3_cuda) is happening is being synchronized, and then the event on the stream on which the transfer (t1_cuda = t1_cpu_pinned.to(device, non_blocking=True)) is happening is being synchronized.

Image

However, in the code in the same page, the order is reverse (first t_star_cuda_h2d_event is synchronized which is the event on the stream of transfer, and then t3_cuda_h2d_event which is the event of the stream of multiplication):

# The function we want to profile
def inner(pinned: bool, streamed: bool):
    with [torch.cuda.stream](https://docs.pytorch.org/docs/stable/generated/torch.cuda.stream.html#torch.cuda.stream)([s](https://docs.pytorch.org/docs/stable/generated/torch.cuda.Stream.html#torch.cuda.Stream)) if streamed else contextlib.nullcontext():
        if pinned:
            t1_cuda = [t1_cpu_pinned](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor).to([device](https://docs.pytorch.org/docs/stable/tensor_attributes.html#torch.device), non_blocking=True)
        else:
            t2_cuda = [t2_cpu_paged](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor).to([device](https://docs.pytorch.org/docs/stable/tensor_attributes.html#torch.device), non_blocking=True)
        t_star_cuda_h2d_event = [s.record_event](https://docs.pytorch.org/docs/stable/generated/torch.cuda.Stream.html#torch.cuda.Stream.record_event)()
    # This operation can be executed during the CPU to GPU copy if and only if the tensor is pinned and the copy is
    #  done in the other stream
    [t3_cuda](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor)_mul = [t3_cuda](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor) * [t3_cuda](https://docs.pytorch.org/docs/stable/tensors.html#torch.Tensor) * t3_cuda
    t3_cuda_h2d_event = [torch.cuda.current_stream](https://docs.pytorch.org/docs/stable/generated/torch.cuda.current_stream.html#torch.cuda.current_stream)().record_event()
    t_star_cuda_h2d_event.synchronize()
    t3_cuda_h2d_event.synchronize()

Hence, I think there might be a slight discrepancy between the code and the images (please correct me if I am misinterpreting anything here). It definitely doesn't change the conclusions any bit, but just thought it would be nice to bring to light, as I got confused for a second there.

Thanks a lot!
Best,
Shashank

Metadata

Metadata

Assignees

No one assigned

    Labels

    Reinforcement LearningIssues relating to reinforcement learning tutorials

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions