-
Notifications
You must be signed in to change notification settings - Fork 70
Add performance tips tutorial #1065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
304fdf9
5693776
e8b2a73
7ac0d2f
a74f653
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,170 @@ | ||
| # Copyright (c) Meta Platforms, Inc. and affiliates. | ||
| # All rights reserved. | ||
| # | ||
| # This source code is licensed under the BSD-style license found in the | ||
| # LICENSE file in the root directory of this source tree. | ||
|
|
||
| """ | ||
| ==================================== | ||
| Performance Tips and Best Practices | ||
| ==================================== | ||
|
|
||
| This tutorial consolidates performance optimization techniques for video | ||
| decoding with TorchCodec. Learn when and how to apply various strategies | ||
| to increase performance. | ||
| """ | ||
|
|
||
|
|
||
| # %% | ||
| # Overview | ||
| # -------- | ||
| # | ||
| # When decoding videos with TorchCodec, several techniques can significantly | ||
| # improve performance depending on your use case. This guide covers: | ||
| # | ||
| # 1. **Batch APIs** - Decode multiple frames at once | ||
| # 2. **Approximate Mode & Keyframe Mappings** - Trade accuracy for speed | ||
| # 3. **Multi-threading** - Parallelize decoding across videos or chunks | ||
| # 4. **CUDA Acceleration** - Use GPU decoding for supported formats | ||
| # | ||
| # We'll explore each technique and when to use it. | ||
|
|
||
| # %% | ||
| # 1. Use Batch APIs When Possible | ||
| # -------------------------------- | ||
| # | ||
| # If you need to decode multiple frames at once, the batch methods are faster than calling single-frame decoding methods multiple times. | ||
| # For example, :meth:`~torchcodec.decoders.VideoDecoder.get_frames_at` is faster than calling :meth:`~torchcodec.decoders.VideoDecoder.get_frame_at` multiple times. | ||
| # TorchCodec's batch APIs reduce overhead and can leverage internal optimizations. | ||
| # | ||
| # **Key Methods:** | ||
| # | ||
| # - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_at` for specific indices | ||
| # - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_in_range` for ranges | ||
| # - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_played_at` for timestamps | ||
| # - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_played_in_range` for time ranges | ||
| # | ||
| # **When to use:** | ||
| # | ||
| # - Decoding multiple frames | ||
|
|
||
| # %% | ||
| # .. note:: | ||
| # | ||
| # For complete examples with runnable code demonstrating batch decoding, | ||
| # iteration, and frame retrieval, see: | ||
| # | ||
| # - :ref:`sphx_glr_generated_examples_decoding_basic_example.py` | ||
|
|
||
| # %% | ||
| # 2. Approximate Mode & Keyframe Mappings | ||
| # ---------------------------------------- | ||
| # | ||
| # By default, TorchCodec uses ``seek_mode="exact"``, which performs a :term:`scan` when | ||
| # the decoder is created to build an accurate internal index of frames. This | ||
| # ensures frame-accurate seeking but takes longer for decoder initialization, | ||
| # especially on long videos. | ||
|
|
||
| # %% | ||
| # **Approximate Mode** | ||
| # ~~~~~~~~~~~~~~~~~~~~ | ||
| # | ||
| # Setting ``seek_mode="approximate"`` skips the initial :term:`scan` and relies on the | ||
| # video file's metadata headers. This dramatically speeds up | ||
| # :class:`~torchcodec.decoders.VideoDecoder` creation, particularly for long | ||
| # videos, but may result in slightly less accurate seeking in some cases. | ||
| # | ||
| # | ||
| # **Which mode should you use:** | ||
| # | ||
| # - If you care about exactness of frame seeking, use “exact”. | ||
| # - If the video is long and you're only decoding a small amount of frames, approximate mode should be faster. | ||
|
|
||
| # %% | ||
| # **Custom Frame Mappings** | ||
| # ~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| # | ||
| # For advanced use cases, you can pre-compute a custom mapping between desired | ||
| # frame indices and actual keyframe locations. This allows you to speed up :class:`~torchcodec.decoders.VideoDecoder` | ||
| # instantiation while maintaining the frame seeking accuracy of ``seek_mode="exact"`` | ||
| # | ||
| # **When to use:** | ||
| # | ||
| # - Frame accuracy is critical, so approximate mode cannot be used | ||
| # - Videos can be preprocessed once and then decoded many times | ||
| # | ||
| # **Performance impact:** Enables consistent, predictable performance for repeated | ||
| # random access without the overhead of exact mode's scanning. | ||
|
|
||
| # %% | ||
| # .. note:: | ||
| # | ||
| # For complete benchmarks showing actual speedup numbers, accuracy comparisons, | ||
| # and implementation examples, see: | ||
| # | ||
| # - :ref:`sphx_glr_generated_examples_decoding_approximate_mode.py` | ||
| # | ||
| # - :ref:`sphx_glr_generated_examples_decoding_custom_frame_mappings.py` | ||
|
|
||
| # %% | ||
| # 3. Multi-threading for Parallel Decoding | ||
| # ----------------------------------------- | ||
| # | ||
| # When decoding multiple videos or decoding a large number of frames from a single video, there are a few parallelization strategies to speed up the decoding process: | ||
| # | ||
| # - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames | ||
| # - **Multiprocessing** - Distributing work across multiple processes | ||
| # - **Multithreading** - Using multiple threads within a single process | ||
| # | ||
| # Both multiprocessing and multithreading can be used to decode multiple videos in parallel, or to decode a single long video in parallel by splitting it into chunks. | ||
|
|
||
mollyxu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # %% | ||
| # .. note:: | ||
| # | ||
| # For complete examples comparing | ||
| # sequential, ffmpeg-based parallelism, multi-process, and multi-threaded approaches, see: | ||
| # | ||
| # - :ref:`sphx_glr_generated_examples_decoding_parallel_decoding.py` | ||
|
|
||
| # %% | ||
| # 4. CUDA Acceleration | ||
| # -------------------- | ||
| # | ||
| # TorchCodec supports GPU-accelerated decoding using NVIDIA's hardware decoder | ||
| # (NVDEC) on supported hardware. This keeps decoded tensors in GPU memory, | ||
| # avoiding expensive CPU-GPU transfers for downstream GPU operations. | ||
| # | ||
| # **When to use:** | ||
| # | ||
| # - Decoding large resolution videos | ||
| # - Large batch of videos saturating the CPU | ||
| # - GPU-intensive pipelines with transforms like scaling and cropping | ||
| # - CPU is saturated and you want to free it up for other work | ||
| # | ||
| # **When NOT to use:** | ||
| # | ||
| # - You need bit-exact results | ||
| # - Small resolution videos and the PCI-e transfer latency is large | ||
| # - GPU is already busy and CPU is idle | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. super nit: this is a personal writing style preference, but within a section lets consistently use either active or passive voice. For example, we could remove "you" from the first bullet point, and instead use the passive voice: "bit-exact results are needed"
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's consistently use active voice. :) |
||
| # | ||
| # **Performance impact:** CUDA decoding can significantly outperform CPU decoding, | ||
| # especially for high-resolution videos and when combined with GPU-based transforms. | ||
| # Actual speedup varies by hardware, resolution, and codec. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's good to have those bullet points here. They overlap with what is already in the CUDA decoding tutorial, and I think we'll want to remove them from there and have them here instead. Eventually we'll also want to update the CUDA tutorial to explain to users how to check whether they're falling back to the CPU. Mainly here in this tutorial, I think we should insist on one thing (as the main point): users should be using the Beta interface with with set_cuda_backend("beta"):
dec = VideoDecoder("file.mp4", device="cuda") |
||
|
|
||
| # %% | ||
| # **Recommended Usage for Beta Interface** | ||
| # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| # | ||
| # .. code-block:: python | ||
| # | ||
| # with set_cuda_backend("beta"): | ||
| # decoder = VideoDecoder("file.mp4", device="cuda") | ||
| # | ||
|
|
||
| # %% | ||
| # .. note:: | ||
| # | ||
| # For installation instructions, detailed examples, and visual comparisons | ||
| # between CPU and CUDA decoding, see: | ||
| # | ||
| # - :ref:`sphx_glr_generated_examples_decoding_basic_cuda_example.py` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be more clear to group the similar functions here?
For example, we could add two small headers to group index based vs timestamp based retrieval:
For index based frame retrieval:get_frames_atget_frames_in_rangeFor timestamp based frame retrieval:...