Skip to content

Conversation

@scotts
Copy link
Contributor

@scotts scotts commented Nov 18, 2025

I've thought this was strange for a long time now - on main, in the public VideoDecoder and AudioDecoder, we add a stream before getting the metadata. This was not the originally intended order, as evidenced by some of the error checking we do:

if stream_index is None:
if (stream_index := container_metadata.best_video_stream_index) is None:
raise ValueError(
"The best video stream is unknown and there is no specified stream. "
+ ERROR_REPORTING_INSTRUCTIONS
)

We should never hit that error condition, as before we call it, we add the stream. And if the video file has no best video stream, the C++ layer would have thrown before we ever had a chance to reach this condition. I feel that it's more natural to do things in the order in this PR: first get the metadata from the file, then add the stream if the metadata is valid.

The reason why I'm doing this now is that this should simplify the decoder-native transforms. We'll want to know a video stream's height and width when pre-processing the transforms before adding a stream. And that means getting that metadata before adding a stream. In the C++ layer, this does mean accessing values in the headers in initializeDecoder() through AVCodecParameters that we didn't before.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 18, 2025
@scotts scotts marked this pull request as ready for review November 18, 2025 03:00
Comment on lines 527 to 530
// This metadata was already set in initializeDecoder() from the
// AVCodecParameters that are part of the AVStream. But we consider the
// AVCodecContext to be more authoritative, so we use that for our decoding
// stream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the AVCodecContext fields were set to those of the AVCodecParameters when we called avcodec_parameters_to_context just above in addStream:

int retVal = avcodec_parameters_to_context(
streamInfo.codecContext.get(), streamInfo.stream->codecpar);

I think it's best to remove the lines below and trust that avcodec_parameters_to_context is doing what we expect it to do. Right now, we are setting the streamMetadata in a lot of different places and it makes it harder to reason about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good call! Yup, I'm happy to remove more code. :)

@NicolasHug
Copy link
Contributor

oh, I approved but I wonder if the docs failure is real 🤔

 File "/__w/_temp/conda_environment_19470203429/lib/python3.10/site-packages/torchcodec/decoders/_video_decoder.py", line 423, in _get_and_validate_stream_metadata
    raise ValueError(
ValueError: The minimum pts value in seconds is unknown. 
This should never happen. Please report an issue following the steps in

// not the constructor because we need to know the stream index. If we
// can encode the relevant stream indices into custom frame mappings
// itself, then we can put it in the constructor.
writeFallbackBasedMetadata(map, streamMetadata, SeekMode::approximate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand these workarounds are needed right now, but I have a really hard time cleanly reasoning about all the comments above.

We might eventually want to revisit the existence of addStream? Maybe we should just have a constructor, just like we do in Python. I think all the "add stream" logic is mainly a relic of when the decoder was potentially thought to be a multi-stream decoder, but it seems like it's hurting us now

We still want to enable existing use-cases of users getting metadata without having to scan, but I'm pretty sure we can support that by passing approximate mode (that's what we do from the public Python APIs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug, you're right, that's actually the cleanest resolution here: there's no value any more in differentiating the constructor from adding a stream. Created #1064 for follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To double check my understanding, the expected behaviour for custom frame mappings is

  • like exact mode for the active stream
  • like approximate mode for non active streams

The reason we have to make this distinction with these conditions is because we don't know the active stream index during construction (since we addStream separately from the constructor). Once we consolidate addStream into the constructor, we would be able to get rid of all the conditions and just call
writeFallbackBasedMetadata(map, streamMetadata, seekMode);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct @mollyxu !

@scotts scotts merged commit 04b02b9 into meta-pytorch:main Nov 19, 2025
77 of 78 checks passed
@scotts scotts deleted the refactor_metadata_order branch November 19, 2025 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants