The simplest way to deal with variable frame-rates is to edit how we peform axial RoPE such that actual timestamp is used for rotation, so that a model can
- Learn from variable framerates but have a positional encoding that expresses timestamp
- Allow for token dropping during inference so that when looking far into the past, we reduce framerate. This should allow for signifigantly longer context lengths during inference.
The simplest way to deal with variable frame-rates is to edit how we peform axial RoPE such that actual timestamp is used for rotation, so that a model can