Issue:
Typical pretrained text to video models use a temporal autoencoder, i.e. compressing 128 frames at 512x512 into a 8x8x8 latent. There is temporal compression. Our autoencoders are purely spatial, in order to facilitate low latency decoding and frame presentation for high FPS simulations. This increases the context length used by the WM for long time contexts and prevents us from using any existing video diffusion models.
What We Need:
- Some way to decode one new frame from a temporally compressed latent video
- Some way to add new frames into the latent without re-encoding the entire video
Issue:
Typical pretrained text to video models use a temporal autoencoder, i.e. compressing 128 frames at 512x512 into a 8x8x8 latent. There is temporal compression. Our autoencoders are purely spatial, in order to facilitate low latency decoding and frame presentation for high FPS simulations. This increases the context length used by the WM for long time contexts and prevents us from using any existing video diffusion models.
What We Need: