Skip to content

Self-Updating Temporal VAE #11

@shahbuland

Description

@shahbuland

Issue:

Typical pretrained text to video models use a temporal autoencoder, i.e. compressing 128 frames at 512x512 into a 8x8x8 latent. There is temporal compression. Our autoencoders are purely spatial, in order to facilitate low latency decoding and frame presentation for high FPS simulations. This increases the context length used by the WM for long time contexts and prevents us from using any existing video diffusion models.

What We Need:

  1. Some way to decode one new frame from a temporally compressed latent video
  2. Some way to add new frames into the latent without re-encoding the entire video

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions