Self-Updating Temporal VAE

### Issue:  
Typical pretrained text to video models use a temporal autoencoder, i.e. compressing 128 frames at 512x512 into a 8x8x8 latent. There is temporal compression. Our autoencoders are purely spatial, in order to facilitate low latency decoding and frame presentation for high FPS simulations. This increases the context length used by the WM for long time contexts and prevents us from using any existing video diffusion models.    
  
### What We Need:   
1. Some way to decode one new frame from a temporally compressed latent video  
2. Some way to add new frames into the latent without re-encoding the entire video  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-Updating Temporal VAE #11

Issue:

What We Need:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Self-Updating Temporal VAE #11

Description

Issue:

What We Need:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions