Issue
VAEs used for WMs encode each frame separately. This creates flickering artifacts as video progresses.
See here:

Details
This is an open research problem and an obvious solution is not readily available.
Ideally, a solution should involve only tuning the resnet decoder.
We have some leads that might work:
- Temporal attention across hidden states of decoder
- Six channel discriminator to create temporal + spatial GAN loss
- Something utilizing optical flow to match optical flow between generated frames to real frames.
Issue
VAEs used for WMs encode each frame separately. This creates flickering artifacts as video progresses.
See here:
Details
This is an open research problem and an obvious solution is not readily available.
Ideally, a solution should involve only tuning the resnet decoder.
We have some leads that might work: