How the “ground-truth edited video” is constructed？

Hi, thanks for the great work on Dream4Drive.
I have a question regarding the training supervision for the diffusion model.
From the paper, I understand that the model takes dense 3D-aware guidance maps as input:
depth map D
normal map N
edge map E
object image O
object mask M
and is fine-tuned to generate edited photorealistic videos.

However, I am still confused about the source of the target edited RGB videos used during training. Is the target edited RGB video directly rendered from the inserted 3D assets or refined by a pretrained diffusion model first, and then used as the supervision target for training the diffusion model itself? I am trying to understand whether the method involves a form of self-bootstrapping / self-generated supervision.
Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How the “ground-truth edited video” is constructed？ #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How the “ground-truth edited video” is constructed？ #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions