Skip to content

How the “ground-truth edited video” is constructed? #3

@a18805193766-arch

Description

@a18805193766-arch

Hi, thanks for the great work on Dream4Drive.
I have a question regarding the training supervision for the diffusion model.
From the paper, I understand that the model takes dense 3D-aware guidance maps as input:
depth map D
normal map N
edge map E
object image O
object mask M
and is fine-tuned to generate edited photorealistic videos.

However, I am still confused about the source of the target edited RGB videos used during training. Is the target edited RGB video directly rendered from the inserted 3D assets or refined by a pretrained diffusion model first, and then used as the supervision target for training the diffusion model itself? I am trying to understand whether the method involves a form of self-bootstrapping / self-generated supervision.
Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions