Skip to content

Confusion on syncam module implementation. #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zyhbili opened this issue Feb 6, 2025 · 1 comment
Closed

Confusion on syncam module implementation. #8

zyhbili opened this issue Feb 6, 2025 · 1 comment

Comments

@zyhbili
Copy link

zyhbili commented Feb 6, 2025

Thanks for your great work!
I am trying to reproduce the results upon Cogvideo-5B. But currently i failed to produce the synchronized videos.
The following code repeat the text embedding view_num times for each frame which means that (view_num * text_seq_length + visual tokens_length) tokens will be involved into self attention according to CogVideoXAttnProcessor2_0.

norm_encoder_hidden_states = rearrange(norm_encoder_hidden_states, "(b v) n d -> b (v n) d", v=view_num)

I am concerned that this operation is redundant and may dominate the view-sync attention process, thereby hindering its effectiveness.

Besides, i wonder more training details.
(1) Dataset scale and frame length. Is 49 frames a appropriate training frame length? Are 6000 samples sufficient to finetune SynCamMaster.
(2) How many views do you use in the training and inference procedure, respectively? Currently, i randomly sample 3 views each sample and treat the first view as the reference view. The training occupied around 61G on each card with gradient_checkpointing trick.
(3) How many cards are required ?

@JianhongBai
Copy link
Collaborator

Hi @zyhbili, thanks for your interest!

Repeat the text embedding view_num times is redundant and may dominate the view-sync attention process, thereby hindering its effectiveness.

Yes, it's more reasonable for the multi-view synchronization module to perform self-attention solely on visual features. You can refer to the implementation we recently released.

Dataset scale and frame length. Is 49 frames a appropriate training frame length? Are 6000 samples sufficient to finetune SynCamMaster.

I believe that maintaining the same number of video frames as the base model is optimal, as it allows for maximizing the model's generative capabilities. Regarding the data volume, I think the scale of the training data primarily affects the model's generalization ability. A feasible validation approach would be to first train on a smaller dataset to verify the fine-tuned model's ability to generate synchronized videos, and then scale up the dataset to enhance the model's generalization ability.

How many views do you use in the training and inference procedure, respectively? Currently, i randomly sample 3 views each sample and treat the first view as the reference view. The training occupied around 61G on each card with gradient_checkpointing trick.

I have tested with view_num set to 2 or 4, allowing the model to simultaneously generate 2 or 4 synchronized videos, and it works.

How many cards are required?

I trained using 8 GPUs, with a batch size of 1 per GPU. A larger batch size should yield better results.

@zyhbili zyhbili closed this as completed Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants