Confusion on syncam module implementation. #8

zyhbili · 2025-02-06T11:29:30Z

Thanks for your great work!
I am trying to reproduce the results upon Cogvideo-5B. But currently i failed to produce the synchronized videos.
The following code repeat the text embedding view_num times for each frame which means that (view_num * text_seq_length + visual tokens_length) tokens will be involved into self attention according to CogVideoXAttnProcessor2_0.

SynCamMaster/syncammaster/transformer_3d.py

Line 127 in b4c60fe

    
           norm_encoder_hidden_states = rearrange(norm_encoder_hidden_states, "(b v) n d -> b (v n) d", v=view_num)

I am concerned that this operation is redundant and may dominate the view-sync attention process, thereby hindering its effectiveness.

Besides, i wonder more training details.
(1) Dataset scale and frame length. Is 49 frames a appropriate training frame length? Are 6000 samples sufficient to finetune SynCamMaster.
(2) How many views do you use in the training and inference procedure, respectively? Currently, i randomly sample 3 views each sample and treat the first view as the reference view. The training occupied around 61G on each card with gradient_checkpointing trick.
(3) How many cards are required ?

The text was updated successfully, but these errors were encountered:

JianhongBai · 2025-04-15T07:37:49Z

Hi @zyhbili, thanks for your interest!

Repeat the text embedding view_num times is redundant and may dominate the view-sync attention process, thereby hindering its effectiveness.

Yes, it's more reasonable for the multi-view synchronization module to perform self-attention solely on visual features. You can refer to the implementation we recently released.

Dataset scale and frame length. Is 49 frames a appropriate training frame length? Are 6000 samples sufficient to finetune SynCamMaster.

I believe that maintaining the same number of video frames as the base model is optimal, as it allows for maximizing the model's generative capabilities. Regarding the data volume, I think the scale of the training data primarily affects the model's generalization ability. A feasible validation approach would be to first train on a smaller dataset to verify the fine-tuned model's ability to generate synchronized videos, and then scale up the dataset to enhance the model's generalization ability.

How many views do you use in the training and inference procedure, respectively? Currently, i randomly sample 3 views each sample and treat the first view as the reference view. The training occupied around 61G on each card with gradient_checkpointing trick.

I have tested with view_num set to 2 or 4, allowing the model to simultaneously generate 2 or 4 synchronized videos, and it works.

How many cards are required?

I trained using 8 GPUs, with a batch size of 1 per GPU. A larger batch size should yield better results.

zyhbili closed this as completed Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion on syncam module implementation. #8

Confusion on syncam module implementation. #8

zyhbili commented Feb 6, 2025

JianhongBai commented Apr 15, 2025

Confusion on syncam module implementation. #8

Confusion on syncam module implementation. #8

Comments

zyhbili commented Feb 6, 2025

JianhongBai commented Apr 15, 2025