You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your great work!
I am trying to reproduce the results upon Cogvideo-5B. But currently i failed to produce the synchronized videos.
The following code repeat the text embedding view_num times for each frame which means that (view_num * text_seq_length + visual tokens_length) tokens will be involved into self attention according to CogVideoXAttnProcessor2_0.
norm_encoder_hidden_states=rearrange(norm_encoder_hidden_states, "(b v) n d -> b (v n) d", v=view_num)
I am concerned that this operation is redundant and may dominate the view-sync attention process, thereby hindering its effectiveness.
Besides, i wonder more training details.
(1) Dataset scale and frame length. Is 49 frames a appropriate training frame length? Are 6000 samples sufficient to finetune SynCamMaster.
(2) How many views do you use in the training and inference procedure, respectively? Currently, i randomly sample 3 views each sample and treat the first view as the reference view. The training occupied around 61G on each card with gradient_checkpointing trick.
(3) How many cards are required ?
The text was updated successfully, but these errors were encountered:
Repeat the text embedding view_num times is redundant and may dominate the view-sync attention process, thereby hindering its effectiveness.
Yes, it's more reasonable for the multi-view synchronization module to perform self-attention solely on visual features. You can refer to the implementation we recently released.
Dataset scale and frame length. Is 49 frames a appropriate training frame length? Are 6000 samples sufficient to finetune SynCamMaster.
I believe that maintaining the same number of video frames as the base model is optimal, as it allows for maximizing the model's generative capabilities. Regarding the data volume, I think the scale of the training data primarily affects the model's generalization ability. A feasible validation approach would be to first train on a smaller dataset to verify the fine-tuned model's ability to generate synchronized videos, and then scale up the dataset to enhance the model's generalization ability.
How many views do you use in the training and inference procedure, respectively? Currently, i randomly sample 3 views each sample and treat the first view as the reference view. The training occupied around 61G on each card with gradient_checkpointing trick.
I have tested with view_num set to 2 or 4, allowing the model to simultaneously generate 2 or 4 synchronized videos, and it works.
How many cards are required?
I trained using 8 GPUs, with a batch size of 1 per GPU. A larger batch size should yield better results.
Thanks for your great work!
I am trying to reproduce the results upon Cogvideo-5B. But currently i failed to produce the synchronized videos.
The following code repeat the text embedding view_num times for each frame which means that
(view_num * text_seq_length + visual tokens_length)
tokens will be involved into self attention according toCogVideoXAttnProcessor2_0
.SynCamMaster/syncammaster/transformer_3d.py
Line 127 in b4c60fe
I am concerned that this operation is redundant and may dominate the view-sync attention process, thereby hindering its effectiveness.
Besides, i wonder more training details.
(1) Dataset scale and frame length. Is 49 frames a appropriate training frame length? Are 6000 samples sufficient to finetune SynCamMaster.
(2) How many views do you use in the training and inference procedure, respectively? Currently, i randomly sample 3 views each sample and treat the first view as the reference view. The training occupied around 61G on each card with gradient_checkpointing trick.
(3) How many cards are required ?
The text was updated successfully, but these errors were encountered: