thanks for your great work, I noticed that in all four stages the model needs to be reloaded using
model = CoVTForConditionalGeneration.from_pretrained(
model_args.model_path,
torch_dtype=compute_dtype,
attn_implementation="flash_attention_2" if not training_args.disable_flash_attn2 else "sdpa",
**bnb_model_from_pretrained_args
)
The following code is used in the initialization code of CoVTForConditionalGeneration to initialize the variables used by sam cross attention.
self.sam_projection = nn.Linear(3584, 256)
self.sam_query_vectors = nn.Parameter(torch.randn(8, 256, dtype=torch.bfloat16, requires_grad=True))
self.sam_cross_attention = nn.MultiheadAttention(embed_dim=256, num_heads=8, batch_first=True)
This means that these variables are randomly re-initialized in four stages. However, sam_query_vectors and sam_projection are trainable. Why not save these variables in the previous stage and load them into the model in the next stage, so that the sam_query_vectors and sam_projection used in the four stages are consistent and uninterrupted.
thanks for your great work, I noticed that in all four stages the model needs to be reloaded using
The following code is used in the initialization code of CoVTForConditionalGeneration to initialize the variables used by sam cross attention.
This means that these variables are randomly re-initialized in four stages. However, sam_query_vectors and sam_projection are trainable. Why not save these variables in the previous stage and load them into the model in the next stage, so that the sam_query_vectors and sam_projection used in the four stages are consistent and uninterrupted.