You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this paper, it is mentioned that:
"Ultimately, we concatenate visual features of all frames along the time dimension with special tokens interleaved among the sequence to model the temporal inter-dependencies. The resulting visual sequence is then projected into LLM embedding space via the multi-modal projector as the final output of the visual branch, denoted as 𝑍_𝑉∈𝑅^{(ℎ×𝑤×𝑛)×𝑐}."
I have the following questions regarding this process:
(1) What exactly are the "special tokens" interleaved among the visual sequence? Are they fixed embeddings, learnable parameters, or derived from any other sources?
(2) If 𝑘 special tokens are inserted, shouldn't the temporal length of the sequence increase, resulting in 𝑍_𝑉∈𝑅^{(ℎ×𝑤×(𝑛+𝑘))×𝑐}
? Could you clarify how the dimensionality is computed in this case? Your clarification on this would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
Sorry for the late response. I just noticed this new issue in the project. As mentioned in Figure 2, fisrt we encode frames into video features, where we denote as Z_V (h, w, n, c) in the paper. Then we we insert special tokens (<vi_start>, , etc.) within the video tokens as the final visual sequence. As you mentioned here, with demonsion (h, w, (n+k), c).
In this paper, it is mentioned that:
"Ultimately, we concatenate visual features of all frames along the time dimension with special tokens interleaved among the sequence to model the temporal inter-dependencies. The resulting visual sequence is then projected into LLM embedding space via the multi-modal projector as the final output of the visual branch, denoted as 𝑍_𝑉∈𝑅^{(ℎ×𝑤×𝑛)×𝑐}."
I have the following questions regarding this process:
(1) What exactly are the "special tokens" interleaved among the visual sequence? Are they fixed embeddings, learnable parameters, or derived from any other sources?
(2) If 𝑘 special tokens are inserted, shouldn't the temporal length of the sequence increase, resulting in 𝑍_𝑉∈𝑅^{(ℎ×𝑤×(𝑛+𝑘))×𝑐}
? Could you clarify how the dimensionality is computed in this case? Your clarification on this would be greatly appreciated.
The text was updated successfully, but these errors were encountered: