Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Special Tokens in Visual Sequence and Dimensionality of 𝑍_𝑉 #8

Closed
PANPANKK opened this issue Dec 21, 2024 · 1 comment

Comments

@PANPANKK
Copy link

PANPANKK commented Dec 21, 2024

In this paper, it is mentioned that:
"Ultimately, we concatenate visual features of all frames along the time dimension with special tokens interleaved among the sequence to model the temporal inter-dependencies. The resulting visual sequence is then projected into LLM embedding space via the multi-modal projector as the final output of the visual branch, denoted as 𝑍_𝑉∈𝑅^{(ℎ×𝑤×𝑛)×𝑐}."

I have the following questions regarding this process:
(1) What exactly are the "special tokens" interleaved among the visual sequence? Are they fixed embeddings, learnable parameters, or derived from any other sources?
(2) If 𝑘 special tokens are inserted, shouldn't the temporal length of the sequence increase, resulting in 𝑍_𝑉∈𝑅^{(ℎ×𝑤×(𝑛+𝑘))×𝑐}
? Could you clarify how the dimensionality is computed in this case? Your clarification on this would be greatly appreciated.

@IceWYB
Copy link
Collaborator

IceWYB commented Feb 7, 2025

Sorry for the late response. I just noticed this new issue in the project. As mentioned in Figure 2, fisrt we encode frames into video features, where we denote as Z_V (h, w, n, c) in the paper. Then we we insert special tokens (<vi_start>, , etc.) within the video tokens as the final visual sequence. As you mentioned here, with demonsion (h, w, (n+k), c).

@IceWYB IceWYB closed this as completed Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@PANPANKK @IceWYB and others