Clarification on Special Tokens in Visual Sequence and Dimensionality of 𝑍_𝑉 #8

PANPANKK · 2024-12-21T13:03:08Z

In this paper, it is mentioned that:
"Ultimately, we concatenate visual features of all frames along the time dimension with special tokens interleaved among the sequence to model the temporal inter-dependencies. The resulting visual sequence is then projected into LLM embedding space via the multi-modal projector as the final output of the visual branch, denoted as 𝑍_𝑉∈𝑅^{(ℎ×𝑤×𝑛)×𝑐}."

I have the following questions regarding this process:
(1) What exactly are the "special tokens" interleaved among the visual sequence? Are they fixed embeddings, learnable parameters, or derived from any other sources?
(2) If 𝑘 special tokens are inserted, shouldn't the temporal length of the sequence increase, resulting in 𝑍_𝑉∈𝑅^{(ℎ×𝑤×(𝑛+𝑘))×𝑐}
? Could you clarify how the dimensionality is computed in this case? Your clarification on this would be greatly appreciated.

IceWYB · 2025-02-07T10:05:20Z

Sorry for the late response. I just noticed this new issue in the project. As mentioned in Figure 2, fisrt we encode frames into video features, where we denote as Z_V (h, w, n, c) in the paper. Then we we insert special tokens (<vi_start>, , etc.) within the video tokens as the final visual sequence. As you mentioned here, with demonsion (h, w, (n+k), c).

IceWYB closed this as completed Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Special Tokens in Visual Sequence and Dimensionality of 𝑍_𝑉 #8

Clarification on Special Tokens in Visual Sequence and Dimensionality of 𝑍_𝑉 #8

PANPANKK commented Dec 21, 2024 •

edited

Loading

IceWYB commented Feb 7, 2025

Clarification on Special Tokens in Visual Sequence and Dimensionality of 𝑍_𝑉 #8

Clarification on Special Tokens in Visual Sequence and Dimensionality of 𝑍_𝑉 #8

Comments

PANPANKK commented Dec 21, 2024 • edited Loading

IceWYB commented Feb 7, 2025

PANPANKK commented Dec 21, 2024 •

edited

Loading