The issue of the input of caption

I have a question about the position format of the caption in the input data in the command data. For example, the following sentence in the paper, A video of a Super-hero Movie. Is this sentence part of the text prompt, or does it need to be embedded through the imagebind model and then input into LLM?