How did you incorporate the generated visual tokens into the input? Just use the id and the embedding layer?
How did you incorporate the generated visual tokens into the input? Just use the id and the embedding layer?