Error: Token indices sequence length is longer than the specified maximum sequence length for this model #9101
Replies: 1 comment
-
Welp, I found this: TLDR: In the context of the HybridChunker, this is a known & ancitipated "false alarm". |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everyone, I've been trying to figure out why I get this error/warning when I run my Pipeline. I am not even sure if this is the correct place to add this question, but here we go.
Let me give you a small background on what I want to do first:
I want to create a RAG pipeline using Milvus as a vector database and docling as a document converter while using haystack as the backend.
Here's a snippet of the code:
Output
I tried both openAI embedder and HF embedder and I get the same results.
My main question is, where does it find a sequence of length 26838? Why is there a sequence of that length while my max_tokens param is at 2500? Shouldn't all chunks be at max 2500 tokens?
Also from the output, I get the sequence warning before the first print statement "Indexing 1 files..." which means that the warning comes from somewhere outside that cell?
I am very confused and I've been looking at this for days.
The tokenizer is BAAI/bge-m3, the embedder I use is text-embedding-3-large from OpenAI, both have the same Max input size.
I cannot narrow down on which step of the code I get this warning and how to solve it.
Should I use Haystack's splitter into the pipeline to split the chunks down? I am under the impression that HybridChunker handles that, and I have it setup correctly with the max_tokens parameter.
Checking the tokens of each chunk, I get this:
Sorry for the long text, thank you for reading!
EDIT:
If I run the last cell again without restarting the notebook, I do not get the sequence length warning/error. But if I restart the notebook and run the cell, I get the warning. So this error pops up only once per notebook session. Running it again and again does not make it appear.
Furtherdown I have a RAG pipeline which works, so I do not know wether do ignore the error or find a solution.
Beta Was this translation helpful? Give feedback.
All reactions