Fix visual encoders with no CLS #11982
Merged
+5
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes the bug outlined in this issue: #10157
As well as discussed in projects leverage llama cpp like ollama: ollama/ollama#7441 ollama/ollama-python#433
Summary
In
clip.cpp
, we initialize a"patches"
vector, which is then used to index into the embedding with aget rows
op (here).This can cause the out of bounds assertion to be triggered when run with the CPU backend when it's used with a visual encoder that has no CLS embedding, e.g.,
siglip
. I.e.,729
and no CLS[1, 2, ..., 729]
instead of the correct[0, 1, ..., 728]
Steps to Verify
Build llava llama cli with
cmake --build build --config Release --target llama-llava-cli
Try running the model.
On
main
, it blows up because of the patch729
:On this branch, things are happy:
@ngxson @ggerganov @gabe-l-hart PTAL when you can - this change is needed to run granite vision models correctly as well (being added by this PR), but decoupling the bug fix from the new model support 🙂