-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Fix issue #27486 double bos token #27515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix issue #27486 double bos token #27515
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an issue with duplicate BOS tokens when a chat template is manually applied. The approach is to heuristically detect if special tokens are already present in the prompt. While the fix works for the intended cases, the implementation introduces a critical bug due to an overly broad heuristic for detecting the <s> token. My review includes a critical comment with a suggested code change to fix this bug by making the detection logic more robust and less prone to false positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| chat_markers = [ | ||
| "<|start_header_id|>", # Llama 3.x | ||
| "<|im_start|>", # ChatML format | ||
| "<s>", # Common BOS in older models | ||
| ] | ||
|
|
||
| for marker in chat_markers: | ||
| if marker in prompt[:100]: # Check first 100 chars | ||
| logger.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid treating literal "
" in user text as chat template
The new _should_add_special_tokens uses if marker in prompt[:100] with chat_markers containing <s>. Any plain prompt that mentions the literal string, e.g. asking about the HTML <s> tag, will now be classified as already containing special tokens and add_special_tokens will be forced to False. That removes the BOS/EOS tokens that were added before this change and can noticeably degrade generation quality for legitimate prompts that simply contain <s> in their content. The detection should be restricted to prefix checks or token boundary matches instead of substring search to avoid this regression.
Useful? React with 👍 / 👎.
…/baonudesifeizhai/vllm into fix-issue-27486-double-bos-token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @njhill I remember that a similar issue was filed a while back?
|
See #9519 |
Purpose
Fix issue #27486: Prevent duplicate BOS tokens when users manually apply chat templates before calling
generate().Problem
When users manually apply chat templates using
tokenizer.apply_chat_template()and then callllm.generate()with the templated string, vLLM was incorrectly adding an additional BOS token during tokenization. This resulted in duplicate BOS tokens at the start of the prompt:This happened because:
apply_chat_template()generates text containing BOS token markers (e.g.,<|begin_of_text|>)Solution
Added intelligent detection in
InputPreprocessor._tokenize_prompt()to check if a prompt already contains special tokens before adding them:Changes:
New method
_should_add_special_tokens()(vllm/inputs/preprocess.py)<|begin_of_text|>)<|start_header_id|>(Llama 3.x)<|im_start|>(ChatML format)<s>(older models)Falseif special tokens are detected,TrueotherwiseModified
_tokenize_prompt()(vllm/inputs/preprocess.py)_should_add_special_tokens()before tokenizingadd_special_tokens=Falsewhen chat template markers are detectedNew test case (
tests/entrypoints/llm/test_chat.py)test_llm_generate_with_chat_template_no_double_bos()Test Plan
ytest tests/entrypoints/llm/test_chat.py::test_llm_generate_with_chat_template_no_double_bos -v -s pass
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.