-
-
Notifications
You must be signed in to change notification settings - Fork 230
Add thinking budget parameter to limit reasoning tokens #637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hey @tasercake One question, is there a way to find the thiking tokens automatically? If not, it's fine we can go ahead and merge. |
|
Could you add support to the batch_gen API as well? |
|
Thanks for taking a look @Blaizzy! I'll see if I can get it working with
Unfortunately it looks like there is no standardized API for reasoning token IDs. qwen3-vl defines them in its tokenizer config while deepseek does not appear to. |
|
@Blaizzy I've taken a crack at supporting this in the batch_generate API as well (and also fixed a couple of minor issues with the returned logprobs) - let me know what you think? Thanks! |
Happy new year! Agreed, we can address it in a future PR :) |
|
Hey @tasercake, I really like the thinking budget concept, but the current approach feels a bit verbose. 1. Create a new
|
This adds a `thinking_budget` parameter to the generate functions that limits the number of tokens a model can generate inside `<think>...</think>` blocks. When the budget is exceeded, the `</think>` token is force-inserted to end the thinking phase. Features: - New parameters: `thinking_budget`, `thinking_start_token`, `thinking_end_token` - Configurable think tags (default: `<think>`/`</think>`) - CLI support via `--thinking-budget`, `--thinking-start-token`, `--thinking-end-token` - Works with any reasoning model that uses think tags (Qwen3, DeepSeek-R1, etc.) The implementation tracks thinking state in the generation loop and enforces the budget by replacing the sampled token with the end token when exceeded.
* Add thinking budget support to batch generation Extend the thinking budget enforcement to BatchGenerator to support streaming batch inference with token limits on thinking blocks. Tracks thinking state per sample and forces </think> tokens when budget is exceeded. * Update logprobs when forcing thinking end token When the thinking budget is exceeded and the </think> token is forced, the logprobs are now updated to reflect this forced selection (setting the forced token's log probability to 0 and all others to -inf).
* Add test and fix mlx array syntax for thinking budget - Add TestBatchGeneratorThinking test class with thinking budget test - Fix mlx array syntax: use direct indexing instead of .at[].set() - Fix Batch.filter/extend to handle None thinking state - Skip counting the start token in thinking token count
|
@Blaizzy agree that something of this sort would be nice! I'd been thinking of doing something like this, but extending the StoppingCriteria class instead – I think your suggestion is much cleaner though. |
|
That's awesome, extending the stop criteria is a great idea too and is simpler! Looking forward to your changes 🚀 |
Hello! I wanted to place a hard cap on the number of tokens used for reasoning (inside a
<think>...</think>block from models like Qwen3) because the models would often spend way too long thinking.So this a thing I built for my own use and figured there might be some interest in having this upstream as well. Happy to iterate on this if needed :)
Here's a summary of changes:
thinking_budgetparameter to limit tokens in<think>...</think>blocks</think>when budget is exceeded to end reasoning phasethinking_start_tokenandthinking_end_token--thinking-budget,--thinking-start-token,--thinking-end-tokenUsage
Testing
Added a
TestThinkingBudgetclass to cover a few cases.