-
Notifications
You must be signed in to change notification settings - Fork 16
feat: add max-num-batched-tokens configuration and implement request handling constraints (#83) #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…handling constraints
@@ -88,6 +88,7 @@ var _ = Describe("Simulator configuration", func() { | |||
c = createDefaultConfig(qwenModelName) | |||
c.Port = 8001 | |||
c.ServedModelNames = []string{"model1", "model2"} | |||
c.MaxNumBatchedTokens = 2048 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this (and all other occurrences) to createDefaultConfig().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if outputTokens < 0 { | ||
outputTokens = 0 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it that if maxCompletionTokens is nil, this function should just return s.config.MaxModelLen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
pkg/llm-d-inference-sim/simulator.go
Outdated
promptTokens: req.getNumberOfPromptTokens(), | ||
maxTokens: processingTokens, | ||
totalTokens: processingTokens, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could only find where 'totalTokens' is used (to update processingTokensCount), why 'promptTokens' and 'totalTokens' are needed? And if they are not used, I guess we don't need runningRequestsMap at all? (And requestID)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, that was unnecessary, I was biased on the general one fit for all use case approach in-case we would like further control over parallel requests in near future. But I guess removing it for now is better and leaner.
Removed unused fields and structures:
- runningRequest struct - Completely removed since promptTokens and maxTokens were never used
- runningRequestsMap sync.Map - Removed since we don't need to map request IDs to token counts
- requestID field - Removed from completionReqCtx since we no longer need unique request tracking
Simplified token tracking:
- Before: Store a complex runningRequest struct with 3 fields in a map, indexed by requestID
- After: Store just the processingTokens count directly in the completionReqCtx
Updated method signatures:
- addRunningRequest() now takes *completionReqCtx instead of (reqID, req)
- removeRunningRequest() now takes *completionReqCtx instead of reqID
- Both methods are simpler and more direct
@mohitpalsingh Thank you very much for this PR. After review of this PR along with issue #83 with a colleague of mine, we need to think about what exactly needs to be simulated here more carefully. Therefore, there will be a delay in the review of your PR. |
yeah no issues @irar2 , I've updated the PR as per your comments and it should be gtg for the current scope and expected behavior. Let me know if you decide to change something and I can be of help for that. |
✨ New Feature:
max-num-batched-tokens
SupportThis PR implements the
max-num-batched-tokens
parameter, which limits the total number of tokens (prompt + max output tokens) that can be processed simultaneously across all running requests.🔧 Technical Implementation
calculateProcessingTokens()
function to compute total token requirements per request.canAcceptRequest()
function to check constraint satisfaction.addRunningRequest()
andremoveRunningRequest()
for proper token tracking.📝 Configuration & Documentation
--max-num-batched-tokens
with proper help text.config.yaml
(2048) andbasic-config.yaml
(1024).🔄 Code Quality Improvements
0
, only themax-num-seqs
constraint is enforced.🧪 Testing
📊 Behavior
When
max-num-batched-tokens
is configured:max-num-seqs
max-num-batched-tokens
When
max-num-batched-tokens
is0
or not set:max-num-seqs
constraint is enforced (existing behavior).🏗️ Files Modified
config.go
- Added parameter and validationsimulator.go
- Core implementation and renamed fieldpkg/llm-d-inference-sim/*_test.go
- Updated tests and expectationsconfig.yaml
&basic-config.yaml
- Added parameterREADME.md
- Documentation updates✅ Validation
--help
output🎯 Addresses
This implementation follows the
vLLM
specification for themax-num-batched-tokens
parameter, ensuring requests only proceed when both sequence and token constraints are satisfied. This enables better resource management and throughput control.