Skip to content

feat: add max-num-batched-tokens configuration and implement request handling constraints (#83) #97

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mohitpalsingh
Copy link
Contributor

✨ New Feature: max-num-batched-tokens Support

This PR implements the max-num-batched-tokens parameter, which limits the total number of tokens (prompt + max output tokens) that can be processed simultaneously across all running requests.

🔧 Technical Implementation

  • Added calculateProcessingTokens() function to compute total token requirements per request.
  • Implemented canAcceptRequest() function to check constraint satisfaction.
  • Added addRunningRequest() and removeRunningRequest() for proper token tracking.
  • Redesigned queue manager with ticker-based processing to avoid busy-waiting.
  • Added atomic counters for thread-safe token counting.

📝 Configuration & Documentation

  • Added command-line parameter: --max-num-batched-tokens with proper help text.
  • Updated configuration files: Added parameter to config.yaml (2048) and basic-config.yaml (1024).
  • Enhanced README: Added documentation for the new parameter.
  • Updated manifests: Configuration examples now include the new parameter.

🔄 Code Quality Improvements

  • Maintained backward compatibility - when set to 0, only the max-num-seqs constraint is enforced.

🧪 Testing

  • ✅ All existing tests pass (102/102 specs).
  • ✅ Added comprehensive test coverage for the new functionality.
  • ✅ Test scenarios include:
    • Parameter validation and edge cases
    • Queue constraint enforcement
    • Token calculation accuracy
    • Configuration loading from files
    • Backward compatibility (disabled mode)

📊 Behavior

When max-num-batched-tokens is configured:

  • Requests are queued until both constraints are satisfied:
    • Number of running requests < max-num-seqs
    • Total processing tokens + new request tokens ≤ max-num-batched-tokens

When max-num-batched-tokens is 0 or not set:

  • Only the max-num-seqs constraint is enforced (existing behavior).

🏗️ Files Modified

  • config.go - Added parameter and validation
  • simulator.go - Core implementation and renamed field
  • pkg/llm-d-inference-sim/*_test.go - Updated tests and expectations
  • config.yaml & basic-config.yaml - Added parameter
  • README.md - Documentation updates

✅ Validation

  • Build succeeds without errors
  • All tests pass (102/102 specs)
  • Parameter appears in --help output
  • Configuration loading works correctly
  • Token tracking is thread-safe
  • Backward compatibility maintained

🎯 Addresses

This implementation follows the vLLM specification for the max-num-batched-tokens parameter, ensuring requests only proceed when both sequence and token constraints are satisfied. This enables better resource management and throughput control.

@@ -88,6 +88,7 @@ var _ = Describe("Simulator configuration", func() {
c = createDefaultConfig(qwenModelName)
c.Port = 8001
c.ServedModelNames = []string{"model1", "model2"}
c.MaxNumBatchedTokens = 2048
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this (and all other occurrences) to createDefaultConfig().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if outputTokens < 0 {
outputTokens = 0
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it that if maxCompletionTokens is nil, this function should just return s.config.MaxModelLen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

promptTokens: req.getNumberOfPromptTokens(),
maxTokens: processingTokens,
totalTokens: processingTokens,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could only find where 'totalTokens' is used (to update processingTokensCount), why 'promptTokens' and 'totalTokens' are needed? And if they are not used, I guess we don't need runningRequestsMap at all? (And requestID)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that was unnecessary, I was biased on the general one fit for all use case approach in-case we would like further control over parallel requests in near future. But I guess removing it for now is better and leaner.

Removed unused fields and structures:

  • runningRequest struct - Completely removed since promptTokens and maxTokens were never used
  • runningRequestsMap sync.Map - Removed since we don't need to map request IDs to token counts
  • requestID field - Removed from completionReqCtx since we no longer need unique request tracking

Simplified token tracking:

  • Before: Store a complex runningRequest struct with 3 fields in a map, indexed by requestID
  • After: Store just the processingTokens count directly in the completionReqCtx

Updated method signatures:

  • addRunningRequest() now takes *completionReqCtx instead of (reqID, req)
  • removeRunningRequest() now takes *completionReqCtx instead of reqID
  • Both methods are simpler and more direct

@irar2
Copy link
Collaborator

irar2 commented Jul 16, 2025

@mohitpalsingh Thank you very much for this PR. After review of this PR along with issue #83 with a colleague of mine, we need to think about what exactly needs to be simulated here more carefully. Therefore, there will be a delay in the review of your PR.

@mohitpalsingh mohitpalsingh requested a review from irar2 July 16, 2025 09:06
@mohitpalsingh
Copy link
Contributor Author

@mohitpalsingh Thank you very much for this PR. After review of this PR along with issue #83 with a colleague of mine, we need to think about what exactly needs to be simulated here more carefully. Therefore, there will be a delay in the review of your PR.

yeah no issues @irar2 , I've updated the PR as per your comments and it should be gtg for the current scope and expected behavior.

Let me know if you decide to change something and I can be of help for that.
Till then, Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants