feat: add max-num-batched-tokens configuration and implement request handling constraints (#83) #97

mohitpalsingh · 2025-07-15T17:25:26Z

✨ New Feature: `max-num-batched-tokens` Support

This PR implements the max-num-batched-tokens parameter, which limits the total number of tokens (prompt + max output tokens) that can be processed simultaneously across all running requests.

🔧 Technical Implementation

Added calculateProcessingTokens() function to compute total token requirements per request.
Implemented canAcceptRequest() function to check constraint satisfaction.
Added addRunningRequest() and removeRunningRequest() for proper token tracking.
Redesigned queue manager with ticker-based processing to avoid busy-waiting.
Added atomic counters for thread-safe token counting.

📝 Configuration & Documentation

Added command-line parameter: --max-num-batched-tokens with proper help text.
Updated configuration files: Added parameter to config.yaml (2048) and basic-config.yaml (1024).
Enhanced README: Added documentation for the new parameter.
Updated manifests: Configuration examples now include the new parameter.

🔄 Code Quality Improvements

Maintained backward compatibility - when set to 0, only the max-num-seqs constraint is enforced.

🧪 Testing

✅ All existing tests pass (102/102 specs).
✅ Added comprehensive test coverage for the new functionality.
✅ Test scenarios include:
- Parameter validation and edge cases
- Queue constraint enforcement
- Token calculation accuracy
- Configuration loading from files
- Backward compatibility (disabled mode)

📊 Behavior

When max-num-batched-tokens is configured:

Requests are queued until both constraints are satisfied:
- Number of running requests < max-num-seqs
- Total processing tokens + new request tokens ≤ max-num-batched-tokens

When max-num-batched-tokens is 0 or not set:

Only the max-num-seqs constraint is enforced (existing behavior).

🏗️ Files Modified

config.go - Added parameter and validation
simulator.go - Core implementation and renamed field
pkg/llm-d-inference-sim/*_test.go - Updated tests and expectations
config.yaml & basic-config.yaml - Added parameter
README.md - Documentation updates

✅ Validation

Build succeeds without errors
All tests pass (102/102 specs)
Parameter appears in --help output
Configuration loading works correctly
Token tracking is thread-safe
Backward compatibility maintained

🎯 Addresses

This implementation follows the vLLM specification for the max-num-batched-tokens parameter, ensuring requests only proceed when both sequence and token constraints are satisfied. This enables better resource management and throughput control.

…handling constraints

irar2 · 2025-07-16T05:46:41Z

pkg/llm-d-inference-sim/config_test.go

@@ -88,6 +88,7 @@ var _ = Describe("Simulator configuration", func() {
 	c = createDefaultConfig(qwenModelName)
 	c.Port = 8001
 	c.ServedModelNames = []string{"model1", "model2"}
+	c.MaxNumBatchedTokens = 2048


Please move this (and all other occurrences) to createDefaultConfig().

irar2 · 2025-07-16T05:57:07Z

pkg/llm-d-inference-sim/simulator.go

+		if outputTokens < 0 {
+			outputTokens = 0
+		}
+	}


Isn't it that if maxCompletionTokens is nil, this function should just return s.config.MaxModelLen?

irar2 · 2025-07-16T07:07:04Z

pkg/llm-d-inference-sim/simulator.go

+		promptTokens: req.getNumberOfPromptTokens(),
+		maxTokens:    processingTokens,
+		totalTokens:  processingTokens,
+	}


I could only find where 'totalTokens' is used (to update processingTokensCount), why 'promptTokens' and 'totalTokens' are needed? And if they are not used, I guess we don't need runningRequestsMap at all? (And requestID)

You're right, that was unnecessary, I was biased on the general one fit for all use case approach in-case we would like further control over parallel requests in near future. But I guess removing it for now is better and leaner.

Removed unused fields and structures:

runningRequest struct - Completely removed since promptTokens and maxTokens were never used

runningRequestsMap sync.Map - Removed since we don't need to map request IDs to token counts

requestID field - Removed from completionReqCtx since we no longer need unique request tracking

Simplified token tracking:

Before: Store a complex runningRequest struct with 3 fields in a map, indexed by requestID

After: Store just the processingTokens count directly in the completionReqCtx

Updated method signatures:

addRunningRequest() now takes *completionReqCtx instead of (reqID, req)

removeRunningRequest() now takes *completionReqCtx instead of reqID

Both methods are simpler and more direct

irar2 · 2025-07-16T08:32:56Z

@mohitpalsingh Thank you very much for this PR. After review of this PR along with issue #83 with a colleague of mine, we need to think about what exactly needs to be simulated here more carefully. Therefore, there will be a delay in the review of your PR.

…andling

mohitpalsingh · 2025-07-16T09:08:46Z

@mohitpalsingh Thank you very much for this PR. After review of this PR along with issue #83 with a colleague of mine, we need to think about what exactly needs to be simulated here more carefully. Therefore, there will be a delay in the review of your PR.

yeah no issues @irar2 , I've updated the PR as per your comments and it should be gtg for the current scope and expected behavior.

Let me know if you decide to change something and I can be of help for that.
Till then, Thanks.

mohitpalsingh mentioned this pull request Jul 15, 2025

Support --max-num-batched-tokens configuration parameter #83

Open

feat: add max-num-batched-tokens configuration and implement request …

3ab3937

…handling constraints

mohitpalsingh force-pushed the main branch from f342a26 to 3ab3937 Compare July 16, 2025 06:36

irar2 requested changes Jul 16, 2025

View reviewed changes

feat: add MaxNumBatchedTokens to configuration and refactor request h…

9d2b6b6

…andling

mohitpalsingh requested a review from irar2 July 16, 2025 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add max-num-batched-tokens configuration and implement request handling constraints (#83) #97

feat: add max-num-batched-tokens configuration and implement request handling constraints (#83) #97

Uh oh!

mohitpalsingh commented Jul 15, 2025

Uh oh!

irar2 Jul 16, 2025

Uh oh!

mohitpalsingh Jul 16, 2025

Uh oh!

irar2 Jul 16, 2025

Uh oh!

mohitpalsingh Jul 16, 2025

Uh oh!

irar2 Jul 16, 2025

Uh oh!

mohitpalsingh Jul 16, 2025

Uh oh!

irar2 commented Jul 16, 2025

Uh oh!

mohitpalsingh commented Jul 16, 2025

Uh oh!

Uh oh!

feat: add max-num-batched-tokens configuration and implement request handling constraints (#83) #97

Are you sure you want to change the base?

feat: add max-num-batched-tokens configuration and implement request handling constraints (#83) #97

Uh oh!

Conversation

mohitpalsingh commented Jul 15, 2025

✨ New Feature: max-num-batched-tokens Support

🔧 Technical Implementation

📝 Configuration & Documentation

🔄 Code Quality Improvements

🧪 Testing

📊 Behavior

🏗️ Files Modified

✅ Validation

🎯 Addresses

Uh oh!

irar2 Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

mohitpalsingh Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

irar2 Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

mohitpalsingh Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

irar2 Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

mohitpalsingh Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

irar2 commented Jul 16, 2025

Uh oh!

mohitpalsingh commented Jul 16, 2025

Uh oh!

Uh oh!

✨ New Feature: `max-num-batched-tokens` Support