-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Definition: max-num-batched-tokens
is maximum number of batched tokens per iteration.
Currently only max-num-seqs
is supported and defines how many requests can run in parallel.
If max-num-batched-tokens
is defined the actual total number of tokens allowed in a batch at runtime should be limited.
Given a queue of received requests, the worker should pull a request from the queue only if both of the following conditions are met:
-
The number of currently running requests is less than
max-num-seqs
, and -
The total number of processing tokens is less than
max-num-batched-tokens
.
The total number of processing tokens is calculated by summing, for each running request, the number of tokens in the prompt plus the maximum number of tokens in the output.
The number of output tokens is defined by the max_tokens
parameter in the request. If max_tokens
is not specified, it is calculated by max-model-len - prompt-len
.