Skip to content

Support --max-num-batched-tokens configuration parameter #83

@mayabar

Description

@mayabar

Definition: max-num-batched-tokens is maximum number of batched tokens per iteration.

Currently only max-num-seqs is supported and defines how many requests can run in parallel.

If max-num-batched-tokens is defined the actual total number of tokens allowed in a batch at runtime should be limited.

Given a queue of received requests, the worker should pull a request from the queue only if both of the following conditions are met:

  • The number of currently running requests is less than max-num-seqs, and

  • The total number of processing tokens is less than max-num-batched-tokens.

The total number of processing tokens is calculated by summing, for each running request, the number of tokens in the prompt plus the maximum number of tokens in the output.
The number of output tokens is defined by the max_tokens parameter in the request. If max_tokens is not specified, it is calculated by max-model-len - prompt-len.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions