Support `--max-num-batched-tokens` configuration parameter


**Definition**: `max-num-batched-tokens` is maximum number of batched tokens per iteration.

Currently only `max-num-seqs` is supported and defines how many requests can run in parallel.

If `max-num-batched-tokens` is defined the actual total number of tokens allowed in a batch at runtime should be limited.

Given a queue of received requests, the worker should pull a request from the queue only if both of the following conditions are met:

- The number of currently running requests is less than `max-num-seqs`, and

- The total number of processing tokens is less than `max-num-batched-tokens`.

The total number of processing tokens is calculated by summing, for each running request, the number of tokens in the prompt plus the maximum number of tokens in the output.
The number of output tokens is defined by the `max_tokens` parameter in the request. If `max_tokens` is not specified, it is calculated by  `max-model-len - prompt-len`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support `--max-num-batched-tokens` configuration parameter #83

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support --max-num-batched-tokens configuration parameter #83

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Support `--max-num-batched-tokens` configuration parameter #83