v0.3.0
Pre-release
      Pre-release
    
        Release Notes
Compatibility with vLLM
- Aligned command-line parameters with real vLLM. All parameters supported by both the simulator and the vLLM now share the same name and format:
- Support for --served-model-name
 - Support for --seed
 - Support for --max-model-len
 
 - Added support for tools in chat completions
 - Included usage in the response
 - Added object field to the response JSON
 - Added support for multimodal inputs in chat completions
 - Added health and readiness endpoints
 - Added P/D support; the connector type must be set to nixl
 
Additional Features
- Introduced configuration file support. All parameters can now be loaded from a configuration file in addition to being set via the command line.
 - Added new test coverage
 - Changed the Docker base image
 - Added the ability to randomize time to first token, inter token latency, and KV-cache transfer latency
 
Migration Notes (for users upgrading from versions prior to v0.2.0)
- max-running-requests has been renamed to max-num-seqs
 - lora has been replaced by lora-modules, which now accepts a list of JSON strings, e.g, '{"name": "name", "path": "lora_path", "base_model_name": "id"}'
 
Change details since v0.2.2
- feat: add max-model-len configuration and validation for context window (#82) by @mohitpalsingh in #85
 - Fixed readme, removed error for --help by @irar2 in #89
 - Pd support by @mayabar in #94
 - fix: crash when omitted stream_options by @jasonmadigan in #95
 - style: 🔨 splits all import blocks into different sections by @yafengio in #98
 - Fixed deployment.yaml by @irar2 in #99
 - Enable configuration of various parameters in tools by @irar2 in #100
 - Choose latencies randomly by @irar2 in #103
 
New Contributors
- @mohitpalsingh made their first contribution in #85
 - @jasonmadigan made their first contribution in #95
 
Full Changelog: v0.2.2...v0.3.0