Skip to content

v0.5.0

Choose a tag to compare

@mayabar mayabar released this 16 Sep 06:54
· 25 commits to main since this release
9c541b9

New features

  • Processing time is affected by server load
  • Change TTFT parameter to be based on number of request tokens
  • KV cache affects prefill time
  • Support failure injection
  • Implement kv-cache usage and waiting loras Prometheus metrics
  • Randomize response length based when max-tokens is defined in the request
  • Support DP (data parallel)
  • Support /tokenize endpoint

What's Changed

  • Fix server interrupt by @npolshakova in #161
  • Show final config in simulaor default logger at Info lvel by @pancak3 in #154
  • Cast bounds type in tests to func def: latency, interToken, and timeToFirst (to int) by @pancak3 in #163
  • Remvoe unnecessary deferal of server close by @pancak3 in #162
  • Fix: Rand generator is not set in a test suite which result in accessing nil pointer during runtime if run the only test suite by @pancak3 in #166
  • Use channels for metrics updates, added metrics tests by @irar2 in #171
  • Remove rerun on comment action by @irar2 in #174
  • Add failure injection mode to simulator by @smarunich in #131
  • Add waiting loras list to loraInfo metrics by @mayabar in #175
  • feat: generate response length based on a histogram when max_tokens is defined in the request by @mayabar in #169
  • extend response length buckets calculation to have not necessary equally sized buckets by @mayabar in #176
  • Use dynamic ports in zmq tests by @pancak3 in #170
  • Change time-to-first-token parameter to be based on number of request tokens #137 by @pancak3 in #165
  • Bugfix: was accessing number of tokens from nil var; getting it from req instead by @pancak3 in #177
  • feat: add helm charts for Kubernetes deployment by @Blackoutta in #182
  • chore: Make the image smaller by @shmuelk in #183
  • Take cached prompt tokens into account in prefill time calculation by @irar2 in #184
  • Add ignore eos in request by @pancak3 in #187
  • Support DP by @irar2 in #188
  • Change RandomNorm from float types to int by @pancak3 in #190
  • KV cache usage metric by @irar2 in #192
  • Adjust request "processing time" to current load by @pancak3 in #189
  • Updates for the new release of kv-cache-manager by @irar2 in #194
  • DP bug fix: wait after starting rank 0 sim by @irar2 in #193
  • Support /tokenize endpoint by @irar2 in #198
  • add Service to expose vLLM deployment and update doc by @googs1025 in #201
  • Split simulator.go into several files by @irar2 in #199

New Contributors

Full Changelog: v0.4.0...v0.5.0