-
Notifications
You must be signed in to change notification settings - Fork 9
Limit concurrent requests to 28000 #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
benchmark_serving.py
Outdated
prompts_sent += 1 | ||
|
||
results = await asyncio.gather(*tasks) | ||
async with aiohttp.ClientSession(trust_env=False, connector=aiohttp.TCPConnector(keepalive_timeout=30, enable_cleanup_closed=True, limit=28000,),timeout=None, trace_configs=[trace_config]) as clientSession: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am less inclined to hard code 28k vs set to 0 (no limit) here and catch the appropriate error, log, and retry.
The added metric and logging will help observability. The retry is effectively the same outcome (qps slowdown).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes if we can log failures due to ephemeral port exhaustion so that we know the experiment is not valid and the user needs to reduce the QPS or num_prompts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to logging when we exhaust ports, added and prevented including server metrics when we exhaust ports since the wait time invalidates these. Non-server metrics could still be valuable since the measured e2e latency includes the time waiting to send the request, if no requests are ever being queued on any model server and the bottleneck is this tool then yes the experiment data would for certain be invalid.
This PR fixes the
ClientConnectorError
s seen at high qps due to port exhaustion by upper bounding the number of concurrent requests to 28000. This is a temporary fix until the approach for high qps benchmarking is decided on. 28000 is roughly the number of ephemeral ports available when containerized. Addedactive_connections_metric
gauge.