Skip to content

Limit concurrent requests to 28000 #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

Conversation

Bslabe123
Copy link
Collaborator

@Bslabe123 Bslabe123 commented Mar 26, 2025

This PR fixes the ClientConnectorErrors seen at high qps due to port exhaustion by upper bounding the number of concurrent requests to 28000. This is a temporary fix until the approach for high qps benchmarking is decided on. 28000 is roughly the number of ephemeral ports available when containerized. Added active_connections_metric gauge.

LatencyProfileGenerator:time_to_first_token_created 1.7431152424978657e+09
# HELP LatencyProfileGenerator:active_requests How many requests actively being processed
# TYPE LatencyProfileGenerator:active_requests gauge
LatencyProfileGenerator:active_requests 29888.0
# HELP LatencyProfileGenerator:active_connections How many active connections
# TYPE LatencyProfileGenerator:active_connections gauge
LatencyProfileGenerator:active_connections 28000.0

@Bslabe123 Bslabe123 changed the title [WIP] Limit concurrent requests to prevent exhausting ephemeral ports Limit concurrent requests to 28000 prevent exhausting ephemeral ports Mar 27, 2025
@Bslabe123 Bslabe123 changed the title Limit concurrent requests to 28000 prevent exhausting ephemeral ports Limit concurrent requests to 28000 Mar 27, 2025
prompts_sent += 1

results = await asyncio.gather(*tasks)
async with aiohttp.ClientSession(trust_env=False, connector=aiohttp.TCPConnector(keepalive_timeout=30, enable_cleanup_closed=True, limit=28000,),timeout=None, trace_configs=[trace_config]) as clientSession:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am less inclined to hard code 28k vs set to 0 (no limit) here and catch the appropriate error, log, and retry.

The added metric and logging will help observability. The retry is effectively the same outcome (qps slowdown).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes if we can log failures due to ephemeral port exhaustion so that we know the experiment is not valid and the user needs to reduce the QPS or num_prompts

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to logging when we exhaust ports, added and prevented including server metrics when we exhaust ports since the wait time invalidates these. Non-server metrics could still be valuable since the measured e2e latency includes the time waiting to send the request, if no requests are ever being queued on any model server and the bottleneck is this tool then yes the experiment data would for certain be invalid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants