Skip to content

Qps observability #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

jjk-g
Copy link
Collaborator

@jjk-g jjk-g commented Apr 1, 2025

Adds two QPS observability featueres

  • Prometheus request counter to allow promQL observability
  • Add async singleton counter to report QPS in benchmark_result

Tested:

python3 benchmark_serving.py --save-json-results --host=llama3-8b-vllm-service --port=8000 --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Meta-Llama-3-8B --request-rate=20 --backend=vllm --num-prompts=400 --max-input-length=1024 --max-output-length=1024 --file-prefix=benchmark --models=meta-llama/Meta-Llama-3-8B --scrape-server-metrics
Namespace(backend='vllm', sax_model='', file_prefix='benchmark', endpoint='generate', host='llama3-8b-vllm-service', port=8000, dataset='ShareGPT_V3_unfiltered_cleaned_split.json', models='meta-llama/Meta-Llama-3-8B', traffic_split=None, stream_request=False, request_timeout=10800.0, tokenizer='meta-llama/Meta-Llama-3-8B', best_of=1, use_beam_search=False, num_prompts=400, max_input_length=1024, max_output_length=1024, top_k=32000, request_rate=20.0, seed=1743547367, trust_remote_code=False, machine_cost=None, use_dummy_text=False, save_json_results=True, output_bucket='', output_bucket_filepath=None, save_aggregated_result=False, additional_metadata_metrics_to_save=None, scrape_server_metrics=True, pm_namespace='default', pm_job='vllm-podmonitoring')
Models to benchmark: ['meta-llama/Meta-Llama-3-8B']
No traffic split specified. Defaulting to uniform traffic split.
Starting Prometheus Server on port 9090
====Result for Model: weighted====
Errors: {'ClientConnectorError': 0, 'TimeoutError': 0, 'ContentTypeError': 0, 'ClientOSError': 0, 'ServerDisconnectedError': 0, 'unknown_error': 0}
Total time: 93.23 s
Successful/total requests: 400/400
Requests/min: 257.42
Queries/sec: 20.49
Output_tokens/min: 24535.02
Input_tokens/min: 64095.85
Tokens/min: 88630.87
Average seconds/token (includes waiting time on server): 0.12
Average milliseconds/request (includes waiting time on server): 23495.42
Average milliseconds/output_token (includes waiting time on server): 2225.20
Average input length: 248.99
Average output length: 95.31
====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorError': 0, 'TimeoutError': 0, 'ContentTypeError': 0, 'ClientOSError': 0, 'ServerDisconnectedError': 0, 'unknown_error': 0}
Total time: 93.23 s
Successful/total requests: 400/400
Requests/min: 257.42
Queries/sec: 20.49
Output_tokens/min: 24535.02
Input_tokens/min: 64095.85
Tokens/min: 88630.87
Average seconds/token (includes waiting time on server): 0.12
Average milliseconds/request (includes waiting time on server): 23495.42
Average milliseconds/output_token (includes waiting time on server): 2225.20
Average input length: 248.99
Average output length: 95.31

jjk-g added 2 commits April 1, 2025 14:35
Adds singleton counter that allows for calculating
QPS.
return cls._instance

async def increment(self):
async with self._lock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have to lock to increment the counter each time, does it lead to any slowdowns waiting for this to happen when the QPS is high? Is there a way to check? I'm wondering if this can slow the rate at which we send requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants