make eval script also handle performance measurement #3473

vkuzo · 2025-12-09T20:47:51Z

Summary:

refactors the eval script to also handle performance measurement in
vllm
adds a simple vllm bench latency script to bench in vllm for prefill and decode

Also, add convenience flags to skip model creation, lm_eval, vllm as
needed to enable running just a single model + single step.

Test Plan:

with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100
// full output: https://www.internalfb.com/phabricator/paste/view/P2094641791

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-12-09T20:47:52Z

Stack from ghstack (oldest at bottom):

-> make eval script also handle performance measurement #3473

pytorch-bot · 2025-12-09T20:47:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3473

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 6 Pending

As of commit 55c2ab4 with merge base 486fe0d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e1d713e ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 15f7481 ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 15f7481 ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 665f2c8 ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: cae97ab ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 42466df ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 79c5722 ghstack-comment-id: 3634216524 Pull-Request: #3473

[ghstack-poisoned]

Summary: 1. refactors the eval script to also handle performance measurement in vllm 2. adds a simple `vllm bench latency` script to bench in vllm The script is broken on every single recipe, we'll have to fix and enable things in future PRs, will update the performance tables afterwards. Also, add convenience flags to skip model creation, lm_eval, vllm as needed to enable running just a single model + single step. Test Plan: ``` SKIP_MODEL_CREATE=1 SKIP_LM_EVAL=1 SKIP_VLLM=0 with-proxy ./benchmarks/quantization/measure_accuracy_and_performance.sh h100 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: a0019d4 ghstack-comment-id: 3634216524 Pull-Request: #3473

vkuzo added 4 commits December 9, 2025 06:30

Update

cbc18b3

[ghstack-poisoned]

Update

cf212a9

[ghstack-poisoned]

Update

d4f3afd

[ghstack-poisoned]

Update

9450d20

[ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 9, 2025

This was referenced Dec 9, 2025

simplify accuracy eval #3470

Merged

refactor accuracy eval script to be organized by hardware #3472

Merged

vkuzo added the topic: for developers Use this tag if this PR is mainly developer facing label Dec 9, 2025

vkuzo requested review from jainapurva and jerryzh168 December 10, 2025 11:24

vkuzo added 2 commits December 10, 2025 10:08

Update

d99f6d8

[ghstack-poisoned]

Update

17ef1f7

[ghstack-poisoned]

Update

8aba356

[ghstack-poisoned]

vkuzo changed the base branch from gh/vkuzo/182/head to main December 10, 2025 18:09

Update

ebde070

[ghstack-poisoned]

Update

86304cb

[ghstack-poisoned]

Update

d0f8a00

[ghstack-poisoned]

Update

1eb4438

[ghstack-poisoned]

Update

55c2ab4

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make eval script also handle performance measurement #3473

make eval script also handle performance measurement #3473

vkuzo commented Dec 9, 2025 •

edited

Loading

Uh oh!

vkuzo commented Dec 9, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

make eval script also handle performance measurement #3473

Are you sure you want to change the base?

make eval script also handle performance measurement #3473

Conversation

vkuzo commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3473

⏳ No Failures, 6 Pending

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Dec 9, 2025 •

edited

Loading

vkuzo commented Dec 9, 2025 •

edited

Loading

pytorch-bot bot commented Dec 9, 2025 •

edited

Loading