vllm - Add initial set of metrics #7285

rzabarazesh · 2025-10-03T15:21:16Z

Adds metrics for both CI runtime and code review cycle
Updated to now add reliability metrics as well.

vercel · 2025-10-03T15:21:21Z

@rzabarazesh is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2025-10-03T17:24:48Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Updated (UTC)
torchci	Ready	Preview	Oct 7, 2025 3:13am

huydhn · 2025-10-03T17:27:22Z

PyTorch and test-infra uses a tool called lintrunner to all linters, our version of pre-commit. You want to install it https://pypi.org/project/lintrunner and lintrunner init && lintrunner -a to fix these lint failures

Also, the failure in https://github.com/pytorch/test-infra/actions/runs/18228918765/job/51907208089?pr=7285 can be solved easily by running yarn format to format the React code automatically

torchci/components/metrics/panels/ScalarPanel.tsx

torchci/components/metrics/vllm/CiDurationsPanel.tsx

yangw-dev · 2025-10-06T19:08:18Z

torchci/components/metrics/vllm/CiDurationsPanel.tsx

+    }
+  }
+
+  const options: EChartsOption = {


do you want to choose only some legends ? right now the
{ name: "Success" },
{ name: "Failed" },
{ name: "Canceled" },
are not clickable

I'm not sure I understood this one. Do you mean the data points? Or that the legend itself isn't clickable?

torchci/components/metrics/vllm/CiDurationsPanel.tsx

torchci/components/metrics/vllm/MergesPanel.tsx

torchci/pages/metrics/vllm.tsx

huydhn · 2025-10-09T19:47:01Z

torchci/clickhouse_queries/vllm/ci_reliability/query.sql

+        bucket,
+        countIf(lowerUTF8(build_state) IN ('passed', 'finished', 'success'))
+            AS passed_count,
+        countIf(lowerUTF8(build_state) = 'failed') AS failed_count,


Do you want to split this up into actual failures and soft failed? Maybe we only care about the former category

Correct. We mostly care about hard failures. I added a commit to be more explicit about soft-failures

huydhn · 2025-10-09T20:25:04Z

torchci/clickhouse_queries/vllm/job_reliability/query.sql

+    success_rate
+FROM job_stats
+ORDER BY
+    success_rate ASC,


A curious q: My understand is that this query would return the worst job first. Why is this on the preview all the jobs with 100% success rate are show first? I guess we want to focus on those that are not in a good state and we should show unreliable jobs first, right?

Sure. Changed it to show the worst jobs first.

huydhn · 2025-10-09T20:48:28Z

torchci/clickhouse_queries/vllm/merges_percentage/query.sql

    GROUP BY
        bucket
 ),
+manual_merged_prs_pending AS (


Let's chat more on this one because I don't think it would work like this. My understanding is that job_state is a field that is updated when the job progresses changing from scheduled to pending to running, then successed or failured or cancelled etc. A manual merge due to impatience means that the job is scheduled or pending or running at the time the merge occurs. So, it's a snapshot in time. However, the job information we have here is only the latest state. This means that this query changes depending on when you query it.

If you are agree with this, we could exclude this KPI to implement it later in a different PR as I need to double check if the above snapshot is even kept in the database instead of being overwritten. If it's indeed being overwritten, we need to think about a way to persist the snapshot of all jobs at the time of a merge. Just FYI, PyTorch keeps that in a table call merges although I don't think we could reuse that one.

You are right. Removed it for now

torchci/clickhouse_queries/vllm/pr_cycle_time_breakdown/query.sql

huydhn · 2025-10-09T21:29:32Z

torchci/clickhouse_queries/vllm/trunk_health/query.sql

@@ -0,0 +1,23 @@
+-- vLLM trunk health history


This is more like a high level comment on the approach for many of these CI metrics including:

torchci/clickhouse_queries/vllm/ci_reliability/query.sql

torchci/clickhouse_queries/vllm/ci_run_duration/query.sql

torchci/clickhouse_queries/vllm/job_reliability/query.sql

Should we adopt the same approach as this torchci/clickhouse_queries/vllm/trunk_health/query.sql to limit these query to only jobs from the main branch? The reason why i bring this up is because contributors are free to experiments in their PR and that could really skew these metrics if PRs are included, for example, building new components that take longer, adding tests that are flaky on PRs are work in progress and they are ok. Only when the changes are approved and landed, they become the new norm. For this reason, in PyTorch, we generally just look at CI metrics from main branch.

Basically, my thought is that we should only capture issues that affect multiple contributors, and exclude work in progress noise.

Good idea! Done

huydhn · 2025-10-09T22:18:47Z

torchci/clickhouse_queries/vllm/trunk_recovery_time/query.sql

+),
+
+-- Track state changes
+build_with_prev AS (


Um, how does this work when there are multiple build failures before trunk is recovered? My understand is that we want to capture this pattern: Last success, F, F, ..., F, F, Success (recovered) and the time in between. I can see the second transition from F to Success here, but what's about the first transition from Success to F?

Good catch! Let me fix that

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 3, 2025

rzabarazesh requested a review from huydhn October 3, 2025 15:26

rzabarazesh force-pushed the vllm-ci-metrics branch from 74cf17a to e7ddb4b Compare October 3, 2025 17:12

vercel bot deployed to Preview October 3, 2025 17:28 View deployment

huydhn requested review from clee2000 and yangw-dev October 3, 2025 18:15

rzabarazesh force-pushed the vllm-ci-metrics branch from e7ddb4b to aed2831 Compare October 3, 2025 18:26

vllm - add CI runtime and review cycle metrics

1d728cf

rzabarazesh force-pushed the vllm-ci-metrics branch from aed2831 to 1d728cf Compare October 6, 2025 11:14

yangw-dev requested changes Oct 6, 2025

View reviewed changes

torchci/components/metrics/vllm/MergesPanel.tsx Outdated Show resolved Hide resolved

torchci/pages/metrics/vllm.tsx Outdated Show resolved Hide resolved

torchci/pages/metrics/vllm.tsx Outdated Show resolved Hide resolved

DRY refactor

7748f3f

rzabarazesh requested a review from yangw-dev October 6, 2025 21:26

vllm reliability metrics

c0a3b50

rzabarazesh changed the title ~~vllm - add CI runtime and review cycle metrics~~ vllm - Add initial set of metrics Oct 7, 2025

lint

76a8295

vercel bot deployed to Preview October 7, 2025 03:13 View deployment

huydhn reviewed Oct 9, 2025

View reviewed changes

torchci/clickhouse_queries/vllm/pr_cycle_time_breakdown/query.sql Show resolved Hide resolved

huydhn reviewed Oct 9, 2025

View reviewed changes

torchci/clickhouse_queries/vllm/pr_cycle_time_breakdown/query.sql Show resolved Hide resolved

huydhn reviewed Oct 9, 2025

View reviewed changes

torchci/clickhouse_queries/vllm/pr_cycle_time_breakdown/query.sql Show resolved Hide resolved

huydhn reviewed Oct 9, 2025

View reviewed changes

vllm - Add initial set of metrics #7285

Are you sure you want to change the base?

vllm - Add initial set of metrics #7285

Conversation

rzabarazesh commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Oct 3, 2025

Uh oh!

vercel bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rzabarazesh Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huydhn Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huydhn Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huydhn Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rzabarazesh commented Oct 3, 2025 •

edited

Loading

vercel bot commented Oct 3, 2025 •

edited

Loading

huydhn commented Oct 3, 2025 •

edited

Loading

rzabarazesh Oct 13, 2025 •

edited

Loading

huydhn Oct 9, 2025 •

edited

Loading

huydhn Oct 9, 2025 •

edited

Loading

huydhn Oct 9, 2025 •

edited

Loading