Skip to content

Conversation

rzabarazesh
Copy link

@rzabarazesh rzabarazesh commented Oct 3, 2025

Adds metrics for both CI runtime and code review cycle
Updated to now add reliability metrics as well.

Copy link

vercel bot commented Oct 3, 2025

@rzabarazesh is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 3, 2025
@rzabarazesh rzabarazesh requested a review from huydhn October 3, 2025 15:26
Copy link

vercel bot commented Oct 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
torchci Ready Ready Preview Oct 7, 2025 3:13am

@huydhn
Copy link
Contributor

huydhn commented Oct 3, 2025

PyTorch and test-infra uses a tool called lintrunner to all linters, our version of pre-commit. You want to install it https://pypi.org/project/lintrunner and lintrunner init && lintrunner -a to fix these lint failures

Also, the failure in https://github.com/pytorch/test-infra/actions/runs/18228918765/job/51907208089?pr=7285 can be solved easily by running yarn format to format the React code automatically

}
}

const options: EChartsOption = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to choose only some legends ? right now the
{ name: "Success" },
{ name: "Failed" },
{ name: "Canceled" },
are not clickable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understood this one. Do you mean the data points? Or that the legend itself isn't clickable?

@rzabarazesh rzabarazesh requested a review from yangw-dev October 6, 2025 21:26
@rzabarazesh rzabarazesh changed the title vllm - add CI runtime and review cycle metrics vllm - Add initial set of metrics Oct 7, 2025
bucket,
countIf(lowerUTF8(build_state) IN ('passed', 'finished', 'success'))
AS passed_count,
countIf(lowerUTF8(build_state) = 'failed') AS failed_count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to split this up into actual failures and soft failed? Maybe we only care about the former category

Copy link
Author

@rzabarazesh rzabarazesh Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We mostly care about hard failures. I added a commit to be more explicit about soft-failures

success_rate
FROM job_stats
ORDER BY
success_rate ASC,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A curious q: My understand is that this query would return the worst job first. Why is this on the preview all the jobs with 100% success rate are show first? I guess we want to focus on those that are not in a good state and we should show unreliable jobs first, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Changed it to show the worst jobs first.

GROUP BY
bucket
),
manual_merged_prs_pending AS (
Copy link
Contributor

@huydhn huydhn Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat more on this one because I don't think it would work like this. My understanding is that job_state is a field that is updated when the job progresses changing from scheduled to pending to running, then successed or failured or cancelled etc. A manual merge due to impatience means that the job is scheduled or pending or running at the time the merge occurs. So, it's a snapshot in time. However, the job information we have here is only the latest state. This means that this query changes depending on when you query it.

If you are agree with this, we could exclude this KPI to implement it later in a different PR as I need to double check if the above snapshot is even kept in the database instead of being overwritten. If it's indeed being overwritten, we need to think about a way to persist the snapshot of all jobs at the time of a merge. Just FYI, PyTorch keeps that in a table call merges although I don't think we could reuse that one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Removed it for now

@@ -0,0 +1,23 @@
-- vLLM trunk health history
Copy link
Contributor

@huydhn huydhn Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like a high level comment on the approach for many of these CI metrics including:

  • torchci/clickhouse_queries/vllm/ci_reliability/query.sql
  • torchci/clickhouse_queries/vllm/ci_run_duration/query.sql
  • torchci/clickhouse_queries/vllm/job_reliability/query.sql

Should we adopt the same approach as this torchci/clickhouse_queries/vllm/trunk_health/query.sql to limit these query to only jobs from the main branch? The reason why i bring this up is because contributors are free to experiments in their PR and that could really skew these metrics if PRs are included, for example, building new components that take longer, adding tests that are flaky on PRs are work in progress and they are ok. Only when the changes are approved and landed, they become the new norm. For this reason, in PyTorch, we generally just look at CI metrics from main branch.

Basically, my thought is that we should only capture issues that affect multiple contributors, and exclude work in progress noise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Done

),

-- Track state changes
build_with_prev AS (
Copy link
Contributor

@huydhn huydhn Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, how does this work when there are multiple build failures before trunk is recovered? My understand is that we want to capture this pattern: Last success, F, F, ..., F, F, Success (recovered) and the time in between. I can see the second transition from F to Success here, but what's about the first transition from Success to F?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Let me fix that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants