Skip to content

Conversation

izaitsevfb
Copy link
Contributor

This helps filter out signals from flaky tests (that succeed after rerun) and avoid reverting based on them. Such tests are excluded from HUD, so this makes autorevert more consistent with HUD.

Changes:

  • Extract individual test reruns as separate Signal events
  • Dedup retains both outcomes
  • For each attempt, emit at most one FAILURE and one SUCCESS event
  • Skipped tests are excluded from "success"

Extraction result before:
2025-10-08T18-48-39.575589-00-00.html

Extraction result after:
2025-10-08T21-30-45.676074-00-00.html

notice flaky signals like:
pull:inductor/test_compile_subprocess.py::test_remove_noop_slice_scatter_cpu
pull:inductor/test_compile_subprocess.py::test_remove_noop_slice1_cpu

@pytorch-bot pytorch-bot bot added the ci-no-td label Oct 8, 2025
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 8, 2025
Copy link

vercel bot commented Oct 8, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Preview Updated (UTC)
torchci Ignored Ignored Preview Oct 8, 2025 10:23pm

prev_key: Optional[Tuple[datetime, int, SignalStatus]] = None
for e in c.events: # already sorted by (started_at, wf_run_id)
key = (e.started_at, e.wf_run_id)
key = (e.started_at, e.wf_run_id, e.status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to dedup by status? I mean, for flakiness analysis it is relevant if the job failed 3x before succeeding once or it failed 1x and succeeded once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding status to the key prevents deduping events with different status.

There are two constraints:

  1. github tends to reuse the same jobs (assigning them different ids) when "retry failed" is invoked for the workflow. This is what dedup is trying to solve.
  2. test reruns come from the same job (so they have the same timestamp and workflow run_id)

I can think of an alternative way to extract test retries and avoid deduplicating them — we can artificially increase the synthetic event time by one second.

However, you're right to question how that change would affect signal processing. I think tests are retried only when first run fails, so in theory this change only acts as a filter. Maybe we can simplify the whole thing and just not consider signals when we detect mixed retries on test levels (pre-filter them even before they are processed).

@jeanschmidt
Copy link
Contributor

jeanschmidt commented Oct 9, 2025

There are 3 things we need to discuss before moving on:

  1. If the goal is to extract information about individual tests failure/success for flakiness analysis, it is relevant to know how many times it failed before succeeding, so maybe we need to change the approach;

  2. I see that tests are green, but this change have a strong potential to impact signal detection logic in unexpected ways. Can we have some other assurances that its behaviour is consistent? Maybe we're at the point where it is starting to be more-and-more important our capacity to backtest autorevert with real historical data.

  3. Maybe at this stage we might want to also enable some sort of feature control, where we can switch from one behaviour to another live to quickly test code changes + still gather feedback from both logics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants