Skip to content

test(producer): regenerate 7 stale regression baselines#946

Merged
miguel-heygen merged 2 commits into
mainfrom
fix/regen-stale-regression-baselines
May 19, 2026
Merged

test(producer): regenerate 7 stale regression baselines#946
miguel-heygen merged 2 commits into
mainfrom
fix/regen-stale-regression-baselines

Conversation

@miguel-heygen
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen commented May 18, 2026

Summary

  • Regenerate golden baselines for 7 regression tests that were deterministically failing due to stale baselines from before the recent renderer fixes
  • Root cause: renderer changes (revert flattenInnerRoot, late-bind polling removal, conditional rebind, sub-comp timeline readiness polling) changed visual output, but baselines were never regenerated
  • CI's Detect changes job was skipping shards on non-engine commits, masking the failures — making it look like flaky tests when in reality they failed every time shards ran

Tests regenerated (all in Docker with pinned Chrome 148.0.7778.167)

Test Previously failed frames Baseline age
gsap-letters-render-compat 86/100 old
typegpu-adapter 76/100 May 13
style-7-prod 60/100 PR #368 (very old)
style-15-prod 59/100 May 17 (pre-renderer-fixes)
style-18-prod 19/100 old
style-3-prod 7/100 May 17 (pre-renderer-fixes)
style-8-prod 1/100 borderline

Test plan

  • All 7 tests pass locally in Docker with 0 failed frames
  • Verified deterministic — ran style-7-prod twice, same result both times
  • All 8 CI regression shards green
  • Other tests (style-1, 2, 4, 5, 6, 9, 10, 11, 12, 13, 16, 17, overlay-montage, sub-composition-video, vignelli-stacking, etc.) still pass

🤖 Generated with Claude Code

miguel-heygen and others added 2 commits May 18, 2026 22:28
Renderer fixes (revert flattenInnerRoot, late-bind polling removal,
conditional rebind, sub-comp timeline readiness polling) changed visual
output but baselines were never regenerated. CI's Detect changes job
was skipping shards on non-engine commits, masking the failures.

Regenerated in Docker (Dockerfile.test) with pinned Chrome 148.0.7778.167:
- style-7-prod (60/100 frames failed, baseline from PR #368)
- style-15-prod (59/100 frames failed)
- style-18-prod (19/100 frames failed)
- style-3-prod (7/100 frames failed)
- style-8-prod (1/100 frames failed, borderline)
- typegpu-adapter (76/100 frames failed)

All 6 now pass locally with 0 failed frames.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Missed in the initial batch — 86/100 frames were failing. Regenerated
in Docker with the same pinned Chrome build. Now passes with 0 failures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@miguel-heygen miguel-heygen changed the title test(producer): regenerate 6 stale regression baselines test(producer): regenerate 7 stale regression baselines May 19, 2026
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: APPROVE

Pure baseline regeneration — 14 files modified, all in packages/producer/tests/*/output/ (compiled.html + output.mp4 pairs for 7 fixtures). Zero source code touched. Scope is exactly what the title promises.

Verification of the "expected vs current behavior" concern

The classic footgun on golden-file regen is baseline refresh on macOS that doesn't match CI Linux render (the hf#591 lesson). Audited specifically:

CI Linux reproduces — all 8 regression shards green on 13846554:

Test regenerated Shard Result
style-3-prod shard-1 success
style-15-prod shard-2 success
style-7-prod, style-8-prod shard-3 success
typegpu-adapter shard-5 success
style-18-prod shard-7 success
gsap-letters-render-compat shard-8 success

Every regenerated baseline reproduces on Linux CI — not just on Miguel's local. That's the signal that resolves hf#591.

Additionally: the other shards (covering style-1/2/4/5/6/9/10/11/12/13/16/17, vignelli-stacking, overlay-montage, sub-composition-video, hdr-regression, mov-prores, etc.) are also green — confirms the recent renderer fixes (revert flattenInnerRoot / late-bind polling removal / conditional rebind / sub-comp timeline readiness) haven't destabilized previously-passing baselines while fixing these 7.

Spot-checks on the compiled.html diffs

Looked at style-3-prod/output/compiled.html — additions are inline @font-face blocks with base64-encoded woff2 payloads (e.g. Helvetica). Consistent with sub-composition asset-inlining behavior — benign renderer-output change, not a regression artifact.

Body-vs-diff check

Body claims fail-counts pre-regen (86/100, 76/100, 60/100, 59/100, 19/100, 7/100, 1/100) — not directly verifiable from review, but the "Detect changes job was skipping shards on non-engine commits" explanation matches the observed shard-bypass pattern and gives a plausible reason these failed deterministically without being noticed earlier.

One note (non-blocking)

mergeable_state: blocked is currently driven by Tests on windows-latest + Render on windows-latest still in_progress. Those are non-regression jobs and historically green on prior runs of this stack — let them complete before merge but no action needed here.

Review by Rames

@jrusso1020
Copy link
Copy Markdown
Collaborator

Diagnosis matches what I see from PR #945 (observability + fail-fast: false to force every shard to run). Same 7 fixtures, identical failed-frame counts on rerun — fully deterministic. Two asks before approve:

1. Confirm style-8-prod is supposed to render an empty navy/decorations frame at t=3.045s.

The previous baseline at that timestamp showed the main A-roll talking head (man in "W" cap, office background). Your new baseline replaces it with what looks like a transition gap. The other 99 frames in the fixture match the old baseline within PSNR, so this is a content-shifting change at the intro→main scene boundary specifically — not a generic timing drift. Want to make sure the "hold external sub-compositions through host duration" change (2b46565c / 551607ef) isn't dropping a frame's worth of main content at the boundary.

Old expected vs. new baseline at t=3.045s:

  • Expected: man in office, talking-head visible
  • New baseline: navy bg + a few decorative shapes, no person

A one-line PR comment ("at t=3.045 style-8 should show [X]; the new render is correct because [Y]") would settle it for reviewers.

2. fail-fast: true in regression.yml:53 is the other masking mechanism.

The "Detect changes skipping shards" point in the PR description is real, but it doesn't explain why 27efcd0f (release v0.6.22) and 8163f380 (release v0.6.21) — both touching engine paths — surfaced only 1 failing fixture per run when the truth was 6–7. That's fail-fast: true cancelling subsequent shards after the first failure, so each release run showed a different fixture failing, looking flaky.

Without fixing this, the same masking pattern will hide the next regression. Either:

  • Drop fail-fast: true in a follow-up (regression matrix goes from ~15 min → ~25 min wall-clock, but every failure surfaces),
  • Or add a checklist note: PRs touching packages/{engine,core,producer}/** must include matching baseline regens.

3. Two more visual-sanity notes (low priority, just flagging):

  • gsap-letters-render-compat at t=0.16s: old expected has subtle halos under each letter (H, Y, P); new baseline has crisp letters with no halo. Could be "GSAP frame-timing artifact gone, render is cleaner" or "CSS text-shadow inadvertently stripped." A one-line confirm would close it out.
  • The 7 fixtures split into 3 distinct PSNR signatures (catastrophic single-frame at scene boundary | 26–30 dB sustained drift | 18.7 dB sustained wrong-state). Regenerating all of them treats them as one phenomenon. If style-8's is actually a regression hiding under "we accepted the new render," we won't catch it until production.

FYI — happy to revert #945 once this lands, unless you want to keep the structured pollSubCompositionTimelines JSON log for future investigations (it's two changes: the JSON line itself, and fail-fast: false).

@jrusso1020
Copy link
Copy Markdown
Collaborator

Aligning with the two asks above — my earlier APPROVE was premature; downgrading my position to pending Miguel's confirmation on (1). The content-shift catch on style-8 is the load-bearing one and I should have surfaced it from the diff signatures alone.

The audit I should have done:

Cross-referencing the PR-body fail counts against the compiled.html diff sizes:

Fixture Failed frames compiled.html ±lines
style-8-prod 1/100 +1468 / -543
style-7-prod 60/100 +935 / -48
style-15-prod 59/100 +45 / -5
style-18-prod 19/100 +59 / -3
style-3-prod 7/100 +43 / -3
gsap-letters-render-compat 86/100 +64 / 0
typegpu-adapter 76/100 +64 / 0

The style-8 row is the outlier: 1/100 failed frames but ~2000 lines of compiled.html churn. That signature is "content-shifting change at a single frame boundary" — the bulk of the .html change is structural (likely sub-comp inlining / asset-tree mutation), and the pixel-domain failure is the one frame where the structural change manifests visually. Without per-frame inspection this should have read as a flag, not "benign."

Vai's framing is right: regen-all treats the 7 fixtures as one phenomenon (3 distinct PSNR signatures: catastrophic-single-frame, sustained-drift, sustained-wrong-state). My approval framed all 7 as the same shape because they all turned green in CI; that's deterministic, not correct.

On fail-fast: true — agreed this is a follow-up. The "Detect changes skipping shards" explanation in the PR body is necessary but not sufficient — fail-fast is what produced the "different fixture failing each run → looks flaky" pattern on release commits. A follow-up that flips it (or adds the checklist note) is what makes the next regression actually visible.

Holding for Miguel's one-liner on (1) style-8 + (3a) gsap-letters halo intent. Once those are confirmed, I'm back at APPROVE.

— Rames

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-APPROVE after Miguel's two clarifications resolve the open questions:

1. The 2-frame intro extension at the intro→main scene boundary is intentional.

Miguel confirmed "without extending that, it feels choppy" — the "hold external sub-compositions through host duration" change 2b46565c / 551607ef is doing exactly what it says, and accepting the new baseline at frames 90–91 (intro overlay holds through full duration before A-roll appears) is the correct behavior, not a regression. Old baselines that ended the intro early were the buggy state.

Verified visually by extracting frames from the two mp4s — frame 91 OLD shows the talking-head fully visible, frame 91 NEW shows the navy intro overlay still opaque, and both converge by frame 92 (intro mid-fadeout, talking-head emerging). PNG-size ratios across 10 sampled frames confirm everything else is pixel-equivalent within ~5% — only frames 90–91 differ catastrophically, exactly at the boundary.

2. The +1468/-543 LoC outlier on style-8-prod has a benign root cause.

It's not structural change — it's inline base64 font-data bloat from c8e8fdcf fix(producer): fetch missing font weights from Google Fonts at render time:

  • OLD: 9 @font-face rules, ~288KB woff2
  • NEW: 33 @font-face rules, ~1.14MB woff2

Inter went from 3 weights (100/300/800) to all weights; Roboto + Segoe UI added as fallbacks. That ~850KB of base64 = ~1400 lines of diff. Actual structural CSS-scoping diff is small (32 → 35 selectors). My earlier "signature outlier" framing identified a real disproportion but mis-attributed the cause — the cause was the renderer's font-backfill behavior maturing, not a content-shift regression.

Open items (non-blocking, per Vai's framing):

  • gsap-letters halo loss at t=0.16s — confirm intentional (still open; doesn't block this PR).
  • fail-fast: true in regression.yml:53 — follow-up to remove (or add a checklist note for PRs touching packages/{engine,core,producer}/**) so the next regression isn't hidden the same way the release-commit "1 failure per run looks flaky" pattern was.

Review by Rames

@miguel-heygen miguel-heygen merged commit d873364 into main May 19, 2026
55 checks passed
@miguel-heygen miguel-heygen deleted the fix/regen-stale-regression-baselines branch May 19, 2026 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants