test(producer): regenerate 7 stale regression baselines#946
Conversation
Renderer fixes (revert flattenInnerRoot, late-bind polling removal, conditional rebind, sub-comp timeline readiness polling) changed visual output but baselines were never regenerated. CI's Detect changes job was skipping shards on non-engine commits, masking the failures. Regenerated in Docker (Dockerfile.test) with pinned Chrome 148.0.7778.167: - style-7-prod (60/100 frames failed, baseline from PR #368) - style-15-prod (59/100 frames failed) - style-18-prod (19/100 frames failed) - style-3-prod (7/100 frames failed) - style-8-prod (1/100 frames failed, borderline) - typegpu-adapter (76/100 frames failed) All 6 now pass locally with 0 failed frames. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Missed in the initial batch — 86/100 frames were failing. Regenerated in Docker with the same pinned Chrome build. Now passes with 0 failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jrusso1020
left a comment
There was a problem hiding this comment.
Verdict: APPROVE
Pure baseline regeneration — 14 files modified, all in packages/producer/tests/*/output/ (compiled.html + output.mp4 pairs for 7 fixtures). Zero source code touched. Scope is exactly what the title promises.
Verification of the "expected vs current behavior" concern
The classic footgun on golden-file regen is baseline refresh on macOS that doesn't match CI Linux render (the hf#591 lesson). Audited specifically:
CI Linux reproduces — all 8 regression shards green on 13846554:
| Test regenerated | Shard | Result |
|---|---|---|
style-3-prod |
shard-1 | success |
style-15-prod |
shard-2 | success |
style-7-prod, style-8-prod |
shard-3 | success |
typegpu-adapter |
shard-5 | success |
style-18-prod |
shard-7 | success |
gsap-letters-render-compat |
shard-8 | success |
Every regenerated baseline reproduces on Linux CI — not just on Miguel's local. That's the signal that resolves hf#591.
Additionally: the other shards (covering style-1/2/4/5/6/9/10/11/12/13/16/17, vignelli-stacking, overlay-montage, sub-composition-video, hdr-regression, mov-prores, etc.) are also green — confirms the recent renderer fixes (revert flattenInnerRoot / late-bind polling removal / conditional rebind / sub-comp timeline readiness) haven't destabilized previously-passing baselines while fixing these 7.
Spot-checks on the compiled.html diffs
Looked at style-3-prod/output/compiled.html — additions are inline @font-face blocks with base64-encoded woff2 payloads (e.g. Helvetica). Consistent with sub-composition asset-inlining behavior — benign renderer-output change, not a regression artifact.
Body-vs-diff check
Body claims fail-counts pre-regen (86/100, 76/100, 60/100, 59/100, 19/100, 7/100, 1/100) — not directly verifiable from review, but the "Detect changes job was skipping shards on non-engine commits" explanation matches the observed shard-bypass pattern and gives a plausible reason these failed deterministically without being noticed earlier.
One note (non-blocking)
mergeable_state: blocked is currently driven by Tests on windows-latest + Render on windows-latest still in_progress. Those are non-regression jobs and historically green on prior runs of this stack — let them complete before merge but no action needed here.
Review by Rames
|
Diagnosis matches what I see from PR #945 (observability + 1. Confirm The previous baseline at that timestamp showed the main A-roll talking head (man in "W" cap, office background). Your new baseline replaces it with what looks like a transition gap. The other 99 frames in the fixture match the old baseline within PSNR, so this is a content-shifting change at the intro→main scene boundary specifically — not a generic timing drift. Want to make sure the "hold external sub-compositions through host duration" change ( Old expected vs. new baseline at t=3.045s:
A one-line PR comment ("at t=3.045 style-8 should show [X]; the new render is correct because [Y]") would settle it for reviewers. 2. The "Detect changes skipping shards" point in the PR description is real, but it doesn't explain why Without fixing this, the same masking pattern will hide the next regression. Either:
3. Two more visual-sanity notes (low priority, just flagging):
FYI — happy to revert #945 once this lands, unless you want to keep the structured |
|
Aligning with the two asks above — my earlier APPROVE was premature; downgrading my position to pending Miguel's confirmation on (1). The content-shift catch on style-8 is the load-bearing one and I should have surfaced it from the diff signatures alone. The audit I should have done: Cross-referencing the PR-body fail counts against the compiled.html diff sizes:
The Vai's framing is right: regen-all treats the 7 fixtures as one phenomenon (3 distinct PSNR signatures: catastrophic-single-frame, sustained-drift, sustained-wrong-state). My approval framed all 7 as the same shape because they all turned green in CI; that's deterministic, not correct. On Holding for Miguel's one-liner on (1) style-8 + (3a) gsap-letters halo intent. Once those are confirmed, I'm back at APPROVE. — Rames |
jrusso1020
left a comment
There was a problem hiding this comment.
Re-APPROVE after Miguel's two clarifications resolve the open questions:
1. The 2-frame intro extension at the intro→main scene boundary is intentional.
Miguel confirmed "without extending that, it feels choppy" — the "hold external sub-compositions through host duration" change 2b46565c / 551607ef is doing exactly what it says, and accepting the new baseline at frames 90–91 (intro overlay holds through full duration before A-roll appears) is the correct behavior, not a regression. Old baselines that ended the intro early were the buggy state.
Verified visually by extracting frames from the two mp4s — frame 91 OLD shows the talking-head fully visible, frame 91 NEW shows the navy intro overlay still opaque, and both converge by frame 92 (intro mid-fadeout, talking-head emerging). PNG-size ratios across 10 sampled frames confirm everything else is pixel-equivalent within ~5% — only frames 90–91 differ catastrophically, exactly at the boundary.
2. The +1468/-543 LoC outlier on style-8-prod has a benign root cause.
It's not structural change — it's inline base64 font-data bloat from c8e8fdcf fix(producer): fetch missing font weights from Google Fonts at render time:
- OLD: 9
@font-facerules, ~288KB woff2 - NEW: 33
@font-facerules, ~1.14MB woff2
Inter went from 3 weights (100/300/800) to all weights; Roboto + Segoe UI added as fallbacks. That ~850KB of base64 = ~1400 lines of diff. Actual structural CSS-scoping diff is small (32 → 35 selectors). My earlier "signature outlier" framing identified a real disproportion but mis-attributed the cause — the cause was the renderer's font-backfill behavior maturing, not a content-shift regression.
Open items (non-blocking, per Vai's framing):
- gsap-letters halo loss at t=0.16s — confirm intentional (still open; doesn't block this PR).
fail-fast: trueinregression.yml:53— follow-up to remove (or add a checklist note for PRs touchingpackages/{engine,core,producer}/**) so the next regression isn't hidden the same way the release-commit "1 failure per run looks flaky" pattern was.
Review by Rames
Summary
Detect changesjob was skipping shards on non-engine commits, masking the failures — making it look like flaky tests when in reality they failed every time shards ranTests regenerated (all in Docker with pinned Chrome 148.0.7778.167)
gsap-letters-render-compattypegpu-adapterstyle-7-prodstyle-15-prodstyle-18-prodstyle-3-prodstyle-8-prodTest plan
🤖 Generated with Claude Code