test(producer): regenerate 7 stale regression baselines by miguel-heygen · Pull Request #946 · heygen-com/hyperframes

miguel-heygen · 2026-05-18T22:28:28Z

Summary

Regenerate golden baselines for 7 regression tests that were deterministically failing due to stale baselines from before the recent renderer fixes
Root cause: renderer changes (revert flattenInnerRoot, late-bind polling removal, conditional rebind, sub-comp timeline readiness polling) changed visual output, but baselines were never regenerated
CI's Detect changes job was skipping shards on non-engine commits, masking the failures — making it look like flaky tests when in reality they failed every time shards ran

Tests regenerated (all in Docker with pinned Chrome 148.0.7778.167)

Test	Previously failed frames	Baseline age
`gsap-letters-render-compat`	86/100	old
`typegpu-adapter`	76/100	May 13
`style-7-prod`	60/100	PR #368 (very old)
`style-15-prod`	59/100	May 17 (pre-renderer-fixes)
`style-18-prod`	19/100	old
`style-3-prod`	7/100	May 17 (pre-renderer-fixes)
`style-8-prod`	1/100	borderline

Test plan

All 7 tests pass locally in Docker with 0 failed frames
Verified deterministic — ran style-7-prod twice, same result both times
All 8 CI regression shards green
Other tests (style-1, 2, 4, 5, 6, 9, 10, 11, 12, 13, 16, 17, overlay-montage, sub-composition-video, vignelli-stacking, etc.) still pass

🤖 Generated with Claude Code

Renderer fixes (revert flattenInnerRoot, late-bind polling removal, conditional rebind, sub-comp timeline readiness polling) changed visual output but baselines were never regenerated. CI's Detect changes job was skipping shards on non-engine commits, masking the failures. Regenerated in Docker (Dockerfile.test) with pinned Chrome 148.0.7778.167: - style-7-prod (60/100 frames failed, baseline from PR #368) - style-15-prod (59/100 frames failed) - style-18-prod (19/100 frames failed) - style-3-prod (7/100 frames failed) - style-8-prod (1/100 frames failed, borderline) - typegpu-adapter (76/100 frames failed) All 6 now pass locally with 0 failed frames. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Missed in the initial batch — 86/100 frames were failing. Regenerated in Docker with the same pinned Chrome build. Now passes with 0 failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jrusso1020

Verdict: APPROVE

Pure baseline regeneration — 14 files modified, all in packages/producer/tests/*/output/ (compiled.html + output.mp4 pairs for 7 fixtures). Zero source code touched. Scope is exactly what the title promises.

Verification of the "expected vs current behavior" concern

The classic footgun on golden-file regen is baseline refresh on macOS that doesn't match CI Linux render (the hf#591 lesson). Audited specifically:

CI Linux reproduces — all 8 regression shards green on 13846554:

Test regenerated	Shard	Result
`style-3-prod`	shard-1	success
`style-15-prod`	shard-2	success
`style-7-prod`, `style-8-prod`	shard-3	success
`typegpu-adapter`	shard-5	success
`style-18-prod`	shard-7	success
`gsap-letters-render-compat`	shard-8	success

Every regenerated baseline reproduces on Linux CI — not just on Miguel's local. That's the signal that resolves hf#591.

Additionally: the other shards (covering style-1/2/4/5/6/9/10/11/12/13/16/17, vignelli-stacking, overlay-montage, sub-composition-video, hdr-regression, mov-prores, etc.) are also green — confirms the recent renderer fixes (revert flattenInnerRoot / late-bind polling removal / conditional rebind / sub-comp timeline readiness) haven't destabilized previously-passing baselines while fixing these 7.

Spot-checks on the compiled.html diffs

Looked at style-3-prod/output/compiled.html — additions are inline @font-face blocks with base64-encoded woff2 payloads (e.g. Helvetica). Consistent with sub-composition asset-inlining behavior — benign renderer-output change, not a regression artifact.

Body-vs-diff check

Body claims fail-counts pre-regen (86/100, 76/100, 60/100, 59/100, 19/100, 7/100, 1/100) — not directly verifiable from review, but the "Detect changes job was skipping shards on non-engine commits" explanation matches the observed shard-bypass pattern and gives a plausible reason these failed deterministically without being noticed earlier.

One note (non-blocking)

mergeable_state: blocked is currently driven by Tests on windows-latest + Render on windows-latest still in_progress. Those are non-regression jobs and historically green on prior runs of this stack — let them complete before merge but no action needed here.

Review by Rames

jrusso1020 · 2026-05-19T00:19:55Z

Diagnosis matches what I see from PR #945 (observability + fail-fast: false to force every shard to run). Same 7 fixtures, identical failed-frame counts on rerun — fully deterministic. Two asks before approve:

1. Confirm style-8-prod is supposed to render an empty navy/decorations frame at t=3.045s.

The previous baseline at that timestamp showed the main A-roll talking head (man in "W" cap, office background). Your new baseline replaces it with what looks like a transition gap. The other 99 frames in the fixture match the old baseline within PSNR, so this is a content-shifting change at the intro→main scene boundary specifically — not a generic timing drift. Want to make sure the "hold external sub-compositions through host duration" change (2b46565c / 551607ef) isn't dropping a frame's worth of main content at the boundary.

Old expected vs. new baseline at t=3.045s:

Expected: man in office, talking-head visible
New baseline: navy bg + a few decorative shapes, no person

A one-line PR comment ("at t=3.045 style-8 should show [X]; the new render is correct because [Y]") would settle it for reviewers.

2. fail-fast: true in regression.yml:53 is the other masking mechanism.

The "Detect changes skipping shards" point in the PR description is real, but it doesn't explain why 27efcd0f (release v0.6.22) and 8163f380 (release v0.6.21) — both touching engine paths — surfaced only 1 failing fixture per run when the truth was 6–7. That's fail-fast: true cancelling subsequent shards after the first failure, so each release run showed a different fixture failing, looking flaky.

Without fixing this, the same masking pattern will hide the next regression. Either:

Drop fail-fast: true in a follow-up (regression matrix goes from ~15 min → ~25 min wall-clock, but every failure surfaces),
Or add a checklist note: PRs touching packages/{engine,core,producer}/** must include matching baseline regens.

3. Two more visual-sanity notes (low priority, just flagging):

gsap-letters-render-compat at t=0.16s: old expected has subtle halos under each letter (H, Y, P); new baseline has crisp letters with no halo. Could be "GSAP frame-timing artifact gone, render is cleaner" or "CSS text-shadow inadvertently stripped." A one-line confirm would close it out.
The 7 fixtures split into 3 distinct PSNR signatures (catastrophic single-frame at scene boundary | 26–30 dB sustained drift | 18.7 dB sustained wrong-state). Regenerating all of them treats them as one phenomenon. If style-8's is actually a regression hiding under "we accepted the new render," we won't catch it until production.

FYI — happy to revert #945 once this lands, unless you want to keep the structured pollSubCompositionTimelines JSON log for future investigations (it's two changes: the JSON line itself, and fail-fast: false).

jrusso1020 · 2026-05-19T00:21:31Z

Aligning with the two asks above — my earlier APPROVE was premature; downgrading my position to pending Miguel's confirmation on (1). The content-shift catch on style-8 is the load-bearing one and I should have surfaced it from the diff signatures alone.

The audit I should have done:

Cross-referencing the PR-body fail counts against the compiled.html diff sizes:

Fixture	Failed frames	compiled.html ±lines
`style-8-prod`	1/100	+1468 / -543
`style-7-prod`	60/100	+935 / -48
`style-15-prod`	59/100	+45 / -5
`style-18-prod`	19/100	+59 / -3
`style-3-prod`	7/100	+43 / -3
`gsap-letters-render-compat`	86/100	+64 / 0
`typegpu-adapter`	76/100	+64 / 0

The style-8 row is the outlier: 1/100 failed frames but ~2000 lines of compiled.html churn. That signature is "content-shifting change at a single frame boundary" — the bulk of the .html change is structural (likely sub-comp inlining / asset-tree mutation), and the pixel-domain failure is the one frame where the structural change manifests visually. Without per-frame inspection this should have read as a flag, not "benign."

Vai's framing is right: regen-all treats the 7 fixtures as one phenomenon (3 distinct PSNR signatures: catastrophic-single-frame, sustained-drift, sustained-wrong-state). My approval framed all 7 as the same shape because they all turned green in CI; that's deterministic, not correct.

On fail-fast: true — agreed this is a follow-up. The "Detect changes skipping shards" explanation in the PR body is necessary but not sufficient — fail-fast is what produced the "different fixture failing each run → looks flaky" pattern on release commits. A follow-up that flips it (or adds the checklist note) is what makes the next regression actually visible.

Holding for Miguel's one-liner on (1) style-8 + (3a) gsap-letters halo intent. Once those are confirmed, I'm back at APPROVE.

— Rames

jrusso1020

Re-APPROVE after Miguel's two clarifications resolve the open questions:

1. The 2-frame intro extension at the intro→main scene boundary is intentional.

Miguel confirmed "without extending that, it feels choppy" — the "hold external sub-compositions through host duration" change 2b46565c / 551607ef is doing exactly what it says, and accepting the new baseline at frames 90–91 (intro overlay holds through full duration before A-roll appears) is the correct behavior, not a regression. Old baselines that ended the intro early were the buggy state.

Verified visually by extracting frames from the two mp4s — frame 91 OLD shows the talking-head fully visible, frame 91 NEW shows the navy intro overlay still opaque, and both converge by frame 92 (intro mid-fadeout, talking-head emerging). PNG-size ratios across 10 sampled frames confirm everything else is pixel-equivalent within ~5% — only frames 90–91 differ catastrophically, exactly at the boundary.

2. The +1468/-543 LoC outlier on style-8-prod has a benign root cause.

It's not structural change — it's inline base64 font-data bloat from c8e8fdcf fix(producer): fetch missing font weights from Google Fonts at render time:

OLD: 9 @font-face rules, ~288KB woff2
NEW: 33 @font-face rules, ~1.14MB woff2

Inter went from 3 weights (100/300/800) to all weights; Roboto + Segoe UI added as fallbacks. That ~850KB of base64 = ~1400 lines of diff. Actual structural CSS-scoping diff is small (32 → 35 selectors). My earlier "signature outlier" framing identified a real disproportion but mis-attributed the cause — the cause was the renderer's font-backfill behavior maturing, not a content-shift regression.

Open items (non-blocking, per Vai's framing):

gsap-letters halo loss at t=0.16s — confirm intentional (still open; doesn't block this PR).
fail-fast: true in regression.yml:53 — follow-up to remove (or add a checklist note for PRs touching packages/{engine,core,producer}/**) so the next regression isn't hidden the same way the release-commit "1 failure per run looks flaky" pattern was.

Review by Rames

miguel-heygen and others added 2 commits May 18, 2026 22:28

test(producer): regenerate gsap-letters-render-compat baseline

1384655

Missed in the initial batch — 86/100 frames were failing. Regenerated in Docker with the same pinned Chrome build. Now passes with 0 failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

miguel-heygen changed the title ~~test(producer): regenerate 6 stale regression baselines~~ test(producer): regenerate 7 stale regression baselines May 19, 2026

jrusso1020 approved these changes May 19, 2026

View reviewed changes

miguel-heygen merged commit d873364 into main May 19, 2026
55 checks passed

miguel-heygen deleted the fix/regen-stale-regression-baselines branch May 19, 2026 00:38

jrusso1020 mentioned this pull request May 19, 2026

fix(producer): treat 4xx from Google Fonts as deterministic "not served", not as failClosed trigger #957

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(producer): regenerate 7 stale regression baselines#946

test(producer): regenerate 7 stale regression baselines#946
miguel-heygen merged 2 commits into
mainfrom
fix/regen-stale-regression-baselines

miguel-heygen commented May 18, 2026 •

edited

Loading

Uh oh!

jrusso1020 left a comment

Uh oh!

jrusso1020 commented May 19, 2026

Uh oh!

jrusso1020 commented May 19, 2026

Uh oh!

jrusso1020 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

miguel-heygen commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests regenerated (all in Docker with pinned Chrome 148.0.7778.167)

Test plan

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

Verdict: APPROVE

Verification of the "expected vs current behavior" concern

Spot-checks on the compiled.html diffs

Body-vs-diff check

One note (non-blocking)

Uh oh!

jrusso1020 commented May 19, 2026

Uh oh!

jrusso1020 commented May 19, 2026

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miguel-heygen commented May 18, 2026 •

edited

Loading