Skip to content

[plan] sfx + music via elevenlabs · gemini video & image intelligence#25

Draft
cuio wants to merge 1 commit intomainfrom
plan/sfx-music-and-gemini-intelligence
Draft

[plan] sfx + music via elevenlabs · gemini video & image intelligence#25
cuio wants to merge 1 commit intomainfrom
plan/sfx-music-and-gemini-intelligence

Conversation

@cuio
Copy link
Copy Markdown
Owner

@cuio cuio commented Apr 27, 2026

Tracking doc only — DRAFT. This PR exists as a permanent URL for the plan. Do not merge.
The next Claude session checks out this branch and starts at Phase 1.

TL;DR

Two parallel feature tracks that land the missing retention levers. Both feed the existing Storyline cockpit's applyPatch pipeline — no parallel write surfaces.

Track What Models Cost
A — SFX Per-scene generated sound effects on the existing SFX lane Haiku proposes · ElevenLabs Sound Generation ~$0.05/video
B — Music Background music tracks scoring multi-scene sections Haiku proposes · ElevenLabs Music v3 ~$0.10/video
C — Render review Gemini analyses the rendered MP4 → structured retention feedback + scroll-risk windows Gemini 2.5 Flash ~$0.05/video
D — Image analysis Auto-detect role / vibe / suggested treatment on upload Gemini 2.5 Flash ~$0.001/image
E — Scroll test Per-scene "would they scroll?" prediction with one-change fix proposal Gemini 2.5 Flash ~$0.005/scene
F — Retention map Horizontal strip at the top of the Storyline tab — one square per scene, coloured by retention score (composes C+E) free

Total AI-augmented direction cost per video: under $0.70, below a single render's compute time.

Why this stack

Model Best at Used for
Haiku 4.5 Cheap, fast, scene-scoped directorial polish SFX prompt generation, music prompt generation
ElevenLabs Production audio (voice, SFX, music) Generating actual audio assets
Gemini 2.5 Flash Long-context video + image understanding Render review, scroll prediction, image role detection

No overlap. Haiku proposes, ElevenLabs generates, Gemini judges.

How the pieces compose

```
┌─ Director (Storyline / Project) ─┐
│ Haiku, single textarea │
└──────────────┬───────────────────┘
│ proposes patches
┌─ Per-scene Director ──────────────┴──────────────────┐
│ "✦ Direct this scene" (already shipped) │
└──────────────┬────────────────────────────────────────┘
│ all funnel into …
┌──────────────▼─────────────────────────────────────────┐
│ applyPatch(sceneId, patch) → PUT /scenes/:id │
│ (single write surface) │
└──────────────┬─────────────────────────────────────────┘

└─ Storyline reload

├─ SFX manifest → SFX lane (Milestone A)
├─ Music manifest → Music lane (Milestone B)
└─ Retention map ← Gemini (C+E+F)
```

Suggested phase order

  1. Read `.claude/handoffs/sfx-music-gemini-intelligence.md` end to end. It's the source of truth — endpoints, prompt shapes, tool schemas, cost notes, file paths, gotchas.
  2. Milestone A (SFX) backend then frontend — fastest win, low risk, validates the manifest pattern.
  3. Milestone C (Gemini render review) before B — the render-review feedback loop unlocks the rest.
  4. Milestone B (Music) wired into the music lane.
  5. Milestone D (image analysis) as a small standalone PR.
  6. Milestone E (per-scene scroll test) — the retention killer feature.
  7. Milestone F (retention map) — ties it all together at the top of the Storyline tab.

Out of scope this round

  • Per-scene voice cloning (different reader per scene = retention killer)
  • Auto-generated B-roll (Sora/Runway is a separate stack)
  • Render-blocking quality gates (Gemini is advisory, not gating)
  • Music-as-score timing-aware generation (phase 2)
  • Multi-language SFX (ElevenLabs SFX is English-only currently)

Reading order for the next session

  1. This PR description
  2. Then the full handoff doc on disk
  3. Then start coding at Milestone A

🤖 Generated with Claude Code

cuio added a commit that referenced this pull request Apr 27, 2026
feat: per-scene sfx via elevenlabs (milestone a of #25)
cuio pushed a commit that referenced this pull request Apr 29, 2026
Four milestones from #25 in one drop. Closes the retention feedback
loop: write → render → grade → fix.

**Foundation** (`packages/core/src/gemini/`)

- env.ts: GEMINI_API_KEY loader, mirrors anthropic/env.ts.
- client.ts: REST client with uploadFile (resumable Files API) +
  generateStructured (function-tool call). Picked direct fetch over
  @google/genai SDK because the SDK pulls in gRPC + Vertex auth we
  don't need.
- Cost-tracking: kind: "gemini" entries in CostOp. gemini-2.5-flash
  priced at $0.30/$2.50 per M tokens.

**Milestone B — ElevenLabs Music**

- elevenlabs/music.ts: async polled-job pattern (POST returns music_id,
  GET polls until completed, then download signed audio_url).
  Geometric backoff 2/4/8s, max 30s per poll, 5min total.
- script/music/manifest.ts: same shape as SFX with scenesCovered (a
  track spans multiple scenes). resolveMusicSpan computes the
  master-timeline window from covered scenes. 9 unit tests.
- assembler emits <audio data-track-index="2" data-timeline-group="music">
  per entry, including data-music-duck-db so the producer applies a
  sidechain duck during voiceover.
- 4 routes: music-suggest (Haiku) / music-generate (ElevenLabs polled
  job) / GET music / DELETE music/:entryId.
- Frontend: 🎵 Music Wizard panel above Director — vibe textarea →
  Haiku proposes 1-3 tracks → click Generate per track. Applied
  tracks list below with Remove buttons.

**Milestone C — Gemini render review**

- POST /storyline/render-review uploads the most recent .mp4 from
  <project>/renders/ to Gemini Files API, prompts with script meta +
  per-scene timings, returns:
    overallRetentionScore (0-100)
    scrollRiskWindows[] — severity, why, one-sentence fix
    brandConsistency { score, drift[] }
    audioMix { voiceClarity, musicLevels, sfxBalance }
    perScene[] { visualHook, paceMatch, onBrand, note }
- Persisted to .hyperframes/render-reviews/<ts>.json so reload shows
  the last review without re-running. GET /render-review serves it.
- Frontend: Retention Review panel with overall-score chip, scroll-risk
  windows with timestamps, 3-column audio/brand summary.

**Milestone E — Per-scene scroll test**

- POST /storyline/scroll-test samples 3 frames per scene via the new
  adapter.extractVideoFrameToBytes hook, asks Gemini "would they
  scroll?". Returns verdict + sceneStrengthScore + optional concrete
  patch.
- New per-card AI action: 📉 Scroll test. Patches drop into the
  amber suggestion stack, applied via the same pipeline.

**Milestone F — Retention map**

- Horizontal strip at the top of Storyline, one cell per scene,
  color-coded by retention strength (review's 3-dim avg, fallback to
  scrollTest score). Click → smooth-scroll to scene. Hidden until at
  least one signal lands.

Tests: 728 core (+9 music-span), 281 studio. Lint, format, typecheck
clean. Live verify: all four panels render in expanded sidebar mode,
zero console errors.

Plan: #25 (Milestones B + C + E + F shipped; D image analysis is the
remaining piece, smaller follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant