Skip to content

feat(image-gen): add live denoising preview event to ImageGenerationEvent #1747

@roryford

Description

@roryford

Summary

ImageGenerationEvent currently emits only .progress(step:total:) and .completed(URL). Consuming apps can show a "Step X of Y" spinner but cannot show the image forming during the denoising loop. The diffusion backends already iterate per-step latents internally — the data exists, it just isn't surfaced through the public event stream.

Adding a preview event would let apps deliver a live in-progress preview (the single biggest "delight" UX win for a local image-gen app), without each consumer having to reach into backend internals.

Current state

  • Sources/ManifoldModelCatalog/ImageGenerationEvent.swift — two cases only (.progress, .completed).
  • Sources/ManifoldMLX/MLXDiffusionBackend.swift and Sources/ManifoldMLX/Diffusion/Flux/FluxDiffusionBackend.swift already step through latents (generateLatents() / denoiser.i), so intermediate representations are available mid-loop.

Proposed change

Add a preview case to ImageGenerationEvent, e.g.:

/// Optional low-res/decoded preview of the image as it forms during the
/// denoising loop. Emitted on a subset of steps (backend-throttled).
case preview(step: Int, total: Int, url: URL)   // or CGImage / raw pixels

Design questions to decide:

  • Payload: file URL (consistent with .completed, avoids CoreGraphics in the inference layer) vs. CGImage/pixel buffer (no disk write per preview tick). Decoding a latent to a viewable image each tick has a cost — a URL re-encode per step may be too heavy; a lightweight in-memory buffer throttled to every N steps may be better.
  • Throttling: previews should be opt-in and/or throttled (e.g. every 2-4 steps, or a previewStride on ImageGenerationConfig) so short 1-4 step Turbo/Schnell runs and long 20-50 step runs both behave sensibly.
  • Opt-in: gate behind a config flag (emitPreviews: Bool / previewStride: Int?) so backends that can't cheaply decode intermediates can no-op.
  • Cost: VAE-decoding intermediate latents adds GPU work per preview; document the perf tradeoff.

Motivation / consumer

LocalImage (the macOS showcase app) wants to show the image emerging during generation instead of just a step counter. It can't today because the public stream has no preview channel. This belongs in MK so every consumer benefits and backends own the latent-decode logic.

Filed from the LocalImage integration review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions