feat(video): text-to-video generation engine (Wan2.2 via mlx-gen) by panwudi · Pull Request #53 · panwudi/flyto-mlx

panwudi · 2026-06-10T17:47:17Z

Summary

Adds a text-to-video generation engine to fmlx: Wan2.2 T2V A14B (MLX-q8, diffusers layout) served through an OpenAI-style async job API (/v1/videos), generated by mlx-gen in an isolated subprocess worker, with a memory lease held against the ProcessMemoryEnforcer ceiling for safe co-residency with LLM serving on 128GB UMA machines.

Design spec: docs/video-generation-engine-spec.md (6-lens adversarial review, 22 findings incorporated; P0 measurement results in section 6.1).

Key pieces:

Discovery: diffusers roots (model_index.json) register as model_type=video for allow-listed pipelines; phantom-component hazard closed
Pool/server guards: typed rejection BEFORE the admission loop; default-model and fallback hygiene; chat on a video model gives 400 + endpoint hint
VideoJobManager: FIFO queue, persistence + restart replay, lease admission, watchdog (lease breach / stall / timeout), artifact retention
Worker: own venv (hash-locked mlx-gen==0.18.14), Metal wired-limit self-containment, low-RAM mode default (49.3GB -> 18.8GB measured, not slower)
Memory lease: ceiling subtraction at the single choke point + dynamic-ceiling add-back (worker counted exactly once) + parent wired reset
/v1/videos: create (JSON + openai-SDK multipart) / list / get / content (Range) / delete; P0-calibrated per-request peak predictor
Admin UI: video settings section (en/zh/ru), model-type dropdown; ModelInfo.model_type extension
huggingface.disable_xet setting (China Xet wall mitigation)

Disclosure

This branch includes the two prefill-gate commits from #52 by ancestry (1ea1eb3, 2080310; already validated on m5max). Merging this PR supersedes #52, which will be closed with a comment.

Self-merge under explicit owner authorization given in-session (2026-06-11).

Test plan

Unit: 112 video tests (discovery / pool rejection + lease / manager state machine incl. stall + watchdog + retention + restart replay / routes incl. multipart + pagination + predictor)
Full suite: 4659 passed, 0 failed, 19 skipped (zero regression; the 3 baseline api_key failures were traced to an OMLX_API_KEY env leak and do not reproduce with a clean env)
Real-machine P0 (m5max, M5 Max 128GB): 7-profile measurement matrix; peak memory frames- and steps-invariant, spatial-token-linear; low-RAM mode 62% memory reduction
Real-machine E2E (m5max, integrated): model discovered as video; misroute guards return 400; POST /v1/videos job completed in 111s (384x224x25f draft); lease acquired (ceiling 107.5 -> 70.7GB) and released (back to 107.5); co-resident gemma chat served normally during generation; content endpoint streamed a valid mp4; zero OVER_HARD, zero panic

为 fmlx 增加文生视频引擎: Wan2.2 T2V A14B 经 OpenAI 形态异步 job API (/v1/videos) 提供, mlx-gen 在隔离 venv 子进程中生成, 内存租约经 ProcessMemoryEnforcer 单一咽喉点扣减, 与 LLM 服务在 128GB UMA 机上安全共驻.

设计稿 docs/video-generation-engine-spec.md (6 视角对抗评审, 22 条发现全部吸收; P0 实测见 6.1 节).

披露: 本分支按祖先关系包含 #52 的两个 prefill gate commit (m5max 已真机验证), 合并即取代 #52. 自助合并依据 2026-06-11 会话内 owner 明确授权.

测试: 视频单测 112 个; 全套件 4659 过 0 败 19 跳 (零回归; 基线那 3 个 api_key 失败溯源为环境变量泄漏, 干净环境不复现); m5max 真机 P0 七档测量 + 集成 E2E (任务 111s 完成, 租约 107.5->70.7->107.5, 共驻 gemma 正常服务, 零 OVER_HARD 零 panic).

…-panic) Stop the m5max whole-machine watchdog panic caused by a large model's prefill transient breaching the Metal cap. The model loads and runs normally; only an individual request that cannot fit is refused (503-class), not the model. Root cause: oMLX bounds memory reactively (enforcer polls phys every 1s) and the in-prefill memory check ran AFTER self.model()+mx.eval. On Apple UMA a chunk that overshoots the Metal wired limit kernel-panics the whole machine, so the post-forward Python check never runs. glm4.5-air-106b (85GB) on a 128GB box left ~22GB for KV+prefill; a per-chunk MoE-dequant transient (measured up to 7.4GB, peaks hit 110GB vs the 107.5GB cap) crossed the cap and rebooted the box twice. - Scheduler._prefill_forward_gate: before each prefill forward (external loop + chunked-step mirror), predict current(high-water: max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + a conservative transient margin; if it would exceed the hard cap, raise BEFORE the forward. The existing jundot#1405 cleanup converts that RuntimeError into a finish_reason="error" output -- request refused cleanly, machine not crashed. Legacy post-forward check stays as backstop. - New MemorySettings.prefill_transient_margin_gb (default 10GB), propagated settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the MoE expert-dequant spike, so this margin carries the guarantee: margin (10) > worst observed single-step jump (7.4GB). - Preflight now also maxes against the enforcer recent high-water mark. - Reverts model_load_prefill_headroom_gb load rejection (user feedback: a model must be loadable; refuse the request, not the model). Honest residual: the gate reads current just after the prior chunk's cache clear (active trough), leaning on phys_footprint stickiness + recent_peak to avoid a trough misread; a misread could still admit a crashing chunk. Not a literal never-panic guarantee -- needs on-hardware [memgate] log validation; the real fix is preemptive KV offload (separate work). Tests: 12 new (gate raise/pass, margin-is-the-trip, no-op guards, integration asserting model forward NOT called over-cap), verified to fail with gate neutered. Full suite 4542 pass / 3 known api_key fail / 19 skip on m2max -- zero regression. --- 止住 m5max 因大模型 prefill 瞬时撞穿 Metal cap 导致的整机 watchdog panic. 模型正常 load 正常用; 只拒放不下的单个请求(503 级), 不禁模型. 根因: oMLX 内存是 reactive 管控(enforcer 每 1s poll phys), prefill 内存检查在 self.model()+mx.eval 之后. Apple UMA 上 chunk 撞穿 Metal wired limit 直接整机 kernel panic, forward 后的 Python 检查根本来不及跑. glm4.5-air-106b(85GB)在 128GB 机只剩 ~22GB 给 KV+prefill; 每 chunk 的 MoE 反量化瞬时(实测达 7.4GB, 峰冲到 110GB vs 107.5 cap)撞穿后整机重启两次. - Scheduler._prefill_forward_gate: 每个 prefill forward 前(external loop + chunked-step mirror), 预测 current(高水位 max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + 保守 transient margin; 超 hard cap 就在 forward 前 raise. 现有 jundot#1405 cleanup 把 RuntimeError 转成 finish_reason=error 输出 -- 干净拒请求, 不崩机. forward 后旧检查留作兜底. - 新增 MemorySettings.prefill_transient_margin_gb(默认 10GB), settings -> enforcer -> scheduler 传播. estimate 不含 MoE 反量化瞬时, margin 扛保证: margin(10) > 实测最坏单步跳变(7.4GB). - preflight 也改用 enforcer 近期高水位取 max. - 撤掉 model_load_prefill_headroom_gb 拒 load(用户反馈: 模型必须能 load, 拒请求不拒模型). 诚实残余: gate 读 current 在上个 chunk cache clear 后(active 谷), 靠 phys_footprint 黏性 + recent_peak 避免读谷; 读到谷仍可能放行会崩的 chunk. 不是字面"绝不 panic" -- 需真机 [memgate] log 验证; 真治本是抢占式 KV offload(独立工作). 测试: 12 个新增, 验证 gate 失效时会 fail. 完整套件 m2max 4542 pass / 3 已知 api_key fail / 19 skip -- 零回归.

…prod On-hardware validation found the forward gate (and every estimate-based memory guard) was INERT in production: scheduler.memory_monitor is never wired, so the gate's `if memory_monitor is None: return` / `if estimate == 0: return` no-op fired on every chunk. A single glm4.5-air-106b (85GB) prefill drove memory to 107.525GB (OVER_HARD, past the 107.5 cap) with the gate silent; only the legacy reactive post-forward check caught it -- after the transient had already landed (no panic only because it fell inside the tolerance band). Unit tests passed because they inject a mock monitor. Make the gate PHYS-based, so it no longer depends on the dead monitor: - current = max(active, phys, recent_peak) -- all LIVE production readings (the same ones [memcheck:external] and the enforcer use). - The model-dim estimate is now OPTIONAL: used if a monitor is present, else 0. At chunk granularity it is tiny anyway; the margin is the real guarantee. - The transient margin is propagated from the ENFORCER (live), not the monitor (dead), so the gate's safety actually fires. Bumped 10 -> 12GB: the un-modelled MoE expert-dequant transient is sub-poll (invisible to every memory read) and the m5max crash showed ~10.6GB; 12 = ceil(10.6) padded, and the box only panics nearer ~110 so 12 has real cushion. - The gate now no-ops ONLY when the guard is off, the hard limit is unset, or chunk<=0 -- never because the monitor/estimate is absent. Add _log_prefill_gate_state_once: on the first prefill, log the RESOLVED state (phys-based, margin, cap, estimator active/disabled) at info, or WARNING if the margin propagated as 0 (gate degraded to the bare cap check). The prior failure was not the wiring gap but that it was SILENT; this makes that class of inert guard impossible to ship blind. Document _preflight_memory_check as inert in prod (monitor-dependent; a phys-only version would never reject at idle). Functional residual: a model that fills most of the cap (85GB on 128GB) gets long prompts refused cleanly (503-class) once accumulated KV nears the headroom -- correct behaviour (refuse the request, not crash the box); fit longer contexts with a smaller quant. The refusal reuses the existing call-site RuntimeError handler that _sync_and_clear_cache()s the KV. Not yet on-hardware validated -- this commit is what makes that validation meaningful (the prior gate could not fire). m5max round must confirm: startup log shows ACTIVE+margin before any request; a long glm4.5 prompt is refused cleanly AND memory returns to the ~85GB base; normal requests still succeed. Tests: 2 monitor-dependent no-op tests replaced with phys-based firing tests (fires without a monitor, passes when it fits, margin-carries-it with estimate 0) + 3 startup-log tests (logs once, warns at margin 0, reports estimator disabled). Full suite 4546 pass / 3 known api_key fail / 19 skip -- zero regression. --- 真机验证发现 forward gate (以及所有 estimate-based 内存防护) 在生产 INERT: scheduler.memory_monitor 从没接线, 所以 gate 的 `if memory_monitor is None: return` / `if estimate == 0: return` 每个 chunk 都 no-op. 单个 glm4.5-air-106b (85GB) prefill 把内存推到 107.525GB (OVER_HARD, 越过 107.5 cap) 而 gate 全程沉默; 只有 legacy 反应式 post-forward 检查在瞬时已 land 之后才掐 (没崩纯属落在容忍带内). 单测注入 mock monitor 才绿. 把 gate 改成 PHYS-based, 不再依赖死掉的 monitor: - current = max(active, phys, recent_peak) -- 全是 LIVE 生产读数 (与 memcheck:external 和 enforcer 同源). - 模型维度 estimate 现在 OPTIONAL: 有 monitor 才用, 否则 0. chunk 粒度下它本就极小; margin 才是真保证. - 瞬时 margin 从 ENFORCER (live) 传播, 不是 monitor (dead), 所以 gate 的安全真能 fire. 10 -> 12GB: 未建模的 MoE 反量化瞬时是 sub-poll (所有读数不可见), m5max 崩溃实测 ~10.6GB; 12 = ceil(10.6) 加垫, 且机器近 ~110 才 panic, 12 有真余量. - gate 现在只在 guard 关 / hard limit 未设 / chunk<=0 时 no-op -- 绝不因 monitor/estimate 缺失而 no-op. 新增 _log_prefill_gate_state_once: 首次 prefill 时 info log 解析后的状态 (phys-based, margin, cap, estimator active/disabled), margin 传成 0 则 WARNING (gate 退化成裸 cap 检查). 之前的失败不是断线本身而是断线沉默; 这让此类 inert guard 不可能再盲发. 把 _preflight_memory_check 标注为生产 inert (依赖 monitor; phys-only 版在 idle 永不拒绝). 功能残余: 占满大半 cap 的模型 (85GB 在 128GB) 长 prompt 在累积 KV 逼近余量时被干净拒绝 (503 级) -- 正确行为 (拒请求不崩机); 长上下文换小量化. 拒绝复用现有调用点 RuntimeError handler, 它会 _sync_and_clear_cache() 释放 KV. 尚未真机验证 -- 本 commit 正是让验证有意义的前提 (之前的 gate 根本不会 fire). m5max 轮次须确认: 启动 log 在任何请求前显示 ACTIVE+margin; 长 glm4.5 prompt 被干净拒绝且内存回落到 ~85GB base; 正常请求仍成功. 测试: 2 个依赖 monitor 的 no-op 测试换成 phys-based 触发测试 (无 monitor 也 fire, 放得下就过, estimate 0 时 margin 兜) + 3 个启动 log 测试 (只 log 一次, margin 0 时 WARNING, 无 monitor 报 estimator disabled). 完整套件 4546 pass / 3 已知 api_key fail / 19 skip -- 零回归.

Design spec for the fmlx video generation engine (Wan2.2 T2V A14B via the mlx-gen runtime, subprocess worker + memory lease architecture). The spec went through a 6-lens adversarial review; all 22 confirmed blocker/major findings are incorporated (rejection placement before pool admission, OpenAI SDK multipart compat, Metal wired-limit containment, lease double-count fix, venv bootstrap correctness, A/B protocol arithmetic). Also: scripts/video_p0_measure.py (standalone real-machine measurement harness using the kernel lifetime-max phys_footprint ledger, one child process per profile) and the hash-pinned worker venv lockfile generated on m5max (darwin/arm64, mlx-gen==0.18.14). --- fmlx 视频生成引擎设计稿 (Wan2.2 T2V A14B, mlx-gen 运行时, subprocess worker + 内存租约). 经 6 视角对抗评审, 22 条 blocker/major 发现全部吸收 (拒绝点前移到 pool 准入之前, OpenAI SDK multipart 兼容, Metal wired 自缚, 租约双重计数修正, venv 引导命令修正, A/B 协议算术修正). 附: P0 真机测量装置 (内核 lifetime-max ledger, 每档独立子进程) 与 m5max 上生成的 hash 锁定 worker venv 依赖锁 (mlx-gen==0.18.14).

Implements docs/video-generation-engine-spec.md end to end: - Discovery: diffusers-layout roots (model_index.json) register as model_type=video for allow-listed pipeline classes (WanPipeline); unknown or class-less pipelines are skipped outright -- this also closes the phantom-component hazard where a flat-layout diffusers dir registered its transformer/vae subdirs as bogus llm models. - Pool guard: video entries are rejected with ModelTypeNotLoadableError at the TOP of pool.get_engine, before the admission/eviction loop, so a misrouted chat request can never evict resident LLMs. Server-side pre-pool 400s + default-model/fallback hygiene (chat-capable filter). - VideoJobManager: FIFO queue (one job at a time), per-job JSON persistence with restart replay, memory-lease admission, subprocess worker spawn (isolated venv, -I, env whitelist), JSONL progress, watchdog (lease breach / stall / timeout / monitor failure), artifact retention with expires_at semantics. - Worker: runs mflux Wan2_2_TI2V in the video venv; pins its own Metal wired limit inside the lease before loading (preventive wired-sum containment); low-RAM mode on by default (P0 measured the natural-mode peak at 49.3GB even for small profiles); failure manifests. - Memory lease: enforcer ceiling subtraction (clamped >=1, never 0) + dynamic-ceiling add-back of min(worker footprint, lease) so the worker is counted exactly once; parent wired-limit reset on acquire/release. - API: /v1/videos create/list/get/content/delete, OpenAI video-job shape; POST accepts both JSON and the openai SDK's multipart form; per-request peak predictor gates memory beyond the static caps. - Settings: video section (4-site wiring) + admin global-settings read/write + model-type dropdown; huggingface.disable_xet exported to HF_HUB_DISABLE_XET at serve startup (China Xet wall mitigation). Tests: 108 new unit tests across discovery/pool+lease/manager/routes. Full suite zero regression (env-independent baseline; the 3 api_key failures reproduce at the base commit with OMLX_API_KEY leaked in env). Known follow-up before merge: memory_lease_gb default and the peak predictor constants await calibration from the low-RAM P0 pass on m5max (out-of-the-box predictor currently exceeds the default lease). --- 按 docs/video-generation-engine-spec.md 完整落地文生视频引擎: - 发现层: diffusers 布局 (model_index.json) 按允许清单注册为 video 类型, 未知/缺类名管线直接跳过, 同时修掉幽灵组件隐患. - 防护层: pool.get_engine 入口 (准入循环之前) typed 拒绝, 误路由的 chat 请求不会驱逐在驻 LLM; server 侧 pre-pool 400 + 默认模型/fallback 卫生. - VideoJobManager: FIFO 队列 + 重启回放 + 租约准入 + 独立 venv 子进程 + JSONL 进度 + watchdog 三连杀 + 产物保留策略. - worker: 进场自缚 Metal wired limit (预防 wired-sum), 默认 low-RAM 模式 (P0 实测自然模式峰值 49.3GB), failure manifest. - 内存租约: ceiling 扣减 (钳 >=1) + 动态 ceiling 加回 (worker 精确计一次) + 父进程 wired 重设. - API: /v1/videos 五端点, OpenAI video job 形态, POST 兼容官方 SDK 的 multipart; 逐请求峰值预测器把守内存边界. - 配置: video section 四件套 + admin 读写 + huggingface.disable_xet (国内 Xet 墙缓解). 测试: 新增 108 个单测; 全套件零回归 (api_key 3 失败为环境变量泄漏, 基点同样复现). 合并前待办: lease 默认值与预测器系数等 m5max low-RAM P0 数据校准 (当前开箱预测值超默认租约).

List the upstream-shared files patched by feat/video-engine so future cherry-pick conflicts are easy to attribute and resolve. --- 记录 feat/video-engine 对上游同源文件的补丁面, 方便后续 cherry-pick 撞冲突时归因与解决.

Frontend: Video Generation section in admin settings (enable toggle, memory lease, defaults, queue/timeout/retention, worker python path) wired through dashboard.js load/save and i18n (en/zh/ru). Calibration from the m5max P0 low-RAM matrix: peak memory is invariant to step count AND frame count (measured byte-identical), scaling only with per-frame spatial tokens. Predictor reparameterized to BASE 17.5GB + 0.0029 GB/token with a 6GB transient margin; default memory_lease_gb 28 -> 36 so out-of-the-box settings admit the caps corner (fixes the v1 always-413 defect). Spec section 6.1 records the full measurement table. --- 前端: admin 设置页新增视频生成区块 (开关, 内存租约, 默认参数, 队列/ 超时/产物保留, worker python 路径), dashboard.js 读写接线, en/zh/ru 三语言. 校准: 按 m5max P0 低 RAM 矩阵实测, 峰值与步数和帧数均无关 (逐字节一致), 只随单帧空间 token 线性增长. 预测器重参数化为 BASE 17.5GB + 0.0029 GB/token + 6GB 瞬时余量; memory_lease_gb 默认 28 -> 36, 开箱即可容纳上限角 (修复 v1 全量 413 缺陷). spec 6.1 记录完整测量表.

The fallback hygiene check blocked fallback whenever the default model's entry was missing or a test double; only a KNOWN non-chat model_type (video/audio/embedding) should block it. --- fallback 卫生检查对查不到 entry 或测试替身的情况误拦; 只有已知非 chat 类型才应阻止回退.

Generated videos live under {base_path}/video-artifacts, never in-repo; ignore patterns guard against measurement or test runs writing mp4s into the worktree. --- 生成视频只存 {base_path}/video-artifacts, 永不进仓; ignore 规则防止测量或测试把 mp4 写进工作区.

#54) Followup of #53: video models no longer appear in the chat model picker. --- #53 后续: 视频模型不再出现在 chat 模型下拉中.

yuanwei added 8 commits June 6, 2026 00:57

panwudi merged commit c617367 into main Jun 10, 2026

This was referenced Jun 10, 2026

fix(memory): make the prefill forward gate phys-based (validated on m5max) #52

Merged

fix(chat): allowlist llm/vlm in the chat model picker #54

Merged

This was referenced Jun 10, 2026

feat(chat): generate videos inline from the chat UI #55

Merged

fix(chat): video bubble UX defects + active-model overlay #56

Merged

sync: DiffusionGemma basic serving support from upstream #74

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(video): text-to-video generation engine (Wan2.2 via mlx-gen)#53

feat(video): text-to-video generation engine (Wan2.2 via mlx-gen)#53
panwudi merged 8 commits into
mainfrom
feat/video-engine

panwudi commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

panwudi commented Jun 10, 2026

Summary

Disclosure

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant