feat(video): text-to-video generation engine (Wan2.2 via mlx-gen)#53
Merged
Conversation
added 8 commits
June 6, 2026 00:57
…-panic) Stop the m5max whole-machine watchdog panic caused by a large model's prefill transient breaching the Metal cap. The model loads and runs normally; only an individual request that cannot fit is refused (503-class), not the model. Root cause: oMLX bounds memory reactively (enforcer polls phys every 1s) and the in-prefill memory check ran AFTER self.model()+mx.eval. On Apple UMA a chunk that overshoots the Metal wired limit kernel-panics the whole machine, so the post-forward Python check never runs. glm4.5-air-106b (85GB) on a 128GB box left ~22GB for KV+prefill; a per-chunk MoE-dequant transient (measured up to 7.4GB, peaks hit 110GB vs the 107.5GB cap) crossed the cap and rebooted the box twice. - Scheduler._prefill_forward_gate: before each prefill forward (external loop + chunked-step mirror), predict current(high-water: max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + a conservative transient margin; if it would exceed the hard cap, raise BEFORE the forward. The existing jundot#1405 cleanup converts that RuntimeError into a finish_reason="error" output -- request refused cleanly, machine not crashed. Legacy post-forward check stays as backstop. - New MemorySettings.prefill_transient_margin_gb (default 10GB), propagated settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the MoE expert-dequant spike, so this margin carries the guarantee: margin (10) > worst observed single-step jump (7.4GB). - Preflight now also maxes against the enforcer recent high-water mark. - Reverts model_load_prefill_headroom_gb load rejection (user feedback: a model must be loadable; refuse the request, not the model). Honest residual: the gate reads current just after the prior chunk's cache clear (active trough), leaning on phys_footprint stickiness + recent_peak to avoid a trough misread; a misread could still admit a crashing chunk. Not a literal never-panic guarantee -- needs on-hardware [memgate] log validation; the real fix is preemptive KV offload (separate work). Tests: 12 new (gate raise/pass, margin-is-the-trip, no-op guards, integration asserting model forward NOT called over-cap), verified to fail with gate neutered. Full suite 4542 pass / 3 known api_key fail / 19 skip on m2max -- zero regression. --- 止住 m5max 因大模型 prefill 瞬时撞穿 Metal cap 导致的整机 watchdog panic. 模型正常 load 正常用; 只拒放不下的单个请求(503 级), 不禁模型. 根因: oMLX 内存是 reactive 管控(enforcer 每 1s poll phys), prefill 内存检查在 self.model()+mx.eval 之后. Apple UMA 上 chunk 撞穿 Metal wired limit 直接整机 kernel panic, forward 后的 Python 检查根本来不及跑. glm4.5-air-106b(85GB)在 128GB 机只剩 ~22GB 给 KV+prefill; 每 chunk 的 MoE 反量化瞬时(实测达 7.4GB, 峰冲到 110GB vs 107.5 cap)撞穿后整机重启两次. - Scheduler._prefill_forward_gate: 每个 prefill forward 前(external loop + chunked-step mirror), 预测 current(高水位 max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + 保守 transient margin; 超 hard cap 就在 forward 前 raise. 现有 jundot#1405 cleanup 把 RuntimeError 转成 finish_reason=error 输出 -- 干净拒请求, 不崩机. forward 后旧检查留作兜底. - 新增 MemorySettings.prefill_transient_margin_gb(默认 10GB), settings -> enforcer -> scheduler 传播. estimate 不含 MoE 反量化瞬时, margin 扛保证: margin(10) > 实测 最坏单步跳变(7.4GB). - preflight 也改用 enforcer 近期高水位取 max. - 撤掉 model_load_prefill_headroom_gb 拒 load(用户反馈: 模型必须能 load, 拒请求不拒模型). 诚实残余: gate 读 current 在上个 chunk cache clear 后(active 谷), 靠 phys_footprint 黏性 + recent_peak 避免读谷; 读到谷仍可能放行会崩的 chunk. 不是字面"绝不 panic" -- 需真机 [memgate] log 验证; 真治本是抢占式 KV offload(独立工作). 测试: 12 个新增, 验证 gate 失效时会 fail. 完整套件 m2max 4542 pass / 3 已知 api_key fail / 19 skip -- 零回归.
…prod On-hardware validation found the forward gate (and every estimate-based memory guard) was INERT in production: scheduler.memory_monitor is never wired, so the gate's `if memory_monitor is None: return` / `if estimate == 0: return` no-op fired on every chunk. A single glm4.5-air-106b (85GB) prefill drove memory to 107.525GB (OVER_HARD, past the 107.5 cap) with the gate silent; only the legacy reactive post-forward check caught it -- after the transient had already landed (no panic only because it fell inside the tolerance band). Unit tests passed because they inject a mock monitor. Make the gate PHYS-based, so it no longer depends on the dead monitor: - current = max(active, phys, recent_peak) -- all LIVE production readings (the same ones [memcheck:external] and the enforcer use). - The model-dim estimate is now OPTIONAL: used if a monitor is present, else 0. At chunk granularity it is tiny anyway; the margin is the real guarantee. - The transient margin is propagated from the ENFORCER (live), not the monitor (dead), so the gate's safety actually fires. Bumped 10 -> 12GB: the un-modelled MoE expert-dequant transient is sub-poll (invisible to every memory read) and the m5max crash showed ~10.6GB; 12 = ceil(10.6) padded, and the box only panics nearer ~110 so 12 has real cushion. - The gate now no-ops ONLY when the guard is off, the hard limit is unset, or chunk<=0 -- never because the monitor/estimate is absent. Add _log_prefill_gate_state_once: on the first prefill, log the RESOLVED state (phys-based, margin, cap, estimator active/disabled) at info, or WARNING if the margin propagated as 0 (gate degraded to the bare cap check). The prior failure was not the wiring gap but that it was SILENT; this makes that class of inert guard impossible to ship blind. Document _preflight_memory_check as inert in prod (monitor-dependent; a phys-only version would never reject at idle). Functional residual: a model that fills most of the cap (85GB on 128GB) gets long prompts refused cleanly (503-class) once accumulated KV nears the headroom -- correct behaviour (refuse the request, not crash the box); fit longer contexts with a smaller quant. The refusal reuses the existing call-site RuntimeError handler that _sync_and_clear_cache()s the KV. Not yet on-hardware validated -- this commit is what makes that validation meaningful (the prior gate could not fire). m5max round must confirm: startup log shows ACTIVE+margin before any request; a long glm4.5 prompt is refused cleanly AND memory returns to the ~85GB base; normal requests still succeed. Tests: 2 monitor-dependent no-op tests replaced with phys-based firing tests (fires without a monitor, passes when it fits, margin-carries-it with estimate 0) + 3 startup-log tests (logs once, warns at margin 0, reports estimator disabled). Full suite 4546 pass / 3 known api_key fail / 19 skip -- zero regression. --- 真机验证发现 forward gate (以及所有 estimate-based 内存防护) 在生产 INERT: scheduler.memory_monitor 从没接线, 所以 gate 的 `if memory_monitor is None: return` / `if estimate == 0: return` 每个 chunk 都 no-op. 单个 glm4.5-air-106b (85GB) prefill 把内存推到 107.525GB (OVER_HARD, 越过 107.5 cap) 而 gate 全程沉默; 只有 legacy 反应式 post-forward 检查在瞬时已 land 之后才掐 (没崩纯属落在容忍带内). 单测注入 mock monitor 才绿. 把 gate 改成 PHYS-based, 不再依赖死掉的 monitor: - current = max(active, phys, recent_peak) -- 全是 LIVE 生产读数 (与 memcheck:external 和 enforcer 同源). - 模型维度 estimate 现在 OPTIONAL: 有 monitor 才用, 否则 0. chunk 粒度下它本就极小; margin 才是真保证. - 瞬时 margin 从 ENFORCER (live) 传播, 不是 monitor (dead), 所以 gate 的安全真能 fire. 10 -> 12GB: 未建模的 MoE 反量化瞬时是 sub-poll (所有读数不可见), m5max 崩溃实测 ~10.6GB; 12 = ceil(10.6) 加垫, 且机器近 ~110 才 panic, 12 有真余量. - gate 现在只在 guard 关 / hard limit 未设 / chunk<=0 时 no-op -- 绝不因 monitor/estimate 缺失而 no-op. 新增 _log_prefill_gate_state_once: 首次 prefill 时 info log 解析后的状态 (phys-based, margin, cap, estimator active/disabled), margin 传成 0 则 WARNING (gate 退化成裸 cap 检查). 之前的失败不是断线本身而是断线沉默; 这让此类 inert guard 不可能再盲发. 把 _preflight_memory_check 标注为生产 inert (依赖 monitor; phys-only 版在 idle 永不拒绝). 功能残余: 占满大半 cap 的模型 (85GB 在 128GB) 长 prompt 在累积 KV 逼近余量时被干净拒绝 (503 级) -- 正确行为 (拒请求不崩机); 长上下文换小量化. 拒绝复用现有调用点 RuntimeError handler, 它会 _sync_and_clear_cache() 释放 KV. 尚未真机验证 -- 本 commit 正是让验证有意义的前提 (之前的 gate 根本不会 fire). m5max 轮次须确认: 启动 log 在任何请求前显示 ACTIVE+margin; 长 glm4.5 prompt 被干净拒绝且内存 回落到 ~85GB base; 正常请求仍成功. 测试: 2 个依赖 monitor 的 no-op 测试换成 phys-based 触发测试 (无 monitor 也 fire, 放得下 就过, estimate 0 时 margin 兜) + 3 个启动 log 测试 (只 log 一次, margin 0 时 WARNING, 无 monitor 报 estimator disabled). 完整套件 4546 pass / 3 已知 api_key fail / 19 skip -- 零回归.
Design spec for the fmlx video generation engine (Wan2.2 T2V A14B via the mlx-gen runtime, subprocess worker + memory lease architecture). The spec went through a 6-lens adversarial review; all 22 confirmed blocker/major findings are incorporated (rejection placement before pool admission, OpenAI SDK multipart compat, Metal wired-limit containment, lease double-count fix, venv bootstrap correctness, A/B protocol arithmetic). Also: scripts/video_p0_measure.py (standalone real-machine measurement harness using the kernel lifetime-max phys_footprint ledger, one child process per profile) and the hash-pinned worker venv lockfile generated on m5max (darwin/arm64, mlx-gen==0.18.14). --- fmlx 视频生成引擎设计稿 (Wan2.2 T2V A14B, mlx-gen 运行时, subprocess worker + 内存租约). 经 6 视角对抗评审, 22 条 blocker/major 发现全部吸收 (拒绝点前移到 pool 准入之前, OpenAI SDK multipart 兼容, Metal wired 自缚, 租约双重计数修正, venv 引导命令修正, A/B 协议算术修正). 附: P0 真机测量装置 (内核 lifetime-max ledger, 每档独立子进程) 与 m5max 上生成的 hash 锁定 worker venv 依赖锁 (mlx-gen==0.18.14).
Implements docs/video-generation-engine-spec.md end to end: - Discovery: diffusers-layout roots (model_index.json) register as model_type=video for allow-listed pipeline classes (WanPipeline); unknown or class-less pipelines are skipped outright -- this also closes the phantom-component hazard where a flat-layout diffusers dir registered its transformer/vae subdirs as bogus llm models. - Pool guard: video entries are rejected with ModelTypeNotLoadableError at the TOP of pool.get_engine, before the admission/eviction loop, so a misrouted chat request can never evict resident LLMs. Server-side pre-pool 400s + default-model/fallback hygiene (chat-capable filter). - VideoJobManager: FIFO queue (one job at a time), per-job JSON persistence with restart replay, memory-lease admission, subprocess worker spawn (isolated venv, -I, env whitelist), JSONL progress, watchdog (lease breach / stall / timeout / monitor failure), artifact retention with expires_at semantics. - Worker: runs mflux Wan2_2_TI2V in the video venv; pins its own Metal wired limit inside the lease before loading (preventive wired-sum containment); low-RAM mode on by default (P0 measured the natural-mode peak at 49.3GB even for small profiles); failure manifests. - Memory lease: enforcer ceiling subtraction (clamped >=1, never 0) + dynamic-ceiling add-back of min(worker footprint, lease) so the worker is counted exactly once; parent wired-limit reset on acquire/release. - API: /v1/videos create/list/get/content/delete, OpenAI video-job shape; POST accepts both JSON and the openai SDK's multipart form; per-request peak predictor gates memory beyond the static caps. - Settings: video section (4-site wiring) + admin global-settings read/write + model-type dropdown; huggingface.disable_xet exported to HF_HUB_DISABLE_XET at serve startup (China Xet wall mitigation). Tests: 108 new unit tests across discovery/pool+lease/manager/routes. Full suite zero regression (env-independent baseline; the 3 api_key failures reproduce at the base commit with OMLX_API_KEY leaked in env). Known follow-up before merge: memory_lease_gb default and the peak predictor constants await calibration from the low-RAM P0 pass on m5max (out-of-the-box predictor currently exceeds the default lease). --- 按 docs/video-generation-engine-spec.md 完整落地文生视频引擎: - 发现层: diffusers 布局 (model_index.json) 按允许清单注册为 video 类型, 未知/缺类名管线直接跳过, 同时修掉幽灵组件隐患. - 防护层: pool.get_engine 入口 (准入循环之前) typed 拒绝, 误路由的 chat 请求不会驱逐在驻 LLM; server 侧 pre-pool 400 + 默认模型/fallback 卫生. - VideoJobManager: FIFO 队列 + 重启回放 + 租约准入 + 独立 venv 子进程 + JSONL 进度 + watchdog 三连杀 + 产物保留策略. - worker: 进场自缚 Metal wired limit (预防 wired-sum), 默认 low-RAM 模式 (P0 实测自然模式峰值 49.3GB), failure manifest. - 内存租约: ceiling 扣减 (钳 >=1) + 动态 ceiling 加回 (worker 精确计一次) + 父进程 wired 重设. - API: /v1/videos 五端点, OpenAI video job 形态, POST 兼容官方 SDK 的 multipart; 逐请求峰值预测器把守内存边界. - 配置: video section 四件套 + admin 读写 + huggingface.disable_xet (国内 Xet 墙缓解). 测试: 新增 108 个单测; 全套件零回归 (api_key 3 失败为环境变量泄漏, 基点同样复现). 合并前待办: lease 默认值与预测器系数等 m5max low-RAM P0 数据校准 (当前开箱预测值超默认租约).
List the upstream-shared files patched by feat/video-engine so future cherry-pick conflicts are easy to attribute and resolve. --- 记录 feat/video-engine 对上游同源文件的补丁面, 方便后续 cherry-pick 撞冲突时归因与解决.
Frontend: Video Generation section in admin settings (enable toggle, memory lease, defaults, queue/timeout/retention, worker python path) wired through dashboard.js load/save and i18n (en/zh/ru). Calibration from the m5max P0 low-RAM matrix: peak memory is invariant to step count AND frame count (measured byte-identical), scaling only with per-frame spatial tokens. Predictor reparameterized to BASE 17.5GB + 0.0029 GB/token with a 6GB transient margin; default memory_lease_gb 28 -> 36 so out-of-the-box settings admit the caps corner (fixes the v1 always-413 defect). Spec section 6.1 records the full measurement table. --- 前端: admin 设置页新增视频生成区块 (开关, 内存租约, 默认参数, 队列/ 超时/产物保留, worker python 路径), dashboard.js 读写接线, en/zh/ru 三语言. 校准: 按 m5max P0 低 RAM 矩阵实测, 峰值与步数和帧数均无关 (逐字节 一致), 只随单帧空间 token 线性增长. 预测器重参数化为 BASE 17.5GB + 0.0029 GB/token + 6GB 瞬时余量; memory_lease_gb 默认 28 -> 36, 开箱 即可容纳上限角 (修复 v1 全量 413 缺陷). spec 6.1 记录完整测量表.
The fallback hygiene check blocked fallback whenever the default model's entry was missing or a test double; only a KNOWN non-chat model_type (video/audio/embedding) should block it. --- fallback 卫生检查对查不到 entry 或测试替身的情况误拦; 只有已知非 chat 类型才应阻止回退.
Generated videos live under {base_path}/video-artifacts, never in-repo;
ignore patterns guard against measurement or test runs writing mp4s
into the worktree.
---
生成视频只存 {base_path}/video-artifacts, 永不进仓; ignore 规则防止
测量或测试把 mp4 写进工作区.
This was referenced Jun 10, 2026
This was referenced Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a text-to-video generation engine to fmlx: Wan2.2 T2V A14B (MLX-q8, diffusers layout) served through an OpenAI-style async job API (/v1/videos), generated by mlx-gen in an isolated subprocess worker, with a memory lease held against the ProcessMemoryEnforcer ceiling for safe co-residency with LLM serving on 128GB UMA machines.
Design spec: docs/video-generation-engine-spec.md (6-lens adversarial review, 22 findings incorporated; P0 measurement results in section 6.1).
Key pieces:
Disclosure
This branch includes the two prefill-gate commits from #52 by ancestry (1ea1eb3, 2080310; already validated on m5max). Merging this PR supersedes #52, which will be closed with a comment.
Self-merge under explicit owner authorization given in-session (2026-06-11).
Test plan
为 fmlx 增加文生视频引擎: Wan2.2 T2V A14B 经 OpenAI 形态异步 job API (/v1/videos) 提供, mlx-gen 在隔离 venv 子进程中生成, 内存租约经 ProcessMemoryEnforcer 单一咽喉点扣减, 与 LLM 服务在 128GB UMA 机上安全共驻.
设计稿 docs/video-generation-engine-spec.md (6 视角对抗评审, 22 条发现全部吸收; P0 实测见 6.1 节).
披露: 本分支按祖先关系包含 #52 的两个 prefill gate commit (m5max 已真机验证), 合并即取代 #52. 自助合并依据 2026-06-11 会话内 owner 明确授权.
测试: 视频单测 112 个; 全套件 4659 过 0 败 19 跳 (零回归; 基线那 3 个 api_key 失败溯源为环境变量泄漏, 干净环境不复现); m5max 真机 P0 七档测量 + 集成 E2E (任务 111s 完成, 租约 107.5->70.7->107.5, 共驻 gemma 正常服务, 零 OVER_HARD 零 panic).