Skip to content

feat(video): text-to-video generation engine (Wan2.2 via mlx-gen)#53

Merged
panwudi merged 8 commits into
mainfrom
feat/video-engine
Jun 10, 2026
Merged

feat(video): text-to-video generation engine (Wan2.2 via mlx-gen)#53
panwudi merged 8 commits into
mainfrom
feat/video-engine

Conversation

@panwudi

@panwudi panwudi commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

Adds a text-to-video generation engine to fmlx: Wan2.2 T2V A14B (MLX-q8, diffusers layout) served through an OpenAI-style async job API (/v1/videos), generated by mlx-gen in an isolated subprocess worker, with a memory lease held against the ProcessMemoryEnforcer ceiling for safe co-residency with LLM serving on 128GB UMA machines.

Design spec: docs/video-generation-engine-spec.md (6-lens adversarial review, 22 findings incorporated; P0 measurement results in section 6.1).

Key pieces:

  • Discovery: diffusers roots (model_index.json) register as model_type=video for allow-listed pipelines; phantom-component hazard closed
  • Pool/server guards: typed rejection BEFORE the admission loop; default-model and fallback hygiene; chat on a video model gives 400 + endpoint hint
  • VideoJobManager: FIFO queue, persistence + restart replay, lease admission, watchdog (lease breach / stall / timeout), artifact retention
  • Worker: own venv (hash-locked mlx-gen==0.18.14), Metal wired-limit self-containment, low-RAM mode default (49.3GB -> 18.8GB measured, not slower)
  • Memory lease: ceiling subtraction at the single choke point + dynamic-ceiling add-back (worker counted exactly once) + parent wired reset
  • /v1/videos: create (JSON + openai-SDK multipart) / list / get / content (Range) / delete; P0-calibrated per-request peak predictor
  • Admin UI: video settings section (en/zh/ru), model-type dropdown; ModelInfo.model_type extension
  • huggingface.disable_xet setting (China Xet wall mitigation)

Disclosure

This branch includes the two prefill-gate commits from #52 by ancestry (1ea1eb3, 2080310; already validated on m5max). Merging this PR supersedes #52, which will be closed with a comment.

Self-merge under explicit owner authorization given in-session (2026-06-11).

Test plan

  • Unit: 112 video tests (discovery / pool rejection + lease / manager state machine incl. stall + watchdog + retention + restart replay / routes incl. multipart + pagination + predictor)
  • Full suite: 4659 passed, 0 failed, 19 skipped (zero regression; the 3 baseline api_key failures were traced to an OMLX_API_KEY env leak and do not reproduce with a clean env)
  • Real-machine P0 (m5max, M5 Max 128GB): 7-profile measurement matrix; peak memory frames- and steps-invariant, spatial-token-linear; low-RAM mode 62% memory reduction
  • Real-machine E2E (m5max, integrated): model discovered as video; misroute guards return 400; POST /v1/videos job completed in 111s (384x224x25f draft); lease acquired (ceiling 107.5 -> 70.7GB) and released (back to 107.5); co-resident gemma chat served normally during generation; content endpoint streamed a valid mp4; zero OVER_HARD, zero panic

为 fmlx 增加文生视频引擎: Wan2.2 T2V A14B 经 OpenAI 形态异步 job API (/v1/videos) 提供, mlx-gen 在隔离 venv 子进程中生成, 内存租约经 ProcessMemoryEnforcer 单一咽喉点扣减, 与 LLM 服务在 128GB UMA 机上安全共驻.

设计稿 docs/video-generation-engine-spec.md (6 视角对抗评审, 22 条发现全部吸收; P0 实测见 6.1 节).

披露: 本分支按祖先关系包含 #52 的两个 prefill gate commit (m5max 已真机验证), 合并即取代 #52. 自助合并依据 2026-06-11 会话内 owner 明确授权.

测试: 视频单测 112 个; 全套件 4659 过 0 败 19 跳 (零回归; 基线那 3 个 api_key 失败溯源为环境变量泄漏, 干净环境不复现); m5max 真机 P0 七档测量 + 集成 E2E (任务 111s 完成, 租约 107.5->70.7->107.5, 共驻 gemma 正常服务, 零 OVER_HARD 零 panic).

yuanwei added 8 commits June 6, 2026 00:57
…-panic)

Stop the m5max whole-machine watchdog panic caused by a large model's prefill
transient breaching the Metal cap. The model loads and runs normally; only an
individual request that cannot fit is refused (503-class), not the model.

Root cause: oMLX bounds memory reactively (enforcer polls phys every 1s) and the
in-prefill memory check ran AFTER self.model()+mx.eval. On Apple UMA a chunk that
overshoots the Metal wired limit kernel-panics the whole machine, so the
post-forward Python check never runs. glm4.5-air-106b (85GB) on a 128GB box left
~22GB for KV+prefill; a per-chunk MoE-dequant transient (measured up to 7.4GB,
peaks hit 110GB vs the 107.5GB cap) crossed the cap and rebooted the box twice.

- Scheduler._prefill_forward_gate: before each prefill forward (external loop +
  chunked-step mirror), predict current(high-water: max active/phys/recent_peak)
  + estimate_prefill_peak_bytes(KV+SDPA) + a conservative transient margin; if it
  would exceed the hard cap, raise BEFORE the forward. The existing jundot#1405 cleanup
  converts that RuntimeError into a finish_reason="error" output -- request
  refused cleanly, machine not crashed. Legacy post-forward check stays as backstop.
- New MemorySettings.prefill_transient_margin_gb (default 10GB), propagated
  settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the
  MoE expert-dequant spike, so this margin carries the guarantee: margin (10) >
  worst observed single-step jump (7.4GB).
- Preflight now also maxes against the enforcer recent high-water mark.
- Reverts model_load_prefill_headroom_gb load rejection (user feedback: a model
  must be loadable; refuse the request, not the model).

Honest residual: the gate reads current just after the prior chunk's cache clear
(active trough), leaning on phys_footprint stickiness + recent_peak to avoid a
trough misread; a misread could still admit a crashing chunk. Not a literal
never-panic guarantee -- needs on-hardware [memgate] log validation; the real fix
is preemptive KV offload (separate work).

Tests: 12 new (gate raise/pass, margin-is-the-trip, no-op guards, integration
asserting model forward NOT called over-cap), verified to fail with gate
neutered. Full suite 4542 pass / 3 known api_key fail / 19 skip on m2max -- zero
regression.

---

止住 m5max 因大模型 prefill 瞬时撞穿 Metal cap 导致的整机 watchdog panic. 模型正常
load 正常用; 只拒放不下的单个请求(503 级), 不禁模型.

根因: oMLX 内存是 reactive 管控(enforcer 每 1s poll phys), prefill 内存检查在
self.model()+mx.eval 之后. Apple UMA 上 chunk 撞穿 Metal wired limit 直接整机 kernel
panic, forward 后的 Python 检查根本来不及跑. glm4.5-air-106b(85GB)在 128GB 机只剩
~22GB 给 KV+prefill; 每 chunk 的 MoE 反量化瞬时(实测达 7.4GB, 峰冲到 110GB vs 107.5
cap)撞穿后整机重启两次.

- Scheduler._prefill_forward_gate: 每个 prefill forward 前(external loop +
  chunked-step mirror), 预测 current(高水位 max active/phys/recent_peak) +
  estimate_prefill_peak_bytes(KV+SDPA) + 保守 transient margin; 超 hard cap 就在
  forward 前 raise. 现有 jundot#1405 cleanup 把 RuntimeError 转成 finish_reason=error
  输出 -- 干净拒请求, 不崩机. forward 后旧检查留作兜底.
- 新增 MemorySettings.prefill_transient_margin_gb(默认 10GB), settings -> enforcer
  -> scheduler 传播. estimate 不含 MoE 反量化瞬时, margin 扛保证: margin(10) > 实测
  最坏单步跳变(7.4GB).
- preflight 也改用 enforcer 近期高水位取 max.
- 撤掉 model_load_prefill_headroom_gb 拒 load(用户反馈: 模型必须能 load, 拒请求不拒模型).

诚实残余: gate 读 current 在上个 chunk cache clear 后(active 谷), 靠 phys_footprint
黏性 + recent_peak 避免读谷; 读到谷仍可能放行会崩的 chunk. 不是字面"绝不 panic" --
需真机 [memgate] log 验证; 真治本是抢占式 KV offload(独立工作).

测试: 12 个新增, 验证 gate 失效时会 fail. 完整套件 m2max 4542 pass / 3 已知 api_key
fail / 19 skip -- 零回归.
…prod

On-hardware validation found the forward gate (and every estimate-based memory
guard) was INERT in production: scheduler.memory_monitor is never wired, so the
gate's `if memory_monitor is None: return` / `if estimate == 0: return` no-op
fired on every chunk. A single glm4.5-air-106b (85GB) prefill drove memory to
107.525GB (OVER_HARD, past the 107.5 cap) with the gate silent; only the legacy
reactive post-forward check caught it -- after the transient had already landed
(no panic only because it fell inside the tolerance band). Unit tests passed
because they inject a mock monitor.

Make the gate PHYS-based, so it no longer depends on the dead monitor:
- current = max(active, phys, recent_peak) -- all LIVE production readings (the
  same ones [memcheck:external] and the enforcer use).
- The model-dim estimate is now OPTIONAL: used if a monitor is present, else 0.
  At chunk granularity it is tiny anyway; the margin is the real guarantee.
- The transient margin is propagated from the ENFORCER (live), not the monitor
  (dead), so the gate's safety actually fires. Bumped 10 -> 12GB: the un-modelled
  MoE expert-dequant transient is sub-poll (invisible to every memory read) and
  the m5max crash showed ~10.6GB; 12 = ceil(10.6) padded, and the box only
  panics nearer ~110 so 12 has real cushion.
- The gate now no-ops ONLY when the guard is off, the hard limit is unset, or
  chunk<=0 -- never because the monitor/estimate is absent.

Add _log_prefill_gate_state_once: on the first prefill, log the RESOLVED state
(phys-based, margin, cap, estimator active/disabled) at info, or WARNING if the
margin propagated as 0 (gate degraded to the bare cap check). The prior failure
was not the wiring gap but that it was SILENT; this makes that class of inert
guard impossible to ship blind. Document _preflight_memory_check as inert in
prod (monitor-dependent; a phys-only version would never reject at idle).

Functional residual: a model that fills most of the cap (85GB on 128GB) gets
long prompts refused cleanly (503-class) once accumulated KV nears the headroom
-- correct behaviour (refuse the request, not crash the box); fit longer
contexts with a smaller quant. The refusal reuses the existing call-site
RuntimeError handler that _sync_and_clear_cache()s the KV.

Not yet on-hardware validated -- this commit is what makes that validation
meaningful (the prior gate could not fire). m5max round must confirm: startup
log shows ACTIVE+margin before any request; a long glm4.5 prompt is refused
cleanly AND memory returns to the ~85GB base; normal requests still succeed.

Tests: 2 monitor-dependent no-op tests replaced with phys-based firing tests
(fires without a monitor, passes when it fits, margin-carries-it with estimate
0) + 3 startup-log tests (logs once, warns at margin 0, reports estimator
disabled). Full suite 4546 pass / 3 known api_key fail / 19 skip -- zero
regression.

---

真机验证发现 forward gate (以及所有 estimate-based 内存防护) 在生产 INERT:
scheduler.memory_monitor 从没接线, 所以 gate 的 `if memory_monitor is None: return`
/ `if estimate == 0: return` 每个 chunk 都 no-op. 单个 glm4.5-air-106b (85GB) prefill
把内存推到 107.525GB (OVER_HARD, 越过 107.5 cap) 而 gate 全程沉默; 只有 legacy 反应式
post-forward 检查在瞬时已 land 之后才掐 (没崩纯属落在容忍带内). 单测注入 mock monitor 才绿.

把 gate 改成 PHYS-based, 不再依赖死掉的 monitor:
- current = max(active, phys, recent_peak) -- 全是 LIVE 生产读数 (与 memcheck:external
  和 enforcer 同源).
- 模型维度 estimate 现在 OPTIONAL: 有 monitor 才用, 否则 0. chunk 粒度下它本就极小;
  margin 才是真保证.
- 瞬时 margin 从 ENFORCER (live) 传播, 不是 monitor (dead), 所以 gate 的安全真能 fire.
  10 -> 12GB: 未建模的 MoE 反量化瞬时是 sub-poll (所有读数不可见), m5max 崩溃实测 ~10.6GB;
  12 = ceil(10.6) 加垫, 且机器近 ~110 才 panic, 12 有真余量.
- gate 现在只在 guard 关 / hard limit 未设 / chunk<=0 时 no-op -- 绝不因 monitor/estimate
  缺失而 no-op.

新增 _log_prefill_gate_state_once: 首次 prefill 时 info log 解析后的状态 (phys-based,
margin, cap, estimator active/disabled), margin 传成 0 则 WARNING (gate 退化成裸 cap 检查).
之前的失败不是断线本身而是断线沉默; 这让此类 inert guard 不可能再盲发. 把
_preflight_memory_check 标注为生产 inert (依赖 monitor; phys-only 版在 idle 永不拒绝).

功能残余: 占满大半 cap 的模型 (85GB 在 128GB) 长 prompt 在累积 KV 逼近余量时被干净拒绝
(503 级) -- 正确行为 (拒请求不崩机); 长上下文换小量化. 拒绝复用现有调用点 RuntimeError
handler, 它会 _sync_and_clear_cache() 释放 KV.

尚未真机验证 -- 本 commit 正是让验证有意义的前提 (之前的 gate 根本不会 fire). m5max
轮次须确认: 启动 log 在任何请求前显示 ACTIVE+margin; 长 glm4.5 prompt 被干净拒绝且内存
回落到 ~85GB base; 正常请求仍成功.

测试: 2 个依赖 monitor 的 no-op 测试换成 phys-based 触发测试 (无 monitor 也 fire, 放得下
就过, estimate 0 时 margin 兜) + 3 个启动 log 测试 (只 log 一次, margin 0 时 WARNING,
无 monitor 报 estimator disabled). 完整套件 4546 pass / 3 已知 api_key fail / 19 skip --
零回归.
Design spec for the fmlx video generation engine (Wan2.2 T2V A14B via the
mlx-gen runtime, subprocess worker + memory lease architecture). The spec
went through a 6-lens adversarial review; all 22 confirmed blocker/major
findings are incorporated (rejection placement before pool admission,
OpenAI SDK multipart compat, Metal wired-limit containment, lease
double-count fix, venv bootstrap correctness, A/B protocol arithmetic).

Also: scripts/video_p0_measure.py (standalone real-machine measurement
harness using the kernel lifetime-max phys_footprint ledger, one child
process per profile) and the hash-pinned worker venv lockfile generated
on m5max (darwin/arm64, mlx-gen==0.18.14).

---

fmlx 视频生成引擎设计稿 (Wan2.2 T2V A14B, mlx-gen 运行时, subprocess
worker + 内存租约). 经 6 视角对抗评审, 22 条 blocker/major 发现全部吸收
(拒绝点前移到 pool 准入之前, OpenAI SDK multipart 兼容, Metal wired
自缚, 租约双重计数修正, venv 引导命令修正, A/B 协议算术修正).

附: P0 真机测量装置 (内核 lifetime-max ledger, 每档独立子进程) 与
m5max 上生成的 hash 锁定 worker venv 依赖锁 (mlx-gen==0.18.14).
Implements docs/video-generation-engine-spec.md end to end:

- Discovery: diffusers-layout roots (model_index.json) register as
  model_type=video for allow-listed pipeline classes (WanPipeline);
  unknown or class-less pipelines are skipped outright -- this also
  closes the phantom-component hazard where a flat-layout diffusers dir
  registered its transformer/vae subdirs as bogus llm models.
- Pool guard: video entries are rejected with ModelTypeNotLoadableError
  at the TOP of pool.get_engine, before the admission/eviction loop, so
  a misrouted chat request can never evict resident LLMs. Server-side
  pre-pool 400s + default-model/fallback hygiene (chat-capable filter).
- VideoJobManager: FIFO queue (one job at a time), per-job JSON
  persistence with restart replay, memory-lease admission, subprocess
  worker spawn (isolated venv, -I, env whitelist), JSONL progress,
  watchdog (lease breach / stall / timeout / monitor failure), artifact
  retention with expires_at semantics.
- Worker: runs mflux Wan2_2_TI2V in the video venv; pins its own Metal
  wired limit inside the lease before loading (preventive wired-sum
  containment); low-RAM mode on by default (P0 measured the natural-mode
  peak at 49.3GB even for small profiles); failure manifests.
- Memory lease: enforcer ceiling subtraction (clamped >=1, never 0) +
  dynamic-ceiling add-back of min(worker footprint, lease) so the worker
  is counted exactly once; parent wired-limit reset on acquire/release.
- API: /v1/videos create/list/get/content/delete, OpenAI video-job
  shape; POST accepts both JSON and the openai SDK's multipart form;
  per-request peak predictor gates memory beyond the static caps.
- Settings: video section (4-site wiring) + admin global-settings
  read/write + model-type dropdown; huggingface.disable_xet exported to
  HF_HUB_DISABLE_XET at serve startup (China Xet wall mitigation).

Tests: 108 new unit tests across discovery/pool+lease/manager/routes.
Full suite zero regression (env-independent baseline; the 3 api_key
failures reproduce at the base commit with OMLX_API_KEY leaked in env).

Known follow-up before merge: memory_lease_gb default and the peak
predictor constants await calibration from the low-RAM P0 pass on
m5max (out-of-the-box predictor currently exceeds the default lease).

---

按 docs/video-generation-engine-spec.md 完整落地文生视频引擎:

- 发现层: diffusers 布局 (model_index.json) 按允许清单注册为 video 类型,
  未知/缺类名管线直接跳过, 同时修掉幽灵组件隐患.
- 防护层: pool.get_engine 入口 (准入循环之前) typed 拒绝, 误路由的 chat
  请求不会驱逐在驻 LLM; server 侧 pre-pool 400 + 默认模型/fallback 卫生.
- VideoJobManager: FIFO 队列 + 重启回放 + 租约准入 + 独立 venv 子进程 +
  JSONL 进度 + watchdog 三连杀 + 产物保留策略.
- worker: 进场自缚 Metal wired limit (预防 wired-sum), 默认 low-RAM 模式
  (P0 实测自然模式峰值 49.3GB), failure manifest.
- 内存租约: ceiling 扣减 (钳 >=1) + 动态 ceiling 加回 (worker 精确计一次)
  + 父进程 wired 重设.
- API: /v1/videos 五端点, OpenAI video job 形态, POST 兼容官方 SDK 的
  multipart; 逐请求峰值预测器把守内存边界.
- 配置: video section 四件套 + admin 读写 + huggingface.disable_xet
  (国内 Xet 墙缓解).

测试: 新增 108 个单测; 全套件零回归 (api_key 3 失败为环境变量泄漏,
基点同样复现). 合并前待办: lease 默认值与预测器系数等 m5max low-RAM
P0 数据校准 (当前开箱预测值超默认租约).
List the upstream-shared files patched by feat/video-engine so future
cherry-pick conflicts are easy to attribute and resolve.

---

记录 feat/video-engine 对上游同源文件的补丁面, 方便后续 cherry-pick
撞冲突时归因与解决.
Frontend: Video Generation section in admin settings (enable toggle,
memory lease, defaults, queue/timeout/retention, worker python path)
wired through dashboard.js load/save and i18n (en/zh/ru).

Calibration from the m5max P0 low-RAM matrix: peak memory is invariant
to step count AND frame count (measured byte-identical), scaling only
with per-frame spatial tokens. Predictor reparameterized to
BASE 17.5GB + 0.0029 GB/token with a 6GB transient margin; default
memory_lease_gb 28 -> 36 so out-of-the-box settings admit the caps
corner (fixes the v1 always-413 defect). Spec section 6.1 records the
full measurement table.

---

前端: admin 设置页新增视频生成区块 (开关, 内存租约, 默认参数, 队列/
超时/产物保留, worker python 路径), dashboard.js 读写接线, en/zh/ru
三语言.

校准: 按 m5max P0 低 RAM 矩阵实测, 峰值与步数和帧数均无关 (逐字节
一致), 只随单帧空间 token 线性增长. 预测器重参数化为 BASE 17.5GB +
0.0029 GB/token + 6GB 瞬时余量; memory_lease_gb 默认 28 -> 36, 开箱
即可容纳上限角 (修复 v1 全量 413 缺陷). spec 6.1 记录完整测量表.
The fallback hygiene check blocked fallback whenever the default model's
entry was missing or a test double; only a KNOWN non-chat model_type
(video/audio/embedding) should block it.

---

fallback 卫生检查对查不到 entry 或测试替身的情况误拦; 只有已知非 chat
类型才应阻止回退.
Generated videos live under {base_path}/video-artifacts, never in-repo;
ignore patterns guard against measurement or test runs writing mp4s
into the worktree.

---

生成视频只存 {base_path}/video-artifacts, 永不进仓; ignore 规则防止
测量或测试把 mp4 写进工作区.
@panwudi panwudi merged commit c617367 into main Jun 10, 2026
panwudi added a commit that referenced this pull request Jun 10, 2026
#54)

Followup of #53: video models no longer appear in the chat model picker.

---

#53 后续: 视频模型不再出现在 chat 模型下拉中.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant