fix(memory): make the prefill forward gate phys-based (validated on m5max)#52
Conversation
…-panic) Stop the m5max whole-machine watchdog panic caused by a large model's prefill transient breaching the Metal cap. The model loads and runs normally; only an individual request that cannot fit is refused (503-class), not the model. Root cause: oMLX bounds memory reactively (enforcer polls phys every 1s) and the in-prefill memory check ran AFTER self.model()+mx.eval. On Apple UMA a chunk that overshoots the Metal wired limit kernel-panics the whole machine, so the post-forward Python check never runs. glm4.5-air-106b (85GB) on a 128GB box left ~22GB for KV+prefill; a per-chunk MoE-dequant transient (measured up to 7.4GB, peaks hit 110GB vs the 107.5GB cap) crossed the cap and rebooted the box twice. - Scheduler._prefill_forward_gate: before each prefill forward (external loop + chunked-step mirror), predict current(high-water: max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + a conservative transient margin; if it would exceed the hard cap, raise BEFORE the forward. The existing jundot#1405 cleanup converts that RuntimeError into a finish_reason="error" output -- request refused cleanly, machine not crashed. Legacy post-forward check stays as backstop. - New MemorySettings.prefill_transient_margin_gb (default 10GB), propagated settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the MoE expert-dequant spike, so this margin carries the guarantee: margin (10) > worst observed single-step jump (7.4GB). - Preflight now also maxes against the enforcer recent high-water mark. - Reverts model_load_prefill_headroom_gb load rejection (user feedback: a model must be loadable; refuse the request, not the model). Honest residual: the gate reads current just after the prior chunk's cache clear (active trough), leaning on phys_footprint stickiness + recent_peak to avoid a trough misread; a misread could still admit a crashing chunk. Not a literal never-panic guarantee -- needs on-hardware [memgate] log validation; the real fix is preemptive KV offload (separate work). Tests: 12 new (gate raise/pass, margin-is-the-trip, no-op guards, integration asserting model forward NOT called over-cap), verified to fail with gate neutered. Full suite 4542 pass / 3 known api_key fail / 19 skip on m2max -- zero regression. --- 止住 m5max 因大模型 prefill 瞬时撞穿 Metal cap 导致的整机 watchdog panic. 模型正常 load 正常用; 只拒放不下的单个请求(503 级), 不禁模型. 根因: oMLX 内存是 reactive 管控(enforcer 每 1s poll phys), prefill 内存检查在 self.model()+mx.eval 之后. Apple UMA 上 chunk 撞穿 Metal wired limit 直接整机 kernel panic, forward 后的 Python 检查根本来不及跑. glm4.5-air-106b(85GB)在 128GB 机只剩 ~22GB 给 KV+prefill; 每 chunk 的 MoE 反量化瞬时(实测达 7.4GB, 峰冲到 110GB vs 107.5 cap)撞穿后整机重启两次. - Scheduler._prefill_forward_gate: 每个 prefill forward 前(external loop + chunked-step mirror), 预测 current(高水位 max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + 保守 transient margin; 超 hard cap 就在 forward 前 raise. 现有 jundot#1405 cleanup 把 RuntimeError 转成 finish_reason=error 输出 -- 干净拒请求, 不崩机. forward 后旧检查留作兜底. - 新增 MemorySettings.prefill_transient_margin_gb(默认 10GB), settings -> enforcer -> scheduler 传播. estimate 不含 MoE 反量化瞬时, margin 扛保证: margin(10) > 实测 最坏单步跳变(7.4GB). - preflight 也改用 enforcer 近期高水位取 max. - 撤掉 model_load_prefill_headroom_gb 拒 load(用户反馈: 模型必须能 load, 拒请求不拒模型). 诚实残余: gate 读 current 在上个 chunk cache clear 后(active 谷), 靠 phys_footprint 黏性 + recent_peak 避免读谷; 读到谷仍可能放行会崩的 chunk. 不是字面"绝不 panic" -- 需真机 [memgate] log 验证; 真治本是抢占式 KV offload(独立工作). 测试: 12 个新增, 验证 gate 失效时会 fail. 完整套件 m2max 4542 pass / 3 已知 api_key fail / 19 skip -- 零回归.
…prod On-hardware validation found the forward gate (and every estimate-based memory guard) was INERT in production: scheduler.memory_monitor is never wired, so the gate's `if memory_monitor is None: return` / `if estimate == 0: return` no-op fired on every chunk. A single glm4.5-air-106b (85GB) prefill drove memory to 107.525GB (OVER_HARD, past the 107.5 cap) with the gate silent; only the legacy reactive post-forward check caught it -- after the transient had already landed (no panic only because it fell inside the tolerance band). Unit tests passed because they inject a mock monitor. Make the gate PHYS-based, so it no longer depends on the dead monitor: - current = max(active, phys, recent_peak) -- all LIVE production readings (the same ones [memcheck:external] and the enforcer use). - The model-dim estimate is now OPTIONAL: used if a monitor is present, else 0. At chunk granularity it is tiny anyway; the margin is the real guarantee. - The transient margin is propagated from the ENFORCER (live), not the monitor (dead), so the gate's safety actually fires. Bumped 10 -> 12GB: the un-modelled MoE expert-dequant transient is sub-poll (invisible to every memory read) and the m5max crash showed ~10.6GB; 12 = ceil(10.6) padded, and the box only panics nearer ~110 so 12 has real cushion. - The gate now no-ops ONLY when the guard is off, the hard limit is unset, or chunk<=0 -- never because the monitor/estimate is absent. Add _log_prefill_gate_state_once: on the first prefill, log the RESOLVED state (phys-based, margin, cap, estimator active/disabled) at info, or WARNING if the margin propagated as 0 (gate degraded to the bare cap check). The prior failure was not the wiring gap but that it was SILENT; this makes that class of inert guard impossible to ship blind. Document _preflight_memory_check as inert in prod (monitor-dependent; a phys-only version would never reject at idle). Functional residual: a model that fills most of the cap (85GB on 128GB) gets long prompts refused cleanly (503-class) once accumulated KV nears the headroom -- correct behaviour (refuse the request, not crash the box); fit longer contexts with a smaller quant. The refusal reuses the existing call-site RuntimeError handler that _sync_and_clear_cache()s the KV. Not yet on-hardware validated -- this commit is what makes that validation meaningful (the prior gate could not fire). m5max round must confirm: startup log shows ACTIVE+margin before any request; a long glm4.5 prompt is refused cleanly AND memory returns to the ~85GB base; normal requests still succeed. Tests: 2 monitor-dependent no-op tests replaced with phys-based firing tests (fires without a monitor, passes when it fits, margin-carries-it with estimate 0) + 3 startup-log tests (logs once, warns at margin 0, reports estimator disabled). Full suite 4546 pass / 3 known api_key fail / 19 skip -- zero regression. --- 真机验证发现 forward gate (以及所有 estimate-based 内存防护) 在生产 INERT: scheduler.memory_monitor 从没接线, 所以 gate 的 `if memory_monitor is None: return` / `if estimate == 0: return` 每个 chunk 都 no-op. 单个 glm4.5-air-106b (85GB) prefill 把内存推到 107.525GB (OVER_HARD, 越过 107.5 cap) 而 gate 全程沉默; 只有 legacy 反应式 post-forward 检查在瞬时已 land 之后才掐 (没崩纯属落在容忍带内). 单测注入 mock monitor 才绿. 把 gate 改成 PHYS-based, 不再依赖死掉的 monitor: - current = max(active, phys, recent_peak) -- 全是 LIVE 生产读数 (与 memcheck:external 和 enforcer 同源). - 模型维度 estimate 现在 OPTIONAL: 有 monitor 才用, 否则 0. chunk 粒度下它本就极小; margin 才是真保证. - 瞬时 margin 从 ENFORCER (live) 传播, 不是 monitor (dead), 所以 gate 的安全真能 fire. 10 -> 12GB: 未建模的 MoE 反量化瞬时是 sub-poll (所有读数不可见), m5max 崩溃实测 ~10.6GB; 12 = ceil(10.6) 加垫, 且机器近 ~110 才 panic, 12 有真余量. - gate 现在只在 guard 关 / hard limit 未设 / chunk<=0 时 no-op -- 绝不因 monitor/estimate 缺失而 no-op. 新增 _log_prefill_gate_state_once: 首次 prefill 时 info log 解析后的状态 (phys-based, margin, cap, estimator active/disabled), margin 传成 0 则 WARNING (gate 退化成裸 cap 检查). 之前的失败不是断线本身而是断线沉默; 这让此类 inert guard 不可能再盲发. 把 _preflight_memory_check 标注为生产 inert (依赖 monitor; phys-only 版在 idle 永不拒绝). 功能残余: 占满大半 cap 的模型 (85GB 在 128GB) 长 prompt 在累积 KV 逼近余量时被干净拒绝 (503 级) -- 正确行为 (拒请求不崩机); 长上下文换小量化. 拒绝复用现有调用点 RuntimeError handler, 它会 _sync_and_clear_cache() 释放 KV. 尚未真机验证 -- 本 commit 正是让验证有意义的前提 (之前的 gate 根本不会 fire). m5max 轮次须确认: 启动 log 在任何请求前显示 ACTIVE+margin; 长 glm4.5 prompt 被干净拒绝且内存 回落到 ~85GB base; 正常请求仍成功. 测试: 2 个依赖 monitor 的 no-op 测试换成 phys-based 触发测试 (无 monitor 也 fire, 放得下 就过, estimate 0 时 margin 兜) + 3 个启动 log 测试 (只 log 一次, margin 0 时 WARNING, 无 monitor 报 estimator disabled). 完整套件 4546 pass / 3 已知 api_key fail / 19 skip -- 零回归.
Honest caveats / scoped follow-ups (not gaps in the panic protection)To state the validation accurately:
None of these affect the headline result: the single-request / accumulation OVER_HARD that breached on old code is now refused before the forward, box stays safe. 据实补充 (非 panic 防护的缺口, 是 scoped follow-up):
|
Wan2.2 T2V via mlx-gen subprocess worker, /v1/videos job API, memory-lease co-residency. Validated end to end on m5max. Supersedes #52 (included by ancestry). Self-merge under explicit owner authorization (2026-06-11). --- Wan2.2 文生视频 (mlx-gen 子进程 worker), /v1/videos 异步 job API, 内存 租约共驻. m5max 端到端真机验证. 按祖先关系取代 #52. 依 2026-06-11 owner 明确授权自助合并.
Problem (found by on-hardware validation)
The forward gate -- and every estimate-based memory guard -- was INERT in production:
scheduler.memory_monitoris never wired, so the gate'sif memory_monitor is None: return/if estimate == 0: returnno-op'd on every chunk. A single (or KV-accumulated) glm4.5-air-106b prefill drove memory to 107.525GB OVER_HARD (past the 107.5 cap) with the gate SILENT; only the legacy reactive post-forward check caught it, after the transient had already landed (no panic only by tolerance-band luck). Unit tests passed because they inject a mock monitor.Fix
Make the gate phys-based so it no longer depends on the dead monitor:
current = max(active, phys, recent_peak)-- all LIVE production readings._log_prefill_gate_state_once: logs the RESOLVED state (margin/cap/estimator) so a mis-propagated margin can't ship silently inert again (WARNING if margin=0)._preflight_memory_checkdocumented as inert-in-prod (monitor-dependent).On-hardware validation (m5max, glm4.5-air-moe-106b-a12b-6bit, 2026-06-06)
A/B against the same n=4 unique-22K-token load that breached on the old code:
All three required assertions passed:
[memgate] prefill forward gate ACTIVE (phys-based): margin=12.0GB, cap=107.5GB, model-dim estimator=DISABLED (phys+margin only)-- confirming the margin propagated live (12.0) AND that the estimator is indeed unwired (the root cause), with the gate active anyway.refusing prefill chunk BEFORE forward: predicted peak 109.655GB = current 97.655GB + margin 12.000GB (estimate 0.000GB) exceeds hard cap 107.520GB. current never crossed 107.5, 0 OVER_HARD, no panic, and free memory returned to ~85GB base after each refusal (no KV wedging).Functional residual
A model that fills most of the cap (85GB on 128GB) gets long/accumulated prompts refused cleanly (503-class) once KV nears the headroom -- correct (refuse the request, do not crash the box). Fit longer contexts with a smaller quant (e.g. glm4.5 4bit ~55GB).
Relationship to #51
#51 (memory-budget admission queue) stays BLOCKED: it was also monitor-inert, and glm4.5 runs the external prefill loop which already serializes prefills, so its concurrent-queue targets a path glm4.5 does not use. This PR is the live single-request/accumulation protection.
Tests
2 monitor-dependent no-op tests replaced with phys-based firing tests; 3 startup-log tests added. Full suite 4546 pass / 3 known api_key fail / 19 skip on m2max -- zero regression. fix/* branch -> human review + merge.
问题 (真机验证发现)
forward gate 及所有 estimate-based 内存防护在生产 INERT: scheduler.memory_monitor 从没接线, gate 每 chunk 都 no-op. glm4.5 单/累积 prefill 冲 107.525GB OVER_HARD 而 gate 沉默, 只有 legacy 反应式检查在瞬时 land 后才掐 (没崩纯靠容忍带). 单测注入 mock 才绿.
修复
gate 改 phys-based: current=max(active,phys,recent_peak) 全 live; estimate 可选; margin 从 enforcer (live) 传播, 10->12GB; 仅 guard 关/limit 未设/chunk<=0 才 no-op; 加 _log_prefill_gate_state_once 显式 log 解析状态 (margin=0 则 WARNING).
真机验证 (m5max glm4.5, 2026-06-06)
同款 n=4 unique-22K 负载 A/B: 旧码 max current 107.525GB OVER_HARD; 本 PR max current 93.7GB, 0 OVER_HARD, 3 干净拒绝 + 1 成功, 内存回 ~85GB base, 无 panic. 三项断言全过: (a) 压测前 log 显示 ACTIVE margin=12.0 estimator=DISABLED; (b) gate 在 109.655>107.520 前拒绝, current 不越 cap, KV 释放无卡死; (c) 正常短/中请求仍成功.
残余 + 关系
占满大半 cap 的模型长 prompt 干净 503 (换小量化). #51 仍 BLOCKED (monitor-inert + glm4.5 走 external loop 本就串行). 完整套件 4546 pass / 3 已知 api_key fail / 19 skip 零回归. fix/* -> 人审人合.