Skip to content

fix(memory): make the prefill forward gate phys-based (validated on m5max)#52

Merged
panwudi merged 2 commits into
mainfrom
fix/prefill-gate-phys-based
Jun 10, 2026
Merged

fix(memory): make the prefill forward gate phys-based (validated on m5max)#52
panwudi merged 2 commits into
mainfrom
fix/prefill-gate-phys-based

Conversation

@panwudi

@panwudi panwudi commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Problem (found by on-hardware validation)

The forward gate -- and every estimate-based memory guard -- was INERT in production: scheduler.memory_monitor is never wired, so the gate's if memory_monitor is None: return / if estimate == 0: return no-op'd on every chunk. A single (or KV-accumulated) glm4.5-air-106b prefill drove memory to 107.525GB OVER_HARD (past the 107.5 cap) with the gate SILENT; only the legacy reactive post-forward check caught it, after the transient had already landed (no panic only by tolerance-band luck). Unit tests passed because they inject a mock monitor.

Fix

Make the gate phys-based so it no longer depends on the dead monitor:

  • current = max(active, phys, recent_peak) -- all LIVE production readings.
  • model-dim estimate is now OPTIONAL (used if a monitor exists, else 0). At chunk granularity it is tiny; the margin is the guarantee, and it is propagated from the enforcer (live), not the monitor (dead).
  • margin 10 -> 12GB (sub-poll MoE transient measured ~10.6GB; 12 has cushion below the ~110 real-panic point).
  • gate no-ops ONLY when guard off / hard limit unset / chunk<=0 -- never because the monitor/estimate is absent.
  • _log_prefill_gate_state_once: logs the RESOLVED state (margin/cap/estimator) so a mis-propagated margin can't ship silently inert again (WARNING if margin=0).
  • _preflight_memory_check documented as inert-in-prod (monitor-dependent).

On-hardware validation (m5max, glm4.5-air-moe-106b-a12b-6bit, 2026-06-06)

A/B against the same n=4 unique-22K-token load that breached on the old code:

old code this PR
max current 107.525GB OVER_HARD 93.7GB
OVER_HARD events breached cap 0
protection too-late post-forward check -> unhandled 500 gate refuses BEFORE the forward
outcome near-panic 3 clean refusals + 1 success; memory returns to ~85GB base; no panic

All three required assertions passed:

  • (a) Before any stress request, the startup log showed: [memgate] prefill forward gate ACTIVE (phys-based): margin=12.0GB, cap=107.5GB, model-dim estimator=DISABLED (phys+margin only) -- confirming the margin propagated live (12.0) AND that the estimator is indeed unwired (the root cause), with the gate active anyway.
  • (b) Under the accumulating load the gate fired: refusing prefill chunk BEFORE forward: predicted peak 109.655GB = current 97.655GB + margin 12.000GB (estimate 0.000GB) exceeds hard cap 107.520GB. current never crossed 107.5, 0 OVER_HARD, no panic, and free memory returned to ~85GB base after each refusal (no KV wedging).
  • (c) Normal short (148 tok) / moderate (3338 tok) / 2-concurrent-moderate requests all still succeeded -- the gate is not over-eager.

Functional residual

A model that fills most of the cap (85GB on 128GB) gets long/accumulated prompts refused cleanly (503-class) once KV nears the headroom -- correct (refuse the request, do not crash the box). Fit longer contexts with a smaller quant (e.g. glm4.5 4bit ~55GB).

Relationship to #51

#51 (memory-budget admission queue) stays BLOCKED: it was also monitor-inert, and glm4.5 runs the external prefill loop which already serializes prefills, so its concurrent-queue targets a path glm4.5 does not use. This PR is the live single-request/accumulation protection.

Tests

2 monitor-dependent no-op tests replaced with phys-based firing tests; 3 startup-log tests added. Full suite 4546 pass / 3 known api_key fail / 19 skip on m2max -- zero regression. fix/* branch -> human review + merge.


问题 (真机验证发现)

forward gate 及所有 estimate-based 内存防护在生产 INERT: scheduler.memory_monitor 从没接线, gate 每 chunk 都 no-op. glm4.5 单/累积 prefill 冲 107.525GB OVER_HARD 而 gate 沉默, 只有 legacy 反应式检查在瞬时 land 后才掐 (没崩纯靠容忍带). 单测注入 mock 才绿.

修复

gate 改 phys-based: current=max(active,phys,recent_peak) 全 live; estimate 可选; margin 从 enforcer (live) 传播, 10->12GB; 仅 guard 关/limit 未设/chunk<=0 才 no-op; 加 _log_prefill_gate_state_once 显式 log 解析状态 (margin=0 则 WARNING).

真机验证 (m5max glm4.5, 2026-06-06)

同款 n=4 unique-22K 负载 A/B: 旧码 max current 107.525GB OVER_HARD; 本 PR max current 93.7GB, 0 OVER_HARD, 3 干净拒绝 + 1 成功, 内存回 ~85GB base, 无 panic. 三项断言全过: (a) 压测前 log 显示 ACTIVE margin=12.0 estimator=DISABLED; (b) gate 在 109.655>107.520 前拒绝, current 不越 cap, KV 释放无卡死; (c) 正常短/中请求仍成功.

残余 + 关系

占满大半 cap 的模型长 prompt 干净 503 (换小量化). #51 仍 BLOCKED (monitor-inert + glm4.5 走 external loop 本就串行). 完整套件 4546 pass / 3 已知 api_key fail / 19 skip 零回归. fix/* -> 人审人合.

yuanwei added 2 commits June 6, 2026 00:57
…-panic)

Stop the m5max whole-machine watchdog panic caused by a large model's prefill
transient breaching the Metal cap. The model loads and runs normally; only an
individual request that cannot fit is refused (503-class), not the model.

Root cause: oMLX bounds memory reactively (enforcer polls phys every 1s) and the
in-prefill memory check ran AFTER self.model()+mx.eval. On Apple UMA a chunk that
overshoots the Metal wired limit kernel-panics the whole machine, so the
post-forward Python check never runs. glm4.5-air-106b (85GB) on a 128GB box left
~22GB for KV+prefill; a per-chunk MoE-dequant transient (measured up to 7.4GB,
peaks hit 110GB vs the 107.5GB cap) crossed the cap and rebooted the box twice.

- Scheduler._prefill_forward_gate: before each prefill forward (external loop +
  chunked-step mirror), predict current(high-water: max active/phys/recent_peak)
  + estimate_prefill_peak_bytes(KV+SDPA) + a conservative transient margin; if it
  would exceed the hard cap, raise BEFORE the forward. The existing jundot#1405 cleanup
  converts that RuntimeError into a finish_reason="error" output -- request
  refused cleanly, machine not crashed. Legacy post-forward check stays as backstop.
- New MemorySettings.prefill_transient_margin_gb (default 10GB), propagated
  settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the
  MoE expert-dequant spike, so this margin carries the guarantee: margin (10) >
  worst observed single-step jump (7.4GB).
- Preflight now also maxes against the enforcer recent high-water mark.
- Reverts model_load_prefill_headroom_gb load rejection (user feedback: a model
  must be loadable; refuse the request, not the model).

Honest residual: the gate reads current just after the prior chunk's cache clear
(active trough), leaning on phys_footprint stickiness + recent_peak to avoid a
trough misread; a misread could still admit a crashing chunk. Not a literal
never-panic guarantee -- needs on-hardware [memgate] log validation; the real fix
is preemptive KV offload (separate work).

Tests: 12 new (gate raise/pass, margin-is-the-trip, no-op guards, integration
asserting model forward NOT called over-cap), verified to fail with gate
neutered. Full suite 4542 pass / 3 known api_key fail / 19 skip on m2max -- zero
regression.

---

止住 m5max 因大模型 prefill 瞬时撞穿 Metal cap 导致的整机 watchdog panic. 模型正常
load 正常用; 只拒放不下的单个请求(503 级), 不禁模型.

根因: oMLX 内存是 reactive 管控(enforcer 每 1s poll phys), prefill 内存检查在
self.model()+mx.eval 之后. Apple UMA 上 chunk 撞穿 Metal wired limit 直接整机 kernel
panic, forward 后的 Python 检查根本来不及跑. glm4.5-air-106b(85GB)在 128GB 机只剩
~22GB 给 KV+prefill; 每 chunk 的 MoE 反量化瞬时(实测达 7.4GB, 峰冲到 110GB vs 107.5
cap)撞穿后整机重启两次.

- Scheduler._prefill_forward_gate: 每个 prefill forward 前(external loop +
  chunked-step mirror), 预测 current(高水位 max active/phys/recent_peak) +
  estimate_prefill_peak_bytes(KV+SDPA) + 保守 transient margin; 超 hard cap 就在
  forward 前 raise. 现有 jundot#1405 cleanup 把 RuntimeError 转成 finish_reason=error
  输出 -- 干净拒请求, 不崩机. forward 后旧检查留作兜底.
- 新增 MemorySettings.prefill_transient_margin_gb(默认 10GB), settings -> enforcer
  -> scheduler 传播. estimate 不含 MoE 反量化瞬时, margin 扛保证: margin(10) > 实测
  最坏单步跳变(7.4GB).
- preflight 也改用 enforcer 近期高水位取 max.
- 撤掉 model_load_prefill_headroom_gb 拒 load(用户反馈: 模型必须能 load, 拒请求不拒模型).

诚实残余: gate 读 current 在上个 chunk cache clear 后(active 谷), 靠 phys_footprint
黏性 + recent_peak 避免读谷; 读到谷仍可能放行会崩的 chunk. 不是字面"绝不 panic" --
需真机 [memgate] log 验证; 真治本是抢占式 KV offload(独立工作).

测试: 12 个新增, 验证 gate 失效时会 fail. 完整套件 m2max 4542 pass / 3 已知 api_key
fail / 19 skip -- 零回归.
…prod

On-hardware validation found the forward gate (and every estimate-based memory
guard) was INERT in production: scheduler.memory_monitor is never wired, so the
gate's `if memory_monitor is None: return` / `if estimate == 0: return` no-op
fired on every chunk. A single glm4.5-air-106b (85GB) prefill drove memory to
107.525GB (OVER_HARD, past the 107.5 cap) with the gate silent; only the legacy
reactive post-forward check caught it -- after the transient had already landed
(no panic only because it fell inside the tolerance band). Unit tests passed
because they inject a mock monitor.

Make the gate PHYS-based, so it no longer depends on the dead monitor:
- current = max(active, phys, recent_peak) -- all LIVE production readings (the
  same ones [memcheck:external] and the enforcer use).
- The model-dim estimate is now OPTIONAL: used if a monitor is present, else 0.
  At chunk granularity it is tiny anyway; the margin is the real guarantee.
- The transient margin is propagated from the ENFORCER (live), not the monitor
  (dead), so the gate's safety actually fires. Bumped 10 -> 12GB: the un-modelled
  MoE expert-dequant transient is sub-poll (invisible to every memory read) and
  the m5max crash showed ~10.6GB; 12 = ceil(10.6) padded, and the box only
  panics nearer ~110 so 12 has real cushion.
- The gate now no-ops ONLY when the guard is off, the hard limit is unset, or
  chunk<=0 -- never because the monitor/estimate is absent.

Add _log_prefill_gate_state_once: on the first prefill, log the RESOLVED state
(phys-based, margin, cap, estimator active/disabled) at info, or WARNING if the
margin propagated as 0 (gate degraded to the bare cap check). The prior failure
was not the wiring gap but that it was SILENT; this makes that class of inert
guard impossible to ship blind. Document _preflight_memory_check as inert in
prod (monitor-dependent; a phys-only version would never reject at idle).

Functional residual: a model that fills most of the cap (85GB on 128GB) gets
long prompts refused cleanly (503-class) once accumulated KV nears the headroom
-- correct behaviour (refuse the request, not crash the box); fit longer
contexts with a smaller quant. The refusal reuses the existing call-site
RuntimeError handler that _sync_and_clear_cache()s the KV.

Not yet on-hardware validated -- this commit is what makes that validation
meaningful (the prior gate could not fire). m5max round must confirm: startup
log shows ACTIVE+margin before any request; a long glm4.5 prompt is refused
cleanly AND memory returns to the ~85GB base; normal requests still succeed.

Tests: 2 monitor-dependent no-op tests replaced with phys-based firing tests
(fires without a monitor, passes when it fits, margin-carries-it with estimate
0) + 3 startup-log tests (logs once, warns at margin 0, reports estimator
disabled). Full suite 4546 pass / 3 known api_key fail / 19 skip -- zero
regression.

---

真机验证发现 forward gate (以及所有 estimate-based 内存防护) 在生产 INERT:
scheduler.memory_monitor 从没接线, 所以 gate 的 `if memory_monitor is None: return`
/ `if estimate == 0: return` 每个 chunk 都 no-op. 单个 glm4.5-air-106b (85GB) prefill
把内存推到 107.525GB (OVER_HARD, 越过 107.5 cap) 而 gate 全程沉默; 只有 legacy 反应式
post-forward 检查在瞬时已 land 之后才掐 (没崩纯属落在容忍带内). 单测注入 mock monitor 才绿.

把 gate 改成 PHYS-based, 不再依赖死掉的 monitor:
- current = max(active, phys, recent_peak) -- 全是 LIVE 生产读数 (与 memcheck:external
  和 enforcer 同源).
- 模型维度 estimate 现在 OPTIONAL: 有 monitor 才用, 否则 0. chunk 粒度下它本就极小;
  margin 才是真保证.
- 瞬时 margin 从 ENFORCER (live) 传播, 不是 monitor (dead), 所以 gate 的安全真能 fire.
  10 -> 12GB: 未建模的 MoE 反量化瞬时是 sub-poll (所有读数不可见), m5max 崩溃实测 ~10.6GB;
  12 = ceil(10.6) 加垫, 且机器近 ~110 才 panic, 12 有真余量.
- gate 现在只在 guard 关 / hard limit 未设 / chunk<=0 时 no-op -- 绝不因 monitor/estimate
  缺失而 no-op.

新增 _log_prefill_gate_state_once: 首次 prefill 时 info log 解析后的状态 (phys-based,
margin, cap, estimator active/disabled), margin 传成 0 则 WARNING (gate 退化成裸 cap 检查).
之前的失败不是断线本身而是断线沉默; 这让此类 inert guard 不可能再盲发. 把
_preflight_memory_check 标注为生产 inert (依赖 monitor; phys-only 版在 idle 永不拒绝).

功能残余: 占满大半 cap 的模型 (85GB 在 128GB) 长 prompt 在累积 KV 逼近余量时被干净拒绝
(503 级) -- 正确行为 (拒请求不崩机); 长上下文换小量化. 拒绝复用现有调用点 RuntimeError
handler, 它会 _sync_and_clear_cache() 释放 KV.

尚未真机验证 -- 本 commit 正是让验证有意义的前提 (之前的 gate 根本不会 fire). m5max
轮次须确认: 启动 log 在任何请求前显示 ACTIVE+margin; 长 glm4.5 prompt 被干净拒绝且内存
回落到 ~85GB base; 正常请求仍成功.

测试: 2 个依赖 monitor 的 no-op 测试换成 phys-based 触发测试 (无 monitor 也 fire, 放得下
就过, estimate 0 时 margin 兜) + 3 个启动 log 测试 (只 log 一次, margin 0 时 WARNING,
无 monitor 报 estimator disabled). 完整套件 4546 pass / 3 已知 api_key fail / 19 skip --
零回归.
@panwudi

panwudi commented Jun 6, 2026

Copy link
Copy Markdown
Owner Author

Honest caveats / scoped follow-ups (not gaps in the panic protection)

To state the validation accurately:

  1. Refusal is box-safe but the CLIENT error is not yet clean. The gate keeps the box safe and KV IS reclaimed (the load-bearing wins, confirmed: 0 OVER_HARD, memory returns to ~85GB base). But the client receives a dropped connection (IncompleteRead), not a 503 carrying the "reduce context length" message -- the refusal RuntimeError still flows through the pre-existing finish_reason=error -> generic 500 path (server.py:594). So a caller cannot distinguish "refused, shorten your prompt" from "server crashed." Translating the gate refusal into a proper 503 with the helpful body is a follow-up (does not need another hardware round -- the IncompleteRead is the evidence).

  2. Only the external prefill path was metal-validated. That is the path glm4.5-air uses (external loop, prefills serialized), so it is the right one to prove. The gate's other call site (_advance_chunked_prefills) advances N concurrent chunked prefills one chunk per step; each checks current+margin against the same pre-step trough, then all forward -- siblings' simultaneous transients are not modeled, only the margin cushions them. Moot for glm4.5; a residual for chunked-path models under high concurrency.

  3. KV-eviction estimate path is also monitor-inert (watch item). Reclamation works post-completion (validated). But under sustained unique-prompt load, if the prefix cache pins current near the trip point (~cap-margin) faster than eviction frees it, the gate could begin refusing even moderate requests. The burst test cannot observe this steady state -- flag for production observation, not a blocker.

None of these affect the headline result: the single-request / accumulation OVER_HARD that breached on old code is now refused before the forward, box stays safe.


据实补充 (非 panic 防护的缺口, 是 scoped follow-up):

  1. 拒绝是机器安全的, 但客户端错误还不干净: 机器安全 + KV 释放是真的 (0 OVER_HARD, 内存回 ~85GB base), 但客户端拿到的是掉连接 (IncompleteRead), 不是带 "缩短上下文" 信息的 503 -- 拒绝的 RuntimeError 仍走预存的 finish_reason=error -> 通用 500 路径 (server.py:594). 把 gate 拒绝译成干净 503 是 follow-up (不需再上真机, IncompleteRead 就是证据).
  2. 只真机验证了 external 路径 (glm4.5 用这条, 对的); chunked 路径 (_advance_chunked_prefills) 并发兄弟的同时瞬时没建模, 只靠 margin 兜. 对 glm4.5 无影响, 是 chunked-path 模型高并发下的残余.
  3. KV-eviction estimate 路径也 monitor-inert (watch): 完成后回收正常 (已验证), 但持续 unique-prompt 负载下若 prefix cache 把 current 钉在 trip point 附近快过 eviction 释放, gate 可能开始拒绝中等请求. burst 测试看不到稳态, 标记生产观察, 非阻塞.

panwudi added a commit that referenced this pull request Jun 10, 2026
Wan2.2 T2V via mlx-gen subprocess worker, /v1/videos job API, memory-lease
co-residency. Validated end to end on m5max. Supersedes #52 (included by
ancestry). Self-merge under explicit owner authorization (2026-06-11).

---

Wan2.2 文生视频 (mlx-gen 子进程 worker), /v1/videos 异步 job API, 内存
租约共驻. m5max 端到端真机验证. 按祖先关系取代 #52. 依 2026-06-11
owner 明确授权自助合并.
@panwudi panwudi merged commit 2080310 into main Jun 10, 2026
@panwudi

panwudi commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

Superseded by #53 (feat/video-engine), which contains both gate commits by ancestry and is now merged to main. The gate's m5max A/B validation evidence remains recorded in this PR. Closing. / 已被 #53 取代 (按祖先关系包含本 PR 的两个 commit, 已合入 main), m5max 验证证据保留在本 PR 记录中, 关闭.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant