fix(memory): make the prefill forward gate phys-based (validated on m5max) by panwudi · Pull Request #52 · panwudi/flyto-mlx

panwudi · 2026-06-06T08:27:33Z

Problem (found by on-hardware validation)

The forward gate -- and every estimate-based memory guard -- was INERT in production: scheduler.memory_monitor is never wired, so the gate's if memory_monitor is None: return / if estimate == 0: return no-op'd on every chunk. A single (or KV-accumulated) glm4.5-air-106b prefill drove memory to 107.525GB OVER_HARD (past the 107.5 cap) with the gate SILENT; only the legacy reactive post-forward check caught it, after the transient had already landed (no panic only by tolerance-band luck). Unit tests passed because they inject a mock monitor.

Fix

Make the gate phys-based so it no longer depends on the dead monitor:

current = max(active, phys, recent_peak) -- all LIVE production readings.
model-dim estimate is now OPTIONAL (used if a monitor exists, else 0). At chunk granularity it is tiny; the margin is the guarantee, and it is propagated from the enforcer (live), not the monitor (dead).
margin 10 -> 12GB (sub-poll MoE transient measured ~10.6GB; 12 has cushion below the ~110 real-panic point).
gate no-ops ONLY when guard off / hard limit unset / chunk<=0 -- never because the monitor/estimate is absent.
_log_prefill_gate_state_once: logs the RESOLVED state (margin/cap/estimator) so a mis-propagated margin can't ship silently inert again (WARNING if margin=0).
_preflight_memory_check documented as inert-in-prod (monitor-dependent).

On-hardware validation (m5max, glm4.5-air-moe-106b-a12b-6bit, 2026-06-06)

A/B against the same n=4 unique-22K-token load that breached on the old code:

	old code	this PR
max current	107.525GB OVER_HARD	93.7GB
OVER_HARD events	breached cap	0
protection	too-late post-forward check -> unhandled 500	gate refuses BEFORE the forward
outcome	near-panic	3 clean refusals + 1 success; memory returns to ~85GB base; no panic

All three required assertions passed:

(a) Before any stress request, the startup log showed: [memgate] prefill forward gate ACTIVE (phys-based): margin=12.0GB, cap=107.5GB, model-dim estimator=DISABLED (phys+margin only) -- confirming the margin propagated live (12.0) AND that the estimator is indeed unwired (the root cause), with the gate active anyway.
(b) Under the accumulating load the gate fired: refusing prefill chunk BEFORE forward: predicted peak 109.655GB = current 97.655GB + margin 12.000GB (estimate 0.000GB) exceeds hard cap 107.520GB. current never crossed 107.5, 0 OVER_HARD, no panic, and free memory returned to ~85GB base after each refusal (no KV wedging).
(c) Normal short (148 tok) / moderate (3338 tok) / 2-concurrent-moderate requests all still succeeded -- the gate is not over-eager.

Functional residual

A model that fills most of the cap (85GB on 128GB) gets long/accumulated prompts refused cleanly (503-class) once KV nears the headroom -- correct (refuse the request, do not crash the box). Fit longer contexts with a smaller quant (e.g. glm4.5 4bit ~55GB).

Relationship to #51

#51 (memory-budget admission queue) stays BLOCKED: it was also monitor-inert, and glm4.5 runs the external prefill loop which already serializes prefills, so its concurrent-queue targets a path glm4.5 does not use. This PR is the live single-request/accumulation protection.

Tests

2 monitor-dependent no-op tests replaced with phys-based firing tests; 3 startup-log tests added. Full suite 4546 pass / 3 known api_key fail / 19 skip on m2max -- zero regression. fix/* branch -> human review + merge.

问题 (真机验证发现)

forward gate 及所有 estimate-based 内存防护在生产 INERT: scheduler.memory_monitor 从没接线, gate 每 chunk 都 no-op. glm4.5 单/累积 prefill 冲 107.525GB OVER_HARD 而 gate 沉默, 只有 legacy 反应式检查在瞬时 land 后才掐 (没崩纯靠容忍带). 单测注入 mock 才绿.

修复

gate 改 phys-based: current=max(active,phys,recent_peak) 全 live; estimate 可选; margin 从 enforcer (live) 传播, 10->12GB; 仅 guard 关/limit 未设/chunk<=0 才 no-op; 加 _log_prefill_gate_state_once 显式 log 解析状态 (margin=0 则 WARNING).

真机验证 (m5max glm4.5, 2026-06-06)

同款 n=4 unique-22K 负载 A/B: 旧码 max current 107.525GB OVER_HARD; 本 PR max current 93.7GB, 0 OVER_HARD, 3 干净拒绝 + 1 成功, 内存回 ~85GB base, 无 panic. 三项断言全过: (a) 压测前 log 显示 ACTIVE margin=12.0 estimator=DISABLED; (b) gate 在 109.655>107.520 前拒绝, current 不越 cap, KV 释放无卡死; (c) 正常短/中请求仍成功.

残余 + 关系

占满大半 cap 的模型长 prompt 干净 503 (换小量化). #51 仍 BLOCKED (monitor-inert + glm4.5 走 external loop 本就串行). 完整套件 4546 pass / 3 已知 api_key fail / 19 skip 零回归. fix/* -> 人审人合.

…-panic) Stop the m5max whole-machine watchdog panic caused by a large model's prefill transient breaching the Metal cap. The model loads and runs normally; only an individual request that cannot fit is refused (503-class), not the model. Root cause: oMLX bounds memory reactively (enforcer polls phys every 1s) and the in-prefill memory check ran AFTER self.model()+mx.eval. On Apple UMA a chunk that overshoots the Metal wired limit kernel-panics the whole machine, so the post-forward Python check never runs. glm4.5-air-106b (85GB) on a 128GB box left ~22GB for KV+prefill; a per-chunk MoE-dequant transient (measured up to 7.4GB, peaks hit 110GB vs the 107.5GB cap) crossed the cap and rebooted the box twice. - Scheduler._prefill_forward_gate: before each prefill forward (external loop + chunked-step mirror), predict current(high-water: max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + a conservative transient margin; if it would exceed the hard cap, raise BEFORE the forward. The existing jundot#1405 cleanup converts that RuntimeError into a finish_reason="error" output -- request refused cleanly, machine not crashed. Legacy post-forward check stays as backstop. - New MemorySettings.prefill_transient_margin_gb (default 10GB), propagated settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the MoE expert-dequant spike, so this margin carries the guarantee: margin (10) > worst observed single-step jump (7.4GB). - Preflight now also maxes against the enforcer recent high-water mark. - Reverts model_load_prefill_headroom_gb load rejection (user feedback: a model must be loadable; refuse the request, not the model). Honest residual: the gate reads current just after the prior chunk's cache clear (active trough), leaning on phys_footprint stickiness + recent_peak to avoid a trough misread; a misread could still admit a crashing chunk. Not a literal never-panic guarantee -- needs on-hardware [memgate] log validation; the real fix is preemptive KV offload (separate work). Tests: 12 new (gate raise/pass, margin-is-the-trip, no-op guards, integration asserting model forward NOT called over-cap), verified to fail with gate neutered. Full suite 4542 pass / 3 known api_key fail / 19 skip on m2max -- zero regression. --- 止住 m5max 因大模型 prefill 瞬时撞穿 Metal cap 导致的整机 watchdog panic. 模型正常 load 正常用; 只拒放不下的单个请求(503 级), 不禁模型. 根因: oMLX 内存是 reactive 管控(enforcer 每 1s poll phys), prefill 内存检查在 self.model()+mx.eval 之后. Apple UMA 上 chunk 撞穿 Metal wired limit 直接整机 kernel panic, forward 后的 Python 检查根本来不及跑. glm4.5-air-106b(85GB)在 128GB 机只剩 ~22GB 给 KV+prefill; 每 chunk 的 MoE 反量化瞬时(实测达 7.4GB, 峰冲到 110GB vs 107.5 cap)撞穿后整机重启两次. - Scheduler._prefill_forward_gate: 每个 prefill forward 前(external loop + chunked-step mirror), 预测 current(高水位 max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + 保守 transient margin; 超 hard cap 就在 forward 前 raise. 现有 jundot#1405 cleanup 把 RuntimeError 转成 finish_reason=error 输出 -- 干净拒请求, 不崩机. forward 后旧检查留作兜底. - 新增 MemorySettings.prefill_transient_margin_gb(默认 10GB), settings -> enforcer -> scheduler 传播. estimate 不含 MoE 反量化瞬时, margin 扛保证: margin(10) > 实测最坏单步跳变(7.4GB). - preflight 也改用 enforcer 近期高水位取 max. - 撤掉 model_load_prefill_headroom_gb 拒 load(用户反馈: 模型必须能 load, 拒请求不拒模型). 诚实残余: gate 读 current 在上个 chunk cache clear 后(active 谷), 靠 phys_footprint 黏性 + recent_peak 避免读谷; 读到谷仍可能放行会崩的 chunk. 不是字面"绝不 panic" -- 需真机 [memgate] log 验证; 真治本是抢占式 KV offload(独立工作). 测试: 12 个新增, 验证 gate 失效时会 fail. 完整套件 m2max 4542 pass / 3 已知 api_key fail / 19 skip -- 零回归.

…prod On-hardware validation found the forward gate (and every estimate-based memory guard) was INERT in production: scheduler.memory_monitor is never wired, so the gate's `if memory_monitor is None: return` / `if estimate == 0: return` no-op fired on every chunk. A single glm4.5-air-106b (85GB) prefill drove memory to 107.525GB (OVER_HARD, past the 107.5 cap) with the gate silent; only the legacy reactive post-forward check caught it -- after the transient had already landed (no panic only because it fell inside the tolerance band). Unit tests passed because they inject a mock monitor. Make the gate PHYS-based, so it no longer depends on the dead monitor: - current = max(active, phys, recent_peak) -- all LIVE production readings (the same ones [memcheck:external] and the enforcer use). - The model-dim estimate is now OPTIONAL: used if a monitor is present, else 0. At chunk granularity it is tiny anyway; the margin is the real guarantee. - The transient margin is propagated from the ENFORCER (live), not the monitor (dead), so the gate's safety actually fires. Bumped 10 -> 12GB: the un-modelled MoE expert-dequant transient is sub-poll (invisible to every memory read) and the m5max crash showed ~10.6GB; 12 = ceil(10.6) padded, and the box only panics nearer ~110 so 12 has real cushion. - The gate now no-ops ONLY when the guard is off, the hard limit is unset, or chunk<=0 -- never because the monitor/estimate is absent. Add _log_prefill_gate_state_once: on the first prefill, log the RESOLVED state (phys-based, margin, cap, estimator active/disabled) at info, or WARNING if the margin propagated as 0 (gate degraded to the bare cap check). The prior failure was not the wiring gap but that it was SILENT; this makes that class of inert guard impossible to ship blind. Document _preflight_memory_check as inert in prod (monitor-dependent; a phys-only version would never reject at idle). Functional residual: a model that fills most of the cap (85GB on 128GB) gets long prompts refused cleanly (503-class) once accumulated KV nears the headroom -- correct behaviour (refuse the request, not crash the box); fit longer contexts with a smaller quant. The refusal reuses the existing call-site RuntimeError handler that _sync_and_clear_cache()s the KV. Not yet on-hardware validated -- this commit is what makes that validation meaningful (the prior gate could not fire). m5max round must confirm: startup log shows ACTIVE+margin before any request; a long glm4.5 prompt is refused cleanly AND memory returns to the ~85GB base; normal requests still succeed. Tests: 2 monitor-dependent no-op tests replaced with phys-based firing tests (fires without a monitor, passes when it fits, margin-carries-it with estimate 0) + 3 startup-log tests (logs once, warns at margin 0, reports estimator disabled). Full suite 4546 pass / 3 known api_key fail / 19 skip -- zero regression. --- 真机验证发现 forward gate (以及所有 estimate-based 内存防护) 在生产 INERT: scheduler.memory_monitor 从没接线, 所以 gate 的 `if memory_monitor is None: return` / `if estimate == 0: return` 每个 chunk 都 no-op. 单个 glm4.5-air-106b (85GB) prefill 把内存推到 107.525GB (OVER_HARD, 越过 107.5 cap) 而 gate 全程沉默; 只有 legacy 反应式 post-forward 检查在瞬时已 land 之后才掐 (没崩纯属落在容忍带内). 单测注入 mock monitor 才绿. 把 gate 改成 PHYS-based, 不再依赖死掉的 monitor: - current = max(active, phys, recent_peak) -- 全是 LIVE 生产读数 (与 memcheck:external 和 enforcer 同源). - 模型维度 estimate 现在 OPTIONAL: 有 monitor 才用, 否则 0. chunk 粒度下它本就极小; margin 才是真保证. - 瞬时 margin 从 ENFORCER (live) 传播, 不是 monitor (dead), 所以 gate 的安全真能 fire. 10 -> 12GB: 未建模的 MoE 反量化瞬时是 sub-poll (所有读数不可见), m5max 崩溃实测 ~10.6GB; 12 = ceil(10.6) 加垫, 且机器近 ~110 才 panic, 12 有真余量. - gate 现在只在 guard 关 / hard limit 未设 / chunk<=0 时 no-op -- 绝不因 monitor/estimate 缺失而 no-op. 新增 _log_prefill_gate_state_once: 首次 prefill 时 info log 解析后的状态 (phys-based, margin, cap, estimator active/disabled), margin 传成 0 则 WARNING (gate 退化成裸 cap 检查). 之前的失败不是断线本身而是断线沉默; 这让此类 inert guard 不可能再盲发. 把 _preflight_memory_check 标注为生产 inert (依赖 monitor; phys-only 版在 idle 永不拒绝). 功能残余: 占满大半 cap 的模型 (85GB 在 128GB) 长 prompt 在累积 KV 逼近余量时被干净拒绝 (503 级) -- 正确行为 (拒请求不崩机); 长上下文换小量化. 拒绝复用现有调用点 RuntimeError handler, 它会 _sync_and_clear_cache() 释放 KV. 尚未真机验证 -- 本 commit 正是让验证有意义的前提 (之前的 gate 根本不会 fire). m5max 轮次须确认: 启动 log 在任何请求前显示 ACTIVE+margin; 长 glm4.5 prompt 被干净拒绝且内存回落到 ~85GB base; 正常请求仍成功. 测试: 2 个依赖 monitor 的 no-op 测试换成 phys-based 触发测试 (无 monitor 也 fire, 放得下就过, estimate 0 时 margin 兜) + 3 个启动 log 测试 (只 log 一次, margin 0 时 WARNING, 无 monitor 报 estimator disabled). 完整套件 4546 pass / 3 已知 api_key fail / 19 skip -- 零回归.

panwudi · 2026-06-06T08:32:18Z

Honest caveats / scoped follow-ups (not gaps in the panic protection)

To state the validation accurately:

Refusal is box-safe but the CLIENT error is not yet clean. The gate keeps the box safe and KV IS reclaimed (the load-bearing wins, confirmed: 0 OVER_HARD, memory returns to ~85GB base). But the client receives a dropped connection (IncompleteRead), not a 503 carrying the "reduce context length" message -- the refusal RuntimeError still flows through the pre-existing finish_reason=error -> generic 500 path (server.py:594). So a caller cannot distinguish "refused, shorten your prompt" from "server crashed." Translating the gate refusal into a proper 503 with the helpful body is a follow-up (does not need another hardware round -- the IncompleteRead is the evidence).
Only the external prefill path was metal-validated. That is the path glm4.5-air uses (external loop, prefills serialized), so it is the right one to prove. The gate's other call site (_advance_chunked_prefills) advances N concurrent chunked prefills one chunk per step; each checks current+margin against the same pre-step trough, then all forward -- siblings' simultaneous transients are not modeled, only the margin cushions them. Moot for glm4.5; a residual for chunked-path models under high concurrency.
KV-eviction estimate path is also monitor-inert (watch item). Reclamation works post-completion (validated). But under sustained unique-prompt load, if the prefix cache pins current near the trip point (~cap-margin) faster than eviction frees it, the gate could begin refusing even moderate requests. The burst test cannot observe this steady state -- flag for production observation, not a blocker.

None of these affect the headline result: the single-request / accumulation OVER_HARD that breached on old code is now refused before the forward, box stays safe.

据实补充 (非 panic 防护的缺口, 是 scoped follow-up):

拒绝是机器安全的, 但客户端错误还不干净: 机器安全 + KV 释放是真的 (0 OVER_HARD, 内存回 ~85GB base), 但客户端拿到的是掉连接 (IncompleteRead), 不是带 "缩短上下文" 信息的 503 -- 拒绝的 RuntimeError 仍走预存的 finish_reason=error -> 通用 500 路径 (server.py:594). 把 gate 拒绝译成干净 503 是 follow-up (不需再上真机, IncompleteRead 就是证据).
只真机验证了 external 路径 (glm4.5 用这条, 对的); chunked 路径 (_advance_chunked_prefills) 并发兄弟的同时瞬时没建模, 只靠 margin 兜. 对 glm4.5 无影响, 是 chunked-path 模型高并发下的残余.
KV-eviction estimate 路径也 monitor-inert (watch): 完成后回收正常 (已验证), 但持续 unique-prompt 负载下若 prefix cache 把 current 钉在 trip point 附近快过 eviction 释放, gate 可能开始拒绝中等请求. burst 测试看不到稳态, 标记生产观察, 非阻塞.

Wan2.2 T2V via mlx-gen subprocess worker, /v1/videos job API, memory-lease co-residency. Validated end to end on m5max. Supersedes #52 (included by ancestry). Self-merge under explicit owner authorization (2026-06-11). --- Wan2.2 文生视频 (mlx-gen 子进程 worker), /v1/videos 异步 job API, 内存租约共驻. m5max 端到端真机验证. 按祖先关系取代 #52. 依 2026-06-11 owner 明确授权自助合并.

panwudi · 2026-06-10T17:47:41Z

Superseded by #53 (feat/video-engine), which contains both gate commits by ancestry and is now merged to main. The gate's m5max A/B validation evidence remains recorded in this PR. Closing. / 已被 #53 取代 (按祖先关系包含本 PR 的两个 commit, 已合入 main), m5max 验证证据保留在本 PR 记录中, 关闭.

yuanwei added 2 commits June 6, 2026 00:57

panwudi mentioned this pull request Jun 10, 2026

feat(video): text-to-video generation engine (Wan2.2 via mlx-gen) #53

Merged

panwudi merged commit 2080310 into main Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): make the prefill forward gate phys-based (validated on m5max)#52

fix(memory): make the prefill forward gate phys-based (validated on m5max)#52
panwudi merged 2 commits into
mainfrom
fix/prefill-gate-phys-based

panwudi commented Jun 6, 2026

Uh oh!

panwudi commented Jun 6, 2026

Uh oh!

panwudi commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

panwudi commented Jun 6, 2026

Problem (found by on-hardware validation)

Fix

On-hardware validation (m5max, glm4.5-air-moe-106b-a12b-6bit, 2026-06-06)

Functional residual

Relationship to #51

Tests

问题 (真机验证发现)

修复

真机验证 (m5max glm4.5, 2026-06-06)

残余 + 关系

Uh oh!

panwudi commented Jun 6, 2026

Honest caveats / scoped follow-ups (not gaps in the panic protection)

Uh oh!

panwudi commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant