fix(memory): refuse oversized prefill chunks before the forward (anti-panic) by panwudi · Pull Request #49 · panwudi/flyto-mlx

panwudi · 2026-06-05T21:31:06Z

What this is now (replaces the rejected load-admission headroom)

The PR originally rejected oversized models at load. Per user feedback that's the wrong shape -- a model must be loadable; refuse the request, not the model. That approach is reverted (engine_pool/settings back to main). This now stops the crash at the prefill level.

Why

glm4.5-air-106b (85GB) on a 128GB box (Metal cap 107.5GB) leaves ~22GB for KV+prefill. A per-chunk MoE-dequant transient (measured up to 7.4GB on m5max, peaks hit 110GB) crosses the cap. oMLX checks memory AFTER the forward, but on Apple UMA an over-cap allocation kernel-panics the whole machine before any Python check runs -- rebooted the box twice (2026-06-01, -06-06).

What

Scheduler._prefill_forward_gate: before each prefill forward (external + chunked-step loops), predict current(high-water: max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + margin; raise BEFORE the forward if it would breach the hard cap. The existing [Bug] Prefill requests stuck after force-stopped due to memory limit exceeded - never released jundot/omlx#1405 cleanup turns that into a finish_reason=error (503-class) -- request refused cleanly, machine not crashed. Legacy post-forward check kept as backstop.
MemorySettings.prefill_transient_margin_gb (default 10GB), propagated settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the MoE expert-dequant spike, so this margin carries the guarantee: margin(10) > worst observed single-step jump(7.4GB), measured from m5max crash logs.
Preflight also maxes against the enforcer recent high-water mark.

Honest residual (read before merge)

The gate reads current right after the prior chunk's cache clear (active-memory trough), leaning on phys_footprint stickiness + recent_peak to avoid a trough misread. A trough misread could still admit a crashing chunk -- this cannot be mock-tested; it needs on-hardware [memgate] log validation. This is "crash -> 503 for the common cases", NOT a literal never-panic guarantee. The real fix is preemptive KV offload (mlx-lm's extract_cache/remove/insert primitives support it -- separate, larger work).

Also: with a 10GB margin, glm4.5 on this box refuses requests aggressively (~97GB trip point). It stops the crash; it does NOT make glm4.5 comfortably usable for long-context agentic work on a 128GB box -- that needs a smaller quant or the preemption work.

Test plan

Targeted (self-run, m2max): test_scheduler_prefill_forward_gate.py + admission + enforcer + engine_pool -> 157 passed.
12 new gate tests; verified to FAIL with the gate neutered (pins the fix).
Full suite (m2max): 4542 pass / 3 fail / 19 skip. The 3 fails are pre-existing test_settings.py api_key env-bleed (confirmed identical on clean main). Zero regression.

Deploy validation (after merge, on m5max)

Deploy + restart serve.
Load glm4.5-air-106b + send a long prompt -> expect a clean 503-class refusal (NOT a crash). Watch [memgate] logs: confirm current reflects high-water (not a sub-97GB trough) before a would-overshoot chunk.
Load a normal model (e.g. 27b) -> confirm normal requests still run.

…-panic) Stop the m5max whole-machine watchdog panic caused by a large model's prefill transient breaching the Metal cap. The model loads and runs normally; only an individual request that cannot fit is refused (503-class), not the model. Root cause: oMLX bounds memory reactively (enforcer polls phys every 1s) and the in-prefill memory check ran AFTER self.model()+mx.eval. On Apple UMA a chunk that overshoots the Metal wired limit kernel-panics the whole machine, so the post-forward Python check never runs. glm4.5-air-106b (85GB) on a 128GB box left ~22GB for KV+prefill; a per-chunk MoE-dequant transient (measured up to 7.4GB, peaks hit 110GB vs the 107.5GB cap) crossed the cap and rebooted the box twice. - Scheduler._prefill_forward_gate: before each prefill forward (external loop + chunked-step mirror), predict current(high-water: max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + a conservative transient margin; if it would exceed the hard cap, raise BEFORE the forward. The existing jundot#1405 cleanup converts that RuntimeError into a finish_reason="error" output -- request refused cleanly, machine not crashed. Legacy post-forward check stays as backstop. - New MemorySettings.prefill_transient_margin_gb (default 10GB), propagated settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the MoE expert-dequant spike, so this margin carries the guarantee: margin (10) > worst observed single-step jump (7.4GB). - Preflight now also maxes against the enforcer recent high-water mark. - Reverts model_load_prefill_headroom_gb load rejection (user feedback: a model must be loadable; refuse the request, not the model). Honest residual: the gate reads current just after the prior chunk's cache clear (active trough), leaning on phys_footprint stickiness + recent_peak to avoid a trough misread; a misread could still admit a crashing chunk. Not a literal never-panic guarantee -- needs on-hardware [memgate] log validation; the real fix is preemptive KV offload (separate work). Tests: 12 new (gate raise/pass, margin-is-the-trip, no-op guards, integration asserting model forward NOT called over-cap), verified to fail with gate neutered. Full suite 4542 pass / 3 known api_key fail / 19 skip on m2max -- zero regression. --- 止住 m5max 因大模型 prefill 瞬时撞穿 Metal cap 导致的整机 watchdog panic. 模型正常 load 正常用; 只拒放不下的单个请求(503 级), 不禁模型. 根因: oMLX 内存是 reactive 管控(enforcer 每 1s poll phys), prefill 内存检查在 self.model()+mx.eval 之后. Apple UMA 上 chunk 撞穿 Metal wired limit 直接整机 kernel panic, forward 后的 Python 检查根本来不及跑. glm4.5-air-106b(85GB)在 128GB 机只剩 ~22GB 给 KV+prefill; 每 chunk 的 MoE 反量化瞬时(实测达 7.4GB, 峰冲到 110GB vs 107.5 cap)撞穿后整机重启两次. - Scheduler._prefill_forward_gate: 每个 prefill forward 前(external loop + chunked-step mirror), 预测 current(高水位 max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + 保守 transient margin; 超 hard cap 就在 forward 前 raise. 现有 jundot#1405 cleanup 把 RuntimeError 转成 finish_reason=error 输出 -- 干净拒请求, 不崩机. forward 后旧检查留作兜底. - 新增 MemorySettings.prefill_transient_margin_gb(默认 10GB), settings -> enforcer -> scheduler 传播. estimate 不含 MoE 反量化瞬时, margin 扛保证: margin(10) > 实测最坏单步跳变(7.4GB). - preflight 也改用 enforcer 近期高水位取 max. - 撤掉 model_load_prefill_headroom_gb 拒 load(用户反馈: 模型必须能 load, 拒请求不拒模型). 诚实残余: gate 读 current 在上个 chunk cache clear 后(active 谷), 靠 phys_footprint 黏性 + recent_peak 避免读谷; 读到谷仍可能放行会崩的 chunk. 不是字面"绝不 panic" -- 需真机 [memgate] log 验证; 真治本是抢占式 KV offload(独立工作). 测试: 12 个新增, 验证 gate 失效时会 fail. 完整套件 m2max 4542 pass / 3 已知 api_key fail / 19 skip -- 零回归.

panwudi · 2026-06-06T06:47:55Z

Superseded by #51. The forward gate read the mid-prefill trough with margin 10 and under-reported (10 < the observed ~10.6GB transient). #51 makes entry-point budget admission the primary fix (stable read, queue-not-stack), bumps the margin to 12, and keeps this forward gate un-retired as the per-chunk concurrent-drift backstop.

被 #51 取代. forward gate 读 prefill 中途谷值 + margin 10 漏报 (10 < 实测 ~10.6GB 瞬时). #51 以入口预算准入为主 (稳定读数, 排队不叠加), margin 升 12, 并保留此 forward gate 作 per-chunk 并发漂移兜底.

panwudi force-pushed the fix/memory-load-admission-headroom branch from fefdecf to 35feba7 Compare June 5, 2026 23:34

panwudi changed the title ~~fix(memory): reserve prefill headroom at load + high-water preflight~~ fix(memory): refuse oversized prefill chunks before the forward (anti-panic) Jun 5, 2026

panwudi mentioned this pull request Jun 6, 2026

[BLOCKED: inert in prod] fix(memory): admit prefill by memory budget, queue instead of stacking #51

Draft

panwudi closed this Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): refuse oversized prefill chunks before the forward (anti-panic)#49

fix(memory): refuse oversized prefill chunks before the forward (anti-panic)#49
panwudi wants to merge 1 commit into
mainfrom
fix/memory-load-admission-headroom

panwudi commented Jun 5, 2026 •

edited

Loading

Uh oh!

panwudi commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

panwudi commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is now (replaces the rejected load-admission headroom)

Why

What

Honest residual (read before merge)

Test plan

Deploy validation (after merge, on m5max)

Uh oh!

panwudi commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

panwudi commented Jun 5, 2026 •

edited

Loading