Skip to content

fix(memory): refuse oversized prefill chunks before the forward (anti-panic)#49

Closed
panwudi wants to merge 1 commit into
mainfrom
fix/memory-load-admission-headroom
Closed

fix(memory): refuse oversized prefill chunks before the forward (anti-panic)#49
panwudi wants to merge 1 commit into
mainfrom
fix/memory-load-admission-headroom

Conversation

@panwudi

@panwudi panwudi commented Jun 5, 2026

Copy link
Copy Markdown
Owner

What this is now (replaces the rejected load-admission headroom)

The PR originally rejected oversized models at load. Per user feedback that's the wrong shape -- a model must be loadable; refuse the request, not the model. That approach is reverted (engine_pool/settings back to main). This now stops the crash at the prefill level.

Why

glm4.5-air-106b (85GB) on a 128GB box (Metal cap 107.5GB) leaves ~22GB for KV+prefill. A per-chunk MoE-dequant transient (measured up to 7.4GB on m5max, peaks hit 110GB) crosses the cap. oMLX checks memory AFTER the forward, but on Apple UMA an over-cap allocation kernel-panics the whole machine before any Python check runs -- rebooted the box twice (2026-06-01, -06-06).

What

  • Scheduler._prefill_forward_gate: before each prefill forward (external + chunked-step loops), predict current(high-water: max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + margin; raise BEFORE the forward if it would breach the hard cap. The existing [Bug] Prefill requests stuck after force-stopped due to memory limit exceeded - never released jundot/omlx#1405 cleanup turns that into a finish_reason=error (503-class) -- request refused cleanly, machine not crashed. Legacy post-forward check kept as backstop.
  • MemorySettings.prefill_transient_margin_gb (default 10GB), propagated settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the MoE expert-dequant spike, so this margin carries the guarantee: margin(10) > worst observed single-step jump(7.4GB), measured from m5max crash logs.
  • Preflight also maxes against the enforcer recent high-water mark.

Honest residual (read before merge)

The gate reads current right after the prior chunk's cache clear (active-memory trough), leaning on phys_footprint stickiness + recent_peak to avoid a trough misread. A trough misread could still admit a crashing chunk -- this cannot be mock-tested; it needs on-hardware [memgate] log validation. This is "crash -> 503 for the common cases", NOT a literal never-panic guarantee. The real fix is preemptive KV offload (mlx-lm's extract_cache/remove/insert primitives support it -- separate, larger work).

Also: with a 10GB margin, glm4.5 on this box refuses requests aggressively (~97GB trip point). It stops the crash; it does NOT make glm4.5 comfortably usable for long-context agentic work on a 128GB box -- that needs a smaller quant or the preemption work.

Test plan

  • Targeted (self-run, m2max): test_scheduler_prefill_forward_gate.py + admission + enforcer + engine_pool -> 157 passed.
  • 12 new gate tests; verified to FAIL with the gate neutered (pins the fix).
  • Full suite (m2max): 4542 pass / 3 fail / 19 skip. The 3 fails are pre-existing test_settings.py api_key env-bleed (confirmed identical on clean main). Zero regression.

Deploy validation (after merge, on m5max)

  1. Deploy + restart serve.
  2. Load glm4.5-air-106b + send a long prompt -> expect a clean 503-class refusal (NOT a crash). Watch [memgate] logs: confirm current reflects high-water (not a sub-97GB trough) before a would-overshoot chunk.
  3. Load a normal model (e.g. 27b) -> confirm normal requests still run.

…-panic)

Stop the m5max whole-machine watchdog panic caused by a large model's prefill
transient breaching the Metal cap. The model loads and runs normally; only an
individual request that cannot fit is refused (503-class), not the model.

Root cause: oMLX bounds memory reactively (enforcer polls phys every 1s) and the
in-prefill memory check ran AFTER self.model()+mx.eval. On Apple UMA a chunk that
overshoots the Metal wired limit kernel-panics the whole machine, so the
post-forward Python check never runs. glm4.5-air-106b (85GB) on a 128GB box left
~22GB for KV+prefill; a per-chunk MoE-dequant transient (measured up to 7.4GB,
peaks hit 110GB vs the 107.5GB cap) crossed the cap and rebooted the box twice.

- Scheduler._prefill_forward_gate: before each prefill forward (external loop +
  chunked-step mirror), predict current(high-water: max active/phys/recent_peak)
  + estimate_prefill_peak_bytes(KV+SDPA) + a conservative transient margin; if it
  would exceed the hard cap, raise BEFORE the forward. The existing jundot#1405 cleanup
  converts that RuntimeError into a finish_reason="error" output -- request
  refused cleanly, machine not crashed. Legacy post-forward check stays as backstop.
- New MemorySettings.prefill_transient_margin_gb (default 10GB), propagated
  settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the
  MoE expert-dequant spike, so this margin carries the guarantee: margin (10) >
  worst observed single-step jump (7.4GB).
- Preflight now also maxes against the enforcer recent high-water mark.
- Reverts model_load_prefill_headroom_gb load rejection (user feedback: a model
  must be loadable; refuse the request, not the model).

Honest residual: the gate reads current just after the prior chunk's cache clear
(active trough), leaning on phys_footprint stickiness + recent_peak to avoid a
trough misread; a misread could still admit a crashing chunk. Not a literal
never-panic guarantee -- needs on-hardware [memgate] log validation; the real fix
is preemptive KV offload (separate work).

Tests: 12 new (gate raise/pass, margin-is-the-trip, no-op guards, integration
asserting model forward NOT called over-cap), verified to fail with gate
neutered. Full suite 4542 pass / 3 known api_key fail / 19 skip on m2max -- zero
regression.

---

止住 m5max 因大模型 prefill 瞬时撞穿 Metal cap 导致的整机 watchdog panic. 模型正常
load 正常用; 只拒放不下的单个请求(503 级), 不禁模型.

根因: oMLX 内存是 reactive 管控(enforcer 每 1s poll phys), prefill 内存检查在
self.model()+mx.eval 之后. Apple UMA 上 chunk 撞穿 Metal wired limit 直接整机 kernel
panic, forward 后的 Python 检查根本来不及跑. glm4.5-air-106b(85GB)在 128GB 机只剩
~22GB 给 KV+prefill; 每 chunk 的 MoE 反量化瞬时(实测达 7.4GB, 峰冲到 110GB vs 107.5
cap)撞穿后整机重启两次.

- Scheduler._prefill_forward_gate: 每个 prefill forward 前(external loop +
  chunked-step mirror), 预测 current(高水位 max active/phys/recent_peak) +
  estimate_prefill_peak_bytes(KV+SDPA) + 保守 transient margin; 超 hard cap 就在
  forward 前 raise. 现有 jundot#1405 cleanup 把 RuntimeError 转成 finish_reason=error
  输出 -- 干净拒请求, 不崩机. forward 后旧检查留作兜底.
- 新增 MemorySettings.prefill_transient_margin_gb(默认 10GB), settings -> enforcer
  -> scheduler 传播. estimate 不含 MoE 反量化瞬时, margin 扛保证: margin(10) > 实测
  最坏单步跳变(7.4GB).
- preflight 也改用 enforcer 近期高水位取 max.
- 撤掉 model_load_prefill_headroom_gb 拒 load(用户反馈: 模型必须能 load, 拒请求不拒模型).

诚实残余: gate 读 current 在上个 chunk cache clear 后(active 谷), 靠 phys_footprint
黏性 + recent_peak 避免读谷; 读到谷仍可能放行会崩的 chunk. 不是字面"绝不 panic" --
需真机 [memgate] log 验证; 真治本是抢占式 KV offload(独立工作).

测试: 12 个新增, 验证 gate 失效时会 fail. 完整套件 m2max 4542 pass / 3 已知 api_key
fail / 19 skip -- 零回归.
@panwudi panwudi force-pushed the fix/memory-load-admission-headroom branch from fefdecf to 35feba7 Compare June 5, 2026 23:34
@panwudi panwudi changed the title fix(memory): reserve prefill headroom at load + high-water preflight fix(memory): refuse oversized prefill chunks before the forward (anti-panic) Jun 5, 2026
@panwudi

panwudi commented Jun 6, 2026

Copy link
Copy Markdown
Owner Author

Superseded by #51. The forward gate read the mid-prefill trough with margin 10 and under-reported (10 < the observed ~10.6GB transient). #51 makes entry-point budget admission the primary fix (stable read, queue-not-stack), bumps the margin to 12, and keeps this forward gate un-retired as the per-chunk concurrent-drift backstop.


#51 取代. forward gate 读 prefill 中途谷值 + margin 10 漏报 (10 < 实测 ~10.6GB 瞬时). #51 以入口预算准入为主 (稳定读数, 排队不叠加), margin 升 12, 并保留此 forward gate 作 per-chunk 并发漂移兜底.

@panwudi panwudi closed this Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant