fix(memory): refuse oversized prefill chunks before the forward (anti-panic)#49
Closed
panwudi wants to merge 1 commit into
Closed
fix(memory): refuse oversized prefill chunks before the forward (anti-panic)#49panwudi wants to merge 1 commit into
panwudi wants to merge 1 commit into
Conversation
…-panic) Stop the m5max whole-machine watchdog panic caused by a large model's prefill transient breaching the Metal cap. The model loads and runs normally; only an individual request that cannot fit is refused (503-class), not the model. Root cause: oMLX bounds memory reactively (enforcer polls phys every 1s) and the in-prefill memory check ran AFTER self.model()+mx.eval. On Apple UMA a chunk that overshoots the Metal wired limit kernel-panics the whole machine, so the post-forward Python check never runs. glm4.5-air-106b (85GB) on a 128GB box left ~22GB for KV+prefill; a per-chunk MoE-dequant transient (measured up to 7.4GB, peaks hit 110GB vs the 107.5GB cap) crossed the cap and rebooted the box twice. - Scheduler._prefill_forward_gate: before each prefill forward (external loop + chunked-step mirror), predict current(high-water: max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + a conservative transient margin; if it would exceed the hard cap, raise BEFORE the forward. The existing jundot#1405 cleanup converts that RuntimeError into a finish_reason="error" output -- request refused cleanly, machine not crashed. Legacy post-forward check stays as backstop. - New MemorySettings.prefill_transient_margin_gb (default 10GB), propagated settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the MoE expert-dequant spike, so this margin carries the guarantee: margin (10) > worst observed single-step jump (7.4GB). - Preflight now also maxes against the enforcer recent high-water mark. - Reverts model_load_prefill_headroom_gb load rejection (user feedback: a model must be loadable; refuse the request, not the model). Honest residual: the gate reads current just after the prior chunk's cache clear (active trough), leaning on phys_footprint stickiness + recent_peak to avoid a trough misread; a misread could still admit a crashing chunk. Not a literal never-panic guarantee -- needs on-hardware [memgate] log validation; the real fix is preemptive KV offload (separate work). Tests: 12 new (gate raise/pass, margin-is-the-trip, no-op guards, integration asserting model forward NOT called over-cap), verified to fail with gate neutered. Full suite 4542 pass / 3 known api_key fail / 19 skip on m2max -- zero regression. --- 止住 m5max 因大模型 prefill 瞬时撞穿 Metal cap 导致的整机 watchdog panic. 模型正常 load 正常用; 只拒放不下的单个请求(503 级), 不禁模型. 根因: oMLX 内存是 reactive 管控(enforcer 每 1s poll phys), prefill 内存检查在 self.model()+mx.eval 之后. Apple UMA 上 chunk 撞穿 Metal wired limit 直接整机 kernel panic, forward 后的 Python 检查根本来不及跑. glm4.5-air-106b(85GB)在 128GB 机只剩 ~22GB 给 KV+prefill; 每 chunk 的 MoE 反量化瞬时(实测达 7.4GB, 峰冲到 110GB vs 107.5 cap)撞穿后整机重启两次. - Scheduler._prefill_forward_gate: 每个 prefill forward 前(external loop + chunked-step mirror), 预测 current(高水位 max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + 保守 transient margin; 超 hard cap 就在 forward 前 raise. 现有 jundot#1405 cleanup 把 RuntimeError 转成 finish_reason=error 输出 -- 干净拒请求, 不崩机. forward 后旧检查留作兜底. - 新增 MemorySettings.prefill_transient_margin_gb(默认 10GB), settings -> enforcer -> scheduler 传播. estimate 不含 MoE 反量化瞬时, margin 扛保证: margin(10) > 实测 最坏单步跳变(7.4GB). - preflight 也改用 enforcer 近期高水位取 max. - 撤掉 model_load_prefill_headroom_gb 拒 load(用户反馈: 模型必须能 load, 拒请求不拒模型). 诚实残余: gate 读 current 在上个 chunk cache clear 后(active 谷), 靠 phys_footprint 黏性 + recent_peak 避免读谷; 读到谷仍可能放行会崩的 chunk. 不是字面"绝不 panic" -- 需真机 [memgate] log 验证; 真治本是抢占式 KV offload(独立工作). 测试: 12 个新增, 验证 gate 失效时会 fail. 完整套件 m2max 4542 pass / 3 已知 api_key fail / 19 skip -- 零回归.
fefdecf to
35feba7
Compare
Owner
Author
|
Superseded by #51. The forward gate read the mid-prefill trough with margin 10 and under-reported (10 < the observed ~10.6GB transient). #51 makes entry-point budget admission the primary fix (stable read, queue-not-stack), bumps the margin to 12, and keeps this forward gate un-retired as the per-chunk concurrent-drift backstop. 被 #51 取代. forward gate 读 prefill 中途谷值 + margin 10 漏报 (10 < 实测 ~10.6GB 瞬时). #51 以入口预算准入为主 (稳定读数, 排队不叠加), margin 升 12, 并保留此 forward gate 作 per-chunk 并发漂移兜底. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is now (replaces the rejected load-admission headroom)
The PR originally rejected oversized models at load. Per user feedback that's the wrong shape -- a model must be loadable; refuse the request, not the model. That approach is reverted (engine_pool/settings back to main). This now stops the crash at the prefill level.
Why
glm4.5-air-106b (85GB) on a 128GB box (Metal cap 107.5GB) leaves ~22GB for KV+prefill. A per-chunk MoE-dequant transient (measured up to 7.4GB on m5max, peaks hit 110GB) crosses the cap. oMLX checks memory AFTER the forward, but on Apple UMA an over-cap allocation kernel-panics the whole machine before any Python check runs -- rebooted the box twice (2026-06-01, -06-06).
What
Scheduler._prefill_forward_gate: before each prefill forward (external + chunked-step loops), predictcurrent(high-water: max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + margin; raise BEFORE the forward if it would breach the hard cap. The existing [Bug] Prefill requests stuck after force-stopped due to memory limit exceeded - never released jundot/omlx#1405 cleanup turns that into afinish_reason=error(503-class) -- request refused cleanly, machine not crashed. Legacy post-forward check kept as backstop.MemorySettings.prefill_transient_margin_gb(default 10GB), propagated settings -> enforcer -> scheduler.estimate_prefill_peak_bytesdoes not model the MoE expert-dequant spike, so this margin carries the guarantee: margin(10) > worst observed single-step jump(7.4GB), measured from m5max crash logs.Honest residual (read before merge)
The gate reads
currentright after the prior chunk's cache clear (active-memory trough), leaning onphys_footprintstickiness +recent_peakto avoid a trough misread. A trough misread could still admit a crashing chunk -- this cannot be mock-tested; it needs on-hardware[memgate]log validation. This is "crash -> 503 for the common cases", NOT a literal never-panic guarantee. The real fix is preemptive KV offload (mlx-lm'sextract_cache/remove/insertprimitives support it -- separate, larger work).Also: with a 10GB margin, glm4.5 on this box refuses requests aggressively (~97GB trip point). It stops the crash; it does NOT make glm4.5 comfortably usable for long-context agentic work on a 128GB box -- that needs a smaller quant or the preemption work.
Test plan
test_scheduler_prefill_forward_gate.py+ admission + enforcer + engine_pool -> 157 passed.test_settings.pyapi_key env-bleed (confirmed identical on clean main). Zero regression.Deploy validation (after merge, on m5max)
[memgate]logs: confirmcurrentreflects high-water (not a sub-97GB trough) before a would-overshoot chunk.