[BLOCKED: inert in prod] fix(memory): admit prefill by memory budget, queue instead of stacking#51
[BLOCKED: inert in prod] fix(memory): admit prefill by memory budget, queue instead of stacking#51panwudi wants to merge 2 commits into
Conversation
…-panic) Stop the m5max whole-machine watchdog panic caused by a large model's prefill transient breaching the Metal cap. The model loads and runs normally; only an individual request that cannot fit is refused (503-class), not the model. Root cause: oMLX bounds memory reactively (enforcer polls phys every 1s) and the in-prefill memory check ran AFTER self.model()+mx.eval. On Apple UMA a chunk that overshoots the Metal wired limit kernel-panics the whole machine, so the post-forward Python check never runs. glm4.5-air-106b (85GB) on a 128GB box left ~22GB for KV+prefill; a per-chunk MoE-dequant transient (measured up to 7.4GB, peaks hit 110GB vs the 107.5GB cap) crossed the cap and rebooted the box twice. - Scheduler._prefill_forward_gate: before each prefill forward (external loop + chunked-step mirror), predict current(high-water: max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + a conservative transient margin; if it would exceed the hard cap, raise BEFORE the forward. The existing jundot#1405 cleanup converts that RuntimeError into a finish_reason="error" output -- request refused cleanly, machine not crashed. Legacy post-forward check stays as backstop. - New MemorySettings.prefill_transient_margin_gb (default 10GB), propagated settings -> enforcer -> scheduler. estimate_prefill_peak_bytes does not model the MoE expert-dequant spike, so this margin carries the guarantee: margin (10) > worst observed single-step jump (7.4GB). - Preflight now also maxes against the enforcer recent high-water mark. - Reverts model_load_prefill_headroom_gb load rejection (user feedback: a model must be loadable; refuse the request, not the model). Honest residual: the gate reads current just after the prior chunk's cache clear (active trough), leaning on phys_footprint stickiness + recent_peak to avoid a trough misread; a misread could still admit a crashing chunk. Not a literal never-panic guarantee -- needs on-hardware [memgate] log validation; the real fix is preemptive KV offload (separate work). Tests: 12 new (gate raise/pass, margin-is-the-trip, no-op guards, integration asserting model forward NOT called over-cap), verified to fail with gate neutered. Full suite 4542 pass / 3 known api_key fail / 19 skip on m2max -- zero regression. --- 止住 m5max 因大模型 prefill 瞬时撞穿 Metal cap 导致的整机 watchdog panic. 模型正常 load 正常用; 只拒放不下的单个请求(503 级), 不禁模型. 根因: oMLX 内存是 reactive 管控(enforcer 每 1s poll phys), prefill 内存检查在 self.model()+mx.eval 之后. Apple UMA 上 chunk 撞穿 Metal wired limit 直接整机 kernel panic, forward 后的 Python 检查根本来不及跑. glm4.5-air-106b(85GB)在 128GB 机只剩 ~22GB 给 KV+prefill; 每 chunk 的 MoE 反量化瞬时(实测达 7.4GB, 峰冲到 110GB vs 107.5 cap)撞穿后整机重启两次. - Scheduler._prefill_forward_gate: 每个 prefill forward 前(external loop + chunked-step mirror), 预测 current(高水位 max active/phys/recent_peak) + estimate_prefill_peak_bytes(KV+SDPA) + 保守 transient margin; 超 hard cap 就在 forward 前 raise. 现有 jundot#1405 cleanup 把 RuntimeError 转成 finish_reason=error 输出 -- 干净拒请求, 不崩机. forward 后旧检查留作兜底. - 新增 MemorySettings.prefill_transient_margin_gb(默认 10GB), settings -> enforcer -> scheduler 传播. estimate 不含 MoE 反量化瞬时, margin 扛保证: margin(10) > 实测 最坏单步跳变(7.4GB). - preflight 也改用 enforcer 近期高水位取 max. - 撤掉 model_load_prefill_headroom_gb 拒 load(用户反馈: 模型必须能 load, 拒请求不拒模型). 诚实残余: gate 读 current 在上个 chunk cache clear 后(active 谷), 靠 phys_footprint 黏性 + recent_peak 避免读谷; 读到谷仍可能放行会崩的 chunk. 不是字面"绝不 panic" -- 需真机 [memgate] log 验证; 真治本是抢占式 KV offload(独立工作). 测试: 12 个新增, 验证 gate 失效时会 fail. 完整套件 m2max 4542 pass / 3 已知 api_key fail / 19 skip -- 零回归.
Make request admission memory-aware so concurrent prefills can no longer stack past the Metal cap and kernel-panic the whole m5max box. Admission was capped only by max_num_seqs (a fixed concurrency), so N requests against an 85GB model each ran prefill and their transients summed over the ~22GB of headroom. Primary fix -- entry-point budget admission (Scheduler._schedule_waiting): before admitting a waiting request while work is in flight, predict its prefill peak (_predicted_prefill_peak_bytes = current high-water + KV+SDPA estimate + transient margin); if it would breach the hard cap, leave it QUEUED (appendleft + break, NOT rejected) and re-check next step once in-flight work frees its KV. Concurrency becomes adaptive with no per-model knob: a memory-rich model packs many requests, an 85GB model collapses to 1 and queues the rest. The first request (nothing in flight) is never deferred -- a lone request that cannot fit even alone is instead REJECTED by _preflight_memory_check (now sharing the same predicate, so it too carries the margin). Reading at admission time -- between generation steps -- is a stable high-water, not the mid-prefill chunk trough the forward gate had to read. In flight means running OR prefilling (_has_inflight_kv), NOT running alone. The documented crash is concurrent CHUNKED prefills stacking, and a request mid-chunked-prefill lives in self.prefilling, not self.running. Gating on self.running alone would wave the 2nd, 3rd, ... prefill straight past the budget into the exact stack this guard exists to prevent. self.prefilling always drains, so deferring on it cannot deadlock; a lone request still hits preflight reject (both empty). Margin 10 -> 12GB. The un-modelled MoE expert-dequant transient is SUB-POLL (faster than the enforcer's 1s sample), so it is invisible to every memory read and MUST be carried by the margin, not by reading the footprint more cleverly. The 2026-06-06 m5max crash showed an effective transient up to ~10.6GB; margin 10 was the actual root cause of the prior forward-gate miss (10 < 10.6), not the trough read. 12 = ceil(10.6) padded. The forward gate (_prefill_forward_gate) is kept, un-retired, as a per-chunk CONCURRENT-DRIFT backstop: admission snapshots memory once, the gate re-reads before every chunk and catches another in-flight request's KV growing during a long prefill. Because admission uses the full-prompt estimate (>= the gate's per-chunk estimate) with the same margin, a correctly-admitted request cannot trip the gate under static memory -- it only fires on post-admission drift. The margin 12 fix repairs the gate's prior miss too. The pre-pop generation soft-guard is left as a complementary coarse early-out (soft limit, request-agnostic); the budget defer is the fine, request-aware, hard-cap check. The defer decision logs at info ([memadmit]) so it is visible during on-hardware validation alongside the enforcer [memcheck] ceiling. Honest residual: a single huge-context request on an 85GB model can still need a preflight reject (or a smaller quant) -- entry admission solves concurrent stacking, not a lone prefill that physically cannot fit. The stable-read assumption still needs on-hardware [memadmit]/[memcheck] validation under concurrent + long glm4.5 load before this is proven. Tests: 14 new (8 predicate unit tests including recent_peak folds for running OR prefilling; budget-defer requeues-not-rejects for both the running and the chunked-prefill-only case; the dominance property that an admitted request never trips the gate under static memory + the gate firing under drift; lone-request ignores stale recent_peak). Full suite 4556 pass / 3 known api_key fail / 19 skip on m2max -- zero regression vs the baseline. --- 让请求准入感知内存, 使并发 prefill 不再叠加冲过 Metal cap 把 m5max 整机 kernel panic. 之前准入只受 max_num_seqs (固定并发) 限制, N 个请求打 85GB 模型各自跑 prefill, 瞬时叠加冲过仅 ~22GB 的余量. 主修复 -- 入口预算准入 (Scheduler._schedule_waiting): 有工作在飞时, 准入下一个 waiting 请求前预测其 prefill 峰值 (_predicted_prefill_peak_bytes = 当前高水位 + KV+SDPA 估计 + 瞬时 margin); 若会冲破 hard cap, 留在队列里 (appendleft + break, 不拒绝), 下一步等在飞工作释放 KV 后重判. 并发自适应, 无 per-model 旋钮: 内存富的 模型多并发, 85GB 模型自动压到 1 其余排队. 第一个请求 (无在飞工作) 永不 defer -- 连单独都放不下的孤请求改由 _preflight_memory_check 拒绝 (现共用同一判据, 故也带 margin). 在入口 (两个 generation step 之间) 读数是稳定高水位, 不是 forward gate 不得不读的 prefill 中途谷值. "在飞" = running 或 prefilling (_has_inflight_kv), 不是只看 running. 已记录的崩溃 正是并发 chunked prefill 叠加, 而 chunked-prefill 中途的请求在 self.prefilling 不在 self.running. 只看 self.running 会把第 2、3 ... 个 prefill 直接放过预算, 叠成这个 guard 要防的栈. self.prefilling 总会 drain, 故据它 defer 不会死锁; 孤请求仍走 preflight 拒绝 (两者皆空). margin 10 -> 12GB. 未建模的 MoE 反量化瞬时是 sub-poll (快于 enforcer 1s 采样), 对所有内存读数都不可见, 必须由 margin 兜, 而非把读数读得更聪明. 2026-06-06 m5max 崩溃实测有效瞬时达 ~10.6GB; margin 10 才是之前 forward gate 漏报的真因 (10 < 10.6), 不是谷读. 12 = ceil(10.6) 加垫. forward gate (_prefill_forward_gate) 保留并解除退役, 作 per-chunk 并发漂移兜底: 准入只快照一次内存, gate 每个 chunk 前重读, 抓另一在飞请求在本次长 prefill 期间 KV 增长. 因准入用 full-prompt 估计 (>= gate 的 per-chunk 估计) 且同 margin, 被正确 准入的请求在静态内存下绝不触发 gate -- 只在准入后漂移时才触发. margin 12 也修好了 gate 之前的漏报. pre-pop generation soft-guard 保留作互补的 coarse early-out (soft limit, 不看具体 请求); budget defer 是精确的, 看请求的, hard-cap 判据. defer 决策记 info ([memadmit]) 以便真机验证时与 enforcer [memcheck] 顶值一并可见. 诚实残余: 85GB 模型上单个超长上下文请求仍可能需 preflight 拒绝 (或换小量化) -- 入口准入解决的是并发叠加, 不是物理放不下的单次 prefill. 稳定读数这一假设仍需真机 [memadmit]/[memcheck] 在并发 + 长 glm4.5 负载下验证才算证实. 测试: 14 个新增 (8 个判据单元测试含 recent_peak 在 running 或 prefilling 时叠; budget-defer 重排队不拒绝, 覆盖 running 和仅 chunked-prefill 两种; dominance 性质即 被准入的请求静态内存下绝不触发 gate + 漂移时 gate 触发; 孤请求忽略 stale recent_peak). 完整套件 m2max 4556 pass / 3 已知 api_key fail / 19 skip -- 对 baseline 零回归.
BLOCKED -- do not merge: inert in productionOn-hardware validation (m5max, glm4.5-air-moe-106b-a12b-6bit, 2026-06-06) found this fix is INERT in production. Root cause is pre-existing and bigger than this PR.
Static confirmation: only assignment in the repo is Empirical: a single ~22K-token glm4.5 prefill drove current to 107.525GB = OVER_HARD (past the 107.5 cap) with memadmit=0 and memgate=0. Only the legacy reactive Reframe: glm4.5 runs the external prefill loop, which serializes prefills inline -- so the 4 concurrent requests were already serialized by architecture (not by this PR). The actual breach was a single request's transient. So the real fix target is single-request per-chunk protection (the forward gate), and it needs the estimate path to actually be live. This PR's concurrent-admission queue addresses a path glm4.5 does not use. Next step (scope change, pending owner decision): determine whether the monitor path is intentionally-retired-vestigial (revive both wiring + dims, but that reactivates the never-run KV-eviction subsystem too) or compute the estimate inline from config (head_dim etc. are present), then log the guard's resolved state at startup so this class of silent inertness cannot ship again. Re-validation requires another m5max panic-risk round. 真机验证 (m5max, glm4.5, 2026-06-06) 发现本 PR 在生产 inert. 根因预存且大于本 PR: |
Problem
m5max kernel-panics the whole box when concurrent requests against a large model (glm4.5-air-106b, 85GB on a 128GB box, ~22GB headroom) each run prefill and their memory transients sum past the Metal cap. Admission was bounded only by max_num_seqs (a fixed concurrency), so it had no notion of memory budget -- N prefills could stack.
Fix
Make admission memory-aware, at the entry point where the read is stable (between generation steps), not mid-prefill.
Scheduler._schedule_waiting): while work is in flight, predict the incoming request's prefill peak (_predicted_prefill_peak_bytes= current high-water + KV+SDPA estimate + transient margin). If it would breach the hard cap, leave it QUEUED (appendleft + break, not rejected); it is re-checked everystep()and admitted once in-flight work frees its KV. Concurrency becomes adaptive with no per-model knob: memory-rich models pack many requests, an 85GB model collapses to 1 and queues the rest._has_inflight_kv). The crash is concurrent chunked prefills stacking, and a request mid-chunked-prefill lives inself.prefilling, notself.running-- gating onself.runningalone would wave the 2nd/3rd prefill straight into the stack.self.prefillingalways drains, so this cannot deadlock._preflight_memory_check(now sharing the same predicate, so it carries the margin too). Nothing would drain to admit it, so queuing would deadlock -- reject (503-class) instead.Supersedes #49 (the forward-gate-only branch, which under-reported because it read the mid-prefill trough with margin 10).
Honest residual
A single huge-context request on an 85GB model can still need a preflight reject (or a smaller quant). Entry admission solves concurrent stacking, not a lone prefill that physically cannot fit.
Test plan
Targeted:
tests/test_scheduler_budget_admission.py(14 new),test_scheduler_admission.py,test_scheduler_prefill_forward_gate.py,test_process_memory_enforcer.py,test_memory_monitor.py-- all green. New tests cover: the shared predicate (margin folded, recent_peak folded only when running OR prefilling, cached-tokens accounting, all None guards); budget-defer requeues-not-rejects for both the running case and the chunked-prefill-only case; the dominance property (an admitted request never trips the gate under static memory, using a REAL MemoryMonitor) + the gate firing under drift; lone-request ignores stale recent_peak.Full suite on m2max: 4556 pass / 3 known api_key fail / 19 skip -- zero regression vs the baseline (the 3 api_key fails are pre-existing, verified by re-running them on the stashed baseline).
REQUIRED before merge -- on-hardware validation (not yet done)
Unit-green is necessary, not sufficient (the prior forward-gate passed 12 unit tests + full suite and still under-reported on real hardware). This needs m5max validation: load glm4.5-air-106b, drive concurrent + long requests, confirm
[memcheck:external]current never crosses the cap and[memadmit]shows clean queue/503 rather than a panic. Ops-risky (can panic the prod box), so it is gated on explicit go-ahead. This is afix/*branch -- human review + merge, no AI self-merge.问题
m5max 跑大模型 (glm4.5-air-106b, 128GB 机上 85GB, 仅 ~22GB 余量) 时, 并发请求各自 prefill, 内存瞬时叠加冲过 Metal cap 整机 kernel panic. 之前准入只受 max_num_seqs 限制, 无内存预算概念 -- N 个 prefill 可叠加.
修复
让准入感知内存, 在读数稳定的入口 (两个 generation step 之间) 判断, 不在 prefill 中途.
_schedule_waiting): 有工作在飞时预测来请求的 prefill 峰值, 冲破 hard cap 就留队列 (排队不拒绝), 每 step 重判, 在飞工作释放 KV 后准入. 并发自适应无 per-model 旋钮._has_inflight_kv). 崩溃是并发 chunked prefill 叠加, chunked-prefill 中途请求在self.prefilling不在self.running; 只看 running 会放过第 2/3 个 prefill. prefilling 总会 drain 不死锁._preflight_memory_check拒绝 (共用判据带 margin). 无可 drain, 排队会死锁, 故 503.取代 #49 (仅 forward-gate 分支, 读 prefill 中途谷值 + margin 10 故漏报).
诚实残余
85GB 模型上单个超长上下文请求仍可能需 preflight 拒绝 (或换小量化). 入口准入解决并发叠加, 非物理放不下的单次 prefill.
测试计划
定向:
test_scheduler_budget_admission.py(14 新) 等全绿. 完整套件 m2max 4556 pass / 3 已知 api_key fail / 19 skip -- 对 baseline 零回归 (3 个 api_key fail 是预存, 已 stash baseline 重跑确认).合并前必做 -- 真机验证 (尚未做)
单测绿是必要非充分 (之前 forward gate 单测 12 绿 + 完整套件仍真机漏报). 需 m5max: load glm4.5-air-106b, 并发 + 长请求, 确认
[memcheck:external]current 不越 cap、[memadmit]是干净排队/503 而非 panic. ops 高风险 (可能 panic 生产机), 待显式 go-ahead. 本分支fix/*-- 人审人合, 无 AI 自助合并.