Skip to content

sync: independent upstream fixes (boundary-store, per-engine threads, VLM, profiles)#8

Merged
panwudi merged 10 commits into
mainfrom
sync/upstream-2026-05-28
May 27, 2026
Merged

sync: independent upstream fixes (boundary-store, per-engine threads, VLM, profiles)#8
panwudi merged 10 commits into
mainfrom
sync/upstream-2026-05-28

Conversation

@panwudi

@panwudi panwudi commented May 27, 2026

Copy link
Copy Markdown
Owner

Summary

Catch-up sync of upstream commits that landed since v0.3.12 and are independent of the memory_guard_tier work. Notably fixes the boundary-store flake we have been carrying in the baseline (test_cleanup_all_drains_queue) and brings in per-engine MLX streams to eliminate cross-engine contamination during multi-model serving.

upstream content
4f3a9b9 (jundot#1423) fix(boundary-store): serialize cleanup_all + cleanup_request with the writer thread
bc1c427 cleanup(boundary-store): drop unreachable shutdown(cleanup=) path
2916ab4 (jundot#1422) cleanup(cache): remove dead TieredCacheManager
56860b3 (jundot#1304) fix(engine): per-engine threads to eliminate cross-engine stream contamination
b7cb489 (a62f953) refactor(engine): remove redundant _ensure_wired_limit guard
c50d64e (e6d8a3f jundot#1445) test(mtp): drop monkeypatch of removed _get_generation_stream
f554f19 (2e698ff jundot#1437) fix(scheduler): wait on generation_stream in store-cache worker sync
ebf2c21 (9d5bed8) fix(engine): materialize VLM model lazy state on loader thread
414b843 (ff7522b) fix(load): skip VLM MTPModule attach when checkpoint lacks mtp.* weights
59d9e7e (0c881f5 jundot#1399) refactor(profiles): three-scope template contract + drop is_builtin emission

Conflict resolution

Test plan

  • Full suite: 4567 pass / 3 fail / 36 skip / 59 deselected (244s).
  • +31 tests vs main; covers per-engine threads (tests/test_per_engine_threads.py), boundary-store cleanup races, and the existing scheduler / VLM paths.
  • boundary_snapshot flake eliminated: tests/test_boundary_snapshot_store.py::TestBoundarySnapshotSSDStore::test_cleanup_all_drains_queue (previously ~20% flake) passes cleanly. That removes one of the 4 baseline failures; new baseline becomes the 3 known OMLX_API_KEY env-override flakes.
  • Smoke import all modified modules.

Standing authorization (sync/*)

All 5 criteria satisfied. Self-merging per user delegation.

cfbraun and others added 10 commits May 27, 2026 05:08
…th the writer thread (jundot#1423)

test_cleanup_all_drains_queue failed ~20% of the time, in isolation
and in suites. The race:

  T0 save() puts item in queue
  T1 writer queue.get() pulls item (already off the queue)
  T2 cleanup_all() drains queue — finds it empty
  T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir)
  T4 writer mkdir(req_dir) + temp write + rename(final)

Result: ``req-X/N.safetensors`` survives the supposed cleanup, the
caller (scheduler / shutdown) believes the snapshot dir is empty, and
the leftover file is the disk shadow of a request that should have
been forgotten.

Add ``_writer_busy`` (threading.Lock). The writer holds it for the
entire duration of each ``_process_write_item`` call. ``cleanup_all()``
drains the queue first (no new items can start) and then acquires
``_writer_busy``, so any item the writer had already pulled finishes
before rmtree runs.

Same class of race exists for ``cleanup_request()`` — writer pulls a
request's item before cleanup_request sets the cancelled-counter, then
mkdir's the req_dir back into existence after the rmtree, and the
late-rename rescue at the tail of ``_process_write_item`` catches it
in most timings but a small window remains. Apply the same
``_writer_busy`` barrier (bounded, see below) so the two cleanup paths
are symmetric.

Per-path timeouts so cleanup_request can yield faster than
cleanup_all:

- ``_CLEANUP_ALL_TIMEOUT_S = 5.0`` — cleanup_all runs at reset /
  startup where blocking longer is tolerable in exchange for a
  stronger orphan-avoidance guarantee.
- ``_CLEANUP_REQUEST_TIMEOUT_S = 2.0`` — cleanup_request is called
  from the scheduler's abort hot path (~3 sites) where bounding
  latency matters more than chasing one in-flight write.

The bounded acquires have a logged-warning fallback. Without that
bound a slow disk inside ``_write_safetensors_no_mx`` could deadlock
the scheduler's hot path — cleanup_all() is called from request abort
/ reset (scheduler.py:5400 / 5596 / 5737), not just shutdown. The
worst-case fallback behaviour is an orphaned file under the recreated
snapshot dir until next startup's constructor cleanup, which is the
exact same state the pre-fix code produced on every cleanup_all.

Counter pop on cleanup_request now only runs when the
``_writer_busy`` acquire succeeded. On the timeout fallback the
writer is still mid-item and may not yet have consulted
``_is_cancelled``; popping the counter there would defeat the late-
rename rescue that the docstring advertises as the timeout-
fallback safety net. The previous code popped unconditionally.

Counter bump itself is now additive (``get + count``) and skipped
entirely when ``count == 0``. Two distinct bugs that closes:

- Skip-on-zero: cleanup_request("X") for an rid with NO pending
  items previously wrote ``cancelled[X] = 0`` then popped on the
  acquired path. On the timeout fallback the pop never ran and the
  ``X: 0`` entry lingered for the process lifetime — every later
  ``save()`` under that rid (or any reuse of the string) was
  silently discarded by the writer's ``_is_cancelled`` gates,
  which check key membership not value > 0. Counter must only
  exist when there is at least one in-flight item to drain it.
  Regression: test_cleanup_request_no_pending_does_not_pin_counter
  _on_timeout (new).
- Additive: a re-entrant cleanup_request("X") for an rid that
  already has an in-flight cancellation must NOT overwrite the
  previous count. The writer's ``cleared_by_cleanup`` branch +
  ``_writer_busy`` lock together close the file-write race today
  (so no orphan file slips out), but the per-item dec_cancelled
  bookkeeping has to balance — overwriting drops remaining decs on
  the floor and on the next ``save()`` under the same rid the
  writer would see a non-zero counter from the earlier batch and
  silently discard the new item.

shutdown() now accepts ``cleanup=True`` so callers that want both
operations can express the cleanup-before-shutdown ordering in one
call instead of sequencing them manually. The cleanup_all() warning
at the top of the function catches misordered callers that still
call cleanup_all() AFTER shutdown — the writer no longer reacquires
``_writer_busy`` past the sentinel, so a post-shutdown cleanup
degrades to an in-memory-only clear.

Tests:
- ``test_cleanup_all_drains_queue`` — no longer sleeps; relies on the
  lock to guarantee ordering, runs deterministically.
- ``test_cleanup_all_blocks_until_writer_finishes_pinned_item`` —
  monkey-patches ``_write_safetensors_no_mx`` to block on an Event,
  pins the writer mid-item, asserts cleanup_all does NOT return until
  the writer releases. Without the lock this test fails
  deterministically rather than the original ~20% flake rate.
- ``test_cleanup_request_blocks_until_writer_finishes_pinned_item`` —
  symmetric for cleanup_request.
- ``test_cleanup_request_keeps_counter_on_timeout`` — regression for
  the bug where the timeout-fallback pop dropped the late-rename
  rescue's safety net.
- ``test_cleanup_request_timeout_drains_counter_on_writer_early_return``
  — the writer's cleared_by_cleanup early-return must still
  dec_cancelled or the counter pins the rid forever.
- ``test_cleanup_request_no_pending_does_not_pin_counter_on_timeout``
  (new) — regression for the skip-on-zero bug above. Pins the writer
  with an unrelated save, calls cleanup_request("never-saved-rid")
  past the 0.1s timeout, asserts the rid does NOT appear in
  ``_cancelled_requests`` AND that a subsequent save under the same
  rid is not silently discarded.
- ``test_shutdown_cleanup_true_runs_cleanup_before_setting_flag`` —
  pins the cleanup-before-shutdown ordering of the new
  ``shutdown(cleanup=True)`` path.

24/24 boundary-store tests pass.

(cherry picked from commit 4f3a9b9)
Scheduler.shutdown() never calls _boundary_snapshot_store.shutdown(), so _shutdown.is_set() is always False in production. The post-shutdown branches in cleanup_request / cleanup_all and the cleanup= kwarg on shutdown() were dead code, kept alive only by a self-referential regression test. Drops them along with the test.

(cherry picked from commit bc1c427)
Verified unused: zero instantiations across the entire git history
(`git log -S 'TieredCacheManager('` returns only the initial-commit
declaration), zero imports outside the module's own __init__ re-export,
zero test references. The class was a planned coordinator between
PagedCacheManager / BlockAwarePrefixCache / PagedSSDCacheManager /
MemoryMonitor — Scheduler.__init__ does that wiring directly, making
the abstraction redundant.

Removes 353 lines plus the re-export from omlx.cache.__all__.

(cherry picked from commit 2916ab4)
…amination (jundot#1304)

Replace the shared _global_mlx_executor with per-EngineCore
ThreadPoolExecutor + mx.Stream, and fix the MTP patch reading the
module-level generation_stream instead of the per-engine stream.

(cherry picked from commit 56860b3)
BatchGenerator.__init__ already calls mx.set_wired_limit() on each
instance, and concurrent calls with the same value are race-free
(verified empirically). The guard never prevented the race it claimed
to fix.

Follow-up to jundot#1304.

(cherry picked from commit a62f953)
…ndot#1445)

jundot#1304 (``fix(engine): per-engine threads to eliminate cross-engine
stream contamination``) refactored the patched ``BatchGenerator`` to
inherit its execution stream from the enclosing engine context, and
removed the module-level ``_get_generation_stream`` helper as part of
that. ``TestBatchGeneratorDispatch._make_reconcile_batch`` still tried
to monkeypatch that name and failed at collection of every test that
depends on the fixture with ``AttributeError: module ... has no
attribute '_get_generation_stream'`` — taking down 4 reconcile-path
tests on every CI run.

The override is no longer needed: the surrounding fixture replaces
``_rebuild_singleton_cache`` and ``_call_backbone`` with fakes that
do all of their work via ``np.array`` / ``mx.array`` directly, so
neither MLX dispatch nor stream selection is reached.

Tests (tests/test_mlx_lm_mtp_patch.py::TestBatchGeneratorDispatch):
- test_reconcile_uses_queue_front_as_next_token
- test_reconcile_empty_queue_samples_from_logits
- test_reconcile_returns_false_on_empty_tokens
- test_reconcile_fallback_on_rebuild_failure

11/11 TestBatchGeneratorDispatch tests pass.

(cherry picked from commit e6d8a3f)
…undot#1437)

mx.synchronize() with no args only waits on the default stream (gpu:0), not the generation_stream (gpu:2) where mx.async_eval dispatched the arrays. The lazy memoryview() in _extract_tensor_bytes triggered a cross-thread eval that aborted at get_command_encoder(gpu:2). Route the worker sync through _safe_sync_generation_stream so the wait targets the correct stream. Closes jundot#1437.

(cherry picked from commit 2e698ff)
mlx-vlm's load() only materializes model.language_model.parameters(), leaving frozen buffers (RoPE freqs) and sibling sub-trees (vision_tower, audio_tower) as lazy arrays bound to the loader thread's default stream. Pre-jundot#1304 this was invisible because loader and forward shared one global thread; the per-engine executor split exposed it as "no Stream(gpu, X) in current thread" when mx.eval touches a sibling buffer during prefill.

Fix: materialize the full model tree on the loader thread right after load. Verified against gemma-4-E2B-it and gemma-4-31b-it-4bit.

Also fixes test_safe_sync_passes_generation_stream to match the _default_generation_stream alias introduced by jundot#1304.

Reported by @zviratko in jundot#1304.

(cherry picked from commit 9d5bed8)
Some Qwen3.6 MoE VLM exports declare mtp_num_hidden_layers > 0 in
config.json but ship no mtp.* weights in the safetensors (unsloth
Qwen3.6 UD MLX builds across 3bit/4bit). PR jundot#1404 unconditionally
attaches MTPModule whenever the config declares MTP heads so persisted
weights have a binding site; for these weight-stripped checkpoints there
are no weights to bind, mlx-vlm strict load_weights fails with
"Missing N parameters: language_model.mtp.*", the engine silently falls
back to LLM and vision is dropped. The user just sees the model answer
as if no image was attached, hence the hallucination report.

Add _checkpoint_has_mtp_weights to scan model.safetensors.index.json
(or the first safetensors shard header) for any mtp.* /
language_model.mtp.* / model.mtp.* key. Flip
set_mtp_attach_enabled(False) before mlx_vlm.utils.load() runs when the
scan misses, so the runtime patch's __init__ wrap skips attachment.
PR jundot#1404's intended behavior is preserved when the checkpoint does
ship mtp.* weights.

Smoke test: unsloth_Qwen3.6-35B-A3B-UD-MLX-3bit + red square PNG.
Before: "the user hasn't provided an image" (vision dropped).
After:  "Red square" identified correctly
        (prompt_tokens 25 -> 99 with vision tokens).

Fixes jundot#1426.

(cherry picked from commit ff7522b)
…mission (jundot#1399)

* feat(profiles): three-scope template fields + persistence

Adds the universal/per-model field split and built-in preset templates
shipped on first run. UNIVERSAL_PROFILE_FIELDS gates which keys a global
template may carry; per-model profiles continue to accept the full
sampling surface. New default_global_profile.json seeds the preset
catalog when the user has no templates of their own.

Server-side foundation for the design's three-scope profile model
(preset / global / model) the Swift app now binds to.

* refactor(profiles): retire server builtins, align on preset bundle

Server-side slice. Drops the 4 hard-coded preset templates the
ModelSettingsManager shipped on first run and lets clients source their
catalog from `omlx/admin/static/omlx_preset.json` (the bundle that's
already refreshable from omlx.ai). The 4 builtin entries were never on
omlx.ai's side, so both HTML and the upcoming Swift app converge on the
bundle's 10-entry catalog with no drift.

- Delete `omlx/template/default_global_profile.json` and the
  `_get_builtin_templates()` machinery in `ModelSettingsManager`.
- Drop the rename/save/delete rejection branches that special-cased
  builtin names — `/api/profile-templates` now serves user templates
  only. `is_builtin` stays on the response schema (always false) so
  back-compat with older HTML/Swift clients is intact; a follow-up
  removes it once both clients no longer key on the field.
- Sync `model_profiles.py` to drop `load_default_global_templates()` —
  it's dead code once the builtin source is gone.
- Tests: drop the qwen3.6 builtin assertions; keep CRUD + persistence
  coverage.

Carved out of the Swift app PR (jundot#1371) for focused review per
maintainer ask — the Swift-side preset bundle client lives there.

* chore(profiles): stop emitting is_builtin on template responses

`ModelSettingsManager` previously stamped `is_builtin: false` on every
template payload — belt-and-suspenders left over from when server-side
builtins existed. With those retired (prior commit), the flag is dead
weight: HTML doesn't read it, and the Swift app gates on
`templateScope` (defaults to `.global` when the field is absent) so
older clients keep working unchanged.

Schema-compat note: the field stays decodable on the client side as
`Optional<Bool>` — we're just dropping the emission. Older clients
that still expect the key continue to work because their decoders
treat it as optional.

Carved out of the Swift app PR (jundot#1371). The Swift-side stale-reference
patches and packaging/build.py cleanup that were also in the original
commit stay in jundot#1371 since they touch Swift sources.

(cherry picked from commit 0c881f5)
@panwudi panwudi merged commit 403c278 into main May 27, 2026
panwudi added a commit that referenced this pull request May 27, 2026
Docs only catch-up for #5 #6 #7 #8. Standing-authorization sync/* self-merge.

---

Docs only 跟踪 #5 #6 #7 #8 的落地记录. Standing-authorization sync/* self-merge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants