sync: independent upstream fixes (boundary-store, per-engine threads, VLM, profiles)#8
Merged
Conversation
…th the writer thread (jundot#1423) test_cleanup_all_drains_queue failed ~20% of the time, in isolation and in suites. The race: T0 save() puts item in queue T1 writer queue.get() pulls item (already off the queue) T2 cleanup_all() drains queue — finds it empty T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir) T4 writer mkdir(req_dir) + temp write + rename(final) Result: ``req-X/N.safetensors`` survives the supposed cleanup, the caller (scheduler / shutdown) believes the snapshot dir is empty, and the leftover file is the disk shadow of a request that should have been forgotten. Add ``_writer_busy`` (threading.Lock). The writer holds it for the entire duration of each ``_process_write_item`` call. ``cleanup_all()`` drains the queue first (no new items can start) and then acquires ``_writer_busy``, so any item the writer had already pulled finishes before rmtree runs. Same class of race exists for ``cleanup_request()`` — writer pulls a request's item before cleanup_request sets the cancelled-counter, then mkdir's the req_dir back into existence after the rmtree, and the late-rename rescue at the tail of ``_process_write_item`` catches it in most timings but a small window remains. Apply the same ``_writer_busy`` barrier (bounded, see below) so the two cleanup paths are symmetric. Per-path timeouts so cleanup_request can yield faster than cleanup_all: - ``_CLEANUP_ALL_TIMEOUT_S = 5.0`` — cleanup_all runs at reset / startup where blocking longer is tolerable in exchange for a stronger orphan-avoidance guarantee. - ``_CLEANUP_REQUEST_TIMEOUT_S = 2.0`` — cleanup_request is called from the scheduler's abort hot path (~3 sites) where bounding latency matters more than chasing one in-flight write. The bounded acquires have a logged-warning fallback. Without that bound a slow disk inside ``_write_safetensors_no_mx`` could deadlock the scheduler's hot path — cleanup_all() is called from request abort / reset (scheduler.py:5400 / 5596 / 5737), not just shutdown. The worst-case fallback behaviour is an orphaned file under the recreated snapshot dir until next startup's constructor cleanup, which is the exact same state the pre-fix code produced on every cleanup_all. Counter pop on cleanup_request now only runs when the ``_writer_busy`` acquire succeeded. On the timeout fallback the writer is still mid-item and may not yet have consulted ``_is_cancelled``; popping the counter there would defeat the late- rename rescue that the docstring advertises as the timeout- fallback safety net. The previous code popped unconditionally. Counter bump itself is now additive (``get + count``) and skipped entirely when ``count == 0``. Two distinct bugs that closes: - Skip-on-zero: cleanup_request("X") for an rid with NO pending items previously wrote ``cancelled[X] = 0`` then popped on the acquired path. On the timeout fallback the pop never ran and the ``X: 0`` entry lingered for the process lifetime — every later ``save()`` under that rid (or any reuse of the string) was silently discarded by the writer's ``_is_cancelled`` gates, which check key membership not value > 0. Counter must only exist when there is at least one in-flight item to drain it. Regression: test_cleanup_request_no_pending_does_not_pin_counter _on_timeout (new). - Additive: a re-entrant cleanup_request("X") for an rid that already has an in-flight cancellation must NOT overwrite the previous count. The writer's ``cleared_by_cleanup`` branch + ``_writer_busy`` lock together close the file-write race today (so no orphan file slips out), but the per-item dec_cancelled bookkeeping has to balance — overwriting drops remaining decs on the floor and on the next ``save()`` under the same rid the writer would see a non-zero counter from the earlier batch and silently discard the new item. shutdown() now accepts ``cleanup=True`` so callers that want both operations can express the cleanup-before-shutdown ordering in one call instead of sequencing them manually. The cleanup_all() warning at the top of the function catches misordered callers that still call cleanup_all() AFTER shutdown — the writer no longer reacquires ``_writer_busy`` past the sentinel, so a post-shutdown cleanup degrades to an in-memory-only clear. Tests: - ``test_cleanup_all_drains_queue`` — no longer sleeps; relies on the lock to guarantee ordering, runs deterministically. - ``test_cleanup_all_blocks_until_writer_finishes_pinned_item`` — monkey-patches ``_write_safetensors_no_mx`` to block on an Event, pins the writer mid-item, asserts cleanup_all does NOT return until the writer releases. Without the lock this test fails deterministically rather than the original ~20% flake rate. - ``test_cleanup_request_blocks_until_writer_finishes_pinned_item`` — symmetric for cleanup_request. - ``test_cleanup_request_keeps_counter_on_timeout`` — regression for the bug where the timeout-fallback pop dropped the late-rename rescue's safety net. - ``test_cleanup_request_timeout_drains_counter_on_writer_early_return`` — the writer's cleared_by_cleanup early-return must still dec_cancelled or the counter pins the rid forever. - ``test_cleanup_request_no_pending_does_not_pin_counter_on_timeout`` (new) — regression for the skip-on-zero bug above. Pins the writer with an unrelated save, calls cleanup_request("never-saved-rid") past the 0.1s timeout, asserts the rid does NOT appear in ``_cancelled_requests`` AND that a subsequent save under the same rid is not silently discarded. - ``test_shutdown_cleanup_true_runs_cleanup_before_setting_flag`` — pins the cleanup-before-shutdown ordering of the new ``shutdown(cleanup=True)`` path. 24/24 boundary-store tests pass. (cherry picked from commit 4f3a9b9)
Scheduler.shutdown() never calls _boundary_snapshot_store.shutdown(), so _shutdown.is_set() is always False in production. The post-shutdown branches in cleanup_request / cleanup_all and the cleanup= kwarg on shutdown() were dead code, kept alive only by a self-referential regression test. Drops them along with the test. (cherry picked from commit bc1c427)
Verified unused: zero instantiations across the entire git history
(`git log -S 'TieredCacheManager('` returns only the initial-commit
declaration), zero imports outside the module's own __init__ re-export,
zero test references. The class was a planned coordinator between
PagedCacheManager / BlockAwarePrefixCache / PagedSSDCacheManager /
MemoryMonitor — Scheduler.__init__ does that wiring directly, making
the abstraction redundant.
Removes 353 lines plus the re-export from omlx.cache.__all__.
(cherry picked from commit 2916ab4)
…amination (jundot#1304) Replace the shared _global_mlx_executor with per-EngineCore ThreadPoolExecutor + mx.Stream, and fix the MTP patch reading the module-level generation_stream instead of the per-engine stream. (cherry picked from commit 56860b3)
BatchGenerator.__init__ already calls mx.set_wired_limit() on each instance, and concurrent calls with the same value are race-free (verified empirically). The guard never prevented the race it claimed to fix. Follow-up to jundot#1304. (cherry picked from commit a62f953)
…ndot#1445) jundot#1304 (``fix(engine): per-engine threads to eliminate cross-engine stream contamination``) refactored the patched ``BatchGenerator`` to inherit its execution stream from the enclosing engine context, and removed the module-level ``_get_generation_stream`` helper as part of that. ``TestBatchGeneratorDispatch._make_reconcile_batch`` still tried to monkeypatch that name and failed at collection of every test that depends on the fixture with ``AttributeError: module ... has no attribute '_get_generation_stream'`` — taking down 4 reconcile-path tests on every CI run. The override is no longer needed: the surrounding fixture replaces ``_rebuild_singleton_cache`` and ``_call_backbone`` with fakes that do all of their work via ``np.array`` / ``mx.array`` directly, so neither MLX dispatch nor stream selection is reached. Tests (tests/test_mlx_lm_mtp_patch.py::TestBatchGeneratorDispatch): - test_reconcile_uses_queue_front_as_next_token - test_reconcile_empty_queue_samples_from_logits - test_reconcile_returns_false_on_empty_tokens - test_reconcile_fallback_on_rebuild_failure 11/11 TestBatchGeneratorDispatch tests pass. (cherry picked from commit e6d8a3f)
…undot#1437) mx.synchronize() with no args only waits on the default stream (gpu:0), not the generation_stream (gpu:2) where mx.async_eval dispatched the arrays. The lazy memoryview() in _extract_tensor_bytes triggered a cross-thread eval that aborted at get_command_encoder(gpu:2). Route the worker sync through _safe_sync_generation_stream so the wait targets the correct stream. Closes jundot#1437. (cherry picked from commit 2e698ff)
mlx-vlm's load() only materializes model.language_model.parameters(), leaving frozen buffers (RoPE freqs) and sibling sub-trees (vision_tower, audio_tower) as lazy arrays bound to the loader thread's default stream. Pre-jundot#1304 this was invisible because loader and forward shared one global thread; the per-engine executor split exposed it as "no Stream(gpu, X) in current thread" when mx.eval touches a sibling buffer during prefill. Fix: materialize the full model tree on the loader thread right after load. Verified against gemma-4-E2B-it and gemma-4-31b-it-4bit. Also fixes test_safe_sync_passes_generation_stream to match the _default_generation_stream alias introduced by jundot#1304. Reported by @zviratko in jundot#1304. (cherry picked from commit 9d5bed8)
Some Qwen3.6 MoE VLM exports declare mtp_num_hidden_layers > 0 in config.json but ship no mtp.* weights in the safetensors (unsloth Qwen3.6 UD MLX builds across 3bit/4bit). PR jundot#1404 unconditionally attaches MTPModule whenever the config declares MTP heads so persisted weights have a binding site; for these weight-stripped checkpoints there are no weights to bind, mlx-vlm strict load_weights fails with "Missing N parameters: language_model.mtp.*", the engine silently falls back to LLM and vision is dropped. The user just sees the model answer as if no image was attached, hence the hallucination report. Add _checkpoint_has_mtp_weights to scan model.safetensors.index.json (or the first safetensors shard header) for any mtp.* / language_model.mtp.* / model.mtp.* key. Flip set_mtp_attach_enabled(False) before mlx_vlm.utils.load() runs when the scan misses, so the runtime patch's __init__ wrap skips attachment. PR jundot#1404's intended behavior is preserved when the checkpoint does ship mtp.* weights. Smoke test: unsloth_Qwen3.6-35B-A3B-UD-MLX-3bit + red square PNG. Before: "the user hasn't provided an image" (vision dropped). After: "Red square" identified correctly (prompt_tokens 25 -> 99 with vision tokens). Fixes jundot#1426. (cherry picked from commit ff7522b)
…mission (jundot#1399) * feat(profiles): three-scope template fields + persistence Adds the universal/per-model field split and built-in preset templates shipped on first run. UNIVERSAL_PROFILE_FIELDS gates which keys a global template may carry; per-model profiles continue to accept the full sampling surface. New default_global_profile.json seeds the preset catalog when the user has no templates of their own. Server-side foundation for the design's three-scope profile model (preset / global / model) the Swift app now binds to. * refactor(profiles): retire server builtins, align on preset bundle Server-side slice. Drops the 4 hard-coded preset templates the ModelSettingsManager shipped on first run and lets clients source their catalog from `omlx/admin/static/omlx_preset.json` (the bundle that's already refreshable from omlx.ai). The 4 builtin entries were never on omlx.ai's side, so both HTML and the upcoming Swift app converge on the bundle's 10-entry catalog with no drift. - Delete `omlx/template/default_global_profile.json` and the `_get_builtin_templates()` machinery in `ModelSettingsManager`. - Drop the rename/save/delete rejection branches that special-cased builtin names — `/api/profile-templates` now serves user templates only. `is_builtin` stays on the response schema (always false) so back-compat with older HTML/Swift clients is intact; a follow-up removes it once both clients no longer key on the field. - Sync `model_profiles.py` to drop `load_default_global_templates()` — it's dead code once the builtin source is gone. - Tests: drop the qwen3.6 builtin assertions; keep CRUD + persistence coverage. Carved out of the Swift app PR (jundot#1371) for focused review per maintainer ask — the Swift-side preset bundle client lives there. * chore(profiles): stop emitting is_builtin on template responses `ModelSettingsManager` previously stamped `is_builtin: false` on every template payload — belt-and-suspenders left over from when server-side builtins existed. With those retired (prior commit), the flag is dead weight: HTML doesn't read it, and the Swift app gates on `templateScope` (defaults to `.global` when the field is absent) so older clients keep working unchanged. Schema-compat note: the field stays decodable on the client side as `Optional<Bool>` — we're just dropping the emission. Older clients that still expect the key continue to work because their decoders treat it as optional. Carved out of the Swift app PR (jundot#1371). The Swift-side stale-reference patches and packaging/build.py cleanup that were also in the original commit stay in jundot#1371 since they touch Swift sources. (cherry picked from commit 0c881f5)
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Catch-up sync of upstream commits that landed since v0.3.12 and are independent of the memory_guard_tier work. Notably fixes the boundary-store flake we have been carrying in the baseline (
test_cleanup_all_drains_queue) and brings in per-engine MLX streams to eliminate cross-engine contamination during multi-model serving.4f3a9b9(jundot#1423)bc1c4272916ab4(jundot#1422)56860b3(jundot#1304)b7cb489(a62f953)c50d64e(e6d8a3f jundot#1445)_get_generation_streamf554f19(2e698ff jundot#1437)ebf2c21(9d5bed8)414b843(ff7522b)59d9e7e(0c881f5 jundot#1399)Conflict resolution
_safe_sync_generation_streamthen_safe_sync_stream(self._stream)). 56860b3 is the per-engine version and strictly supersedes 2e698ff. Took HEAD on the textual conflict; the cherry-pick still left a non-trivial diff (paged_ssd_cache.py tweak + a few test additions), so it's kept as its own commit rather than skipped.self._stream(per-engine) instead of the older module-levelgeneration_stream.with mx.stream(_get_generation_stream())wrap around_call_backbone— the per-engine stream is now propagated by the executor, so the explicit wrap (reading the module-level stream) is wrong post-fix(engine): per-engine threads to eliminate cross-engine stream contamination jundot/omlx#1304.Test plan
tests/test_per_engine_threads.py), boundary-store cleanup races, and the existing scheduler / VLM paths.tests/test_boundary_snapshot_store.py::TestBoundarySnapshotSSDStore::test_cleanup_all_drains_queue(previously ~20% flake) passes cleanly. That removes one of the 4 baseline failures; new baseline becomes the 3 known OMLX_API_KEY env-override flakes.Standing authorization (sync/*)
All 5 criteria satisfied. Self-merging per user delegation.