sync: independent upstream fixes (boundary-store, per-engine threads, VLM, profiles) by panwudi · Pull Request #8 · panwudi/flyto-mlx

panwudi · 2026-05-27T12:19:16Z

Summary

Catch-up sync of upstream commits that landed since v0.3.12 and are independent of the memory_guard_tier work. Notably fixes the boundary-store flake we have been carrying in the baseline (test_cleanup_all_drains_queue) and brings in per-engine MLX streams to eliminate cross-engine contamination during multi-model serving.

upstream	content
`4f3a9b9` (jundot#1423)	fix(boundary-store): serialize cleanup_all + cleanup_request with the writer thread
`bc1c427`	cleanup(boundary-store): drop unreachable shutdown(cleanup=) path
`2916ab4` (jundot#1422)	cleanup(cache): remove dead TieredCacheManager
`56860b3` (jundot#1304)	fix(engine): per-engine threads to eliminate cross-engine stream contamination
`b7cb489` (`a62f953`)	refactor(engine): remove redundant _ensure_wired_limit guard
`c50d64e` (`e6d8a3f` jundot#1445)	test(mtp): drop monkeypatch of removed `_get_generation_stream`
`f554f19` (`2e698ff` jundot#1437)	fix(scheduler): wait on generation_stream in store-cache worker sync
`ebf2c21` (`9d5bed8`)	fix(engine): materialize VLM model lazy state on loader thread
`414b843` (`ff7522b`)	fix(load): skip VLM MTPModule attach when checkpoint lacks mtp.* weights
`59d9e7e` (`0c881f5` jundot#1399)	refactor(profiles): three-scope template contract + drop is_builtin emission

Conflict resolution

2e698ff (SIGABRT / Crash on background cache-saver thread due to cross-thread MLX Stream isolation jundot/omlx#1437): same fix as a piece of 56860b3 (fix(engine): per-engine threads to eliminate cross-engine stream contamination jundot/omlx#1304) — upstream had two takes on the worker sync (_safe_sync_generation_stream then _safe_sync_stream(self._stream)). 56860b3 is the per-engine version and strictly supersedes 2e698ff. Took HEAD on the textual conflict; the cherry-pick still left a non-trivial diff (paged_ssd_cache.py tweak + a few test additions), so it's kept as its own commit rather than skipped.
56860b3 scheduler.py boundary_capture / store_cache_main_prep: auto-merge tangled flyto's already-3-phase split with upstream's same split. Manually wrote a clean 3-phase block that uses self._stream (per-engine) instead of the older module-level generation_stream.
56860b3 batch_generator.py: removed the explicit with mx.stream(_get_generation_stream()) wrap around _call_backbone — the per-engine stream is now propagated by the executor, so the explicit wrap (reading the module-level stream) is wrong post-fix(engine): per-engine threads to eliminate cross-engine stream contamination jundot/omlx#1304.
a62f953 engine_core.py: came in with 56860b3's per-engine stream / executor as context, so the cherry-pick applied cleanly once 56860b3 was in.

Test plan

Full suite: 4567 pass / 3 fail / 36 skip / 59 deselected (244s).
+31 tests vs main; covers per-engine threads (tests/test_per_engine_threads.py), boundary-store cleanup races, and the existing scheduler / VLM paths.
boundary_snapshot flake eliminated: tests/test_boundary_snapshot_store.py::TestBoundarySnapshotSSDStore::test_cleanup_all_drains_queue (previously ~20% flake) passes cleanly. That removes one of the 4 baseline failures; new baseline becomes the 3 known OMLX_API_KEY env-override flakes.
Smoke import all modified modules.

Standing authorization (sync/*)

All 5 criteria satisfied. Self-merging per user delegation.

…th the writer thread (jundot#1423) test_cleanup_all_drains_queue failed ~20% of the time, in isolation and in suites. The race: T0 save() puts item in queue T1 writer queue.get() pulls item (already off the queue) T2 cleanup_all() drains queue — finds it empty T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir) T4 writer mkdir(req_dir) + temp write + rename(final) Result: ``req-X/N.safetensors`` survives the supposed cleanup, the caller (scheduler / shutdown) believes the snapshot dir is empty, and the leftover file is the disk shadow of a request that should have been forgotten. Add ``_writer_busy`` (threading.Lock). The writer holds it for the entire duration of each ``_process_write_item`` call. ``cleanup_all()`` drains the queue first (no new items can start) and then acquires ``_writer_busy``, so any item the writer had already pulled finishes before rmtree runs. Same class of race exists for ``cleanup_request()`` — writer pulls a request's item before cleanup_request sets the cancelled-counter, then mkdir's the req_dir back into existence after the rmtree, and the late-rename rescue at the tail of ``_process_write_item`` catches it in most timings but a small window remains. Apply the same ``_writer_busy`` barrier (bounded, see below) so the two cleanup paths are symmetric. Per-path timeouts so cleanup_request can yield faster than cleanup_all: - ``_CLEANUP_ALL_TIMEOUT_S = 5.0`` — cleanup_all runs at reset / startup where blocking longer is tolerable in exchange for a stronger orphan-avoidance guarantee. - ``_CLEANUP_REQUEST_TIMEOUT_S = 2.0`` — cleanup_request is called from the scheduler's abort hot path (~3 sites) where bounding latency matters more than chasing one in-flight write. The bounded acquires have a logged-warning fallback. Without that bound a slow disk inside ``_write_safetensors_no_mx`` could deadlock the scheduler's hot path — cleanup_all() is called from request abort / reset (scheduler.py:5400 / 5596 / 5737), not just shutdown. The worst-case fallback behaviour is an orphaned file under the recreated snapshot dir until next startup's constructor cleanup, which is the exact same state the pre-fix code produced on every cleanup_all. Counter pop on cleanup_request now only runs when the ``_writer_busy`` acquire succeeded. On the timeout fallback the writer is still mid-item and may not yet have consulted ``_is_cancelled``; popping the counter there would defeat the late- rename rescue that the docstring advertises as the timeout- fallback safety net. The previous code popped unconditionally. Counter bump itself is now additive (``get + count``) and skipped entirely when ``count == 0``. Two distinct bugs that closes: - Skip-on-zero: cleanup_request("X") for an rid with NO pending items previously wrote ``cancelled[X] = 0`` then popped on the acquired path. On the timeout fallback the pop never ran and the ``X: 0`` entry lingered for the process lifetime — every later ``save()`` under that rid (or any reuse of the string) was silently discarded by the writer's ``_is_cancelled`` gates, which check key membership not value > 0. Counter must only exist when there is at least one in-flight item to drain it. Regression: test_cleanup_request_no_pending_does_not_pin_counter _on_timeout (new). - Additive: a re-entrant cleanup_request("X") for an rid that already has an in-flight cancellation must NOT overwrite the previous count. The writer's ``cleared_by_cleanup`` branch + ``_writer_busy`` lock together close the file-write race today (so no orphan file slips out), but the per-item dec_cancelled bookkeeping has to balance — overwriting drops remaining decs on the floor and on the next ``save()`` under the same rid the writer would see a non-zero counter from the earlier batch and silently discard the new item. shutdown() now accepts ``cleanup=True`` so callers that want both operations can express the cleanup-before-shutdown ordering in one call instead of sequencing them manually. The cleanup_all() warning at the top of the function catches misordered callers that still call cleanup_all() AFTER shutdown — the writer no longer reacquires ``_writer_busy`` past the sentinel, so a post-shutdown cleanup degrades to an in-memory-only clear. Tests: - ``test_cleanup_all_drains_queue`` — no longer sleeps; relies on the lock to guarantee ordering, runs deterministically. - ``test_cleanup_all_blocks_until_writer_finishes_pinned_item`` — monkey-patches ``_write_safetensors_no_mx`` to block on an Event, pins the writer mid-item, asserts cleanup_all does NOT return until the writer releases. Without the lock this test fails deterministically rather than the original ~20% flake rate. - ``test_cleanup_request_blocks_until_writer_finishes_pinned_item`` — symmetric for cleanup_request. - ``test_cleanup_request_keeps_counter_on_timeout`` — regression for the bug where the timeout-fallback pop dropped the late-rename rescue's safety net. - ``test_cleanup_request_timeout_drains_counter_on_writer_early_return`` — the writer's cleared_by_cleanup early-return must still dec_cancelled or the counter pins the rid forever. - ``test_cleanup_request_no_pending_does_not_pin_counter_on_timeout`` (new) — regression for the skip-on-zero bug above. Pins the writer with an unrelated save, calls cleanup_request("never-saved-rid") past the 0.1s timeout, asserts the rid does NOT appear in ``_cancelled_requests`` AND that a subsequent save under the same rid is not silently discarded. - ``test_shutdown_cleanup_true_runs_cleanup_before_setting_flag`` — pins the cleanup-before-shutdown ordering of the new ``shutdown(cleanup=True)`` path. 24/24 boundary-store tests pass. (cherry picked from commit 4f3a9b9)

Scheduler.shutdown() never calls _boundary_snapshot_store.shutdown(), so _shutdown.is_set() is always False in production. The post-shutdown branches in cleanup_request / cleanup_all and the cleanup= kwarg on shutdown() were dead code, kept alive only by a self-referential regression test. Drops them along with the test. (cherry picked from commit bc1c427)

Verified unused: zero instantiations across the entire git history (`git log -S 'TieredCacheManager('` returns only the initial-commit declaration), zero imports outside the module's own __init__ re-export, zero test references. The class was a planned coordinator between PagedCacheManager / BlockAwarePrefixCache / PagedSSDCacheManager / MemoryMonitor — Scheduler.__init__ does that wiring directly, making the abstraction redundant. Removes 353 lines plus the re-export from omlx.cache.__all__. (cherry picked from commit 2916ab4)

…amination (jundot#1304) Replace the shared _global_mlx_executor with per-EngineCore ThreadPoolExecutor + mx.Stream, and fix the MTP patch reading the module-level generation_stream instead of the per-engine stream. (cherry picked from commit 56860b3)

BatchGenerator.__init__ already calls mx.set_wired_limit() on each instance, and concurrent calls with the same value are race-free (verified empirically). The guard never prevented the race it claimed to fix. Follow-up to jundot#1304. (cherry picked from commit a62f953)

…ndot#1445) jundot#1304 (``fix(engine): per-engine threads to eliminate cross-engine stream contamination``) refactored the patched ``BatchGenerator`` to inherit its execution stream from the enclosing engine context, and removed the module-level ``_get_generation_stream`` helper as part of that. ``TestBatchGeneratorDispatch._make_reconcile_batch`` still tried to monkeypatch that name and failed at collection of every test that depends on the fixture with ``AttributeError: module ... has no attribute '_get_generation_stream'`` — taking down 4 reconcile-path tests on every CI run. The override is no longer needed: the surrounding fixture replaces ``_rebuild_singleton_cache`` and ``_call_backbone`` with fakes that do all of their work via ``np.array`` / ``mx.array`` directly, so neither MLX dispatch nor stream selection is reached. Tests (tests/test_mlx_lm_mtp_patch.py::TestBatchGeneratorDispatch): - test_reconcile_uses_queue_front_as_next_token - test_reconcile_empty_queue_samples_from_logits - test_reconcile_returns_false_on_empty_tokens - test_reconcile_fallback_on_rebuild_failure 11/11 TestBatchGeneratorDispatch tests pass. (cherry picked from commit e6d8a3f)

…undot#1437) mx.synchronize() with no args only waits on the default stream (gpu:0), not the generation_stream (gpu:2) where mx.async_eval dispatched the arrays. The lazy memoryview() in _extract_tensor_bytes triggered a cross-thread eval that aborted at get_command_encoder(gpu:2). Route the worker sync through _safe_sync_generation_stream so the wait targets the correct stream. Closes jundot#1437. (cherry picked from commit 2e698ff)

@zviratko

mlx-vlm's load() only materializes model.language_model.parameters(), leaving frozen buffers (RoPE freqs) and sibling sub-trees (vision_tower, audio_tower) as lazy arrays bound to the loader thread's default stream. Pre-jundot#1304 this was invisible because loader and forward shared one global thread; the per-engine executor split exposed it as "no Stream(gpu, X) in current thread" when mx.eval touches a sibling buffer during prefill. Fix: materialize the full model tree on the loader thread right after load. Verified against gemma-4-E2B-it and gemma-4-31b-it-4bit. Also fixes test_safe_sync_passes_generation_stream to match the _default_generation_stream alias introduced by jundot#1304. Reported by @zviratko in jundot#1304. (cherry picked from commit 9d5bed8)

Some Qwen3.6 MoE VLM exports declare mtp_num_hidden_layers > 0 in config.json but ship no mtp.* weights in the safetensors (unsloth Qwen3.6 UD MLX builds across 3bit/4bit). PR jundot#1404 unconditionally attaches MTPModule whenever the config declares MTP heads so persisted weights have a binding site; for these weight-stripped checkpoints there are no weights to bind, mlx-vlm strict load_weights fails with "Missing N parameters: language_model.mtp.*", the engine silently falls back to LLM and vision is dropped. The user just sees the model answer as if no image was attached, hence the hallucination report. Add _checkpoint_has_mtp_weights to scan model.safetensors.index.json (or the first safetensors shard header) for any mtp.* / language_model.mtp.* / model.mtp.* key. Flip set_mtp_attach_enabled(False) before mlx_vlm.utils.load() runs when the scan misses, so the runtime patch's __init__ wrap skips attachment. PR jundot#1404's intended behavior is preserved when the checkpoint does ship mtp.* weights. Smoke test: unsloth_Qwen3.6-35B-A3B-UD-MLX-3bit + red square PNG. Before: "the user hasn't provided an image" (vision dropped). After: "Red square" identified correctly (prompt_tokens 25 -> 99 with vision tokens). Fixes jundot#1426. (cherry picked from commit ff7522b)

…mission (jundot#1399) * feat(profiles): three-scope template fields + persistence Adds the universal/per-model field split and built-in preset templates shipped on first run. UNIVERSAL_PROFILE_FIELDS gates which keys a global template may carry; per-model profiles continue to accept the full sampling surface. New default_global_profile.json seeds the preset catalog when the user has no templates of their own. Server-side foundation for the design's three-scope profile model (preset / global / model) the Swift app now binds to. * refactor(profiles): retire server builtins, align on preset bundle Server-side slice. Drops the 4 hard-coded preset templates the ModelSettingsManager shipped on first run and lets clients source their catalog from `omlx/admin/static/omlx_preset.json` (the bundle that's already refreshable from omlx.ai). The 4 builtin entries were never on omlx.ai's side, so both HTML and the upcoming Swift app converge on the bundle's 10-entry catalog with no drift. - Delete `omlx/template/default_global_profile.json` and the `_get_builtin_templates()` machinery in `ModelSettingsManager`. - Drop the rename/save/delete rejection branches that special-cased builtin names — `/api/profile-templates` now serves user templates only. `is_builtin` stays on the response schema (always false) so back-compat with older HTML/Swift clients is intact; a follow-up removes it once both clients no longer key on the field. - Sync `model_profiles.py` to drop `load_default_global_templates()` — it's dead code once the builtin source is gone. - Tests: drop the qwen3.6 builtin assertions; keep CRUD + persistence coverage. Carved out of the Swift app PR (jundot#1371) for focused review per maintainer ask — the Swift-side preset bundle client lives there. * chore(profiles): stop emitting is_builtin on template responses `ModelSettingsManager` previously stamped `is_builtin: false` on every template payload — belt-and-suspenders left over from when server-side builtins existed. With those retired (prior commit), the flag is dead weight: HTML doesn't read it, and the Swift app gates on `templateScope` (defaults to `.global` when the field is absent) so older clients keep working unchanged. Schema-compat note: the field stays decodable on the client side as `Optional<Bool>` — we're just dropping the emission. Older clients that still expect the key continue to work because their decoders treat it as optional. Carved out of the Swift app PR (jundot#1371). The Swift-side stale-reference patches and packaging/build.py cleanup that were also in the original commit stay in jundot#1371 since they touch Swift sources. (cherry picked from commit 0c881f5)

Docs only catch-up for #5 #6 #7 #8. Standing-authorization sync/* self-merge. --- Docs only 跟踪 #5 #6 #7 #8 的落地记录. Standing-authorization sync/* self-merge.

cfbraun and others added 10 commits May 27, 2026 05:08

panwudi merged commit 403c278 into main May 27, 2026

panwudi mentioned this pull request May 27, 2026

docs(upstream-sync): record today's 4 sync PRs #9

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: independent upstream fixes (boundary-store, per-engine threads, VLM, profiles)#8

sync: independent upstream fixes (boundary-store, per-engine threads, VLM, profiles)#8
panwudi merged 10 commits into
mainfrom
sync/upstream-2026-05-28

panwudi commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

panwudi commented May 27, 2026

Summary

Conflict resolution

Test plan

Standing authorization (sync/*)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants