Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety#1254
Open
aljen wants to merge 16 commits into
Open
Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety#1254aljen wants to merge 16 commits into
aljen wants to merge 16 commits into
Conversation
7444f4a to
c92fdf8
Compare
|
I really like the model manager. How often become the models be updated? |
Author
|
Thanks! |
4898113 to
a6cf21d
Compare
Native MTP state was initialized on fresh singleton donor batches and then carried across GenerationBatch.extend/filter transitions. When continuous batching merged or shrank requests, the sidecar MTP cache could become detached from the uid/slot whose main cache was advanced by the standard decode path, then later resume from stale state. Activate MTP lazily only for the current singleton batch, stamp state with its owning uid, and drop it whenever a reshape or non-singleton fallback breaks that ownership.
Replace fixed chunk barriers with a refill-on-completion queue for accuracy eval generation. The old runner launched N requests, waited for the slowest request in the chunk, then scored and launched the next chunk. Thinking/code workloads have long-tail generations, so completed slots sat idle while one straggler blocked the whole suite. The new helper keeps up to batch_size generation jobs in flight, refills a slot as soon as generation completes, scores completed outputs behind a bounded queue, and records per-question generation time instead of spreading chunk wall time across every question in the batch. HumanEval, MBPP, LiveCodeBench, and generic evals now share that path. Validated on MacBook Pro 16-inch, M3 Max, 128GB, Qwen3.6-35B-A3B-oQ4-mtp. LiveCodeBench 300, non-thinking: - original 8x chunk barrier 150/300 in 4774.5s - refill 8x best 151/300 in 3454.4s - refill 4x 147/300 in 2788.0s saving 1320.1s / 22m with the same score. LiveCodeBench 300, thinking: - original 8x chunk barrier 158/300 in 33914.9s; - refill 8x 158/300 in 23904.1s, saving 10010.8s / 166.8m / 2.78h with the same score. Focused checks: python -m pytest tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py python -m black --check omlx/eval/base.py omlx/eval/humaneval.py omlx/eval/mbpp.py omlx/eval/livecodebench.py tests/test_eval_streaming.py python -m ruff check --select F,N,I omlx/eval/base.py omlx/eval/humaneval.py omlx/eval/mbpp.py omlx/eval/livecodebench.py tests/test_eval_streaming.py.
- update MTP eligibility test to account for the upstream mtp_active gate while restoring global state after the test - classify upstream VLM MTP model settings as model-specific profile fields - remove stale dflash_draft_quant_bits from profile field classification - modernize admin route tags typing after upstream Optional/List cleanup - add missing Apache-2.0 SPDX header to touched model_settings source file Verification: - python -m pytest tests/test_mlx_lm_mtp_patch.py::TestBatchGeneratorDispatch::test_is_mtp_eligible_requires_mtp_forward_and_solo_batch - python -m pytest tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py tests/test_benchmark.py tests/test_mlx_lm_mtp_patch.py tests/test_model_settings.py - python -m pytest tests/test_admin_profiles_api.py tests/test_admin_api_key.py tests/test_model_settings.py - python -m pytest tests/test_model_settings.py tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified
Fix downloader catalog writes after the nested owner/model download layout. The hooks were passing an undefined model_name, so provenance recording failed and reconcile later recreated the model as a local entry. Use target_dir.name as the catalog model id for HF and ModelScope downloads, matching discovery's leaf-name ids. Add reconcile-side HF provenance inference from downloader cache markers so already-broken nested Hugging Face entries repair on refresh without editing the user's catalog by hand. Tests cover HF and ModelScope downloader catalog writes, benchmark-summary preservation during record_download, and reconcile upgrading nested Hugging Face downloads from Local to HF.
Add the French translations for the model manager update labels, model tags modal, benchmark upload checkbox, and saved accuracy-result delete tooltip introduced by the QoL changes.
a6cf21d to
6ab207a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety
Summary
This PR improves oMLX benchmark reliability and observability, adds persistent
benchmark/model metadata, refines the model manager UI, and fixes a native MTP
state bug that could corrupt batched generation after batch reshapes.
The largest behavior change is in accuracy benchmark execution: evals no longer
run as fixed chunks that wait for the slowest request before starting the next
chunk. They now keep up to
batch_sizerequests in flight and refill completedslots immediately. This removes a large amount of idle time for thinking/code
benchmarks with long-tail generations.
Why
scoring was too brittle around common model output formats, indentation,
markdown fences, tiny numeric drift, and diagnostic reporting.
one visual blob.
GenerationBatch.extend()/filter()reshapes, making it possible to resumea stale sidecar MTP cache after continuous batching changed ownership.
profile classification/test alignment on this branch.
Major Changes
Accuracy benchmark execution
BaseBenchmark.batch_sizegeneration requests in flight.for the slowest request in the current chunk.
scoring in a worker thread so subprocess-based code evaluation does not block
generation refill.
chunk average.
a probe batch before entering the queue.
LiveCodeBench.
Measured locally on MacBook Pro 16-inch, M3 Max, 128GB,
Qwen3.6-35B-A3B-oQ4-mtp:Accuracy benchmark correctness and diagnostics
pass_mode,failure_type, anderrordiagnostics to per-questionresults.
body-only completions, extracted code, normalized indentation, top-level
indentation, and full-function responses.
after generated code when needed.
pass mode, failure type, and errors.
model_settingsuses saved model sampling settings, whiledeterministicforces neutral greedy settings for reproducibility.
Persistent benchmark results
survive server/browser restarts.
results.
result_idandcreated_atmetadata.Model catalog and model manager QoL
ModelCatalogfor durable local model metadata.hf,modelscope,local,unknown), repo id,revisions/update timestamps, update status, and benchmark summaries.
models.
HF,MS,Local)HE 93.9%↓/↑statsReload runtime,Rescan disk, andCheck remote updates.User model tags
tagstoModelSettings.trimmed, empty values removed, case-insensitive de-duplicated, and capped to
64 chars.
DenseandMoE; they are not auto-applied.metadata, not model behavior settings.
DenseandMoEremain literal tag values.Performance benchmark upload control
DFlash/SpecPrefill/TurboQuant-style runs do not pollute community benchmark
submissions.
Native MTP batch-state fix
GenerationBatch.__init__.next()when the current batch is eligible.extend()/filter()/ fallback paths break singletonownership.
on standard
GenerationBatch._steptoken-acceptance hooks.Reasoning: fresh singleton prompt batches may later be merged into a larger
continuous batch. If MTP mutates the donor cache before merge, then the sidecar
MTP cache can become detached from the uid/slot advanced by the standard
batched decode path and resume from stale state later.
Local validation after the fix:
Rebase hygiene and settings/profile classification
fields.
dflash_draft_quant_bitsfrom profile classification.preserve_thinkingin universal profile fields.tagsexcluded from profiles/templates.mtp_activegate.Optional/Listusage in admin routes.model_settings.py; branchadded files already had the header.
Runtime model reload fix
defaults as global settings.
model_dirsare empty and thedefault base-path model directory should be used.
API / Data Surface Changes
AccuracyBenchmarkRequestaddssampling_profile:model_settingsdeterministicupload_to_omlx.tags: list[str].fields.
POST /admin/api/models/{model_id}/check-updatePOST /admin/api/models/check-updatesGET /admin/api/bench/resultsPOST /admin/api/bench/results/resetDELETE /admin/api/bench/results/{result_id}DELETE /admin/api/bench/accuracy/results/{result_id}provenance, batch size, benchmark variant, diagnostics, and upload status.
model_catalog.jsonunder the oMLX base path.benchmarks/accuracy/resultsbenchmarks/performance/resultsTests Added / Updated
tests/test_eval_streaming.pyfor refill scheduling behavior:slot refill before slow request finishes, result ordering, per-question timing,
cancellation propagation, thinking probe rerun, and scoring not blocking
generation refill.
malformed fences, runtime/syntax/wrong-answer diagnostics, setup placement,
and numeric drift.
for upstream
mtp_active.Verification Performed
Focused checks run during development/rebase:
Latest focused pass after rebase cleanup:
Notes / Caveats
storage style. A future SQLite consolidation could simplify this, but this PR
does not introduce a database dependency.
is persisted separately and summarized through the model catalog.
community-upload behavior, but direct API callers must opt in explicitly.
PR fixes stale state across batch reshapes rather than enabling multi-request
native MTP speculation.