Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety by aljen · Pull Request #1254 · jundot/omlx

aljen · 2026-05-14T05:40:13Z

Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety

Summary

This PR improves oMLX benchmark reliability and observability, adds persistent
benchmark/model metadata, refines the model manager UI, and fixes a native MTP
state bug that could corrupt batched generation after batch reshapes.

The largest behavior change is in accuracy benchmark execution: evals no longer
run as fixed chunks that wait for the slowest request before starting the next
chunk. They now keep up to batch_size requests in flight and refill completed
slots immediately. This removes a large amount of idle time for thinking/code
benchmarks with long-tail generations.

Why

Accuracy/code evals were hard to trust because HumanEval/MBPP/LiveCodeBench
scoring was too brittle around common model output formats, indentation,
markdown fences, tiny numeric drift, and diagnostic reporting.
Accuracy benchmark results were process-local and easy to lose after restart.
Performance benchmark results were not retained in the UI/model list.
The model manager row mixed names, source tags, size, speed, and accuracy into
one visual blob.
Native MTP state was initialized on singleton donor batches and could survive
GenerationBatch.extend() / filter() reshapes, making it possible to resume
a stale sidecar MTP cache after continuous batching changed ownership.
Rebased upstream added new VLM MTP settings and typing cleanup, which needed
profile classification/test alignment on this branch.

Major Changes

Accuracy benchmark execution

Replaces fixed chunk barriers with a refill-on-completion queue in
BaseBenchmark.
Keeps up to batch_size generation requests in flight.
Starts the next item as soon as a generation slot completes instead of waiting
for the slowest request in the current chunk.
Moves scoring of generated code outputs behind a bounded async queue and runs
scoring in a worker thread so subprocess-based code evaluation does not block
generation refill.
Records per-question generation time instead of assigning each question the
chunk average.
Preserves the existing first-batch thinking auto-detection behavior by running
a probe batch before entering the queue.
Applies the shared refill path to generic evals, HumanEval, MBPP, and
LiveCodeBench.

Measured locally on MacBook Pro 16-inch, M3 Max, 128GB,
Qwen3.6-35B-A3B-oQ4-mtp:

LiveCodeBench 300, non-thinking:
- old 8x chunk barrier: 150/300 in 4774.5s
- refill 8x best:       151/300 in 3454.4s
- refill 4x:            147/300 in 2788.0s

LiveCodeBench 300, thinking:
- old 8x chunk barrier: 158/300 in 33914.9s
- refill 8x:            158/300 in 23904.1s

Thinking run saved ~10010.8s / 166.8m / 2.78h with the same score.

Accuracy benchmark correctness and diagnostics

Adds pass_mode, failure_type, and error diagnostics to per-question
results.
HumanEval now tries deterministic canonical/chat-output repairs:
body-only completions, extracted code, normalized indentation, top-level
indentation, and full-function responses.
MBPP now tolerates tiny numeric assertion drift and can retry setup placement
after generated code when needed.
LiveCodeBench now reports structured code-check failure types.
Result exports now include batch size, sampling profile, effective sampling,
pass mode, failure type, and errors.
Adds a benchmark sampling profile switch:
model_settings uses saved model sampling settings, while deterministic
forces neutral greedy settings for reproducibility.

Persistent benchmark results

Accuracy benchmark results are now persisted under the oMLX base path and
survive server/browser restarts.
Performance benchmark results are also persisted.
Adds delete/reset endpoints for saved accuracy and performance benchmark
results.
Adds stable result_id and created_at metadata.
Updates model catalog summaries when benchmark results are added/deleted/reset.
Accuracy summaries track latest/best result and best result by benchmark.
Performance summaries expose canonical pp/tg throughput for model rows.

Model catalog and model manager QoL

Adds a JSON-backed ModelCatalog for durable local model metadata.
Records model source (hf, modelscope, local, unknown), repo id,
revisions/update timestamps, update status, and benchmark summaries.
Downloaders record catalog entries after successful HF/ModelScope downloads.
Model list reconciles discovered local models with the catalog.
Adds per-model and all-model remote update checks for HF/ModelScope-backed
models.
Model manager rows now show:
- source badge (HF, MS, Local)
- remote update status only for remote-backed models
- source repository link when available
- user-defined tags
- compact accuracy summary, e.g. HE 93.9%
- prompt-processing and token-generation speeds as ↓ / ↑ stats
- row-level actions for update check, settings, benchmark, and delete
Clarifies model manager labels:
Reload runtime, Rescan disk, and Check remote updates.

User model tags

Adds user-defined tags to ModelSettings.
Tags are editable in the model settings modal.
Tags are normalized server-side:
trimmed, empty values removed, case-insensitive de-duplicated, and capped to
64 chars.
Suggested tags include Dense and MoE; they are not auto-applied.
Tags are intentionally excluded from profiles/templates because they are UI
metadata, not model behavior settings.
Adds i18n keys for tag UI strings across existing locale files. Dense and
MoE remain literal tag values.

Performance benchmark upload control

Adds an explicit “Upload to oMLX community benchmarks” checkbox.
The dashboard default is opt-in, preserving previous UI behavior.
The request model default remains conservative for direct API callers.
Uploads are skipped and surfaced when experimental features are active, so
DFlash/SpecPrefill/TurboQuant-style runs do not pollute community benchmark
submissions.
Saved performance results include upload request/status metadata.

Native MTP batch-state fix

Stops activating native MTP in GenerationBatch.__init__.
Lazily initializes MTP only in next() when the current batch is eligible.
Stamps MTP state with the owning singleton uid.
Drops MTP state when extend() / filter() / fallback paths break singleton
ownership.
Disables MTP for grammar-constrained decoding because grammar processors rely
on standard GenerationBatch._step token-acceptance hooks.
Mirrors singleton mRoPE delta setup for direct MTP forwards.

Reasoning: fresh singleton prompt batches may later be merged into a larger
continuous batch. If MTP mutates the donor cache before merge, then the sidecar
MTP cache can become detached from the uid/slot advanced by the standard
batched decode path and resume from stale state later.

Local validation after the fix:

Before:
HumanEval Qwen3.6-35B-A3B-oQ4-mtp 2x batch: 96/164, 58.5%

After:
HumanEval Qwen3.6-35B-A3B-oQ4-mtp 2x batch: 153/164, 93.3%

Rebase hygiene and settings/profile classification

Classifies current DFlash cache/context settings as model-specific profile
fields.
Adds upstream VLM MTP settings to model-specific profile fields.
Removes stale dflash_draft_quant_bits from profile classification.
Keeps preserve_thinking in universal profile fields.
Keeps tags excluded from profiles/templates.
Updates the MTP eligibility test for upstream’s mtp_active gate.
Modernizes the added tags type annotations after upstream removed legacy
Optional/List usage in admin routes.
Adds missing Apache-2.0 SPDX header to touched model_settings.py; branch
added files already had the header.

Runtime model reload fix

Runtime model reload now resolves effective model directories through the same
defaults as global settings.
This fixes reload/rescan behavior when explicit model_dirs are empty and the
default base-path model directory should be used.

API / Data Surface Changes

AccuracyBenchmarkRequest adds sampling_profile:
- model_settings
- deterministic
Performance benchmark start requests add upload_to_omlx.
Model settings add tags: list[str].
Model list responses include catalog metadata and more complete settings
fields.
New admin endpoints:
- POST /admin/api/models/{model_id}/check-update
- POST /admin/api/models/check-updates
- GET /admin/api/bench/results
- POST /admin/api/bench/results/reset
- DELETE /admin/api/bench/results/{result_id}
- DELETE /admin/api/bench/accuracy/results/{result_id}
Saved benchmark result JSON includes result ids, timestamps, sampling
provenance, batch size, benchmark variant, diagnostics, and upload status.
Adds model_catalog.json under the oMLX base path.
Adds persisted benchmark result files under:
- benchmarks/accuracy/results
- benchmarks/performance/results

Tests Added / Updated

Adds tests/test_eval_streaming.py for refill scheduling behavior:
slot refill before slow request finishes, result ordering, per-question timing,
cancellation propagation, thinking probe rerun, and scoring not blocking
generation refill.
Expands HumanEval/MBPP/LiveCodeBench tests around indentation/code repair,
malformed fences, runtime/syntax/wrong-answer diagnostics, setup placement,
and numeric drift.
Adds persistence/delete/reset/catalog summary tests for accuracy results.
Adds sampling provenance tests for deterministic vs model-settings profiles.
Adds model settings tag roundtrip/persistence tests.
Adds MTP state ownership/reshape tests and updates eligibility expectations
for upstream mtp_active.
Updates profile classification tests for new DFlash/VLM MTP fields.

Verification Performed

Focused checks run during development/rebase:

python -m pytest tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py
python -m pytest tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py tests/test_benchmark.py tests/test_mlx_lm_mtp_patch.py tests/test_model_settings.py
python -m pytest tests/test_admin_profiles_api.py tests/test_admin_api_key.py tests/test_model_settings.py
python -m pytest tests/test_model_settings.py tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified
python -m pytest tests/test_mlx_lm_mtp_patch.py::TestBatchGeneratorDispatch::test_is_mtp_eligible_requires_mtp_forward_and_solo_batch
python -m black --check omlx/eval/base.py omlx/eval/humaneval.py omlx/eval/mbpp.py omlx/eval/livecodebench.py tests/test_eval_streaming.py
python -m ruff check --select F,N,I omlx/eval/base.py omlx/eval/humaneval.py omlx/eval/mbpp.py omlx/eval/livecodebench.py tests/test_eval_streaming.py

Latest focused pass after rebase cleanup:

tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py
tests/test_benchmark.py tests/test_mlx_lm_mtp_patch.py tests/test_model_settings.py
=> 252 passed

tests/test_admin_profiles_api.py tests/test_admin_api_key.py tests/test_model_settings.py
=> 122 passed

tests/test_model_settings.py tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified
=> 44 passed

Notes / Caveats

Result history and model catalog are JSON-backed to match current project
storage style. A future SQLite consolidation could simplify this, but this PR
does not introduce a database dependency.
Dashboard average-speed runtime stats are not changed here; benchmark history
is persisted separately and summarized through the model catalog.
The dashboard upload checkbox defaults to enabled to preserve previous
community-upload behavior, but direct API callers must opt in explicitly.
Tags are manual user labels only. No automatic Dense/MoE detection is applied.
MTP remains intentionally restricted to singleton native MTP execution; this
PR fixes stale state across batch reshapes rather than enabling multi-request
native MTP speculation.

MartinKuhl · 2026-05-18T07:36:46Z

I really like the model manager. How often become the models be updated?

aljen · 2026-05-18T09:11:35Z

Thanks!
It depends on the uploader, for original/upstream models, updates are fairly common during the first few days or weeks after release, then usually slow down once the model stabilizes.
The manager just checks the upstream revision/metadata and reports when a newer version is available :)

Native MTP state was initialized on fresh singleton donor batches and then carried across GenerationBatch.extend/filter transitions. When continuous batching merged or shrank requests, the sidecar MTP cache could become detached from the uid/slot whose main cache was advanced by the standard decode path, then later resume from stale state. Activate MTP lazily only for the current singleton batch, stamp state with its owning uid, and drop it whenever a reshape or non-singleton fallback breaks that ownership.

Replace fixed chunk barriers with a refill-on-completion queue for accuracy eval generation. The old runner launched N requests, waited for the slowest request in the chunk, then scored and launched the next chunk. Thinking/code workloads have long-tail generations, so completed slots sat idle while one straggler blocked the whole suite. The new helper keeps up to batch_size generation jobs in flight, refills a slot as soon as generation completes, scores completed outputs behind a bounded queue, and records per-question generation time instead of spreading chunk wall time across every question in the batch. HumanEval, MBPP, LiveCodeBench, and generic evals now share that path. Validated on MacBook Pro 16-inch, M3 Max, 128GB, Qwen3.6-35B-A3B-oQ4-mtp. LiveCodeBench 300, non-thinking: - original 8x chunk barrier 150/300 in 4774.5s - refill 8x best 151/300 in 3454.4s - refill 4x 147/300 in 2788.0s saving 1320.1s / 22m with the same score. LiveCodeBench 300, thinking: - original 8x chunk barrier 158/300 in 33914.9s; - refill 8x 158/300 in 23904.1s, saving 10010.8s / 166.8m / 2.78h with the same score. Focused checks: python -m pytest tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py python -m black --check omlx/eval/base.py omlx/eval/humaneval.py omlx/eval/mbpp.py omlx/eval/livecodebench.py tests/test_eval_streaming.py python -m ruff check --select F,N,I omlx/eval/base.py omlx/eval/humaneval.py omlx/eval/mbpp.py omlx/eval/livecodebench.py tests/test_eval_streaming.py.

- update MTP eligibility test to account for the upstream mtp_active gate while restoring global state after the test - classify upstream VLM MTP model settings as model-specific profile fields - remove stale dflash_draft_quant_bits from profile field classification - modernize admin route tags typing after upstream Optional/List cleanup - add missing Apache-2.0 SPDX header to touched model_settings source file Verification: - python -m pytest tests/test_mlx_lm_mtp_patch.py::TestBatchGeneratorDispatch::test_is_mtp_eligible_requires_mtp_forward_and_solo_batch - python -m pytest tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py tests/test_benchmark.py tests/test_mlx_lm_mtp_patch.py tests/test_model_settings.py - python -m pytest tests/test_admin_profiles_api.py tests/test_admin_api_key.py tests/test_model_settings.py - python -m pytest tests/test_model_settings.py tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified

Fix downloader catalog writes after the nested owner/model download layout. The hooks were passing an undefined model_name, so provenance recording failed and reconcile later recreated the model as a local entry. Use target_dir.name as the catalog model id for HF and ModelScope downloads, matching discovery's leaf-name ids. Add reconcile-side HF provenance inference from downloader cache markers so already-broken nested Hugging Face entries repair on refresh without editing the user's catalog by hand. Tests cover HF and ModelScope downloader catalog writes, benchmark-summary preservation during record_download, and reconcile upgrading nested Hugging Face downloads from Local to HF.

Add the French translations for the model manager update labels, model tags modal, benchmark upload checkbox, and saved accuracy-result delete tooltip introduced by the QoL changes.

aljen marked this pull request as ready for review May 14, 2026 10:07

aljen force-pushed the feature/eval-and-qol-improvements branch from 7444f4a to c92fdf8 Compare May 15, 2026 07:46

aljen force-pushed the feature/eval-and-qol-improvements branch 2 times, most recently from 4898113 to a6cf21d Compare May 19, 2026 09:32

aljen added 16 commits May 21, 2026 03:59

fix: normalize coding benchmark evaluation

736dec9

fix: persist accuracy benchmark results

ca28465

feat: add model catalog and benchmark history

ce0417f

fix: normalize humaneval body indentation

ae2a702

feat: track benchmark sampling provenance

0c79f95

fix: tolerate mbpp numeric drift

8faf414

fix: reload effective model directories

7b94ad8

fix: clean up eval test regressions

e3d9f84

feat: refine model manager stats and tags

2c143cd

fix: classify dflash cache entry setting

7d40e95

fix: default benchmark upload opt-in

83bd0ad

fix(i18n): add French labels for model QoL UI

6ab207a

Add the French translations for the model manager update labels, model tags modal, benchmark upload checkbox, and saved accuracy-result delete tooltip introduced by the QoL changes.

aljen force-pushed the feature/eval-and-qol-improvements branch from a6cf21d to 6ab207a Compare May 21, 2026 02:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety#1254

Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety#1254
aljen wants to merge 16 commits into
jundot:mainfrom
aljen:feature/eval-and-qol-improvements

aljen commented May 14, 2026

Uh oh!

MartinKuhl commented May 18, 2026

Uh oh!

aljen commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aljen commented May 14, 2026

Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety

Summary

Why

Major Changes

Accuracy benchmark execution

Accuracy benchmark correctness and diagnostics

Persistent benchmark results

Model catalog and model manager QoL

User model tags

Performance benchmark upload control

Native MTP batch-state fix

Rebase hygiene and settings/profile classification

Runtime model reload fix

API / Data Surface Changes

Tests Added / Updated

Verification Performed

Notes / Caveats

Uh oh!

MartinKuhl commented May 18, 2026

Uh oh!

aljen commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants