Skip to content

Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety#1254

Open
aljen wants to merge 16 commits into
jundot:mainfrom
aljen:feature/eval-and-qol-improvements
Open

Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety#1254
aljen wants to merge 16 commits into
jundot:mainfrom
aljen:feature/eval-and-qol-improvements

Conversation

@aljen
Copy link
Copy Markdown

@aljen aljen commented May 14, 2026

Eval Reliability, Benchmark History, Model Manager QoL, and MTP Batch Safety

Summary

This PR improves oMLX benchmark reliability and observability, adds persistent
benchmark/model metadata, refines the model manager UI, and fixes a native MTP
state bug that could corrupt batched generation after batch reshapes.

The largest behavior change is in accuracy benchmark execution: evals no longer
run as fixed chunks that wait for the slowest request before starting the next
chunk. They now keep up to batch_size requests in flight and refill completed
slots immediately. This removes a large amount of idle time for thinking/code
benchmarks with long-tail generations.

Why

  • Accuracy/code evals were hard to trust because HumanEval/MBPP/LiveCodeBench
    scoring was too brittle around common model output formats, indentation,
    markdown fences, tiny numeric drift, and diagnostic reporting.
  • Accuracy benchmark results were process-local and easy to lose after restart.
  • Performance benchmark results were not retained in the UI/model list.
  • The model manager row mixed names, source tags, size, speed, and accuracy into
    one visual blob.
  • Native MTP state was initialized on singleton donor batches and could survive
    GenerationBatch.extend() / filter() reshapes, making it possible to resume
    a stale sidecar MTP cache after continuous batching changed ownership.
  • Rebased upstream added new VLM MTP settings and typing cleanup, which needed
    profile classification/test alignment on this branch.

Major Changes

Accuracy benchmark execution

  • Replaces fixed chunk barriers with a refill-on-completion queue in
    BaseBenchmark.
  • Keeps up to batch_size generation requests in flight.
  • Starts the next item as soon as a generation slot completes instead of waiting
    for the slowest request in the current chunk.
  • Moves scoring of generated code outputs behind a bounded async queue and runs
    scoring in a worker thread so subprocess-based code evaluation does not block
    generation refill.
  • Records per-question generation time instead of assigning each question the
    chunk average.
  • Preserves the existing first-batch thinking auto-detection behavior by running
    a probe batch before entering the queue.
  • Applies the shared refill path to generic evals, HumanEval, MBPP, and
    LiveCodeBench.

Measured locally on MacBook Pro 16-inch, M3 Max, 128GB,
Qwen3.6-35B-A3B-oQ4-mtp:

LiveCodeBench 300, non-thinking:
- old 8x chunk barrier: 150/300 in 4774.5s
- refill 8x best:       151/300 in 3454.4s
- refill 4x:            147/300 in 2788.0s

LiveCodeBench 300, thinking:
- old 8x chunk barrier: 158/300 in 33914.9s
- refill 8x:            158/300 in 23904.1s

Thinking run saved ~10010.8s / 166.8m / 2.78h with the same score.

Accuracy benchmark correctness and diagnostics

  • Adds pass_mode, failure_type, and error diagnostics to per-question
    results.
  • HumanEval now tries deterministic canonical/chat-output repairs:
    body-only completions, extracted code, normalized indentation, top-level
    indentation, and full-function responses.
  • MBPP now tolerates tiny numeric assertion drift and can retry setup placement
    after generated code when needed.
  • LiveCodeBench now reports structured code-check failure types.
  • Result exports now include batch size, sampling profile, effective sampling,
    pass mode, failure type, and errors.
  • Adds a benchmark sampling profile switch:
    model_settings uses saved model sampling settings, while deterministic
    forces neutral greedy settings for reproducibility.

Persistent benchmark results

  • Accuracy benchmark results are now persisted under the oMLX base path and
    survive server/browser restarts.
  • Performance benchmark results are also persisted.
  • Adds delete/reset endpoints for saved accuracy and performance benchmark
    results.
  • Adds stable result_id and created_at metadata.
  • Updates model catalog summaries when benchmark results are added/deleted/reset.
  • Accuracy summaries track latest/best result and best result by benchmark.
  • Performance summaries expose canonical pp/tg throughput for model rows.

Model catalog and model manager QoL

  • Adds a JSON-backed ModelCatalog for durable local model metadata.
  • Records model source (hf, modelscope, local, unknown), repo id,
    revisions/update timestamps, update status, and benchmark summaries.
  • Downloaders record catalog entries after successful HF/ModelScope downloads.
  • Model list reconciles discovered local models with the catalog.
  • Adds per-model and all-model remote update checks for HF/ModelScope-backed
    models.
  • Model manager rows now show:
    • source badge (HF, MS, Local)
    • remote update status only for remote-backed models
    • source repository link when available
    • user-defined tags
    • compact accuracy summary, e.g. HE 93.9%
    • prompt-processing and token-generation speeds as / stats
    • row-level actions for update check, settings, benchmark, and delete
  • Clarifies model manager labels:
    Reload runtime, Rescan disk, and Check remote updates.

User model tags

  • Adds user-defined tags to ModelSettings.
  • Tags are editable in the model settings modal.
  • Tags are normalized server-side:
    trimmed, empty values removed, case-insensitive de-duplicated, and capped to
    64 chars.
  • Suggested tags include Dense and MoE; they are not auto-applied.
  • Tags are intentionally excluded from profiles/templates because they are UI
    metadata, not model behavior settings.
  • Adds i18n keys for tag UI strings across existing locale files. Dense and
    MoE remain literal tag values.

Performance benchmark upload control

  • Adds an explicit “Upload to oMLX community benchmarks” checkbox.
  • The dashboard default is opt-in, preserving previous UI behavior.
  • The request model default remains conservative for direct API callers.
  • Uploads are skipped and surfaced when experimental features are active, so
    DFlash/SpecPrefill/TurboQuant-style runs do not pollute community benchmark
    submissions.
  • Saved performance results include upload request/status metadata.

Native MTP batch-state fix

  • Stops activating native MTP in GenerationBatch.__init__.
  • Lazily initializes MTP only in next() when the current batch is eligible.
  • Stamps MTP state with the owning singleton uid.
  • Drops MTP state when extend() / filter() / fallback paths break singleton
    ownership.
  • Disables MTP for grammar-constrained decoding because grammar processors rely
    on standard GenerationBatch._step token-acceptance hooks.
  • Mirrors singleton mRoPE delta setup for direct MTP forwards.

Reasoning: fresh singleton prompt batches may later be merged into a larger
continuous batch. If MTP mutates the donor cache before merge, then the sidecar
MTP cache can become detached from the uid/slot advanced by the standard
batched decode path and resume from stale state later.

Local validation after the fix:

Before:
HumanEval Qwen3.6-35B-A3B-oQ4-mtp 2x batch: 96/164, 58.5%

After:
HumanEval Qwen3.6-35B-A3B-oQ4-mtp 2x batch: 153/164, 93.3%

Rebase hygiene and settings/profile classification

  • Classifies current DFlash cache/context settings as model-specific profile
    fields.
  • Adds upstream VLM MTP settings to model-specific profile fields.
  • Removes stale dflash_draft_quant_bits from profile classification.
  • Keeps preserve_thinking in universal profile fields.
  • Keeps tags excluded from profiles/templates.
  • Updates the MTP eligibility test for upstream’s mtp_active gate.
  • Modernizes the added tags type annotations after upstream removed legacy
    Optional/List usage in admin routes.
  • Adds missing Apache-2.0 SPDX header to touched model_settings.py; branch
    added files already had the header.

Runtime model reload fix

  • Runtime model reload now resolves effective model directories through the same
    defaults as global settings.
  • This fixes reload/rescan behavior when explicit model_dirs are empty and the
    default base-path model directory should be used.

API / Data Surface Changes

  • AccuracyBenchmarkRequest adds sampling_profile:
    • model_settings
    • deterministic
  • Performance benchmark start requests add upload_to_omlx.
  • Model settings add tags: list[str].
  • Model list responses include catalog metadata and more complete settings
    fields.
  • New admin endpoints:
    • POST /admin/api/models/{model_id}/check-update
    • POST /admin/api/models/check-updates
    • GET /admin/api/bench/results
    • POST /admin/api/bench/results/reset
    • DELETE /admin/api/bench/results/{result_id}
    • DELETE /admin/api/bench/accuracy/results/{result_id}
  • Saved benchmark result JSON includes result ids, timestamps, sampling
    provenance, batch size, benchmark variant, diagnostics, and upload status.
  • Adds model_catalog.json under the oMLX base path.
  • Adds persisted benchmark result files under:
    • benchmarks/accuracy/results
    • benchmarks/performance/results

Tests Added / Updated

  • Adds tests/test_eval_streaming.py for refill scheduling behavior:
    slot refill before slow request finishes, result ordering, per-question timing,
    cancellation propagation, thinking probe rerun, and scoring not blocking
    generation refill.
  • Expands HumanEval/MBPP/LiveCodeBench tests around indentation/code repair,
    malformed fences, runtime/syntax/wrong-answer diagnostics, setup placement,
    and numeric drift.
  • Adds persistence/delete/reset/catalog summary tests for accuracy results.
  • Adds sampling provenance tests for deterministic vs model-settings profiles.
  • Adds model settings tag roundtrip/persistence tests.
  • Adds MTP state ownership/reshape tests and updates eligibility expectations
    for upstream mtp_active.
  • Updates profile classification tests for new DFlash/VLM MTP fields.

Verification Performed

Focused checks run during development/rebase:

python -m pytest tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py
python -m pytest tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py tests/test_benchmark.py tests/test_mlx_lm_mtp_patch.py tests/test_model_settings.py
python -m pytest tests/test_admin_profiles_api.py tests/test_admin_api_key.py tests/test_model_settings.py
python -m pytest tests/test_model_settings.py tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified
python -m pytest tests/test_mlx_lm_mtp_patch.py::TestBatchGeneratorDispatch::test_is_mtp_eligible_requires_mtp_forward_and_solo_batch
python -m black --check omlx/eval/base.py omlx/eval/humaneval.py omlx/eval/mbpp.py omlx/eval/livecodebench.py tests/test_eval_streaming.py
python -m ruff check --select F,N,I omlx/eval/base.py omlx/eval/humaneval.py omlx/eval/mbpp.py omlx/eval/livecodebench.py tests/test_eval_streaming.py

Latest focused pass after rebase cleanup:

tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py
tests/test_benchmark.py tests/test_mlx_lm_mtp_patch.py tests/test_model_settings.py
=> 252 passed

tests/test_admin_profiles_api.py tests/test_admin_api_key.py tests/test_model_settings.py
=> 122 passed

tests/test_model_settings.py tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified
=> 44 passed

Notes / Caveats

  • Result history and model catalog are JSON-backed to match current project
    storage style. A future SQLite consolidation could simplify this, but this PR
    does not introduce a database dependency.
  • Dashboard average-speed runtime stats are not changed here; benchmark history
    is persisted separately and summarized through the model catalog.
  • The dashboard upload checkbox defaults to enabled to preserve previous
    community-upload behavior, but direct API callers must opt in explicitly.
  • Tags are manual user labels only. No automatic Dense/MoE detection is applied.
  • MTP remains intentionally restricted to singleton native MTP execution; this
    PR fixes stale state across batch reshapes rather than enabling multi-request
    native MTP speculation.
image image

@aljen aljen marked this pull request as ready for review May 14, 2026 10:07
@aljen aljen force-pushed the feature/eval-and-qol-improvements branch from 7444f4a to c92fdf8 Compare May 15, 2026 07:46
@MartinKuhl
Copy link
Copy Markdown

I really like the model manager. How often become the models be updated?

@aljen
Copy link
Copy Markdown
Author

aljen commented May 18, 2026

Thanks!
It depends on the uploader, for original/upstream models, updates are fairly common during the first few days or weeks after release, then usually slow down once the model stabilizes.
The manager just checks the upstream revision/metadata and reports when a newer version is available :)

@aljen aljen force-pushed the feature/eval-and-qol-improvements branch 2 times, most recently from 4898113 to a6cf21d Compare May 19, 2026 09:32
aljen added 16 commits May 21, 2026 03:59
Native MTP state was initialized on fresh singleton donor batches and then carried across GenerationBatch.extend/filter transitions. When continuous batching merged or shrank requests, the sidecar MTP cache could become detached from the uid/slot whose main cache was advanced by the standard decode path, then later resume from stale state.

Activate MTP lazily only for the current singleton batch, stamp state with its owning uid, and drop it whenever a reshape or non-singleton fallback breaks that ownership.
Replace fixed chunk barriers with a refill-on-completion queue for accuracy eval generation.
The old runner launched N requests, waited for the slowest request in the chunk, then scored and launched the next chunk.

Thinking/code workloads have long-tail generations, so completed slots sat idle while one straggler blocked the whole suite.

The new helper keeps up to batch_size generation jobs in flight, refills a slot as soon as generation completes, scores completed outputs behind a bounded queue, and records per-question generation time instead of spreading chunk wall time across every question in the batch.
HumanEval, MBPP, LiveCodeBench, and generic evals now share that path.

Validated on MacBook Pro 16-inch, M3 Max, 128GB, Qwen3.6-35B-A3B-oQ4-mtp.

LiveCodeBench 300, non-thinking:
- original 8x chunk barrier 150/300 in 4774.5s
- refill 8x best 151/300 in 3454.4s
- refill 4x 147/300 in 2788.0s
saving 1320.1s / 22m with the same score.

LiveCodeBench 300, thinking:
- original 8x chunk barrier 158/300 in 33914.9s;
- refill 8x 158/300 in 23904.1s,
saving 10010.8s / 166.8m / 2.78h with the same score.

Focused checks:
python -m pytest tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py
python -m black --check omlx/eval/base.py omlx/eval/humaneval.py omlx/eval/mbpp.py omlx/eval/livecodebench.py tests/test_eval_streaming.py
python -m ruff check --select F,N,I omlx/eval/base.py omlx/eval/humaneval.py omlx/eval/mbpp.py omlx/eval/livecodebench.py tests/test_eval_streaming.py.
- update MTP eligibility test to account for the upstream mtp_active gate while restoring global state after the test
- classify upstream VLM MTP model settings as model-specific profile fields
- remove stale dflash_draft_quant_bits from profile field classification
- modernize admin route tags typing after upstream Optional/List cleanup
- add missing Apache-2.0 SPDX header to touched model_settings source file

Verification:
- python -m pytest tests/test_mlx_lm_mtp_patch.py::TestBatchGeneratorDispatch::test_is_mtp_eligible_requires_mtp_forward_and_solo_batch
- python -m pytest tests/test_eval_streaming.py tests/test_eval.py tests/test_accuracy_benchmark.py tests/test_benchmark.py tests/test_mlx_lm_mtp_patch.py tests/test_model_settings.py
- python -m pytest tests/test_admin_profiles_api.py tests/test_admin_api_key.py tests/test_model_settings.py
- python -m pytest tests/test_model_settings.py tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified
Fix downloader catalog writes after the nested owner/model download layout. The hooks were passing an undefined model_name, so provenance recording failed and reconcile later recreated the model as a local entry.

Use target_dir.name as the catalog model id for HF and ModelScope downloads, matching discovery's leaf-name ids. Add reconcile-side HF provenance inference from downloader cache markers so already-broken nested Hugging Face entries repair on refresh without editing the user's catalog by hand.

Tests cover HF and ModelScope downloader catalog writes, benchmark-summary preservation during record_download, and reconcile upgrading nested Hugging Face downloads from Local to HF.
Add the French translations for the model manager update labels, model tags modal, benchmark upload checkbox, and saved accuracy-result delete tooltip introduced by the QoL changes.
@aljen aljen force-pushed the feature/eval-and-qol-improvements branch from a6cf21d to 6ab207a Compare May 21, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants