Skip to content

Add n-gram speculative fallback for native MTP#1319

Closed
youndukn wants to merge 1 commit into
jundot:mainfrom
youndukn:codex/ngram-mtp-speculation
Closed

Add n-gram speculative fallback for native MTP#1319
youndukn wants to merge 1 commit into
jundot:mainfrom
youndukn:codex/ngram-mtp-speculation

Conversation

@youndukn
Copy link
Copy Markdown

Summary

  • Add optional draftless n-gram speculation in the native MTP BatchGenerator path.
  • Prefer used n-gram continuations, then repeated prompt n-grams, with native MTP as adaptive fallback on n-gram misses.
  • Add model settings/profile fields, prompt-token tracking, focused tests, and a concise benchmark note.

Notes

  • Disabled by default.
  • Current safe target is greedy long-context roleplay or repeated conversation structure with short drafts.

Tests

  • python -m py_compile omlx/model_settings.py omlx/model_profiles.py omlx/engine/batched.py omlx/scheduler.py omlx/patches/mlx_lm_mtp/batch_generator.py tests/test_mlx_lm_mtp_patch.py
  • PYTHONPATH=/Users/youndukn/projects/oMLX pytest -q tests/test_mlx_lm_mtp_patch.py::TestModelSettingsMtp
  • PYTHONPATH=/Users/youndukn/projects/oMLX pytest -q tests/test_mlx_lm_mtp_patch.py

Benchmark summary

40-turn roleplay benchmark, 320 generated tokens, greedy decoding:

  • Plain greedy: 48.72 wall tok/s, 62.67 decode tok/s
  • N-gram + target fallback: 67.44 wall tok/s, 101.33 decode tok/s
  • N-gram + MTP fallback: 68.85 wall tok/s, 103.87 decode tok/s
  • N-gram + adaptive MTP fallback: 69.61 wall tok/s, 104.35 decode tok/s

@youndukn youndukn force-pushed the codex/ngram-mtp-speculation branch from 09b7ab7 to 1ce631a Compare May 20, 2026 04:51
@youndukn youndukn closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants