Skip to content

Add JANG model loader integration#212

Open
samuelfaj wants to merge 27 commits into
raullenchai:mainfrom
samuelfaj:add-jangtq-loader-v2
Open

Add JANG model loader integration#212
samuelfaj wants to merge 27 commits into
raullenchai:mainfrom
samuelfaj:add-jangtq-loader-v2

Conversation

@samuelfaj

Copy link
Copy Markdown
Contributor

Summary

  • Detect local or Hugging Face models with jang_config.json before the vendored architecture fallback.
  • Route JANGTQ/MXTQ models through jang_tools.load_jangtq.load_jangtq_model and standard JANG models through jang_tools.loader.load_jang_model.
  • Add optional rapid-mlx[jang] dependency extra and regression tests for JANGTQ, JANG v2, and normal DeepSeek V4 fallback behavior.
  • Patch DeepSeek V4 JANGTQ tokenizer loading so jang-tools does not fall through Transformers AutoConfig for the vendored deepseek_v4 architecture.

Root cause

DeepSeek V4 JANGTQ bundles declare weight_format: mxtq and store routed experts as tq_packed/tq_norms tensors. The existing loader treated them like normal DeepSeek V4 MLX weights, so mlx_lm.load_model rejected thousands of unexpected JANGTQ parameters. During live validation, jang-tools also hit a DSV4 tokenizer/EOS expansion path that calls Transformers AutoConfig; the wrapper now patches that call for DSV4 JANGTQ to load tokenizer.json directly.

Validation

  • uv run --extra dev --extra jang python -m pytest tests/test_jangtq_loader.py tests/test_deepseek_v4_vendored.py -q
  • uv run --extra dev ruff check pyproject.toml vllm_mlx/utils/tokenizer.py tests/test_jangtq_loader.py
  • uv run --extra jang python - <<'PY' ... import jang_tools ... PY
  • Local model detection: DeepSeek-V4-Flash-JANGTQ detected as weight_format=mxtq, profile=JANGTQ2.
  • Live serve validation reached DSV4 streaming hydrate, replaced 129 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, then exposed a tokenizer path bug that this branch patches.

@samuelfaj

Copy link
Copy Markdown
Contributor Author

Validation update:

  • Full JANGTQ serve startup completed locally for .
  • Hydration replaced 129 DSV4 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, completed warmup, and served on port 8011.
  • OpenAI-compatible request returned HTTP 200 with , , , .
  • Additional compatibility fixes landed in the branch for DSV4 tokenizer metadata and MLX scalar RoPE offsets under rapid-mlx batching.

@samuelfaj

Copy link
Copy Markdown
Contributor Author

Validation update:

  • Full JANGTQ serve startup completed locally for /Users/samuelfajreldines/dev/models/DeepSeek-V4-Flash-JANGTQ.
  • Hydration replaced 129 DSV4 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, completed warmup, and served on port 8011.
  • OpenAI-compatible /v1/chat/completions request returned HTTP 200 with model=local, prompt_tokens=9, completion_tokens=8, total_tokens=17.
  • Additional compatibility fixes landed in the branch for DSV4 tokenizer metadata and MLX scalar RoPE offsets under rapid-mlx batching.

@samuelfaj

Copy link
Copy Markdown
Contributor Author

Final validation update:

  • Fixed quality issue by routing DSV4 JANGTQ through direct mlx_lm.generate on the model-owning MLX worker instead of the continuous batching generator path, which produced corrupted/repetitive tokens for this runtime.
  • Server validation command completed on port 8013 with /Users/samuelfajreldines/dev/models/DeepSeek-V4-Flash-JANGTQ.
  • /v1/chat/completions simple math request returned HTTP 200 with content exactly 4, prompt_tokens=17, completion_tokens=1, total_tokens=18.
  • /v1/chat/completions exact-ok request returned HTTP 200 with content exactly ok, prompt_tokens=9, completion_tokens=1, total_tokens=10.
  • Regression tests: uv run --extra dev --extra jang python -m pytest tests/test_jangtq_loader.py tests/test_deepseek_v4_vendored.py -q passed, 12 tests.
  • Ruff passed for changed files.

@samuelfaj

Copy link
Copy Markdown
Contributor Author

Performance/streaming update:

  • The DeepSeek V4 JANGTQ direct fallback now uses mlx_lm.stream_generate for streaming requests, so tokens are delivered as they are produced instead of waiting for full completion.
  • Non-streaming requests keep the safe direct mlx_lm.generate path.
  • Added an explicit TODO in the direct fallback explaining the future real batching fix: compare BatchGenerator logits/output against mlx_lm.generate, then fix cache offset handling, prompt-cache merge/extract, and RoPE position state until batching is bit-consistent with the direct path.
  • Live streaming validation returned SSE chunks with content exactly ok and final usage prompt_tokens=9, completion_tokens=2, total_tokens=11.
  • Focused tests passed: 17 tests.
  • Ruff passed.

@samuelfaj samuelfaj force-pushed the add-jangtq-loader-v2 branch from 2f48ce6 to 0ee615b Compare May 5, 2026 03:31
@samuelfaj samuelfaj marked this pull request as draft May 5, 2026 14:23
@samuelfaj samuelfaj marked this pull request as ready for review May 5, 2026 15:48
@samuelfaj samuelfaj force-pushed the add-jangtq-loader-v2 branch from ea128df to 9b0bb10 Compare May 5, 2026 15:58
@raullenchai

Copy link
Copy Markdown
Owner

Hi @samuelfaj — thanks for the work. Applying our new SOP §0 necessity gate (see docs/development/pr_merge_sop.md) I need a demand signal before merging.

Holding for clarification, not closing yet.

Reasoning:

To unlock merge, I need one or more of:

  1. User demand: a GitHub issue from a user (you or someone else) saying "I want to serve JANG model X with rapid-mlx and it doesn't work". Even one is enough.
  2. JANG popularity signal: pointer to a HuggingFace model page using JANGTQ/MXTQ format with non-trivial download counts, or a community discussion (Reddit/Discord/X) showing people are trying to run JANG locally.
  3. Scope split: separate the JANG-specific changes (vllm_mlx/jang_tools/*, tests/test_jangtq_loader.py, jang detection in loader, [jang] extras) from the unrelated infra changes (anthropic auth, completions, health, request_metrics, etc.). The current diff makes it impossible to review JANG support on its own merits.

For now please rebase on top of latest main (which now has #260, #262, #258 merged) and drop the parts that came from #205/#212-stack-overlap. After that I can give the JANG-specific surface the focused review it deserves.

Apologies for the friction — the necessity gate is new this week and I'm working through the backlog. Your #204 (Qwen tool-call fix) is being reviewed now since it has clear user value.

@raullenchai

Copy link
Copy Markdown
Owner

Thanks for putting this together. Two requests before review:

(1) Please split this into independent PRs. The diff is +4007 LOC across 27 files but the title scopes it to the JANG loader. The JANG-loader part is a coherent change on its own:

  • pyproject.toml (the [jang] extra)
  • vllm_mlx/utils/tokenizer.py (DSV4 JANGTQ tokenizer patch)
  • tests/test_jangtq_loader.py
  • whichever loader-routing code path detects jang_config.json before the vendored-arch fallback

The TUI (vllm_mlx/tui.py +736), metrics middleware (vllm_mlx/middleware/metrics.py +247, vllm_mlx/request_metrics.py +201), chat-route refactor (vllm_mlx/routes/chat.py +374), postprocessor changes (vllm_mlx/service/postprocessor.py +176), and batched-engine changes (vllm_mlx/engine/batched.py +224) are each their own scope and should be reviewed separately — they're unrelated to JANG and bundling them makes the diff impossible to review responsibly.

(2) Verify the JANG import path. The PR imports jang_tools.loader.load_jang_model and jang_tools.load_jangtq.load_jangtq_model, but the package published on PyPI is named jang, not jang-tools (https://pypi.org/project/jang/jang-tools returns 404). Either the published name has changed since you tested, or the imports here won't resolve on a clean install. Please:

  • Confirm the actual import path on a fresh venv (uv venv && uv pip install jang && python -c "import jang_tools" vs import jang).
  • Pin the exact version in the [jang] extra (jang-tools>=X.Y or jang>=X.Y) — this is a single-maintainer dependency with custom Metal kernels, so an unpinned floor is risky.
  • Add a one-line note in the PR description acknowledging the JANGQ-AI ecosystem is a small Apple-Silicon community (no academic backing, single primary maintainer at jangq.ai) so reviewers understand the supply-chain shape.

Happy to review the loader-only PR once it's split out — that part looks reasonable on first read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants