feat(oq): extend calibration data with code, tool-calling, and reasoning by a4501150 · Pull Request #1246 · jundot/omlx

a4501150 · 2026-05-13T18:02:21Z

Summary

Extend the shipped calibration data with additional code, tool-calling, and reasoning samples to improve per-layer sensitivity diversity during oQ quantization. Inspired by the imatrix calibration approach (unsloth/llama.cpp) — category-aware sampling ensures the sensitivity measurement sees balanced token distributions across workload types, rather than being dominated by one category.

Changes:

Add new samples to code (mixed-exl, reasoning-exl), tool_calling (hermes function-calling + agent traces), and reasoning (qwen3-dwq <think> conversations, reasoning-exl, hermes agent traces with <think> tags)
Keep upstream multilingual data (en/ko/zh/ja) verbatim — no changes
Each focus category sampled to a ~200K char budget using deterministic hash-sorted selection (fully reproducible, no random seed)
scripts/build_calibration_data.py rebuilds only the 3 focus categories and merges with the existing JSON's multilingual keys on write

Addresses review feedback:

Removed unreachable categories (bartowski/chat/mixed) — _load_builtin_calibration line 2725 only iterates (code, en, ko, zh, ja, tool_calling, reasoning)
Rebalanced tool_calling from 78% → 18% of token weight (was dominated by 16K char/sample hermes traces)
Re-routed <think>-containing samples from dead "chat" category into "reasoning"; split agent traces with <think> tags from pure tool_calling
File size: 1.6 MB (upstream 1.3 MB) — was 18 MB in the original version

Final distribution:

Category	Samples	Chars	Weight	Source
code	169	200K	17.9%	merged + hash-selected
en	150	170K	15.2%	upstream verbatim
ko	60	130K	11.6%	upstream verbatim
zh	50	93K	8.3%	upstream verbatim
ja	60	124K	11.1%	upstream verbatim
tool_calling	17	200K	17.9%	merged + hash-selected
reasoning	119	200K	17.9%	merged + hash-selected

Test plan

pytest tests/test_oq.py — 204 passed
Verified all 7 JSON keys are reachable by _load_builtin_calibration
Verified code, multilingual, and code_multilingual dataset paths all produce sufficient tokens (≥32K)
No dead categories in the JSON

🤖 Generated with Claude Code

jundot · 2026-05-19T04:42:57Z

Quick concern before merging. The loader already subsamples down to 128 × 256 = 32k tokens regardless of corpus size, so going from 600 to 4568 samples doesn't really translate into a stronger sensitivity signal on its own. What actually changes is the per-category token weight inside that 32k.

And the distribution shift looks off: tool_calling (hermes-agent-reasoning-traces, ~16k chars/sample × 500) ends up around 85% of the token weight in default code_multilingual mode, while ko / ja / zh together drop from ~25% to ~3%. bartowski / chat / mixed / reasoning are in the JSON but oq.py:2715 never reaches them.

Three things I'd like to clear up before merging:
(1) per-sample length cap on tool_calling, or a reason the dominance is intended
(2) the new reasoning samples aren't reached by the default code_multilingual loader at all, so the "reasoning samples" claim in the PR title doesn't actually land
(3) is there a concrete reason this dataset needs to be merged? I'm not convinced adding reasoning samples to calibration improves reasoning at inference, since calibration just reshapes per-layer sensitivity scores rather than retraining anything. ko / ja are unchanged at 60 / 60 samples vs the old JSON, and the wheel-size increase is roughly 14× (1.29 MB → 17.95 MB), so some concrete quality or behavior delta would help justify the change.

Keep upstream's multilingual base (en/ko/zh/ja) verbatim and add new samples for the three categories that benefit most from calibration diversity: code, tool_calling, and reasoning. Addresses review feedback on PR jundot#1246: - Remove unreachable categories (bartowski/chat/mixed) that were never loaded by _load_builtin_calibration's code_multilingual path - Rebalance tool_calling from 78% down to 18% of token weight - Re-route <think>-containing samples into "reasoning" instead of dead "chat"; split agent traces from pure tool_calling - All sampling is deterministic (hash-sorted, no random seed) - File size: 1.6 MB (upstream 1.3 MB, was 18 MB in original PR) Build script (scripts/build_calibration_data.py) produces only the 3 focus categories; multilingual data comes from upstream's JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

a4501150 · 2026-05-20T15:26:37Z

Thanks for the detailed review — all three concerns were valid. Addressed concerns:

(1) tool_calling dominance → fixed with char-budget sampling

tool_calling dropped from ~85% to 18% of token weight. Instead of sampling by count (500 samples × 16K chars/sample = 8 MB), we now sample complete items to a 200K char budget using deterministic hash-sorted selection. This gives 17 complete tool-calling conversations — enough diversity without drowning out the other categories.

(2) unreachable categories → removed entirely

bartowski, chat, and mixed are gone from the JSON. The build script now only produces the 3 categories the code_multilingual loader actually reaches (code, tool_calling, reasoning). On write, it merges with the existing JSON's multilingual keys so en/ko/zh/ja are preserved.

Additionally, <think>-containing samples that were landing in dead "chat" (252 of 500) have been re-routed into "reasoning", and hermes agent traces are now split on <think> presence — reasoning-style ones go to "reasoning", pure function-calling ones stay in "tool_calling".

(3) justification for reasoning in calibration

Fair point that calibration reshapes per-layer sensitivity rather than retraining. The argument is the same as imatrix calibration in llama.cpp/unsloth: the text distribution seen during sensitivity measurement determines which layers the quantizer treats as important. If calibration only sees natural language, layers that activate differently for structured outputs (JSON tool calls, <think> reasoning chains, code syntax) get their sensitivity underestimated, leading to worse quantization for those workloads.

We're not claiming reasoning samples improve reasoning ability — just that they help the quantizer assign bits to the layers that matter for the model's actual workload mix. Each category now gets ~18% of the 32K-token sensitivity window instead of tool_calling getting 85%.

Upstream multilingual data (en/ko/zh/ja) is kept verbatim — file grew from 1.3 MB to 1.6 MB (was 18 MB before this rework). All sampling is fully deterministic (hash-sorted, no random seed).

a4501150 mentioned this pull request May 14, 2026

fix(load): VLM model loading fixes for oQ-quantized checkpoints #1247

Merged

3 tasks

a4501150 force-pushed the pr/oq-calibration branch from 01f669a to 8ceb39f Compare May 20, 2026 15:23

a4501150 changed the title ~~feat(oq): update calibration data with reasoning + ko/ja samples~~ feat(oq): extend calibration data with code, tool-calling, and reasoning May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(oq): extend calibration data with code, tool-calling, and reasoning#1246

feat(oq): extend calibration data with code, tool-calling, and reasoning#1246
a4501150 wants to merge 1 commit into
jundot:mainfrom
a4501150:pr/oq-calibration

a4501150 commented May 13, 2026 •

edited

Loading

Uh oh!

jundot commented May 19, 2026

Uh oh!

a4501150 commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

a4501150 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

jundot commented May 19, 2026

Uh oh!

a4501150 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

a4501150 commented May 13, 2026 •

edited

Loading

a4501150 commented May 20, 2026 •

edited

Loading