Skip to content

feat(oq): extend calibration data with code, tool-calling, and reasoning#1246

Open
a4501150 wants to merge 1 commit into
jundot:mainfrom
a4501150:pr/oq-calibration
Open

feat(oq): extend calibration data with code, tool-calling, and reasoning#1246
a4501150 wants to merge 1 commit into
jundot:mainfrom
a4501150:pr/oq-calibration

Conversation

@a4501150
Copy link
Copy Markdown
Contributor

@a4501150 a4501150 commented May 13, 2026

Summary

Extend the shipped calibration data with additional code, tool-calling, and reasoning samples to improve per-layer sensitivity diversity during oQ quantization. Inspired by the imatrix calibration approach (unsloth/llama.cpp) — category-aware sampling ensures the sensitivity measurement sees balanced token distributions across workload types, rather than being dominated by one category.

Changes:

  • Add new samples to code (mixed-exl, reasoning-exl), tool_calling (hermes function-calling + agent traces), and reasoning (qwen3-dwq <think> conversations, reasoning-exl, hermes agent traces with <think> tags)
  • Keep upstream multilingual data (en/ko/zh/ja) verbatim — no changes
  • Each focus category sampled to a ~200K char budget using deterministic hash-sorted selection (fully reproducible, no random seed)
  • scripts/build_calibration_data.py rebuilds only the 3 focus categories and merges with the existing JSON's multilingual keys on write

Addresses review feedback:

  • Removed unreachable categories (bartowski/chat/mixed) — _load_builtin_calibration line 2725 only iterates (code, en, ko, zh, ja, tool_calling, reasoning)
  • Rebalanced tool_calling from 78% → 18% of token weight (was dominated by 16K char/sample hermes traces)
  • Re-routed <think>-containing samples from dead "chat" category into "reasoning"; split agent traces with <think> tags from pure tool_calling
  • File size: 1.6 MB (upstream 1.3 MB) — was 18 MB in the original version

Final distribution:

Category Samples Chars Weight Source
code 169 200K 17.9% merged + hash-selected
en 150 170K 15.2% upstream verbatim
ko 60 130K 11.6% upstream verbatim
zh 50 93K 8.3% upstream verbatim
ja 60 124K 11.1% upstream verbatim
tool_calling 17 200K 17.9% merged + hash-selected
reasoning 119 200K 17.9% merged + hash-selected

Test plan

  • pytest tests/test_oq.py — 204 passed
  • Verified all 7 JSON keys are reachable by _load_builtin_calibration
  • Verified code, multilingual, and code_multilingual dataset paths all produce sufficient tokens (≥32K)
  • No dead categories in the JSON

🤖 Generated with Claude Code

@jundot
Copy link
Copy Markdown
Owner

jundot commented May 19, 2026

Quick concern before merging. The loader already subsamples down to 128 × 256 = 32k tokens regardless of corpus size, so going from 600 to 4568 samples doesn't really translate into a stronger sensitivity signal on its own. What actually changes is the per-category token weight inside that 32k.

And the distribution shift looks off: tool_calling (hermes-agent-reasoning-traces, ~16k chars/sample × 500) ends up around 85% of the token weight in default code_multilingual mode, while ko / ja / zh together drop from ~25% to ~3%. bartowski / chat / mixed / reasoning are in the JSON but oq.py:2715 never reaches them.

Three things I'd like to clear up before merging:
(1) per-sample length cap on tool_calling, or a reason the dominance is intended
(2) the new reasoning samples aren't reached by the default code_multilingual loader at all, so the "reasoning samples" claim in the PR title doesn't actually land
(3) is there a concrete reason this dataset needs to be merged? I'm not convinced adding reasoning samples to calibration improves reasoning at inference, since calibration just reshapes per-layer sensitivity scores rather than retraining anything. ko / ja are unchanged at 60 / 60 samples vs the old JSON, and the wheel-size increase is roughly 14× (1.29 MB → 17.95 MB), so some concrete quality or behavior delta would help justify the change.

a4501150 added a commit to a4501150/omlx that referenced this pull request May 20, 2026
Keep upstream's multilingual base (en/ko/zh/ja) verbatim and add new
samples for the three categories that benefit most from calibration
diversity: code, tool_calling, and reasoning.

Addresses review feedback on PR jundot#1246:
- Remove unreachable categories (bartowski/chat/mixed) that were never
  loaded by _load_builtin_calibration's code_multilingual path
- Rebalance tool_calling from 78% down to 18% of token weight
- Re-route <think>-containing samples into "reasoning" instead of dead
  "chat"; split agent traces from pure tool_calling
- All sampling is deterministic (hash-sorted, no random seed)
- File size: 1.6 MB (upstream 1.3 MB, was 18 MB in original PR)

Build script (scripts/build_calibration_data.py) produces only the 3
focus categories; multilingual data comes from upstream's JSON.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep upstream's multilingual base (en/ko/zh/ja) verbatim and add new
samples for the three categories that benefit most from calibration
diversity: code, tool_calling, and reasoning.

Addresses review feedback on PR jundot#1246:
- Remove unreachable categories (bartowski/chat/mixed) that were never
  loaded by _load_builtin_calibration's code_multilingual path
- Rebalance tool_calling from 78% down to 18% of token weight
- Re-route <think>-containing samples into "reasoning" instead of dead
  "chat"; split agent traces from pure tool_calling
- All sampling is deterministic (hash-sorted, no random seed)
- File size: 1.6 MB (upstream 1.3 MB, was 18 MB in original PR)

Build script (scripts/build_calibration_data.py) produces only the 3
focus categories; multilingual data comes from upstream's JSON.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@a4501150 a4501150 force-pushed the pr/oq-calibration branch from 01f669a to 8ceb39f Compare May 20, 2026 15:23
@a4501150 a4501150 changed the title feat(oq): update calibration data with reasoning + ko/ja samples feat(oq): extend calibration data with code, tool-calling, and reasoning May 20, 2026
@a4501150
Copy link
Copy Markdown
Contributor Author

a4501150 commented May 20, 2026

Thanks for the detailed review — all three concerns were valid. Addressed concerns:

(1) tool_calling dominance → fixed with char-budget sampling

tool_calling dropped from ~85% to 18% of token weight. Instead of sampling by count (500 samples × 16K chars/sample = 8 MB), we now sample complete items to a 200K char budget using deterministic hash-sorted selection. This gives 17 complete tool-calling conversations — enough diversity without drowning out the other categories.

(2) unreachable categories → removed entirely

bartowski, chat, and mixed are gone from the JSON. The build script now only produces the 3 categories the code_multilingual loader actually reaches (code, tool_calling, reasoning). On write, it merges with the existing JSON's multilingual keys so en/ko/zh/ja are preserved.

Additionally, <think>-containing samples that were landing in dead "chat" (252 of 500) have been re-routed into "reasoning", and hermes agent traces are now split on <think> presence — reasoning-style ones go to "reasoning", pure function-calling ones stay in "tool_calling".

(3) justification for reasoning in calibration

Fair point that calibration reshapes per-layer sensitivity rather than retraining. The argument is the same as imatrix calibration in llama.cpp/unsloth: the text distribution seen during sensitivity measurement determines which layers the quantizer treats as important. If calibration only sees natural language, layers that activate differently for structured outputs (JSON tool calls, <think> reasoning chains, code syntax) get their sensitivity underestimated, leading to worse quantization for those workloads.

We're not claiming reasoning samples improve reasoning ability — just that they help the quantizer assign bits to the layers that matter for the model's actual workload mix. Each category now gets ~18% of the 32K-token sensitivity window instead of tool_calling getting 85%.

Upstream multilingual data (en/ko/zh/ja) is kept verbatim — file grew from 1.3 MB to 1.6 MB (was 18 MB before this rework). All sampling is fully deterministic (hash-sorted, no random seed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants