feat(oq): extend calibration data with code, tool-calling, and reasoning#1246
feat(oq): extend calibration data with code, tool-calling, and reasoning#1246a4501150 wants to merge 1 commit into
Conversation
|
Quick concern before merging. The loader already subsamples down to 128 × 256 = 32k tokens regardless of corpus size, so going from 600 to 4568 samples doesn't really translate into a stronger sensitivity signal on its own. What actually changes is the per-category token weight inside that 32k. And the distribution shift looks off: Three things I'd like to clear up before merging: |
Keep upstream's multilingual base (en/ko/zh/ja) verbatim and add new samples for the three categories that benefit most from calibration diversity: code, tool_calling, and reasoning. Addresses review feedback on PR jundot#1246: - Remove unreachable categories (bartowski/chat/mixed) that were never loaded by _load_builtin_calibration's code_multilingual path - Rebalance tool_calling from 78% down to 18% of token weight - Re-route <think>-containing samples into "reasoning" instead of dead "chat"; split agent traces from pure tool_calling - All sampling is deterministic (hash-sorted, no random seed) - File size: 1.6 MB (upstream 1.3 MB, was 18 MB in original PR) Build script (scripts/build_calibration_data.py) produces only the 3 focus categories; multilingual data comes from upstream's JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep upstream's multilingual base (en/ko/zh/ja) verbatim and add new samples for the three categories that benefit most from calibration diversity: code, tool_calling, and reasoning. Addresses review feedback on PR jundot#1246: - Remove unreachable categories (bartowski/chat/mixed) that were never loaded by _load_builtin_calibration's code_multilingual path - Rebalance tool_calling from 78% down to 18% of token weight - Re-route <think>-containing samples into "reasoning" instead of dead "chat"; split agent traces from pure tool_calling - All sampling is deterministic (hash-sorted, no random seed) - File size: 1.6 MB (upstream 1.3 MB, was 18 MB in original PR) Build script (scripts/build_calibration_data.py) produces only the 3 focus categories; multilingual data comes from upstream's JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
01f669a to
8ceb39f
Compare
|
Thanks for the detailed review — all three concerns were valid. Addressed concerns: (1) tool_calling dominance → fixed with char-budget sampling tool_calling dropped from ~85% to 18% of token weight. Instead of sampling by count (500 samples × 16K chars/sample = 8 MB), we now sample complete items to a 200K char budget using deterministic hash-sorted selection. This gives 17 complete tool-calling conversations — enough diversity without drowning out the other categories. (2) unreachable categories → removed entirely
Additionally, (3) justification for reasoning in calibration Fair point that calibration reshapes per-layer sensitivity rather than retraining. The argument is the same as imatrix calibration in llama.cpp/unsloth: the text distribution seen during sensitivity measurement determines which layers the quantizer treats as important. If calibration only sees natural language, layers that activate differently for structured outputs (JSON tool calls, We're not claiming reasoning samples improve reasoning ability — just that they help the quantizer assign bits to the layers that matter for the model's actual workload mix. Each category now gets ~18% of the 32K-token sensitivity window instead of tool_calling getting 85%. Upstream multilingual data (en/ko/zh/ja) is kept verbatim — file grew from 1.3 MB to 1.6 MB (was 18 MB before this rework). All sampling is fully deterministic (hash-sorted, no random seed). |
Summary
Extend the shipped calibration data with additional code, tool-calling, and reasoning samples to improve per-layer sensitivity diversity during oQ quantization. Inspired by the imatrix calibration approach (unsloth/llama.cpp) — category-aware sampling ensures the sensitivity measurement sees balanced token distributions across workload types, rather than being dominated by one category.
Changes:
<think>conversations, reasoning-exl, hermes agent traces with<think>tags)scripts/build_calibration_data.pyrebuilds only the 3 focus categories and merges with the existing JSON's multilingual keys on writeAddresses review feedback:
_load_builtin_calibrationline 2725 only iterates(code, en, ko, zh, ja, tool_calling, reasoning)<think>-containing samples from dead "chat" category into "reasoning"; split agent traces with<think>tags from pure tool_callingFinal distribution:
Test plan
pytest tests/test_oq.py— 204 passed_load_builtin_calibrationcode,multilingual, andcode_multilingualdataset paths all produce sufficient tokens (≥32K)🤖 Generated with Claude Code