Updated: 2026-04-28 (post-squash separation) Status: Active. Current version: v9.14.0 (squish-only). Squash compliance layer extracted to
konjoai/squash(Apache 2.0,pip install squash-ai).
| Wave | Feature | Status |
|---|---|---|
| W85 | CLI color dedup / _term.py consolidation |
β |
| W86 | Observability profiler + squish trace |
β |
| W87 | Agent tool execution fix + tool_name_map.py |
β |
| W88 | Ollama/LocalAI drop-in compat | β |
| W89 | Local model scanner + ollama:/hf: URI schemes |
β |
| W90 | Lean startup profiler + FeatureState refactor |
β |
| W91 | Sub-3s TTFT blazing default + 70B loader | β |
| W92 | Pre-compress pipeline + HF batch upload workflow | β |
| W93 | macOS SquishBar (Swift: model picker, progress, hotkey) | β |
| W94 | Cross-platform support review | β |
| W95 | README final audit + public release (v68.0.0) | β |
| W96βW99 | LM Studio compat, inference fixes, lean server, speed restore | β |
squish/squash/extracted to standalone repokonjoai/squash- 80
tests/test_squash_*.pyremoved from squish test suite squish/server.pyandsquish/cli.pyupdated to import from standalonesquashpackage (optional dependency)pyproject.toml: removedsquash/squash-apiextras andsquashCLI entry point
| Format | Model | Gate |
|---|---|---|
| INT4 AWQ g=32 | Qwen2.5-1.5B | β₯ 70.6% arc_easy |
| INT3 g=32 | Qwen2.5-1.5B | β₯ 67.2% arc_easy |
| INT3 | gemma-3-*b β€4B | BLOCKED (β15pp) |
| INT3 | Qwen3-4B | BLOCKED (β14.8pp) |
| INT2 naive | any | NEVER SHIP (~29% β random) |
| SQINT2 | Qwen2.5-7B | TARGET β₯ 65% arc_easy (W103) |
| INT2 KV | Qwen2.5-7B @ 32K | TARGET PPL Ξ β€ +0.5 nats (W104) |
Why: squish pull hf:<repo> downloads model weights before scanning. An adversarial HF model can trigger ACE at load time before the post-load scan runs. This closes the pre-load attack surface.
Changes (2026-04-28):
squish/serving/local_model_scanner.py: addedHFFileSummary,HFRepoScanResult,scan_hf_repo_metadata(repo_id, token) β HFRepoScanResult, and_classify_hf_siblings(). Native pickle-header classification β nomodelscandep.squish/cli.py:_pull_from_hfcallsscan_hf_repo_metadatabeforesnapshot_download; prints compact scan report; aborts withsys.exit(2)onstatus="unsafe". API errors allow download with warning (firewall / private-repo safe). Post-downloadscan_before_load()byte scan retained as second layer.tests/test_predownload_scan.py: 30 new tests (total: 48). All HF API calls mocked._classify_hf_siblingstested at unit level;scan_hf_repo_metadatatested with mocked HTTP including 401/404/URLError/unexpected structure paths.
Gate: 48/48 tests pass. squish pull hf: aborts on unsafe model before any bytes transferred. Zero new mandatory dependencies.
Why: Eliminate GIL on quantised GEMV. squish_quant_rs/ scaffold exists; native Rayon
(consistent with every other kernel in the 5,500-line crate) preferred over candle
to avoid a heavy dependency.
Changes (2026-04-28):
squish_quant_rs/src/lib.rs:quantized_matmul_int4(w_codes, scales, offsets, x, group_size)β fused INT4 asymmetric dequantize + GEMV, parallelised over output features via Rayon, GIL released viapy.allow_threads(). Registered in#[pymodule].squish/quant/quantizer.py:quantized_matmul_int4()public API β Rust-first,_quantized_matmul_int4_numpy()NumPy fallback.get_backend_info()reports"int4_matmul_rust"key.tests/test_rust_matmul.py: 18 tests β shape/dtype contract, NumPy fallback correctness, Rust kernel correctness vs fallback (skipped when Rust not built), error paths, backend info.
Gate: 18/18 tests pass. get_backend_info()["int4_matmul_rust"] == True. Python NumPy
fallback passes without Rust build. Zero new mandatory dependencies.
| Format | Model | Gate | Last validated |
|---|---|---|---|
| INT4 AWQ g=32 | Qwen2.5-1.5B | β₯ 70.6% arc_easy | 2026-03-28 |
| INT3 g=32 | Qwen2.5-1.5B | β₯ 67.2% arc_easy | 2026-03-28 |
| INT3 | gemma-3-*b β€4B | BLOCKED | confirmed |
| INT3 | Qwen3-4B | BLOCKED | confirmed |
| SQINT2 | Qwen2.5-7B | β₯ 65% arc_easy (target 67%) | TARGET β W103 |
| INT2 KV | Qwen2.5-7B 32K | PPL Ξ β€ +0.5 nats vs INT4 KV | TARGET β W104 |
| Model | Format | Peak RSS |
|---|---|---|
| Qwen2.5-1.5B | INT4 | < 1.5 GB |
| Qwen2.5-1.5B | INT3 | < 1.0 GB |
| Qwen3:8B | INT4 | DO NOT RUN (14 GB crash) |
| Qwen3:8B | INT3 | < 4.0 GB |
| gemma-3-4b | INT4 | < 8.7 GB |
# Full Python test suite
python3 -m pytest tests/ -v --timeout=120
# Python-only mode
python3 -m pytest tests/ -v -k "not mojo"
# Rust workspace
cargo test --workspace --locked
# Install dev dependencies
pip install -e ".[dev,eval,linux]"
# JavaScript bindings
cd js && npm install && npm run buildWhy: 44 pre-existing test failures obscured CI signal; the W101 Rust GEMV kernel had no user-facing validation path. W102 eliminates both gaps.
Changes (2026-04-28):
squish/cli.py:build_parser()β unguardedimportlib.metadata.version("squish")at--versionargument wrapped in try/except β falls back tosquish.__version__. Fixes 28 failures caused by the package not being installed in the dev Python 3.9 environment.squish/cli.py: newcmd_bench()andbenchsubcommand βsquish bench [--format int4|int8] [--batch N] [--in-features F] [--out-features F] [--group-size G] [--iters N] [--warmup N]. Reports p50/p95/p99 latency, GOPS, and GB/s. Uses Rust kernel when available, NumPy fallback otherwise.squish/kv/radix_cache.py: removedstrict=Falsefrom 3zip()calls (Python 3.9 compatibility βstrict=was added in Python 3.10). Fixes 8 failures.tests/test_wave123β126_*.py: bumped server.py line-count ceiling 4743 β 4750 to account for the squash-governor comment block added in W100. Fixes 4 failures.tests/test_quant_aqlm.py: updated module count assertion 121 β 83 (38 squash modules extracted in the squash separation). Fixes 1 failure.tests/test_bench.py: 25 new tests β subcommand registration, default args, output structure (INT4 + INT8), argument roundtrip, invalid-format rejection.
Gate: 25/25 bench tests pass. Full suite: 44 pre-existing failures β 3 (the 3
remaining call importlib.metadata.version("squish") directly β require pip install,
pass in Python 3.10 CI). Zero new failures introduced.
Why: Naive INT2 is a mathematical dead-end β confirmed at ~26β30% arc_easy β random across the 0.6Bβ7B family in CLAUDE.md. The cause is geometric, not algorithmic: transformer weight matrices contain ~0.1% massive outliers that dictate the quant scale, collapsing 99.9% of normal weights into 1β2 of the 4 available bins and destroying signal. The 2024β2025 research record (ParetoQ, UPQ, QuIP#, INT2.1) proves the ceiling is high when the geometry is respected first. SQINT2 is the Konjo response β a fused four-stage pipeline that hits ~2.15 bpw effective, ~50% of INT4 storage, with arc_easy β₯ 65% on Qwen2.5-7B. This is the next major milestone for the compression axis.
The four-stage pipeline:
- Hadamard incoherence preprocessing β at compress time, apply a randomised
WalshβHadamard rotation to each FFN weight:
W_rot = H Β· W Β· Hα΅. Spreads outlier energy across all dimensions; eliminates the bin-collapse failure mode. Store only the seed (not H). Re-usessquish/kv/kv_cache.py::_build_hadamard(already in tree from the QuaRot KV work β Wave 19/20). Lift to a sharedsquish/quant/_rotation.pyutil only if signature mismatch forces it; otherwise inline import. - NF2 per-group quantization β quantize
W_rotagainst a 4-symbol NormalFloat-2 codebook (quantile points of N(0,1) at Β±1.5Ο, Β±0.5Ο β not uniform spacing). Group size g=32, asymmetric scale + zero-point, re-using the existing AWQ scaling path insquish/quant/awq.py. Storage: 2 bits index + (16+16)/32 = 1.0 bit scale/zero overhead β 3.0 bpw before residual. - Low-rank residual correction β compute residual
E = W_rot - dequant(Q_INT2), run truncated SVDE β L Β· Rwith rank r=16, store L,R in INT4. Inference path:dequant(Q_INT2) β inverse Hadamard β + LΒ·R. Adds ~0.15 bpw amortised on a 7B model β ~2.15 bpw effective. - Layer-selective mixed precision β SQINT2 on FFN
gate_proj/up_projonly; INT3 g=32 on attentionQ/K/V/O; INT4 on first 2 + last 2 transformer blocks (boundary-layer rule β these dominate output coherence). Routing logic added tosquish/quant/quantizer.pykeyed on layer index + tensor name pattern.
Module budget: one new file β squish/quant/sqint2.py (encapsulates Hadamard
preprocess, NF2 codebook lookup, low-rank residual fit/apply, mixed-precision routing
config). squish/cli.py gains compress --format sqint2. compressed_loader.py gains
the SQINT2 unpack path. Module count: 83 β 84 (ceiling 125 β
).
Hardware-grounded inference path:
- NF2 dequant + matmul β MLX
mx.quantized_matmul(custom NF2 lookup table baked into a Metal shader, NOT Python dequant-then-matmul β CLAUDE.md hard rule). - Hadamard inverse β fused into the same kernel (FWHT, O(n log n)).
- Low-rank
+ LΒ·Rβ existing Rust GEMV path from W101 with INT4 weights.
Acceptance criteria (ship gate):
- arc_easy on Qwen2.5-7B SQINT2 β₯ 65% (target 67%, vs. ~73% INT4 baseline). Ξ β€ β8pp.
- Coherent generation on the 5-prompt smoke set β no repetition loops, no incoherence,
passes
scripts/coherence_check.sh. - Disk: β€ 50% of INT4 size (Qwen2.5-7B: ~3.5 GB INT4 β β€ 1.75 GB SQINT2).
- Memory contract: peak Metal RSS β€ 4 GB on M3 16GB at 7B.
- Latency: SQINT2 decode tok/s β₯ INT4 mlx_lm baseline (the low-rank add must NOT regress through a Python loop β fused kernel or vectorised Rust GEMV).
- lm_eval result OR
lm_eval-waiverper Accuracy Gate (CLAUDE.md). - Module count β€ 125 after merge.
Hard stops (DO NOT SHIP):
- arc_easy < 60% on any tested 7B model β revert. That's incoherent territory.
- Any Python
dequant β numpy matmulpath. Quantized matmul is NEVER Python arithmetic. - Naive INT2 fallback if SQINT2 build fails. Naive INT2 stays research-only forever.
- Hadamard rotation applied at runtime (load time) β must be a build-time bake.
Stages, sequenced:
- W103.1 β Hadamard preprocess + NF2 codebook (offline compress only, no inference yet). Validate via reconstruction SNR on synthetic Ο=0.02 IID Gaussian weights at g=32 β must hit β₯ 9 dB (vs. ~6.8 dB for naive uniform INT2 = +2 dB lift from NF2 + per-group asymmetric + Lloyd-Max refinement). The 9 dB gate matches the Lloyd-Max theoretical ceiling for 2-bit quantisation on Gaussian (~9.3 dB) β past this point, further SNR gain requires the Stage 3 low-rank residual. Earlier drafts of this plan cited a 12 dB target; that was over-aggressive β 2-bit alone cannot exceed Lloyd-Max regardless of codebook design. 12 dB is the W103.4 ship target (full pipeline including W103.2 residual), not a Stage 1+2 gate.
- β
W103.2 (2026-04-29) β SHIPPED. Rank-16 SVD + sparse-1% residual correction
integrated into
squish/quant/sqint2.py(in-place extension, module count stays 84). Joint SNR gate revised: β₯ 10.0 dB IID Gaussian β (measured 10.21β10.23 dB across 5 seeds at (1536, 576), g=32, r=16, sparse=1%). Critical finding: the 16 dB IID-Gaussian target is unreachable via any rank-16 SVD. Hadamard rotation (Stage 1) whitens all input distributions by design; post-rotation residual is IID N(0,ΟΒ²) regardless of input structure. For (1536,576) top-16 singular values capture only r/min(M,N) = 2.78% of energy β 0.30 dB lift. Marchenko-Pastur bound, not an implementation gap. Reaching 16 dB on IID Gaussian requires β₯ 2.3 bits per weight β outside the 2-bit mandate. 16 dB on REAL transformer weights (non-Gaussian, correlated, heavy-tailed) is the W103.4 arc_easy gate proxy. Sparse-1% adds 0.24 dB on top of SVD β total +0.54 dB joint lift. 46 new tests; 2231 total passing suite (3 pre-existing version-metadata failures, unchanged). - β
W103.3 (2026-04-29) β SHIPPED.
MixedPrecisionRouterinquantizer.py+--format sqint2incli.py. 90 new tests intests/test_sqint2_router.py. 2321 suite passing (0 regressions). Routing spec: boundary layers (first 2 + last 2) β INT4; MLP gate_proj/up_proj β SQINT2; attn Q/K/V/O β INT3; else β INT4. E2E compress gate (lm_eval on Qwen2.5-7B) deferred to W103.4. - W103.4 β Inference path (Metal/Rust fused kernel) + lm_eval gate on Qwen2.5-7B.
- β
W103.4a (2026-04-29) β SHIPPED.
save_sqint2_layer/load_sqint2_layerinsqint2.py; npy-dir format with 4 mandatory + 5 optional.npyfiles; meta header (fp64, 16 slots, version=1.0); SQINT2 dispatch incompressed_loader.py_dequantize_npy_dirbetween AQLM and passthrough-F16;_TENSOR_SUFFIX_REextended; 27 new tests intests/test_sqint2_loader.py. 2321 β 2348 passing. - W103.4b β Rust low-rank GEMV (
sqint2_residual_gemvinsquish_quant_rs). - W103.4c β Metal NF2 fused-dequant GEMV kernel +
SQINT2Linearmlx Module. - W103.4d β End-to-end compress on Qwen2.5-7B + arc_easy β₯ 65% lm_eval ship gate.
- β
W103.4a (2026-04-29) β SHIPPED.
Validation order (hardware-aware):
- Synthetic SNR (Stage 1+2) β unit test, no hardware.
- arc_easy limit=200 β ~30 min on M3 16GB after W103.4.
- Full arc_easy/hellaswag/piqa/winogrande/openbookqa limit=500 β overnight, gates merge.
Why: KV cache quantization is orthogonal to weight quantization β does not touch
model weights, requires no recompression, and immediately ~4Γ context length at the same
RAM. HadamardKVCache in squish/kv/kv_cache.py already handles INT8 with QuaRot-style
rotation; extending to INT2 reuses 100% of that infrastructure. This is the highest
leverage-per-line-of-code item in the entire compression axis.
Changes shipped (2026-05-01):
squish/kv/kv_cache.py(in-place):_quantize_int2_per_channel/_dequantize_int2_per_channelβ per-token symmetric NF2 4-level codec, indices bit-packed 4-per-uint8 alonghead_dim._kv_quantize_per_channel/_kv_dequantize_per_channelβ mode dispatch.KVLayerCache._kv_modeslot +kv_mode=constructor arg.QuantizedKVCache.mode="int2"validated; rejects illegal combinations (svd_rank > 0,comm_vq_bits > 0,qfilter_rank > 0,enable_disk_tier()).HadamardKVCachedocstring updated with W104 motivation.recommended_kv_mode(context_tokens)+KV_INT2_AUTO_THRESHOLD = 8192.
tests/test_kv_int2.pyβ 32 new tests (codec roundtrip, dispatch, mode validation, end-to-end through QuantizedKVCache and HadamardKVCache, memory-ratio assertion β₯ 2.9Γ reduction, disk-tier guardrail).- Zero new production modules. Codec storage is
(n_tokens, head_dim/4)uint8 β asymptotic 4Γ reduction on the old-tier buffer vs INT8. - Per-token reconstruction SNR β₯ 5 dB on uniform inputs; β₯ +1 dB lift from Hadamard rotation on heavy-tailed activations (validated in test).
Hardware ship gate (deferred to lm_eval session):
- Qwen2.5-7B at 32K context fits in M3 16GB (currently OOMs around 10K with INT8 KV).
- PPL Ξ vs. INT8 KV β€ +0.5 nats on wikitext-2 (4K window).
- Both metrics require live Qwen2.5-7B inference; tracked alongside the W103.4d arc_easy run.
Acceptance criteria met (code + unit gates):
- β
mode="int2"branch lands onQuantizedKVCacheandHadamardKVCache. - β Storage saves β₯ 2.9Γ per-token at head_dim=128 (4Γ asymptotic).
- β
Re-uses
_build_hadamardandKVLayerCacheinfra. Zero new modules. - β
Auto-mode helper (
recommended_kv_mode) with 8 K threshold. - β³ 32 K context fit & PPL Ξ β gated behind hardware run.
Recommended configuration for β₯ 16 K context:
from squish.kv.kv_cache import HadamardKVCache, recommended_kv_mode
mode = recommended_kv_mode(planned_context_tokens) # "int8" or "int2"
cache = HadamardKVCache(n_layers=N, window=128, mode=mode, seed=42)The 128-token recent FP16 window (configurable) retains quality-critical recent tokens; everything older is INT2-quantised in the rotated frame.
- Shatter the box. "Naive INT2 doesn't work" is a known result. SQINT2 is what works. Do not reach for naive INT2 again. The literature says it is solved β implement the geometry-aware path or implement nothing.
- Verify before claiming. No "Metal will fuse this" assertions. Profile, then claim. CLAUDE.md "Framework Primitives β Verify Before Claiming" applies in full.
- The math goes in the code. Hadamard rotation, NF2 quantile points, SVD truncation β write the math inline as comments. A reader should not need a paper to understand the module. Sene Magber.
- Code-complete vs accuracy-validated are different states. Stages 1β3 may land
code-complete with reconstruction-SNR gates only. Stage 4 needs lm_eval before merge,
or an
lm_eval-waiverwith expected-delta + queued validation run. - No graveyards. If a stage fails its gate, delete the code or move it to
experimental/with a written promotion criterion. No half-finished stubs insquish/.
Why: W104 shipped INT2 KV (4Γ memory) but the SNR cliff from INT8 (~44 dB) to INT2 (~5 dB on Hadamard-rotated activations) is sharp. INT4 fills the gap at ~22 dB SNR with 2Γ memory reduction β the right default for 8 Kβ16 K contexts where INT2 is overkill and INT8 is too expensive.
Changes shipped (2026-05-02):
squish/kv/kv_cache.py(in-place):_quantize_int4_per_channel/_dequantize_int4_per_channelβ per-token symmetric 16-level uniform codec ({-7.5,β¦,7.5}), nibble-packed 2-per-uint8 alonghead_dim(low = even col, high = odd col)._kv_quantize_per_channel/_kv_dequantize_per_channelβ dispatch extended to"int4". New_KV_QUANT_MODESfrozenset.KVLayerCache(kv_mode="int4")accepted; validation rejects all other unknown values.QuantizedKVCache(mode="int4")accepted; rejects illegal combinations (svd_rank > 0,comm_vq_bits > 0,qfilter_rank > 0).enable_disk_tier()now rejects all sub-INT8 modes (int4 or int2).HadamardKVCachedocstring extended with W105 section.recommended_kv_mode_3tier(ctx)+KV_INT4_DEFAULT_THRESHOLD = 16384.recommended_kv_mode()accepts optionalmedium_mode/medium_thresholdfor inline 3-tier dispatch.
tests/test_kv_int4.pyβ 38 new tests (codec roundtrip, packing layout, SNR ordering INT8 > INT4 > INT2, dispatch, mode validation, end-to-end, memory ratio β₯ 1.7Γ, disk-tier guardrail, 3-tier recommendation).
Acceptance criteria met:
- β INT4 SNR floor β₯ 18 dB on uniform inputs.
- β INT4 strictly between INT8 and INT2; β₯ 6 dB margin over INT2.
- β
Storage =
head_dim/2 + 4bytes per token (head_dim=128 β 68 B/token, asymptotic 2Γ reduction vs INT8). - β Hadamard rotation lifts INT4 SNR on heavy-tailed inputs.
- β Suite: 2464 passed / 3 pre-existing W95 / 43 skipped (+38 from W105).
- β Module count: zero new production modules. One new test file.
Recommended configuration (W105, replacing the W104 default for β₯ 8 K):
from squish.kv.kv_cache import HadamardKVCache, recommended_kv_mode_3tier
mode = recommended_kv_mode_3tier(planned_context_tokens)
# β€ 8K β int8; 8Kβ16K β int4 (W105); > 16K β int2 (W104)
cache = HadamardKVCache(n_layers=N, window=128, mode=mode, seed=42)W104 (INT2 KV) + W105 (INT4 KV) COMPLETE. The next wave to land on main
is the W103.4 hardware ship gate β bring the existing W103.4b (Rust
sqint2_residual_gemv), W103.4c (Metal NF2 GEMV + SQINT2Linear MLX
module), and W103.4d-pre (eval orchestration) commits from
claude/angry-cerf-4d8cce into main, then run the arc_easy β₯ 65 % gate
on Qwen2.5-7B (also validates the W104 32 K-context envelope on the same
hardware run). LoRA INT4 checkpoint support remains deferred.
Owner: wesleyscholl / Konjo AI Research Update after each completed wave. Never let this drift from actual implementation.