Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257)#379
Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257)#379dannywillowliu-uchi wants to merge 2 commits intoopenai:mainfrom
Conversation
From arXiv:2603.09078. Projects out the self-value component from attention output, forcing the network to use contextual information. Applied via GQA-aware zero-alloc view reshape on last 4 of 11 layers. Both top unmerged submissions (PR openai#374 at 1.1246 and PR openai#379 at 1.1260) use XSA as a key technique. Full next-gen stack now includes: 11L, XSA, Partial RoPE 16/64, Late QAT STE, Tight SWA, GPTQ-lite, LN Scale, FA3, SmearGate, BigramHash, int6+zstd, Muon WD, OrthoInit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
original_model.md: - Discard depth recurrence (amplifies quant error 900×, throughput loss) - New direction: eval-time optimization stack (PPM-C + GPTQ-lite) - Document all our experiment results (v3, v4, v4_30m, ringgolf) - Add TTT/XSA interaction findings (PR openai#303: mutually exclusive) - Add PR openai#375 meta-insight (1ms overhead = 0.006 BPB) - 4-phase execution plan targeting PPM-C as original contribution review_pr_records_track_10min_16mb.md: - Add 2026-03-22 update with PRs openai#374, openai#379, openai#390, openai#375, openai#303, openai#363 - New SOTA at 1.1246 (PR openai#374: Tight SWA + VE128) - Document negative results from $500 compute spend (PR openai#375) - Unexplored opportunities: PPM-C, Neural Cache review_records_track_10min_16mb.md: - Add timestamp note (17 records, no changes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257)Compliance: HOLD — scored-region SLOT pending Issue #1336 Head SHA: bcd61a1 Check 1: N-gram Family Bug (CLOSE trigger: target token in hash key)CLEAN. out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % modInput to Check 2: Pre-Quant TTT (CLOSE trigger: multi-epoch AdamW on val_tokens without score-first)CLEAN on strict criteria. Note: SDTTT was negative (-0.0003 bpb) and is disabled by default ( Check 3: Legal TTT / Score-First Per ChunkCLEAN. s = 0 if ws == 0 else max(wlen - stride, 0)
**Verdict:** HOLD — the scored-region eval pattern needs a ruling from maintainers on Issue #1336 before this can be cleared. No other compliance flags found.
**Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:** **HOLD** pending Issue #1336 ruling on scored-region SLOT.
---
*Reviewed by [@MatoTeziTanka](https://github.com/MatoTeziTanka) — [The Agora](https://matotezitanka.github.io/parameter-golf). Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.* |
PR 379: SDTTT + GPTQ-lite Int6Review SummaryPR Title: Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257) Red Flag Analysis
Findings
RecommendationMERGE - GPTQ-lite is established post-training quantization method. No loss manipulation. 3x MLP is standard parameter reallocation. Clean submission. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks. Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier. |
Summary
val_bpb: 1.1257 (sliding window, stride=64) | 8xH100 SXM, 600s
Built on PR #374's SOTA stack with GPTQ-lite: per-layer optimal clip percentile search during int6 quantization.
Novel: GPTQ-lite
Standard int6 quantization uses row-wise absolute max for clipping. GPTQ-lite searches 5 clip percentiles per weight matrix (100%, 99.9%, 99.5%, 99%, 98%) and selects the one minimizing reconstruction error. This reduces quantization degradation at zero training cost.
Architecture: 11L, XSA4, Tight SWA, Partial RoPE 16/64, LN Scale, Late QAT, Value Embedding, SmearGate, BigramHash, FA3, int6+zstd-22, WD=0.04.
Full source and experiment history: https://github.com/dannywillowliu-uchi/parameter-golf-entry