Skip to content

Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257)#379

Open
dannywillowliu-uchi wants to merge 2 commits intoopenai:mainfrom
dannywillowliu-uchi:submission/sdttt-gptq-1.1260
Open

Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257)#379
dannywillowliu-uchi wants to merge 2 commits intoopenai:mainfrom
dannywillowliu-uchi:submission/sdttt-gptq-1.1260

Conversation

@dannywillowliu-uchi
Copy link
Copy Markdown

@dannywillowliu-uchi dannywillowliu-uchi commented Mar 22, 2026

Summary

val_bpb: 1.1257 (sliding window, stride=64) | 8xH100 SXM, 600s

Built on PR #374's SOTA stack with GPTQ-lite: per-layer optimal clip percentile search during int6 quantization.

Novel: GPTQ-lite

Standard int6 quantization uses row-wise absolute max for clipping. GPTQ-lite searches 5 clip percentiles per weight matrix (100%, 99.9%, 99.5%, 99%, 98%) and selects the one minimizing reconstruction error. This reduces quantization degradation at zero training cost.

Metric Value
Steps 6,733 (89.1ms/step)
Pre-quant val_bpb 1.1417
Sliding window val_bpb (s64) 1.1257

Architecture: 11L, XSA4, Tight SWA, Partial RoPE 16/64, LN Scale, Late QAT, Value Embedding, SmearGate, BigramHash, FA3, int6+zstd-22, WD=0.04.

Full source and experiment history: https://github.com/dannywillowliu-uchi/parameter-golf-entry

anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Mar 22, 2026
From arXiv:2603.09078. Projects out the self-value component from
attention output, forcing the network to use contextual information.
Applied via GQA-aware zero-alloc view reshape on last 4 of 11 layers.

Both top unmerged submissions (PR openai#374 at 1.1246 and PR openai#379 at 1.1260)
use XSA as a key technique.

Full next-gen stack now includes: 11L, XSA, Partial RoPE 16/64,
Late QAT STE, Tight SWA, GPTQ-lite, LN Scale, FA3, SmearGate,
BigramHash, int6+zstd, Muon WD, OrthoInit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dannywillowliu-uchi dannywillowliu-uchi changed the title Record: 11L GPTQ-lite + Self-Distillation TTT (val_bpb=1.1260) Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257) Mar 22, 2026
rarce added a commit to rarce/parameter-golf that referenced this pull request Mar 22, 2026
original_model.md:
- Discard depth recurrence (amplifies quant error 900×, throughput loss)
- New direction: eval-time optimization stack (PPM-C + GPTQ-lite)
- Document all our experiment results (v3, v4, v4_30m, ringgolf)
- Add TTT/XSA interaction findings (PR openai#303: mutually exclusive)
- Add PR openai#375 meta-insight (1ms overhead = 0.006 BPB)
- 4-phase execution plan targeting PPM-C as original contribution

review_pr_records_track_10min_16mb.md:
- Add 2026-03-22 update with PRs openai#374, openai#379, openai#390, openai#375, openai#303, openai#363
- New SOTA at 1.1246 (PR openai#374: Tight SWA + VE128)
- Document negative results from $500 compute spend (PR openai#375)
- Unexplored opportunities: PPM-C, Neural Cache

review_records_track_10min_16mb.md:
- Add timestamp note (17 records, no changes)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257)

Compliance: HOLD — scored-region SLOT pending Issue #1336

Head SHA: bcd61a1
PR: #379 — "11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257)" by @dannywillowliu-uchi
Author: Danny Willow Liu


Check 1: N-gram Family Bug (CLOSE trigger: target token in hash key)

CLEAN. BigramHashEmbedding.bigram_hash() at line 753:

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

Input to forward() is input_ids = the input sequence x, not targets. Position i hashes (x[i], x[i-1]) — both context tokens, no target token in the key. This is standard BigramHash, explicitly noted as legal. NOT the disqualifying bug.


Check 2: Pre-Quant TTT (CLOSE trigger: multi-epoch AdamW on val_tokens without score-first)

CLEAN on strict criteria. sdttt_adapt() at line 1223 uses torch.optim.SGD, not AdamW. The CLOSE trigger requires AdamW specifically. However, this function does run 2 epochs over val_tokens computing CE loss on targets without score-first gating. The SGD distinction is narrow — the semantic violation (adapt on val targets pre-scoring) is present, but the exact CLOSE criterion (AdamW) is not met. Flagged for reviewer judgment but not auto-CLOSE.

Note: SDTTT was negative (-0.0003 bpb) and is disabled by default (SDTTT_ENABLED=0). The submission score was achieved without SDTTT active.


Check 3: Legal TTT / Score-First Per Chunk

CLEAN. eval_bpb_sliding_window() at lines 1104-1107:

s = 0 if ws == 0 else max(wlen - stride, 0)

**Verdict:** HOLDthe scored-region eval pattern needs a ruling from maintainers on Issue #1336 before this can be cleared. No other compliance flags found.

**Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:** **HOLD** pending Issue #1336 ruling on scored-region SLOT.

---
*Reviewed by [@MatoTeziTanka](https://github.com/MatoTeziTanka) — [The Agora](https://matotezitanka.github.io/parameter-golf). Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.*

@MatoTeziTanka
Copy link
Copy Markdown

PR 379: SDTTT + GPTQ-lite Int6

Review Summary

PR Title: Record: 11L GPTQ-lite + Int6 MLP3x (val_bpb=1.1257)
Status: OPEN | No reviewer comments
Train File: records/track_10min_16mb/2026-03-21_SDTTT_GPTQ_11L_Int6_MLP3x/train_gpt.py
Classification: PURE_NEURAL_CLEAN

Red Flag Analysis

Signal Finding
target-in-key loss CLEAN - Standard BPB metric
TTT/SLOT classes CLEAN - No TTT logic in architecture
Custom tokenizer CLEAN - Standard SentencePiece
loss[] indexing CLEAN - No custom loss dict access

Findings

  1. SDTTT reference: Directory name mentions SDTTT, but training uses standard GPTQ-lite (post-training quantization), not TTT architecture
  2. Architecture: 11L transformer with 3x MLP expansion
  3. Quantization: GPTQ-lite + Int6 (post-training, not QAT)
  4. Result: val_bpb=1.1257 (record-track candidate)
  5. Techniques: Standard quantization pipeline combined with architectural optimization

Recommendation

MERGE - GPTQ-lite is established post-training quantization method. No loss manipulation. 3x MLP is standard parameter reallocation. Clean submission.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks.


Reviewed by @MatoTeziTankaThe Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants