Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) by rarce · Pull Request #543 · openai/parameter-golf

rarce · 2026-03-23T15:37:38Z

Summary

val_bpb: 1.1804 (post-quant, single seed) | 15.95 MB artifact | 8×H100 SXM, 615s

Non-record submission documenting systematic combination of PR #374 frontier techniques with MLP width optimization and GPTQ-lite quantization.

Key Techniques

Technique	Source	Impact
Partial RoPE (16/64 dims)	PR #315	Position-free 75% of head dims
LN Scale (1/sqrt(i+1))	PR #315	Damps deeper layers
XSA on last 4 layers	PR #265, #287	GQA-aware self-value debiasing
Shared VE128 (layers 9,10)	PR #374	Value embedding injection
Tight SWA (scale<0.2)	PR #374	Zero-penalty weight averaging
Late QAT (lr_scale<0.1)	PR #297	Avoids Muon momentum corruption
GPTQ-lite (clip search)	PR #379	Per-tensor optimal clip ratio
MLP hidden=1408	Novel	Faster steps → more training in 10min
Int6 layers 1-9 + int8 0,10	Reference	Mixed precision quantization
zstd-22	Standard	~35% better than zlib

Novel Contribution

MLP hidden=1408 vs 1536: Narrower MLP fits in 16MB while enabling 33% more training steps (137ms vs 178ms/step). The extra 1000 steps more than compensate for reduced per-step capacity:

MLP 1536: 3061 steps, val_bpb 1.1958, 18MB (over limit)
MLP 1408: 4071 steps, val_bpb 1.1804, 15.95MB (under limit)

Metrics

Metric	Value
Pre-quant val_bpb	1.1770
Post-quant val_bpb	1.1804
Quant gap	+0.0034
Steps	4,071 @ 137ms/step
Parameters	25,224,291
Artifact	15,949,473 bytes

Test plan

Artifact under 16MB (15.95MB)
Trains in 615s on 8×H100 SXM
Post-quant roundtrip verified
train_gpt.py compiles and runs from records/ folder
Train log included
Multi-seed validation (single seed, budget constrained)

MatoTeziTanka · 2026-04-12T14:27:13Z

Community Review — Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #543 — Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)

Head SHA: 9096bd9
File audited: records/track_10min_16mb/2026-03-23_11L_PR374Stack_PartialRoPE_XSA4_VE128_TightSWA_GPTQ/train_gpt.py (1179 lines)

Check 1 — N-gram family bug (CLOSE trigger)

No n-gram or bigram hash structures present. Class inventory: Hyperparameters, Muon, RMSNorm, CastedLinear, SmearGate, SharedValueEmbedding, Rotary, CausalSelfAttention, MLP, Block, GPT, TokenStream, DistributedTokenLoader. No BigramHash, NgramHash, or any hash-keyed token lookup. Hash functions appear nowhere in the model or embedding code. No violation.

Check 2 — Pre-Quant TTT (CLOSE trigger)

No test-time training, online adaptation, or multi-epoch AdamW loop over val_tokens anywhere in the file. The only uses of val_tokens are: (a) passed into eval_val() which is a pure torch.inference_mode() evaluation routine — no gradients, no optimizer steps — and (b) as an argument to the post-quant roundtrip eval. There are zero optimizer .step() calls that touch validation data. No violation.

Check 3 — Legal TTT (CLEAN check)

No TTT of any kind is present, so score-first-per-chunk legality is not applicable. Absence is clean. No issue.

Check 4 — Scored-region SLOT (HOLD trigger)

No scored-region masking, selective loss computation, or SLOT-pattern logic detected. The training loop applies uniform loss across all tokens; eval_val similarly computes full-sequence cross-entropy. No HOLD trigger.

Check 5 — Pure neural (CLEAN check)

Architecture is a standard transformer (GPT variant) with: partial RoPE (16/64 dims), XSA on last 4 layers, SharedValueEmbedding injection at layers 9–10, SmearGate token mixing, tight SWA weight averaging, and late QAT. All components are learned neural operations. No lookup tables, retrieval augmentation, or symbolic components in the forward pass. CLEAN.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Non-record: 11L PR374 Stack + GPTQ-lite (val_bpb=1.1804, 15.95MB)

9096bd9

rarce force-pushed the submission/2026-03-23_PR374Stack_GPTQ branch from 81ea3ef to 9096bd9 Compare March 23, 2026 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)#543

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)#543
rarce wants to merge 1 commit intoopenai:mainfrom
rarce:submission/2026-03-23_PR374Stack_GPTQ

rarce commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rarce commented Mar 23, 2026

Summary

Key Techniques

Novel Contribution

Metrics

Test plan

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)

PR #543 — Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)

Check 1 — N-gram family bug (CLOSE trigger)

Check 2 — Pre-Quant TTT (CLOSE trigger)

Check 3 — Legal TTT (CLEAN check)

Check 4 — Scored-region SLOT (HOLD trigger)

Check 5 — Pure neural (CLEAN check)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants