Skip to content

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)#543

Open
rarce wants to merge 1 commit intoopenai:mainfrom
rarce:submission/2026-03-23_PR374Stack_GPTQ
Open

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)#543
rarce wants to merge 1 commit intoopenai:mainfrom
rarce:submission/2026-03-23_PR374Stack_GPTQ

Conversation

@rarce
Copy link
Copy Markdown

@rarce rarce commented Mar 23, 2026

Summary

val_bpb: 1.1804 (post-quant, single seed) | 15.95 MB artifact | 8×H100 SXM, 615s

Non-record submission documenting systematic combination of PR #374 frontier techniques with MLP width optimization and GPTQ-lite quantization.

Key Techniques

Technique Source Impact
Partial RoPE (16/64 dims) PR #315 Position-free 75% of head dims
LN Scale (1/sqrt(i+1)) PR #315 Damps deeper layers
XSA on last 4 layers PR #265, #287 GQA-aware self-value debiasing
Shared VE128 (layers 9,10) PR #374 Value embedding injection
Tight SWA (scale<0.2) PR #374 Zero-penalty weight averaging
Late QAT (lr_scale<0.1) PR #297 Avoids Muon momentum corruption
GPTQ-lite (clip search) PR #379 Per-tensor optimal clip ratio
MLP hidden=1408 Novel Faster steps → more training in 10min
Int6 layers 1-9 + int8 0,10 Reference Mixed precision quantization
zstd-22 Standard ~35% better than zlib

Novel Contribution

MLP hidden=1408 vs 1536: Narrower MLP fits in 16MB while enabling 33% more training steps (137ms vs 178ms/step). The extra 1000 steps more than compensate for reduced per-step capacity:

  • MLP 1536: 3061 steps, val_bpb 1.1958, 18MB (over limit)
  • MLP 1408: 4071 steps, val_bpb 1.1804, 15.95MB (under limit)

Metrics

Metric Value
Pre-quant val_bpb 1.1770
Post-quant val_bpb 1.1804
Quant gap +0.0034
Steps 4,071 @ 137ms/step
Parameters 25,224,291
Artifact 15,949,473 bytes

Test plan

  • Artifact under 16MB (15.95MB)
  • Trains in 615s on 8×H100 SXM
  • Post-quant roundtrip verified
  • train_gpt.py compiles and runs from records/ folder
  • Train log included
  • Multi-seed validation (single seed, budget constrained)

@rarce rarce force-pushed the submission/2026-03-23_PR374Stack_GPTQ branch from 81ea3ef to 9096bd9 Compare March 23, 2026 15:40
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #543 — Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)

Head SHA: 9096bd9
File audited: records/track_10min_16mb/2026-03-23_11L_PR374Stack_PartialRoPE_XSA4_VE128_TightSWA_GPTQ/train_gpt.py (1179 lines)


Check 1 — N-gram family bug (CLOSE trigger)

No n-gram or bigram hash structures present. Class inventory: Hyperparameters, Muon, RMSNorm, CastedLinear, SmearGate, SharedValueEmbedding, Rotary, CausalSelfAttention, MLP, Block, GPT, TokenStream, DistributedTokenLoader. No BigramHash, NgramHash, or any hash-keyed token lookup. Hash functions appear nowhere in the model or embedding code. No violation.

Check 2 — Pre-Quant TTT (CLOSE trigger)

No test-time training, online adaptation, or multi-epoch AdamW loop over val_tokens anywhere in the file. The only uses of val_tokens are: (a) passed into eval_val() which is a pure torch.inference_mode() evaluation routine — no gradients, no optimizer steps — and (b) as an argument to the post-quant roundtrip eval. There are zero optimizer .step() calls that touch validation data. No violation.

Check 3 — Legal TTT (CLEAN check)

No TTT of any kind is present, so score-first-per-chunk legality is not applicable. Absence is clean. No issue.

Check 4 — Scored-region SLOT (HOLD trigger)

No scored-region masking, selective loss computation, or SLOT-pattern logic detected. The training loop applies uniform loss across all tokens; eval_val similarly computes full-sequence cross-entropy. No HOLD trigger.

Check 5 — Pure neural (CLEAN check)

Architecture is a standard transformer (GPT variant) with: partial RoPE (16/64 dims), XSA on last 4 layers, SharedValueEmbedding injection at layers 9–10, SmearGate token mixing, tight SWA weight averaging, and late QAT. All components are learned neural operations. No lookup tables, retrieval augmentation, or symbolic components in the forward pass. CLEAN.


Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants