Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)#543
Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)#543rarce wants to merge 1 commit intoopenai:mainfrom
Conversation
81ea3ef to
9096bd9
Compare
Community Review — Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache PR #543 — Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)Head SHA: 9096bd9 Check 1 — N-gram family bug (CLOSE trigger)No n-gram or bigram hash structures present. Class inventory: Check 2 — Pre-Quant TTT (CLOSE trigger)No test-time training, online adaptation, or multi-epoch AdamW loop over Check 3 — Legal TTT (CLEAN check)No TTT of any kind is present, so score-first-per-chunk legality is not applicable. Absence is clean. No issue. Check 4 — Scored-region SLOT (HOLD trigger)No scored-region masking, selective loss computation, or SLOT-pattern logic detected. The training loop applies uniform loss across all tokens; Check 5 — Pure neural (CLEAN check)Architecture is a standard transformer (GPT variant) with: partial RoPE (16/64 dims), XSA on last 4 layers, SharedValueEmbedding injection at layers 9–10, SmearGate token mixing, tight SWA weight averaging, and late QAT. All components are learned neural operations. No lookup tables, retrieval augmentation, or symbolic components in the forward pass. CLEAN. Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually. |
Summary
val_bpb: 1.1804 (post-quant, single seed) | 15.95 MB artifact | 8×H100 SXM, 615s
Non-record submission documenting systematic combination of PR #374 frontier techniques with MLP width optimization and GPTQ-lite quantization.
Key Techniques
Novel Contribution
MLP hidden=1408 vs 1536: Narrower MLP fits in 16MB while enabling 33% more training steps (137ms vs 178ms/step). The extra 1000 steps more than compensate for reduced per-step capacity:
Metrics
Test plan