Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean)#1687
Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean)#1687resouer wants to merge 1 commit intoopenai:mainfrom
Conversation
…ranch This branch lifts the validated review package onto a clean upstream/main base so the official submission diff stays to one records folder and one commit. The package keeps the faithful multi-file surface because the packed single-file experiments drifted, while a direct smoke on the current multi-file surface matched the measured candidate within noise. Constraint: The submission branch must contain only records/ files and must keep the exact measured candidate surface. Rejected: Reuse the existing fork review branch as-is | it carries many exploratory commits and is noisier than a clean submit branch Rejected: Promote the packed single-file variant | it was not fidelity-cleared for this candidate Confidence: high Scope-risk: narrow Reversibility: clean Directive: If packaging changes again, rerun at least one packaged smoke before treating the branch as submission-ready Tested: py_compile on packaged Python files; exact folder-size audit (15,991,282 bytes total); packaged multi-file smoke on PR-head surface at 1.03971272 BPB Not-tested: Re-running the full 3-seed sweep on this rebased records-only branch (package contents unchanged)
|
Hi @resouer — I think this PR hits the same Where: Reference (base Empirical check on
Per-token delta (PR − ref): 68.0% of tokens → 0, 31.5% → +1 (the ▁-tokens), 0.5% → +5 (looks like Correction factor ×1.1555 → the reported 1.0409 3-seed mean becomes ≈ 1.20 under the reference scorer, which is in line with the prior GDN-Hybrid runs before rescoring. Minimal fix, mirroring base
Happy to share the verification script if useful. |
|
Closing this submission after local verification found a scoring bug in the SentencePiece byte accounting path. The corrected rerun on commit |
… (3-seed mean) GatedDeltaNet linear attention (FLA) + legal score-first TTT on PR openai#1687 K_KVShare_Wider architecture. 3-seed mean: 1.00995 BPB (std 0.0012). Seeds: 42 (1.01130), 314 (1.00896), 999 (1.00959) TTT gain: ~-0.010 BPB per seed All artifacts under 16 MiB. Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.
…ct concern; PR openai#1687 CLOSED BPB bug; PR openai#1693 casefold 1.05733; SOTA Day 8; Session 16 https://claude.ai/code/session_01LVvBLAM46dRKg53renpkq4
Independent 3-seed reproduction of GatedDeltaNet K_KVShare_Wider on 8xH100 SXM. Builds on PR openai#1687 (resouer). No TTT, no SLOT, no n-gram. Seeds: 42 (1.0353), 1337 (1.0333), 2025 (1.0330) Mean: 1.0339 ± 0.0012 | Artifact: 15.88 MB mean
- is_boundary defaults to True (was zeros) - skip control/unknown/unused tokens early - handle byte tokens as 1 byte explicitly - strip sentencepiece space marker before UTF-8 encoding - use int16 for base_bytes (was float32) Same bug that closed PR openai#1687.
…0 (3-seed mean) GatedDeltaNet linear attention (FLA) K_KVShare_Wider + legal score-first TTT (SGD 3ep freeze=2) + brotli-11 compression. 3-seed mean: 1.00980 BPB (std 0.0015). All artifacts under 16 MB. Seeds: 1337 (1.00803), 42 (1.01069), 2025 (1.01067) TTT gain: ~-0.009 BPB per seed Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.
… mean) GatedDeltaNet linear attention (FLA) K_KVShare_Wider + brotli-11 compression. No TTT — pure fixed predictor (Track A). 3-seed mean: 1.01902 BPB (std 0.0017). All artifacts under 16 MB. Seeds: 1337 (1.01720), 42 (1.02054), 2025 (1.01933) Based on PR openai#1687 by @resouer.
Aweb's record-attempt submission, building on PR openai#1711 (1.00980 BPB) by adding EMA-Teacher Distillation (Tarvainen & Valpola NeurIPS 2017, 'Mean teachers are better role models') as the novel contribution. Loss: L = (1-α)·CE(target) + α·KL(student || teacher.detach()) Teacher is a separate copy of the student model, periodically (every K=16 steps) synchronized from the EMA-smoothed state already maintained by the frontier code. Alpha ramps linearly 0 → 0.3 over the middle 40%% of training (steps 30%%-70%%). Temperature scaling per Hinton soft-target convention (KL × T²). Verified novel via gh search (mean teacher / EMA teacher / distillation / KL soft targets) — zero matching open PRs in the competition. Verified legal under Issue openai#1017 conditions 1-4: - Causal (teacher uses same forward as student) - Full distribution (KL on full softmax over vocab) - Score-before-update (distillation is training-time only; eval unchanged) - Single L→R pass (no rescoring) CPU smoke test (8 cases, FLA-independent) passes: CE-only path correct, EMT path differs from CE, gradient routes to student not teacher, temperature scaling active, alpha schedule correct, mini-training loss decreases, KL of identical distributions = 0. Credits: PR openai#1711 (aamodbhatt) GDN+brotli base; PR openai#1687 (resouer) GDN K_KVShare; PR openai#461 (Christopher-Lee-McClendon) Score-First Legal TTT; FLA library (sustcsonglin); Tarvainen & Valpola (NeurIPS 2017) Mean Teacher framework.
Summary
3-seed mean val_bpb: 1.04089763 (std 0.00106003) | 3.16476760 nats | 8xH100 SXM, 600s
Mechanism
This candidate packages the stronger
K_KVShare_Widerpoint on a fuller upstream-style FLA / GatedDeltaNet recipe.Main idea:
K_KVShare_Wider: KV sharing (kv_sharing_stride=2) used to buy width rather than depthNearest prior family reference: PR
#1370.Compliance notes
What this candidate does not use:
K_KVShare_Widerhasnum_swa_layers = 0)Hardening already applied:
train_gpt.pydoes not perform runtime dependency downloadsrequirements.txtand expected to be installed before evaluationPackaging note
This PR keeps the faithful multi-file records surface rather than the tidier single-file experiments.
A direct packaged smoke on this exact multi-file surface (seed
1337) completed at 1.03971272 BPB, versus the measuredseed1337result of 1.03967403, a delta of only +0.00003869 BPB. Peak memory remained 41127 MiB.Packaged-folder verification
Exact draft records-folder audit:
The packaged code in the records folder passes
py_compilefrom inside the folder.Reproduction
pip install --no-deps -r requirements.txt SEED=$SEED ARCH_MODE=K MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 EVAL_COMPILE_ENABLED=0 \ torchrun --standalone --nproc_per_node=8 train_gpt.py