Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean) by resouer · Pull Request #1687 · openai/parameter-golf

resouer · 2026-04-17T06:00:01Z

Summary

3-seed mean val_bpb: 1.04089763 (std 0.00106003) | 3.16476760 nats | 8xH100 SXM, 600s

Seed	Quantized BPB	val_loss (nats)	Artifact
1337	1.03967403	3.16104735	15,762,406
42	1.04153708	3.16671180	15,870,797
2025	1.04148177	3.16654364	15,648,800
Mean	1.04089763	3.16476760	15,760,668

Mechanism

This candidate packages the stronger K_KVShare_Wider point on a fuller upstream-style FLA / GatedDeltaNet recipe.

Main idea:

FLA / GatedDeltaNet family
K_KVShare_Wider: KV sharing (kv_sharing_stride=2) used to buy width rather than depth
fuller upstream-style recipe
EMA + SWA + late QAT + int6 artifact path

Nearest prior family reference: PR #1370.

Compliance notes

What this candidate does not use:

no TTT
no SLOT
no n-gram overlay
no SWA/XSA final scoring path (K_KVShare_Wider has num_swa_layers = 0)

Hardening already applied:

all 3 seeds are on one script revision
train_gpt.py does not perform runtime dependency downloads
dependencies are declared in requirements.txt and expected to be installed before evaluation

Packaging note

This PR keeps the faithful multi-file records surface rather than the tidier single-file experiments.

A direct packaged smoke on this exact multi-file surface (seed 1337) completed at 1.03971272 BPB, versus the measured seed1337 result of 1.03967403, a delta of only +0.00003869 BPB. Peak memory remained 41127 MiB.

Packaged-folder verification

Exact draft records-folder audit:

packaged folder + largest artifact = 15,991,282 bytes
remaining headroom = 8,718 bytes

The packaged code in the records folder passes py_compile from inside the folder.

Reproduction

pip install --no-deps -r requirements.txt
SEED=$SEED ARCH_MODE=K MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 EVAL_COMPILE_ENABLED=0 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

…ranch This branch lifts the validated review package onto a clean upstream/main base so the official submission diff stays to one records folder and one commit. The package keeps the faithful multi-file surface because the packed single-file experiments drifted, while a direct smoke on the current multi-file surface matched the measured candidate within noise. Constraint: The submission branch must contain only records/ files and must keep the exact measured candidate surface. Rejected: Reuse the existing fork review branch as-is | it carries many exploratory commits and is noisier than a clean submit branch Rejected: Promote the packed single-file variant | it was not fidelity-cleared for this candidate Confidence: high Scope-risk: narrow Reversibility: clean Directive: If packaging changes again, rerun at least one packaged smoke before treating the branch as submission-ready Tested: py_compile on packaged Python files; exact folder-size audit (15,991,282 bytes total); packaged multi-file smoke on PR-head surface at 1.03971272 BPB Not-tested: Re-running the full 3-seed sweep on this rebased records-only branch (package contents unchanged)

bigbag · 2026-04-17T10:05:30Z

Hi @resouer — I think this PR hits the same build_sentencepiece_luts byte-counting bug that closed #1545 / #1576 / #1632 / #1672 / #1681. Flagging for visibility.

Where: train_gdn_7k.py L192–205 sets base_bytes[i] = len(piece[1:].encode("utf-8")) + 1 for ▁-prefixed pieces — so the leading-space byte is already baked in. Then the scorer at L361–363 adds another +1 via has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]. Every ▁-token ends up counted with one extra byte.

Reference (base train_gpt.py L180–204) strips the ▁ from piece before calling .encode("utf-8"), so base_bytes counts only the content and the scorer's +1 is the only leading-space accounting.

Empirical check on fineweb_1024_bpe.model with a 3,648-byte FineWeb-style sample (reproduces the scorer arithmetic from L361–363):

	actual text	reference LUT	this PR's LUT
byte_count	3,648	3,647	4,214 (+15.55%)

Per-token delta (PR − ref): 68.0% of tokens → 0, 31.5% → +1 (the ▁-tokens), 0.5% → +5 (looks like sp.is_byte tokens, where the reference uses base_bytes = 1 but this impl uses len("<0x..>".encode())).

Correction factor ×1.1555 → the reported 1.0409 3-seed mean becomes ≈ 1.20 under the reference scorer, which is in line with the prior GDN-Hybrid runs before rescoring.

Minimal fix, mirroring base train_gpt.py L180–204:

strip ▁ from piece before piece.encode("utf-8")
default is_boundary_token to True; set False only for non-control/unknown/unused tokens
handle sp.is_byte(i) → base_bytes = 1
add sp.is_unused(i) to the control/unknown skip

Happy to share the verification script if useful.

resouer · 2026-04-17T14:49:54Z

Closing this submission after local verification found a scoring bug in the SentencePiece byte accounting path. The corrected rerun on commit 3bf1ec3 lands at final_int6_roundtrip_exact val_bpb:1.22285488, so the reported 1.04089763 result does not hold under the base scorer semantics.

@resouer

… (3-seed mean) GatedDeltaNet linear attention (FLA) + legal score-first TTT on PR openai#1687 K_KVShare_Wider architecture. 3-seed mean: 1.00995 BPB (std 0.0012). Seeds: 42 (1.01130), 314 (1.00896), 999 (1.00959) TTT gain: ~-0.010 BPB per seed All artifacts under 16 MiB. Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.

…ct concern; PR openai#1687 CLOSED BPB bug; PR openai#1693 casefold 1.05733; SOTA Day 8; Session 16 https://claude.ai/code/session_01LVvBLAM46dRKg53renpkq4

Independent 3-seed reproduction of GatedDeltaNet K_KVShare_Wider on 8xH100 SXM. Builds on PR openai#1687 (resouer). No TTT, no SLOT, no n-gram. Seeds: 42 (1.0353), 1337 (1.0333), 2025 (1.0330) Mean: 1.0339 ± 0.0012 | Artifact: 15.88 MB mean

- is_boundary defaults to True (was zeros) - skip control/unknown/unused tokens early - handle byte tokens as 1 byte explicitly - strip sentencepiece space marker before UTF-8 encoding - use int16 for base_bytes (was float32) Same bug that closed PR openai#1687.

@resouer

…0 (3-seed mean) GatedDeltaNet linear attention (FLA) K_KVShare_Wider + legal score-first TTT (SGD 3ep freeze=2) + brotli-11 compression. 3-seed mean: 1.00980 BPB (std 0.0015). All artifacts under 16 MB. Seeds: 1337 (1.00803), 42 (1.01069), 2025 (1.01067) TTT gain: ~-0.009 BPB per seed Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.

@resouer

… mean) GatedDeltaNet linear attention (FLA) K_KVShare_Wider + brotli-11 compression. No TTT — pure fixed predictor (Track A). 3-seed mean: 1.01902 BPB (std 0.0017). All artifacts under 16 MB. Seeds: 1337 (1.01720), 42 (1.02054), 2025 (1.01933) Based on PR openai#1687 by @resouer.

Aweb's record-attempt submission, building on PR openai#1711 (1.00980 BPB) by adding EMA-Teacher Distillation (Tarvainen & Valpola NeurIPS 2017, 'Mean teachers are better role models') as the novel contribution. Loss: L = (1-α)·CE(target) + α·KL(student || teacher.detach()) Teacher is a separate copy of the student model, periodically (every K=16 steps) synchronized from the EMA-smoothed state already maintained by the frontier code. Alpha ramps linearly 0 → 0.3 over the middle 40%% of training (steps 30%%-70%%). Temperature scaling per Hinton soft-target convention (KL × T²). Verified novel via gh search (mean teacher / EMA teacher / distillation / KL soft targets) — zero matching open PRs in the competition. Verified legal under Issue openai#1017 conditions 1-4: - Causal (teacher uses same forward as student) - Full distribution (KL on full softmax over vocab) - Score-before-update (distillation is training-time only; eval unchanged) - Single L→R pass (no rescoring) CPU smoke test (8 cases, FLA-independent) passes: CE-only path correct, EMT path differs from CE, gradient routes to student not teacher, temperature scaling active, alpha schedule correct, mini-training loss decreases, KL of identical distributions = 0. Credits: PR openai#1711 (aamodbhatt) GDN+brotli base; PR openai#1687 (resouer) GDN K_KVShare; PR openai#461 (Christopher-Lee-McClendon) Score-First Legal TTT; FLA library (sustcsonglin); Tarvainen & Valpola (NeurIPS 2017) Mean Teacher framework.

resouer closed this Apr 17, 2026

arsenis-cmd mentioned this pull request Apr 17, 2026

Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean) #1698

Open

5 tasks

genji0306 mentioned this pull request Apr 17, 2026

Record: K_KVShare_Wider FLA — val_bpb 1.0339 (3-seed mean) #1705

Closed

This was referenced Apr 18, 2026

Record: GatedDeltaNet FLA + Score-First TTT + Brotli — val_bpb 1.00980 (3-seed mean) #1711

Closed

Record: GatedDeltaNet FLA + Brotli (No TTT) — val_bpb 1.01902 (3-seed mean) #1712

Closed

kiyoaki mentioned this pull request Apr 18, 2026

Notable: SP8192 + 3-Layer Recurrence + Parallel Residuals - 5-Seed Quantization Reference and SDClip Ablations #1720

Open

5 tasks

yahya010 mentioned this pull request Apr 19, 2026

Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts) #1734

Closed

12 tasks

dexhunter mentioned this pull request Apr 19, 2026

Byte-accounting bug in build_sentencepiece_luts affects GDN-family submissions #1719

Open

genji0306 mentioned this pull request Apr 23, 2026

Record: K_KVShare_Wider FLA — val_bpb 1.0339 (3-seed mean) #1791

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean)#1687

Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean)#1687
resouer wants to merge 1 commit intoopenai:mainfrom
resouer:submission/kkvsharewider-fla-record

resouer commented Apr 17, 2026

Uh oh!

bigbag commented Apr 17, 2026

Uh oh!

resouer commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

resouer commented Apr 17, 2026

Summary

Mechanism

Compliance notes

Packaging note

Packaged-folder verification

Reproduction

Uh oh!

bigbag commented Apr 17, 2026

Uh oh!

resouer commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants