Skip to content

Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean)#1687

Closed
resouer wants to merge 1 commit intoopenai:mainfrom
resouer:submission/kkvsharewider-fla-record
Closed

Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean)#1687
resouer wants to merge 1 commit intoopenai:mainfrom
resouer:submission/kkvsharewider-fla-record

Conversation

@resouer
Copy link
Copy Markdown

@resouer resouer commented Apr 17, 2026

Summary

3-seed mean val_bpb: 1.04089763 (std 0.00106003) | 3.16476760 nats | 8xH100 SXM, 600s

Seed Quantized BPB val_loss (nats) Artifact
1337 1.03967403 3.16104735 15,762,406
42 1.04153708 3.16671180 15,870,797
2025 1.04148177 3.16654364 15,648,800
Mean 1.04089763 3.16476760 15,760,668

Mechanism

This candidate packages the stronger K_KVShare_Wider point on a fuller upstream-style FLA / GatedDeltaNet recipe.

Main idea:

  • FLA / GatedDeltaNet family
  • K_KVShare_Wider: KV sharing (kv_sharing_stride=2) used to buy width rather than depth
  • fuller upstream-style recipe
  • EMA + SWA + late QAT + int6 artifact path

Nearest prior family reference: PR #1370.

Compliance notes

What this candidate does not use:

  • no TTT
  • no SLOT
  • no n-gram overlay
  • no SWA/XSA final scoring path (K_KVShare_Wider has num_swa_layers = 0)

Hardening already applied:

  • all 3 seeds are on one script revision
  • train_gpt.py does not perform runtime dependency downloads
  • dependencies are declared in requirements.txt and expected to be installed before evaluation

Packaging note

This PR keeps the faithful multi-file records surface rather than the tidier single-file experiments.

A direct packaged smoke on this exact multi-file surface (seed 1337) completed at 1.03971272 BPB, versus the measured seed1337 result of 1.03967403, a delta of only +0.00003869 BPB. Peak memory remained 41127 MiB.

Packaged-folder verification

Exact draft records-folder audit:

  • packaged folder + largest artifact = 15,991,282 bytes
  • remaining headroom = 8,718 bytes

The packaged code in the records folder passes py_compile from inside the folder.

Reproduction

pip install --no-deps -r requirements.txt
SEED=$SEED ARCH_MODE=K MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 EVAL_COMPILE_ENABLED=0 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

…ranch

This branch lifts the validated review package onto a clean upstream/main base so the official submission diff stays to one records folder and one commit. The package keeps the faithful multi-file surface because the packed single-file experiments drifted, while a direct smoke on the current multi-file surface matched the measured candidate within noise.

Constraint: The submission branch must contain only records/ files and must keep the exact measured candidate surface.
Rejected: Reuse the existing fork review branch as-is | it carries many exploratory commits and is noisier than a clean submit branch
Rejected: Promote the packed single-file variant | it was not fidelity-cleared for this candidate
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: If packaging changes again, rerun at least one packaged smoke before treating the branch as submission-ready
Tested: py_compile on packaged Python files; exact folder-size audit (15,991,282 bytes total); packaged multi-file smoke on PR-head surface at 1.03971272 BPB
Not-tested: Re-running the full 3-seed sweep on this rebased records-only branch (package contents unchanged)
@bigbag
Copy link
Copy Markdown

bigbag commented Apr 17, 2026

Hi @resouer — I think this PR hits the same build_sentencepiece_luts byte-counting bug that closed #1545 / #1576 / #1632 / #1672 / #1681. Flagging for visibility.

Where: train_gdn_7k.py L192–205 sets base_bytes[i] = len(piece[1:].encode("utf-8")) + 1 for ▁-prefixed pieces — so the leading-space byte is already baked in. Then the scorer at L361–363 adds another +1 via has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]. Every ▁-token ends up counted with one extra byte.

Reference (base train_gpt.py L180–204) strips the ▁ from piece before calling .encode("utf-8"), so base_bytes counts only the content and the scorer's +1 is the only leading-space accounting.

Empirical check on fineweb_1024_bpe.model with a 3,648-byte FineWeb-style sample (reproduces the scorer arithmetic from L361–363):

actual text reference LUT this PR's LUT
byte_count 3,648 3,647 4,214 (+15.55%)

Per-token delta (PR − ref): 68.0% of tokens → 0, 31.5% → +1 (the ▁-tokens), 0.5% → +5 (looks like sp.is_byte tokens, where the reference uses base_bytes = 1 but this impl uses len("<0x..>".encode())).

Correction factor ×1.1555 → the reported 1.0409 3-seed mean becomes ≈ 1.20 under the reference scorer, which is in line with the prior GDN-Hybrid runs before rescoring.

Minimal fix, mirroring base train_gpt.py L180–204:

  • strip from piece before piece.encode("utf-8")
  • default is_boundary_token to True; set False only for non-control/unknown/unused tokens
  • handle sp.is_byte(i)base_bytes = 1
  • add sp.is_unused(i) to the control/unknown skip

Happy to share the verification script if useful.

@resouer
Copy link
Copy Markdown
Author

resouer commented Apr 17, 2026

Closing this submission after local verification found a scoring bug in the SentencePiece byte accounting path. The corrected rerun on commit 3bf1ec3 lands at final_int6_roundtrip_exact val_bpb:1.22285488, so the reported 1.04089763 result does not hold under the base scorer semantics.

@resouer resouer closed this Apr 17, 2026
arsenis-cmd added a commit to arsenis-cmd/parameter-golf that referenced this pull request Apr 17, 2026
… (3-seed mean)

GatedDeltaNet linear attention (FLA) + legal score-first TTT on PR openai#1687
K_KVShare_Wider architecture. 3-seed mean: 1.00995 BPB (std 0.0012).

Seeds: 42 (1.01130), 314 (1.00896), 999 (1.00959)
TTT gain: ~-0.010 BPB per seed
All artifacts under 16 MiB.

Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 17, 2026
genji0306 added a commit to genji0306/parameter-golf that referenced this pull request Apr 17, 2026
Independent 3-seed reproduction of GatedDeltaNet K_KVShare_Wider on
8xH100 SXM. Builds on PR openai#1687 (resouer). No TTT, no SLOT, no n-gram.

Seeds: 42 (1.0353), 1337 (1.0333), 2025 (1.0330)
Mean: 1.0339 ± 0.0012 | Artifact: 15.88 MB mean
genji0306 added a commit to genji0306/parameter-golf that referenced this pull request Apr 17, 2026
- is_boundary defaults to True (was zeros)
- skip control/unknown/unused tokens early
- handle byte tokens as 1 byte explicitly
- strip sentencepiece space marker before UTF-8 encoding
- use int16 for base_bytes (was float32)

Same bug that closed PR openai#1687.
aamodbhatt pushed a commit to aamodbhatt/parameter-golf that referenced this pull request Apr 18, 2026
…0 (3-seed mean)

GatedDeltaNet linear attention (FLA) K_KVShare_Wider + legal score-first
TTT (SGD 3ep freeze=2) + brotli-11 compression. 3-seed mean: 1.00980 BPB
(std 0.0015). All artifacts under 16 MB.

Seeds: 1337 (1.00803), 42 (1.01069), 2025 (1.01067)
TTT gain: ~-0.009 BPB per seed

Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.
aamodbhatt pushed a commit to aamodbhatt/parameter-golf that referenced this pull request Apr 18, 2026
… mean)

GatedDeltaNet linear attention (FLA) K_KVShare_Wider + brotli-11
compression. No TTT — pure fixed predictor (Track A). 3-seed mean:
1.01902 BPB (std 0.0017). All artifacts under 16 MB.

Seeds: 1337 (1.01720), 42 (1.02054), 2025 (1.01933)

Based on PR openai#1687 by @resouer.
manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Apr 18, 2026
Aweb's record-attempt submission, building on PR openai#1711 (1.00980 BPB) by
adding EMA-Teacher Distillation (Tarvainen & Valpola NeurIPS 2017,
'Mean teachers are better role models') as the novel contribution.

Loss: L = (1-α)·CE(target) + α·KL(student || teacher.detach())

Teacher is a separate copy of the student model, periodically (every K=16 steps)
synchronized from the EMA-smoothed state already maintained by the frontier code.
Alpha ramps linearly 0 → 0.3 over the middle 40%% of training (steps 30%%-70%%).
Temperature scaling per Hinton soft-target convention (KL × T²).

Verified novel via gh search (mean teacher / EMA teacher / distillation / KL
soft targets) — zero matching open PRs in the competition.

Verified legal under Issue openai#1017 conditions 1-4:
  - Causal (teacher uses same forward as student)
  - Full distribution (KL on full softmax over vocab)
  - Score-before-update (distillation is training-time only; eval unchanged)
  - Single L→R pass (no rescoring)

CPU smoke test (8 cases, FLA-independent) passes:
  CE-only path correct, EMT path differs from CE, gradient routes to student
  not teacher, temperature scaling active, alpha schedule correct, mini-training
  loss decreases, KL of identical distributions = 0.

Credits: PR openai#1711 (aamodbhatt) GDN+brotli base; PR openai#1687 (resouer) GDN K_KVShare;
PR openai#461 (Christopher-Lee-McClendon) Score-First Legal TTT; FLA library
(sustcsonglin); Tarvainen & Valpola (NeurIPS 2017) Mean Teacher framework.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants