Skip to content

Record: Casefold V4 Tokenizer + Multi-Phase Global SGD TTT — val_bpb 1.05970 (3-seed mean)#1670

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:dexhunter/casefold-v4-multiphase-sgd-ttt
Open

Record: Casefold V4 Tokenizer + Multi-Phase Global SGD TTT — val_bpb 1.05970 (3-seed mean)#1670
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:dexhunter/casefold-v4-multiphase-sgd-ttt

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

  • val_bpb: 1.05970 (3-seed mean, std 0.00031) | 3.05401 nats | ~15.20 MB | 8×H100 SXM, 600s | Phased TTT
  • Casefold tokenizer normalization (lowercase + retrained SP8192) reduces token entropy
  • Multi-phase global SGD TTT with 3 phases on 2000 prefix documents
  • Built on PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 base with adaptive GPTQ clip (MLP=12σ, ATTN=13σ)

Key Innovation

Casefold tokenizer preprocessing normalizes text to lowercase before SP8192 tokenization, reducing vocabulary entropy. Combined with our multi-phase global SGD TTT from PR #1626. Casefold legality is pending organizer review at Issue #1604.

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, Phased TTT)

Seed Pre-TTT bpb Post-TTT bpb TTT gain Artifact
42 1.07122 1.05938 -0.01184 15,935,851
0 1.07124 1.05961 -0.01164 15,935,513
1234 1.07155 1.06010 -0.01145 15,933,440
Mean 1.07134 1.05970 -0.01164
Std 0.00031

Lineage

PR #1530 (@samacqua) → PR #1626 (@dexhunter, multi-phase SGD TTT) → this PR (+ casefold tokenizer)

Credits

Note

Casefold tokenizer normalization is pending organizer review at Issue #1604. This submission is offered for evaluation under that pending ruling.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 16, 2026
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 17, 2026
…TTT — val_bpb 1.05733 (3-seed mean)

Stacks per-head Attention Output Gate (PR openai#1667 @MarioPaerle) and SmearGate
on top of PR openai#1670's Casefold V4 + Multi-Phase Global SGD TTT base.
Zero-init gates (identity at init) add 1,056 + 13 parameters total.

- Seed 42:   val_bpb=1.05693, val_loss=3.04604, artifact=15,936,269 B
- Seed 0:    val_bpb=1.05730, val_loss=3.04712, artifact=15,937,514 B
- Seed 1234: val_bpb=1.05777, val_loss=3.04846, artifact=15,938,772 B
- 3-seed mean val_bpb=1.05733 (std 0.00035), val_loss=3.04721 nats
- Delta vs casefold leader (PR openai#1585): -0.00657 BPB / -0.01697 nats (>3x the 0.005-nat bar)
- Delta vs PR openai#1670 casefold base: -0.00237 BPB / -0.00680 nats

Casefold legality pending organizer review at Issue openai#1604.
AttnOutGate and SmearGate are pure architectural additions and comply with
all Issue openai#1017 conditions (causality, normalized distribution, score-before-
update, single pass).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant