Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean)#1394
Merged
cocohearts merged 1 commit intoopenai:mainfrom Apr 9, 2026
Conversation
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 6, 2026
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 6, 2026
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 6, 2026
vaibhav-i
added a commit
to vaibhav-i/parameter-golf
that referenced
this pull request
Apr 6, 2026
New base: PR openai#1394 (clarkkev SP8192 + SDClip + GPTQ embeddings, 1.08563 BPB) Experiments (all build on new_base_pr1394): - exp_polar_express: 4-step minimax-optimal NS (arXiv:2505.16932), ~-0.002 BPB - exp_causal_slot: per-window delta on context tokens, AdamW 16 steps, ~-0.013 BPB - exp_log_bias: streaming online log-bias (Nacrith arXiv:2602.19626), ~-0.015 BPB Research briefs: - research/2026-04-04-full-scan-brief.md - research/2026-04-05-scan-brief.md (updated: pre-quant TTT ruled illegal) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced Apr 6, 2026
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 6, 2026
…(3-seed mean) On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all fitting 16MB with 7-11K margin. Per-seed (post-TTT): - seed 0 : 1.08210 (val_loss 2.79517) - seed 42 : 1.08315 (val_loss 2.79788) - seed 1234: 1.08314 (val_loss 2.79785) - mean : 1.08279 (2.79697 nats per token) Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token, clearing the 0.005 nats record threshold by 0.00231 nats per seed. No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT matches PR openai#549 precedent: every chunk scored under inference_mode() before any parameter update.
4 tasks
erichroepke
added a commit
to erichroepke/parameter-golf
that referenced
this pull request
Apr 6, 2026
…ed mean) Merges @clarkkev's openai#1394 (SP8192, SDClip, GPTQ embeddings, skip gates) with @stukenov's openai#1364 (pre-quant AdamW TTT). First combination of these techniques. 3-seed mean: 1.07948 BPB (std=0.00043), artifact 15.12 MB. Built with Claude Opus 4.6 as AI co-author. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aryanbhosale
added a commit
to aryanbhosale/parameter-golf
that referenced
this pull request
Apr 6, 2026
… mean) SP8192 + Pre-Quant AdamW TTT + QK-Gain 5.0 on PR openai#1394 base. 3-seed mean: 1.0791 BPB. Track A, no eval-time adaptation.
Contributor
|
@clarkkev Strikes again. Clean and elegant as always. |
aravhawk
added a commit
to aravhawk/parameter-golf
that referenced
this pull request
Apr 7, 2026
- matrix_lr 0.025 -> 0.02 (matches SP8192 base, better for larger models) - scalar_lr 0.025 -> 0.02 - tied_embed_lr 0.035 -> 0.03 - warmdown_iters 2800 -> 3500 (~66.7% of training, matches all top 5) These match the proven hyperparameters from clarkkev's SP8192 base (PR openai#1394) and every top 5 submission.
This was referenced Apr 7, 2026
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 7, 2026
…am Tilt — val_bpb 1.07800 (3-seed mean) 3-lever stack on top of PR openai#1394 sp8192 baseline: - Parallel Residuals on layers 7-10 (PR openai#1412 by @Robby955) - 3-layer depth recurrence (LOOP_START=3 LOOP_END=5, extends PR openai#1394's 2-layer recurrence) - Eval-time causal n-gram tilt (PR openai#1420 by @abaybektursun, lineage PR openai#1145 by @AnirudhRahul) Plus our existing PR openai#1413 stack: QK_GAIN_INIT=5, score-first legal TTT (LR=0.005, epochs=3). Results (3-seed mean, 8xH100 SXM): - val_bpb 1.07800 (std 0.00053) - val_loss 2.78457 nats per token - Beats PR openai#1394 (1.08563) by 0.01971 nats per token - Beats PR openai#1420 (1.08014) by 0.00553 nats per token - Beats own PR openai#1413 (1.08279) by 0.01237 nats per token All four issue openai#1017 conditions verified for the n-gram tilt path: prefix-only hash construction, full-vocab renormalized one-token tilt, score-before-update ordering inside the C++ kernel, single left-to-right pass. C++ n-gram kernel ported from PR openai#1420 with the nanobind dependency removed (extern "C" shim + ctypes loader, single g++ -shared invocation at runtime). 5-seed re-verification via the shipped mini wrapper is in progress; this PR will be updated with the final 5-seed mean once s1337 and s2025 land.
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 7, 2026
Adds s1337 (1.07801) and s2025 (1.07862) via the shipped mini wrapper. The 5-seed mean is +0.00013 worse than the initial 3-seed mean (1.07800) which is well within the std (~0.00046). Margins vs the legal open chronology are unchanged in direction: - vs PR openai#1394 (1.08563): -0.01938 nats per token (margin +0.01438 over 0.005 bar) - vs PR openai#1420 (1.08014): -0.00520 nats per token (margin +0.00020 over 0.005 bar) - vs own PR openai#1413 (1.08279): -0.01205 nats per token 3 of 5 seeds (s42, s1337, s2025) are now mini-wrapper-verified for fit; s0 and s1234 mini-wrapper re-runs still in progress.
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 7, 2026
All 5 seeds (s0, s42, s1234, s1337, s2025) re-run via the shipped mini wrapper. The mean improves slightly from the prior mixed-source 1.07813 to 1.07807 because s1234 produced a noticeably lower TTT under the mini wrapper (1.07813 mini vs 1.07848 raw, -0.00035 — within float64 reordering noise but the largest single-seed drift in the verification set). All 5 artifact sizes are direct from the mini-wrapper runs (NOT projections): - s0: 15,992,304 bytes (7,696 byte headroom) - s42: 15,993,733 bytes (6,267 byte headroom) - s1234: 15,990,539 bytes (9,461 byte headroom) - s1337: 15,988,039 bytes (11,961 byte headroom) - s2025: 15,992,215 bytes (7,785 byte headroom) Margins vs the legal open chronology: - vs PR openai#1394 (1.08563): -0.01952 nats per token (margin +0.01452 over 0.005 bar) - vs PR openai#1420 (1.08014): -0.00534 nats per token (margin +0.00034 over 0.005 bar) - vs own PR openai#1413 (1.08279): -0.01218 nats per token All four issue openai#1017 conditions remain verified for the n-gram tilt path.
amrayach
added a commit
to amrayach/parameter-golf
that referenced
this pull request
Apr 7, 2026
7 tasks
5 tasks
12 tasks
hilbertmeng
pushed a commit
to hilbertmeng/parameter-golf
that referenced
this pull request
Apr 30, 2026
…val_bpb 1.07983 3-seed mean val_bpb 1.07983 (std 0.00050) on the PR openai#1394 sp8192 stack. Changes from PR openai#1394 + PR openai#1413 baseline: - Muon momentum = 0.97 (vs 0.99 default), warmup 0.92→0.97 unchanged - Causal token n-gram tilt (base_beta=2.0, agree_bonus=0.1) on top of legal score-first TTT; within-word and word-start experts explicitly disabled (within_beta=0, word_beta=0) because they cannot be made fully causal. - 3-seed verification (seeds 0/42/1234) Seeds: - seed 0 → 1.07928 bpb / 2.78790 nats / 15,993,346 bytes - seed 42 → 1.07997 bpb / 2.78967 nats / 15,992,995 bytes - seed 1234 → 1.08025 bpb / 2.79039 nats / 15,994,604 bytes - mean → 1.07983 bpb / 2.78932 nats / 15,993,648 bytes Delta vs current merged SOTA PR openai#1493 (1.0810): 0.00117 bpb / 0.00302 nats per token Credits: @clarkkev (base PR openai#1394 sp8192 stack), @abaybektursun (n-gram tilt kernel PR openai#1420, causal fix applied), prior legal-TTT precedent PR openai#549 / PR openai#461. Platform: 8xH100 80GB SXM, PyTorch 2.9.1+cu128. Training 588s, eval <437s per seed, both under the 600s budget. Artifact under 16 MB on all 3 seeds.
6 tasks
This was referenced Apr 30, 2026
6 tasks
Open
6 tasks
3 tasks
4 tasks
6 tasks
This was referenced May 1, 2026
This was referenced May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip + Simplifications — val_bpb 1.08563
val bpb: 1.08563 (5-seed mean, std=0.0007)
Changes
This script builds on #1218. The main changes are:
Quantization–Compression Tradeoffs
Quantization and compression interact in interesting ways. The compressed size depends not just on bitwidth, but also on the clip range (also called the scale) used during quantization. An int5 quantized network can actually compress smaller than an int4 one if the int5 quantization uses a much wider clip range. The reason is that the effectiveness of compression algorithms like
brotlidepends on the entropy of the data they are compressing, and increasing the clip range can lower that entropy.An example
Neural network weights are approximately normally distributed (a). In this example, we could clip the weights to [-1, 1] and uniformly quantize them into int5 (b). But this seems a bit wasteful because many of those bins are spent modeling the tails of the distribution, where very few weights lie. Instead, we could clip to [-0.5, 0.5] and use int4 (c). Or we could go one step further and use a non-uniform quantizer such as NF4 (d) so there are approximately the same number of weights at each quantized value.
Now here is the surprising part: after compression, int4 is only slightly smaller than int5, and NF4 is quite a bit larger. Why? Because the effectiveness of compression depends on not just the raw number of bits, but also the entropy of the quantized values. When we moved from int5 to int4, we made the histogram flatter, which increases entropy. NF4 flattens it even further by design, pushing the entropy higher still.
Another view is that the int4 and int5 parameters are mostly the same. The only difference is that the weights that would have been clipped to +-7 by int4 can take on larger values in int5, but as there are very few of them, this does not substantially increase compressed size.
Mathematical explanation
Suppose our network has$n$ weights and we quantize each one to $b$ bits. The quantized model size is $s_q = n b$ . However, we also compress our network after quantizing. A useful first approximation is that the compressed size $s$ is proportional to $H(q)$ , the entropy of the quantized weights:
This is not exact: compressors can also exploit structure beyond the marginal distribution. But neural network weights usually contain much less structure than natural data, so in practice their compressed size is often very close to what their entropy would suggest. So what is$H(q)$ ? Suppose our weights are normally distributed:
The differential entropy is
Now, suppose we clip our weights between$[-c, c]$ and quantize them into $2^b$ evenly spaced bins, i.e, we uniformly quantize them into int-$b$. Each bin then has width
The entropy of the resulting quantized weights, which we call$q$ , is approximately
If we measure entropy in bits, this becomes
This approximation becomes more accurate when$c \gg \sigma$ (since in that case only a small fraction of the weights are clipped), when $b$ is large enough that the quantization bins are small, and when $n$ is large enough that we still have many weights per bin.
A natural choice is to set the clip range proportional to the standard deviation, writing$c = k\sigma$ for some hyperparameter $k$ . This makes the amount of clipping scale-invariant: if the weights become 2x larger, the clip range should also become 2x larger. Substituting $c = k\sigma$ into the expression above gives
This gives two ways to reduce compressed model size: decrease$b$ (for example, go from int5 to int4), or increase $k$ (use a wider clip range so the quantized values get more concentrated near the center, which lowers their entropy). In fact, increasing $b$ and increasing $k$ have roughly opposite effects. The histogram produced by $(b, k)$ exactly matches the middle $2^b$ bins of $(b + 1, 2k)$ . The $(b + 1, 2k)$ quantization also includes additional outer bins, but very few weights lie in those bins, so $H(q)$ may not increase by much. This is exactly what we saw in the int5 versus int4 example.
Of course our approximations do not hold exactly in practice: the derivation ignores clipping, the weight distribution is only approximately normal, and compression depends on the full byte representation, not just the marginal histogram of quantized values. However, when I examined some trained networks, I found the standard deviation of a matrix (an estimate of$\sigma$ ) correlated very strongly ($R^2=0.995$ ) with the compression ratio of that matrix under a fixed clip width, suggesting the approximations are reasonable in practice. Lastly, I should note that usually each row is quantized separately, but the same reasoning applies on a per-row basis.
Improved clipping
The previous practice was to search over multiple clip thresholds to find the one that minimized reconstruction error. In the new version, the clipping threshold for a matrix row is just set at
In practice, I used$b = 6, k = 12.85$ for matrix parameters (tuned so the artifact is close to 16MB) and $b=8, k = 20$ for embeddings (they are more sensitive to quantization). As the above analysis suggests, upping the matrix params to int7 or int8 while doubling/quadrupling $k$ produced similarly-sized models, but I stuck with int6 to keep the script consistent with the previous version. Compared with the old approach, the new standard-deviation-based clipping has several advantages: