Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) by clarkkev · Pull Request #1218 · openai/parameter-golf

clarkkev · 2026-04-01T11:54:49Z

Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_bpb 1.09785

val bpb: 1.09785 (3-seed mean, std=0.0004)

Seed	Steps	Pre-quant BPB	Post-quant BPB	Sliding BPB	Artifact
42	5967	1.10411	1.11588	1.09744	15,915,268
1337	5962	1.10482	1.11631	1.09795	15,905,460
2025	5961	1.10507	1.11641	1.09816	15,927,782
Mean		1.10467	1.11620	1.09785	15,916,170

Overview

This script builds on the 03-23 leaderboard record. The main changes are:

Fixes

Fixed a small bug in the sliding window evaluation causing it to score tokens at the end of the val dataset multiple times. This bug didn't significantly affect results: it added roughly 2k duplicate contributions to the total loss and byte counts over a validation set of about 6M tokens. The faulty line was:
window_starts = [ws for ws in range(0, total_tokens, stride) if min(ws + seq_len, total_tokens) - ws >= 1], and it should be:
window_starts = [ws for ws in range(0, total_tokens, stride) if ws + seq_len - stride < total_tokens]

Simplifications

Use XSA in all layers instead of only the last 4.
Removed parameter banking and distributed muon implementation and instead just used Muon + DDP.
Removed test time training. I doubt that 0.1% additional tokens will improve the model generally, and for long docs I think it makes more sense to work on extending the sequence length.
Removed quantization-aware training, since it appeared to provide little or no benefit.
Removed gated attention.
Removed value residuals.
Removed hash embeddings, which are probably less necessary after increasing the vocab size.
Removed the smear gate, for the same reason.

Additions

Increased the vocabulary size from 1024 to 4096. I used the existing data/download_hf_docs_and_tokenize.py to build the sentencepiece tokenizer and pre-tokenized data. The tokenizer model grew by ~50kb, but even with that added, the final artifacts would be below the 16MB cap. A larger vocab means the model sees more context for the same sequence length and more train data per step.
Use a bigger but more strongly regularized model. I discovered that the compression ratio of a weight matrix (i.e., quantized-and-compressed-mb / raw-mb) correlates extremely well with the matrix's root-mean-square (torch.sqrt(torch.mean(x**2))) with an R^2 near 0.99. This suggests that the weight decay is a good lever for reducing the compressed size, which can let us add more parameters to the model. In particular this script uses:
- Higher weight decays: muon weight decay increased 0.04 -> 0.085, and added an embeddings weight decay of 0.085. Additionally, decreased the adam weight decay 0.04 -> 0.02, as scalar parameters shouldn't need to be low-magnitude.
- Wider MLPs, increasing mlp_mult 3 -> 4.
- A decreased learning rate 0.025 -> 0.02, as larger models generally benefit from smaller LRs.
Added the coprime-stride data loader from #726. The benefit is that it avoids showing the model sequences from the same document in the same/nearby minibatches by jumping around the data files.
Added GPTQ Hessian-aware quantization. My implementation is based on #1060 and reserves some time from training for Hessian computation.
Use more efficient byte shuffle + brotli compression from #1089.
Added sigmoid-gated skip connections to the unet, also from #1089.
Increased qk_gain_init 1.5 -> 4 following #1125.

…pb 1.09785 (3-seed mean)

mikeapedia · 2026-04-01T22:52:45Z

Awesome results @clarkkev!

Strip complexity, bigger model, higher weight decay: MLP 3x → 4x (32.2M params vs 27M) MUON_WD 0.04 → 0.085 (better int6 compression) ADAM_WD 0.04 → 0.02 (scalars) BigramHash removed, VE removed QK_GAIN_INIT=4.0 PR openai#1218 proved this approach works: simplify + regularize = 1.098 on sp4096. On Scylla (998 tokens): should fit ~15.9MB at high WD.

abaybektursun · 2026-04-02T02:46:59Z

so elegant, thing of beauty my friend.
How did you discover that the compression ratio correlates with the matrix's root-mean-square?
In general what did you test and what experiments did you conduct before this?
Because there were a lot of great decisions made and it's not obvious how you made them. Is it just established expertise? Would you mind sharing? Thanks!

NoesisGenesis · 2026-04-02T14:18:55Z

@abaybektursun so elegant, thing of beauty my friend. How did you discover that the compression ratio correlates with the matrix's root-mean-square?

I am not the author of this PR, but I have spent some time on the same quantization-and-compression pipeline to have views on the moving parts.

Mostly I think the discovery itself is the boring part, and the mechanism is the interesting part.

The boring part is: you just instrument the exact export path matrix by matrix. For each tensor, log raw MB, quantized-and-compressed MB, and a few simple statistics on the float weights. Then scatter compressed_mb / raw_mb against those statistics. RMS jumps out very quickly.

The reason it jumps out is that, in this pipeline, RMS sits upstream of almost everything the exporter cares about. GPTQ is using rowwise scales, so if a matrix has lower RMS, its rows usually get smaller scales and therefore a finer effective grid. That pushes more coefficients into small quantized values, especially 0, ±1, ±2. Then selective pruning likes exactly those values, because they are the cheapest ones to zero. Then byte shuffle + Brotli likes the resulting stream, because the symbol histogram is more peaked and the scale bytes repeat more. So lower RMS changes the quantized representation into the exact kind of byte stream that the whole stack prefers.

I suspect the reason the R² got so absurdly high is that, in that regime, the matrices were fairly self-similar apart from scale. If the row max/RMS ratio and general histogram shape do not vary too much, then one scalar, RMS, ends up predicting most of the row-scale distribution seen by the quantizer, and from there a lot of the final compressed size is basically determined.

I would not overstate it as a universal law, though. It is a very strong empirical regularity for this specific setup: rowwise low-bit quantization, pruning of small codes, byte shuffle, then Brotli. It can break if two matrices have the same RMS but different outlier structure, different max/RMS, different scale distributions, different bit assignments, or if one tensor bypasses the low-bit path. In short, RMS is an excellent first-order proxy for the rate term of the deployed artifact.

The deeper point, to me, is that this is really a rate-distortion problem. RMS tells you a lot about the rate side, meaning how many bytes the matrix will want after quantization and compression. It does not tell you the whole distortion side, meaning how much loss you incur if you make that matrix smaller. I have seen cases where a matrix looks benign by raw MSE but is catastrophic by Hessian-weighted error.

What I would recommend is to use weight decay as a crude shadow price on bytes, then spend the saved bytes on the matrices whose Hessian-weighted quantization damage is worst.

Because there were a lot of great decisions made and it's not obvious how you made them.

The simplifications make sense in the local logic of that PR, though not all for the same reason. Once you have the compression lever via RMS and a 4096 vocab, removing lexical auxiliaries like hash embeddings and the smear gate becomes much easier to justify, because some of what they were buying is now being bought more directly by the tokenizer and by the larger core model. Likewise, if stronger regularization is making the main weight banks cheaper to quantize and compress, then spending the recovered budget on a wider MLP is a very coherent move.

At the same time, I do not think all of those removals should be read as settled truths. Some of them, especially the lexical extras, follow pretty naturally from the larger vocab; others feel more like strong empirical bets that also hand-wave away interactions that I suspect matter, particularly around matrix-specific quantization sensitivity and export behavior. My own not-yet-submitted SOTA PR has led me to somewhat different conclusions on a few of these choices. The space of good recipes is larger than any single record PR might suggest.

My own framing I keep coming back to is that Parameter Golf is not about training a model and then compressing it, although so far most PRs read like that is exactly what it is. It is about learning an equivalence class of functions, then choosing the member of that class whose quantized, side-informed, bank-packed serialization has the lowest task loss at 16MB.

Or, if you prefer it in fewer syllables: the true parameters are the bits.

Train weights that the quantizer will thank you for. This submission is what it looks like when someone starts pulling on that thread. There is a lot more to find in this direction. Go further: shape the training dynamics from the ground up so the learned solution already lives in a compression-friendly, distortion-stable basin. You can push a surprisingly large model through the 16 MB bottleneck and have it come out the other side intact.

…1.0929 (3-seed mean) Adds three techniques to PR openai#1218's 4096-vocab high-WD stack: - MuonEq-R optimizer (row-norm before NS5 orthogonalization) - Depth recurrence on layers 4,5 (shared MLP, zero extra params) - Mixed int5/int6 GPTQ via Hessian sensitivity ranking 3-seed mean: 1.0929 BPB / 2.5145 nats All seeds under 16MB (max: 15,981,324 bytes) No TTT, no SLOT, no eval-time adaptation.

mtybadger · 2026-04-02T20:36:16Z

I'm so happy V4096 is back

@clarkkev

… (3-seed mean) Improves PR openai#1260 (1.0929) by using N_INT6=61 (one more int6 layer) with a smaller mini runner (21,396 bytes) that creates enough headroom. 3-seed mean: 1.0924 BPB / 2.5133 nats (seeds 42, 0, 7) All seeds under 16MB (max: 15,996,591 bytes) No TTT, no SLOT, no eval-time adaptation. Techniques: MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP), 61 int6 + 5 int5 Hessian-ranked GPTQ, brotli-11 compression. Built on PR openai#1218 by @clarkkev.

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

clarkkev · 2026-04-03T12:56:32Z

Thanks everyone!

To answer @abaybektursun's questions

I originally looked at RMS versus per-matrix compression ratio because I was wondering whether different parameter groups should get different weight decays. Once I plotted it, though, the correlation was so strong and so consistent that it seemed like a general lever, so I shifted toward just increasing weight decay more broadly.
I had an intuition from the beginning that 1024 was too small of a vocab. In addition to the reasons mentioned in the PR, I knew that historically language models with small vocabularies (such as character-level models) have underperformed models operating on larger word/subword units.
In general, I prefer simple and scalable solutions over methods with lots of special cases or extra moving parts, so I tried removing as many components as I could and then checked to make sure the performance didn't regress. In fact, it improved a little!
I did test a few other things that didn't seem to help, most notably long-context training and inducing sparsity during training.

I agree with a lot of your points @NoesisGenesis and I do think there is a lot left to explore here. I basically deleted things that I felt weren't needed and made sure there wasn't a regression, but that doesn't mean some version of QAT, hashing, etc. might still be useful. Looking forward to see your PR when it is ready!

I also looked a bit deeper into the RMS and compression ratio correlation to see what is really going on.

The compression ratio is almost entirely governed by the entropy of the quantized weights: I measured an R^2 of 0.996 between the entropy and the compression ratio. This makes sense because the weights have no extra structure for the compressor to make use of.
NN weights are approximately normally distributed and the entropy of a gaussian is proportional to $\log{\sigma}$. RMS is just a measurement of $\sigma$ assuming the mean is 0. So actually log(RMS) correlates even better with an R^2 of 0.995 because it gives a good estimate of the entropy.
For a narrow range, log looks approximately linear so RMS without the log still has the high correlation of R^2 = 0.99.

…base Based on PR openai#1218 (clarkkev) SP4096/MLP4x/WD0.085 stack. Added: Polar Express NS (4 steps), MuonEq-R, depth recurrence (layers 4,5), SLOT eval-time delta. Target: sub-1.09 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@clarkkev

…b 1.0900 (3-seed mean) 3-layer depth recurrence (layers 3,4,5) with WD-LR synergy: higher WD (0.095) compresses for all-int6 headroom, higher MLR (0.022) recovers quality. All 66 layers at int6 precision. 3-seed mean: 1.0900 BPP / 2.5077 nats (seeds 42, 0, 7) All seeds under 16MB with 36K+ margins. No TTT, no SLOT, no eval-time adaptation. Improves PR openai#1285 (1.0912) by 0.0013 BPP. Beats PR openai#1218 by 0.0079. Built on PR openai#1218 by @clarkkev.

…el spectral test-time training Base: clarkkev openai#1218 (1.0974 BPB, 4096 vocab, brotli, 34M params) Added: depth recurrence L4,5 (from openai#1285), MuonEq-R, WD=0.09 Novel: Spectral TTT — adapt singular values at eval time (8192 params) Target: ~1.085 BPB

…mult4-wd085 Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

Improve post-training quantization on PR openai#1218 base (SP4096, MLP 4x, WD 0.085). Three changes: sequential cross-layer error propagation, groupwise int6 scales (group_size=128), and Hessian-weighted scale selection. Expected -0.004 to -0.008 dBPB with zero training-time cost. Made-with: Cursor

Three improvements to the post-training quantization pipeline on PR openai#1218: 1. Sequential cross-layer GPTQ: quantize layers one at a time, injecting quantized weights back before collecting later layers' Hessians. This propagates quantization error forward so later Hessians are accurate. 2. Groupwise int6 scales (group_size=128): per-group fp16 scales instead of per-row, giving finer control over weight variance within rows. 3. Hessian-weighted scale selection: minimize H_diag-weighted error instead of MSE when selecting per-row clip percentiles. Zero training-time cost. Expected -0.004 to -0.008 dBPB. Made-with: Cursor

- Match heading, table, and section format from openai#1218/openai#1394 - Add Post-quant BPB column, bold Sliding BPB values - Add missing submission.json fields (hardware, bytes_total, bytes_code) - Remove Deltas and Reproducibility sections - Round val_bpb to 5 decimal places consistently Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

Frontier records (PR openai#1285 MuonEq-R + WD=0.090, PR openai#1218 WD=0.085) use AdamW-style decoupled weight decay on the Muon optimizer. Add the knob with default 0.0 (backward-compatible). Applied as p.data.mul_(1 - lr * wd) before the Muon matrix update. MuonEq-R (row-normalized) variant is not ported — it would need more line budget than we have on this branch. WD alone accounts for the majority of that record's improvement per the commit notes. dev/run_frontier.sh sets MUON_WEIGHT_DECAY=0.09 by default. Also inlined restore_low_dim_params_to_fp32 at its single call site to free lines for this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_b…

4421a8d

…pb 1.09785 (3-seed mean)

clarkkev changed the title ~~4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)~~ Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) Apr 1, 2026

bigbag mentioned this pull request Apr 1, 2026

Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217

Open

clarkkev mentioned this pull request Apr 1, 2026

Tokenizer Request: larger variants #1189

Open

dexhunter mentioned this pull request Apr 2, 2026

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260

Open

5 tasks

dexhunter mentioned this pull request Apr 3, 2026

Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean) #1279

Open

5 tasks

dexhunter mentioned this pull request Apr 3, 2026

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285

Merged

4 tasks

This was referenced Apr 3, 2026

Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean) #1287

Open

Record: Vocab4096 + MLP4.0x + SLOT - val_bpb 1.0925 (3-seed mean) #1291

Open

aryanbhosale mentioned this pull request Apr 3, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1296

Open

aryanbhosale mentioned this pull request Apr 3, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

abaybektursun mentioned this pull request Apr 3, 2026

Record: Fused MLP (Triton+CUTLASS EVT) + Fast Causal N-Gram Tilt & Subword Certainty (3-seed mean) #1105

Closed

aryanbhosale mentioned this pull request Apr 4, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT — val_bpb 1.0896 (3-seed mean) #1326

Open

dexhunter mentioned this pull request Apr 4, 2026

Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean) #1331

Open

Omrigotlieb mentioned this pull request Apr 4, 2026

Record: SP4096 + Polar Express NS + MuonEq-R + WD=0.090 — 1.0959 BPB (3-seed mean) #1332

Closed

5 tasks

This was referenced Apr 4, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean) #1333

Open

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334

Merged

This was referenced Apr 4, 2026

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean) #1338

Closed

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean) #1339

Open

Omrigotlieb mentioned this pull request Apr 4, 2026

Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344

Open

4 tasks

X-Abhishek-X mentioned this pull request Apr 5, 2026

Cautious Muon + SP4096 + Depth Recurrence — val_bpb 1.1604 (non-record) #1381

Open

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1218 SP4096 (Round 7)

0dc222e

Rome-1 mentioned this pull request Apr 5, 2026

Non-record: GPTQ-lite Scale Clamp Fix + 6-bit Packing + Depth Recurrence on Stack B #1389

Open

clarkkev mentioned this pull request Apr 5, 2026

Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394

Merged

abaybektursun mentioned this pull request Apr 6, 2026

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420

Open

dentity007 mentioned this pull request Apr 6, 2026

Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10 #1425

Open

This was referenced Apr 7, 2026

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1426

Closed

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1429

Open

mtybadger mentioned this pull request Apr 7, 2026

[Non-record] Codebooks! - val_bpb 1.2067 (3-seed mean) #1433

Open

mradassaad mentioned this pull request Apr 7, 2026

Non-record: Mamba-3 Hybrid + GPTQ + Late QAT + MuonEq-R — val_bpb 1.1526 (best SSM) #1355

Open

4 tasks

cocohearts merged commit d96169d into openai:main Apr 9, 2026

cocohearts mentioned this pull request Apr 9, 2026

Update README leaderboard for April records #1511

Merged

SH-Tan pushed a commit to SH-Tan/parameter-golf that referenced this pull request Apr 9, 2026

Merge pull request openai#1218 from clarkkev/submission/vocab4096-mlp…

5db4cc0

…mult4-wd085 Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)

MatoTeziTanka mentioned this pull request Apr 10, 2026

PR Acceptance Order and Competition Rules - A discussion - I want to know what you think #1522

Open

zoharb157 mentioned this pull request Apr 16, 2026

WIP: Sequential GPTQ with Groupwise Int6 — improved post-training quantization on SP4096 base #1664

Open

7 tasks

camden-git mentioned this pull request Apr 17, 2026

Non-record: wip Random-Basis MLPs + LoRA #1684

Open

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)#1218

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)#1218
cocohearts merged 1 commit intoopenai:mainfrom
clarkkev:submission/vocab4096-mlpmult4-wd085

clarkkev commented Apr 1, 2026 •

edited

Loading

Uh oh!

mikeapedia commented Apr 1, 2026

Uh oh!

abaybektursun commented Apr 2, 2026 •

edited

Loading

Uh oh!

NoesisGenesis commented Apr 2, 2026

Uh oh!

mtybadger commented Apr 2, 2026

Uh oh!

clarkkev commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

clarkkev commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_bpb 1.09785

Overview

Fixes

Simplifications

Additions

Uh oh!

mikeapedia commented Apr 1, 2026

Uh oh!

abaybektursun commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NoesisGenesis commented Apr 2, 2026

Uh oh!

mtybadger commented Apr 2, 2026

Uh oh!

clarkkev commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

clarkkev commented Apr 1, 2026 •

edited

Loading

abaybektursun commented Apr 2, 2026 •

edited

Loading