Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster) by Tonyy1977 · Pull Request #1817 · openai/parameter-golf

Tonyy1977 · 2026-04-25T15:46:15Z

Non-Record: Crawler Transformer 3f+2cx2 d=832 — Mixed Int5 GPTQ + Post-Quant TTT — val_bpb 1.0903

val_bpb: 1.0903 | 15.96 MB | 1x RTX 6000 Ada 48GB, 30 hours (1-hour 8xH100 cluster equivalent)

Builds on PR #1579 (10-min track, val_bpb 1.1372). 6x more training compute → -0.047 BPB.

Result Summary

Stage	val_bpb
Pre-quant SWA	1.0684
int8+SDClip roundtrip	1.1381
GPTQ mixed-int (int5 flat-attn / int6 rest) roundtrip	1.1264
Post-quant TTT (freeze=1) on GPTQ artifact	1.0903

Steps: 30,374 (stopped by 30-hour wallclock cap)
Artifact: 15,867,420 bytes (15.96 MB), zero pruning
Total: 15,959,106 bytes (under 16 MB budget)

Comparison to PR #1579 (10-min track)

Config	Steps	Pre-quant	TTT BPB	Hardware
d=736 int6 (PR #1579)	6,042	1.1232	1.1372	10-min cluster
d=832 int5-flat (this)	30,374	1.0684	1.0903	1-hour cluster

Architecture: Crawler Transformer

3 flat blocks + 2 crawler blocks × 2 loops = 7 effective depth. dim=832, 47.4M params. SP8192 tokenizer, BigramHash, SmearGate, VE, XSA all 7 layers.

Quantization (Mixed Int5/Int6)

int5 for flat-block attention only (12 matrices)
int6 for everything else (22 matrices: flat MLP + all crawler)
int8 for embeddings
SDClip + GPTQ + Brotli, zero pruning (fits naturally at 15.96 MB)

Key Findings

Mixed-int beats pruning: Standard int6 needs 13.5% pruning at d=832 (roundtrip 1.1664). Mixed int5/int6 fits naturally with no pruning (roundtrip 1.1264).
Int5 attention is robust, int5 MLP is not: Quantizing only flat attention to int5 saves space without significant quality loss.
Pre-quant matters most: 6x more training compute → 0.041 BPB improvement at SWA, carrying through quantization and TTT.

Credits

Crawler Transformer architecture: inspired by @newjordan's research (PR Recursive Transformer - Non-Record Submission — 1.07424983 val_bpb (4h depth-recurrent hybrid transformer run) #1535)
Mixed-int quantization (int5 attn / int6 MLP): inspired by @newjordan's Midnight 12L (PR Midnight 12L — 1.10567949 val_bpb (seed 444) #1458)
See PR Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372 #1579 for full credits stack

…0903 (1-hour cluster) 30-hour local run (1-hour 8xH100 cluster equivalent): - Pre-quant SWA: 1.0684 - GPTQ mixed-int (int5 flat-attn / int6 rest, no pruning): 1.1264 - Post-quant TTT freeze=1: 1.0903 Builds on PR openai#1579 (10-min track, 1.1372). 6x more training compute. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)#1817

Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)#1817
Tonyy1977 wants to merge 1 commit intoopenai:mainfrom
Tonyy1977:crawler-d832-1hr-mixedint5

Tonyy1977 commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Tonyy1977 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-Record: Crawler Transformer 3f+2cx2 d=832 — Mixed Int5 GPTQ + Post-Quant TTT — val_bpb 1.0903

Result Summary

Comparison to PR #1579 (10-min track)

Architecture: Crawler Transformer

Quantization (Mixed Int5/Int6)

Key Findings

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tonyy1977 commented Apr 25, 2026 •

edited

Loading