Skip to content

Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)#1817

Open
Tonyy1977 wants to merge 1 commit intoopenai:mainfrom
Tonyy1977:crawler-d832-1hr-mixedint5
Open

Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)#1817
Tonyy1977 wants to merge 1 commit intoopenai:mainfrom
Tonyy1977:crawler-d832-1hr-mixedint5

Conversation

@Tonyy1977
Copy link
Copy Markdown

@Tonyy1977 Tonyy1977 commented Apr 25, 2026

Non-Record: Crawler Transformer 3f+2cx2 d=832 — Mixed Int5 GPTQ + Post-Quant TTT — val_bpb 1.0903

val_bpb: 1.0903 | 15.96 MB | 1x RTX 6000 Ada 48GB, 30 hours (1-hour 8xH100 cluster equivalent)

Builds on PR #1579 (10-min track, val_bpb 1.1372). 6x more training compute → -0.047 BPB.

Result Summary

Stage val_bpb
Pre-quant SWA 1.0684
int8+SDClip roundtrip 1.1381
GPTQ mixed-int (int5 flat-attn / int6 rest) roundtrip 1.1264
Post-quant TTT (freeze=1) on GPTQ artifact 1.0903
  • Steps: 30,374 (stopped by 30-hour wallclock cap)
  • Artifact: 15,867,420 bytes (15.96 MB), zero pruning
  • Total: 15,959,106 bytes (under 16 MB budget)

Comparison to PR #1579 (10-min track)

Config Steps Pre-quant TTT BPB Hardware
d=736 int6 (PR #1579) 6,042 1.1232 1.1372 10-min cluster
d=832 int5-flat (this) 30,374 1.0684 1.0903 1-hour cluster

Architecture: Crawler Transformer

3 flat blocks + 2 crawler blocks × 2 loops = 7 effective depth. dim=832, 47.4M params. SP8192 tokenizer, BigramHash, SmearGate, VE, XSA all 7 layers.

Quantization (Mixed Int5/Int6)

  • int5 for flat-block attention only (12 matrices)
  • int6 for everything else (22 matrices: flat MLP + all crawler)
  • int8 for embeddings
  • SDClip + GPTQ + Brotli, zero pruning (fits naturally at 15.96 MB)

Key Findings

  1. Mixed-int beats pruning: Standard int6 needs 13.5% pruning at d=832 (roundtrip 1.1664). Mixed int5/int6 fits naturally with no pruning (roundtrip 1.1264).
  2. Int5 attention is robust, int5 MLP is not: Quantizing only flat attention to int5 saves space without significant quality loss.
  3. Pre-quant matters most: 6x more training compute → 0.041 BPB improvement at SWA, carrying through quantization and TTT.

Credits

…0903 (1-hour cluster)

30-hour local run (1-hour 8xH100 cluster equivalent):
- Pre-quant SWA: 1.0684
- GPTQ mixed-int (int5 flat-attn / int6 rest, no pruning): 1.1264
- Post-quant TTT freeze=1: 1.0903

Builds on PR openai#1579 (10-min track, 1.1372). 6x more training compute.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant