Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions APPROACH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Parameter Golf — Approach Notes

## Strategy Overview

Maximize language model quality within a 16MB artifact constraint and 10 minutes on 8×H100s. Five pillars informed by research in model compression, efficient architectures, and training optimization.

---

## 1. Depth Recurrence (Layer Sharing)

Instead of unique parameters per layer, reuse a small set of transformer blocks recursively. A 4-block recursive model with 8 passes achieves the effective depth of a 32-layer network while only storing 4 layers of parameters.

Research shows recursive transformers achieve comparable loss to standard architectures with 3-4× fewer parameters. The model learns to refine representations through repeated application of the same weights — a form of iterative refinement that naturally suits the extreme parameter constraint.

**Target:** Replace 12 unique layers with 4 recursive blocks × 3 passes = 12 effective layers at 1/3 the parameter cost.

## 2. Factorized Embeddings

The embedding matrix is often the largest single component. Instead of a full V×H matrix, decompose it into V×E and E×H where E << H. This technique (from ALBERT) can reduce embedding parameters by 80%+ while maintaining representation quality.

Combined with tied input/output embeddings, this eliminates the output projection layer entirely — the same factorized embedding serves both input and output.

**Math:** At vocab 1024, hidden 512: Full = 524K params. Factorized (E=128): 131K + 65K = 196K params. Savings: 63%.

## 3. Quantization-Aware Training (QAT)

Train the model knowing it will be quantized. The model learns weight distributions that survive low-precision conversion. At 2-bit precision, 16MB supports ~32M parameters.

Key insight: post-training quantization at 2-bit loses 15-20% quality. QAT at 2-bit loses only ~4%. The difference is massive at this scale.

**Approach:** Train at FP16/BF16, apply QAT during training with straight-through estimators, export at 2-bit for the final artifact.

## 4. Knowledge Distillation

Use a larger pretrained model as a teacher during training. The 8×H100 budget can run a 7B teacher alongside a 32M student. The student learns from soft probability distributions rather than hard labels, capturing more knowledge per training step.

Distillation is especially powerful for small models — the teacher provides a richer gradient signal than raw cross-entropy on token predictions alone.

## 5. Training Maximization

Every second of the 10-minute budget matters:

- **Sequence packing:** Multiple short examples per input sequence, no wasted padding tokens
- **Curriculum ordering:** Train on FineWeb examples ordered by difficulty (shorter/simpler first, longer/complex later) for faster initial convergence
- **Cosine LR schedule:** High initial learning rate with cosine decay over the 10-minute window
- **Gradient accumulation:** Effective batch size tuned for optimal loss curves on H100s
- **Mixed precision training:** BF16 compute for speed, QAT checkpoints for artifact size

## 6. Tokenizer Optimization

Vocabulary size directly impacts embedding parameter count. The baseline uses 1024 tokens. Exploring:

- Smaller BPE vocabularies (512, 256) — fewer embedding parameters but worse compression
- The tradeoff is parameter cost vs bytes-per-token — the evaluation metric is bits per byte, so better compression from larger vocab can offset the parameter cost
- Custom tokenizer trained specifically on FineWeb distribution

## 7. Alternative Architectures

Beyond standard transformers:

- **State-space models (Mamba-style):** Linear scaling with sequence length, potentially more parameter-efficient for the same quality
- **Mixture of Experts at micro-scale:** Multiple tiny FFN experts with a router — only a subset active per token, more capacity per parameter
- **Depth-adaptive inference:** Early exit for easy tokens, full depth for hard ones — maximizes quality where it matters most

---

## The Math

| Bitwidth | Parameters in 16MB | Architecture |
|----------|-------------------|-------------|
| 2-bit | ~32M | Recursive transformer, factorized embeddings |
| 3-bit | ~21M | Standard transformer, tied embeddings |
| 4-bit | ~16M | Compact transformer |

## Experiment Plan

- [ ] Run baseline (9-layer, 512-dim, 1024-vocab, tied embeddings) — establish score to beat (1.2244)
- [ ] Implement depth recurrence (4 recursive blocks × 3 passes)
- [ ] Add factorized embeddings (V×128 + 128×H)
- [ ] Test 2-bit QAT during training
- [ ] Knowledge distillation with 7B teacher
- [ ] Curriculum data ordering on FineWeb
- [ ] Tokenizer vocabulary sweep (256, 512, 1024, 2048)
- [ ] Mamba/SSM architecture comparison
- [ ] Combine best techniques into final submission

## Background

5 production fine-tuned models (7B-72B) deployed via QLoRA/GGUF/NVFP4 quantization on NVIDIA DGX hardware. Built a 130K-chunk expert knowledge base for AI/ML research consultation. Deep experience with compression-quality tradeoffs across bitwidths.

## Status

Credits requested. Local experimentation with MLX baseline in progress.
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Non-record: Universal Transformer + Adaptive Density

**val_bpb: 3.2483 (legal, no TTT)** | DGX Spark GB10, 200 steps sp1024, no torch.compile

## Update (April 11, 2026)

An earlier version of this PR reported **val_bpb 1.4390** using multi-epoch TTT on `val_tokens` without score-first discipline. @MatoTeziTanka correctly flagged this as the same illegal pattern that closed PR #1376 and the rest of the Pre-Quant TTT cluster.

The flagged code path (`ttt_adapt(args, base_model, device, val_tokens, ...)` at line 1194) was part of the `train_gpt_kitchen_sink.py` base script and defaulted to enabled. This submission has been updated to disable TTT by default and report honest BPB from a clean run on the DGX Spark.

Thanks to @MatoTeziTanka for the careful review.

## What This PR Is

Implements OpenAI's requested "Universal transformer" research direction from the README. Single shared transformer block looped N times with per-iteration learnable parameters (attn_scale, mlp_scale, resid_mix, iteration_embed). 50% sparse-to-dense curriculum during training.

This is a non-record track research submission. It exists to answer the question: does weight-shared depth recurrence work at the parameter golf scale? The answer is yes, but it plateaus fast and is dominated by mini depth recurrence (repeat 2-3 specific layers) as used in PR #1204 and PR #1334.

## Legal Results (No TTT)

All runs on NVIDIA DGX Spark GB10 (single GPU, 128GB unified memory), sp1024, 200 training steps, SEED=42, no torch.compile.

| Run | Config | Params | val_bpb | ms/step |
|-----|--------|--------|---------|---------|
| UT-1 | 1 block x 6 iters | 4,546,568 | **3.2483** | 707 |
| UT-2 | 1 block x 24 iters | 4,601,864 | 3.2490 | 2,734 |

## Finding

Doubling iterations from 6 to 24 (4x compute per step) produces identical BPB to 3 decimal places. Full weight sharing hits a ceiling almost immediately at this model size. The compute budget is better spent on:

1. **Mini depth recurrence** (repeat 2-3 specific layers) as in PR #1204, which avoids the weight-sharing penalty on the non-repeated layers
2. **More training steps** rather than more iterations per step
3. **Wider models** (per MEGA-2 ablation: d=640 beats 11 layers at d=512)

The 2.87 MB artifact size means there is substantial headroom under the 16 MB limit. A hybrid approach combining partial weight sharing with a larger base model would likely beat the pure-shared approach tested here.

## Reproduction

```bash
pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
VOCAB_SIZE=1024 NUM_ITERS=6 TORCH_COMPILE_DISABLE=1 ITERATIONS=200 TTT_ENABLED=0 \
python3 records/track_non_record_16mb/2026-03-31_UniversalTransformer_AdaptiveDensity/train_gpt.py
```

## Full Ablation Data

Raw logs and CSV for all 22 runs across 7 architectures (Universal Transformer, Text Diffusion, Random Adapters, JEPA, Mamba SSM, H-Net, Megakernels):

https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

## Hardware Notes

DGX Spark GB10 is approximately 6x slower per step than 8xH100. Absolute BPB values are much higher than competition runs due to the short 200-step training budget. The relative ordering between configurations is what matters here: more iterations does not help, and depth recurrence plateaus quickly.

## Related

- PR #1191 H-Net Dynamic Chunking (non-record, same cluster)
- PR #1192 Fused Triton Megakernels (non-record)
- PR #1194 Text Diffusion (non-record)
- PR #1195 Random Adapter Maps (non-record)
- PR #1196 LLM-JEPA (non-record)
- PR #1197 Mamba SSM Hybrid (non-record)
- PR #1204 msisovic Mini Depth Recurrence (record-track implementation of this idea)
- PR #1334 aryanbhosale Track A with depth recurrence + parallel residuals (1.0897 BPB)
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"track":"non_record","title":"Universal Transformer + Adaptive Density","val_bpb":3.2483,"hardware":"1xDGX-Spark-GB10","author":"NathanMaine","notes":"Updated April 11: TTT-on-val disabled per MatoTeziTanka review. Honest no-TTT result from DGX Spark ablation (200 steps, sp1024)."}
Loading