openai · dentity007 · Mar 18, 2026 · Mar 18, 2026 · Mar 31, 2026 · Apr 11, 2026
diff --git a/APPROACH.md b/APPROACH.md
@@ -0,0 +1,93 @@
+# Parameter Golf — Approach Notes
+
+## Strategy Overview
+
+Maximize language model quality within a 16MB artifact constraint and 10 minutes on 8×H100s. Five pillars informed by research in model compression, efficient architectures, and training optimization.
+
+---
+
+## 1. Depth Recurrence (Layer Sharing)
+
+Instead of unique parameters per layer, reuse a small set of transformer blocks recursively. A 4-block recursive model with 8 passes achieves the effective depth of a 32-layer network while only storing 4 layers of parameters.
+
+Research shows recursive transformers achieve comparable loss to standard architectures with 3-4× fewer parameters. The model learns to refine representations through repeated application of the same weights — a form of iterative refinement that naturally suits the extreme parameter constraint.
+
+**Target:** Replace 12 unique layers with 4 recursive blocks × 3 passes = 12 effective layers at 1/3 the parameter cost.
+
+## 2. Factorized Embeddings
+
+The embedding matrix is often the largest single component. Instead of a full V×H matrix, decompose it into V×E and E×H where E << H. This technique (from ALBERT) can reduce embedding parameters by 80%+ while maintaining representation quality.
+
+Combined with tied input/output embeddings, this eliminates the output projection layer entirely — the same factorized embedding serves both input and output.
+
+**Math:** At vocab 1024, hidden 512: Full = 524K params. Factorized (E=128): 131K + 65K = 196K params. Savings: 63%.
+
+## 3. Quantization-Aware Training (QAT)
+
+Train the model knowing it will be quantized. The model learns weight distributions that survive low-precision conversion. At 2-bit precision, 16MB supports ~32M parameters.
+
+Key insight: post-training quantization at 2-bit loses 15-20% quality. QAT at 2-bit loses only ~4%. The difference is massive at this scale.
+
+**Approach:** Train at FP16/BF16, apply QAT during training with straight-through estimators, export at 2-bit for the final artifact.
+
+## 4. Knowledge Distillation
+
+Use a larger pretrained model as a teacher during training. The 8×H100 budget can run a 7B teacher alongside a 32M student. The student learns from soft probability distributions rather than hard labels, capturing more knowledge per training step.
+
+Distillation is especially powerful for small models — the teacher provides a richer gradient signal than raw cross-entropy on token predictions alone.
+
+## 5. Training Maximization
+
+Every second of the 10-minute budget matters:
+
+- **Sequence packing:** Multiple short examples per input sequence, no wasted padding tokens
+- **Curriculum ordering:** Train on FineWeb examples ordered by difficulty (shorter/simpler first, longer/complex later) for faster initial convergence
+- **Cosine LR schedule:** High initial learning rate with cosine decay over the 10-minute window
+- **Gradient accumulation:** Effective batch size tuned for optimal loss curves on H100s
+- **Mixed precision training:** BF16 compute for speed, QAT checkpoints for artifact size
+
+## 6. Tokenizer Optimization
+
+Vocabulary size directly impacts embedding parameter count. The baseline uses 1024 tokens. Exploring:
+
+- Smaller BPE vocabularies (512, 256) — fewer embedding parameters but worse compression
+- The tradeoff is parameter cost vs bytes-per-token — the evaluation metric is bits per byte, so better compression from larger vocab can offset the parameter cost
+- Custom tokenizer trained specifically on FineWeb distribution
+
+## 7. Alternative Architectures
+
+Beyond standard transformers:
+
+- **State-space models (Mamba-style):** Linear scaling with sequence length, potentially more parameter-efficient for the same quality
+- **Mixture of Experts at micro-scale:** Multiple tiny FFN experts with a router — only a subset active per token, more capacity per parameter
+- **Depth-adaptive inference:** Early exit for easy tokens, full depth for hard ones — maximizes quality where it matters most
+
+---
+
+## The Math
+
+| Bitwidth | Parameters in 16MB | Architecture |
+|----------|-------------------|-------------|
+| 2-bit | ~32M | Recursive transformer, factorized embeddings |
+| 3-bit | ~21M | Standard transformer, tied embeddings |
+| 4-bit | ~16M | Compact transformer |
+
+## Experiment Plan
+
+- [ ] Run baseline (9-layer, 512-dim, 1024-vocab, tied embeddings) — establish score to beat (1.2244)
+- [ ] Implement depth recurrence (4 recursive blocks × 3 passes)
+- [ ] Add factorized embeddings (V×128 + 128×H)
+- [ ] Test 2-bit QAT during training
+- [ ] Knowledge distillation with 7B teacher
+- [ ] Curriculum data ordering on FineWeb
+- [ ] Tokenizer vocabulary sweep (256, 512, 1024, 2048)
+- [ ] Mamba/SSM architecture comparison
+- [ ] Combine best techniques into final submission
+
+## Background
+
+5 production fine-tuned models (7B-72B) deployed via QLoRA/GGUF/NVFP4 quantization on NVIDIA DGX hardware. Built a 130K-chunk expert knowledge base for AI/ML research consultation. Deep experience with compression-quality tradeoffs across bitwidths.
+
+## Status
+
+Credits requested. Local experimentation with MLX baseline in progress.
diff --git a/...track_non_record_16mb/2026-03-31_UniversalTransformer_AdaptiveDensity/README.md b/...track_non_record_16mb/2026-03-31_UniversalTransformer_AdaptiveDensity/README.md
@@ -0,0 +1,66 @@
+# Non-record: Universal Transformer + Adaptive Density
+
+**val_bpb: 3.2483 (legal, no TTT)** | DGX Spark GB10, 200 steps sp1024, no torch.compile
+
+## Update (April 11, 2026)
+
+An earlier version of this PR reported **val_bpb 1.4390** using multi-epoch TTT on `val_tokens` without score-first discipline. @MatoTeziTanka correctly flagged this as the same illegal pattern that closed PR #1376 and the rest of the Pre-Quant TTT cluster.
+
+The flagged code path (`ttt_adapt(args, base_model, device, val_tokens, ...)` at line 1194) was part of the `train_gpt_kitchen_sink.py` base script and defaulted to enabled. This submission has been updated to disable TTT by default and report honest BPB from a clean run on the DGX Spark.
+
+Thanks to @MatoTeziTanka for the careful review.
+
+## What This PR Is
+
+Implements OpenAI's requested "Universal transformer" research direction from the README. Single shared transformer block looped N times with per-iteration learnable parameters (attn_scale, mlp_scale, resid_mix, iteration_embed). 50% sparse-to-dense curriculum during training.
+
+This is a non-record track research submission. It exists to answer the question: does weight-shared depth recurrence work at the parameter golf scale? The answer is yes, but it plateaus fast and is dominated by mini depth recurrence (repeat 2-3 specific layers) as used in PR #1204 and PR #1334.
+
+## Legal Results (No TTT)
+
+All runs on NVIDIA DGX Spark GB10 (single GPU, 128GB unified memory), sp1024, 200 training steps, SEED=42, no torch.compile.
+
+| Run | Config | Params | val_bpb | ms/step |
+|-----|--------|--------|---------|---------|
+| UT-1 | 1 block x 6 iters | 4,546,568 | **3.2483** | 707 |
+| UT-2 | 1 block x 24 iters | 4,601,864 | 3.2490 | 2,734 |
+
+## Finding
+
+Doubling iterations from 6 to 24 (4x compute per step) produces identical BPB to 3 decimal places. Full weight sharing hits a ceiling almost immediately at this model size. The compute budget is better spent on:
+
+1. **Mini depth recurrence** (repeat 2-3 specific layers) as in PR #1204, which avoids the weight-sharing penalty on the non-repeated layers
+2. **More training steps** rather than more iterations per step
+3. **Wider models** (per MEGA-2 ablation: d=640 beats 11 layers at d=512)
+
+The 2.87 MB artifact size means there is substantial headroom under the 16 MB limit. A hybrid approach combining partial weight sharing with a larger base model would likely beat the pure-shared approach tested here.
+
+## Reproduction
+
+```bash
+pip install sentencepiece brotli
+python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
+VOCAB_SIZE=1024 NUM_ITERS=6 TORCH_COMPILE_DISABLE=1 ITERATIONS=200 TTT_ENABLED=0 \
+  python3 records/track_non_record_16mb/2026-03-31_UniversalTransformer_AdaptiveDensity/train_gpt.py
+```
+
+## Full Ablation Data
+
+Raw logs and CSV for all 22 runs across 7 architectures (Universal Transformer, Text Diffusion, Random Adapters, JEPA, Mamba SSM, H-Net, Megakernels):
+
+https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08
+
+## Hardware Notes
+
+DGX Spark GB10 is approximately 6x slower per step than 8xH100. Absolute BPB values are much higher than competition runs due to the short 200-step training budget. The relative ordering between configurations is what matters here: more iterations does not help, and depth recurrence plateaus quickly.
+
+## Related
+
+- PR #1191 H-Net Dynamic Chunking (non-record, same cluster)
+- PR #1192 Fused Triton Megakernels (non-record)
+- PR #1194 Text Diffusion (non-record)
+- PR #1195 Random Adapter Maps (non-record)
+- PR #1196 LLM-JEPA (non-record)
+- PR #1197 Mamba SSM Hybrid (non-record)
+- PR #1204 msisovic Mini Depth Recurrence (record-track implementation of this idea)
+- PR #1334 aryanbhosale Track A with depth recurrence + parallel residuals (1.0897 BPB)
diff --git a/...rds/track_non_record_16mb/2026-03-31_UniversalTransformer_AdaptiveDensity/submission.json b/...rds/track_non_record_16mb/2026-03-31_UniversalTransformer_AdaptiveDensity/submission.json
@@ -0,0 +1 @@
+{"track":"non_record","title":"Universal Transformer + Adaptive Density","val_bpb":3.2483,"hardware":"1xDGX-Spark-GB10","author":"NathanMaine","notes":"Updated April 11: TTT-on-val disabled per MatoTeziTanka review. Honest no-TTT result from DGX Spark ablation (200 steps, sp1024)."}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"track":"non_record","title":"Universal Transformer + Adaptive Density","val_bpb":3.2483,"hardware":"1xDGX-Spark-GB10","author":"NathanMaine","notes":"Updated April 11: TTT-on-val disabled per MatoTeziTanka review. Honest no-TTT result from DGX Spark ablation (200 steps, sp1024)."}