Skip to content

chris-colinsky/openai-parameter-golf

Repository files navigation

Parameter Golf — Two leaderboard submissions to OpenAI's challenge

OpenAI's Parameter Golf challenge asks one question: how good a language model can you train in 10 minutes on 8×H100 GPUs, with the final artifact compressed to under 16 MB? Score is bits-per-byte on FineWeb-10B validation.

This repo is the working directory I used over six weeks (March 19 → April 30, 2026) to make two submissions, both built around a novel technique I introduced — adaptive Hessian-sensitivity GPTQ clipping.

Results

# Submitted PR val_bpb (3-seed) Stack
1 2026-04-17 #1689 1.0822 SP8192 + adaptive Hessian-sensitivity GPTQ clip on the PR #1394 base
2 2026-04-30 #1962 1.06310 Same technique ported onto PR #1855 (current leaderboard SOTA at 1.06108)

The leaderboard moved fast — the SOTA front advanced from 1.2244 BPB at the challenge start to 1.06108 BPB by 2026-04-27 (a 13% absolute improvement from the community in six weeks). My second submission was a deliberate follow-up to validate that the adaptive-clip technique generalised beyond its original base.

The technique in one paragraph

Adaptive Hessian-sensitivity GPTQ clipping replaces three hand-tuned hyperparameters (MLP_CLIP_SIGMAS, ATTN_CLIP_SIGMAS, MATRIX_CLIP_SIGMAS) with one automated per-tensor selection driven by the Hessian diagonal magnitude. Each weight tensor's quantization clip-σ is computed from its own sensitivity (H_diag.mean() × row_var); a binary-search offset preserves the overall compression budget. Tensors that need precision get tighter clipping; tensors that don't get looser. On heavily-tuned stacks like PR #1855's it reproduces the hand-tuned result within ~2σ while eliminating three hyperparameters; on un-tuned stacks it skips the manual search entirely.

→ See submissions/2026-04-30_phase9_PR1962_pr1855_stack/README.md for the full math, derivation, and per-tensor σ allocation table.

What's here

File / directory What it is
train_gpt.py Current model script (~3,870 lines, the PR #1962 / current SOTA-base implementation, annotated for readability)
learning_concepts.md Concept reference — 60+ techniques explained conversationally, from QAT to TTT to Muon to GPTQ. Written as I learned each one.
runs.md Chronological log of every training run with date, config, val_bpb, artifact size, and what we learned. ~188 entries through 2026-04-17.
submissions/ Frozen submission artifacts for both leaderboard PRs (each folder contains the exact train_gpt.py, 3-seed run logs, and submission.json that were submitted upstream)
docs/strategy.md Original strategic blueprint + the 2026-03-22 leaderboard-pivot moment + scoring scoreboard
docs/phases/ Per-phase planning docs (Phase 2 architecture → Phase 9 YOLO) — the actual execution chronology
docs/workstation.md Local hardware spec + multi-GPU training safety protocol + what the rig did for the project
docs/runpod_guide.md RunPod 8×H100 deployment recipe for submission runs (deps, dataset, 3-seed loop, log retrieval)
data/ Dataset download script (cached_challenge_fineweb.py) + custom retokenization script (download_hf_docs_and_tokenize.py) + tokenizer specs

The narrative arc

I went into this project as a software engineer transitioning to AI engineering. Training a transformer from scratch was new to me — I'd never worked at the level of attention heads, quantization, or optimizers before. So I asked Claude to write learning_concepts.md as we went, one entry per technique we touched, in conversational language. That doc became the running glossary I'd reread between runs.

The original plan (docs/strategy.md) had six "pillars": ternary weights, factored embeddings, depth recurrence, compression-aware regularization, progressive recurrence, and TTT. Three of those got abandoned within the first two weeks (ternary, compression regularization, the original framing of recurrence) once I read the actual leaderboard and saw what was winning. TTT survived from start to finish and ended up being the bedrock of both submissions.

The second submission's story is genuinely the better one. By the time I shipped PR #1689 at 1.0822 BPB, the leaderboard had moved past my base stack. Rather than start over, I ported my novel contribution (adaptive Hessian-sensitivity clipping) onto the new SOTA stack (PR #1855) and shipped PR #1962 on the deadline. Same technique, applied to the current best base, validated to compose with everything in the modern pipeline (LQER asymmetric quantization, phased TTT, per-group lrzip).

How the project ran

Most of the project happened on a local AI workstation, not on RunPod. docs/workstation.md has the rig spec; the short version:

  • Local-first cost asymmetry. A bad 8×H100 run costs ~$5. A bad 4090 run costs ambient electricity. That asymmetry let me try unbounded ideas cheaply: ternary weights got disproved here, not on H100; the LoRA-TTT-compile bug was discovered and fixed locally before any expensive RunPod surprise.
  • 188 logged training runs in runs.md from 2026-03-17 through 2026-04-17 (Phases 1–7), with phase-8 and phase-9 work logged in the phase docs and submission folders.
  • ~10 H100 runs total across the entire project — only when local data justified the spend.
  • The 4090's 10-minute wallclock budget was deliberately set to mirror the H100 budget, so a local run produced a directionally- meaningful "what would this look like at scale" signal.

What I learned (the SE→AIE part)

Six weeks in I'd written ~190 training runs and explored 60+ techniques. A few takeaways that go into learning_concepts.md and shaped both submissions:

  • Compression is the binding constraint, not parameter count. The 16 MB artifact limit means you optimize information density after compression, not raw weight count. This reframes everything: quantization isn't an afterthought, it's the central design choice.
  • Test-time training is the highest-ceiling lever. Both submissions use it; the technique evolved across phases (LoRA TTT → score-first SGD TTT → phased TTT) but the principle is unchanged: let the model adapt per-document at eval, not just be good at "everything on average."
  • Track everything, decide once. runs.md was 300 KB by the end. Going back to it weekly to re-evaluate dead ends and lucky breaks was more valuable than any individual experiment.
  • Local-first is a discipline, not a constraint. Spending more on H100s wouldn't have moved my best result much; what would have moved it is more rigorous local triage of dead-end ideas.
  • Honest negative results are a contribution. Both submissions document techniques that failed (TrigramHash, BitNet ternary, mixed- precision GPTQ). Recruiters and reviewers read those harder than the headline result.

Status

PR #1689 and PR #1962 are both currently open on openai/parameter-golf (neither beats current SOTA by the +0.005 nats record threshold needed to become the new SOTA citation). PR #1962 reproduces PR #1855's hand-tuned result within ~2σ at +0.00203 BPB — a near-statistical-tie that demonstrates the adaptive technique generalises. The full PR description and reviewer-readable writeup is in submissions/2026-04-30_phase9_PR1962_pr1855_stack/README.md.

The challenge ended 2026-04-30; the leaderboard front at deadline was 1.06108 BPB.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages