Skip to content

Progressive Depth + Hedge Mixer — val_bpb 1.1441 (3-seed mean)#1384

Closed
iverbovoy wants to merge 1 commit intoopenai:mainfrom
iverbovoy:submission/progressive-depth-hedge-mixer
Closed

Progressive Depth + Hedge Mixer — val_bpb 1.1441 (3-seed mean)#1384
iverbovoy wants to merge 1 commit intoopenai:mainfrom
iverbovoy:submission/progressive-depth-hedge-mixer

Conversation

@iverbovoy
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1441 (3-seed mean, std 0.0051) — 5-expert Hedge Mixer eval
  • sliding_bpb: 1.1960 (3-seed mean) — standard sliding window eval
  • 3 shared blocks × 4 repeats = 12 effective layers, 17.14M params
  • Progressive depth training (2→3→4 repeats): +30% training steps vs fixed depth
  • int8 + zstd-22, ~15.88 MB artifact
  • 8×H100 SXM, PyTorch 2.5.1, 600s training + ~582s Hedge eval

3-Seed Results

Seed Steps Roundtrip bpb Sliding bpb Hedge bpb
1337 5,668 1.2302 1.1965 1.1441
42 5,170 1.2298 1.1962 1.1491
7 5,405 1.2286 1.1952 1.1390
Mean 5,414 1.2295 1.1960 1.1441

Key Innovations

  1. Depth Recurrence: 3 shared blocks repeated 4× with cross-repeat skip connections, loop embeddings, and value embeddings
  2. Progressive Depth Training: Train at 2 repeats (fast) → 3 → 4 repeats (full depth), gaining +30% steps
  3. Hedge Mixer: Eval-time 5-expert online ensemble (neural + unigram + bigram + trigram + entropy) providing −0.052 bpb improvement

Prior Work in This Repo

This submission consolidates our iterative work across several earlier PRs:

PR Score What
#148 1.2196 Depth recurrence architecture
#784 1.2065 + XSA, LeakyReLU²
#835 1.1980 + Progressive depth
#856 1.1454 + Hedge Mixer (1 seed)
This PR 1.1441 Clean submission, 3-seed validation

Previous PRs will be closed in favor of this clean submission.

@iverbovoy iverbovoy changed the title Record: Progressive Depth + Hedge Mixer — val_bpb 1.1441 (3-seed mean) Progressive Depth + Hedge Mixer — val_bpb 1.1441 (3-seed mean) Apr 5, 2026
iverbovoy added a commit to iverbovoy/parameter-golf that referenced this pull request Apr 7, 2026
…eed mean)

3 shared blocks × 4 repeats (12 effective layers), MLP 3× (d=880),
int7 attention (63 levels) + int5 MLP (16 levels) mixed quantization,
8-GPU parallel Hedge Mixer eval (164s).

Key finding: int7 is the sweet spot for attention quantization —
recovers 98% of int8 hedge quality while saving 2MB for a wider model.

Improves on PR openai#1384 (1.1441) by −0.012 bpb.
@iverbovoy
Copy link
Copy Markdown
Author

Superseded by #1453 (1.1324 bpb, int7 mixed quantization). Keeping for historical reference — this was the first 3-seed validated submission in this depth recurrence line.

@iverbovoy iverbovoy closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant