Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981) by aruniyer · Pull Request #1774 · openai/parameter-golf

aruniyer · 2026-04-22T13:45:13Z

Summary

val_bpb = 1.0981 (3-seed mean, std 0.00083) | ~15.99 MB | 8xH100 SXM

12 layers + Shared-Specific Attention (shared_head_dim=16) + MLP 4.5x

Built on bigbag's baseline (PR #1493, 11L x 512d MLP 4x). Code verified structurally identical to PR #1493 except for 3 additions: shared-specific attention, +1 layer, wider MLP. No TTT.

Shared-Specific Attention compresses Q/K projections by averaging 25% of each head's dimensions across all heads, saving ~416 KB in the quantized artifact with near-zero BPB cost. PR #1493's 11L model already uses ~15.99 MB of the 16 MB budget — without SHD there is no room for either a deeper or wider model. SHD frees the budget needed for both the extra layer and wider MLP.

Architecture Changes

Change	Description	Effect
Shared-Specific Attention (SHD=16)	25/75 Q/K split — last 16 dims of each head averaged across heads	Compresses artifact ~416 KB; enables the other two changes
12 layers (from 11)	Extra non-looped physical layer	Better representation, moderate speed cost
MLP 4.5x (from 4.0x)	Wider MLP fills remaining artifact budget	Improves expressiveness

The shared-specific attention mechanism splits each head's Q/K projections into specific_dim=48 (unique per head) and shared_dim=16 (averaged across all heads). V projections stay fully independent. RoPE positional encoding (16 dims) applies only to specific dimensions. The sharing is baked into weights before GPTQ quantization, so it adds zero inference cost.

Artifact budget (all 8xH100 sp8192, same code_bytes):

Config	Artifact	vs PR #1493
PR #1493 (11L MLP4x)	15.99 MB	—
12L SHD=0 MLP4x	15.08 MB	-0.91 MB
12L SHD=16 MLP4x	14.66 MB	-1.33 MB
12L SHD=16 MLP4.5x	<16.00 MB	fills budget

SHD=16 on 12L saves ~416 KB (15.08 -> 14.66 MB), creating just enough room to widen MLP from 4.0x to 4.5x and fill the budget.

Parameters: 38.3M | Steps: ~4,317 in 600s | Speed: ~5.9M tok/s on 8xH100 SXM

3-Seed Validation (8xH100 SXM, 600s)

Seed	Quantized BPB	Sliding Window BPB	Artifact Size	Margin
42	1.12159	1.09785	15,996,287 B	3,713 B
314	1.12270	1.09907	15,994,335 B	5,665 B
999	1.12124	1.09746	15,998,411 B	1,589 B
Mean	1.12184	1.09813	—	—
Std	0.00076	0.00083	—	—

PR #1493 reference (bigbag, 11L with TTT):

Seed	Sliding BPP	TTT BPP	Artifact
42	1.0829	1.0808	15,991,930
314	1.0827	1.0810	15,992,919
999	1.0826	1.0812	15,993,232
Mean	1.0827	1.0810	15,992,694

Our no-TTT sliding window BPB (1.0981) does not surpass bigbag's TTT result (1.0810). The contribution here is the shared-specific attention mechanism which compresses the artifact and enables deeper+wider architectures within the same 16 MB limit.

Ablation Table (8xH100 SXM, seed=42, sp8192)

Config	Pre-quant	Quantized	Quant+SW	Steps	Artifact
12L SHD=0	1.11592	1.12730	—	4,521	15.08 MB
12L SHD=16	1.11429	1.12560	1.10185	4,483	14.66 MB
12L SHD=16 MLP 4.5x	1.11087	1.12159	1.09785	4,317	<16.00 MB

Key insight: depth > width. Going from 11->12 layers was more effective than widening MLP at the same layer count, because wider MLP costs training speed (fewer steps in 600s) while an extra layer's speed cost is moderate.

Rule Compliance

Sliding window eval enabled (SLIDING_WINDOW_ENABLED=1)
No TTT (TTT_ENABLED=0)
Artifact <= 16 MB (max 15,998,411 B across 3 seeds)
Training <= 600s on 8xH100 SXM
No validation data during training

Credits

Built on PR #1493 (@bigbag).

….1218 Architecture changes from bigbag baseline (11L x 512d MLP 4x): - 12 layers (up from 11): +1 unique non-looped layer - Shared-Specific Attention (shared_head_dim=16): 25/75 Q/K split shared across heads, saves ~380KB artifact with no BPB cost - MLP 4.5x (up from 4.0x): fills remaining artifact budget - 38.3M params, artifact ~15.99 MB (fits 16 MB limit) 3-seed results (8xH100 SXM, 600s training): Seed 42: quantized BPB 1.12159, artifact 15,996,287 bytes Seed 314: quantized BPB 1.12270, artifact 15,994,335 bytes Seed 999: quantized BPB 1.12124, artifact 15,998,411 bytes Mean: quantized BPB 1.12184 (-0.0055 vs 11L baseline 1.12732) Sliding window eval (quantized): Seed 42: 1.09785 Seed 314: 1.09907 Seed 999: 1.09746 Mean: 1.09813 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

aruniyer changed the title ~~Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean BPB 1.1218)~~ Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981) Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981)#1774

Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981)#1774
aruniyer wants to merge 1 commit intoopenai:mainfrom
aruniyer:submission/12L-shared-attn-mlp45

aruniyer commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aruniyer commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture Changes

3-Seed Validation (8xH100 SXM, 600s)

Ablation Table (8xH100 SXM, seed=42, sp8192)

Rule Compliance

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aruniyer commented Apr 22, 2026 •

edited

Loading