Skip to content

Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981)#1774

Open
aruniyer wants to merge 1 commit intoopenai:mainfrom
aruniyer:submission/12L-shared-attn-mlp45
Open

Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981)#1774
aruniyer wants to merge 1 commit intoopenai:mainfrom
aruniyer:submission/12L-shared-attn-mlp45

Conversation

@aruniyer
Copy link
Copy Markdown

@aruniyer aruniyer commented Apr 22, 2026

Summary

val_bpb = 1.0981 (3-seed mean, std 0.00083) | ~15.99 MB | 8xH100 SXM

12 layers + Shared-Specific Attention (shared_head_dim=16) + MLP 4.5x

Built on bigbag's baseline (PR #1493, 11L x 512d MLP 4x). Code verified structurally identical to PR #1493 except for 3 additions: shared-specific attention, +1 layer, wider MLP. No TTT.

Shared-Specific Attention compresses Q/K projections by averaging 25% of each head's dimensions across all heads, saving ~416 KB in the quantized artifact with near-zero BPB cost. PR #1493's 11L model already uses ~15.99 MB of the 16 MB budget — without SHD there is no room for either a deeper or wider model. SHD frees the budget needed for both the extra layer and wider MLP.

Architecture Changes

Change Description Effect
Shared-Specific Attention (SHD=16) 25/75 Q/K split — last 16 dims of each head averaged across heads Compresses artifact ~416 KB; enables the other two changes
12 layers (from 11) Extra non-looped physical layer Better representation, moderate speed cost
MLP 4.5x (from 4.0x) Wider MLP fills remaining artifact budget Improves expressiveness

The shared-specific attention mechanism splits each head's Q/K projections into specific_dim=48 (unique per head) and shared_dim=16 (averaged across all heads). V projections stay fully independent. RoPE positional encoding (16 dims) applies only to specific dimensions. The sharing is baked into weights before GPTQ quantization, so it adds zero inference cost.

Artifact budget (all 8xH100 sp8192, same code_bytes):

Config Artifact vs PR #1493
PR #1493 (11L MLP4x) 15.99 MB
12L SHD=0 MLP4x 15.08 MB -0.91 MB
12L SHD=16 MLP4x 14.66 MB -1.33 MB
12L SHD=16 MLP4.5x <16.00 MB fills budget

SHD=16 on 12L saves ~416 KB (15.08 -> 14.66 MB), creating just enough room to widen MLP from 4.0x to 4.5x and fill the budget.

Parameters: 38.3M | Steps: ~4,317 in 600s | Speed: ~5.9M tok/s on 8xH100 SXM

3-Seed Validation (8xH100 SXM, 600s)

Seed Quantized BPB Sliding Window BPB Artifact Size Margin
42 1.12159 1.09785 15,996,287 B 3,713 B
314 1.12270 1.09907 15,994,335 B 5,665 B
999 1.12124 1.09746 15,998,411 B 1,589 B
Mean 1.12184 1.09813
Std 0.00076 0.00083

PR #1493 reference (bigbag, 11L with TTT):

Seed Sliding BPP TTT BPP Artifact
42 1.0829 1.0808 15,991,930
314 1.0827 1.0810 15,992,919
999 1.0826 1.0812 15,993,232
Mean 1.0827 1.0810 15,992,694

Our no-TTT sliding window BPB (1.0981) does not surpass bigbag's TTT result (1.0810). The contribution here is the shared-specific attention mechanism which compresses the artifact and enables deeper+wider architectures within the same 16 MB limit.

Ablation Table (8xH100 SXM, seed=42, sp8192)

Config Pre-quant Quantized Quant+SW Steps Artifact
12L SHD=0 1.11592 1.12730 4,521 15.08 MB
12L SHD=16 1.11429 1.12560 1.10185 4,483 14.66 MB
12L SHD=16 MLP 4.5x 1.11087 1.12159 1.09785 4,317 <16.00 MB

Key insight: depth > width. Going from 11->12 layers was more effective than widening MLP at the same layer count, because wider MLP costs training speed (fewer steps in 600s) while an extra layer's speed cost is moderate.

Rule Compliance

  • Sliding window eval enabled (SLIDING_WINDOW_ENABLED=1)
  • No TTT (TTT_ENABLED=0)
  • Artifact <= 16 MB (max 15,998,411 B across 3 seeds)
  • Training <= 600s on 8xH100 SXM
  • No validation data during training

Credits

Built on PR #1493 (@bigbag).

….1218

Architecture changes from bigbag baseline (11L x 512d MLP 4x):
- 12 layers (up from 11): +1 unique non-looped layer
- Shared-Specific Attention (shared_head_dim=16): 25/75 Q/K split
  shared across heads, saves ~380KB artifact with no BPB cost
- MLP 4.5x (up from 4.0x): fills remaining artifact budget
- 38.3M params, artifact ~15.99 MB (fits 16 MB limit)

3-seed results (8xH100 SXM, 600s training):
  Seed 42:  quantized BPB 1.12159, artifact 15,996,287 bytes
  Seed 314: quantized BPB 1.12270, artifact 15,994,335 bytes
  Seed 999: quantized BPB 1.12124, artifact 15,998,411 bytes
  Mean:     quantized BPB 1.12184 (-0.0055 vs 11L baseline 1.12732)

Sliding window eval (quantized):
  Seed 42:  1.09785  Seed 314: 1.09907  Seed 999: 1.09746
  Mean:     1.09813

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aruniyer aruniyer changed the title Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean BPB 1.1218) Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981) Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant