Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981)#1774
Open
aruniyer wants to merge 1 commit intoopenai:mainfrom
Open
Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981)#1774aruniyer wants to merge 1 commit intoopenai:mainfrom
aruniyer wants to merge 1 commit intoopenai:mainfrom
Conversation
….1218 Architecture changes from bigbag baseline (11L x 512d MLP 4x): - 12 layers (up from 11): +1 unique non-looped layer - Shared-Specific Attention (shared_head_dim=16): 25/75 Q/K split shared across heads, saves ~380KB artifact with no BPB cost - MLP 4.5x (up from 4.0x): fills remaining artifact budget - 38.3M params, artifact ~15.99 MB (fits 16 MB limit) 3-seed results (8xH100 SXM, 600s training): Seed 42: quantized BPB 1.12159, artifact 15,996,287 bytes Seed 314: quantized BPB 1.12270, artifact 15,994,335 bytes Seed 999: quantized BPB 1.12124, artifact 15,998,411 bytes Mean: quantized BPB 1.12184 (-0.0055 vs 11L baseline 1.12732) Sliding window eval (quantized): Seed 42: 1.09785 Seed 314: 1.09907 Seed 999: 1.09746 Mean: 1.09813 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
val_bpb = 1.0981 (3-seed mean, std 0.00083) | ~15.99 MB | 8xH100 SXM
12 layers + Shared-Specific Attention (shared_head_dim=16) + MLP 4.5x
Built on bigbag's baseline (PR #1493, 11L x 512d MLP 4x). Code verified structurally identical to PR #1493 except for 3 additions: shared-specific attention, +1 layer, wider MLP. No TTT.
Shared-Specific Attention compresses Q/K projections by averaging 25% of each head's dimensions across all heads, saving ~416 KB in the quantized artifact with near-zero BPB cost. PR #1493's 11L model already uses ~15.99 MB of the 16 MB budget — without SHD there is no room for either a deeper or wider model. SHD frees the budget needed for both the extra layer and wider MLP.
Architecture Changes
The shared-specific attention mechanism splits each head's Q/K projections into
specific_dim=48(unique per head) andshared_dim=16(averaged across all heads). V projections stay fully independent. RoPE positional encoding (16 dims) applies only to specific dimensions. The sharing is baked into weights before GPTQ quantization, so it adds zero inference cost.Artifact budget (all 8xH100 sp8192, same code_bytes):
SHD=16 on 12L saves ~416 KB (15.08 -> 14.66 MB), creating just enough room to widen MLP from 4.0x to 4.5x and fill the budget.
Parameters: 38.3M | Steps: ~4,317 in 600s | Speed: ~5.9M tok/s on 8xH100 SXM
3-Seed Validation (8xH100 SXM, 600s)
PR #1493 reference (bigbag, 11L with TTT):
Our no-TTT sliding window BPB (1.0981) does not surpass bigbag's TTT result (1.0810). The contribution here is the shared-specific attention mechanism which compresses the artifact and enables deeper+wider architectures within the same 16 MB limit.
Ablation Table (8xH100 SXM, seed=42, sp8192)
Key insight: depth > width. Going from 11->12 layers was more effective than widening MLP at the same layer count, because wider MLP costs training speed (fewer steps in 600s) while an extra layer's speed cost is moderate.
Rule Compliance
Credits
Built on PR #1493 (@bigbag).