Skip to content

Record: L_ParallelResiduals_UNetSkips_DepthRecur_Muon_ 1.0901#1893

Open
Hieuabssy wants to merge 5 commits intoopenai:mainfrom
Hieuabssy:hieufinal
Open

Record: L_ParallelResiduals_UNetSkips_DepthRecur_Muon_ 1.0901#1893
Hieuabssy wants to merge 5 commits intoopenai:mainfrom
Hieuabssy:hieufinal

Conversation

@Hieuabssy
Copy link
Copy Markdown

@Hieuabssy Hieuabssy commented Apr 28, 2026

#Submission: 11-Layer GPT with Parallel Residuals & U-Net Skips (track_10min_16mb)

(Building upon the foundation laid in PR #1445)

Hello team, this PR introduces our heavily optimized GPT-2 inspired model tailored specifically to dominate the track_10min_16mb parameters. By maximizing parameter utilization and rigorously compressing the weights, we achieved a highly competitive Val BPB of 1.0901. This solution fully complies with the strict 10-minute wall-clock training envelope and the 16.0 MB maximum storage constraint.

Below is an exhaustive breakdown of the methodologies introduced in this submission.


1. Architectural Innovations

To extract maximum entropy from our ~74M parameter budget (11 Layers, 512 Dim, 8 Heads, 4 KV Heads), we restructured the forward pass semantics:

  • U-Net Style Skip Connections: We implemented a logical encoder-decoder split. State tensors from the early encoder layers are cached and injected directly into deep decoder layers via learnable scalar gates (skip_gates with sigmoid activation). This combats vanishing gradients and fluidly routes low-level lexical features where they are most needed.
  • Late-Stage Parallel Residuals: Starting at PARALLEL_START_LAYER = 7, the processing forks into lane0 (Attention) and lane1 (MLP). Both lanes digest the initial state and compute simultaneously. A lane_merge parameter then smoothly concatenates the output before normalization. This drastically increases effective network width and lowers deep-graph latency.
  • Dynamic Depth Recurrence: To amplify perceptual depth without adding actual parameters, layers [3, 4, 5] are physically traversed multiple times in a single forward pass. To ensure stable convergence, this feature is toggled on seamlessly at RECUR_START_STEP = 3000.
  • Value Embeddings (VE): Extra spatial representations are injected directly into the Multi-Head Attention blocks at late layers ([9, 10]), offloading routing complexity from the MLPs.

2. Heterogeneous Training & Optimization

We deployed a bipartite optimization strategy separated by tensor geometry:

  • Muon Optimizer: Exclusively handles 2D matrix weights using Newton-Schulz 5 (NS5) steps to orthogonally update the manifold. Configured with a highly aggressive momentum of 0.99 (warmed up over 1500 steps) and a weight decay of 0.095. Matrix LR is set to 0.022.
  • AdamW Optimizer: Manages all 1D/0D parameters (embeddings, biases, layer norms, skip gates) with a conservative LR of 0.02 to strictly bound layout variance.

Data & Context Lifecycle:
Sequence length is locked at 2048 utilizing coprime-stride loaders. We enforce early stopping directly via MAX_WALLCLOCK_SECONDS.

3. Brutal 16MB Compression Strategy

Taking a 74M parameter model down to <16MB requires three coordinated phases:

  1. GPTQ with ActOrder: Weights are compressed to INT6 using inverted Hessian matrices across 64 calibration batches.
  2. Deterministic ±1 Pruning: If the artifact projects to >16MB, the algorithm targets the ±1 quantized states that have the lowest scaled error impact and zero-prunes them until the size mathematically fits under the threshold.
  3. Transposed Brotli Packing: A custom _byte_shuffle stride alignment followed by Level 11 Brotli guarantees maximal bit-packing efficiency.

4. Sliding Window Evaluation

Block-scoring penalizes sequence edges unfairly. We replaced it with a strided contextual window evaluation (EVAL_STRIDE = 64). Each token is evaluated using the fullest available historical context. Wrapped in a torch.compile(dynamic=False, fullgraph=True) block, this process minimizes internal evaluation latency.


5. Verified Metrics

Results are aggregated across three distinct seed runs (1337, 42, 1024):

Metric Aggregate Average
Max Training Steps ~5080
Val Loss (Cross Entropy) 2.5084
Val BPB (Byte-Per-Token) 1.0901
Final Artifact Size 15,976,317 bytes

Note: The 1.09 BPB highlights that the Deep Recurrence + Parallel Skip topology is capturing exceptionally dense semantics within the FineWeb BPE4096 constraints.

Please review the revised codebase and the full README.md for hyperparameter specifics! Let me know if you need any component isolated for closer review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant