Record: L_ParallelResiduals_UNetSkips_DepthRecur_Muon_ 1.0901#1893
Open
Hieuabssy wants to merge 5 commits intoopenai:mainfrom
Open
Record: L_ParallelResiduals_UNetSkips_DepthRecur_Muon_ 1.0901#1893Hieuabssy wants to merge 5 commits intoopenai:mainfrom
Hieuabssy wants to merge 5 commits intoopenai:mainfrom
Conversation
leakyRelu0.5^2 + GPTQ + EMA + BigramHash(1.3069)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#Submission: 11-Layer GPT with Parallel Residuals & U-Net Skips (track_10min_16mb)
(Building upon the foundation laid in PR #1445)
Hello team, this PR introduces our heavily optimized GPT-2 inspired model tailored specifically to dominate the
track_10min_16mbparameters. By maximizing parameter utilization and rigorously compressing the weights, we achieved a highly competitive Val BPB of 1.0901. This solution fully complies with the strict 10-minute wall-clock training envelope and the 16.0 MB maximum storage constraint.Below is an exhaustive breakdown of the methodologies introduced in this submission.
1. Architectural Innovations
To extract maximum entropy from our ~74M parameter budget (11 Layers, 512 Dim, 8 Heads, 4 KV Heads), we restructured the forward pass semantics:
skip_gateswith sigmoid activation). This combats vanishing gradients and fluidly routes low-level lexical features where they are most needed.PARALLEL_START_LAYER = 7, the processing forks intolane0(Attention) andlane1(MLP). Both lanes digest the initial state and compute simultaneously. Alane_mergeparameter then smoothly concatenates the output before normalization. This drastically increases effective network width and lowers deep-graph latency.[3, 4, 5]are physically traversed multiple times in a single forward pass. To ensure stable convergence, this feature is toggled on seamlessly atRECUR_START_STEP = 3000.[9, 10]), offloading routing complexity from the MLPs.2. Heterogeneous Training & Optimization
We deployed a bipartite optimization strategy separated by tensor geometry:
Newton-Schulz 5(NS5) steps to orthogonally update the manifold. Configured with a highly aggressive momentum of0.99(warmed up over 1500 steps) and a weight decay of0.095. Matrix LR is set to0.022.0.02to strictly bound layout variance.Data & Context Lifecycle:
Sequence length is locked at
2048utilizing coprime-stride loaders. We enforce early stopping directly viaMAX_WALLCLOCK_SECONDS.3. Brutal 16MB Compression Strategy
Taking a 74M parameter model down to <16MB requires three coordinated phases:
±1quantized states that have the lowest scaled error impact and zero-prunes them until the size mathematically fits under the threshold._byte_shufflestride alignment followed by Level 11 Brotli guarantees maximal bit-packing efficiency.4. Sliding Window Evaluation
Block-scoring penalizes sequence edges unfairly. We replaced it with a strided contextual window evaluation (
EVAL_STRIDE = 64). Each token is evaluated using the fullest available historical context. Wrapped in atorch.compile(dynamic=False, fullgraph=True)block, this process minimizes internal evaluation latency.5. Verified Metrics
Results are aggregated across three distinct seed runs (
1337,42,1024):Note: The 1.09 BPB highlights that the Deep Recurrence + Parallel Skip topology is capturing exceptionally dense semantics within the FineWeb BPE4096 constraints.
Please review the revised codebase and the full
README.mdfor hyperparameter specifics! Let me know if you need any component isolated for closer review.