Record: Parcae px43 embed7 clip1300 (val_bpb = 1.0878)#1995
Open
User123331 wants to merge 32 commits intoopenai:mainfrom
Open
Record: Parcae px43 embed7 clip1300 (val_bpb = 1.0878)#1995User123331 wants to merge 32 commits intoopenai:mainfrom
User123331 wants to merge 32 commits intoopenai:mainfrom
Conversation
Fix critical bugs: MoS params now included in optimizer groups, use NLL loss (not cross_entropy) since MoS returns log-probs, skip logit softcap for MoS path, re-normalize after LoRA correction. Low-rank factorization (MOS_RANK=64) keeps artifact under 16MB budget. Enable via: USE_MOS=1 MOS_K=2 MOS_RANK=64 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clones fork, downloads dataset, runs baseline vs MoS K=2 rank=64 A/B comparison (10 min each on 1x H100). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Baseline bpb already known from prior runs (~1.2244). Saves 10 min of GPU time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
First MoS pilot run. 1113 steps on 1xH100 SXM, 12.8MB artifact. Loss still dropping at wallclock cap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1hr run with MoS K=2 R=64 + WARMDOWN_ITERS=100 on 1xH100. Target: beat vanilla baseline val_bpb=1.2540 from PR#111. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Training now runs in background — safe to close terminal. Monitor with: tail -f /workspace/mos_1h_log.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…peed optimizations, SOTA plan - techniques_encyclopedia.md: 39 techniques catalog with bpb impacts and PR references - combination_matrix.md: Compatibility matrix (++/+/~/−) with stacking recommendations - speed_optimizations.md: Triton/FA3/fused kernels research for throughput gains - PLAN_beat_SOTA.md: Phase-by-phase implementation plan targeting <1.13 bpb MoS rejected after experiments showed +0.057 bpb worse than baseline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace train_gpt.py with thwu1's openai#1 implementation: - 10 layers, 3x MLP, BigramHash(10240), SmearGate - Mixed int5/int6 quantization, SWA, sliding eval - zstd-22 compression, magnitude pruning - Add custom tokenizer training pipeline: - run_custom_tokenizer_pipeline.sh: all-in-one script - data/train_tokenizer.py: SentencePiece trainer - Add run scripts: - run_competitive.sh: SOTA stack with default tokenizer - run_competitive_custom_tok.sh: SOTA stack with custom tokenizer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mixture of Softmax (K=2) output layer integrated with full SOTA technique stack: 11L Int6 + XSA4 + Partial RoPE + LN Scale + Tight SWA + VE128 + U-Net skips + Late QAT + SmearGate + BigramHash + FA3. - train_gpt_mos_sota.py: MoS class, FA3 soft fallback, nll_loss branch - run_mos_sota.sh: MODE=baseline|mos|smoke, auto FA3 selective build Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pings nvidia-smi every 60s in background to keep pod active during FA3 build and other CPU-only phases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
train_gpt_mos_sota.py imports sentencepiece as spm at the top level; without it the script exits immediately on import. numpy is also used directly. Both are now checked and installed before training starts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pip copies the compiled .so into flash_attn_3/ relative to the hopper dir, but that subdir doesn't exist after a fresh clone. All kernels compiled successfully; only the final copy step was failing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default to disabled for stability on fresh environments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parcae Loop Injection-px43-embed7-clip1300
This is a research submission based on the Parcae loop-injection direction from @mikeapedia's PR #1674: Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS.
What This Architecture Is
The main idea I wanted to test was whether the Parcae-style loop boundary can improve a small recurrent-depth transformer under the 8xH100 / 16MB setting. PR #1674 describes Parcae constrained loop injection as an SSM-inspired boundary condition at loop re-entry points: instead of passing the recurrent hidden state through unchanged, the loop boundary learns a stable decay term and a residual re-injection term from the original stream. In my run, this is combined with the px43/embed7/clip1300 compression setup and evaluated with the legal sliding-window path.
The submitted package uses:
Tokenizer / Data
This run uses the Mikeapedia SP8192 tokenizer and pretokenized data from:
datasets/tokenizers/fineweb_8192_bpe.modelThe tokenizer SHA256 used by the runner is:
Results
Three completed 8xH100 seeds are included:
Credits
Thanks to @mikeapedia for PR #1674 and the Parcae loop-injection research direction, plus the public Mikeapedia SP8192 tokenizer/data bundle used here. PR #1674 also points to its upstream inspirations, including xIELU/per-layer QK-gain work and the Parcae paper lineage; this experiment is an attempt to test that family of ideas under a 3-seed 8xH100 run.
Thanks also to the Parameter Golf community for the prior work on depth recurrence, QK gain, GPTQ, SP8192 tokenization, and compression/eval tooling that this run builds on.