Skip to content

Record: Parcae px43 embed7 clip1300 (val_bpb = 1.0878)#1995

Open
User123331 wants to merge 32 commits intoopenai:mainfrom
User123331:user123331-parcae-px43-embed7-clip1300
Open

Record: Parcae px43 embed7 clip1300 (val_bpb = 1.0878)#1995
User123331 wants to merge 32 commits intoopenai:mainfrom
User123331:user123331-parcae-px43-embed7-clip1300

Conversation

@User123331
Copy link
Copy Markdown

@User123331 User123331 commented Apr 30, 2026

Parcae Loop Injection-px43-embed7-clip1300

This is a research submission based on the Parcae loop-injection direction from @mikeapedia's PR #1674: Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS.

What This Architecture Is

The main idea I wanted to test was whether the Parcae-style loop boundary can improve a small recurrent-depth transformer under the 8xH100 / 16MB setting. PR #1674 describes Parcae constrained loop injection as an SSM-inspired boundary condition at loop re-entry points: instead of passing the recurrent hidden state through unchanged, the loop boundary learns a stable decay term and a residual re-injection term from the original stream. In my run, this is combined with the px43/embed7/clip1300 compression setup and evaluated with the legal sliding-window path.

The submitted package uses:

  • recurrent-depth transformer loop structure over the middle blocks
  • QK-gain attention initialization
  • skip gates and tied embedding/head path
  • EMA post-training weights
  • Hessian-aware mixed GPTQ
  • 6-bit matrix quantization and 7-bit embedding quantization
  • Brotli compression
  • final sliding-window evaluation

Tokenizer / Data

This run uses the Mikeapedia SP8192 tokenizer and pretokenized data from:

The tokenizer SHA256 used by the runner is:

a24fd9326f81c9456e24484aae2a05b209898738a0082f37b085ef2fe873cec7

Results

Three completed 8xH100 seeds are included:

Seed Sliding BPB Train Time Eval Time Artifact Bytes
42 1.08802944 600.024s 89.275s 15,633,824
1337 1.08783878 600.117s 89.174s 15,630,505
2024 1.08760994 600.093s 89.318s 15,630,862
Mean 1.08782605 600.078s 89.256s 15,631,730

Credits

Thanks to @mikeapedia for PR #1674 and the Parcae loop-injection research direction, plus the public Mikeapedia SP8192 tokenizer/data bundle used here. PR #1674 also points to its upstream inspirations, including xIELU/per-layer QK-gain work and the Parcae paper lineage; this experiment is an attempt to test that family of ideas under a 3-seed 8xH100 run.

Thanks also to the Parameter Golf community for the prior work on depth recurrence, QK gain, GPTQ, SP8192 tokenization, and compression/eval tooling that this run builds on.

Billy Endson and others added 30 commits March 21, 2026 02:34
Fix critical bugs: MoS params now included in optimizer groups,
use NLL loss (not cross_entropy) since MoS returns log-probs,
skip logit softcap for MoS path, re-normalize after LoRA correction.
Low-rank factorization (MOS_RANK=64) keeps artifact under 16MB budget.

Enable via: USE_MOS=1 MOS_K=2 MOS_RANK=64

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clones fork, downloads dataset, runs baseline vs MoS K=2 rank=64
A/B comparison (10 min each on 1x H100).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Baseline bpb already known from prior runs (~1.2244).
Saves 10 min of GPU time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
First MoS pilot run. 1113 steps on 1xH100 SXM, 12.8MB artifact.
Loss still dropping at wallclock cap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1hr run with MoS K=2 R=64 + WARMDOWN_ITERS=100 on 1xH100.
Target: beat vanilla baseline val_bpb=1.2540 from PR#111.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Training now runs in background — safe to close terminal.
Monitor with: tail -f /workspace/mos_1h_log.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…peed optimizations, SOTA plan

- techniques_encyclopedia.md: 39 techniques catalog with bpb impacts and PR references
- combination_matrix.md: Compatibility matrix (++/+/~/−) with stacking recommendations
- speed_optimizations.md: Triton/FA3/fused kernels research for throughput gains
- PLAN_beat_SOTA.md: Phase-by-phase implementation plan targeting <1.13 bpb

MoS rejected after experiments showed +0.057 bpb worse than baseline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace train_gpt.py with thwu1's openai#1 implementation:
  - 10 layers, 3x MLP, BigramHash(10240), SmearGate
  - Mixed int5/int6 quantization, SWA, sliding eval
  - zstd-22 compression, magnitude pruning

- Add custom tokenizer training pipeline:
  - run_custom_tokenizer_pipeline.sh: all-in-one script
  - data/train_tokenizer.py: SentencePiece trainer

- Add run scripts:
  - run_competitive.sh: SOTA stack with default tokenizer
  - run_competitive_custom_tok.sh: SOTA stack with custom tokenizer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mixture of Softmax (K=2) output layer integrated with full SOTA technique
stack: 11L Int6 + XSA4 + Partial RoPE + LN Scale + Tight SWA + VE128 +
U-Net skips + Late QAT + SmearGate + BigramHash + FA3.

- train_gpt_mos_sota.py: MoS class, FA3 soft fallback, nll_loss branch
- run_mos_sota.sh: MODE=baseline|mos|smoke, auto FA3 selective build

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pings nvidia-smi every 60s in background to keep pod active during
FA3 build and other CPU-only phases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
train_gpt_mos_sota.py imports sentencepiece as spm at the top level;
without it the script exits immediately on import. numpy is also used
directly. Both are now checked and installed before training starts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pip copies the compiled .so into flash_attn_3/ relative to the hopper
dir, but that subdir doesn't exist after a fresh clone. All kernels
compiled successfully; only the final copy step was failing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default to disabled for stability on fresh environments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Billy Endson and others added 2 commits March 23, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant