Record: Parcae px43 embed7 clip1300 (val_bpb = 1.0878)#1995

Open

User123331 wants to merge 32 commits intoopenai:mainfrom

User123331:user123331-parcae-px43-embed7-clip1300

User123331 commented Apr 30, 2026 •

edited

Loading

Parcae Loop Injection-px43-embed7-clip1300

This is a research submission based on the Parcae loop-injection direction from @mikeapedia's PR #1674: Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS.

What This Architecture Is

The main idea I wanted to test was whether the Parcae-style loop boundary can improve a small recurrent-depth transformer under the 8xH100 / 16MB setting. PR #1674 describes Parcae constrained loop injection as an SSM-inspired boundary condition at loop re-entry points: instead of passing the recurrent hidden state through unchanged, the loop boundary learns a stable decay term and a residual re-injection term from the original stream. In my run, this is combined with the px43/embed7/clip1300 compression setup and evaluated with the legal sliding-window path.

The submitted package uses:

recurrent-depth transformer loop structure over the middle blocks
QK-gain attention initialization
skip gates and tied embedding/head path
EMA post-training weights
Hessian-aware mixed GPTQ
6-bit matrix quantization and 7-bit embedding quantization
Brotli compression
final sliding-window evaluation

Tokenizer / Data

This run uses the Mikeapedia SP8192 tokenizer and pretokenized data from:

Hugging Face dataset: Mikeapedia/parameter-golf-sp8192
Tokenizer file: datasets/tokenizers/fineweb_8192_bpe.model

The tokenizer SHA256 used by the runner is:

a24fd9326f81c9456e24484aae2a05b209898738a0082f37b085ef2fe873cec7

Results

Three completed 8xH100 seeds are included:

Seed	Sliding BPB	Train Time	Eval Time	Artifact Bytes
42	1.08802944	600.024s	89.275s	15,633,824
1337	1.08783878	600.117s	89.174s	15,630,505
2024	1.08760994	600.093s	89.318s	15,630,862
Mean	1.08782605	600.078s	89.256s	15,631,730

Credits

Thanks to @mikeapedia for PR #1674 and the Parcae loop-injection research direction, plus the public Mikeapedia SP8192 tokenizer/data bundle used here. PR #1674 also points to its upstream inspirations, including xIELU/per-layer QK-gain work and the Parcae paper lineage; this experiment is an attempt to test that family of ideas under a 3-seed 8xH100 run.

Thanks also to the Parameter Golf community for the prior work on depth recurrence, QK gain, GPTQ, SP8192 tokenization, and compression/eval tooling that this run builds on.

Billy Endson and others added 30 commits

March 21, 2026 02:34


          Add Mixture of Softmax (MoS) with low-rank option for softmax bottleneck

5d3c230

Fix critical bugs: MoS params now included in optimizer groups,
use NLL loss (not cross_entropy) since MoS returns log-probs,
skip logit softcap for MoS path, re-normalize after LoRA correction.
Low-rank factorization (MOS_RANK=64) keeps artifact under 16MB budget.

Enable via: USE_MOS=1 MOS_K=2 MOS_RANK=64

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add one-shot RunPod setup and pilot run script

9d61c17

Clones fork, downloads dataset, runs baseline vs MoS K=2 rank=64
A/B comparison (10 min each on 1x H100).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Simplify pilot script: MoS only, skip redundant baseline

c8de91e

Baseline bpb already known from prior runs (~1.2244).
Saves 10 min of GPU time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix setup script: remove clone step, assume already in repo

68520fe

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add HF_TOKEN to setup script for faster dataset downloads

01f071a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Record: MoS K=2 R=64 pilot — val_bpb=1.3932 (1xH100, 10min)

183dfa3

First MoS pilot run. 1113 steps on 1xH100 SXM, 12.8MB artifact.
Loss still dropping at wallclock cap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add 1-hour MoS validation script (targeting PR#111 baseline)

2440ea8

1hr run with MoS K=2 R=64 + WARMDOWN_ITERS=100 on 1xH100.
Target: beat vanilla baseline val_bpb=1.2540 from PR#111.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Make 1h script survive terminal disconnects via nohup

0773ee0

Training now runs in background — safe to close terminal.
Monitor with: tail -f /workspace/mos_1h_log.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add vanilla baseline 10-min script for 1xH100 comparison

69d4a7d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add research artifacts: technique encyclopedia, combination matrix, s…

0a32f9d

…peed optimizations, SOTA plan

- techniques_encyclopedia.md: 39 techniques catalog with bpb impacts and PR references
- combination_matrix.md: Compatibility matrix (++/+/~/−) with stacking recommendations
- speed_optimizations.md: Triton/FA3/fused kernels research for throughput gains
- PLAN_beat_SOTA.md: Phase-by-phase implementation plan targeting <1.13 bpb

MoS rejected after experiments showed +0.057 bpb worse than baseline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Integrate SOTA stack (thwu1's 1.1428 bpb) + custom tokenizer pipeline

fd4f106

- Replace train_gpt.py with thwu1's openai#1 implementation:
  - 10 layers, 3x MLP, BigramHash(10240), SmearGate
  - Mixed int5/int6 quantization, SWA, sliding eval
  - zstd-22 compression, magnitude pruning

- Add custom tokenizer training pipeline:
  - run_custom_tokenizer_pipeline.sh: all-in-one script
  - data/train_tokenizer.py: SentencePiece trainer

- Add run scripts:
  - run_competitive.sh: SOTA stack with default tokenizer
  - run_competitive_custom_tok.sh: SOTA stack with custom tokenizer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add train_tokenizer_only.sh for focused tokenizer training

34ff363

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add RunPod SOTA launcher

db62d70

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          Add MoS + SOTA technique stack for competitive testing

32c694d

Mixture of Softmax (K=2) output layer integrated with full SOTA technique
stack: 11L Int6 + XSA4 + Partial RoPE + LN Scale + Tight SWA + VE128 +
U-Net skips + Late QAT + SmearGate + BigramHash + FA3.

- train_gpt_mos_sota.py: MoS class, FA3 soft fallback, nll_loss branch
- run_mos_sota.sh: MODE=baseline|mos|smoke, auto FA3 selective build

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Run training with nohup to survive terminal disconnects

5a0fa0b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix FA3 build: use --no-build-isolation so setup.py can find torch

a5a6391

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add keep-alive heartbeat to prevent RunPod pod termination

73ba4f7

Pings nvidia-smi every 60s in background to keep pod active during
FA3 build and other CPU-only phases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix FA3 build: clear stale build dir, fix variable scoping

34519d8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add sentencepiece and numpy to deps check

b2e9c10

train_gpt_mos_sota.py imports sentencepiece as spm at the top level;
without it the script exits immediately on import. numpy is also used
directly. Both are now checked and installed before training starts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix FA3 install: mkdir flash_attn_3 before pip editable install

d7aa8c4

pip copies the compiled .so into flash_attn_3/ relative to the hopper
dir, but that subdir doesn't exist after a fresh clone. All kernels
compiled successfully; only the final copy step was failing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add hyperbolic.ai setup scripts for 8x H100

6003c55

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Update quickstart to use pre-compiled FA3 .so

b2f4d3e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix data paths for hyperbolic setup

b34ba1e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix: use ~/golf instead of /workspace for hyperbolic

9ba2ec4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix: use $HOME instead of /workspace for FA3 build

a77f8a7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add --break-system-packages for externally-managed environments

057a844

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix: use SCRIPT_DIR instead of hardcoded golf path

c88f9a2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix: use user site-packages instead of system

bd347d9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Build FA3 from source (pre-compiled .so not in repo)

61a9b21

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add DISABLE_COMPILE option to fix torch.compile/inductor issues

5ebbc36

Default to disabled for stability on fresh environments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Billy Endson and others added 2 commits

March 23, 2026 20:50


          Remove nohup wait - use tmux for persistence instead

43c0e5a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add draft parcae px43 embed7 clip1300 run

2e9f535

PiyushDatta mentioned this pull request

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet