Ouroboros — 1.13727008 val_bpb (seed 444)#1283
Closed
newjordan wants to merge 405 commits intoopenai:mainfrom
Closed
Ouroboros — 1.13727008 val_bpb (seed 444)#1283newjordan wants to merge 405 commits intoopenai:mainfrom
newjordan wants to merge 405 commits intoopenai:mainfrom
Conversation
…ix mlp=6.0 in arms Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove permanently-disabled features: - DeltaNet (DeltaNetMemory, CanonicalDeltaNet, FLA import) - MTP heads and loss computation - LATE_QAT branch - All GPTQ functions (gptq_calibrate, loop_aware, mixed_quantize_int6_gptq) - GPTQ Hessian collection hooks in training loop - Nitrust bridge - EMA accumulation loop (SKIP_EMA=1 locked) Naive int6 + zstd compression pipeline, crawler architecture, training loop all intact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tall Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Config verified: dim=512, 4F+1C×3, mlp=6.0, SKIP_GPTQ=1. Beats CL3 3-seed mean (1.18742) by 0.00126. Seed 444 confirmed good. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…warmdown=0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Was only downloading dataset shards, not the tokenizer — caused crash on fresh pod. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces wallclock cap with ITERATIONS= so proxy runs use identical training compute on any GPU count. Default 500 steps (~6 min on 1xH100). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BW ablation never ran BW-00 (4F+1C) as a 500-step proxy arm — all comparisons were against a full-run anchor. This experiment closes that gap with two arms: BW2-00: 4F+1C, XSA=11 — the missing control BW2-01: 5F+1C, XSA=14 — proportional XSA coverage for 18-block model Also commits BW ablation results log (2026-03-30). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BW2-00 (4F+1C control, 500 steps): 1.52365 BPB BW2-01 (5F+1C, XSA=14): 1.52963 BPB BW-03 (5F+1C, XSA=11, ref): 1.54404 BPB 4F+1C wins by 0.020 BPB over 5F+1C at equal compute. Raw learning is identical (~1.424 val_bpb) — delta is entirely quantization robustness. BW-03's apparent win was an artifact of no proxy control arm. Secondary: XSA coverage is a quant robustness lever (XSA=14 recovered 0.015 BPB vs XSA=11 for 5F). CL3 config (4F+1C) confirmed correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests whether wider XSA (cross-block attention bandwidth) reduces the quantization gap on the optimal 4F+1C model. BW5F established that raw learning is unaffected by XSA — gain is purely quant robustness. BWXSA-01: XSA_LAST_N=13 (87% of 15 blocks) BWXSA-02: XSA_LAST_N=15 (100% — ceiling) Control: XSA_LAST_N=11 (73%, BW2-00: 1.52365) carried Script records step_avg alongside BPB to directly measure speed tradeoff. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New train_gpt.py with CRAWLER_MLP_LEAKY_SLOPE env var — separates the crawler block's leaky slope from flat blocks (which stay at 0.5 locked). Default is 0.5, bit-equivalent to all prior runs when unset. 4 surgical edits to train_gpt.py only (new file, tested scripts untouched): - env var parse for CRAWLER_MLP_LEAKY_SLOPE - CrawlerGPT.__init__ new param - crawler_blocks construction uses crawler_mlp_leaky_slope - build_model() threads new param through 5 arms: slope=0.5 (control repin), 0.0, 0.25, 0.75, 1.0 BW3-00 must match BW2-00 (1.52365) to validate code change before reading results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BWXSA-01 (XSA=13): 1.51982 BPB, 530ms/step BWXSA-02 (XSA=15): 1.51431 BPB, 514ms/step ← PROMOTED Control (XSA=11): 1.52365 BPB, 546ms/step Counter-intuitive: full XSA coverage is 32ms/step FASTER than baseline. Quant gap shrinks monotonically (0.099→0.095→0.090) — mechanism is cross-block bandwidth smoothing quantization perturbation. XSA=15 is both the BPB ceiling and the speed optimum. Gate at 2000 steps before 8×H100. Pending: combine with crawler_mlp slope winner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CrawlerMLP class with loop-specific choke_down/choke_up pairs
(512→3072→act→[C per-loop]→act→512). Flat blocks unchanged. Sweep
covers choke_dim ∈ {0, 32, 128, 256, 512} to find optimal quant
surface reduction. BWC-00 (dim=0) is control repin targeting 1.52365.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er loops Adds LoopSmearGate (~512 scalars, no matmuls) that blends each crawler loop output with the previous loop output. Loop 0 smears against the encoder output as a stable anchor. Attacks depth-compounding quant error directly at loop boundaries. BWS-00/01 on/off ablation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds encoder tap infrastructure: frozen intermediate encoder layer outputs
projected to tap_dim and injected per-loop via loop_tap_up[loop]. Mirrors
FLOW pattern but anchors crawler to pre-drift encoder signal rather than
self-referential x. Sweeps tap_dim ∈ {16,32,64}, loop specificity, and
which encoder layers to tap. BWT-02 (dim=32, per-loop, all) is core hypothesis.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CRAWLER_LOOP_ROPE_SCALES: divides inv_freq per loop to widen attention range without extra parameters. CausalSelfAttention.forward and Block.forward accept optional cos_sin tuple; _run_crawler pre-computes per loop. run_all_ablations.sh runs all 4 series (choke/smear/tap/battery = 20 arms) sequentially using unified train_gpt.py, prints ranked summary with winners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ONE variable: swap int6+zstd(22) for int6+brotli(q11). 1k step gate to check artifact size delta and confirm no blowups before full run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parent: BWX 9F (1.13868 int6_sw_bpb, 15.24MB) Changes: brotli compression (approved baseline) + GPTQ enabled (SKIP_GPTQ=0) Expected: ~-0.002 BPB from GPTQ, artifact stays under 16MB via brotli savings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Best-foot-forward production run on BWX 9F base: - LOOP_AWARE_GPTQ=1 (confirmed -0.00380 BPB in BW10, NOT standard GPTQ) - QK_GAIN_INIT=4.0 (high-confidence, -0.006 external signal) - CRAWLER_LOOPS=2 (BW17 RAPID: -0.054 directional, faster steps) - Brotli compression (approved from BW20 gate) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy of SLOT_brotli with two eval-only changes: - delta shape (1,1,dim) -> (bsz,1,dim) per-sample adaptation - SLOT_STEPS 8 -> 24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3-seed confirmed crawler SOTA (Bandit Wagon XI): seed 444: 1.13727008 BPB, 15,034,550 B seed 4: 1.13565882 BPB, 15,042,594 B seed 300: 1.13638653 BPB, 15,049,936 B mean: 1.13643848 BPB 9F crawler + loop-aware GPTQ + QK4 + 2-loop cadence + brotli Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ouroboros
9-flat crawler with loop-aware GPTQ, QK gain 4.0, 2-loop cadence, and brotli compression — stacking five research signals on the Bandit Wagon 9F platform.
Results
Hardware: 8×H100 SXM · 600s wallclock ·
bytes_code: 121,677Architecture
9-flat crawler with recurrent refinement: 9 unique flat transformer blocks followed by 1 shared crawler block looping 2× with differentiated RoPE scales (9,1,1). 26.25M parameters, ~100.85ms/step.
Research signals stacked
This submission is the product of a systematic crawler research program beginning with the beloved but ill fated frugendorff - this is where the dream of recursion lives. - It will not die:
Key finding from our crawler signal analysis: the crawler's advantage is 85% width, 15% implicit regularization at this configuration. This shifted focus from adding crawler complexity toward maximizing flat depth, reducing loop overhead, and improving post-training quantization. Next is working on an inverse kramer resolution with a 6-7 oscillator.
Parent: Bandit Wagon X (BWX 9F)
Reproduce