Ouroboros

newjordan · 2026-04-03T03:51:56Z

9-flat crawler with loop-aware GPTQ, QK gain 4.0, 2-loop cadence, and brotli compression — stacking five research signals on the Bandit Wagon 9F platform.

Results

Seed	val_bpb (sliding window)	Steps	Size
444	1.13727008	5951	15,034,550 B
4	1.13565882	5963	15,042,594 B
300	1.13638653	5948	15,049,936 B
mean	1.13643848		15,049,936 B

Hardware: 8×H100 SXM · 600s wallclock · bytes_code: 121,677

Architecture

9-flat crawler with recurrent refinement: 9 unique flat transformer blocks followed by 1 shared crawler block looping 2× with differentiated RoPE scales (9,1,1). 26.25M parameters, ~100.85ms/step.

Research signals stacked

This submission is the product of a systematic crawler research program beginning with the beloved but ill fated frugendorff - this is where the dream of recursion lives. - It will not die:

Loop-aware GPTQ (confirmed −0.00380 BPB in BW10 full run) — 2-phase Hessian calibration that re-collects crawler importance scores after flat-layer quantization, addressing shared-weight quantization hostility
Brotli compression (approved via BW20 gate) — quality=11 replaces zstd level=22, ~5-15% smaller artifacts, freeing 16MB headroom for GPTQ overhead
QK_GAIN_INIT=4.0 (high-confidence, −0.006 external signal across 45 runs/3 codebases) — sharper initial attention gradients via per-head q_gain scalar
2-loop cadence (directional −0.054 in BW17 DGX-Spark RAPID) — fewer loops = faster steps (100.85 vs 110.19ms) = 505 more training steps in 600s budget + smaller quant gap
Optimized warmdown (WARMDOWN_ITERS=2000, confirmed 2000 > 3500 > 5000)

Key finding from our crawler signal analysis: the crawler's advantage is 85% width, 15% implicit regularization at this configuration. This shifted focus from adding crawler complexity toward maximizing flat depth, reducing loop overhead, and improving post-training quantization. Next is working on an inverse kramer resolution with a 6-7 oscillator.

Parent: Bandit Wagon X (BWX 9F)

Metric	BWX 9F	Ouroboros	Delta
int6_sw_bpb	1.13867894	1.13727008	−0.00141
bytes_total	15,239,617	15,034,550	−205,067
step_ms	110.19	100.85	−9.34
steps (600s)	5446	5951	+505

Reproduce

pip install brotli
SEED=444 \
MAX_WALLCLOCK_SECONDS=600 WARMDOWN_ITERS=2000 \
NUM_FLAT_LAYERS=9 NUM_CRAWLER_LAYERS=1 CRAWLER_LOOPS=2 \
USE_CRAWLER=1 COMPILE_FULLGRAPH=1 \
SKIP_GPTQ=0 LOOP_AWARE_GPTQ=1 QK_GAIN_INIT=4.0 \
GPTQ_CAL_SAMPLES=128 GPTQ_CAL_SEQ_LEN=2048 \
CRAWLER_LOOP_ROPE_SCALES=9,1,1 SKIP_EMA=1 \
MODEL_DIM=512 INST_DIM=32 CRAWLER_MLP_MULT=6.0 \
CRAWLER_TAP_DIM=0 ANCHOR_DIM=0 CRAWLER_MLP_CHOKE_DIM=0 \
XSA_LAST_N=11 BIGRAM_VOCAB_SIZE=2048 ROPE_DIMS=16 \
SWA_EVERY=50 MATRIX_LR=0.03 \
MLP_LEAKY_SLOPE=0.5 CRAWLER_MLP_LEAKY_SLOPE=0.5 \
torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-04-03_Ouroboros_8xH100/train_gpt.py

…ix mlp=6.0 in arms Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove permanently-disabled features: - DeltaNet (DeltaNetMemory, CanonicalDeltaNet, FLA import) - MTP heads and loss computation - LATE_QAT branch - All GPTQ functions (gptq_calibrate, loop_aware, mixed_quantize_int6_gptq) - GPTQ Hessian collection hooks in training loop - Nitrust bridge - EMA accumulation loop (SKIP_EMA=1 locked) Naive int6 + zstd compression pipeline, crawler architecture, training loop all intact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tall Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Config verified: dim=512, 4F+1C×3, mlp=6.0, SKIP_GPTQ=1. Beats CL3 3-seed mean (1.18742) by 0.00126. Seed 444 confirmed good. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…warmdown=0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Was only downloading dataset shards, not the tokenizer — caused crash on fresh pod. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces wallclock cap with ITERATIONS= so proxy runs use identical training compute on any GPU count. Default 500 steps (~6 min on 1xH100). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BW ablation never ran BW-00 (4F+1C) as a 500-step proxy arm — all comparisons were against a full-run anchor. This experiment closes that gap with two arms: BW2-00: 4F+1C, XSA=11 — the missing control BW2-01: 5F+1C, XSA=14 — proportional XSA coverage for 18-block model Also commits BW ablation results log (2026-03-30). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BW2-00 (4F+1C control, 500 steps): 1.52365 BPB BW2-01 (5F+1C, XSA=14): 1.52963 BPB BW-03 (5F+1C, XSA=11, ref): 1.54404 BPB 4F+1C wins by 0.020 BPB over 5F+1C at equal compute. Raw learning is identical (~1.424 val_bpb) — delta is entirely quantization robustness. BW-03's apparent win was an artifact of no proxy control arm. Secondary: XSA coverage is a quant robustness lever (XSA=14 recovered 0.015 BPB vs XSA=11 for 5F). CL3 config (4F+1C) confirmed correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tests whether wider XSA (cross-block attention bandwidth) reduces the quantization gap on the optimal 4F+1C model. BW5F established that raw learning is unaffected by XSA — gain is purely quant robustness. BWXSA-01: XSA_LAST_N=13 (87% of 15 blocks) BWXSA-02: XSA_LAST_N=15 (100% — ceiling) Control: XSA_LAST_N=11 (73%, BW2-00: 1.52365) carried Script records step_avg alongside BPB to directly measure speed tradeoff. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New train_gpt.py with CRAWLER_MLP_LEAKY_SLOPE env var — separates the crawler block's leaky slope from flat blocks (which stay at 0.5 locked). Default is 0.5, bit-equivalent to all prior runs when unset. 4 surgical edits to train_gpt.py only (new file, tested scripts untouched): - env var parse for CRAWLER_MLP_LEAKY_SLOPE - CrawlerGPT.__init__ new param - crawler_blocks construction uses crawler_mlp_leaky_slope - build_model() threads new param through 5 arms: slope=0.5 (control repin), 0.0, 0.25, 0.75, 1.0 BW3-00 must match BW2-00 (1.52365) to validate code change before reading results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BWXSA-01 (XSA=13): 1.51982 BPB, 530ms/step BWXSA-02 (XSA=15): 1.51431 BPB, 514ms/step ← PROMOTED Control (XSA=11): 1.52365 BPB, 546ms/step Counter-intuitive: full XSA coverage is 32ms/step FASTER than baseline. Quant gap shrinks monotonically (0.099→0.095→0.090) — mechanism is cross-block bandwidth smoothing quantization perturbation. XSA=15 is both the BPB ceiling and the speed optimum. Gate at 2000 steps before 8×H100. Pending: combine with crawler_mlp slope winner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds CrawlerMLP class with loop-specific choke_down/choke_up pairs (512→3072→act→[C per-loop]→act→512). Flat blocks unchanged. Sweep covers choke_dim ∈ {0, 32, 128, 256, 512} to find optimal quant surface reduction. BWC-00 (dim=0) is control repin targeting 1.52365. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…er loops Adds LoopSmearGate (~512 scalars, no matmuls) that blends each crawler loop output with the previous loop output. Loop 0 smears against the encoder output as a stable anchor. Attacks depth-compounding quant error directly at loop boundaries. BWS-00/01 on/off ablation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds encoder tap infrastructure: frozen intermediate encoder layer outputs projected to tap_dim and injected per-loop via loop_tap_up[loop]. Mirrors FLOW pattern but anchors crawler to pre-drift encoder signal rather than self-referential x. Sweeps tap_dim ∈ {16,32,64}, loop specificity, and which encoder layers to tap. BWT-02 (dim=32, per-loop, all) is core hypothesis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds CRAWLER_LOOP_ROPE_SCALES: divides inv_freq per loop to widen attention range without extra parameters. CausalSelfAttention.forward and Block.forward accept optional cos_sin tuple; _run_crawler pre-computes per loop. run_all_ablations.sh runs all 4 series (choke/smear/tap/battery = 20 arms) sequentially using unified train_gpt.py, prints ranked summary with winners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ONE variable: swap int6+zstd(22) for int6+brotli(q11). 1k step gate to check artifact size delta and confirm no blowups before full run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Parent: BWX 9F (1.13868 int6_sw_bpb, 15.24MB) Changes: brotli compression (approved baseline) + GPTQ enabled (SKIP_GPTQ=0) Expected: ~-0.002 BPB from GPTQ, artifact stays under 16MB via brotli savings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Best-foot-forward production run on BWX 9F base: - LOOP_AWARE_GPTQ=1 (confirmed -0.00380 BPB in BW10, NOT standard GPTQ) - QK_GAIN_INIT=4.0 (high-confidence, -0.006 external signal) - CRAWLER_LOOPS=2 (BW17 RAPID: -0.054 directional, faster steps) - Brotli compression (approved from BW20 gate) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copy of SLOT_brotli with two eval-only changes: - delta shape (1,1,dim) -> (bsz,1,dim) per-sample adaptation - SLOT_STEPS 8 -> 24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

3-seed confirmed crawler SOTA (Bandit Wagon XI): seed 444: 1.13727008 BPB, 15,034,550 B seed 4: 1.13565882 BPB, 15,042,594 B seed 300: 1.13638653 BPB, 15,049,936 B mean: 1.13643848 BPB 9F crawler + loop-aware GPTQ + QK4 + 2-loop cadence + brotli Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

newjordan · 2026-04-03T17:34:07Z

Closing to resubmit under track_non_record_16mb. This is crawler architecture research, not a leaderboard attempt. Resubmission will include the full research context linking this to our earlier crawler PRs (#579, #990, #1028, #1140, #1208). See the non-record PR for the complete picture.

Octavian and others added 30 commits March 30, 2026 12:50

Bandit_Wagon: clean HYPOTHESIS.md — remove stale oracle/proxy refs, f…

d31cd54

…ix mlp=6.0 in arms Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: fix banner title

78a4e47

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pod_setup.sh: switch branch to TEST_LAB, remove dead FLA/DeltaNet ins…

2e3d5bf

…tall Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: add ad-hoc winddown A/B suite

5417530

Add clean Rascal A/B lab for baseline, turbomuon, engramlite, combo

4ce945f

BW-00 anchor: 1.18616 int6 SW BPB (seed 444, 8×H100, 600s)

4401ff8

Config verified: dim=512, 4F+1C×3, mlp=6.0, SKIP_GPTQ=1. Beats CL3 3-seed mean (1.18742) by 0.00126. Seed 444 confirmed good. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: add run_ablations.sh — BW-01..04 back-to-back at 350s, …

d02bb2c

…warmdown=0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: run_ablations.sh default NPROC=1 (single GPU signal)

e135eb9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: fix run_ablations.sh env var passing (use env)

9e8b69f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: add 1-GPU winddown wrapper

56e3ff3

pod_setup.sh: download tokenizer (fineweb_1024_bpe.model) in step 6

656622a

Was only downloading dataset shards, not the tokenizer — caused crash on fresh pod. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: run_ablations.sh step-based stopping (ABLATION_STEPS=500)

b64efeb

Replaces wallclock cap with ITERATIONS= so proxy runs use identical training compute on any GPU count. Default 500 steps (~6 min on 1xH100). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add fresh pod bootstrap and single-H100 signal runners

33742df

Make Rascal runners use portable torchrun default

b04652b

Add Rascal_Turbo race-ready TurboMuon-only variant

7a36bf8

Bandit_Wagon: add 8xH100 launcher and checkpoint arch autodetect

550edbf

Rascal_Turbo: single run.py launcher

07eb836

Add single-H100 Rascal ablation matrix runner

0df2921

Add sparse skip-gram ngram ablation for single-H100 Rascal

37f1dcf

Revert unvalidated sparse skip-gram integration from Rascal runner path

1b674cf

Octavian and others added 28 commits April 2, 2026 09:05

Probe torch-capable python in LC4 launcher

78812b9

Add RECOVERY_TEST_1 LC4 trainer copy

b0d01f9

Add lucky_slot recovery trainer

3509647

crawler: add 24h audit table from BW12-14, BWX9F, isolated 10F

8e53b6f

Revert lucky_slot export to safe serializer

64395b5

Replace lucky_slot trainer with Lucky

3634758

Add lucky_slot seed300 SLOT run log

f8e3e8b

Make Lucky SLOT post-export and window-isolated

50a5910

Bake 30 percent SLOT defaults into Lucky

65fdc7a

Allow partial dataset pulls in pod setup

def0fdc

Add 2k rascal signal hunt experiments

fb7c598

Log 4gpu 2k build sweep results

794cf99

Add Spark SLOT size hunt runners

f175bb4

Add Lucky II contender with legal SLOT

89fd9f5

Split QK4 contender from warmdown kit

ddcc281

Add Lucky III: Lucky II + brotli byte-shuffle compression

3418f02

Add brotli to pod setup

6c13590

Lucky III: revert QK4 to baseline 1.5, keep brotli+SLOT

7fac1b4

Add BW20_Brotli_2k: crawler compression gate (zstd → brotli)

876d0bd

ONE variable: swap int6+zstd(22) for int6+brotli(q11). 1k step gate to check artifact size delta and confirm no blowups before full run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add SLOT_brotli: baseline + SLOT + brotli byte-shuffle

6d51b78

Fix coprime_shards_per_batch default to 1 (match safepoint)

f052d7c

Fix loader_mode default to coprime (match safepoint)

0d40c15

Enable SLOT by default

f972331

Log SLOT_brotli seed 300 breakthrough result

33a03d3

Add Lucky IV: per-sample SLOT delta + 24 steps

181ce55

Copy of SLOT_brotli with two eval-only changes: - delta shape (1,1,dim) -> (bsz,1,dim) per-sample adaptation - SLOT_STEPS 8 -> 24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

newjordan closed this Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ouroboros — 1.13727008 val_bpb (seed 444)#1283

Ouroboros — 1.13727008 val_bpb (seed 444)#1283
newjordan wants to merge 405 commits intoopenai:mainfrom
newjordan:submission/ouroboros

newjordan commented Apr 3, 2026 •

edited

Loading

Uh oh!

newjordan commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newjordan commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ouroboros

Results

Architecture

Research signals stacked

Parent: Bandit Wagon X (BWX 9F)

Reproduce

Uh oh!

newjordan commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

newjordan commented Apr 3, 2026 •

edited

Loading