Non-Record: Ouroboros — Crawler Architecture Research (1.1364 BPB) by newjordan · Pull Request #1308 · openai/parameter-golf

newjordan · 2026-04-03T17:35:50Z

Ouroboros — Crawler Architecture Research

Non-record research submission documenting signal-hunting through rapid ablation on a novel recurrent architecture. This is not a leaderboard attempt — it documents an ongoing research program exploring bidirectional cross-stream recurrence. This is a much different implementation than double firing a layer.

Results

Seed	int6_sw_bpb	Steps	Size
444	1.13727008	5951	15,034,550 B
4	1.13565882	5963	15,042,594 B
300	1.13638653	5948	15,049,936 B
mean	1.13643848		15,049,936 B

Hardware: 8×H100 SXM · 600s · 26.25M params · ~100.85ms/step · No TTT, no SLOT, no eval-time adaptation.

Research Progression

Date	Name	PR	BPB (mean)	Size	Architecture	Key Finding
Mar 22	Frugendorff	#579	1.1478	15.2MB	6×2 symmetric shared	Cadence laws; recursion = 85% width, 15% regularization
Mar 23	Micro Crawler	#579	1.1325	16.5MB	4F+2C×2	Asymmetric flat/crawler split beats symmetric
Mar 26	ClownCar	#990	1.1813	9.4MB	4F+1C×3 + DeltaNet	Width is the primary lever (~0.033 BPB)
Mar 27	Medusa	#1028	0.9984	9.8MB	DeltaNet crawler	Cross-loop state carry violates causality
Mar 27	Medusa S2	#1047	1.1823*	9.8MB	DeltaNet off	Confirmed DeltaNet was learning to cheat
Mar 30	Crawler	#1140	1.1874	9.4MB	4F+1C×3, FLOW, RoPE battery	U-Net + bottleneck; gate discipline established
Apr 1	Nightcrawler	#1208	1.1761	10.3MB	5F+1C×3, tap-off	Tap-off > tap-on; depth scaling confirmed
Apr 3	Ouroboros	this PR	1.1364	15.0MB	9F+1C×2, GPTQ, QK4, brotli	Loop-aware GPTQ; 5 stacked signals
WIP	Helix	—	micro only	—	Dual-stream co-firing	Position-agnostic cross-injection; confirmed signal

*Medusa S2 BPB is with DeltaNet disabled — the legal baseline after discovering the causality violation.

The Arc: Frugendorff → Crawler → Ouroboros → Helix

This is one checkpoint in a crawler research program spanning 6 PRs and 50+ ablation arms:

Frugendorff (The Frugendorff: Recursive Weight Sharing Research — Cadence Laws + Challenges (1.1325 BPB) #579) — Original recursive weight sharing research. Discovered cadence laws. Key finding: recursion advantage is ~85% width reallocation, ~15% implicit regularization.
ClownCar (ClownCar: Frugendorff compression baseline + canonical DeltaNet integration #990) — Compression baseline + canonical DeltaNet integration.
Medusa (Medusa: Unstable — DeltaNet Crawler 0.8104 BPB 10mb file size(best seed), mean 0.9984, Frugendorff continuation #1028, INVALID* (0.8822 BPB mean) Medusa: Unstable S2 — DeltaNet Crawler, Legal 10mb. .77bpb single seed. #1047) — DeltaNet crawler variants. Found cross-loop state carry violates causality.
Crawler (Crawler — 8.8MB -1.1874 BPB (3-seed mean, 8xH100, 600s) #1140) — 8.8MB, 1.1874 BPB. Established U-Net + bottleneck crawler architecture. Introduced FLOW instructions and RoPE battery.
Nightcrawler (Nightcrawler — 1.176bpb 10mb #1208) — 10MB, 1.176 BPB. Scaled flat depth, validated tap-off > tap-on.
Ouroboros (this PR) — 15.0MB, 1.1364 BPB. Five individually-gated signals stacked on a 9-flat platform.
Helix (in development) — Dual-stream co-firing with position-agnostic content routing. Early micro-scale results show confirmed architectural signal.

Methodology

Test concepts on local GB10 - escalate ablations to 2x/4x H100, Run final 8xh100 GPU research pass on target.

Technical Contributions

Crawler Base Architecture - (see crawler PR)

Loop-aware GPTQ — 2-phase Hessian calibration for shared-weight architectures. Standard GPTQ is dangerous on crawler (Frugendorff catastrophe: 1.38 → 5.7 BPB post-quant). Loop-aware recalibrates crawler importance on actual post-flat-quantized activations. Confirmed −0.00380 BPB.

Width-vs-recursion analysis — Quantified that the crawler's advantage is width reallocation, not recursion signal. Redirected research from adding crawler complexity toward maximizing flat depth and improving post-training quantization.

Position-agnostic cross-stream routing (Helix) — The bridge between flat and crawler streams deliberately has no positional encoding. RoPE handles WHERE in each stream; the bridge routes WHAT by content similarity. Distinct from the field's depth recurrence (simple layer replay) — this is bidirectional cross-injection between co-evolving streams.

Reproduce

pip install brotli
SEED=444 MAX_WALLCLOCK_SECONDS=600 WARMDOWN_ITERS=2000 \
NUM_FLAT_LAYERS=9 NUM_CRAWLER_LAYERS=1 CRAWLER_LOOPS=2 \
USE_CRAWLER=1 COMPILE_FULLGRAPH=1 \
SKIP_GPTQ=0 LOOP_AWARE_GPTQ=1 QK_GAIN_INIT=4.0 \
GPTQ_CAL_SAMPLES=128 GPTQ_CAL_SEQ_LEN=2048 \
CRAWLER_LOOP_ROPE_SCALES=9,1,1 SKIP_EMA=1 \
MODEL_DIM=512 INST_DIM=32 CRAWLER_MLP_MULT=6.0 \
CRAWLER_TAP_DIM=0 ANCHOR_DIM=0 CRAWLER_MLP_CHOKE_DIM=0 \
XSA_LAST_N=11 BIGRAM_VOCAB_SIZE=2048 ROPE_DIMS=16 \
SWA_EVERY=50 MATRIX_LR=0.03 \
MLP_LEAKY_SLOPE=0.5 CRAWLER_MLP_LEAKY_SLOPE=0.5 \
torchrun --standalone --nproc_per_node=8 \
  records/track_non_record_16mb/2026-04-03_Ouroboros_Crawler_Research_8xH100/train_gpt.py

…warmdown=0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Was only downloading dataset shards, not the tokenizer — caused crash on fresh pod. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces wallclock cap with ITERATIONS= so proxy runs use identical training compute on any GPU count. Default 500 steps (~6 min on 1xH100). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BW ablation never ran BW-00 (4F+1C) as a 500-step proxy arm — all comparisons were against a full-run anchor. This experiment closes that gap with two arms: BW2-00: 4F+1C, XSA=11 — the missing control BW2-01: 5F+1C, XSA=14 — proportional XSA coverage for 18-block model Also commits BW ablation results log (2026-03-30). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BW2-00 (4F+1C control, 500 steps): 1.52365 BPB BW2-01 (5F+1C, XSA=14): 1.52963 BPB BW-03 (5F+1C, XSA=11, ref): 1.54404 BPB 4F+1C wins by 0.020 BPB over 5F+1C at equal compute. Raw learning is identical (~1.424 val_bpb) — delta is entirely quantization robustness. BW-03's apparent win was an artifact of no proxy control arm. Secondary: XSA coverage is a quant robustness lever (XSA=14 recovered 0.015 BPB vs XSA=11 for 5F). CL3 config (4F+1C) confirmed correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tests whether wider XSA (cross-block attention bandwidth) reduces the quantization gap on the optimal 4F+1C model. BW5F established that raw learning is unaffected by XSA — gain is purely quant robustness. BWXSA-01: XSA_LAST_N=13 (87% of 15 blocks) BWXSA-02: XSA_LAST_N=15 (100% — ceiling) Control: XSA_LAST_N=11 (73%, BW2-00: 1.52365) carried Script records step_avg alongside BPB to directly measure speed tradeoff. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New train_gpt.py with CRAWLER_MLP_LEAKY_SLOPE env var — separates the crawler block's leaky slope from flat blocks (which stay at 0.5 locked). Default is 0.5, bit-equivalent to all prior runs when unset. 4 surgical edits to train_gpt.py only (new file, tested scripts untouched): - env var parse for CRAWLER_MLP_LEAKY_SLOPE - CrawlerGPT.__init__ new param - crawler_blocks construction uses crawler_mlp_leaky_slope - build_model() threads new param through 5 arms: slope=0.5 (control repin), 0.0, 0.25, 0.75, 1.0 BW3-00 must match BW2-00 (1.52365) to validate code change before reading results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BWXSA-01 (XSA=13): 1.51982 BPB, 530ms/step BWXSA-02 (XSA=15): 1.51431 BPB, 514ms/step ← PROMOTED Control (XSA=11): 1.52365 BPB, 546ms/step Counter-intuitive: full XSA coverage is 32ms/step FASTER than baseline. Quant gap shrinks monotonically (0.099→0.095→0.090) — mechanism is cross-block bandwidth smoothing quantization perturbation. XSA=15 is both the BPB ceiling and the speed optimum. Gate at 2000 steps before 8×H100. Pending: combine with crawler_mlp slope winner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds CrawlerMLP class with loop-specific choke_down/choke_up pairs (512→3072→act→[C per-loop]→act→512). Flat blocks unchanged. Sweep covers choke_dim ∈ {0, 32, 128, 256, 512} to find optimal quant surface reduction. BWC-00 (dim=0) is control repin targeting 1.52365. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…er loops Adds LoopSmearGate (~512 scalars, no matmuls) that blends each crawler loop output with the previous loop output. Loop 0 smears against the encoder output as a stable anchor. Attacks depth-compounding quant error directly at loop boundaries. BWS-00/01 on/off ablation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds encoder tap infrastructure: frozen intermediate encoder layer outputs projected to tap_dim and injected per-loop via loop_tap_up[loop]. Mirrors FLOW pattern but anchors crawler to pre-drift encoder signal rather than self-referential x. Sweeps tap_dim ∈ {16,32,64}, loop specificity, and which encoder layers to tap. BWT-02 (dim=32, per-loop, all) is core hypothesis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds CRAWLER_LOOP_ROPE_SCALES: divides inv_freq per loop to widen attention range without extra parameters. CausalSelfAttention.forward and Block.forward accept optional cos_sin tuple; _run_crawler pre-computes per loop. run_all_ablations.sh runs all 4 series (choke/smear/tap/battery = 20 arms) sequentially using unified train_gpt.py, prints ranked summary with winners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t 0.5 All 5 arms complete. No arm cleared the ≥0.005 threshold. Best was slope=0.75 at −0.00065 vs control. Direction: higher slope slightly helps (negative gradient carries cross-loop corrections), pure relu_sq (0.0) is worst. Config unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Block.forward passes loop_idx to self.mlp when crawler loop is active. Standard MLP (used when choke=0) rejected the extra arg. Accept and ignore it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Shaped bottleneck ablation informed by flat choke results: - BWC-04 (flat-512) wins at -0.013 BPB vs control - BWC-02 (flat-128) clears at -0.005 - Improvements in raw BPB, not just quant gap Five CrawlerMLP shapes added to train_gpt.py: - flat: existing (control reference) - pyramid: shared 3072→512 stage + cheap per-loop 512→C→512 - pyramid_res: pyramid with free residual (stage1 output = bypass) - grouped: block-diagonal per-loop down-projection, G equal groups - residual: shared bypass proj + per-loop delta at narrow dim 7 arms (BWCS-00 through BWCS-06), ~90 min on 1×H100. Key question: can pyramid-512 match flat-512 cheaper? Can grouped beat flat-512 with block-diagonal balanced routing? Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ONE variable: swap int6+zstd(22) for int6+brotli(q11). 1k step gate to check artifact size delta and confirm no blowups before full run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Parent: BWX 9F (1.13868 int6_sw_bpb, 15.24MB) Changes: brotli compression (approved baseline) + GPTQ enabled (SKIP_GPTQ=0) Expected: ~-0.002 BPB from GPTQ, artifact stays under 16MB via brotli savings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Best-foot-forward production run on BWX 9F base: - LOOP_AWARE_GPTQ=1 (confirmed -0.00380 BPB in BW10, NOT standard GPTQ) - QK_GAIN_INIT=4.0 (high-confidence, -0.006 external signal) - CRAWLER_LOOPS=2 (BW17 RAPID: -0.054 directional, faster steps) - Brotli compression (approved from BW20 gate) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copy of SLOT_brotli with two eval-only changes: - delta shape (1,1,dim) -> (bsz,1,dim) per-sample adaptation - SLOT_STEPS 8 -> 24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…aved) New architecture where crawler fires alongside every flat layer with bidirectional cross-injection via 32-dim gated projections. Two strands spiral together — flat builds unique representations while shared crawler continuously refines, cross-pollinating at every step. Gate: 3 arms (control, stride=3, stride=1) at 1k steps. HELIX=0 path identical to Ouroboros — warm start via zero-init. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Single variable vs SLOT_brotli: shared delta (1,1,dim) + 24 steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tiny model (dim=256, seq=512, 200 steps, compile=off) for rapid architecture prototyping. Tests: foundation, depth scaling, stride, bridge width, and interaction with existing crawler features. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SLOT_STEPS 24 with shared delta. Beats Slot Machine by 0.00849 BPB. Seed 444 pending for promotion gate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HELIX_CROSS_ATTN=1 enables content-addressed cross-injection between streams via single-head cross-attention (Q from source, K+V from target) instead of blind linear projection. Causal mask preserves autoregressive. Micro suite now has 23 arms across 6 phases including linear vs cross-attn A/B at multiple dims, strides, and depths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

16-arm DGX Spark micro suite complete. Key findings: - Bridge width dim=64: -0.0464 BPB vs control (dominant lever) - Stride=5 (rare firing): -0.0107, fastest arm, better than stride=1 - Sequential loops + helix = wasteful (confirmed) - Existing features (inst/anchor/tap) don't stack with helix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…arms) Following suite 1 findings (dim=64 breakout, stride=5 efficiency): - Controls: 7F no-helix, 5F 1loop no-helix (fill gaps) - Dim ceiling: 64/96/128/192 (does scaling continue?) - Best combo: stride=5 + dim=64/128, stride=3 + dim=64/128 - Marco-Polo: cross-attn at dim=16/32/64/128, stride=5+dim=64 - Depth scaling: 7F and 9F with dim=64 linear and marco-polo Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Research submission documenting the Frugendorff→Crawler→Ouroboros→Helix arc. Signal hunting through rapid ablation: 50+ arms, loop-aware GPTQ, width-vs- recursion analysis, position-agnostic cross-stream routing (Helix, in dev). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Control: Ouroboros PR openai#1308 baseline (9F+1Cx2, loop-aware GPTQ, QK4, brotli) Arm A: Noisy QAT — int6-calibrated differentiable noise on crawler blocks only (Evangeline Kamin PR openai#363: collapses quant gap 0.37→0.002 BPB on loops) Arm B: CRAWLER_QUANT_INT8=1 — int8 for shared crawler blocks, int6 for flat Arm C: SCORE contractive dt=0.5 on crawler loops (arXiv 2603.10544) x' = (1-dt)*x + dt*F(x) instead of x' = F(x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Octavian and others added 30 commits March 30, 2026 13:37

Bandit_Wagon: add run_ablations.sh — BW-01..04 back-to-back at 350s, …

d02bb2c

…warmdown=0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: run_ablations.sh default NPROC=1 (single GPU signal)

e135eb9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: fix run_ablations.sh env var passing (use env)

9e8b69f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: add 1-GPU winddown wrapper

56e3ff3

pod_setup.sh: download tokenizer (fineweb_1024_bpe.model) in step 6

656622a

Was only downloading dataset shards, not the tokenizer — caused crash on fresh pod. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bandit_Wagon: run_ablations.sh step-based stopping (ABLATION_STEPS=500)

b64efeb

Replaces wallclock cap with ITERATIONS= so proxy runs use identical training compute on any GPU count. Default 500 steps (~6 min on 1xH100). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add fresh pod bootstrap and single-H100 signal runners

33742df

Make Rascal runners use portable torchrun default

b04652b

Add Rascal_Turbo race-ready TurboMuon-only variant

7a36bf8

Bandit_Wagon: add 8xH100 launcher and checkpoint arch autodetect

550edbf

Rascal_Turbo: single run.py launcher

07eb836

Add single-H100 Rascal ablation matrix runner

0df2921

Add sparse skip-gram ngram ablation for single-H100 Rascal

37f1dcf

Revert unvalidated sparse skip-gram integration from Rascal runner path

1b674cf

Add isolated sparse skip-gram ablation (2200-step single-GPU)

38c8826

bandit_wagon_battery: fix MLP.forward to accept optional loop_idx

bac88a6

Block.forward passes loop_idx to self.mlp when crawler loop is active. Standard MLP (used when choke=0) rejected the extra arg. Accept and ignore it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove ngram sparse ablation files; keep Rascal path ngram-free

6b7f205

Add stripped Rascal skip-gram 2200-step calibration runner

f7f301a

Log 2026-03-31 single-H100 RASCAL ablation matrix results

34ce3e4

Octavian and others added 29 commits April 2, 2026 11:52

Add lucky_slot seed300 SLOT run log

f8e3e8b

Make Lucky SLOT post-export and window-isolated

50a5910

Bake 30 percent SLOT defaults into Lucky

65fdc7a

Allow partial dataset pulls in pod setup

def0fdc

Add 2k rascal signal hunt experiments

fb7c598

Log 4gpu 2k build sweep results

794cf99

Add Spark SLOT size hunt runners

f175bb4

Add Lucky II contender with legal SLOT

89fd9f5

Split QK4 contender from warmdown kit

ddcc281

Add Lucky III: Lucky II + brotli byte-shuffle compression

3418f02

Add brotli to pod setup

6c13590

Lucky III: revert QK4 to baseline 1.5, keep brotli+SLOT

7fac1b4

Add BW20_Brotli_2k: crawler compression gate (zstd → brotli)

876d0bd

ONE variable: swap int6+zstd(22) for int6+brotli(q11). 1k step gate to check artifact size delta and confirm no blowups before full run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add SLOT_brotli: baseline + SLOT + brotli byte-shuffle

6d51b78

Fix coprime_shards_per_batch default to 1 (match safepoint)

f052d7c

Fix loader_mode default to coprime (match safepoint)

0d40c15

Enable SLOT by default

f972331

Log SLOT_brotli seed 300 breakthrough result

33a03d3

Add Lucky IV: per-sample SLOT delta + 24 steps

181ce55

Copy of SLOT_brotli with two eval-only changes: - delta shape (1,1,dim) -> (bsz,1,dim) per-sample adaptation - SLOT_STEPS 8 -> 24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Lucky IV: revert per-sample delta, keep only SLOT_STEPS=24

b33ab5b

Single variable vs SLOT_brotli: shared delta (1,1,dim) + 24 steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Lucky IV seed 300: 1.09600210 BPB — new SOTA

65b104e

SLOT_STEPS 24 with shared delta. Beats Slot Machine by 0.00849 BPB. Seed 444 pending for promotion gate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: Ouroboros — Crawler Architecture Research (1.1364 BPB)#1308

Non-Record: Ouroboros — Crawler Architecture Research (1.1364 BPB)#1308
newjordan wants to merge 412 commits intoopenai:mainfrom
newjordan:submission/ouroboros-research

newjordan commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newjordan commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ouroboros — Crawler Architecture Research

Results

Research Progression

The Arc: Frugendorff → Crawler → Ouroboros → Helix

Methodology

Technical Contributions

Reproduce

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

newjordan commented Apr 3, 2026 •

edited

Loading