Skip to content

Non-Record: Ouroboros — Crawler Architecture Research (1.1364 BPB)#1308

Open
newjordan wants to merge 412 commits intoopenai:mainfrom
newjordan:submission/ouroboros-research
Open

Non-Record: Ouroboros — Crawler Architecture Research (1.1364 BPB)#1308
newjordan wants to merge 412 commits intoopenai:mainfrom
newjordan:submission/ouroboros-research

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Apr 3, 2026

Ouroboros — Crawler Architecture Research

crawler

Non-record research submission documenting signal-hunting through rapid ablation on a novel recurrent architecture. This is not a leaderboard attempt — it documents an ongoing research program exploring bidirectional cross-stream recurrence. This is a much different implementation than double firing a layer.

Results

Seed int6_sw_bpb Steps Size
444 1.13727008 5951 15,034,550 B
4 1.13565882 5963 15,042,594 B
300 1.13638653 5948 15,049,936 B
mean 1.13643848 15,049,936 B

Hardware: 8×H100 SXM · 600s · 26.25M params · ~100.85ms/step · No TTT, no SLOT, no eval-time adaptation.

Research Progression

Date Name PR BPB (mean) Size Architecture Key Finding
Mar 22 Frugendorff #579 1.1478 15.2MB 6×2 symmetric shared Cadence laws; recursion = 85% width, 15% regularization
Mar 23 Micro Crawler #579 1.1325 16.5MB 4F+2C×2 Asymmetric flat/crawler split beats symmetric
Mar 26 ClownCar #990 1.1813 9.4MB 4F+1C×3 + DeltaNet Width is the primary lever (~0.033 BPB)
Mar 27 Medusa #1028 0.9984 9.8MB DeltaNet crawler Cross-loop state carry violates causality
Mar 27 Medusa S2 #1047 1.1823* 9.8MB DeltaNet off Confirmed DeltaNet was learning to cheat
Mar 30 Crawler #1140 1.1874 9.4MB 4F+1C×3, FLOW, RoPE battery U-Net + bottleneck; gate discipline established
Apr 1 Nightcrawler #1208 1.1761 10.3MB 5F+1C×3, tap-off Tap-off > tap-on; depth scaling confirmed
Apr 3 Ouroboros this PR 1.1364 15.0MB 9F+1C×2, GPTQ, QK4, brotli Loop-aware GPTQ; 5 stacked signals
WIP Helix micro only Dual-stream co-firing Position-agnostic cross-injection; confirmed signal

*Medusa S2 BPB is with DeltaNet disabled — the legal baseline after discovering the causality violation.

The Arc: Frugendorff → Crawler → Ouroboros → Helix

This is one checkpoint in a crawler research program spanning 6 PRs and 50+ ablation arms:

Methodology

Test concepts on local GB10 - escalate ablations to 2x/4x H100, Run final 8xh100 GPU research pass on target.

Technical Contributions

Crawler Base Architecture - (see crawler PR)

Loop-aware GPTQ — 2-phase Hessian calibration for shared-weight architectures. Standard GPTQ is dangerous on crawler (Frugendorff catastrophe: 1.38 → 5.7 BPB post-quant). Loop-aware recalibrates crawler importance on actual post-flat-quantized activations. Confirmed −0.00380 BPB.

Width-vs-recursion analysis — Quantified that the crawler's advantage is width reallocation, not recursion signal. Redirected research from adding crawler complexity toward maximizing flat depth and improving post-training quantization.

Position-agnostic cross-stream routing (Helix) — The bridge between flat and crawler streams deliberately has no positional encoding. RoPE handles WHERE in each stream; the bridge routes WHAT by content similarity. Distinct from the field's depth recurrence (simple layer replay) — this is bidirectional cross-injection between co-evolving streams.

Reproduce

pip install brotli
SEED=444 MAX_WALLCLOCK_SECONDS=600 WARMDOWN_ITERS=2000 \
NUM_FLAT_LAYERS=9 NUM_CRAWLER_LAYERS=1 CRAWLER_LOOPS=2 \
USE_CRAWLER=1 COMPILE_FULLGRAPH=1 \
SKIP_GPTQ=0 LOOP_AWARE_GPTQ=1 QK_GAIN_INIT=4.0 \
GPTQ_CAL_SAMPLES=128 GPTQ_CAL_SEQ_LEN=2048 \
CRAWLER_LOOP_ROPE_SCALES=9,1,1 SKIP_EMA=1 \
MODEL_DIM=512 INST_DIM=32 CRAWLER_MLP_MULT=6.0 \
CRAWLER_TAP_DIM=0 ANCHOR_DIM=0 CRAWLER_MLP_CHOKE_DIM=0 \
XSA_LAST_N=11 BIGRAM_VOCAB_SIZE=2048 ROPE_DIMS=16 \
SWA_EVERY=50 MATRIX_LR=0.03 \
MLP_LEAKY_SLOPE=0.5 CRAWLER_MLP_LEAKY_SLOPE=0.5 \
torchrun --standalone --nproc_per_node=8 \
  records/track_non_record_16mb/2026-04-03_Ouroboros_Crawler_Research_8xH100/train_gpt.py

Octavian and others added 30 commits March 30, 2026 13:37
…warmdown=0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Was only downloading dataset shards, not the tokenizer — caused crash on fresh pod.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces wallclock cap with ITERATIONS= so proxy runs use identical
training compute on any GPU count. Default 500 steps (~6 min on 1xH100).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BW ablation never ran BW-00 (4F+1C) as a 500-step proxy arm — all
comparisons were against a full-run anchor. This experiment closes that
gap with two arms:

  BW2-00: 4F+1C, XSA=11 — the missing control
  BW2-01: 5F+1C, XSA=14 — proportional XSA coverage for 18-block model

Also commits BW ablation results log (2026-03-30).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BW2-00 (4F+1C control, 500 steps): 1.52365 BPB
BW2-01 (5F+1C, XSA=14):            1.52963 BPB
BW-03  (5F+1C, XSA=11, ref):       1.54404 BPB

4F+1C wins by 0.020 BPB over 5F+1C at equal compute. Raw learning is
identical (~1.424 val_bpb) — delta is entirely quantization robustness.
BW-03's apparent win was an artifact of no proxy control arm.

Secondary: XSA coverage is a quant robustness lever (XSA=14 recovered
0.015 BPB vs XSA=11 for 5F). CL3 config (4F+1C) confirmed correct.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests whether wider XSA (cross-block attention bandwidth) reduces the
quantization gap on the optimal 4F+1C model. BW5F established that raw
learning is unaffected by XSA — gain is purely quant robustness.

  BWXSA-01: XSA_LAST_N=13 (87% of 15 blocks)
  BWXSA-02: XSA_LAST_N=15 (100% — ceiling)
  Control:  XSA_LAST_N=11 (73%, BW2-00: 1.52365) carried

Script records step_avg alongside BPB to directly measure speed tradeoff.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New train_gpt.py with CRAWLER_MLP_LEAKY_SLOPE env var — separates the
crawler block's leaky slope from flat blocks (which stay at 0.5 locked).
Default is 0.5, bit-equivalent to all prior runs when unset.

4 surgical edits to train_gpt.py only (new file, tested scripts untouched):
  - env var parse for CRAWLER_MLP_LEAKY_SLOPE
  - CrawlerGPT.__init__ new param
  - crawler_blocks construction uses crawler_mlp_leaky_slope
  - build_model() threads new param through

5 arms: slope=0.5 (control repin), 0.0, 0.25, 0.75, 1.0
BW3-00 must match BW2-00 (1.52365) to validate code change before reading results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BWXSA-01 (XSA=13): 1.51982 BPB, 530ms/step
BWXSA-02 (XSA=15): 1.51431 BPB, 514ms/step  ← PROMOTED
Control  (XSA=11): 1.52365 BPB, 546ms/step

Counter-intuitive: full XSA coverage is 32ms/step FASTER than baseline.
Quant gap shrinks monotonically (0.099→0.095→0.090) — mechanism is
cross-block bandwidth smoothing quantization perturbation.

XSA=15 is both the BPB ceiling and the speed optimum. Gate at 2000 steps
before 8×H100. Pending: combine with crawler_mlp slope winner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CrawlerMLP class with loop-specific choke_down/choke_up pairs
(512→3072→act→[C per-loop]→act→512). Flat blocks unchanged. Sweep
covers choke_dim ∈ {0, 32, 128, 256, 512} to find optimal quant
surface reduction. BWC-00 (dim=0) is control repin targeting 1.52365.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er loops

Adds LoopSmearGate (~512 scalars, no matmuls) that blends each crawler
loop output with the previous loop output. Loop 0 smears against the
encoder output as a stable anchor. Attacks depth-compounding quant error
directly at loop boundaries. BWS-00/01 on/off ablation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds encoder tap infrastructure: frozen intermediate encoder layer outputs
projected to tap_dim and injected per-loop via loop_tap_up[loop]. Mirrors
FLOW pattern but anchors crawler to pre-drift encoder signal rather than
self-referential x. Sweeps tap_dim ∈ {16,32,64}, loop specificity, and
which encoder layers to tap. BWT-02 (dim=32, per-loop, all) is core hypothesis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CRAWLER_LOOP_ROPE_SCALES: divides inv_freq per loop to widen attention
range without extra parameters. CausalSelfAttention.forward and Block.forward
accept optional cos_sin tuple; _run_crawler pre-computes per loop.

run_all_ablations.sh runs all 4 series (choke/smear/tap/battery = 20 arms)
sequentially using unified train_gpt.py, prints ranked summary with winners.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t 0.5

All 5 arms complete. No arm cleared the ≥0.005 threshold. Best was slope=0.75
at −0.00065 vs control. Direction: higher slope slightly helps (negative gradient
carries cross-loop corrections), pure relu_sq (0.0) is worst. Config unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Block.forward passes loop_idx to self.mlp when crawler loop is active.
Standard MLP (used when choke=0) rejected the extra arg. Accept and ignore it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Shaped bottleneck ablation informed by flat choke results:
- BWC-04 (flat-512) wins at -0.013 BPB vs control
- BWC-02 (flat-128) clears at -0.005
- Improvements in raw BPB, not just quant gap

Five CrawlerMLP shapes added to train_gpt.py:
- flat: existing (control reference)
- pyramid: shared 3072→512 stage + cheap per-loop 512→C→512
- pyramid_res: pyramid with free residual (stage1 output = bypass)
- grouped: block-diagonal per-loop down-projection, G equal groups
- residual: shared bypass proj + per-loop delta at narrow dim

7 arms (BWCS-00 through BWCS-06), ~90 min on 1×H100.
Key question: can pyramid-512 match flat-512 cheaper? Can grouped
beat flat-512 with block-diagonal balanced routing?

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Octavian and others added 29 commits April 2, 2026 11:52
ONE variable: swap int6+zstd(22) for int6+brotli(q11). 1k step gate
to check artifact size delta and confirm no blowups before full run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parent: BWX 9F (1.13868 int6_sw_bpb, 15.24MB)
Changes: brotli compression (approved baseline) + GPTQ enabled (SKIP_GPTQ=0)
Expected: ~-0.002 BPB from GPTQ, artifact stays under 16MB via brotli savings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Best-foot-forward production run on BWX 9F base:
- LOOP_AWARE_GPTQ=1 (confirmed -0.00380 BPB in BW10, NOT standard GPTQ)
- QK_GAIN_INIT=4.0 (high-confidence, -0.006 external signal)
- CRAWLER_LOOPS=2 (BW17 RAPID: -0.054 directional, faster steps)
- Brotli compression (approved from BW20 gate)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy of SLOT_brotli with two eval-only changes:
- delta shape (1,1,dim) -> (bsz,1,dim) per-sample adaptation
- SLOT_STEPS 8 -> 24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aved)

New architecture where crawler fires alongside every flat layer with
bidirectional cross-injection via 32-dim gated projections. Two strands
spiral together — flat builds unique representations while shared crawler
continuously refines, cross-pollinating at every step.

Gate: 3 arms (control, stride=3, stride=1) at 1k steps.
HELIX=0 path identical to Ouroboros — warm start via zero-init.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single variable vs SLOT_brotli: shared delta (1,1,dim) + 24 steps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tiny model (dim=256, seq=512, 200 steps, compile=off) for rapid
architecture prototyping. Tests: foundation, depth scaling, stride,
bridge width, and interaction with existing crawler features.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SLOT_STEPS 24 with shared delta. Beats Slot Machine by 0.00849 BPB.
Seed 444 pending for promotion gate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HELIX_CROSS_ATTN=1 enables content-addressed cross-injection between
streams via single-head cross-attention (Q from source, K+V from target)
instead of blind linear projection. Causal mask preserves autoregressive.

Micro suite now has 23 arms across 6 phases including linear vs cross-attn
A/B at multiple dims, strides, and depths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
16-arm DGX Spark micro suite complete. Key findings:
- Bridge width dim=64: -0.0464 BPB vs control (dominant lever)
- Stride=5 (rare firing): -0.0107, fastest arm, better than stride=1
- Sequential loops + helix = wasteful (confirmed)
- Existing features (inst/anchor/tap) don't stack with helix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arms)

Following suite 1 findings (dim=64 breakout, stride=5 efficiency):
- Controls: 7F no-helix, 5F 1loop no-helix (fill gaps)
- Dim ceiling: 64/96/128/192 (does scaling continue?)
- Best combo: stride=5 + dim=64/128, stride=3 + dim=64/128
- Marco-Polo: cross-attn at dim=16/32/64/128, stride=5+dim=64
- Depth scaling: 7F and 9F with dim=64 linear and marco-polo

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Research submission documenting the Frugendorff→Crawler→Ouroboros→Helix arc.
Signal hunting through rapid ablation: 50+ arms, loop-aware GPTQ, width-vs-
recursion analysis, position-agnostic cross-stream routing (Helix, in dev).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Apr 6, 2026
Control: Ouroboros PR openai#1308 baseline (9F+1Cx2, loop-aware GPTQ, QK4, brotli)
Arm A: Noisy QAT — int6-calibrated differentiable noise on crawler blocks only
       (Evangeline Kamin PR openai#363: collapses quant gap 0.37→0.002 BPB on loops)
Arm B: CRAWLER_QUANT_INT8=1 — int8 for shared crawler blocks, int6 for flat
Arm C: SCORE contractive dt=0.5 on crawler loops (arXiv 2603.10544)
       x' = (1-dt)*x + dt*F(x) instead of x' = F(x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant