Non-Record: Ouroboros — Crawler Architecture Research (1.1364 BPB)#1308
Open
newjordan wants to merge 412 commits intoopenai:mainfrom
Open
Non-Record: Ouroboros — Crawler Architecture Research (1.1364 BPB)#1308newjordan wants to merge 412 commits intoopenai:mainfrom
newjordan wants to merge 412 commits intoopenai:mainfrom
Conversation
…warmdown=0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Was only downloading dataset shards, not the tokenizer — caused crash on fresh pod. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces wallclock cap with ITERATIONS= so proxy runs use identical training compute on any GPU count. Default 500 steps (~6 min on 1xH100). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BW ablation never ran BW-00 (4F+1C) as a 500-step proxy arm — all comparisons were against a full-run anchor. This experiment closes that gap with two arms: BW2-00: 4F+1C, XSA=11 — the missing control BW2-01: 5F+1C, XSA=14 — proportional XSA coverage for 18-block model Also commits BW ablation results log (2026-03-30). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BW2-00 (4F+1C control, 500 steps): 1.52365 BPB BW2-01 (5F+1C, XSA=14): 1.52963 BPB BW-03 (5F+1C, XSA=11, ref): 1.54404 BPB 4F+1C wins by 0.020 BPB over 5F+1C at equal compute. Raw learning is identical (~1.424 val_bpb) — delta is entirely quantization robustness. BW-03's apparent win was an artifact of no proxy control arm. Secondary: XSA coverage is a quant robustness lever (XSA=14 recovered 0.015 BPB vs XSA=11 for 5F). CL3 config (4F+1C) confirmed correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests whether wider XSA (cross-block attention bandwidth) reduces the quantization gap on the optimal 4F+1C model. BW5F established that raw learning is unaffected by XSA — gain is purely quant robustness. BWXSA-01: XSA_LAST_N=13 (87% of 15 blocks) BWXSA-02: XSA_LAST_N=15 (100% — ceiling) Control: XSA_LAST_N=11 (73%, BW2-00: 1.52365) carried Script records step_avg alongside BPB to directly measure speed tradeoff. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New train_gpt.py with CRAWLER_MLP_LEAKY_SLOPE env var — separates the crawler block's leaky slope from flat blocks (which stay at 0.5 locked). Default is 0.5, bit-equivalent to all prior runs when unset. 4 surgical edits to train_gpt.py only (new file, tested scripts untouched): - env var parse for CRAWLER_MLP_LEAKY_SLOPE - CrawlerGPT.__init__ new param - crawler_blocks construction uses crawler_mlp_leaky_slope - build_model() threads new param through 5 arms: slope=0.5 (control repin), 0.0, 0.25, 0.75, 1.0 BW3-00 must match BW2-00 (1.52365) to validate code change before reading results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BWXSA-01 (XSA=13): 1.51982 BPB, 530ms/step BWXSA-02 (XSA=15): 1.51431 BPB, 514ms/step ← PROMOTED Control (XSA=11): 1.52365 BPB, 546ms/step Counter-intuitive: full XSA coverage is 32ms/step FASTER than baseline. Quant gap shrinks monotonically (0.099→0.095→0.090) — mechanism is cross-block bandwidth smoothing quantization perturbation. XSA=15 is both the BPB ceiling and the speed optimum. Gate at 2000 steps before 8×H100. Pending: combine with crawler_mlp slope winner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CrawlerMLP class with loop-specific choke_down/choke_up pairs
(512→3072→act→[C per-loop]→act→512). Flat blocks unchanged. Sweep
covers choke_dim ∈ {0, 32, 128, 256, 512} to find optimal quant
surface reduction. BWC-00 (dim=0) is control repin targeting 1.52365.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er loops Adds LoopSmearGate (~512 scalars, no matmuls) that blends each crawler loop output with the previous loop output. Loop 0 smears against the encoder output as a stable anchor. Attacks depth-compounding quant error directly at loop boundaries. BWS-00/01 on/off ablation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds encoder tap infrastructure: frozen intermediate encoder layer outputs
projected to tap_dim and injected per-loop via loop_tap_up[loop]. Mirrors
FLOW pattern but anchors crawler to pre-drift encoder signal rather than
self-referential x. Sweeps tap_dim ∈ {16,32,64}, loop specificity, and
which encoder layers to tap. BWT-02 (dim=32, per-loop, all) is core hypothesis.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CRAWLER_LOOP_ROPE_SCALES: divides inv_freq per loop to widen attention range without extra parameters. CausalSelfAttention.forward and Block.forward accept optional cos_sin tuple; _run_crawler pre-computes per loop. run_all_ablations.sh runs all 4 series (choke/smear/tap/battery = 20 arms) sequentially using unified train_gpt.py, prints ranked summary with winners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t 0.5 All 5 arms complete. No arm cleared the ≥0.005 threshold. Best was slope=0.75 at −0.00065 vs control. Direction: higher slope slightly helps (negative gradient carries cross-loop corrections), pure relu_sq (0.0) is worst. Config unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Block.forward passes loop_idx to self.mlp when crawler loop is active. Standard MLP (used when choke=0) rejected the extra arg. Accept and ignore it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Shaped bottleneck ablation informed by flat choke results: - BWC-04 (flat-512) wins at -0.013 BPB vs control - BWC-02 (flat-128) clears at -0.005 - Improvements in raw BPB, not just quant gap Five CrawlerMLP shapes added to train_gpt.py: - flat: existing (control reference) - pyramid: shared 3072→512 stage + cheap per-loop 512→C→512 - pyramid_res: pyramid with free residual (stage1 output = bypass) - grouped: block-diagonal per-loop down-projection, G equal groups - residual: shared bypass proj + per-loop delta at narrow dim 7 arms (BWCS-00 through BWCS-06), ~90 min on 1×H100. Key question: can pyramid-512 match flat-512 cheaper? Can grouped beat flat-512 with block-diagonal balanced routing? Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ONE variable: swap int6+zstd(22) for int6+brotli(q11). 1k step gate to check artifact size delta and confirm no blowups before full run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parent: BWX 9F (1.13868 int6_sw_bpb, 15.24MB) Changes: brotli compression (approved baseline) + GPTQ enabled (SKIP_GPTQ=0) Expected: ~-0.002 BPB from GPTQ, artifact stays under 16MB via brotli savings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Best-foot-forward production run on BWX 9F base: - LOOP_AWARE_GPTQ=1 (confirmed -0.00380 BPB in BW10, NOT standard GPTQ) - QK_GAIN_INIT=4.0 (high-confidence, -0.006 external signal) - CRAWLER_LOOPS=2 (BW17 RAPID: -0.054 directional, faster steps) - Brotli compression (approved from BW20 gate) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy of SLOT_brotli with two eval-only changes: - delta shape (1,1,dim) -> (bsz,1,dim) per-sample adaptation - SLOT_STEPS 8 -> 24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aved) New architecture where crawler fires alongside every flat layer with bidirectional cross-injection via 32-dim gated projections. Two strands spiral together — flat builds unique representations while shared crawler continuously refines, cross-pollinating at every step. Gate: 3 arms (control, stride=3, stride=1) at 1k steps. HELIX=0 path identical to Ouroboros — warm start via zero-init. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single variable vs SLOT_brotli: shared delta (1,1,dim) + 24 steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tiny model (dim=256, seq=512, 200 steps, compile=off) for rapid architecture prototyping. Tests: foundation, depth scaling, stride, bridge width, and interaction with existing crawler features. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SLOT_STEPS 24 with shared delta. Beats Slot Machine by 0.00849 BPB. Seed 444 pending for promotion gate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HELIX_CROSS_ATTN=1 enables content-addressed cross-injection between streams via single-head cross-attention (Q from source, K+V from target) instead of blind linear projection. Causal mask preserves autoregressive. Micro suite now has 23 arms across 6 phases including linear vs cross-attn A/B at multiple dims, strides, and depths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
16-arm DGX Spark micro suite complete. Key findings: - Bridge width dim=64: -0.0464 BPB vs control (dominant lever) - Stride=5 (rare firing): -0.0107, fastest arm, better than stride=1 - Sequential loops + helix = wasteful (confirmed) - Existing features (inst/anchor/tap) don't stack with helix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arms) Following suite 1 findings (dim=64 breakout, stride=5 efficiency): - Controls: 7F no-helix, 5F 1loop no-helix (fill gaps) - Dim ceiling: 64/96/128/192 (does scaling continue?) - Best combo: stride=5 + dim=64/128, stride=3 + dim=64/128 - Marco-Polo: cross-attn at dim=16/32/64/128, stride=5+dim=64 - Depth scaling: 7F and 9F with dim=64 linear and marco-polo Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Research submission documenting the Frugendorff→Crawler→Ouroboros→Helix arc. Signal hunting through rapid ablation: 50+ arms, loop-aware GPTQ, width-vs- recursion analysis, position-agnostic cross-stream routing (Helix, in dev). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Apr 6, 2026
Control: Ouroboros PR openai#1308 baseline (9F+1Cx2, loop-aware GPTQ, QK4, brotli) Arm A: Noisy QAT — int6-calibrated differentiable noise on crawler blocks only (Evangeline Kamin PR openai#363: collapses quant gap 0.37→0.002 BPB on loops) Arm B: CRAWLER_QUANT_INT8=1 — int8 for shared crawler blocks, int6 for flat Arm C: SCORE contractive dt=0.5 on crawler loops (arXiv 2603.10544) x' = (1-dt)*x + dt*F(x) instead of x' = F(x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ouroboros — Crawler Architecture Research
Non-record research submission documenting signal-hunting through rapid ablation on a novel recurrent architecture. This is not a leaderboard attempt — it documents an ongoing research program exploring bidirectional cross-stream recurrence. This is a much different implementation than double firing a layer.
Results
Hardware: 8×H100 SXM · 600s · 26.25M params · ~100.85ms/step · No TTT, no SLOT, no eval-time adaptation.
Research Progression
*Medusa S2 BPB is with DeltaNet disabled — the legal baseline after discovering the causality violation.
The Arc: Frugendorff → Crawler → Ouroboros → Helix
This is one checkpoint in a crawler research program spanning 6 PRs and 50+ ablation arms:
Methodology
Test concepts on local GB10 - escalate ablations to 2x/4x H100, Run final 8xh100 GPU research pass on target.
Technical Contributions
Crawler Base Architecture - (see crawler PR)
Loop-aware GPTQ — 2-phase Hessian calibration for shared-weight architectures. Standard GPTQ is dangerous on crawler (Frugendorff catastrophe: 1.38 → 5.7 BPB post-quant). Loop-aware recalibrates crawler importance on actual post-flat-quantized activations. Confirmed −0.00380 BPB.
Width-vs-recursion analysis — Quantified that the crawler's advantage is width reallocation, not recursion signal. Redirected research from adding crawler complexity toward maximizing flat depth and improving post-training quantization.
Position-agnostic cross-stream routing (Helix) — The bridge between flat and crawler streams deliberately has no positional encoding. RoPE handles WHERE in each stream; the bridge routes WHAT by content similarity. Distinct from the field's depth recurrence (simple layer replay) — this is bidirectional cross-injection between co-evolving streams.
Reproduce