Skip to content

Ouroboros — 1.13727008 val_bpb (seed 444)#1283

Closed
newjordan wants to merge 405 commits intoopenai:mainfrom
newjordan:submission/ouroboros
Closed

Ouroboros — 1.13727008 val_bpb (seed 444)#1283
newjordan wants to merge 405 commits intoopenai:mainfrom
newjordan:submission/ouroboros

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Apr 3, 2026

crawler

Ouroboros

9-flat crawler with loop-aware GPTQ, QK gain 4.0, 2-loop cadence, and brotli compression — stacking five research signals on the Bandit Wagon 9F platform.

Results

Seed val_bpb (sliding window) Steps Size
444 1.13727008 5951 15,034,550 B
4 1.13565882 5963 15,042,594 B
300 1.13638653 5948 15,049,936 B
mean 1.13643848 15,049,936 B

Hardware: 8×H100 SXM · 600s wallclock · bytes_code: 121,677

Architecture

9-flat crawler with recurrent refinement: 9 unique flat transformer blocks followed by 1 shared crawler block looping 2× with differentiated RoPE scales (9,1,1). 26.25M parameters, ~100.85ms/step.

Research signals stacked

This submission is the product of a systematic crawler research program beginning with the beloved but ill fated frugendorff - this is where the dream of recursion lives. - It will not die:

  1. Loop-aware GPTQ (confirmed −0.00380 BPB in BW10 full run) — 2-phase Hessian calibration that re-collects crawler importance scores after flat-layer quantization, addressing shared-weight quantization hostility
  2. Brotli compression (approved via BW20 gate) — quality=11 replaces zstd level=22, ~5-15% smaller artifacts, freeing 16MB headroom for GPTQ overhead
  3. QK_GAIN_INIT=4.0 (high-confidence, −0.006 external signal across 45 runs/3 codebases) — sharper initial attention gradients via per-head q_gain scalar
  4. 2-loop cadence (directional −0.054 in BW17 DGX-Spark RAPID) — fewer loops = faster steps (100.85 vs 110.19ms) = 505 more training steps in 600s budget + smaller quant gap
  5. Optimized warmdown (WARMDOWN_ITERS=2000, confirmed 2000 > 3500 > 5000)

Key finding from our crawler signal analysis: the crawler's advantage is 85% width, 15% implicit regularization at this configuration. This shifted focus from adding crawler complexity toward maximizing flat depth, reducing loop overhead, and improving post-training quantization. Next is working on an inverse kramer resolution with a 6-7 oscillator.

Parent: Bandit Wagon X (BWX 9F)

Metric BWX 9F Ouroboros Delta
int6_sw_bpb 1.13867894 1.13727008 −0.00141
bytes_total 15,239,617 15,034,550 −205,067
step_ms 110.19 100.85 −9.34
steps (600s) 5446 5951 +505

Reproduce

pip install brotli
SEED=444 \
MAX_WALLCLOCK_SECONDS=600 WARMDOWN_ITERS=2000 \
NUM_FLAT_LAYERS=9 NUM_CRAWLER_LAYERS=1 CRAWLER_LOOPS=2 \
USE_CRAWLER=1 COMPILE_FULLGRAPH=1 \
SKIP_GPTQ=0 LOOP_AWARE_GPTQ=1 QK_GAIN_INIT=4.0 \
GPTQ_CAL_SAMPLES=128 GPTQ_CAL_SEQ_LEN=2048 \
CRAWLER_LOOP_ROPE_SCALES=9,1,1 SKIP_EMA=1 \
MODEL_DIM=512 INST_DIM=32 CRAWLER_MLP_MULT=6.0 \
CRAWLER_TAP_DIM=0 ANCHOR_DIM=0 CRAWLER_MLP_CHOKE_DIM=0 \
XSA_LAST_N=11 BIGRAM_VOCAB_SIZE=2048 ROPE_DIMS=16 \
SWA_EVERY=50 MATRIX_LR=0.03 \
MLP_LEAKY_SLOPE=0.5 CRAWLER_MLP_LEAKY_SLOPE=0.5 \
torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-04-03_Ouroboros_8xH100/train_gpt.py
ouroboros

Octavian and others added 30 commits March 30, 2026 12:50
…ix mlp=6.0 in arms

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove permanently-disabled features:
- DeltaNet (DeltaNetMemory, CanonicalDeltaNet, FLA import)
- MTP heads and loss computation
- LATE_QAT branch
- All GPTQ functions (gptq_calibrate, loop_aware, mixed_quantize_int6_gptq)
- GPTQ Hessian collection hooks in training loop
- Nitrust bridge
- EMA accumulation loop (SKIP_EMA=1 locked)

Naive int6 + zstd compression pipeline, crawler architecture, training loop all intact.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tall

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Config verified: dim=512, 4F+1C×3, mlp=6.0, SKIP_GPTQ=1.
Beats CL3 3-seed mean (1.18742) by 0.00126. Seed 444 confirmed good.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…warmdown=0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Was only downloading dataset shards, not the tokenizer — caused crash on fresh pod.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces wallclock cap with ITERATIONS= so proxy runs use identical
training compute on any GPU count. Default 500 steps (~6 min on 1xH100).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BW ablation never ran BW-00 (4F+1C) as a 500-step proxy arm — all
comparisons were against a full-run anchor. This experiment closes that
gap with two arms:

  BW2-00: 4F+1C, XSA=11 — the missing control
  BW2-01: 5F+1C, XSA=14 — proportional XSA coverage for 18-block model

Also commits BW ablation results log (2026-03-30).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BW2-00 (4F+1C control, 500 steps): 1.52365 BPB
BW2-01 (5F+1C, XSA=14):            1.52963 BPB
BW-03  (5F+1C, XSA=11, ref):       1.54404 BPB

4F+1C wins by 0.020 BPB over 5F+1C at equal compute. Raw learning is
identical (~1.424 val_bpb) — delta is entirely quantization robustness.
BW-03's apparent win was an artifact of no proxy control arm.

Secondary: XSA coverage is a quant robustness lever (XSA=14 recovered
0.015 BPB vs XSA=11 for 5F). CL3 config (4F+1C) confirmed correct.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests whether wider XSA (cross-block attention bandwidth) reduces the
quantization gap on the optimal 4F+1C model. BW5F established that raw
learning is unaffected by XSA — gain is purely quant robustness.

  BWXSA-01: XSA_LAST_N=13 (87% of 15 blocks)
  BWXSA-02: XSA_LAST_N=15 (100% — ceiling)
  Control:  XSA_LAST_N=11 (73%, BW2-00: 1.52365) carried

Script records step_avg alongside BPB to directly measure speed tradeoff.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New train_gpt.py with CRAWLER_MLP_LEAKY_SLOPE env var — separates the
crawler block's leaky slope from flat blocks (which stay at 0.5 locked).
Default is 0.5, bit-equivalent to all prior runs when unset.

4 surgical edits to train_gpt.py only (new file, tested scripts untouched):
  - env var parse for CRAWLER_MLP_LEAKY_SLOPE
  - CrawlerGPT.__init__ new param
  - crawler_blocks construction uses crawler_mlp_leaky_slope
  - build_model() threads new param through

5 arms: slope=0.5 (control repin), 0.0, 0.25, 0.75, 1.0
BW3-00 must match BW2-00 (1.52365) to validate code change before reading results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BWXSA-01 (XSA=13): 1.51982 BPB, 530ms/step
BWXSA-02 (XSA=15): 1.51431 BPB, 514ms/step  ← PROMOTED
Control  (XSA=11): 1.52365 BPB, 546ms/step

Counter-intuitive: full XSA coverage is 32ms/step FASTER than baseline.
Quant gap shrinks monotonically (0.099→0.095→0.090) — mechanism is
cross-block bandwidth smoothing quantization perturbation.

XSA=15 is both the BPB ceiling and the speed optimum. Gate at 2000 steps
before 8×H100. Pending: combine with crawler_mlp slope winner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CrawlerMLP class with loop-specific choke_down/choke_up pairs
(512→3072→act→[C per-loop]→act→512). Flat blocks unchanged. Sweep
covers choke_dim ∈ {0, 32, 128, 256, 512} to find optimal quant
surface reduction. BWC-00 (dim=0) is control repin targeting 1.52365.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er loops

Adds LoopSmearGate (~512 scalars, no matmuls) that blends each crawler
loop output with the previous loop output. Loop 0 smears against the
encoder output as a stable anchor. Attacks depth-compounding quant error
directly at loop boundaries. BWS-00/01 on/off ablation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds encoder tap infrastructure: frozen intermediate encoder layer outputs
projected to tap_dim and injected per-loop via loop_tap_up[loop]. Mirrors
FLOW pattern but anchors crawler to pre-drift encoder signal rather than
self-referential x. Sweeps tap_dim ∈ {16,32,64}, loop specificity, and
which encoder layers to tap. BWT-02 (dim=32, per-loop, all) is core hypothesis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CRAWLER_LOOP_ROPE_SCALES: divides inv_freq per loop to widen attention
range without extra parameters. CausalSelfAttention.forward and Block.forward
accept optional cos_sin tuple; _run_crawler pre-computes per loop.

run_all_ablations.sh runs all 4 series (choke/smear/tap/battery = 20 arms)
sequentially using unified train_gpt.py, prints ranked summary with winners.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Octavian and others added 28 commits April 2, 2026 09:05
ONE variable: swap int6+zstd(22) for int6+brotli(q11). 1k step gate
to check artifact size delta and confirm no blowups before full run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parent: BWX 9F (1.13868 int6_sw_bpb, 15.24MB)
Changes: brotli compression (approved baseline) + GPTQ enabled (SKIP_GPTQ=0)
Expected: ~-0.002 BPB from GPTQ, artifact stays under 16MB via brotli savings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Best-foot-forward production run on BWX 9F base:
- LOOP_AWARE_GPTQ=1 (confirmed -0.00380 BPB in BW10, NOT standard GPTQ)
- QK_GAIN_INIT=4.0 (high-confidence, -0.006 external signal)
- CRAWLER_LOOPS=2 (BW17 RAPID: -0.054 directional, faster steps)
- Brotli compression (approved from BW20 gate)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy of SLOT_brotli with two eval-only changes:
- delta shape (1,1,dim) -> (bsz,1,dim) per-sample adaptation
- SLOT_STEPS 8 -> 24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3-seed confirmed crawler SOTA (Bandit Wagon XI):
  seed 444: 1.13727008 BPB, 15,034,550 B
  seed   4: 1.13565882 BPB, 15,042,594 B
  seed 300: 1.13638653 BPB, 15,049,936 B
  mean:     1.13643848 BPB

9F crawler + loop-aware GPTQ + QK4 + 2-loop cadence + brotli

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@newjordan
Copy link
Copy Markdown
Author

Closing to resubmit under track_non_record_16mb. This is crawler architecture research, not a leaderboard attempt. Resubmission will include the full research context linking this to our earlier crawler PRs (#579, #990, #1028, #1140, #1208). See the non-record PR for the complete picture.

@newjordan newjordan closed this Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant