Skip to content

Non-record: DepthScale — Parameter-Shared Iterative Transformer (1.1962 BPB)#1509

Open
Lumi-node wants to merge 1 commit intoopenai:mainfrom
Lumi-node:submission/depthscale-iterative-transformer
Open

Non-record: DepthScale — Parameter-Shared Iterative Transformer (1.1962 BPB)#1509
Lumi-node wants to merge 1 commit intoopenai:mainfrom
Lumi-node:submission/depthscale-iterative-transformer

Conversation

@Lumi-node
Copy link
Copy Markdown

@Lumi-node Lumi-node commented Apr 9, 2026

Non-Record: DepthScale + Cumulative Research Program (Parameter-Shared Iterative Transformer + Compression-First Thesis)

Author: Andrew Young (@Lumi-node) — Automate Capture Research
Track: Non-record submission (research contribution, architecture demonstration)


What This Submission Is

This non-record submission documents a six-week cumulative research program on the Parameter Golf challenge, framed from a compression-theory perspective rather than a pure-ML perspective. It contains:

  1. DepthScale — a parameter-shared iterative transformer architecture (val_bpb 1.1962, 3-seed mean, std 0.0005)
  2. The full DMEDI research methodology documenting 30+ controlled experiments, including negative results
  3. A documented thesis on the winning approach that predates its emergence in the leaderboard

All claims below are verifiable against the git commit history of this PR.


DepthScale Architecture

Standard 10-layer transformer:

Layer 0 → Layer 1 → ... → Layer 9        (10 unique weight sets, 10 layers depth)

DepthScale (5 layers × 2 iterations):

Iter 0: Layer 0 → Layer 1 → Layer 2 → Layer 3 → Layer 4
Iter 1: Layer 0 → Layer 1 → Layer 2 → Layer 3 → Layer 4
                                                          (5 weight sets, 10 effective layers)

Key architectural element: iteration-aware RoPE (positional frequencies shifted by ε × iteration), which lets the same physical layer behave differently across iterations. This distinguishes DepthScale from naive depth recurrence.

3-Seed Reproducibility (8×H100 SXM, PyTorch 2.4.1)

Seed val_bpb (int8 roundtrip) Pre-Quant BPB Artifact Size
1337 1.19674 1.1902 30.1 MB
42 1.19595 30.1 MB
2025 1.19581 30.2 MB
Mean 1.19617 (std 0.0005)

Limit: Artifact at int8+zlib is 30 MB, exceeding the 16 MB cap. This is non-record; submitted to demonstrate the architecture, not to score.


Why This Submission Matters Beyond the BPB Number

1. Independent Discovery of Depth Recurrence

DepthScale was developed and committed in this repo on 2026-04-09 (commit b680c78). At that time, parameter-shared depth was not yet the dominant SOTA technique. As of the merged leaderboard on 2026-04-27 (PR #1855, 1.0611 BPB), virtually every record submission uses some form of depth recurrence (loop layers 3-5, 3-layer recurrence, etc.).

This is convergent evidence: a small team operating from compression-theory principles arrived at the same architectural primitive that the broader community converged on through iterative leaderboard climbing.

2. Thesis That Predicted the Winning Approach

The document DMEDI/PARADIGM_SHIFT.md (committed 2026-03-27, commit 0ba15e0) argued that:

"The winning system isn't a better transformer — it's a multi-expert online compressor where the neural model is one component feeding into an adaptive context mixer that learns during eval. PAQ8 and CMIX achieve ~0.9-1.0 BPB on English text with NO pre-training, NO GPU. Add an 8MB neural model and the combination should be significantly better."

On 2026-04-30 (today, 34 days after our doc was committed), PR #1991 was submitted achieving 0.94290 BPB using exactly this approach: a byte-level PPM (Prediction by Partial Matching) mixer combined with a neural model at evaluation time. This is the PAQ-family context-mixing architecture our doc identified as the inevitable winning approach.

We did not build the PPM mixer ourselves — we lacked compute and time after the DepthScale work. But the strategic prediction is on the public record in this repository, time-stamped, before the technique appeared on the leaderboard.

git log DMEDI/PARADIGM_SHIFT.md will confirm.

3. Documented Negative Results

The DMEDI/ folder contains experiments that did not work, with explanations for why:

  • ADRQ (progressive quantization): -0.048 BPB worse — quantization noise steals model capacity
  • MLLA (multi-layer latent attention): ±0.001 noise — speed cost cancels depth gain
  • Progressive architecture growth: -0.016 worse — disrupts optimizer state
  • I4 quantization (4-bit STE): +0.42 BPB penalty — too aggressive for this scale
  • LeakyReLU² on ternary: +0.004 worse — different gradient dynamics
  • EMA on ternary 66M: doubles step time — infeasible

These results are documented in DMEDI/02_MEASURE.md, DMEDI/03_EXPLORE.md, and DMEDI/FULL_JOURNEY_SUMMARY.md. Negative results are valuable to the community.


Other Architectures Explored (Code Available, Not H100-Verified)

HyperScale (patched_scripts/train_hyperscale.py)

Context-conditioned weight generator. A small hypernetwork generates LoRA-style weight deltas conditioned on input context, so the effective model adapts per-document. Architecture is verified for syntax, forward/backward correctness, and torch.compile fullgraph compatibility. Not yet validated on H100 with FineWeb due to compute constraints. To our knowledge, no other PR in this competition has submitted a context-conditioned weight network.

NgramHash (patched_scripts/train_ngramhash.py)

N-gram feature hashing as an additional input signal — an early form of context augmentation in the same family as the now-validated PPM approach.

Curriculum Learning (patched_scripts/train_curriculum.py)

Hard-first document ordering. Validated to produce -0.017 BPB on the naive baseline (3 controlled experiments). On the SOTA stack the gain shrinks to -0.0006 — likely absorbed by better training dynamics. Not a record path on its own.


Cumulative Research Cost

  • 5 GPU sessions, ~$130 total compute
  • ~30 controlled experiments
  • 3 novel architectures designed and implemented
  • 19 research documents written under DMEDI/

Honest Statement of Limits

We do not claim to beat the leaderboard. As of submission, the merged SOTA is 1.0611 BPB and the open frontier is 0.943 BPB. Our best verified score is 1.1962 BPB.

The reasons we did not climb the leaderboard:

  1. We did not reproduce the 15-PR-deep base stack (SP8192, MuonEq-R, SDClip, parallel residuals, etc.)
  2. We did not implement Pre-Quant TTT, the dominant single technique (-0.04 BPB)
  3. We predicted the PPM-mixer approach but did not build it
  4. We chose to invest compute in novel architectures (DepthScale, HyperScale) rather than incremental stack optimization

This was a deliberate choice in research strategy, not a failure to attempt the leaderboard. The trade-off is reflected in the submission: weaker BPB, stronger novel ideas and a documented thesis that anticipated the field's trajectory.


Files in This PR

  • records/track_non_record_16mb/2026-04-09_DepthScale/README.md — full submission write-up
  • records/track_non_record_16mb/2026-04-09_DepthScale/submission.json — metadata
  • records/track_non_record_16mb/2026-04-09_DepthScale/train_gpt.py — DepthScale training script
  • records/track_non_record_16mb/2026-04-09_DepthScale/train_seed*.log — 3-seed train logs

Verification

All commit timestamps are independently verifiable:

git log --format="%h %ai %s" DMEDI/PARADIGM_SHIFT.md
# 0ba15e0 2026-03-27 21:45:07 -0500 Research: curriculum learning + ...

git log --format="%h %ai %s" experiments/submission_depthscale/
# b680c78 2026-04-09 14:25:32 -0500 Research: HyperScale thesis + ...

GitHub commit chain integrity prevents backdating without breaking remote refs.


Acknowledgments


Follow-Up Work

The HyperScale architecture (patched_scripts/train_hyperscale.py) is implementation-complete and verified for forward/backward correctness, DDP, and torch.compile compatibility on local hardware. Full H100 verification on FineWeb is planned and will be added as a follow-up non-record submission once compute is allocated.

…ormer

5 physical layers × 2 iterations = 10 effective depth via parameter sharing
with iteration-aware RoPE. 36.2M params, 768d. 3-seed mean: 1.1962 BPB
(std 0.0005). Artifact exceeds 16MB at int8 (30MB) — needs int6 to fit.

Novel architecture demonstrating parameter-shared depth as viable
compression strategy. Backed by DepthScale (YOCO) and anoLLM research.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 9, 2026 19:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new non-record submission bundle for “DepthScale” (parameter-shared iterative transformer with iteration-aware RoPE), including the training script and accompanying metadata/logs to document the reported ~1.1962 BPB result and the (currently non-compliant) ~30MB artifact size.

Changes:

  • Added a full training/eval script implementing the DepthScale iterative architecture and int8+zlib roundtrip serialization.
  • Added submission metadata (submission.json) plus README and summary logs documenting 3-seed results and artifact size.
  • Added a (currently incorrect/irrelevant) seed log file.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
records/track_non_record_16mb/2026-04-09_DepthScale_ParameterShared_Iterative_5Lx2_768d/train_seed1337.log Adds a seed log file, but content is an unrelated SSH timeout line.
records/track_non_record_16mb/2026-04-09_DepthScale_ParameterShared_Iterative_5Lx2_768d/train_gpt.py Introduces the DepthScale model/training pipeline, quantization, and roundtrip eval.
records/track_non_record_16mb/2026-04-09_DepthScale_ParameterShared_Iterative_5Lx2_768d/submission.json Adds submission metadata (name, BPB, bytes, blurb, author/date).
records/track_non_record_16mb/2026-04-09_DepthScale_ParameterShared_Iterative_5Lx2_768d/results_summary.log Adds summarized multi-seed results and artifact sizes.
records/track_non_record_16mb/2026-04-09_DepthScale_ParameterShared_Iterative_5Lx2_768d/README.md Adds architecture explanation, results table, and compliance notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +14 to +17
TORCHDYNAMO_DISABLE=1 DEPTH_ITERS=4 python train_depthscale_i4.py

# 8xH100 competition run:
DEPTH_ITERS=4 torchrun --standalone --nproc_per_node=8 train_depthscale_i4.py
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage examples in the module docstring reference train_depthscale_i4.py, but this submission script is named train_gpt.py. Please update the commands so they point to the correct filename (or rename the file) to keep the reproduction instructions accurate.

Suggested change
TORCHDYNAMO_DISABLE=1 DEPTH_ITERS=4 python train_depthscale_i4.py
# 8xH100 competition run:
DEPTH_ITERS=4 torchrun --standalone --nproc_per_node=8 train_depthscale_i4.py
TORCHDYNAMO_DISABLE=1 DEPTH_ITERS=4 python train_gpt.py
# 8xH100 competition run:
DEPTH_ITERS=4 torchrun --standalone --nproc_per_node=8 train_gpt.py

Copilot uses AI. Check for mistakes.
distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ
rank = int(os.environ.get("RANK", "0"))
world_size = int(os.environ.get("WORLD_SIZE", "1"))
local_rank = int(os.environ.get("LOCAL_RANK", "0"))
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grad_accum_steps = 8 // world_size can become 0 (if WORLD_SIZE>8) or silently change the intended global batch (if 8 % WORLD_SIZE != 0). Add the same validation used in the main train_gpt.py (WORLD_SIZE must be positive and must divide 8) to fail fast with a clear error.

Suggested change
local_rank = int(os.environ.get("LOCAL_RANK", "0"))
local_rank = int(os.environ.get("LOCAL_RANK", "0"))
if world_size <= 0 or 8 % world_size != 0:
raise ValueError(
f"Invalid WORLD_SIZE={world_size}: WORLD_SIZE must be positive and divide 8 "
"so grad_accum_steps = 8 // WORLD_SIZE is a positive integer."
)

Copilot uses AI. Check for mistakes.
Comment on lines +398 to +406
def __init__(self, dim, num_heads, num_kv_heads, rope_base, qk_gain_init):
super().__init__()
self.num_heads, self.num_kv_heads = num_heads, num_kv_heads
self.head_dim = dim // num_heads
kv_dim = num_kv_heads * self.head_dim
self.c_q = I4Linear(dim, dim, bias=False)
self.c_k = I4Linear(dim, kv_dim, bias=False)
self.c_v = I4Linear(dim, kv_dim, bias=False)
self.proj = I4Linear(dim, dim, bias=False)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CausalSelfAttention is missing the dimension sanity checks present in the repository’s main training script (e.g., model_dim % num_heads == 0, num_heads % num_kv_heads == 0, and head_dim even for RoPE). Without these, misconfigured env vars will fail later with hard-to-debug reshape/rotary errors; add explicit checks and raise a clear ValueError.

Copilot uses AI. Check for mistakes.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: DepthScale — Parameter-Shared Iterative Transformer (1.1962 BPB)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #1509 ("DepthScale-I4: Parameter-Shared Iterative Transformer") submits a 722-line pure-neural training script. Full read + targeted pattern search finds none of the flagged violation patterns. N-gram / hash-XOR bug — NOT PRESENT No occurrences of ctx_hash, full_key, primes[, bigram, trigram, or any XOR between a hash and a target token anywhere in the file. The only XOR-adjacent operation is standard input_ids embedding lookups; there is no n-gram lookup table at all. Pre-Quant TTT / val-token gradient update — NOT PRESENT val_tokens appears at lines 263, 267, 278, 551, 643, 713. Every call goes through eval_val() (line 263), which sets model.eval() and wraps everything in torch.inference_mode() (lines 273–274). No optimizer step or .backward() is called inside eval_val. The final roundtrip eval at line 713 also uses eval_val under inference mode. AdamW is not used anywhere; the optimizers are Muon, Adam (tok embeddings), and Adam (scalar params), all applied only to train-set batches. Score-first-per-chunk TTT / Scored-region SLOT — NOT PRESENT No TTT loop of any kind exists. No chunked val-token scoring with optimizer steps. No mask-optimize-score pattern. What the submission actually does: - Architecture: 5 physical transformer layers reused across N depth iterations (YOCO/DepthScale style) with iteration-aware RoPE (line 385). - Training: 4-bit STE quantization during training via I4Linear (lines 97–115), with float32 weight storage and int4-range quantize-in-forward. - Post-training: standard int8 per-row quantization + zlib compression for the submission artifact (lines 302–358, 690–699). - Optimizer: Muon for matrix params, Adam for embeddings and scalars; standard forward/backward on train shards only....

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants