Skip to content

Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed)#2014

Open
simonbissonnette wants to merge 2 commits intoopenai:mainfrom
simonbissonnette:submission/final-growth-candidate
Open

Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed)#2014
simonbissonnette wants to merge 2 commits intoopenai:mainfrom
simonbissonnette:submission/final-growth-candidate

Conversation

@simonbissonnette
Copy link
Copy Markdown

@simonbissonnette simonbissonnette commented Apr 30, 2026

Record candidate: SP8192 CaseOps + Progressive 3k Context Growth + Short-Doc Score-First TTT

val_bpb: 1.05759 (3-seed mean, std 0.00034) | val_loss: 2.31441 nats (std 0.00075) | 15.98 MB max | 8xH100 SXM | 600s train / 600s eval

Improvement over merged PR #1855 leaderboard record (1.06107587 BPB):
-0.00348 BPB / -0.00762 nats

This stacks a progressive training-context schedule and a short-document TTT schedule on top of the late-April CaseOps/SP8192/LQER/SparseAttnGate/BOS-fixed SmearGate lineage. The direct leaderboard comparison is PR #1855, which is the current merged leader used here as the baseline.

Results

Seed Steps ms/step Train ms Pre-quant BPB Quant BPB Post-TTT BPB TTT eval s Artifact bytes
42 4,888 121.9 596,025 1.05993108 1.06833072 1.05740567 572.4 15,981,945
314 4,882 122.1 595,976 1.05975470 1.06832443 1.05730104 489.9 15,984,387
0 4,884 122.0 596,022 1.06072266 1.06902034 1.05807084 493.5 15,981,122
Mean 4,884.7 122.0 596,008 1.06013615 1.06855850 1.05759252 518.6 15,982,485

3-seed population std: 0.00034091 BPB / 0.00074604 nats.

All included seeds are under the 16,000,000-byte artifact cap and the 600s train/eval budgets as logged. The maximum artifact is 15,984,387 bytes and the maximum validation-data TTT pass is 572.4s.

Full validation coverage

All three logs evaluate the full CaseOps validation shard target set:

Seed val_tokens target_tokens
42 47,853,343 47,853,343
314 47,853,343 47,853,343
0 47,853,343 47,853,343

The training script explicitly keeps the validation tail via EVAL_INCLUDE_TAIL=1. This avoids the older multiple-of-context truncation and makes the standard diagnostic eval and quantized TTT eval agree on the same target count.

The tokenizer, CaseOps transform, training shards, validation shard, and byte sidecar format are the same canonical HF-hosted CaseOps export used by the merged PR #1855 setup. If a reviewer already has the clean #1855/HF CaseOps data staged, those same staged shards can be reused here. The included tokenizer/prep files are present only to make this submission self-contained; the preferred reproduction path is to download the canonical HF CaseOps export directly.

What changed vs PR #1855

This submission keeps the same overall 11-layer SP8192 CaseOps recurrent-transformer family as PR #1855, then adds the following levers:

Lever Setting Purpose
Progressive train context [email protected],[email protected],[email protected] Train cheaply at 1k early, move to 2k for most of training, then finish at 3k context.
Final/eval context TRAIN_SEQ_LEN=3072, EVAL_SEQ_LEN=3072, TTT_EVAL_SEQ_LEN=3072, EVAL_STRIDE=1536 Extend the final model and TTT scoring context beyond 2k without the 4k eval-time cost.
Long-context TTT mask TTT_MASK=no_qv, TTT_Q_LORA=0, TTT_V_LORA=0 Keep K/O/MLP LoRA adaptation while removing Q/V adapters that were less helpful at longer context.
TTT local LR TTT_LOCAL_LR_MULT=0.75 Slightly softer per-document LoRA adaptation.
Short-doc score-first chunks TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24, default chunk 48 Use smaller score-before-update chunks for short documents, preserving causality while improving adaptation.
TTT phases PHASED_TTT_NUM_PHASES=1, PHASED_TTT_PREFIX_DOCS=2500 Single score-first phased pass with a 2500-doc prefix budget.
QK gain QK_GAIN_INIT=5.25 Public long-context sweep result from the PR #1953 lineage.
Compression/quant stack COMPRESSOR=pergroup, AWQ-lite, asymmetric logit rescale Inherited from public late-April quantization/compression work stacked on the PR #1855 base.

The short-doc TTT schedule does not train on future validation tokens. It only changes the chunk granularity used inside the existing score-before-update loop: each chunk is scored first, then the LoRA update is applied for future chunks.

Architecture and training stack

Component Setting
Model 11 layers, 512d, 8 query heads, 4 KV heads, MLP 4x
Tokenizer/data SP8192 CaseOps lossless caps with byte sidecar accounting
RoPE Partial RoPE, 16 dims
Recurrence Layers 3-5 looped, enabled at frac=0.35
Parallel decoder Parallel lane from layer 8, mean final lane
XSA All 11 layers
Gates BOS-fixed SmearGate, SparseAttnGate with gate_window=12, scale 0.5
Optimizer Muon on matrix params, Adam on embedding/scalars, BETA2=0.99
EMA ema_decay=0.9965
Quantization GPTQ int6 matrices, int7 embeddings, LQER asymmetric rank-4 correction
GPTQ reserve GPTQ_RESERVE_SECONDS=4.0; logs show gptq:reserving 4s, effective=596000ms
Compression Per-group compression
TTT Quantized phased LoRA TTT, score-first, no_qv mask, short-doc chunk schedule

Compliance notes

  • Artifact cap: all seeds <= 15,984,387 bytes.
  • Training wallclock: all training loops stop around 596.0s with GPTQ_RESERVE_SECONDS=4.0; GPTQ hessian collection is logged immediately after (67 Hessians in 4.1s) for transparency.
  • Eval wallclock: all validation-data TTT passes are <= 572.4s. The ttt_lora:compile warmup uses random tokens and no validation data; it is logged separately from total_eval_time.
  • Score-before-update: quantized_ttt_phased scores each chunk before applying that chunk's LoRA update. The short-doc schedule only changes chunk size.
  • Full validation targets: val_tokens == target_tokens == 47853343 in all included logs.
  • No validation data in training: training uses only training shards. TTT accesses validation documents left-to-right under the score-first rule.
  • No external cache or direct memorization: no SLOT, n-gram cache, PPM mixture, logit bias table, or validation-derived precomputation.
  • Original-byte BPB: CaseOps byte sidecar accounting is preserved.

Reproduction

Install the dependencies in requirements.txt. FlashAttention 3 and the lrzip system binary are noted there because they require separate install paths.

This submission uses the clean canonical CaseOps SP8192 export hosted on Hugging Face. The logs were produced from a 50,000-document validation split with 80 training shards (train_shards: 80, ttt_phased: total_docs:50000, and val_tokens == target_tokens == 47853343 in every included log).

Preferred data setup:

python3 - <<'PY'
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="romeerp/parameter-golf-caseops-v1",
    repo_type="dataset",
    local_dir="./data/datasets/fineweb10B_sp8192_caseops",
    allow_patterns=[
        "datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/*",
        "datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model",
    ],
    max_workers=8,
)
PY

Then set:

DATA_PATH=./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved
TOKENIZER_PATH=./data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model

Fallback local rebuild: if the HF export is unavailable, rebuild from the canonical docs_selected.jsonl with the included prepare_caseops_data.py, lossless_caps.py, and tokenizer. Use --val-docs 50000 and write into a fresh output directory. The prep script now defaults to 50,000 validation docs and refuses to write over existing fineweb_*.bin shards unless --overwrite is passed, to avoid accidentally mixing stale validation shards with a new train split.

Run one seed at a time, replacing DATA_PATH and TOKENIZER_PATH with the staged CaseOps paths:

for SEED in 42 314 0; do
  NCCL_NET=Socket \
  DATA_DIR=./data \
  DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
  TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
  CASEOPS_ENABLED=1 \
  VOCAB_SIZE=8192 \
  ITERATIONS=20000 \
  MAX_WALLCLOCK_SECONDS=600 \
  EVAL_INCLUDE_TAIL=1 \
  TRAIN_SEQ_LEN=3072 \
  ROPE_TRAIN_SEQ_LEN=3072 \
  [email protected],[email protected],[email protected] \
  TRAIN_SEQ_SCHEDULE_MODE=wallclock \
  SEQ_CHANGE_WARMUP_STEPS=32 \
  EVAL_SEQ_LEN=3072 \
  EVAL_STRIDE=1536 \
  TTT_ENABLED=1 \
  TTT_EVAL_SEQ_LEN=3072 \
  TTT_BATCH_SIZE=24 \
  TTT_CHUNK_SIZE=48 \
  TTT_SHORT_SCORE_FIRST_ENABLED=1 \
  TTT_SHORT_DOC_LEN=2000 \
  TTT_SHORT_CHUNK_SIZE=24 \
  TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24 \
  TTT_LORA_RANK=80 \
  TTT_LORA_LR=0.0001 \
  TTT_LOCAL_LR_MULT=0.75 \
  TTT_MASK=no_qv \
  TTT_Q_LORA=0 \
  TTT_V_LORA=0 \
  TTT_WEIGHT_DECAY=0.5 \
  TTT_BETA2=0.99 \
  PHASED_TTT_PREFIX_DOCS=2500 \
  PHASED_TTT_NUM_PHASES=1 \
  WARMDOWN_FRAC=0.85 \
  BETA2=0.99 \
  QK_GAIN_INIT=5.25 \
  SPARSE_ATTN_GATE_ENABLED=1 \
  SPARSE_ATTN_GATE_SCALE=0.5 \
  GATED_ATTN_QUANT_GATE=1 \
  SMEAR_GATE_ENABLED=1 \
  GATE_WINDOW=12 \
  FUSED_CE_ENABLED=1 \
  MATRIX_LR=0.026 \
  MIN_LR=0.1 \
  GRAD_CLIP_NORM=0.3 \
  EMBED_BITS=7 \
  EMBED_CLIP_SIGMAS=14.0 \
  MATRIX_CLIP_SIGMAS=12.85 \
  ATTN_CLIP_SIGMAS=13.0 \
  MLP_CLIP_SIGMAS=11.5 \
  LQER_ENABLED=1 \
  LQER_RANK=4 \
  LQER_TOP_K=3 \
  LQER_FACTOR_BITS=4 \
  LQER_ASYM_ENABLED=1 \
  LQER_ASYM_GROUP=64 \
  AWQ_LITE_ENABLED=1 \
  AWQ_LITE_BITS=8 \
  AWQ_LITE_GROUP_TOP_K=1 \
  AWQ_LITE_GROUP_SIZE=64 \
  ASYM_LOGIT_RESCALE=1 \
  GPTQ_RESERVE_SECONDS=4.0 \
  GPTQ_CALIBRATION_BATCHES=16 \
  COMPRESSOR=pergroup \
  VAL_LOSS_EVERY=0 \
  SEED=$SEED \
  torchrun --standalone --nproc_per_node=8 train_gpt.py \
      > train_seed${SEED}.log 2>&1
done

Included files

Lineage and credits

This submission is a stack on top of the public CaseOps/SP8192 record lineage:

The new contribution here is the combination of progressive 3k train/eval context growth with the short-document score-first TTT chunk schedule, while preserving the full validation target count and staying under the artifact/eval budgets.

Fija pushed a commit to Fija/parameter-golf that referenced this pull request Apr 30, 2026
Pull PR openai#2014's record dir from openai/parameter-golf and reproduce its 1.05759
3-seed mean. Key new levers vs openai#1953: EVAL_SEQ_LEN=3072, train_seq_schedule
1024->2048->3072, single-phase TTT (NUM_PHASES=1, PREFIX=2500), short-doc
score-first chunking (TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24).

Even with our infra's ~1.5-2 milli-BPB inflation pattern, reproducing openai#2014
should land ~1.0590 — close enough to record bar to potentially clear it.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Port the 040C 'middle 5x / late 3.4x' allocation onto simonbissonnette's
progressive-3k base (openai#2014) and screen vs uniform 4.0 baseline. Training-only,
4xH100 1200s, single seed. Code on exp/300-040c-on-2014 @ d174313.

Spec flags the column-slice-in-compile hazard from feedback memory and
mandates a compile-sanity check before scaling. PREQUANT_ONLY=1 keeps the
screen cheap by skipping serialize/GPTQ/TTT.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 1, 2026
…ean 1.05831 BPB

Clears record bar (1.05914) by 0.83 milli-BPB. Welch t = -6.49 vs PR openai#1855 (1.06108),
p < 0.0001. All 3 seeds produce 15.99 MB artifacts under the 16 MB cap, all under
the 600s wallclock budget.

Per-seed:
- 42:   ttt=1.05793  art=15,986,149  eval=572.6s
- 314:  ttt=1.05852  art=15,987,257  eval=553.7s
- 1234: ttt=1.05849  art=15,989,895  eval=574.1s

Submission directory at records/track_10min_16mb/2026-04-30_PR2014_Reproduction_1.0583/
contains PR openai#2014's verbatim train_gpt.py + tokenizer + our seed_results.csv + a
detailed README documenting the lineage (openai#1797 -> openai#1851 -> openai#1855 -> openai#1908 -> openai#1923
-> openai#1953 -> openai#2014), the new levers vs each parent, and the full 4-condition
C1-C4 legality check. submission.json author/github_id are placeholders pending
the user's choice of submitting account.

Reproduction script: runpod/phase_x_pr2014.sh — runs end-to-end on a single
8xH100 SXM pod (~2.5h wall, ~$66 cost).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Idan3011 pushed a commit to Idan3011/parameter-golf that referenced this pull request May 1, 2026
TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request May 1, 2026
AsymLogit Rescale (PR openai#1923) ported as 2 TTT-adaptable scalar params (softcap_pos, softcap_neg).
Pre-quant 1.06160 (slightly worse than S55's 1.06058 — AsymLogit hurts un-adapted model).
TTT recovery -0.01267 (much better than S55's -0.01103) — AsymLogit gives massive adaptive capacity.
Final 1.05759 = -0.00055 vs S55. Single-seed matches PR openai#2014's 3-seed mean.
Eval 521.7s (under 600s cap), Size 15,946,610.
softcap_pos and softcap_neg init to logit_softcap=30.0, adapted per-doc via TTT-LoRA optimizer.
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
User pushed back on openai#2014's LEAK call as too inference-based. Verified directly:
- README says "uses same shards as PR openai#1855. If you don't have them, prepare
  with included prepare_caseops_data.py" — phrasing implies inheritance from
  openai#1855 (LEAK) but doesn't explicitly invoke prep
- No setup.sh, no shell script invoking prep
- No HF download script
- Path /dev/shm/pgolf_caseops_data_80_l17_final is custom flat RAM-disk dir
  (not triple-nested local-prep signature)
- Could be either HF-flattened download OR local-prep copy

Demoted openai#2014 from LEAK to AMBIGUOUS (lean LEAK based on "same shards as openai#1855"
English, but not iron-clad).

Updated tally: CLEAN 9, LEAK 20 (was 21), AMBIGUOUS 4 (was 3), INHERIT 1.
TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request May 1, 2026
…E_OUTSIDE=0

Seed 314: pre-quant 1.06128 / quant 1.06962 / final 1.05701 / eval 571.7s
Compliance: ngram_hint_precompute_outside=False, precompute (166.95s) INSIDE timer per PR openai#1514 precedent.
Token-only tilt: within_gate=0, word_gate=0 - legal per PR openai#1514.
Size 15,943,530 bytes.
Single seed beats openai#2014's 3-seed mean (1.05759).
Validating seeds 42 and 1234.
varunneal added a commit to varunneal/parameter-golf that referenced this pull request May 1, 2026
TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request May 1, 2026
Beats PR openai#1855 (merged rank 1, 1.06108) by 0.00438 BPB.
Beats PR openai#2014 (best open, 1.05759) by 0.00089 BPB.
Beats PR openai#2060 (1.05792) by 0.00122 BPB.

Stack:
- Token-only n-gram tilt (PR openai#1514 merged precedent, within/word channels disabled)
- AsymLogit Rescale (2 trainable scalars adapted by global TTT)
- 3 hyperparameter levers from PR openai#2060 (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5)
- PHASED_TTT_NUM_PHASES=1 (matches PR openai#2014)
- NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 (precompute INSIDE eval timer per PR openai#1514)

Compliance:
- All seeds eval ≤533.1s (cap 600s, 67-80s margin)
- All artifacts ≤15.95MB (cap 16MB)
- Token-only n-gram channel (within_gate=0, word_gate=0)
- Score-first TTT (per PR openai#402)
varunneal added a commit to varunneal/parameter-golf that referenced this pull request May 1, 2026
codemath3000 added a commit to codemath3000/parameter-golf that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant