Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Record Candidate: Long-Context no-QV TTT with Rank-56 / Prefix-3000 Compute Reallocation

**val_bpb = 1.05874877** (3-seed mean, population std 0.00091680) | **max artifact 15,980,110 bytes** | **8x H100 SXM** | **strict 600s train / 600s eval logs**

This submission is a clean, score-first Track B refinement of the late-April CaseOps / LQER / SparseAttnGate / phased-TTT stack. It keeps the long-context `no_qv` TTT setup from PR #1953 and reallocates the TTT eval budget by reducing LoRA rank to 56 while increasing the phased global-TTT prefix to 3000 documents.

Leaderboard context at submission time:

- vs merged PR #1855 (`1.06108`): **-0.00233123 BPB**, approximately **-0.00510 nats**.
- vs open PR #1953 (`1.05855370`): **+0.00019507 BPB** worse. This PR is submitted as a clean record candidate against the merged leaderboard and as a reproducible ablation on the #1953 lineage, not as a claim to beat #1953 if that PR is accepted first.

## 3-Seed Results

| Seed | Stop step | Train ms | Pre-quant BPB | Quant no-TTT BPB | Final TTT BPB | Eval ms | Artifact bytes |
|---:|---:|---:|---:|---:|---:|---:|---:|
| 42 | 4900 | 596051 | 1.06108950 | 1.06949683 | **1.05780842** | 519467 | 15,975,989 |
| 0 | 4879 | 596086 | 1.06200352 | 1.07034149 | **1.05844590** | 425873 | 15,976,674 |
| 1234 | 4880 | 596146 | 1.06319088 | 1.07176584 | **1.05999198** | 400555 | 15,980,110 |
| **Mean** | **4886** | **596094** | **1.06209463** | **1.07053472** | **1.05874877** | **448632** | **15,977,591** |

All three seeds satisfy:

- artifact < 16,000,000 bytes
- training wallclock < 600,000 ms
- final TTT eval wallclock < 600,000 ms
- full validation set with CaseOps byte-sidecar BPB accounting

## What Changed

Relative to the PR #1953 long-context `no_qv` stack:

```bash
TTT_LORA_RANK=56 # PR #1953 used rank 80 in its reproduction command
PHASED_TTT_PREFIX_DOCS=3000 # PR #1953 used 2500
PHASED_TTT_NUM_PHASES=3
EVAL_SEQ_LEN=2560
TTT_EVAL_SEQ_LEN=2560
TTT_MASK=no_qv
TTT_Q_LORA=0
TTT_V_LORA=0
TTT_LOCAL_LR_MULT=0.75
QK_GAIN_INIT=5.25
```

The intent is a simple eval-budget tradeoff: smaller per-document LoRA state with a longer score-first global-TTT prefix. The result is seed-sensitive: it improves seeds 42 and 0 relative to the public #1953 numbers, but seed 1234 is weaker, so the 3-seed mean lands behind #1953 while still clearing the merged #1855 record by the 0.005-nat threshold.

## Key Inherited Components

This submission inherits the core stack from the public #1953 / #1945 / #1908 / #1855 lineage:

- CaseOps lossless tokenizer with validation byte sidecar
- BOS-fixed SmearGate and SparseAttnGate
- LQER asymmetric rank-4 quantization
- AWQ-lite mixed-precision GPTQ
- asymmetric logit rescale
- fused softcapped CE
- long-context score-first phased TTT with `no_qv` LoRA mask
- per-group lrzip artifact compression

## Compliance Against Issue #1017 Conditions

- **C1 strict causal dependence:** every token is scored from the submitted artifact plus strict prefix state only. CaseOps is a deterministic lossless transform; the byte sidecar supplies denominator accounting and does not affect predictions.
- **C2 full normalized distribution:** the neural model emits a standard full-vocabulary distribution over the 8192 CaseOps token ids. No byte-level scorer, entropy expert, PPM mixture, n-gram cache, or target-conditioned gate is used.
- **C3 score-before-update:** phased TTT scores chunks/documents before any LoRA or global update can use those scored tokens. Updated state is used only for later tokens/documents.
- **C4 single left-to-right pass:** every validation token is scored once. No rescoring, no multi-pass selection, and no oracle best-of-k over validation outcomes.

Additional exclusions:

- no SLOT
- no byte-PPM or token-PPM mixer
- no pre-quant TTT on validation data
- no eval-time logit bias selected from the observed target
- no validation-token leakage into training weights before scoring

## Reproduction

The CaseOps dataset/tokenizer are the same public data path used by the inherited stack (`romeerp/parameter-golf-caseops-v1`):

```bash
pip install brotli sentencepiece huggingface-hub
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

# Fetch / prepare the CaseOps SP8192 dataset used by the inherited stack.
# DATA_PATH should contain fineweb_train_*.bin, fineweb_val_*.bin, and fineweb_val_bytes_*.bin.

export DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved
export TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
export CASEOPS_ENABLED=1
export VOCAB_SIZE=8192
export MAX_WALLCLOCK_SECONDS=600
export VAL_LOSS_EVERY=0
export TTT_ENABLED=1
export PHASED_TTT_ENABLED=1
export PHASED_TTT_NUM_PHASES=3
export PHASED_TTT_PREFIX_DOCS=3000
export TTT_LORA_RANK=56
export TTT_MASK=no_qv
export TTT_Q_LORA=0
export TTT_V_LORA=0
export TTT_LOCAL_LR_MULT=0.75
export TTT_BETA2=0.99
export TTT_WEIGHT_DECAY=0.5
export TTT_CHUNK_SIZE=48
export EVAL_SEQ_LEN=2560
export TTT_EVAL_SEQ_LEN=2560
export QK_GAIN_INIT=5.25
export MATRIX_LR=0.026
export MIN_LR=0.1
export WARMDOWN_FRAC=0.85
export BETA2=0.99
export EMBED_BITS=7
export MATRIX_CLIP_SIGMAS=12.85
export ATTN_CLIP_SIGMAS=13.0
export MLP_CLIP_SIGMAS=11.5
export EMBED_CLIP_SIGMAS=14.0
export GRAD_CLIP_NORM=0.3
export FUSED_CE_ENABLED=1
export SMEAR_GATE_ENABLED=1
export GATE_WINDOW=12
export SPARSE_ATTN_GATE_ENABLED=1
export SPARSE_ATTN_GATE_SCALE=0.5
export GATED_ATTN_QUANT_GATE=1
export LQER_ENABLED=1
export LQER_RANK=4
export LQER_TOP_K=3
export LQER_GROUP_SIZE=64
export LQER_ASYM_ENABLED=1
export LQER_ASYM_GROUP=64
export AWQ_LITE_ENABLED=1
export ASYM_LOGIT_RESCALE=1
export GPTQ_RESERVE_SECONDS=4.0
export GPTQ_CALIBRATION_BATCHES=16
export COMPRESSOR=pergroup
export NCCL_NET=Socket

SEED=42 RUN_ID=rank56_prefix3000_seed42 ARTIFACT_DIR=./artifacts/seed42 torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=0 RUN_ID=rank56_prefix3000_seed0 ARTIFACT_DIR=./artifacts/seed0 torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=1234 RUN_ID=rank56_prefix3000_seed1234 ARTIFACT_DIR=./artifacts/seed1234 torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Included Files

- `README.md`
- `submission.json`
- `train_gpt.py`
- `train_seed42.log`
- `train_seed0.log`
- `train_seed1234.log`

## Credits

This is a small compute-allocation refinement on a broad community stack. Credit belongs primarily to the authors of the inherited lineage:

- @codemath3000 for PR #1855
- @alertcat, @romeerp, @jorge-asenjo, and @aquariouseworkman for PR #1945 / #1908 / #1923 components
- @nprime06, @dexhunter, @samacqua, and earlier VarLen/phased-TTT contributors
- @romeerp and @dexhunter for the CaseOps tokenizer and byte-sidecar precedent

The contribution here is the rank-56 / prefix-3000 score-first TTT ablation and the fully logged 3-seed rerun under the standard constraints.
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
{
"author": "Himanshu Dongre",
"github_id": "himanshudongre",
"name": "Long-context no-QV phased TTT with rank-56 / prefix-3000 compute reallocation",
"date": "2026-04-30",
"track": "10min_16mb",
"val_bpb": 1.05874877,
"val_bpb_std": 0.00091680,
"seeds": [
42,
0,
1234
],
"seed_results": {
"42": {
"stop_step": 4900,
"train_ms": 596051,
"prequant_bpb": 1.06108950,
"quant_bpb": 1.06949683,
"val_bpb": 1.05780842,
"eval_ms": 519467,
"artifact_bytes": 15975989
},
"0": {
"stop_step": 4879,
"train_ms": 596086,
"prequant_bpb": 1.06200352,
"quant_bpb": 1.07034149,
"val_bpb": 1.05844590,
"eval_ms": 425873,
"artifact_bytes": 15976674
},
"1234": {
"stop_step": 4880,
"train_ms": 596146,
"prequant_bpb": 1.06319088,
"quant_bpb": 1.07176584,
"val_bpb": 1.05999198,
"eval_ms": 400555,
"artifact_bytes": 15980110
}
},
"hardware": "8x H100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "PR #1953 long-context no-QV CaseOps/LQER/AWQ-lite stack with TTT_LORA_RANK=56 and PHASED_TTT_PREFIX_DOCS=3000",
"comparison": {
"merged_pr_1855_bpb": 1.06108,
"delta_vs_merged_pr_1855_bpb": -0.00233123,
"delta_vs_merged_pr_1855_nats_approx": -0.00510,
"open_pr_1953_bpb": 1.05855370,
"delta_vs_open_pr_1953_bpb": 0.00019507
},
"run_overrides": {
"CASEOPS_ENABLED": 1,
"EVAL_SEQ_LEN": 2560,
"TTT_EVAL_SEQ_LEN": 2560,
"TTT_MASK": "no_qv",
"TTT_Q_LORA": 0,
"TTT_V_LORA": 0,
"TTT_LOCAL_LR_MULT": 0.75,
"TTT_LORA_RANK": 56,
"PHASED_TTT_PREFIX_DOCS": 3000,
"PHASED_TTT_NUM_PHASES": 3,
"QK_GAIN_INIT": 5.25,
"COMPRESSOR": "pergroup"
},
"compliance": {
"train_under_600s": true,
"artifact_under_16mb": true,
"eval_under_600s": true,
"three_seeds": true,
"full_validation_set": true,
"caseops_byte_sidecar_accounting": true,
"score_first_ttt": true,
"no_slot": true,
"no_byte_ppm": true,
"no_token_ppm": true,
"no_ngram_cache": true,
"no_eval_time_logit_bias": true,
"no_pre_quant_ttt_on_val": true,
"single_pass_eval": true
},
"attribution": {
"base_stack": "PR #1953 / #1945 / #1908 / #1855 lineage",
"caseops": "@romeerp and @dexhunter",
"phased_ttt": "@samacqua, @dexhunter, and VarLen/phased-TTT contributors",
"rank56_prefix3000_ablation": "@himanshudongre"
}
}
Loading