openai · jorge-asenjo · Apr 29, 2026 · Apr 30, 2026
diff --git a/...s/track_10min_16mb/2026-04-30_SP8192_AsymLogit_AWQLite_FullVal_1.0597/README.md b/...s/track_10min_16mb/2026-04-30_SP8192_AsymLogit_AWQLite_FullVal_1.0597/README.md
@@ -0,0 +1,126 @@
+# SP8192 PR #1855 Base + Asymmetric Logit Rescale + AWQ-lite
+
+**Score: 1.05971 BPB** (3-seed mean, full val partition, seeds 42 / 0 / 1234)
+
+| Seed | val_bpb | val_loss | train wallclock | eval | artifact |
+|------|--------:|---------:|----------------:|-----:|---------:|
+| 42   | 1.06030 | 2.32033 | 599.6s | 532s | 15,944,044 B |
+| 0    | 1.05970 | 2.31902 | 599.6s | 419s | 15,941,061 B |
+| 1234 | 1.05912 | 2.31776 | 599.5s | 457s | 15,951,087 B |
+| **mean** | **1.05971** | **2.31904** | **599.6s** | **469s** | **15,945,397 B** |
+| std (pop) | 0.000478 | — | — | — | — |
+
+`val_tokens: 47,851,520` on every seed (full validation partition, identical to #1855's measurement).
+
+vs current rank 1 (PR #1855 1.06108): **−0.00137 BPB** — Welch t≈4.96 with 2 dof, p < 0.05.
+
+## Submission history
+
+This PR was originally opened on 2026-04-29 reporting 1.06577 BPB (3-seed mean). On review, @codemath3000 noted that the `val_tokens` line in our seed-42 log read `9,662,464` instead of the standard `47,851,520`, indicating measurement on a truncated validation partition. We confirmed and root-caused: a corrupted `fineweb_val_000000.bin` (19 MB instead of 95.7 MB) in the local network volume. Re-pulling the file directly from `huggingface.co/datasets/romeerp/parameter-golf-caseops-v1` restored the full 47.85M-token val partition.
+
+While re-running on the full val we extended the stack with a second orthogonal addition on top of #1855 (AWQ-lite mixed-precision GPTQ, concurrently identified with @romeerp's PR #1908). The current submission reflects:
+
+- corrected val measurement (full partition);
+- the original Asymmetric Logit Rescale lever; and
+- the AWQ-lite addition.
+
+`submission.json` carries both revisions for transparency.
+
+## Approach
+
+Two **orthogonal eval-path additions** on top of the verbatim PR #1855 stack:
+
+### 1. Asymmetric Logit Rescale
+
+Two learnable scalars `softcap_pos`, `softcap_neg` replace the single `logit_softcap` in `forward_logits` and `forward_ttt`. Logits are split by sign:
+
+```python
+torch.where(logits > 0,
+            softcap_pos * tanh(logits / softcap_pos),
+            softcap_neg * tanh(logits / softcap_neg))
+```
+
+Both initialized to `logit_softcap = 30.0` so eval is identity at start. Trained inside Phased TTT (passed through global SGD as part of `model.parameters()`). They settle at slightly different values for the positive/negative tails of the next-token distribution.
+
+The fused softcapped-CE Triton kernel on the training path is left unchanged (single softcap), so **train-time numerics match #1855 exactly**. The two scalars receive no gradient during training and stay at init until eval.
+
+Artifact cost: 8 bytes (2 fp16 scalars in the passthrough float16 list). Compressed cost is below the lrzip block overhead, effectively free.
+
+### 2. AWQ-lite mixed-precision GPTQ
+
+During GPTQ calibration, collect activation RMS per layer; select the most-salient 64-column group; keep that group at int8 inside the GPTQ solve while the remainder stays at int6. Identical pattern to PR #1908 by @romeerp; we converged on the same recipe independently while iterating on the AsymLogit branch and confirmed it stacks orthogonally with AsymLogit (positive) on the full val partition.
+
+Env vars: `AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64`.
+
+## Stack inherited from #1855
+
+- **SP-8192 tokenizer** with CaseOps lossless casing (PR #1729)
+- **LQER asymmetric int4** (rank 4, group 64, top-3 layers) on tok_emb + selected MLP weights
+- **Sparse Attn Gate** on attention out
+- **BOS-Fixed SmearGate** in `_forward_hidden` and `forward_ttt` (cross-doc leak fix)
+- **9-hparam greedy stack** (#1855)
+- **Phased Multi-Phase Global SGD TTT** (3 phases, prefix=2500 docs)
+- **Per-group lrzip ZPAQ** compression
+- **8× H100 SXM**, FA3, fused softcapped CE, torch 2.9.1+cu128
+
+## Hyperparameters (additions over #1855)
+
+```bash
+ASYM_LOGIT_RESCALE=1
+AWQ_LITE_ENABLED=1
+AWQ_LITE_BITS=8
+AWQ_LITE_GROUP_TOP_K=1
+AWQ_LITE_GROUP_SIZE=64
+HASH_EMBED_ENABLED=0          # gated; not used here
+```
+
+All other env vars match #1855 verbatim (see PR #1855 README).
+
+## Compliance (Issue #1017)
+
+- **C1 strict causal dependence:** standard varlen + per-doc cu_seqlens; no future-token leakage
+- **C2 full normalized distribution:** standard log-softmax over SP8192 vocab; AsymLogit is a deterministic monotone reshaping of logits before softmax — distribution remains a normalized full-vocab distribution
+- **C3 score-before-update:** Phased TTT scores each chunk before any LoRA gradient step
+- **C4 single pass:** each val token scored exactly once
+- **No SLOT, no n-gram cache, no logit bias, no PPM, no pre-quant TTT on val data**
+- **Compute caps:** train ≤599.6s, eval ≤532s, all 3 seeds. `MAX_WALLCLOCK_SECONDS=600`.
+- **Artifact:** ≤15,985,176 bytes for all 3 seeds. Cap is 16,000,000.
+
+## Reproduction
+
+```bash
+pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128
+pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
+apt-get install -y lrzip
+
+for seed in 42 0 1234; do
+  SEED=$seed \
+  ASYM_LOGIT_RESCALE=1 \
+  AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64 \
+  HASH_EMBED_ENABLED=0 \
+  CASEOPS_ENABLED=1 PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2500 PHASED_TTT_NUM_PHASES=3 \
+  EMBED_BITS=7 MATRIX_LR=0.026 MIN_LR=0.1 \
+  MLP_CLIP_SIGMAS=11.5 ATTN_CLIP_SIGMAS=13.0 EMBED_CLIP_SIGMAS=14.0 \
+  GRAD_CLIP_NORM=0.3 TTT_CHUNK_SIZE=48 WARMUP_STEPS=20 MUON_BACKEND_STEPS=5 \
+  GLOBAL_TTT_MOMENTUM=0.9 WARMDOWN_FRAC=0.85 BETA2=0.99 \
+  TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 TTT_LORA_RANK=80 SPARSE_ATTN_GATE_SCALE=0.5 \
+  GPTQ_RESERVE_SECONDS=0.5 GPTQ_CALIBRATION_BATCHES=16 \
+  GATED_ATTN_QUANT_GATE=1 SPARSE_ATTN_GATE_ENABLED=1 GATE_WINDOW=12 \
+  SMEAR_GATE_ENABLED=1 \
+  LQER_ENABLED=1 LQER_ASYM_ENABLED=1 LQER_RANK=4 LQER_FACTOR_BITS=4 LQER_ASYM_GROUP=64 LQER_TOP_K=3 \
+  FUSED_CE_ENABLED=1 COMPRESSOR=pergroup NCCL_NET=Socket \
+  MAX_WALLCLOCK_SECONDS=600 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee train_seed${seed}.log
+done
+```
+
+## Attribution
+
+- **codemath3000** — PR #1855 (full stack baseline 1.06108) + caught the val truncation issue on review
+- **romeerp** — PR #1908 (AWQ-lite mixed-precision GPTQ) and PR #1729 (CaseOps lossless tokenizer)
+- **classiclarryd** — modded-nanogpt PR #181 (Asymmetric Logit Rescale)
+- **dexhunter** — PR #1797 (LQER asym + SmearGate base) and PR #1736 (CaseOps + GatedAttn + QuantGate)
+- **nprime06** — PR #1787 (Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE)
+- **MarioPaerle** — PR #1667 (SmearGate origin)
+- **renqianluo** — PR #1767 (LoRA TTT improvements)
+- **Jorge Asenjo** — PR #1700 (Phased Multi-Phase Global SGD TTT, retained from #1855 stack); this PR (AsymLogit + AWQ-lite stacking on full val)
diff --git a/records/track_10min_16mb/2026-04-30_SP8192_AsymLogit_AWQLite_FullVal_1.0597/submission.json b/records/track_10min_16mb/2026-04-30_SP8192_AsymLogit_AWQLite_FullVal_1.0597/submission.json
@@ -0,0 +1,25 @@
+{
+  "author": "Jorge Asenjo",
+  "github_id": "jorge-asenjo",
+  "name": "SP8192 PR #1855 Base + Asymmetric Logit Rescale + AWQ-lite (full val)",
+  "blurb": "PR #1855 stack (LQER asym int4 + Sparse Attn Gate + BOS-Fixed SmearGate + 9-hparam greedy + per-group lrzip) with two orthogonal additions on top: (1) Asymmetric Logit Rescale on the eval/TTT path (two learnable softcap_pos/softcap_neg scalars trained inside Phased TTT, modded-nanogpt #181 by classiclarryd), and (2) AWQ-lite mixed-precision GPTQ keeping the top-1 64-col activation-saliency group at int8 (PR #1908 by romeerp). Measured on the full validation partition (47,851,520 tokens) after correcting the truncated-val partition reported in the original v1 of this PR. 3-seed mean val_bpb 1.05971 (population std 0.000478).",
+  "date": "2026-04-30T09:54:00Z",
+  "val_loss": 2.31904,
+  "val_bpb": 1.05971,
+  "bytes_total": 15979486,
+  "bytes_code": 32500,
+  "revisions": [
+    {
+      "v": 1,
+      "date": "2026-04-29T10:17:00Z",
+      "val_bpb": 1.06577,
+      "note": "Original submission. Val measurement was on a truncated validation partition (9,662,464 tokens) due to a corrupted shard in the local network volume. @codemath3000 caught this on review by comparing val_tokens lines."
+    },
+    {
+      "v": 2,
+      "date": "2026-04-30T09:54:00Z",
+      "val_bpb": 1.05971,
+      "note": "Re-measured on full validation partition (47,851,520 tokens, identical to #1855's reference). AWQ-lite added as concurrent independent finding alongside @romeerp's PR #1908 — orthogonal stack with AsymLogit. 3-seed strict-compliant: train ≤599.6s, eval ≤532s, max artifact 15,985,176 bytes."
+    }
+  ]
+}