Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean)#1099
Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean)#1099Bortlesboat wants to merge 2 commits intoopenai:mainfrom
Conversation
…(3-seed mean) 3-seed results: 1.1136/1.1133/1.1139 (mean 1.1136, std 0.0003) Built on PR openai#549 + PR openai#1060 with optimized GPTQ reserve (10s vs 14s).
Improved from 1.1136 to 1.1133 by reducing GPTQ reserve from 10s to 9s. Seeds: 1.1133/1.1132/1.1133 (mean 1.1133, std 0.0001) All artifacts under 16MB.
Single innovation: coprime-stride shard traversal. Instead of reading shards 0,1,2,...,79, reads 0,7,14,...,77,4,11,... where stride=7 is coprime to 80 shards. Prevents repeated token sequences across epochs. PR openai#1099 gets 1.1136 with this (vs 1.1217 baseline). 12 lines added. Zero HP changes. Zero architecture changes. Same quantization path. Artifact unchanged. Co-Authored-By: Kevin Tan <kft@lightarchitects.io> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Superseded by #1169 (better score). Closing. |
…IP, I overrode to PASS Subagent found arxiv:2505.15134 (Entropy Minimization at Inference, NeurIPS 2025) and recommended ship. I reversed to PASS after working out the math: EM-INF is equivalent to temperature sharpening, and cross-entropy for a calibrated MLE model is minimized at T=1 by definition. Moving T away from 1 in either direction strictly increases in-distribution NLL. Same class of trap as Patch 14 (entropy-adaptive, already falsified). No push. Better directions logged for next fire: PR openai#1437 N-gram Tilt (multiplicative not sharpening), BPE-8192 tables, Coprime-Stride from merged record openai#1099. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EL2 cycle-2 = 3.2742 (only +0.0008 above champion 3.2734) reversed the audit fire openai#1 verdict that EngramLite was falsified. Adding 4 new EL multi-seed experiments to confirm: - EL3 (seed 1337), EL4 (seed 999), EL5 (seed 7) - EL6 with L5 weights (0.15/0.20/0.15) — new combination Removed 15 dead/falsified configs that wasted cycle 2 compute: EA*, BG*, NG*, TH*, MEGA, MTP0/2/3, MTP1_seed999, PR2/3, EL0. Also captured EMA(0.997) canonical spec from 6 merged records (openai#287, openai#315, openai#414, openai#1019, openai#1099) — DEFERRED actual Patch 17 ship because EMA only affects final val_bpb (not loop train_loss) and training-loop anchoring is risky without reading train_gpt.py. Queue now cycles in ~100 min (vs 185 min) leaving more compute for the EL family expansion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…PEC captured Subagent extracted percentile-based int6 quantization pattern from PR openai#1099, openai#1019, openai#1444 (3+ merged records). No Hessian needed, ~130 LOC, lzma-22 instead of zlib for ~0.5MB size headroom. Direct BPB gain is only -0.0003 (within noise) — the real value is freed size budget that could fund extra model capacity. DEFERRED actual Patch 23 ship: same metric problem as Tilt + EMA (loop train_loss unaffected by serialization), plus serialization code is the highest-risk path to break before submission. Captured spec is drop-in ready for next H100 escalation cycle. Three specs now queued for combined H100 escalation: - USE_NGRAM_TILT_EVAL (task openai#53) - USE_EMA (task openai#45) - USE_INT6_GPTQ (new) Combined estimated gain: +0.003 to +0.008 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eferred (upstream stateless) Two-subagent investigation of coprime-stride loader from PR openai#1099/openai#1060. First subagent confirmed 26 PRs use it, top merged record uses it, ~0.01 BPB estimated gain. Second subagent extracted exact upstream DistributedTokenLoader code: it's COMPLETELY STATELESS (~10 lines, just slices TokenStream). PR openai#1099's implementation is NOT a small patch — it's a fundamental rewrite adding stateful per-shard cursor management. Real implementation is 60-100 LOC, needs to interact with TokenStream class I haven't read yet. DEFERRED because data loader is on the critical path — buggy patch could silently corrupt training data. Better to validate existing MS3/EL/MR cycle 2+3 results first. Spec captured for next focused research fire. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…prime stride sampling Inspired by PR openai#1099/openai#1060/openai#1135 which use TOKEN-level coprime stride. Token-level needs 60+ LOC rewrite of TokenStream (no random access). Shipping the SHARD-LEVEL variant: modify _advance_file() to use a coprime stride instead of +1, so nearby training steps see topically-different shards rather than adjacent similar ones. Implementation: 13 LOC, two anchors in TokenStream class (none of the existing 24 patches touch TokenStream — verified via grep). Gated by USE_COPRIME_STRIDE=1, falls back to stride=1 default. Idempotent via COPRIME_STRIDE_MARKER. Effect: with N shards and gcd(s,N)=1, iterates 0->s->2s->... covering all shards before repeating. Max spacing diversity = better gradient noise reduction. Smaller benefit than full token-level (~25% per PR openai#1099 logic), but ships TODAY at near-zero risk vs. 60+ LOC structural rewrite. 4 CS experiments queued: CS0_alone, CS1_seed42, CS2_L4weights, CS3_with_engram. This is the FIRST data-side patch in our 24-patch stack. Tests a completely new vector after the "neutrality plateau" of architectural/optimizer/training-time patches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntified as top missing technique Patches 15/16/21 + NEW Patch 20 USE_COPRIME_STRIDE all uncontested in 150+ open + 20 closed PRs (7 consecutive audits for the original 3, first confirmation for Patch 20 just shipped 3h ago). CRITICAL FINDING: XSA (Cross-Sequence Attention) is in 4+ MERGED records (PR openai#1019, openai#287, openai#315, openai#265, latest openai#1099) and we have ZERO attention-mask variants. Most-validated missing technique. ~200 LOC moderate port — too big for a single research fire but worth a focused 30-45 min investigation if we can find a minimal variant. SLOT (Score-First TTT) is the openai#2 missing (PR openai#549, ~100 LOC) but it's eval-time, joins the H100 escalation bundle category. H100 escalation candidate updated: NEW: CHAMP_L4 + COPRIME_STRIDE + EL + (EMA + Tilt + INT6 GPTQ) OLD: CHAMP_L4 + EL + (EMA + Tilt + INT6 GPTQ) Need CS2 cycle 2+3 for n=3 mean confirmation before escalating. PR openai#1430 still OPEN, 0 comments, no comp owner activity for 16h+. Spend ~$4.00/$36 (11.1%). Pod healthy at 7h 50min uptime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port from 100+ PRs From arxiv:2603.09078 + PR openai#1099 (latest merged) + 4+ other merged records. ~12 LOC inline insert in CausalSelfAttention.forward after GATED_ATTENTION block. 0 new params. Removes self-value projection from attention output. 4 XSA experiments queued: alone, seed42, +coprime, full stack. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean)BPB: 1.1133 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1128 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=96935 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=96935 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
3-Seed Results
What's New
Stack
Compliance
See README.md for full details.