openai · AidenGeunGeun · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026
diff --git a/...04-30_AidenKim_SP8192_CaseOps_2560NoQV_AdaptiveHedge_1.06083/ENGINEERING_LOG.md b/...04-30_AidenKim_SP8192_CaseOps_2560NoQV_AdaptiveHedge_1.06083/ENGINEERING_LOG.md
@@ -0,0 +1,189 @@
+# Engineering Log
+
+This is a technical log for reviewers. It explains how this candidate was selected, which alternatives were rejected, and why the final method is a targeted follow-up rather than a broad validation-tuned sweep.
+
+## Starting Point
+
+The starting point was the submitted PR #1915 record:
+
+- SP8192 CaseOps tokenizer/data path.
+- PR #1855-style legal frontier architecture and compression stack.
+- Stock top-k LQER.
+- Legal per-document score-first LoRA TTT.
+- 3-seed mean: 1.06504520 BPB.
+- Max package size: 15,922,155 bytes.
+
+That submission stayed frozen. Follow-up experiments were isolated so the clean submitted record could not be contaminated by exploratory changes.
+
+## Methodology
+
+The follow-up work used four rules:
+
+1. Keep legality first: official SP8192 distribution, strict causality, score-before-update, no byte-PPM shortcut, no custom tokenizer, no cross-document validation adaptation.
+2. Treat package bytes and runtime as part of the result, not post-processing.
+3. Prefer paired or same-execution comparisons whenever the expected effect is small.
+4. Close weak mechanisms quickly, but keep evidence for each decision.
+
+The same-execution rule became important after a context-mixer experiment showed that separate eval-time TTT runs can differ at the 1e-5 BPB level even when the mathematical prediction path is intended to be identical. Fresh per-document LoRA state, BF16/fused kernels, and distributed scheduling can create slightly different trajectories. For scoring-only transforms, later comparisons therefore shared the same logits, TTT trajectory, document order, hints, tokens, and bytes.
+
+## Closed Candidate Mechanisms
+
+Several plausible mechanisms were tested before selecting this candidate:
+
+| Mechanism | Result | Decision |
+| --- | --- | --- |
+| Weighted LQER | best gain about 0.000013 BPB | closed |
+| AWQ/no-embedding rescue | package-safe but neutral | closed |
+| D-prime bucketing | slightly better score but too slow | closed |
+| First-order TTT-aware training | seed42 worsened to 1.07164685 | closed |
+| Random-map adapters | sampled BPB worsened by 0.008918 | closed |
+| Long-context 4K | full validation worsened by 0.002438 BPB | closed |
+| LeakyReLU-slope retrain | seed42 worsened to 1.13888025 | closed |
+| Neural Dirichlet context mixer | same-execution identity check passed, but valid slices regressed and full runtime projected about 9180s | closed |
+
+One small mechanism stayed positive: lower eval-time TTT LR. It improved seed42, seed0, and seed1234 by about 0.0005 to 0.0006 BPB and became the base setting for the final follow-up runs.
+
+## Token-Level Causal N-Gram Tilt
+
+The main late signal came from a normalized token-level causal n-gram tilt over the official SP8192 alphabet.
+
+For a strict-prefix hint `h`, the fixed tilt changes the model distribution by
+
+```text
+p'(a) = exp(beta * 1[a == h]) * p(a) / Z
+Z = 1 + p(h) * (exp(beta) - 1)
+```
+
+This is a normalized SP8192 distribution. The hint state is updated only after the current token is scored. This is different from byte-PPM or a custom tokenizer path because the scored alphabet remains the official token alphabet.
+
+The first full seed42 validation of this mechanism produced:
+
+- paired lower-LR control: 1.06373091 BPB
+- fixed token n-gram tilt: 1.06182936 BPB
+- gain: 0.00190155 BPB
+- exact counts: 47,851,520 scored tokens / 151,074,499 scored bytes
+
+That was the first follow-up effect large enough to justify packaging and runtime work.
+
+## Adaptive Hedge
+
+Fixed n-gram tilt has one obvious risk: a single boost strength can overpay the normalizer on some hints and underboost others. Adaptive Hedge was tested as a same-execution mixture over boost temperatures. Mechanically, it acts like a small universal code over n-gram boost strengths.
+
+Across full validation, Adaptive Hedge improved fixed n-gram by almost the same amount in every setting tested:
+
+| Setting | Control | Fixed n-gram | Adaptive Hedge | Hedge vs Fixed |
+| --- | ---: | ---: | ---: | ---: |
+| seed42 | 1.06372791 | 1.06182636 | 1.06143959 | 0.00038677 |
+| seed0 | 1.06452124 | 1.06260772 | 1.06221801 | 0.00038971 |
+| seed1234 | 1.06526616 | 1.06331328 | 1.06292962 | 0.00038366 |
+| no-Q seed42 | 1.06373272 | 1.06182914 | 1.06144170 | 0.00038744 |
+
+Three-seed means from the same-execution counter batch:
+
+- control: 1.06450510 BPB
+- fixed n-gram: 1.06258245 BPB
+- Adaptive Hedge: 1.06219574 BPB
+
+The consistency was the important evidence. Adaptive Hedge was not a seed42-only fluctuation; it repeatedly saved about 0.000386 BPB beyond fixed n-gram.
+
+## Transferable Mechanism: Adaptive-Beta Hedge
+
+The most reusable finding from this follow-up is not just the final absolute score. It is that Adaptive-Beta Hedge behaved like a base-independent scoring overlay.
+
+Across four different seed42 base configurations, Hedge added nearly the same gain over fixed n-gram:
+
+| Base configuration | Hedge gain vs fixed n-gram |
+| --- | ---: |
+| default context, Q/V LoRA active | 0.00038677 BPB |
+| default context, Q LoRA disabled | 0.00038744 BPB |
+| public-frontier diagnostic base, no prefix adaptation | 0.00038690 BPB |
+| 2560 context, Q/V LoRA disabled | 0.00038599 BPB |
+| spread | 0.00000145 BPB |
+
+These configurations differ across eval context length, LoRA branch selection, and underlying trained artifact. The near-identical Hedge delta suggests it is correcting a systematic n-gram boost calibration error rather than exploiting one model trajectory.
+
+Mechanistically, fixed n-gram tilt pays a normalizer cost when it boosts a hinted token. A single fixed boost can overpay on weak hints and underboost on strong hints. Adaptive Hedge acts as a small universal code over boost temperatures, selecting a better effective strength online while preserving the same strict-prefix, normalized SP8192 scoring rule.
+
+This is interesting because most improvements in this contest are base-coupled: change the architecture, quantization, context, or TTT target modules and the gain can disappear. Hedge remained stable across several such axes. The honest limit is that the cross-base transfer table is seed42-only for each base; the three-seed evidence exists for the submitted configuration family, not for every base configuration in the table.
+
+## Trajectory Interactions
+
+Scoring transforms and TTT trajectory changes were then separated. The scoring transform is post-hoc over the logits. The trajectory changes alter the logits themselves through eval-time context or LoRA target selection.
+
+Trajectory-changing candidates tested:
+
+| Candidate | BPB | Read |
+| --- | ---: | --- |
+| tilt-aware TTT objective | 1.06154791 | weak positive |
+| 2560 context + no Q/V LoRA, fixed n-gram | 1.06121730 | strong interaction |
+| tilted training objective | 1.27728534 | undertrained, not useful under deadline |
+| public-frontier diagnostic + Adaptive Hedge | 1.06096543 | useful diagnostic, not best |
+
+The strongest local interaction was 2560 context + no Q/V LoRA combined with Adaptive Hedge:
+
+- paired trajectory control: 1.06306045 BPB
+- fixed n-gram: 1.06121689 BPB
+- Adaptive Hedge: 1.06083091 BPB
+- Adaptive Hedge gain over fixed n-gram: 0.00038599 BPB
+
+This showed that the 2560/no-QV trajectory gain and the Adaptive Hedge scoring gain were mostly complementary.
+
+## Final Runtime And Package Work
+
+The final candidate had to be made self-contained and fast enough.
+
+Package state:
+
+- max compressed model: 15,874,515 bytes
+- counted `train_gpt.py` wrapper: 57,552 bytes
+- max total counted bytes: 15,932,067 bytes
+- margin under 16,000,000 bytes: 67,933 bytes
+- n-gram Python helper and C helper source embedded in counted wrapper
+- no uncounted helper files required
+
+Evaluation data safety:
+
+- validation-only data view
+- train shards visible during final eval: 0
+- validation token shards: 5
+- validation byte shards: 5
+
+Runtime tuning was limited to equivalent implementation changes. No scoring constants were retuned. Batch-size checks showed batch 32 was best; batch 48 fit in memory but slowed down from memory pressure/load imbalance.
+
+The final speedup removed a duplicate vocabulary-wide normalization. The previous path computed token CE and then separately computed a full log-softmax to get the n-gram hint probability. The optimized path reuses:
+
+```text
+loss = logZ - target_logit
+logZ = loss + target_logit
+hint_log_prob = hint_logit - logZ
+```
+
+This preserves the scoring math while avoiding a second normalization.
+
+Selected proof:
+
+- BPB: 1.06082922
+- raw loss sum: 111086708.60261762
+- raw bits sum: 160264243.60967380
+- inner TTT eval: 544.1s
+- total eval wallclock: 566.3s
+- wrapper wallclock: 585s
+- scored tokens: 47,851,520
+- scored bytes: 151,074,499
+- doc-order hash: 33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41
+
+## Three-Seed Runtime Evidence
+
+After the seed42 proof, two additional seeds were run through the optimized final package path. All three seed runs finished under a 600 second wrapper wallclock.
+
+| Seed | BPB | Inner TTT eval | Total eval wallclock | Wrapper wallclock | Note |
+| ---: | ---: | ---: | ---: | ---: | --- |
+| 42 | 1.06082922 | 544.1s | 566.3s | 585s | paired control/fixed/adaptive proof |
+| 0 | 1.06158291 | 546.7s | 568.5s | 587s | paired control/fixed/adaptive proof |
+| 1234 | 1.06232130 | 545.6s | 568.1s | 586s | selected Adaptive Hedge runtime proof |
+
+The 3-seed mean is 1.06157781 BPB. The headline claim remains the seed42 record-track proof; the mean is included for transparency and is not claimed to beat the displayed leaderboard mean.
+
+For seed1234, paired control/fixed diagnostics were bypassed in the final timing proof to keep the wrapper runtime below 600 seconds. Earlier paired evidence establishes the selected scorer behavior; the seed1234 log should be read as the selected Adaptive Hedge runtime proof.
+
+This submission should be read as a focused, legally constrained follow-up: the previously submitted record stack remains frozen, the selected mechanism is measured with exact official accounting, and the final score improvement comes from a targeted eval-time trajectory plus a normalized causal token-level scoring overlay.
diff --git a/...6mb/2026-04-30_AidenKim_SP8192_CaseOps_2560NoQV_AdaptiveHedge_1.06083/README.md b/...6mb/2026-04-30_AidenKim_SP8192_CaseOps_2560NoQV_AdaptiveHedge_1.06083/README.md
@@ -0,0 +1,91 @@
+# SP8192 CaseOps + 2560 No-Q/V Adaptive Hedge Token N-Gram
+
+Status: record-track seed42 proof complete, with three runtime-compliant seed proofs.
+
+## Result
+
+Headline seed42 proof:
+
+| Metric | Value |
+| --- | ---: |
+| Seed42 BPB | 1.06082922 |
+| 3-seed mean BPB | 1.06157781 |
+| Scored tokens | 47,851,520 |
+| Scored bytes | 151,074,499 |
+| Docs | 50,000 |
+| Max artifact bytes | 15,932,067 |
+| Cap margin | 67,933 |
+
+Doc-order hash:
+
+```text
+33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41
+```
+
+## Per-Seed Proofs
+
+| Seed | BPB | Inner TTT eval | Total eval wallclock | Wrapper wallclock | Runtime status |
+| ---: | ---: | ---: | ---: | ---: | --- |
+| 42 | 1.06082922 | 544.1s | 566.3s | 585s | under 600s |
+| 0 | 1.06158291 | 546.7s | 568.5s | 587s | under 600s |
+| 1234 | 1.06232130 | 545.6s | 568.1s | 586s | under 600s |
+
+The headline claim is the seed42 record-track proof plus supporting reproducibility evidence. The 3-seed mean is reported for transparency and is not claimed to beat the displayed leaderboard mean.
+
+## Method
+
+This candidate starts from the PR #1915 quantized artifacts and uses an eval-only follow-up stack:
+
+- lower eval-time per-document TTT LR: `0.000075`
+- eval/TTT context length: `2560`
+- Q/V LoRA disabled for eval-time TTT
+- K/MLP/O/lm_head LoRA active
+- normalized token-level causal n-gram Adaptive Hedge scoring overlay
+
+The n-gram overlay scores a normalized distribution over the official SP8192 CaseOps token alphabet. It uses strict-prefix hints only and updates state after scoring the current token.
+
+## Size Accounting
+
+| Component | Bytes |
+| --- | ---: |
+| max compressed model | 15,874,515 |
+| counted `train_gpt.py` wrapper | 57,552 |
+| max total | 15,932,067 |
+| cap | 16,000,000 |
+| margin | 67,933 |
+
+The custom n-gram Python/C logic is embedded into the counted `train_gpt.py` wrapper. No uncounted helper files are required.
+
+## Evaluation Data Safety
+
+The final package path uses a validation-only data view:
+
+- train shards visible during final eval: `0`
+- validation token shards: `5`
+- validation byte shards: `5`
+- train-shard glob/listing skipped in final eval mode
+
+## Legality Notes
+
+- Official SP8192 CaseOps token alphabet.
+- Full normalized token distribution after the n-gram tilt.
+- Per-document score-first TTT.
+- `TTT_WARM_START_A=0`; LoRA resets per document.
+- `PHASED_TTT_PREFIX_DOCS=0`; no validation prefix adaptation.
+- No global validation SGD or cross-document adaptive state.
+- No byte PPM, custom tokenizer, target-conditioned lookup, or external network dependency.
+
+## Included Files
+
+- `train_gpt.py` - counted self-contained wrapper.
+- `train_seed42.log` - optimized seed42 proof log.
+- `train_seed0.log` - optimized seed0 proof log.
+- `train_seed1234.log` - optimized seed1234 selected-scorer proof log.
+- `submission.json` - metadata and per-seed results.
+- `package_size.json` - package accounting from the selected proof source.
+- `eval_data_manifest.json` - final eval data-view proof.
+- `ENGINEERING_LOG.md` - engineering log for reviewers.
+
+## Relationship To PR #1915
+
+PR #1915 remains the conservative submitted anchor at 1.06504520 BPB. This submission is a separate follow-up and intentionally does not modify the PR #1915 record folder.
diff --git a/...2026-04-30_AidenKim_SP8192_CaseOps_2560NoQV_AdaptiveHedge_1.06083/eval_data_manifest.json b/...2026-04-30_AidenKim_SP8192_CaseOps_2560NoQV_AdaptiveHedge_1.06083/eval_data_manifest.json
@@ -0,0 +1,8 @@
+{
+  "contains_only_eval_data": true,
+  "dataset_root": "/workspace/parameter-golf-warm/path-a-final-package-b2-hedge-20260501T000000Z/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved",
+  "tokenizer": "/workspace/parameter-golf-warm/path-a/data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model",
+  "train_shards_visible_in_eval_view": 0,
+  "val_byte_shards": 5,
+  "val_token_shards": 5
+}
diff --git a/..._16mb/2026-04-30_AidenKim_SP8192_CaseOps_2560NoQV_AdaptiveHedge_1.06083/package_size.json b/..._16mb/2026-04-30_AidenKim_SP8192_CaseOps_2560NoQV_AdaptiveHedge_1.06083/package_size.json
@@ -0,0 +1,27 @@
+{
+  "cap_bytes": 16000000,
+  "code_path": "train_gpt.py",
+  "code_wrapper_sha256": "37e54152983aab20b248e54b5b624b057624a78295035958ab4bbc3feca47320",
+  "code_source_sha256": "88b870cbdf784d72d75e652f13336a82bab012b563a2e0acf7f83935f4bcff45",
+  "code_uncompressed_bytes": 247230,
+  "code_wrapper_brotli_base85_bytes": 57552,
+  "external_ngram_files_required": false,
+  "self_contained_ngram_code": true,
+  "per_seed_model_bytes": {
+    "42": 15872234,
+    "0": 15871302,
+    "1234": 15874515
+  },
+  "per_seed_model_sha256": {
+    "42": "565441a723c0705e7d6d81bfe34271f6a4704fc21ceb1a50b076241c135b762c",
+    "0": "489cf25b2c852a79a6f8265d007ee1f18196dbd2e5e34b96ef4a9234774e2622",
+    "1234": "8b5521ee68a41df1c1e91ade6a814f68c7cf0034096fc775e1f4583275399a41"
+  },
+  "artifact_bytes_by_seed": {
+    "42": 15929786,
+    "0": 15928854,
+    "1234": 15932067
+  },
+  "artifact_bytes_max": 15932067,
+  "margin_bytes": 67933
+}