Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# Engineering Log

This is a technical log for reviewers. It explains how this candidate was selected, which alternatives were rejected, and why the final method is a targeted follow-up rather than a broad validation-tuned sweep.

## Starting Point

The starting point was the submitted PR #1915 record:

- SP8192 CaseOps tokenizer/data path.
- PR #1855-style legal frontier architecture and compression stack.
- Stock top-k LQER.
- Legal per-document score-first LoRA TTT.
- 3-seed mean: 1.06504520 BPB.
- Max package size: 15,922,155 bytes.

That submission stayed frozen. Follow-up experiments were isolated so the clean submitted record could not be contaminated by exploratory changes.

## Methodology

The follow-up work used four rules:

1. Keep legality first: official SP8192 distribution, strict causality, score-before-update, no byte-PPM shortcut, no custom tokenizer, no cross-document validation adaptation.
2. Treat package bytes and runtime as part of the result, not post-processing.
3. Prefer paired or same-execution comparisons whenever the expected effect is small.
4. Close weak mechanisms quickly, but keep evidence for each decision.

The same-execution rule became important after a context-mixer experiment showed that separate eval-time TTT runs can differ at the 1e-5 BPB level even when the mathematical prediction path is intended to be identical. Fresh per-document LoRA state, BF16/fused kernels, and distributed scheduling can create slightly different trajectories. For scoring-only transforms, later comparisons therefore shared the same logits, TTT trajectory, document order, hints, tokens, and bytes.

## Closed Candidate Mechanisms

Several plausible mechanisms were tested before selecting this candidate:

| Mechanism | Result | Decision |
| --- | --- | --- |
| Weighted LQER | best gain about 0.000013 BPB | closed |
| AWQ/no-embedding rescue | package-safe but neutral | closed |
| D-prime bucketing | slightly better score but too slow | closed |
| First-order TTT-aware training | seed42 worsened to 1.07164685 | closed |
| Random-map adapters | sampled BPB worsened by 0.008918 | closed |
| Long-context 4K | full validation worsened by 0.002438 BPB | closed |
| LeakyReLU-slope retrain | seed42 worsened to 1.13888025 | closed |
| Neural Dirichlet context mixer | same-execution identity check passed, but valid slices regressed and full runtime projected about 9180s | closed |

One small mechanism stayed positive: lower eval-time TTT LR. It improved seed42, seed0, and seed1234 by about 0.0005 to 0.0006 BPB and became the base setting for the final follow-up runs.

## Token-Level Causal N-Gram Tilt

The main late signal came from a normalized token-level causal n-gram tilt over the official SP8192 alphabet.

For a strict-prefix hint `h`, the fixed tilt changes the model distribution by

```text
p'(a) = exp(beta * 1[a == h]) * p(a) / Z
Z = 1 + p(h) * (exp(beta) - 1)
```

This is a normalized SP8192 distribution. The hint state is updated only after the current token is scored. This is different from byte-PPM or a custom tokenizer path because the scored alphabet remains the official token alphabet.

The first full seed42 validation of this mechanism produced:

- paired lower-LR control: 1.06373091 BPB
- fixed token n-gram tilt: 1.06182936 BPB
- gain: 0.00190155 BPB
- exact counts: 47,851,520 scored tokens / 151,074,499 scored bytes

That was the first follow-up effect large enough to justify packaging and runtime work.

## Adaptive Hedge

Fixed n-gram tilt has one obvious risk: a single boost strength can overpay the normalizer on some hints and underboost others. Adaptive Hedge was tested as a same-execution mixture over boost temperatures. Mechanically, it acts like a small universal code over n-gram boost strengths.

Across full validation, Adaptive Hedge improved fixed n-gram by almost the same amount in every setting tested:

| Setting | Control | Fixed n-gram | Adaptive Hedge | Hedge vs Fixed |
| --- | ---: | ---: | ---: | ---: |
| seed42 | 1.06372791 | 1.06182636 | 1.06143959 | 0.00038677 |
| seed0 | 1.06452124 | 1.06260772 | 1.06221801 | 0.00038971 |
| seed1234 | 1.06526616 | 1.06331328 | 1.06292962 | 0.00038366 |
| no-Q seed42 | 1.06373272 | 1.06182914 | 1.06144170 | 0.00038744 |

Three-seed means from the same-execution counter batch:

- control: 1.06450510 BPB
- fixed n-gram: 1.06258245 BPB
- Adaptive Hedge: 1.06219574 BPB

The consistency was the important evidence. Adaptive Hedge was not a seed42-only fluctuation; it repeatedly saved about 0.000386 BPB beyond fixed n-gram.

## Transferable Mechanism: Adaptive-Beta Hedge

The most reusable finding from this follow-up is not just the final absolute score. It is that Adaptive-Beta Hedge behaved like a base-independent scoring overlay.

Across four different seed42 base configurations, Hedge added nearly the same gain over fixed n-gram:

| Base configuration | Hedge gain vs fixed n-gram |
| --- | ---: |
| default context, Q/V LoRA active | 0.00038677 BPB |
| default context, Q LoRA disabled | 0.00038744 BPB |
| public-frontier diagnostic base, no prefix adaptation | 0.00038690 BPB |
| 2560 context, Q/V LoRA disabled | 0.00038599 BPB |
| spread | 0.00000145 BPB |

These configurations differ across eval context length, LoRA branch selection, and underlying trained artifact. The near-identical Hedge delta suggests it is correcting a systematic n-gram boost calibration error rather than exploiting one model trajectory.

Mechanistically, fixed n-gram tilt pays a normalizer cost when it boosts a hinted token. A single fixed boost can overpay on weak hints and underboost on strong hints. Adaptive Hedge acts as a small universal code over boost temperatures, selecting a better effective strength online while preserving the same strict-prefix, normalized SP8192 scoring rule.

This is interesting because most improvements in this contest are base-coupled: change the architecture, quantization, context, or TTT target modules and the gain can disappear. Hedge remained stable across several such axes. The honest limit is that the cross-base transfer table is seed42-only for each base; the three-seed evidence exists for the submitted configuration family, not for every base configuration in the table.

## Trajectory Interactions

Scoring transforms and TTT trajectory changes were then separated. The scoring transform is post-hoc over the logits. The trajectory changes alter the logits themselves through eval-time context or LoRA target selection.

Trajectory-changing candidates tested:

| Candidate | BPB | Read |
| --- | ---: | --- |
| tilt-aware TTT objective | 1.06154791 | weak positive |
| 2560 context + no Q/V LoRA, fixed n-gram | 1.06121730 | strong interaction |
| tilted training objective | 1.27728534 | undertrained, not useful under deadline |
| public-frontier diagnostic + Adaptive Hedge | 1.06096543 | useful diagnostic, not best |

The strongest local interaction was 2560 context + no Q/V LoRA combined with Adaptive Hedge:

- paired trajectory control: 1.06306045 BPB
- fixed n-gram: 1.06121689 BPB
- Adaptive Hedge: 1.06083091 BPB
- Adaptive Hedge gain over fixed n-gram: 0.00038599 BPB

This showed that the 2560/no-QV trajectory gain and the Adaptive Hedge scoring gain were mostly complementary.

## Final Runtime And Package Work

The final candidate had to be made self-contained and fast enough.

Package state:

- max compressed model: 15,874,515 bytes
- counted `train_gpt.py` wrapper: 57,552 bytes
- max total counted bytes: 15,932,067 bytes
- margin under 16,000,000 bytes: 67,933 bytes
- n-gram Python helper and C helper source embedded in counted wrapper
- no uncounted helper files required

Evaluation data safety:

- validation-only data view
- train shards visible during final eval: 0
- validation token shards: 5
- validation byte shards: 5

Runtime tuning was limited to equivalent implementation changes. No scoring constants were retuned. Batch-size checks showed batch 32 was best; batch 48 fit in memory but slowed down from memory pressure/load imbalance.

The final speedup removed a duplicate vocabulary-wide normalization. The previous path computed token CE and then separately computed a full log-softmax to get the n-gram hint probability. The optimized path reuses:

```text
loss = logZ - target_logit
logZ = loss + target_logit
hint_log_prob = hint_logit - logZ
```

This preserves the scoring math while avoiding a second normalization.

Selected proof:

- BPB: 1.06082922
- raw loss sum: 111086708.60261762
- raw bits sum: 160264243.60967380
- inner TTT eval: 544.1s
- total eval wallclock: 566.3s
- wrapper wallclock: 585s
- scored tokens: 47,851,520
- scored bytes: 151,074,499
- doc-order hash: 33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41

## Three-Seed Runtime Evidence

After the seed42 proof, two additional seeds were run through the optimized final package path. All three seed runs finished under a 600 second wrapper wallclock.

| Seed | BPB | Inner TTT eval | Total eval wallclock | Wrapper wallclock | Note |
| ---: | ---: | ---: | ---: | ---: | --- |
| 42 | 1.06082922 | 544.1s | 566.3s | 585s | paired control/fixed/adaptive proof |
| 0 | 1.06158291 | 546.7s | 568.5s | 587s | paired control/fixed/adaptive proof |
| 1234 | 1.06232130 | 545.6s | 568.1s | 586s | selected Adaptive Hedge runtime proof |

The 3-seed mean is 1.06157781 BPB. The headline claim remains the seed42 record-track proof; the mean is included for transparency and is not claimed to beat the displayed leaderboard mean.

For seed1234, paired control/fixed diagnostics were bypassed in the final timing proof to keep the wrapper runtime below 600 seconds. Earlier paired evidence establishes the selected scorer behavior; the seed1234 log should be read as the selected Adaptive Hedge runtime proof.

This submission should be read as a focused, legally constrained follow-up: the previously submitted record stack remains frozen, the selected mechanism is measured with exact official accounting, and the final score improvement comes from a targeted eval-time trajectory plus a normalized causal token-level scoring overlay.
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# SP8192 CaseOps + 2560 No-Q/V Adaptive Hedge Token N-Gram

Status: record-track seed42 proof complete, with three runtime-compliant seed proofs.

## Result

Headline seed42 proof:

| Metric | Value |
| --- | ---: |
| Seed42 BPB | 1.06082922 |
| 3-seed mean BPB | 1.06157781 |
| Scored tokens | 47,851,520 |
| Scored bytes | 151,074,499 |
| Docs | 50,000 |
| Max artifact bytes | 15,932,067 |
| Cap margin | 67,933 |

Doc-order hash:

```text
33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41
```

## Per-Seed Proofs

| Seed | BPB | Inner TTT eval | Total eval wallclock | Wrapper wallclock | Runtime status |
| ---: | ---: | ---: | ---: | ---: | --- |
| 42 | 1.06082922 | 544.1s | 566.3s | 585s | under 600s |
| 0 | 1.06158291 | 546.7s | 568.5s | 587s | under 600s |
| 1234 | 1.06232130 | 545.6s | 568.1s | 586s | under 600s |

The headline claim is the seed42 record-track proof plus supporting reproducibility evidence. The 3-seed mean is reported for transparency and is not claimed to beat the displayed leaderboard mean.

## Method

This candidate starts from the PR #1915 quantized artifacts and uses an eval-only follow-up stack:

- lower eval-time per-document TTT LR: `0.000075`
- eval/TTT context length: `2560`
- Q/V LoRA disabled for eval-time TTT
- K/MLP/O/lm_head LoRA active
- normalized token-level causal n-gram Adaptive Hedge scoring overlay

The n-gram overlay scores a normalized distribution over the official SP8192 CaseOps token alphabet. It uses strict-prefix hints only and updates state after scoring the current token.

## Size Accounting

| Component | Bytes |
| --- | ---: |
| max compressed model | 15,874,515 |
| counted `train_gpt.py` wrapper | 57,552 |
| max total | 15,932,067 |
| cap | 16,000,000 |
| margin | 67,933 |

The custom n-gram Python/C logic is embedded into the counted `train_gpt.py` wrapper. No uncounted helper files are required.

## Evaluation Data Safety

The final package path uses a validation-only data view:

- train shards visible during final eval: `0`
- validation token shards: `5`
- validation byte shards: `5`
- train-shard glob/listing skipped in final eval mode

## Legality Notes

- Official SP8192 CaseOps token alphabet.
- Full normalized token distribution after the n-gram tilt.
- Per-document score-first TTT.
- `TTT_WARM_START_A=0`; LoRA resets per document.
- `PHASED_TTT_PREFIX_DOCS=0`; no validation prefix adaptation.
- No global validation SGD or cross-document adaptive state.
- No byte PPM, custom tokenizer, target-conditioned lookup, or external network dependency.

## Included Files

- `train_gpt.py` - counted self-contained wrapper.
- `train_seed42.log` - optimized seed42 proof log.
- `train_seed0.log` - optimized seed0 proof log.
- `train_seed1234.log` - optimized seed1234 selected-scorer proof log.
- `submission.json` - metadata and per-seed results.
- `package_size.json` - package accounting from the selected proof source.
- `eval_data_manifest.json` - final eval data-view proof.
- `ENGINEERING_LOG.md` - engineering log for reviewers.

## Relationship To PR #1915

PR #1915 remains the conservative submitted anchor at 1.06504520 BPB. This submission is a separate follow-up and intentionally does not modify the PR #1915 record folder.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"contains_only_eval_data": true,
"dataset_root": "/workspace/parameter-golf-warm/path-a-final-package-b2-hedge-20260501T000000Z/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved",
"tokenizer": "/workspace/parameter-golf-warm/path-a/data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model",
"train_shards_visible_in_eval_view": 0,
"val_byte_shards": 5,
"val_token_shards": 5
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"cap_bytes": 16000000,
"code_path": "train_gpt.py",
"code_wrapper_sha256": "37e54152983aab20b248e54b5b624b057624a78295035958ab4bbc3feca47320",
"code_source_sha256": "88b870cbdf784d72d75e652f13336a82bab012b563a2e0acf7f83935f4bcff45",
"code_uncompressed_bytes": 247230,
"code_wrapper_brotli_base85_bytes": 57552,
"external_ngram_files_required": false,
"self_contained_ngram_code": true,
"per_seed_model_bytes": {
"42": 15872234,
"0": 15871302,
"1234": 15874515
},
"per_seed_model_sha256": {
"42": "565441a723c0705e7d6d81bfe34271f6a4704fc21ceb1a50b076241c135b762c",
"0": "489cf25b2c852a79a6f8265d007ee1f18196dbd2e5e34b96ef4a9234774e2622",
"1234": "8b5521ee68a41df1c1e91ade6a814f68c7cf0034096fc775e1f4583275399a41"
},
"artifact_bytes_by_seed": {
"42": 15929786,
"0": 15928854,
"1234": 15932067
},
"artifact_bytes_max": 15932067,
"margin_bytes": 67933
}
Loading