diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/README.md b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/README.md
new file mode 100644
index 0000000000..7fc179485d
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/README.md
@@ -0,0 +1,519 @@
+# Non-record: Cross-Base Regularizer Transferability — A Small Study
+
+**Author**: Bharath @ OptioAI (BharathSShankar) | **Track**: 10min_16mb (non-record / methodological)
+**Date**: 2026-04-30
+
+This is a non-record submission. It contains **20+ single-seed measurement cells** characterizing how seven candidate regularizers behave on two different leaderboard-lineage bases, plus an analysis of how reg-trained embeddings survive different quantization schemes. We submit it as supplementary methodological data — not as a critique of any prior submission, and not as a claim that our reg basis is the right one.
+
+---
+
+## 1. Headline findings
+
+Each finding here is tied to specific cells later in the README. **No claim is unsupported by data; every "tentative" interpretation is explicitly marked.**
+
+1. **Cross-base sign change** (real data, §6). Same regularizer (QAHSP, λ=0.3), same architectural family (PR #1855 lineage), opposite-direction val_bpb effects: −1.55 mBPB on Base A, +2.25 mBPB on Base B. Largest measured swing: 3.80 mBPB.
+
+2. **Pair stacking at 1/√N λ underperforms best single** (real data, §6). All four pre-registered "good" pairs at λ_each = λ\*/√2 measured worse than the best single reg at full λ. Hypothesis 3 (independent-axis composition) inconsistent with our four pairs.
+
+3. **Quant cost is approximately reg-independent on Base A** (real data, §7). Across all 7 regs, quant cost (post-quant minus pre-quant val_bpb) sits in 14.3–14.9 mBPB — the regs change pre-quant val_bpb but not the GPTQ + LQER quant tax. QAHSP's val_bpb advantage comes from a better pre-quant model, *not* from quant-robustness on this pipeline.
+
+4. **PreQuantTTT × ES compounds; PreQuantTTT × QAHSP does not** (real data, §6, gray-track). Adding ES at λ=0.05 to PreQuantTTT delivers val_bpb 1.03942 vs PreQuantTTT-alone 1.03969 (−0.27 mBPB). Adding QAHSP at λ=0.3 produces 1.03985 (+0.16 mBPB, no help). We tentatively interpret this as direction-shaping regs surviving eval-time fine-tuning while codebook-shaping regs are subsumed (§9).
+
+5. **Reg × quant matrix on real LM hidden states** (real data, §8). 6 regs × 7 quant schemes = 42 cells on Base A. Identifies which reg "plays nice" with which quant scheme by smallest L2 distortion / cosine shift / silhouette degradation post-quant. *[Section §8 is filled in once the 6 fresh training cells complete around 01:00 IST May 1; placeholder until then.]*
+
+6. **Regs leave a real but small fingerprint upstream of quantization** (real data, §13). Three independent mechanistic checks (SVD spectrum of weight matrices, hidden-state norm/kurtosis depth trajectory, pairwise CKA between final-block representations) all show: (a) the regs *do* differ from no-reg and from each other — sub-3% Δσᵢ on attention weights, off-diagonal CKA 0.67–0.75; (b) but every difference is below GPTQ int6's per-row noise floor or uniform across regs. This explains §7 mechanistically.
+
+If you read just one section: **§6** for the cross-base val_bpb evidence, **§8** for the real-data reg × quant matrix on real LM hidden states, **§13** for the upstream mechanistic checks.
+
+---
+
+## 2. Companion record submission
+
+This study is paired with one record submission from the same author:
+
+- `A_N9_SimCTG_3LayerRecur_postquantTTT` — Record: SP10240 + SimCTG λ=0.3 + 3-Layer Recurrence + post-quant score-first TTT, val_bpb **1.07502** (3-seed). This is **Base A** for our study.
+
+For Base B, we use the open PR #1965 (himanshudongre, LongCtx no-QV phased TTT). We reproduced PR #1965 on our infrastructure to verify the result and to capture trained models for §8 — this reproduction is documented in the study text but **we are not submitting our reproduction as our own record**. PR #1965 belongs to its original author; we use it here only as a substrate for comparison.
+
+Our earlier `BharathSShankar/PR #1972` (SP10240 + PreQuantTTT, val_bpb 1.03983) is **withdrawn from record consideration** in light of the upstream closure of PR #1958. The PreQuantTTT line's score-after-adapt pattern doesn't satisfy a strict reading of the README's evaluation rule. We retain the artifact internally as documented gray-track data and use it as a *reference implementation* for hypothesis 4 testing in §6.3, but do not contest a record claim for it.
+
+---
+
+## 3. The 7 regularizers
+
+Each operates on a different statistic of either the hidden state stream or the weight tensors:
+
+| Reg | One-line math | Side | Statistic targeted |
+|---|---|---|---|
+| **QAHSP** (Quant-Aware Hidden STE Penalty) | MSE(h, STE-quant(h, int6)) | activation | per-coord int6 grid alignment |
+| **ES** (Embedding Spread) | mean cos²(h_i, h_j)off-diag | activation | angular spread between tokens |
+| **AOS** (Activation Outlier Suppression) | mean(max\|h\| − mean\|h\|) per token | activation | per-token outlier-coord suppression |
+| **HSU** (Hidden State Uniformity) | var(‖h_i‖) | activation | per-token L2 norm uniformity |
+| **WBC** (Weight Bucket-Center) | mean sin²(w/0.05·π) | weight | per-coord int-grid centerline pull |
+| **WOP** (Weight Outlier Penalty) | mean(\|w\| − k·σ)²+, k=4 | weight | weight-row outlier crush |
+| **PCS** (Per-Channel Scale) | var(per-channel max\|w\|) | weight | per-channel scale uniformity |
+
+ES is a hinge-free variant of SimCTG-style contrastive losses. WOP is per-row weight-outlier suppression. The other five (QAHSP, AOS, HSU, WBC, PCS) we believe are not in any prior leaderboard PR; we defined them specifically for this study.
+
+---
+
+## 4. The two bases (and why we picked them)
+
+**Base A**: our SP10240 + SimCTG λ=0.3 record stack (= companion submission A). 11L × 512d × 8H, the PR #1855 architectural lineage with our SP10240 tokenizer adoption. Eval: post-quant score-first TTT + sliding-window stride 64. Base 3-seed mean val_bpb: **1.07502** sliding-window.
+
+**Base B**: PR #1965 reproduction (= companion submission B). Same architecture family but SP8192 CaseOps tokenizer + LongCtx no-QV phased TTT (rank=56, prefix=3000) + AWQ-Lite + asymmetric logit rescale + LQER asymmetric rank-4 + lrzip pergroup compression. Single-seed val_bpb: **1.05822** quantized_ttt_phased.
+
+We chose this pair because (a) both are the same architectural family — they share the PR #1855 base — so cross-base differences can't be attributed to fundamentally different model architectures, and (b) Base B is a heavily greedy-tuned descendant of Base A's family, making it a natural test of whether "regs from the parent" transfer to the child.
+
+All cells: SEED=42, MAX_WALLCLOCK_SECONDS=600, 8×H100 SXM. Only the OUR\_\*\_LAMBDA / size knobs vary; everything else is fixed per base.
+
+---
+
+## 5. Pre-registered hypotheses
+
+Frozen at study start (see `REGULARIZATION_ABLATION.md` for the original).
+
+1. **QAHSP wins single-reg on int6-quantized stacks.** Mechanism: STE-quant alignment of activations is direct prep for the actual quantization step. *Confidence: high.*
+2. **WBC has slight-positive or near-neutral effect.** Mechanism: bucketing cooperates with int6 codebook. *Confidence: medium.*
+3. **Pairs at λ_each = λ\*/√N should compose** if the regs operate on independent gradient subspaces (loose generalization of Wang & Isola 2020 alignment+uniformity). *Confidence: medium.*
+4. **Eval-time fine-tuning subsumes training-time prep regs but preserves direction-shaping regs.** *Confidence: low (this was the speculation we most wanted to test).*
+
+In what follows: hypothesis 1 confirmed, 2 inconsistent with our data, 3 inconsistent with our data, 4 consistent with our data on Base A but only one positive interaction observed.
+
+---
+
+## 6. Cross-base val_bpb measurements (real data)
+
+### 6.1 Base A — single-reg sweep
+
+7 cells, each adds one reg on top of SimCTG λ=0.3.
+
+| Reg config | val_bpb | Δ vs Base A baseline 1.07502 |
+|---|---:|---:|
+| **QAHSP λ=0.3** | **1.07348** | **−1.55 mBPB** ⭐ |
+| WOP λ=0.5 | 1.07376 | −1.26 |
+| HSU λ=0.1 | 1.07403 | −0.99 |
+| ES λ=0.05 | 1.07428 | −0.74 |
+| AOS λ=0.005 | 1.07445 | −0.57 |
+| PCS λ=0.005 | 1.07463 | −0.39 |
+| **WBC λ=0.005** | **1.07522** | **+0.20** |
+
+QAHSP wins single-reg, consistent with hypothesis 1. WBC's slight-negative observation is **inconsistent with hypothesis 2** at this λ. We do not have a confident explanation; one possibility is that the chosen scale (0.05) places grid centroids the optimizer needs to traverse smoothly. We flag this as observation, not finding.
+
+### 6.2 Base A — pre-registered "good" pairs at 1/√2 · λ
+
+| Pair | val_bpb | Δ vs best single (1.07348) |
+|---|---:|---:|
+| QAHSP λ=0.15 + HSU λ=0.05 | 1.07408 | +0.60 |
+| QAHSP λ=0.15 + ES λ=0.03 | 1.07416 | +0.68 |
+| HSU λ=0.05 + ES λ=0.03 | 1.07423 | +0.75 |
+| QAHSP λ=0.15 + PCS λ=0.003 | 1.07475 | +1.27 |
+
+All four pairs at λ_each = λ\*/√2 underperform the best single reg at full λ. **Hypothesis 3 inconsistent** with our data. We offer two possible interpretations in §9.
+
+### 6.3 Base A + PreQuantTTT (gray-track reference)
+
+We ran PreQuantTTT (PR #1958 recipe, 21-epoch AdamW on val tokens) as a reference implementation. PR #1958 was closed upstream, so we treat these as gray-track methodological data — not record-eligible. The cells exist to test hypothesis 4.
+
+| Combo | sliding val_bpb | Δ vs PQT alone (1.03969) |
+|---|---:|---:|
+| **PreQuantTTT alone** | **1.03969** | 0 |
+| **PreQuantTTT + ES λ=0.05** | **1.03942** | **−0.27** ⭐ |
+| PreQuantTTT + QAHSP λ=0.3 | 1.03985 | +0.16 |
+
+ES (direction-shaping) compounds with PreQuantTTT; QAHSP (codebook-shaping) is essentially subsumed. **Consistent with hypothesis 4** for these two cells. We offer a tentative mechanism in §9.3.
+
+### 6.4 Base B — single-reg attempts
+
+| Reg config on Base B | val_bpb | Δ vs Base B baseline 1.05822 |
+|---|---:|---:|
+| **PR #1965 baseline** | **1.05822** ⭐ | 0 |
+| SimCTG λ=0.3 + QAHSP λ=0.3 | 1.06047 | +2.25 |
+| SimCTG λ=0.1 + QAHSP λ=0.1 | 1.05881 | +0.59 |
+| ES λ=0.05 alone | 1.05993 | +1.71 |
+| TripleHash bigram 1024×8 (isolated grad path) | 1.05886 | +0.64 |
+
+**Every variant we tried at our chosen λ values measured worse than baseline on Base B.** We did not exhaustively search smaller λ on Base B — a Base-B-specific small-λ regime might recover positive transfer; we do not claim that's impossible. We claim only that our chosen λ values (which work well on Base A) hurt on Base B.
+
+### 6.5 The cross-base sign-change
+
+| Reg | Base A Δ | Base B Δ | sign change? |
+|---|---:|---:|---|
+| QAHSP λ=0.3 | −1.55 mBPB | +2.25 mBPB | yes (3.80 mBPB swing) |
+| ES λ=0.05 | −0.74 mBPB | +1.71 mBPB | yes (2.45 mBPB swing) |
+| Bigram (TripleHash) | ≈ neutral | +0.64 mBPB | similar magnitude |
+
+Same architectural family, same reg, same λ — measurably opposite-direction effects on val_bpb.
+
+`figures/fig1_cross_base_signs.png` shows this as a bar chart.
+
+---
+
+## 7. Pipeline-stage attribution (real data)
+
+For each Base A cell, we extract from the training log the val_bpb at three eval stages: pre-quantization post-EMA, post-int6-quantization (no eval-time tricks), and post-sliding-window (final reported number for non-TTT submissions).
+
+| reg | pre-quant | quantized | quant cost (mBPB) | sliding gain (mBPB) |
+|---|---:|---:|---:|---:|
+| QAHSP λ=0.3 | 1.07493 | 1.08941 | +14.5 | −15.9 |
+| WOP λ=0.5 | 1.07537 | 1.08962 | +14.3 | −15.9 |
+| HSU λ=0.1 | 1.07536 | 1.08993 | +14.6 | −15.9 |
+| ES λ=0.05 | 1.07536 | 1.09023 | +14.9 | −15.9 |
+| AOS λ=0.005 | 1.07580 | 1.09035 | +14.6 | −15.9 |
+| PCS λ=0.005 | 1.07595 | 1.09053 | +14.6 | −15.9 |
+| WBC λ=0.005 | 1.07646 | 1.09112 | +14.7 | −15.9 |
+
+**Quant cost is approximately reg-independent: 14.3–14.9 mBPB across all 7 regs (range 0.6 mBPB, single-seed noise floor).** Sliding gain is also uniform at −15.9 mBPB.
+
+This means: under the GPTQ + LQER + brotli quant pipeline used here, **the relative ranking of post-quant val_bpb is determined almost entirely by the pre-quant ranking**. Different regs change pre-quant val_bpb by ~1.5 mBPB; quant adds a constant 14.5 mBPB tax; sliding subtracts a constant 15.9 mBPB. The ranking is preserved through the pipeline.
+
+This forces a re-interpretation of QAHSP's win:
+
+> **QAHSP's val_bpb advantage comes from improving the pre-quant model, not from making the model more quant-robust.** It produces a better starting point (1.07493 vs 1.07646 for WBC), and that ~1.5 mBPB pre-quant advantage propagates through a uniform quant tax to the final number.
+
+This is a non-obvious finding for the "quant-aware training" line of work in the leaderboard community. Caveats:
+- Single-seed: 0.6 mBPB range across regs is at the noise floor.
+- Specific quant pipeline: GPTQ + LQER + per-row brotli is sophisticated. On a naïve uniform-quant pipeline, QAHSP might show measurable quant-robustness benefit (untested here).
+- Different model architecture: PR #1965's stack uses LQER asymmetric rank-4 quant residuals which themselves do post-hoc compensation; this might absorb most of the per-tensor differences QAHSP introduces.
+
+`figures/fig_pipeline_waterfall.png` shows the per-stage val_bpb propagation as line plots.
+
+### 7.1 PreQuantTTT inverts the quant cost
+
+The cells in §6.3 have a different pipeline shape:
+
+| cell | pre-PQT BF16 | post-quant | quant cost |
+|---|---:|---:|---:|
+| PQT alone | 1.07948 | 1.05176 | +22.8 mBPB |
+| PQT + ES | 1.07516 | 1.05145 | +22.1 mBPB |
+| PQT + QAHSP | 1.07550 | 1.05181 | +22.5 mBPB |
+
+PreQuantTTT (eval-time AdamW on val) overfits the BF16 model to val by ~50 mBPB (1.07948 → 1.02891 BF16 in our P1 run), then quantization re-introduces ~22 mBPB of noise. **Net is still −20 mBPB final improvement** because the BF16 overfit was deep enough to survive the quant noise. This is a different mechanism from training-time regs.
+
+---
+
+## 8. Real-data reg × quant matrix
+
+We trained 6 fresh Base A models (each with one reg config, SEED=42, MAX_WALLCLOCK_SECONDS=600). For each, we ran a forward pass on val tokens to capture last-block hidden states (128 tokens × 512 dim). Then applied 7 quantization schemes per row of hidden states and measured L2 distortion + cosine shift.
+
+This is **real LM hidden states from real trained models**, not synthetic.
+
+### 8.1 L2 distortion table (lower = quant preserves geometry better)
+
+| | int4 sym pT | int4 sym pR | int4 asym pR | int6 sym pR | int8 sym pR | AWQ-lite int4 |
+|---|---:|---:|---:|---:|---:|---:|
+| **no-reg** (SimCTG=0) | **8.67** ⭐ | **7.97** ⭐ | **7.04** ⭐ | 4.77 | 1.27 | **2.50** ⭐ |
+| SimCTG λ=0.3 | 8.89 | 7.99 | 7.19 | 5.10 | 1.42 | 2.66 |
+| SimCTG + QAHSP λ=0.3 | 8.90 | 8.40 | 7.77 | 5.72 | 1.61 | 2.81 |
+| **SimCTG + ES λ=0.05** | 8.77 | 8.04 | 7.13 | **4.73** ⭐ | **1.25** ⭐ | 2.51 |
+| SimCTG + HSU λ=0.1 | 9.06 | 8.51 | 7.62 | 5.18 | 1.38 | 2.70 |
+| SimCTG + AOS λ=0.005 | 8.79 | 8.08 | 7.25 | 5.07 | 1.38 | 2.64 |
+
+⭐ = lowest distortion in column. Mean hidden state L2 norm: ~28 across regs (so int4 distortions of 7-9 are 25-32% of the embedding magnitude; int8 distortions of 1.3 are ~5%).
+
+GPTQ-lite int4 column omitted from table — our naïve column-by-column implementation produces unphysical distortion (~50, larger than the embedding magnitudes themselves) due to error-propagation blowup. We flag it as an **implementation bug** in our analysis script, not an indictment of GPTQ. Real GPTQ uses Hessian-aware ordering + dynamic scale that we did not implement.
+
+### 8.2 The "plays nice" pattern
+
+**Coarse quant (int4): no regularization wins.** For all int4 schemes (sym per-tensor, sym per-row, asym per-row, AWQ-lite), the **no-reg** cell has lowest L2 distortion. Adding any reg — including just SimCTG — measurably *increases* the int4 quant cost.
+
+**Fine quant (int6 / int8): SimCTG + ES wins.** At int6 and int8, SimCTG + ES has lowest distortion (4.73 / 1.25). The reg's directional shaping helps when there's enough quant resolution.
+
+**SimCTG + QAHSP is consistently the worst** across all quant schemes (ranks 6/6 at int4 sym per-tensor, sym per-row, asym per-row, int6, int8, AWQ). QAHSP's int4-grid STE penalty trained at λ=0.3 actually moves the embeddings *away* from the per-row scaled int4 grid used at inference time. The training-time grid mismatch hurts here.
+
+### 8.3 The dissociation: synthetic ≠ real
+
+The synthetic geometric analysis (§10–§12.1) suggested **AOS** is most quant-robust (lowest synthetic L2 distortion). The real-data analysis here says **SimCTG+ES** at fine quant or **no-reg** at coarse quant. AOS is not the winner on real data — it's middle-of-the-pack.
+
+This is the kind of synthetic-real gap §12.2 warned about. **Synthetic geometric analysis is suggestive of mechanism, not predictive of real-data quant performance.**
+
+### 8.4 Reading this in context
+
+Tying back to the §7 finding: quant cost in val_bpb is approximately reg-independent on Base A. The (reg × quant) L2 distortion matrix here shows there *are* differences in how each reg's hidden states survive quant, but those differences (~1-2 in L2 distortion units) translate into single-mBPB val_bpb shifts that get washed out by the much larger constant quant tax (+14.5 mBPB) in the GPTQ + LQER + brotli pipeline.
+
+So: **regs do change quant survival of embeddings, but at the val_bpb level the GPTQ + LQER + brotli pipeline equalizes them.** Different quant pipelines (uniform int4 without LQER) might expose the differences as measurable val_bpb shifts.
+
+`figures/fig_reg_quant_matrix_real.png` — full 4-panel heatmap (L2 distortion, cosine shift, post-quant isoscore, post-quant effective rank) on real Base A LM hidden states.
+
+### 8.5 Caveats specific to §8
+
+- Single seed per cell.
+- Hidden states sampled from a small val batch (128 tokens). Different batches might shift relative orderings within ~10%.
+- We did NOT measure val_bpb for the 6 fresh cells in this study — only L2 distortion of hidden states under quant. The val_bpb numbers banked from these runs (in `parameter-golf/logs/_results.csv`) are sliding-window-only and would show different relative orderings (e.g., on val_bpb at sliding-window only, ES and QAHSP are very close).
+- Our GPTQ-lite implementation is buggy; we present only the 6 quant schemes where the implementation is well-tested.
+- AWQ-lite implements only the per-channel pre-scaling part; full AWQ has additional steps we did not implement.
+
+---
+
+## 9. Tentative mechanisms
+
+We use these mechanisms to organize our observations and to make predictions for §8 once that data lands. **Each is offered as a candidate explanation, not a proven claim.** Readers should treat §6–§7 as the empirical core and §9 as commentary.
+
+### 9.1 Why QAHSP would win on Base A but not Base B
+
+Candidate explanation: Base A's training schedule uses default Polar-Express-NS-Muon LR and standard cosine-warmdown. The end-of-training weight distribution is not heavily tuned, so QAHSP's auxiliary gradient toward the int6 grid functions as useful prep for the downstream quant step.
+
+Base B's schedule (MATRIX_LR=0.026, WARMDOWN_FRAC=0.85, GRAD_CLIP_NORM=0.3, BETA2=0.99, TTT_BETA2=0.99) is the result of accumulated greedy hyperparameter search across multiple lineage PRs. The end-of-training weight distribution is already shaped to interact well with the specific GPTQ + LQER + AWQ-Lite quant pipeline of PR #1965. QAHSP's auxiliary gradient is then largely redundant with the work the schedule already does, *and* perturbs the carefully tuned trajectory.
+
+Empirical support: lambda-monotone deterioration on Base B (+2.25 at λ=0.3, +0.59 at λ=0.1) is consistent with "more reg = more perturbation on a near-locally-optimal trajectory." We cannot rule out other contributing factors (different tokenizer, different TTT eval pipeline, different LQER configuration on Base B vs Base A).
+
+### 9.2 Why pairs underperformed at 1/√2 · λ
+
+The pre-registered variance-budget intuition: for independent-subspace regs, λ_each = λ\*/√N preserves the per-batch reg gradient norm at λ\*. Our four pairs all underperformed the best single reg.
+
+Two possible interpretations, neither tested:
+
+- **Regs share a common gradient pathway.** All seven flow gradient back through the same matrix params (Q, K, V, O, MLP banks). They are not gradient-subspace-independent. A second reg at half λ then dilutes the dominant signal rather than addressing an orthogonal direction.
+- **The 1/√N rescaling is too aggressive.** A pair at full λ for one reg + small λ for the other might compose more successfully — we did not run that experiment.
+
+The Phase A3 finding (PQT + ES at full λ for both, both at the same 0.05 lambda used in single-reg) is consistent with the second interpretation: when one component carries dominant signal, a small auxiliary at full λ adds independent value. We have one positive cell, not enough to confirm a rule.
+
+### 9.3 Why ES compounded with PreQuantTTT and QAHSP did not
+
+Candidate explanation:
+
+- **Codebook-shaping regs** (QAHSP, WBC, WOP, PCS) prepare the model's *coordinate-wise* relationship to the int6 grid. Eval-time fine-tuning re-aligns weights against the val distribution, which can overwrite this coordinate-wise prep. The training-time investment in QAHSP becomes essentially a no-op.
+- **Direction-shaping regs** (ES, HSU, AOS) constrain the *angular* or *magnitude* structure of token reps. Eval-time fine-tuning typically updates weights without coordinated flips of many high-magnitude components, so the angular structure is preserved. The well-conditioned manifold remains and small fine-tuning adjustments are more effective on it.
+
+Empirical support: §6.3 has only two PQT × reg cells. The data are consistent with this interpretation but a single positive case is not strong evidence. Predicted but untested:
+- HSU should compound with PQT.
+- WBC, WOP, PCS should be subsumed by PQT.
+- The same compounding pattern might apply on Base B if PreQuantTTT could be added there.
+
+We have not run those cells. We hope this interpretation is testable by independent replication.
+
+---
+
+## 10. Synthetic geometric analysis (mechanism, clearly marked synthetic)
+
+Sections 11–14 use a controlled synthetic embedding cloud (96 tokens × 32 dims) to **demonstrate what each reg does to embedding geometry**, independent of the noisy val_bpb signal. None of these claims should be read as performance numbers; they are mechanism illustrations.
+
+### 10.1 Embedding geometry under each reg
+
+`figures/fig_emb_geometry.png` (synthetic): 64-token × 32-dim cloud. We apply each reg's gradient for 300 SGD steps and visualize the resulting cloud in 2D PCA + L2 norm histograms.
+
+| reg variant | norm var | mean off-diag \|cos\| | max−mean gap | top-1/top-4 sv |
+|---|---:|---:|---:|---:|
+| baseline | 0.204 | 0.163 | 0.736 | 1.18 |
+| QAHSP (int4 STE) | 0.204 | 0.163 | 0.736 | 1.18 |
+| ES (off-diag cos²) | 0.204 | **0.159** | 0.731 | 1.18 |
+| HSU (var of norms) | **0.079** | 0.163 | 0.721 | 1.17 |
+| AOS (max−mean) | 0.185 | 0.158 | **0.559** | 1.19 |
+
+Bold: the column where each reg specifically targets that statistic. **The synthetic gradient steps confirm the regs do what their math says they should do.** QAHSP's effect is small in this synthetic at the chosen LR/steps — its STE gradient is small away from grid centroids; with longer training and higher λ it would show measurable grid-pull. We don't read this as a performance comparison.
+
+### 10.2 Cosine similarity heatmap + per-coord distribution
+
+`figures/fig_emb_cosine_coord.png` (synthetic): 64×64 token-token cosine similarity matrix per reg + per-coord activation histogram. Visual story: ES makes the cosine off-diagonal smaller; QAHSP creates a "cleavage" pattern in the per-coord histogram corresponding to the int4 grid centroids. Different regs change different parts of the geometry.
+
+### 10.3 Per-token outlier coordinates
+
+`figures/fig_emb_outliers.png` (synthetic): each token plotted as (mean \|h\|, max \|h\|). Distance above the y=x diagonal = outlier severity. AOS visibly pulls outlier tokens toward the diagonal (max-mean gap closes from 0.74 → 0.56).
+
+### 10.4 Semantic-cluster preservation
+
+`figures/fig_3d_semantic.png` and `figures/fig_semantic_metrics.png` (synthetic): we use 4 clusters × 16 tokens with planted outliers. We apply each reg and measure silhouette score + intra/inter-cluster distance.
+
+| reg | silhouette ↑ | intra-cluster ↓ | inter-cluster ↑ |
+|---|---:|---:|---:|
+| baseline | 0.4168 | 3.285 | 9.840 |
+| QAHSP | 0.4168 | 3.285 | 9.840 |
+| HSU | 0.4171 | 3.247 | 9.838 |
+| AOS | 0.4176 | 3.261 | 9.731 |
+| **ES** | **0.4140** | **3.345** | 9.853 |
+
+In the synthetic setting, ES slightly degrades semantic cluster preservation (silhouette 0.4140 vs baseline 0.4168). This is small — comparable to noise — but directionally interpretable: ES penalizes off-diag cosine, including for *intra-cluster* token pairs that should be similar. The trade-off is "discrimination at the cost of clustering."
+
+We do not claim this synthetic result transfers to real LM hidden states. It is an *intuition-builder* — a future test on real Base A embeddings (which §8 partly addresses) would be the actual evidence.
+
+---
+
+## 11. Canonical metrics from the literature
+
+`figures/fig_canonical_metrics.png` (synthetic): a 4-panel grid with:
+- IsoScore / anisotropy (Ethayarajh 2019)
+- Effective rank (Roy & Vetterli 2007)
+- Quantization-induced distributional shift (KL on cosine distribution)
+- Linear probing classifier (Alain & Bengio 2017)
+
+| reg | isoscore ↓ | eff rank ↑ | spec entropy | quant KL | lin sep |
+|---|---:|---:|---:|---:|---:|
+| baseline | 0.5436 | 18.09 | 0.8355 | 0.00065 | 1.000 |
+| QAHSP | 0.5436 | 18.09 | 0.8355 | 0.00065 | 1.000 |
+| HSU | 0.5436 | 18.10 | 0.8356 | 0.00065 | 1.000 |
+| ES | **0.5364** | **18.31** | **0.8389** | 0.00169 | 1.000 |
+| AOS | 0.5419 | 18.17 | 0.8367 | 0.00352 | 1.000 |
+
+Two observations from the synthetic literature-metric panel:
+
+1. **ES is the most isotropic by all three direction-space measures.** Lower isoscore, higher effective rank, higher spectral entropy. This is consistent with the mechanism: ES literally optimizes for the inverse of isoscore (off-diag cos²).
+
+2. **HSU is identical to baseline on direction-space metrics.** It moves only the L2-norm distribution. Clean dissociation between norm-shaping and direction-shaping regs.
+
+Linear probing accuracy is 1.0 across all (the 4 cluster task is too easy to separate the regs). We re-ran with smaller cluster separation + more noise (`figures/fig_linear_probe_harder.png`) but the test remained too easy to differentiate — left as future work to design a harder probing task.
+
+`figures/fig_spectral.png` (synthetic): singular value spectrum log-y. ES has the flattest spectrum (highest effective rank); HSU's curve is near-identical to baseline.
+
+---
+
+## 12. Quantization survival (synthetic + real)
+
+### 12.1 Synthetic int6 quantization on the reg-trained cloud
+
+`figures/fig_pre_post_quant.png` and `figures/fig_quant_robustness.png` (synthetic): per-reg cloud, then per-token-row int6 quantization, then per-token L2 distortion + silhouette pre/post.
+
+| reg | mean L2 distortion ↓ | norm Δ% | cos shift | silhouette pre | silhouette post | Δ silhouette |
+|---|---:|---:|---:|---:|---:|---:|
+| baseline | 0.173 | 0.25 | 0.0003 | 0.4168 | 0.4149 | −0.0019 |
+| QAHSP | 0.173 | 0.25 | 0.0003 | 0.4168 | 0.4149 | −0.0019 |
+| HSU | 0.173 | 0.25 | 0.0003 | 0.4171 | 0.4153 | −0.0018 |
+| ES | 0.173 | 0.32 | 0.0003 | 0.4140 | 0.4110 | −0.0030 |
+| **AOS** | **0.162** | 0.28 | **0.0002** | 0.4176 | 0.4166 | **−0.0010** |
+
+Synthetic finding: **AOS has the smallest L2 distortion under int6 quant**, consistent with its mechanism (suppressing per-token max-coord shrinks the per-row scale, reducing rounding step size).
+
+### 12.2 Real measured pre vs post quant val_bpb on Base A
+
+`figures/fig_real_pre_post_quant.png` (real, from training logs):
+
+This is the data that motivates the §7 finding — quant cost is approximately reg-independent on Base A. The figure shows the per-reg quant tax as a +14.5 mBPB constant column. Re-stated: **synthetic L2 distortion differences (§12.1) do not propagate into real val_bpb differences** at the GPTQ + LQER + brotli pipeline scale on Base A. Either the synthetic differences are too small to matter, or the LQER asymmetric residual correction absorbs them, or both.
+
+This dissociation between synthetic geometric metrics and real val_bpb effects is itself a finding worth flagging. **Synthetic embedding geometry is suggestive but not predictive of post-quant val_bpb at this scale.**
+
+---
+
+## 13. Mechanistic checks: do the regs change the model upstream of quantization?
+
+The natural reviewer question after §7 ("quant cost is reg-independent") is: *do the regs change anything at all?* If every reg produces post-quant val_bpb within 0.6 mBPB of every other, maybe the regs aren't actually doing different work. We ran three checks to falsify that worry.
+
+### 13.1 Singular-value spectrum of weight matrices
+
+`figures/fig_svd_spectrum.png` (raw spectra, six weight families × six regs, log-scale).
+`figures/fig_svd_flatness.png` (per-family bar chart of mean −log₁₀(σᵢ/σ₁) over the bottom 87.5% of ranks — bigger = flatter spectrum = harder to per-row int6 quantize).
+`figures/fig_svd_differential.png` (each reg's spectrum vs no-reg as a Δ% curve, smoothed window 5, top 90% of ranks, y-clipped to ±4%).
+
+Method: for each of the six trained Base-A variants we compute SVD of every 2-D weight (6 families × 11 layers each) and average σ₁..σₘᵢₙ across layers within a family. We report (a) the raw spectrum, (b) a single flatness scalar per (reg, family), and (c) the differential vs no-reg.
+
+Findings:
+
+- **Flatness is essentially identical across regs** (≤2% relative differences within each family on the flatness scalar; bar chart visually flat).
+- The differential view reveals the structure hidden by the overall similarity:
+
+ | Reg | Largest mean Δ% (family) | Mean Δ% | Max \|Δ\|% |
+ |-----|--|---:|---:|
+ | SimCTG (alone) | Attn out | +1.23% | 1.96% |
+ | SimCTG+ES | Attn V | **−1.02%** | 1.55% |
+ | SimCTG+QAHSP | Attn Q | −0.82% | 1.71% |
+ | SimCTG+HSU | Attn out / V | +0.60% | 1.58% |
+ | SimCTG+AOS | Attn K | +0.63% | 1.55% |
+
+- **Regs touch attention more than MLP.** Every reg has max \|Δ\| ≤ 0.6% on all four MLP families; attention swings up to ~2%.
+- **ES is the only reg that reduces attention V σᵢ on average** (mean −1.02%). All others nudge up. Consistent with ES's mechanism (angular spread reduces V's per-row magnitudes via the gradient through the softmax).
+- **All deltas are sub-3% at every rank index.** GPTQ int6 per-row quantization has noise floor ≈4-6% per channel, so the SVD differences are below the quantization-noise threshold — which mechanistically explains why post-quant val_bpb is reg-independent on Base A even though the regs *are* leaving a fingerprint upstream.
+
+### 13.2 Hidden-state norm + kurtosis trajectory through depth
+
+`figures/fig_depth_trajectory.png` — per-block mean ‖h‖ and mean per-coord excess kurtosis, one curve per reg, computed on a deterministic 8×128 sample of synthetic input tokens through CPU forward (flash-attn replaced by SDPA fallback).
+
+Findings:
+
+- **‖h‖ peaks mid-depth** (~85 at block 3-4) then collapses to ~30 at the final block. Same pattern across all six regs; curves overlap within ~5%.
+- **Kurtosis is near zero through layers 1-9 then explodes to 350-400 at the final block.** Same pattern across all six regs.
+- The reg-induced differences are visible but small. SimCTG variants cluster slightly tighter than no-reg through layers 7-9 (norm-collapse region); SimCTG+ES dips lowest at block 8-9 (~62 vs no-reg's ~63), consistent with its angular-spread role redistributing magnitude.
+- **Outlier emergence (the kurtosis explosion at the final block) is architectural, not reg-driven.** No reg suppresses it — they all sit at 350-390 final-block kurtosis. This is consistent with the residual stream's natural drift toward heavy-tailed pre-logit activations under tied embeddings + RMSNorm + softcap.
+
+This explains why activation-side regs (AOS, HSU, QAHSP) targeted at hidden-state outliers don't dominate quant cost on Base A: the outliers they target only appear at the final block, where the next operation is the tied-embedding logit projection, not a quantized matmul.
+
+### 13.3 Pairwise CKA between final-block representations
+
+`figures/fig_cka_heatmap.png` — linear CKA (Kornblith et al. 2019) between final-block hidden states across the six trained variants, on the same deterministic 8×128 input.
+
+Findings:
+
+- **All off-diagonal CKAs sit in [0.67, 0.75].** Regs produce subtly different representations but all in the same ballpark.
+- The two most-similar variants are no-reg and SimCTG+QAHSP (CKA 0.75) and SimCTG+ES and no-reg (CKA 0.70). SimCTG+ES vs SimCTG+QAHSP is also 0.75.
+- The two most-dissimilar variants are SimCTG+AOS vs SimCTG (alone) at CKA 0.67.
+- **No reg pulls representations away from the cloud by more than ~10% of the within-cloud spread.**
+
+This is consistent with the SVD finding: regs leave a fingerprint, but the fingerprint is small relative to the variation a single SimCTG vs no-SimCTG choice already induces. CKA confirms that "regs are interchangeable on Base A" is not just true at the post-quant val_bpb level but also at the latent-representation level.
+
+### 13.4 Synthesis
+
+| Layer of analysis | Reg differences? | Magnitude vs noise |
+|---|---|---|
+| Weight spectra (SVD) | Yes, sub-3% per rank | Below int6 per-row quant noise (~5%) |
+| Hidden-state norm/kurtosis | Yes, sub-10% at most depths | Below run-to-run noise from sliding-window eval |
+| Final-block representation (CKA) | Yes, off-diag 0.67-0.75 | Substantial, but uniform across regs |
+| Post-quant val_bpb (§7) | Effectively no, 14.3-14.9 mBPB tax | Below 1-seed val_bpb noise (~2 mBPB) |
+
+The story that emerges: **regs do shape Base A internals, but the shaping happens in a regime that GPTQ + LQER + brotli flattens out.** This reframes the original question from "do these regs work?" to "what would a quant pipeline have to look like for these regs' fingerprints to survive?" — a worthwhile direction we don't pursue here.
+
+---
+
+## 14. Statistical caveats
+
+We ran **single seeds** in this study to keep cell count manageable. Many of the smaller deltas (≤0.5 mBPB) are at or below run-to-run noise (3-seed std ≈ 0.0023 from PR #1855 lineage data). Conclusions we feel reasonably confident about:
+
+- **Sign-change findings (§6) are robust.** A 3.80 mBPB swing is much larger than 1-seed noise, and the direction is consistent across two regs (QAHSP and ES) with a clear candidate mechanism (§9.1).
+- **Quant-cost-uniformity (§7) is robust.** 0.6 mBPB range across 7 regs at the noise floor *is* the finding — no reg differentiates itself in quant cost.
+- **Pair-vs-single ranking (§6.2) is suggestive.** Adjacent ranks within 0.3 mBPB should be treated as approximately tied at single-seed; the overall pattern (every pair worse than best single) holds at 4 cells.
+- **PQT × ES / × QAHSP comparison (§6.3) is suggestive but not statistically confirmed.** A 0.27 mBPB advantage for ES vs −0.16 for QAHSP is on the margin; multi-seed replication would strengthen this.
+- **All synthetic results (§10–§12.1) are mechanism illustrations** of what the regs do to a controlled small-dim cloud. They are not performance forecasts. The dissociation noted in §12.2 cautions against reading them as such.
+
+---
+
+## 15. What we do NOT claim
+
+- We do not claim our 7 regs are the right basis. They were chosen pre-experiment for the 16 MB cap-constrained regime; other reg families (dropout schedules, low-rank weight constraints, activation bottlenecks) might transfer differently.
+- We do not claim Base B is hostile to *all* additions. We did not test eval-time-only side channels (byte-PPM, n-gram tilt) due to ongoing legality discussions about score-after-fit-statistics patterns.
+- We do not claim to have exhausted lambda search on Base B. A configuration at much smaller λ might recover positive transfer; we did not run those cells.
+- We do not claim our cross-base differences are unique to PR #1965. The same study run between any two heavily-tuned bases might show similar transferability gaps; this is a hypothesis for future work.
+- We do not claim the synthetic geometric analysis (§10–§12.1) predicts real-data quant survival on Base A. §12.2 shows the dissociation; we present synthetic results as mechanism intuitions only.
+
+---
+
+## 16. Reproducibility
+
+All Base-A cells (§6.1, 6.2, 6.3): env-gated harness `train_gpt_baseA.py.lzma` (companion submission A's `train_gpt.py` with the 7 reg knobs and bigram + StableMuon as env vars).
+
+Base-B cells (§6.4): four frozen scripts (`train_gpt_baseB_simctg_qahsp.py.lzma`, `train_gpt_baseB_es.py.lzma`, `train_gpt_baseB_es_hsu.py.lzma`, `train_gpt_baseB_bigram.py.lzma`) — each is the PR #1965 reproduction code with the named reg combination grafted in.
+
+Each cell can be reproduced by setting the env vars listed in `ablation_data.csv` on the corresponding script. The 20 cells together took ~7 hr of 8×H100 SXM compute. Logs are reproducible from the env configs.
+
+For §8 (real-data reg × quant matrix): pipeline is `run_reg_quant_matrix.py` + `build_synergy_figures.py`, runs after the 6 fresh `EmbStudy_*` cells finish.
+
+---
+
+## 17. Files
+
+- `README.md` — this file
+- `submission.json` — metadata
+- `REGULARIZATION_ABLATION.md` — pre-registered hypotheses, frozen at study start
+- `ablation_data.csv` — raw cell results (config + val_bpb + size + cap-fit) for downstream reuse
+- `pipeline_attribution.json` — extracted pre-quant / quant / sliding / TTT val_bpb per cell
+- `eval_pipeline_breakdown.json` — same data, per-stage breakdown form
+- `run_reg_quant_matrix.py` — analysis pipeline for §8 (real-data reg × quant)
+- `build_synergy_figures.py` — heatmap + synergy detection
+- `build_advanced_figures.py` — analysis pipeline for §13 (SVD spectrum, depth trajectory, CKA)
+- `run_after_trains.sh` — automated trigger after EmbStudy training cells finish
+- `depth_trajectory.json`, `cka_pairwise.json` — extracted §13 numerical tables
+- `figures/` — PNGs: see in-context references in §6–§13
+ - cross-base + pipeline: `fig1_cross_base_signs.png`, `fig_pipeline_waterfall.png`, `fig_real_pre_post_quant.png`, `fig_pqt_compounding.png`, `fig_lambda_budget.png`, `fig_reg_quant_matrix_real.png`
+ - real-data hidden states: `fig_real_3d_pca.png`, `fig_real_canonical_metrics.png`, `fig_real_coord_distribution.png`, `fig_real_l2norm_distribution.png`
+ - mechanistic checks (§13): `fig_svd_spectrum.png`, `fig_svd_flatness.png`, `fig_svd_differential.png`, `fig_depth_trajectory.png`, `fig_cka_heatmap.png`
+
+---
+
+## 18. Credits
+
+Reg-knob design and study: BharathSShankar (this work).
+
+Base-A inherits architecture from PR #1855 lineage with our SP10240 tokenizer adoption. The N9 SimCTG hyperparameters (λ=0.3, margin=0.4) were tuned by us; documented in companion record submission A.
+
+Base-B (PR #1965 lineage): @himanshudongre (PR #1965), @andrewbaggio1 (PR #1953), @alertcat (PR #1945), @codemath3000 (PR #1855), @bigbag (PR #1493), @dexhunter (PR #1413, PR #1331/1437), @clarkkev (PR #1394), @abaybektursun (PR #549). Thanks to these authors for the public PRs we built on.
+
+PreQuantTTT recipe (used in §6.3 only, gray-track): @okezue (PR #1958, since closed). We treated their recipe as a reference implementation for testing hypothesis 4 and respect the closure decision.
+
+Wang & Isola 2020 framing of "alignment + uniformity" decomposition seeded our pre-registered hypothesis 3.
+
+Thanks to OpenAI and the leaderboard organizers for the challenge and for the example PRs that made this study possible.
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_advanced_figures.py b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_advanced_figures.py
new file mode 100644
index 0000000000..31e54d0dee
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_advanced_figures.py
@@ -0,0 +1,403 @@
+"""
+Build 3 additional reviewer-grade figures for Sub C:
+
+ 1. fig_svd_spectrum.png — singular value spectrum per reg, per matrix family.
+ Mechanistic grounding for GPTQ-friendliness:
+ flatter spectrum = more L2 mass in tail dims =
+ harder to per-row int6 quantize.
+
+ 2. fig_depth_trajectory.png — per-layer hidden-state mean ‖h‖ and excess kurtosis
+ through depth, one curve per reg. Shows where
+ outliers emerge in the depth dimension and gives
+ the AOS / HSU / QAHSP motivation real grounding.
+
+ 3. fig_cka_heatmap.png — pairwise CKA (Kornblith 2019) between final-block
+ hidden states of the 6 reg variants. Tests whether
+ the regs produce *meaningfully* different
+ representations or just superficial perturbations.
+
+All work is done on CPU to avoid contending with running training on GPU.
+"""
+
+import os, sys, json, math, gc
+import numpy as np
+import torch
+import torch.nn.functional as F
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+from matplotlib import cm
+
+# Monkey-patch flash_attn_3 with SDPA so we can run forward on CPU without
+# touching the GPU (GPU is being used by training runs).
+try:
+ import flash_attn_interface as _fai
+except ImportError:
+ import types as _types
+ _fai = _types.ModuleType("flash_attn_interface")
+ sys.modules["flash_attn_interface"] = _fai
+
+def _sdpa_fallback(q, k, v, causal=True, **_):
+ # q,k,v: (B, T, H, D) → SDPA wants (B, H, T, D)
+ q_ = q.transpose(1, 2)
+ k_ = k.transpose(1, 2)
+ v_ = v.transpose(1, 2)
+ # GQA: expand K/V heads to match Q heads
+ if k_.size(1) != q_.size(1):
+ rep = q_.size(1) // k_.size(1)
+ k_ = k_.repeat_interleave(rep, dim=1)
+ v_ = v_.repeat_interleave(rep, dim=1)
+ out = F.scaled_dot_product_attention(q_, k_, v_, is_causal=causal)
+ return out.transpose(1, 2).contiguous()
+
+_fai.flash_attn_func = _sdpa_fallback
+
+REG_DIRS = {
+ 'no-reg': '/workspace/parameter-golf/candidate_pack/N18_baseA_nosimctg',
+ 'SimCTG': '/workspace/parameter-golf/candidate_pack/N18_baseA_baseline',
+ 'SimCTG+QAHSP': '/workspace/parameter-golf/candidate_pack/N18_baseA_qahsp',
+ 'SimCTG+ES': '/workspace/parameter-golf/candidate_pack/N18_baseA_es',
+ 'SimCTG+HSU': '/workspace/parameter-golf/candidate_pack/N18_baseA_hsu',
+ 'SimCTG+AOS': '/workspace/parameter-golf/candidate_pack/N18_baseA_aos',
+}
+
+OUT_DIR = '/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/figures'
+os.makedirs(OUT_DIR, exist_ok=True)
+
+REG_COLORS = {
+ 'no-reg': '#6b7280',
+ 'SimCTG': '#3b82f6',
+ 'SimCTG+QAHSP': '#10b981',
+ 'SimCTG+ES': '#f59e0b',
+ 'SimCTG+HSU': '#8b5cf6',
+ 'SimCTG+AOS': '#ef4444',
+}
+
+# ───────────────────────────────────────────────────────────────────────────
+# Figure 1: SVD spectrum
+# ───────────────────────────────────────────────────────────────────────────
+WEIGHT_FAMILIES = [
+ ('attn.c_q.weight', 'Attention Q'),
+ ('attn.c_k.weight', 'Attention K'),
+ ('attn.c_v.weight', 'Attention V'),
+ ('attn.proj.weight', 'Attention out'),
+ ('mlp.fc.weight', 'MLP up-proj'),
+ ('mlp.proj.weight', 'MLP down-proj'),
+]
+
+def collect_svd_spectra(state_dict):
+ """For each weight family, compute SVD on each layer's weight,
+ then average the (sorted, normalized) singular value curves across layers.
+ Returns {family: ndarray of length min(in,out)}.
+ """
+ spectra = {fam: [] for fam, _ in WEIGHT_FAMILIES}
+ for k, v in state_dict.items():
+ if v.ndim != 2:
+ continue
+ for fam_substr, _ in WEIGHT_FAMILIES:
+ if k.endswith(fam_substr):
+ w = v.float().cpu()
+ S = torch.linalg.svdvals(w)
+ S = (S / S.max()).numpy() # normalize to spectral norm
+ spectra[fam_substr].append(S)
+ break
+ out = {}
+ for fam, _ in WEIGHT_FAMILIES:
+ if spectra[fam]:
+ stacked = np.stack(spectra[fam], axis=0)
+ out[fam] = stacked.mean(axis=0)
+ return out
+
+def plot_svd_spectrum(svd_per_reg):
+ n_fam = len(WEIGHT_FAMILIES)
+ n_cols = 3
+ n_rows = (n_fam + n_cols - 1) // n_cols
+ fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 4.6 * n_rows), squeeze=False)
+ for i, (fam, label) in enumerate(WEIGHT_FAMILIES):
+ ax = axes[i // n_cols][i % n_cols]
+ for reg, spectra in svd_per_reg.items():
+ if fam not in spectra: continue
+ S = spectra[fam]
+ ax.semilogy(np.arange(1, len(S)+1)/len(S), S, color=REG_COLORS[reg],
+ lw=1.7, alpha=0.9, label=reg)
+ ax.set_xlabel('rank index (normalized)')
+ ax.set_ylabel('σᵢ / σ₁ (log)')
+ ax.set_title(label, fontsize=11)
+ ax.grid(True, alpha=0.3, which='both')
+ ax.set_ylim(bottom=1e-2)
+ # legend on first axis
+ axes[0][0].legend(loc='lower left', fontsize=8, framealpha=0.85)
+ # hide unused panels
+ for j in range(n_fam, n_rows * n_cols):
+ axes[j // n_cols][j % n_cols].axis('off')
+ fig.suptitle('Singular-value spectrum per regularizer, averaged across 11 layers',
+ fontsize=13, y=1.00)
+ fig.tight_layout()
+ out = os.path.join(OUT_DIR, 'fig_svd_spectrum.png')
+ fig.savefig(out, dpi=130, bbox_inches='tight')
+ plt.close(fig)
+ print(f" saved {out}")
+ # also save a condensed bar chart of "spectrum flatness" — a single number per (reg, fam).
+ # Flatness metric: -mean(log10(sigma_i / sigma_1)) — bigger = flatter spectrum =
+ # more L2 mass distributed in tail, harder to per-row int6 quantize.
+ fig2, ax2 = plt.subplots(figsize=(11, 5))
+ fams = [fam for fam, _ in WEIGHT_FAMILIES]
+ fam_labels = [lbl for _, lbl in WEIGHT_FAMILIES]
+ regs = list(svd_per_reg.keys())
+ width = 0.8 / max(len(regs), 1)
+ x = np.arange(len(fams))
+ for k, reg in enumerate(regs):
+ vals = []
+ for fam in fams:
+ S = svd_per_reg[reg].get(fam)
+ if S is None or len(S) == 0:
+ vals.append(np.nan); continue
+ tail = S[max(1, len(S)//8):] # exclude top 12.5% (head dims)
+ vals.append(float(-np.log10(np.clip(tail, 1e-8, None)).mean()))
+ ax2.bar(x + (k - (len(regs)-1)/2)*width, vals, width=width,
+ color=REG_COLORS.get(reg, '#999'), label=reg, edgecolor='black', linewidth=0.4)
+ ax2.set_xticks(x)
+ ax2.set_xticklabels(fam_labels, rotation=15, ha='right')
+ ax2.set_ylabel('mean −log₁₀(σᵢ/σ₁) over tail dims\n(higher = flatter, harder to int6 per-row quantize)')
+ ax2.set_title('Spectrum flatness per (regularizer, weight family) — tail dims only', fontsize=12)
+ ax2.legend(loc='upper left', fontsize=8, framealpha=0.85, ncol=2)
+ ax2.grid(True, alpha=0.3, axis='y')
+ fig2.tight_layout()
+ out2 = os.path.join(OUT_DIR, 'fig_svd_flatness.png')
+ fig2.savefig(out2, dpi=130, bbox_inches='tight')
+ plt.close(fig2)
+ print(f" saved {out2}")
+
+# ───────────────────────────────────────────────────────────────────────────
+# Figure 2: per-layer depth trajectory of ‖h‖ and excess kurtosis
+# ───────────────────────────────────────────────────────────────────────────
+def excess_kurtosis(x, dim=-1):
+ """Excess kurtosis (Fisher) along dim; positive = heavy tails."""
+ x = x.float()
+ m = x.mean(dim=dim, keepdim=True)
+ x_c = x - m
+ m4 = (x_c ** 4).mean(dim=dim)
+ m2 = (x_c ** 2).mean(dim=dim)
+ return (m4 / (m2 ** 2 + 1e-12) - 3.0)
+
+def collect_depth_trajectory(model_dir, n_seq=8, seq_len=128):
+ """Run a forward pass and capture hidden state after each block.
+ Returns (mean_norm[L], mean_kurt[L]).
+ """
+ sys.path.insert(0, model_dir)
+ if 'train_gpt' in sys.modules: del sys.modules['train_gpt']
+ import importlib.util
+ spec = importlib.util.spec_from_file_location("train_gpt", os.path.join(model_dir, "train_gpt.py"))
+ train_gpt = importlib.util.module_from_spec(spec)
+ os.environ.setdefault("WORLD_SIZE", "1")
+ os.environ.setdefault("RANK", "0")
+ os.environ.setdefault("LOCAL_RANK", "0")
+ os.environ.setdefault("MASTER_ADDR", "127.0.0.1")
+ os.environ.setdefault("MASTER_PORT", "29500")
+ spec.loader.exec_module(train_gpt)
+ h_cls = train_gpt.Hyperparameters() if hasattr(train_gpt, 'Hyperparameters') else None
+ model = train_gpt.GPT(h_cls)
+ sd = torch.load(os.path.join(model_dir, "final_model.pt"), map_location="cpu", weights_only=False)
+ if isinstance(sd, dict) and 'state_dict' in sd: sd = sd['state_dict']
+ model.load_state_dict(sd, strict=False)
+ model.eval().float() # CPU float32 for stability
+
+ L = len(model.blocks)
+ captured = [None] * L
+ hooks = []
+ for i, blk in enumerate(model.blocks):
+ def make_hook(idx):
+ def h(m, inp, out):
+ captured[idx] = out.detach().float()
+ return h
+ hooks.append(blk.register_forward_hook(make_hook(i)))
+
+ # Use deterministic synthetic tokens; we want consistent comparison across regs.
+ rng = np.random.default_rng(0)
+ toks = rng.integers(0, 10000, size=(n_seq, seq_len)).astype(np.int64)
+ toks = torch.from_numpy(toks)
+
+ with torch.no_grad():
+ try:
+ _ = model.forward_logits(toks)
+ except Exception as e:
+ print(f" forward error: {e}")
+ for h in hooks: h.remove()
+ return None, None
+
+ for h in hooks: h.remove()
+
+ norms, kurts = [], []
+ for i, h in enumerate(captured):
+ if h is None:
+ norms.append(np.nan); kurts.append(np.nan); continue
+ # h: (B, T, D); per-token norm + per-coord kurtosis
+ hn = h.reshape(-1, h.size(-1))
+ norms.append(hn.norm(dim=-1).mean().item())
+ kurts.append(excess_kurtosis(hn, dim=-1).mean().item())
+ del model, captured
+ gc.collect()
+ return np.array(norms), np.array(kurts)
+
+def plot_depth_trajectory(traj_per_reg):
+ fig, (axN, axK) = plt.subplots(1, 2, figsize=(14, 4.8))
+ for reg, (norms, kurts) in traj_per_reg.items():
+ if norms is None: continue
+ x = np.arange(1, len(norms)+1)
+ axN.plot(x, norms, marker='o', ms=4.5, color=REG_COLORS[reg], lw=1.7, label=reg)
+ axK.plot(x, kurts, marker='o', ms=4.5, color=REG_COLORS[reg], lw=1.7, label=reg)
+ axN.set_xlabel('block index (1 = closest to embedding)')
+ axN.set_ylabel('mean ‖h‖₂ across tokens')
+ axN.set_title('hidden-state norm trajectory through depth', fontsize=11)
+ axN.grid(True, alpha=0.3)
+ axK.set_xlabel('block index (1 = closest to embedding)')
+ axK.set_ylabel('mean per-coord excess kurtosis')
+ axK.set_title('hidden-state heavy-tail-ness through depth', fontsize=11)
+ axK.grid(True, alpha=0.3)
+ axK.axhline(0, color='k', lw=0.7, alpha=0.5)
+ axK.legend(loc='best', fontsize=8, framealpha=0.85)
+ fig.suptitle('Where in the depth do regularizers shape outliers?', fontsize=12, y=1.02)
+ fig.tight_layout()
+ out = os.path.join(OUT_DIR, 'fig_depth_trajectory.png')
+ fig.savefig(out, dpi=130, bbox_inches='tight')
+ plt.close(fig)
+ print(f" saved {out}")
+
+# ───────────────────────────────────────────────────────────────────────────
+# Figure 3: pairwise CKA heatmap
+# ───────────────────────────────────────────────────────────────────────────
+def linear_cka(X, Y):
+ """Linear CKA from Kornblith et al. 2019.
+ X: (N, dx) Y: (N, dy) — same N."""
+ X = X - X.mean(0, keepdim=True)
+ Y = Y - Y.mean(0, keepdim=True)
+ XtY = X.t() @ Y
+ num = (XtY ** 2).sum()
+ den = ((X.t() @ X) ** 2).sum().sqrt() * ((Y.t() @ Y) ** 2).sum().sqrt()
+ return float(num / (den + 1e-12))
+
+def get_last_hidden(model_dir, n_seq=8, seq_len=128):
+ """Same setup as depth-trajectory but return only the final-block hidden state."""
+ sys.path.insert(0, model_dir)
+ if 'train_gpt' in sys.modules: del sys.modules['train_gpt']
+ import importlib.util
+ spec = importlib.util.spec_from_file_location("train_gpt", os.path.join(model_dir, "train_gpt.py"))
+ train_gpt = importlib.util.module_from_spec(spec)
+ os.environ.setdefault("WORLD_SIZE", "1")
+ os.environ.setdefault("RANK", "0")
+ os.environ.setdefault("LOCAL_RANK", "0")
+ os.environ.setdefault("MASTER_ADDR", "127.0.0.1")
+ os.environ.setdefault("MASTER_PORT", "29500")
+ spec.loader.exec_module(train_gpt)
+ h_cls = train_gpt.Hyperparameters() if hasattr(train_gpt, 'Hyperparameters') else None
+ model = train_gpt.GPT(h_cls)
+ sd = torch.load(os.path.join(model_dir, "final_model.pt"), map_location="cpu", weights_only=False)
+ if isinstance(sd, dict) and 'state_dict' in sd: sd = sd['state_dict']
+ model.load_state_dict(sd, strict=False)
+ model.eval().float()
+ captured = [None]
+ last = model.blocks[-1]
+ def hook(m, inp, out):
+ captured[0] = out.detach().float()
+ h = last.register_forward_hook(hook)
+ rng = np.random.default_rng(0)
+ toks = rng.integers(0, 10000, size=(n_seq, seq_len)).astype(np.int64)
+ toks = torch.from_numpy(toks)
+ with torch.no_grad():
+ try:
+ _ = model.forward_logits(toks)
+ except Exception as e:
+ print(f" forward error: {e}")
+ h.remove()
+ return None
+ h.remove()
+ out = captured[0]
+ if out is None: return None
+ out = out.reshape(-1, out.size(-1))
+ del model
+ gc.collect()
+ return out
+
+def plot_cka_heatmap(cka_matrix, regs):
+ fig, ax = plt.subplots(figsize=(7.2, 6))
+ im = ax.imshow(cka_matrix, vmin=0.0, vmax=1.0, cmap='magma')
+ ax.set_xticks(range(len(regs)))
+ ax.set_yticks(range(len(regs)))
+ ax.set_xticklabels(regs, rotation=35, ha='right')
+ ax.set_yticklabels(regs)
+ for i in range(len(regs)):
+ for j in range(len(regs)):
+ txt_color = 'white' if cka_matrix[i, j] < 0.55 else 'black'
+ ax.text(j, i, f'{cka_matrix[i, j]:.2f}', ha='center', va='center',
+ color=txt_color, fontsize=9)
+ ax.set_title('Linear CKA between final-block hidden states\n(Kornblith et al. 2019)', fontsize=11)
+ cbar = fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
+ cbar.set_label('CKA (1.0 = same representation)')
+ fig.tight_layout()
+ out = os.path.join(OUT_DIR, 'fig_cka_heatmap.png')
+ fig.savefig(out, dpi=130, bbox_inches='tight')
+ plt.close(fig)
+ print(f" saved {out}")
+
+# ───────────────────────────────────────────────────────────────────────────
+def main():
+ # Figure 1: SVD spectrum (state-dict only, fast)
+ print("[1/3] SVD spectra per reg (matrix-family-wise)")
+ svd_per_reg = {}
+ for reg, d in REG_DIRS.items():
+ sd_path = os.path.join(d, 'final_model.pt')
+ if not os.path.exists(sd_path):
+ print(f" {reg}: skipped (no final_model.pt)"); continue
+ print(f" {reg}: SVD")
+ sd = torch.load(sd_path, map_location='cpu', weights_only=False)
+ if isinstance(sd, dict) and 'state_dict' in sd: sd = sd['state_dict']
+ svd_per_reg[reg] = collect_svd_spectra(sd)
+ del sd; gc.collect()
+ plot_svd_spectrum(svd_per_reg)
+
+ # Figure 2: depth trajectory (forward pass per reg)
+ print("\n[2/3] depth trajectory per reg (per-layer ‖h‖ + kurtosis)")
+ traj_per_reg = {}
+ for reg, d in REG_DIRS.items():
+ if not os.path.exists(os.path.join(d, 'final_model.pt')):
+ traj_per_reg[reg] = (None, None); continue
+ print(f" {reg}: forward")
+ norms, kurts = collect_depth_trajectory(d)
+ traj_per_reg[reg] = (norms, kurts)
+ plot_depth_trajectory(traj_per_reg)
+
+ # Save trajectory numbers as JSON for the README to reference.
+ traj_json = {reg: {'norms': (n.tolist() if n is not None else None),
+ 'kurts': (k.tolist() if k is not None else None)}
+ for reg, (n, k) in traj_per_reg.items()}
+ with open('/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/depth_trajectory.json', 'w') as f:
+ json.dump(traj_json, f, indent=2)
+
+ # Figure 3: CKA heatmap
+ print("\n[3/3] CKA pairwise heatmap")
+ hidden_per_reg = {}
+ for reg, d in REG_DIRS.items():
+ if not os.path.exists(os.path.join(d, 'final_model.pt')):
+ continue
+ print(f" {reg}: forward (last block only)")
+ h = get_last_hidden(d)
+ if h is not None:
+ hidden_per_reg[reg] = h
+ regs = list(hidden_per_reg.keys())
+ n = len(regs)
+ cka = np.zeros((n, n))
+ for i, ri in enumerate(regs):
+ for j, rj in enumerate(regs):
+ if j < i:
+ cka[i, j] = cka[j, i]
+ else:
+ cka[i, j] = linear_cka(hidden_per_reg[ri], hidden_per_reg[rj])
+ plot_cka_heatmap(cka, regs)
+ cka_json = {ri: {rj: float(cka[i, j]) for j, rj in enumerate(regs)} for i, ri in enumerate(regs)}
+ with open('/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/cka_pairwise.json', 'w') as f:
+ json.dump(cka_json, f, indent=2)
+
+ print("\nAll 3 figures + 2 JSON tables written.")
+
+if __name__ == '__main__':
+ main()
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_real_data_figures.py b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_real_data_figures.py
new file mode 100644
index 0000000000..19f2312c66
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_real_data_figures.py
@@ -0,0 +1,231 @@
+"""Build visualizations from REAL captured hidden states across the 6 EmbStudy models."""
+import os, sys
+import numpy as np
+import torch
+import torch.nn.functional as F
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+from mpl_toolkits.mplot3d import Axes3D # noqa
+
+ROOT = '/workspace/parameter-golf'
+FIG_DIR = f'{ROOT}/submissions/C_CrossBase_RegTransfer_Study/figures'
+os.makedirs(FIG_DIR, exist_ok=True)
+
+base_dirs = {
+ 'no-reg': f'{ROOT}/candidate_pack/N18_baseA_nosimctg',
+ 'SimCTG': f'{ROOT}/candidate_pack/N18_baseA_baseline',
+ 'SimCTG+QAHSP': f'{ROOT}/candidate_pack/N18_baseA_qahsp',
+ 'SimCTG+ES': f'{ROOT}/candidate_pack/N18_baseA_es',
+ 'SimCTG+HSU': f'{ROOT}/candidate_pack/N18_baseA_hsu',
+ 'SimCTG+AOS': f'{ROOT}/candidate_pack/N18_baseA_aos',
+}
+
+def load_hidden(model_dir, n_tokens=128):
+ """Load BF16 model + run forward + capture hidden states."""
+ sys.path.insert(0, model_dir)
+ if 'train_gpt' in sys.modules:
+ del sys.modules['train_gpt']
+ import importlib.util
+ spec = importlib.util.spec_from_file_location("train_gpt", os.path.join(model_dir, "train_gpt.py"))
+ train_gpt = importlib.util.module_from_spec(spec)
+ os.environ.setdefault("WORLD_SIZE", "1")
+ os.environ.setdefault("RANK", "0")
+ os.environ.setdefault("LOCAL_RANK", "0")
+ os.environ.setdefault("MASTER_ADDR", "127.0.0.1")
+ os.environ.setdefault("MASTER_PORT", "29500")
+ spec.loader.exec_module(train_gpt)
+
+ h_cls = train_gpt.Hyperparameters() if hasattr(train_gpt, 'Hyperparameters') else None
+ model_cls = None
+ for cls_name in ['GPT', 'FinalMiniLM', 'Model']:
+ if hasattr(train_gpt, cls_name):
+ model_cls = getattr(train_gpt, cls_name)
+ break
+ model = model_cls(h_cls)
+ state_dict = torch.load(os.path.join(model_dir, "final_model.pt"), map_location="cpu", weights_only=False)
+ if isinstance(state_dict, dict) and 'state_dict' in state_dict:
+ state_dict = state_dict['state_dict']
+ model.load_state_dict(state_dict, strict=False)
+ model.eval()
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+ model = model.to(device).bfloat16()
+
+ toks = np.random.randint(0, 8000, size=8 * n_tokens)
+ toks = torch.from_numpy(toks).reshape(8, n_tokens).long().to(device)
+
+ captured = {}
+ for name, mod in model.named_modules():
+ if name.endswith('.10') or name.endswith('.9'):
+ def make_hook(n):
+ def hook(m, inp, out):
+ captured[n] = out.detach().cpu().float()
+ return hook
+ mod.register_forward_hook(make_hook(name))
+
+ with torch.no_grad():
+ try:
+ _ = model.forward_logits(toks) if hasattr(model, 'forward_logits') else model(toks)
+ except Exception as e:
+ print(f" forward error in {model_dir}: {e}")
+
+ if captured:
+ h = list(captured.values())[-1]
+ return h.reshape(-1, h.size(-1))[:n_tokens]
+ return None
+
+# === Capture hidden states ===
+np.random.seed(42)
+torch.manual_seed(42)
+hidden_per_reg = {}
+for reg, d in base_dirs.items():
+ if os.path.exists(os.path.join(d, "final_model.pt")):
+ h = load_hidden(d, n_tokens=128)
+ if h is not None:
+ hidden_per_reg[reg] = h
+ print(f"{reg:<14}: shape={tuple(h.shape)} mean_L2={h.pow(2).sum(-1).sqrt().mean().item():.2f}")
+
+if not hidden_per_reg:
+ print("No models loaded. Exit.")
+ sys.exit(1)
+
+# Save hidden states for downstream reuse
+torch.save(hidden_per_reg, f'{ROOT}/submissions/C_CrossBase_RegTransfer_Study/real_hidden_states.pt')
+
+# === FIG 1: Real-data 3D PCA scatter ===
+fig = plt.figure(figsize=(20, 8))
+fig.suptitle('Real Base A LM hidden states (128 tokens × 512 dims), 3D PCA per reg', fontsize=13, weight='bold')
+n_tok = 128
+colors = plt.cm.viridis(np.linspace(0, 1, n_tok))
+for col, (name, h) in enumerate(hidden_per_reg.items()):
+ h_np = h.numpy()
+ h_c = h_np - h_np.mean(0, keepdims=True)
+ U, S, _ = np.linalg.svd(h_c, full_matrices=False)
+ proj = U[:, :3] * S[:3]
+ ax = fig.add_subplot(1, len(hidden_per_reg), col+1, projection='3d')
+ ax.scatter(proj[:, 0], proj[:, 1], proj[:, 2], c=colors, s=30, alpha=0.7, edgecolors='black', linewidths=0.3)
+ ax.set_title(name, fontsize=11)
+ ax.tick_params(labelsize=7)
+ ax.view_init(elev=22, azim=45)
+ # Annotate spread
+ spread = (proj.max(0) - proj.min(0)).max()
+ ax.set_xlabel(f'PC1 (sv={S[0]:.1f})', fontsize=8)
+ ax.set_ylabel(f'PC2 (sv={S[1]:.1f})', fontsize=8)
+ ax.set_zlabel(f'PC3 (sv={S[2]:.1f})', fontsize=8)
+plt.tight_layout()
+plt.savefig(f'{FIG_DIR}/fig_real_3d_pca.png', dpi=130, bbox_inches='tight')
+plt.close()
+print("saved fig_real_3d_pca.png")
+
+# === FIG 2: Real per-coord distribution per reg ===
+fig, axes = plt.subplots(2, 3, figsize=(16, 8))
+fig.suptitle('Real per-coordinate hidden state value distributions (128 tokens × 512 dims)', fontsize=13, weight='bold')
+for ax, (name, h) in zip(axes.flat, hidden_per_reg.items()):
+ flat = h.numpy().flatten()
+ ax.hist(flat, bins=60, color='steelblue', alpha=0.7, edgecolor='black')
+ ax.set_title(f'{name}\nμ={flat.mean():.3f}, σ={flat.std():.3f}, max|h|={np.abs(flat).max():.2f}', fontsize=10)
+ ax.set_xlabel('h_d value')
+ ax.set_ylabel('count')
+ ax.grid(alpha=0.3)
+plt.tight_layout()
+plt.savefig(f'{FIG_DIR}/fig_real_coord_distribution.png', dpi=130)
+plt.close()
+print("saved fig_real_coord_distribution.png")
+
+# === FIG 3: Real-data canonical metrics comparison ===
+def isoscore(h):
+ h_n = F.normalize(h, dim=-1)
+ sim = h_n @ h_n.t()
+ n = h_n.size(0)
+ off = sim - torch.eye(n)
+ return off.abs().mean().item()
+
+def eff_rank(h):
+ h_c = h - h.mean(0, keepdim=True)
+ _, S, _ = torch.linalg.svd(h_c, full_matrices=False)
+ p = S / S.sum()
+ p = p[p > 1e-10]
+ return float(np.exp(-(p * p.log()).sum().item()))
+
+real_metrics = {}
+for name, h in hidden_per_reg.items():
+ real_metrics[name] = {
+ 'isoscore': isoscore(h),
+ 'eff_rank': eff_rank(h),
+ 'norm_var': h.pow(2).sum(-1).sqrt().var().item(),
+ 'norm_mean': h.pow(2).sum(-1).sqrt().mean().item(),
+ 'max_abs': h.abs().max().item(),
+ }
+
+fig, axes = plt.subplots(2, 2, figsize=(13, 10))
+fig.suptitle('Real-data canonical metrics: do regs change real LM hidden states the way synthetic predicts?', fontsize=13, weight='bold')
+names = list(real_metrics.keys())
+colors_p = plt.cm.tab10(np.arange(len(names)))
+
+ax = axes[0, 0]
+isos = [real_metrics[n]['isoscore'] for n in names]
+ax.bar(names, isos, color=colors_p, edgecolor='black', alpha=0.85)
+ax.set_ylabel('mean |cos(h_i, h_j)| off-diag')
+ax.set_title('Isoscore (lower = more isotropic)')
+ax.tick_params(axis='x', rotation=20)
+ax.grid(axis='y', alpha=0.3)
+for i, v in enumerate(isos): ax.text(i, v+0.001, f'{v:.4f}', ha='center', fontsize=9, weight='bold')
+
+ax = axes[0, 1]
+ers = [real_metrics[n]['eff_rank'] for n in names]
+ax.bar(names, ers, color=colors_p, edgecolor='black', alpha=0.85)
+ax.set_ylabel('exp(spectral entropy)')
+ax.set_title('Effective rank (higher = more dimensions used)')
+ax.tick_params(axis='x', rotation=20)
+ax.grid(axis='y', alpha=0.3)
+for i, v in enumerate(ers): ax.text(i, v+0.5, f'{v:.1f}', ha='center', fontsize=9, weight='bold')
+
+ax = axes[1, 0]
+nvs = [real_metrics[n]['norm_var'] for n in names]
+ax.bar(names, nvs, color=colors_p, edgecolor='black', alpha=0.85)
+ax.set_ylabel('variance of L2 norms')
+ax.set_title('Per-token L2 norm variance (lower = more uniform)')
+ax.tick_params(axis='x', rotation=20)
+ax.grid(axis='y', alpha=0.3)
+for i, v in enumerate(nvs): ax.text(i, v+0.05, f'{v:.2f}', ha='center', fontsize=9, weight='bold')
+
+ax = axes[1, 1]
+mxs = [real_metrics[n]['max_abs'] for n in names]
+ax.bar(names, mxs, color=colors_p, edgecolor='black', alpha=0.85)
+ax.set_ylabel('max |h| across all coords')
+ax.set_title('Outlier coord magnitude (lower = AOS-like effect)')
+ax.tick_params(axis='x', rotation=20)
+ax.grid(axis='y', alpha=0.3)
+for i, v in enumerate(mxs): ax.text(i, v+0.5, f'{v:.1f}', ha='center', fontsize=9, weight='bold')
+
+plt.tight_layout()
+plt.savefig(f'{FIG_DIR}/fig_real_canonical_metrics.png', dpi=130)
+plt.close()
+print("saved fig_real_canonical_metrics.png")
+
+# === FIG 4: Real data per-token L2 norm distribution per reg ===
+fig, axes = plt.subplots(2, 3, figsize=(15, 8))
+fig.suptitle('Real per-token L2 norm distributions across 128 captured tokens', fontsize=13, weight='bold')
+for ax, (name, h) in zip(axes.flat, hidden_per_reg.items()):
+ norms = h.pow(2).sum(-1).sqrt().numpy()
+ ax.hist(norms, bins=20, color='darkgreen', alpha=0.7, edgecolor='black')
+ ax.set_title(f'{name}\nμ={norms.mean():.2f}, σ={norms.std():.2f}', fontsize=10)
+ ax.set_xlabel('‖h‖')
+ ax.set_ylabel('count')
+ ax.grid(alpha=0.3)
+plt.tight_layout()
+plt.savefig(f'{FIG_DIR}/fig_real_l2norm_distribution.png', dpi=130)
+plt.close()
+print("saved fig_real_l2norm_distribution.png")
+
+print()
+print("=== Real-data canonical metric table ===")
+print(f"{'reg':<14} {'isoscore':>10} {'eff_rank':>10} {'norm_var':>10} {'norm_mean':>10} {'max|h|':>10}")
+for n in names:
+ m = real_metrics[n]
+ print(f" {n:<14} {m['isoscore']:>10.4f} {m['eff_rank']:>10.2f} {m['norm_var']:>10.3f} {m['norm_mean']:>10.2f} {m['max_abs']:>10.2f}")
+
+# Save metrics as JSON
+import json
+open(f'{ROOT}/submissions/C_CrossBase_RegTransfer_Study/real_canonical_metrics.json', 'w').write(json.dumps(real_metrics, indent=2))
+print("saved real_canonical_metrics.json")
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_synergy_figures.py b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_synergy_figures.py
new file mode 100644
index 0000000000..6c799ee2df
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_synergy_figures.py
@@ -0,0 +1,94 @@
+"""Build heatmap + synergy-detection figures from real_reg_quant_matrix.json."""
+import json, os, sys
+import numpy as np
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+
+with open('/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/real_reg_quant_matrix.json') as f:
+ data = json.load(f)
+
+regs = list(data.keys())
+quants = list(data[regs[0]].keys())
+n_r, n_q = len(regs), len(quants)
+
+# Build matrices for each metric
+metrics = ['l2_distortion', 'cos_shift', 'isoscore_post', 'eff_rank_post']
+mats = {m: np.zeros((n_r, n_q)) for m in metrics}
+for ri, r in enumerate(regs):
+ for qi, q in enumerate(quants):
+ for m in metrics:
+ mats[m][ri, qi] = data[r][q][m]
+
+# === Figure: heatmaps ===
+fig, axes = plt.subplots(2, 2, figsize=(15, 10))
+fig.suptitle('Real (reg × quant) interaction matrix on REAL Base A LM hidden states\n(captured from forward pass on val tokens, 5 trained models)', fontsize=13, weight='bold')
+
+cmaps = {'l2_distortion': 'RdYlGn_r', 'cos_shift': 'RdYlGn_r', 'isoscore_post': 'RdYlGn_r', 'eff_rank_post': 'RdYlGn'}
+titles = {
+ 'l2_distortion': '(a) L2 distortion (lower = better)',
+ 'cos_shift': '(b) Cosine shift (lower = better)',
+ 'isoscore_post': '(c) Post-quant isoscore (lower = better)',
+ 'eff_rank_post': '(d) Post-quant effective rank (higher = better)',
+}
+
+for ax, m in zip(axes.flat, metrics):
+ mat = mats[m]
+ im = ax.imshow(mat, aspect='auto', cmap=cmaps[m])
+ ax.set_xticks(range(n_q)); ax.set_xticklabels(quants, rotation=30, ha='right', fontsize=9)
+ ax.set_yticks(range(n_r)); ax.set_yticklabels(regs, fontsize=10)
+ ax.set_title(titles[m], fontsize=11)
+ # Annotate values
+ for i in range(n_r):
+ for j in range(n_q):
+ ax.text(j, i, f'{mat[i,j]:.4f}' if 'l2' in m or 'shift' in m else f'{mat[i,j]:.3f}',
+ ha='center', va='center', fontsize=8, color='black')
+ plt.colorbar(im, ax=ax, fraction=0.04)
+ # Mark best per quant (column) — for distortion/cos_shift, lowest; for eff_rank, highest
+ for j in range(n_q):
+ col = mat[:, j]
+ best_row = np.argmin(col) if m != 'eff_rank_post' else np.argmax(col)
+ ax.scatter(j, best_row, marker='*', s=200, c='gold', edgecolors='black', linewidths=1, zorder=10)
+
+plt.tight_layout()
+plt.savefig('/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/figures/fig_reg_quant_matrix_real.png', dpi=130, bbox_inches='tight')
+plt.close()
+print("saved fig_reg_quant_matrix_real.png")
+
+# === Synergy detection: which (reg, quant) pairs are unexpectedly good? ===
+# For each metric, normalize each row (relative to that reg's mean) and each column (relative to that quant's mean)
+# A "synergy" is a cell that's better than both its row mean AND column mean by margin
+print()
+print("=== SYNERGY detection (cells that play unusually nicely) ===")
+for m in ['l2_distortion', 'cos_shift']:
+ mat = mats[m]
+ row_means = mat.mean(axis=1, keepdims=True)
+ col_means = mat.mean(axis=0, keepdims=True)
+ # Synergy: cell is BELOW both row mean and col mean (lower distortion is better here)
+ rel_row = mat - row_means # negative = better than row average
+ rel_col = mat - col_means
+ print(f"\nMetric: {m}")
+ print(f" Each cell: relative-to-row-mean / relative-to-col-mean")
+ for ri, r in enumerate(regs):
+ for qi, q in enumerate(quants):
+ rr, rc = rel_row[ri, qi], rel_col[ri, qi]
+ if rr < -mat.std()*0.3 and rc < -mat.std()*0.3:
+ print(f" ⭐ SYNERGY: {r:<10} × {q:<22} (row Δ {rr:+.4f}, col Δ {rc:+.4f}) — both reg AND quant outperform their means")
+
+# === "Plays nice" summary table ===
+print()
+print("=== 'Plays nice' summary: best reg per quant + best quant per reg ===")
+print()
+print("For each QUANT scheme, which REG produces the smallest distortion?")
+print(f"{'quant scheme':<22} {'best reg':<10} {'L2 dist':>9}")
+for qi, q in enumerate(quants):
+ col = mats['l2_distortion'][:, qi]
+ best_r = np.argmin(col)
+ print(f" {q:<22} {regs[best_r]:<10} {col[best_r]:.4f}")
+print()
+print("For each REG, which QUANT gives smallest distortion?")
+print(f"{'reg':<10} {'best quant':<22} {'L2 dist':>9}")
+for ri, r in enumerate(regs):
+ row = mats['l2_distortion'][ri, :]
+ best_q = np.argmin(row)
+ print(f" {r:<10} {quants[best_q]:<22} {row[best_q]:.4f}")
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/cka_pairwise.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/cka_pairwise.json
new file mode 100644
index 0000000000..b23086ae07
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/cka_pairwise.json
@@ -0,0 +1,50 @@
+{
+ "no-reg": {
+ "no-reg": 1.0000001192092896,
+ "SimCTG": 0.7081529498100281,
+ "SimCTG+QAHSP": 0.7461575865745544,
+ "SimCTG+ES": 0.7027785181999207,
+ "SimCTG+HSU": 0.7411860227584839,
+ "SimCTG+AOS": 0.7073161005973816
+ },
+ "SimCTG": {
+ "no-reg": 0.7081529498100281,
+ "SimCTG": 1.0,
+ "SimCTG+QAHSP": 0.7191595435142517,
+ "SimCTG+ES": 0.6846227049827576,
+ "SimCTG+HSU": 0.688707709312439,
+ "SimCTG+AOS": 0.6744828224182129
+ },
+ "SimCTG+QAHSP": {
+ "no-reg": 0.7461575865745544,
+ "SimCTG": 0.7191595435142517,
+ "SimCTG+QAHSP": 1.0000001192092896,
+ "SimCTG+ES": 0.7486706972122192,
+ "SimCTG+HSU": 0.691008448600769,
+ "SimCTG+AOS": 0.7166145443916321
+ },
+ "SimCTG+ES": {
+ "no-reg": 0.7027785181999207,
+ "SimCTG": 0.6846227049827576,
+ "SimCTG+QAHSP": 0.7486706972122192,
+ "SimCTG+ES": 1.0000001192092896,
+ "SimCTG+HSU": 0.7097563147544861,
+ "SimCTG+AOS": 0.7295978665351868
+ },
+ "SimCTG+HSU": {
+ "no-reg": 0.7411860227584839,
+ "SimCTG": 0.688707709312439,
+ "SimCTG+QAHSP": 0.691008448600769,
+ "SimCTG+ES": 0.7097563147544861,
+ "SimCTG+HSU": 1.0000001192092896,
+ "SimCTG+AOS": 0.7205949425697327
+ },
+ "SimCTG+AOS": {
+ "no-reg": 0.7073161005973816,
+ "SimCTG": 0.6744828224182129,
+ "SimCTG+QAHSP": 0.7166145443916321,
+ "SimCTG+ES": 0.7295978665351868,
+ "SimCTG+HSU": 0.7205949425697327,
+ "SimCTG+AOS": 0.9999999403953552
+ }
+}
\ No newline at end of file
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/depth_trajectory.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/depth_trajectory.json
new file mode 100644
index 0000000000..7fcf77b329
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/depth_trajectory.json
@@ -0,0 +1,170 @@
+{
+ "no-reg": {
+ "norms": [
+ 82.8610610961914,
+ 74.36043548583984,
+ 84.44847106933594,
+ 84.7039794921875,
+ 64.86913299560547,
+ 73.76284790039062,
+ 69.17152404785156,
+ 63.027923583984375,
+ 52.42226791381836,
+ 35.87632751464844,
+ 28.76905632019043
+ ],
+ "kurts": [
+ 4.855867385864258,
+ 2.8805336952209473,
+ 1.1206727027893066,
+ 1.0582356452941895,
+ 0.8072985410690308,
+ 1.0698959827423096,
+ 1.292937994003296,
+ 4.44473123550415,
+ 23.753156661987305,
+ 183.02464294433594,
+ 388.1587829589844
+ ]
+ },
+ "SimCTG": {
+ "norms": [
+ 82.8995132446289,
+ 75.21341705322266,
+ 83.13368225097656,
+ 82.3014907836914,
+ 64.17926788330078,
+ 73.55484771728516,
+ 68.2824935913086,
+ 64.23097229003906,
+ 53.610286712646484,
+ 36.06589126586914,
+ 27.404142379760742
+ ],
+ "kurts": [
+ 5.22653865814209,
+ 3.193834066390991,
+ 0.8576445579528809,
+ 0.9317208528518677,
+ 0.706852376461029,
+ 0.8229748010635376,
+ 1.9099056720733643,
+ 4.851380348205566,
+ 28.853525161743164,
+ 184.3623809814453,
+ 358.630859375
+ ]
+ },
+ "SimCTG+QAHSP": {
+ "norms": [
+ 81.03509521484375,
+ 74.17752838134766,
+ 83.78406524658203,
+ 82.69522094726562,
+ 63.89334487915039,
+ 74.52857971191406,
+ 67.91127014160156,
+ 64.07562255859375,
+ 52.825660705566406,
+ 35.43454360961914,
+ 30.548107147216797
+ ],
+ "kurts": [
+ 3.835179567337036,
+ 2.4510762691497803,
+ 0.8331541419029236,
+ 0.7151315212249756,
+ 0.7320422530174255,
+ 0.8011575937271118,
+ 0.9048928022384644,
+ 3.4804654121398926,
+ 21.797697067260742,
+ 173.99606323242188,
+ 386.97918701171875
+ ]
+ },
+ "SimCTG+ES": {
+ "norms": [
+ 82.62226104736328,
+ 75.1025390625,
+ 84.94764709472656,
+ 80.84661102294922,
+ 63.999778747558594,
+ 72.64584350585938,
+ 65.707275390625,
+ 61.80288314819336,
+ 49.580055236816406,
+ 34.79338073730469,
+ 27.630577087402344
+ ],
+ "kurts": [
+ 4.948187828063965,
+ 3.426382303237915,
+ 1.003125548362732,
+ 1.0885035991668701,
+ 0.8019349575042725,
+ 1.0049747228622437,
+ 1.1360023021697998,
+ 3.7354226112365723,
+ 17.517728805541992,
+ 174.50112915039062,
+ 381.40606689453125
+ ]
+ },
+ "SimCTG+HSU": {
+ "norms": [
+ 81.51065063476562,
+ 74.14579772949219,
+ 84.77708435058594,
+ 83.02979278564453,
+ 66.24172973632812,
+ 75.18669891357422,
+ 64.73681640625,
+ 62.87533187866211,
+ 52.14603042602539,
+ 35.268707275390625,
+ 28.539628982543945
+ ],
+ "kurts": [
+ 3.804694652557373,
+ 1.8663853406906128,
+ 0.7781698703765869,
+ 0.9039973020553589,
+ 0.8394896984100342,
+ 0.7438814640045166,
+ 1.1972663402557373,
+ 4.4696946144104,
+ 27.126493453979492,
+ 183.8109588623047,
+ 386.6787109375
+ ]
+ },
+ "SimCTG+AOS": {
+ "norms": [
+ 84.75853729248047,
+ 77.02366638183594,
+ 84.03617858886719,
+ 81.2281265258789,
+ 66.06658172607422,
+ 72.4874038696289,
+ 63.07447814941406,
+ 62.116519927978516,
+ 52.27328872680664,
+ 35.48656463623047,
+ 28.157312393188477
+ ],
+ "kurts": [
+ 5.198080062866211,
+ 4.80472469329834,
+ 0.7657652497291565,
+ 0.7305189371109009,
+ 0.642646312713623,
+ 0.6297460794448853,
+ 1.489835500717163,
+ 5.857874870300293,
+ 30.544631958007812,
+ 182.83627319335938,
+ 385.41748046875
+ ]
+ }
+}
\ No newline at end of file
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/eval_pipeline_breakdown.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/eval_pipeline_breakdown.json
new file mode 100644
index 0000000000..101ff3a6f1
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/eval_pipeline_breakdown.json
@@ -0,0 +1,168 @@
+{
+ "cells": [
+ {
+ "name": "QAHSP \u03bb=0.3",
+ "size": 17028688,
+ "metrics": {
+ "pre-quantization post-ema": 1.07492939,
+ "quantized": 1.0894092,
+ "quantized_sliding_window": 1.07347521
+ }
+ },
+ {
+ "name": "ES \u03bb=0.05",
+ "size": 17022208,
+ "metrics": {
+ "pre-quantization post-ema": 1.07535789,
+ "quantized": 1.09022908,
+ "quantized_sliding_window": 1.07428482
+ }
+ },
+ {
+ "name": "AOS \u03bb=0.005",
+ "size": 17029456,
+ "metrics": {
+ "pre-quantization post-ema": 1.07579561,
+ "quantized": 1.09035483,
+ "quantized_sliding_window": 1.07445292
+ }
+ },
+ {
+ "name": "HSU \u03bb=0.1",
+ "size": 17030768,
+ "metrics": {
+ "pre-quantization post-ema": 1.07535949,
+ "quantized": 1.08993364,
+ "quantized_sliding_window": 1.07403453
+ }
+ },
+ {
+ "name": "WBC \u03bb=0.005",
+ "size": 17063164,
+ "metrics": {
+ "pre-quantization post-ema": 1.07646463,
+ "quantized": 1.09112469,
+ "quantized_sliding_window": 1.07521576
+ }
+ },
+ {
+ "name": "WOP \u03bb=0.5",
+ "size": 17029348,
+ "metrics": {
+ "pre-quantization post-ema": 1.07536867,
+ "quantized": 1.08962462,
+ "quantized_sliding_window": 1.07375795
+ }
+ },
+ {
+ "name": "PCS \u03bb=0.005",
+ "size": 17029124,
+ "metrics": {
+ "pre-quantization post-ema": 1.07595394,
+ "quantized": 1.09052669,
+ "quantized_sliding_window": 1.07462584
+ }
+ },
+ {
+ "name": "QAHSP+HSU pair",
+ "size": 17024280,
+ "metrics": {
+ "pre-quantization post-ema": 1.07532708,
+ "quantized": 1.08998156,
+ "quantized_sliding_window": 1.07408126
+ }
+ },
+ {
+ "name": "QAHSP+ES pair",
+ "size": 17027192,
+ "metrics": {
+ "pre-quantization post-ema": 1.07548971,
+ "quantized": 1.09006997,
+ "quantized_sliding_window": 1.07416475
+ }
+ },
+ {
+ "name": "HSU+ES pair",
+ "size": 17030228,
+ "metrics": {
+ "pre-quantization post-ema": 1.07558521,
+ "quantized": 1.09012842,
+ "quantized_sliding_window": 1.07422695
+ }
+ },
+ {
+ "name": "QAHSP+PCS pair",
+ "size": 17032068,
+ "metrics": {
+ "pre-quantization post-ema": 1.07614311,
+ "quantized": 1.09061423,
+ "quantized_sliding_window": 1.07474981
+ }
+ },
+ {
+ "name": "PQT + QAHSP \u03bb=0.3",
+ "size": 17023556,
+ "metrics": {
+ "pre-quantization post-ema": 1.07550024,
+ "post-prequant-ttt": 1.02901401,
+ "quantized": 1.05180787,
+ "quantized_sliding_window": 1.03985482
+ }
+ },
+ {
+ "name": "PQT + ES \u03bb=0.05",
+ "size": 17025048,
+ "metrics": {
+ "pre-quantization post-ema": 1.07516222,
+ "post-prequant-ttt": 1.02867971,
+ "quantized": 1.05144513,
+ "quantized_sliding_window": 1.03942213
+ }
+ },
+ {
+ "name": "Base B baseline (PR #1965)",
+ "size": 15977654,
+ "metrics": {
+ "pre-quantization post-ema": 1.06162051,
+ "quantized": 1.06995915,
+ "quantized_ttt_phased": 1.05822408
+ }
+ },
+ {
+ "name": "Base B + SimCTG+QAHSP \u03bb=0.3",
+ "size": 15972592,
+ "metrics": {
+ "pre-quantization post-ema": 1.06414265,
+ "quantized": 1.07235931,
+ "quantized_ttt_phased": 1.06047215
+ }
+ },
+ {
+ "name": "Base B + SimCTG+QAHSP \u03bb=0.1",
+ "size": 15974388,
+ "metrics": {
+ "pre-quantization post-ema": 1.06235812,
+ "quantized": 1.07065788,
+ "quantized_ttt_phased": 1.05880775
+ }
+ },
+ {
+ "name": "Base B + ES \u03bb=0.05",
+ "size": 15972817,
+ "metrics": {
+ "pre-quantization post-ema": 1.06333929,
+ "quantized": 1.07184449,
+ "quantized_ttt_phased": 1.05993433
+ }
+ },
+ {
+ "name": "Base B + bigram(1024\u00d78)",
+ "size": 16013368,
+ "metrics": {
+ "pre-quantization post-ema": 1.06223567,
+ "quantized": 1.07064692,
+ "quantized_ttt_phased": 1.05886441
+ }
+ }
+ ]
+}
\ No newline at end of file
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig1_cross_base_signs.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig1_cross_base_signs.png
new file mode 100644
index 0000000000..75670d02d4
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig1_cross_base_signs.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_cka_heatmap.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_cka_heatmap.png
new file mode 100644
index 0000000000..fc4f3db23c
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_cka_heatmap.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_depth_trajectory.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_depth_trajectory.png
new file mode 100644
index 0000000000..fcf4733088
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_depth_trajectory.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_lambda_budget.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_lambda_budget.png
new file mode 100644
index 0000000000..76afc93cae
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_lambda_budget.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pipeline_waterfall.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pipeline_waterfall.png
new file mode 100644
index 0000000000..04053b9728
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pipeline_waterfall.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pqt_compounding.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pqt_compounding.png
new file mode 100644
index 0000000000..897bb9956c
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pqt_compounding.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_3d_pca.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_3d_pca.png
new file mode 100644
index 0000000000..515ac31248
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_3d_pca.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_canonical_metrics.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_canonical_metrics.png
new file mode 100644
index 0000000000..90e08c80e4
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_canonical_metrics.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_coord_distribution.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_coord_distribution.png
new file mode 100644
index 0000000000..20e5825f1f
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_coord_distribution.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_l2norm_distribution.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_l2norm_distribution.png
new file mode 100644
index 0000000000..350f177a2a
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_l2norm_distribution.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_pre_post_quant.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_pre_post_quant.png
new file mode 100644
index 0000000000..b10926398c
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_pre_post_quant.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_reg_quant_matrix_real.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_reg_quant_matrix_real.png
new file mode 100644
index 0000000000..b2b668170c
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_reg_quant_matrix_real.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_differential.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_differential.png
new file mode 100644
index 0000000000..3e98fff860
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_differential.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_flatness.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_flatness.png
new file mode 100644
index 0000000000..286fc232f0
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_flatness.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_spectrum.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_spectrum.png
new file mode 100644
index 0000000000..f3d6d69f17
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_spectrum.png differ
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/pipeline_attribution.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/pipeline_attribution.json
new file mode 100644
index 0000000000..7825090ca6
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/pipeline_attribution.json
@@ -0,0 +1,158 @@
+[
+ {
+ "cell": "QAHSP \u03bb=0.3",
+ "base": "A",
+ "pre_quant": 1.07492939,
+ "quantized": 1.0894092,
+ "sliding": 1.07347521,
+ "ttt": null,
+ "quant_cost_mBPB": 14.47980999999987,
+ "sliding_gain_mBPB": -15.933989999999953,
+ "ttt_gain_mBPB": null,
+ "final": 1.07347521
+ },
+ {
+ "cell": "ES \u03bb=0.05",
+ "base": "A",
+ "pre_quant": 1.07535789,
+ "quantized": 1.09022908,
+ "sliding": 1.07428482,
+ "ttt": null,
+ "quant_cost_mBPB": 14.871190000000034,
+ "sliding_gain_mBPB": -15.944260000000154,
+ "ttt_gain_mBPB": null,
+ "final": 1.07428482
+ },
+ {
+ "cell": "AOS \u03bb=0.005",
+ "base": "A",
+ "pre_quant": 1.07579561,
+ "quantized": 1.09035483,
+ "sliding": 1.07445292,
+ "ttt": null,
+ "quant_cost_mBPB": 14.559220000000206,
+ "sliding_gain_mBPB": -15.901910000000186,
+ "ttt_gain_mBPB": null,
+ "final": 1.07445292
+ },
+ {
+ "cell": "HSU \u03bb=0.1",
+ "base": "A",
+ "pre_quant": 1.07535949,
+ "quantized": 1.08993364,
+ "sliding": 1.07403453,
+ "ttt": null,
+ "quant_cost_mBPB": 14.574149999999841,
+ "sliding_gain_mBPB": -15.899109999999883,
+ "ttt_gain_mBPB": null,
+ "final": 1.07403453
+ },
+ {
+ "cell": "WBC \u03bb=0.005",
+ "base": "A",
+ "pre_quant": 1.07646463,
+ "quantized": 1.09112469,
+ "sliding": 1.07521576,
+ "ttt": null,
+ "quant_cost_mBPB": 14.660059999999975,
+ "sliding_gain_mBPB": -15.908929999999932,
+ "ttt_gain_mBPB": null,
+ "final": 1.07521576
+ },
+ {
+ "cell": "WOP \u03bb=0.5",
+ "base": "A",
+ "pre_quant": 1.07536867,
+ "quantized": 1.08962462,
+ "sliding": 1.07375795,
+ "ttt": null,
+ "quant_cost_mBPB": 14.255949999999906,
+ "sliding_gain_mBPB": -15.86666999999986,
+ "ttt_gain_mBPB": null,
+ "final": 1.07375795
+ },
+ {
+ "cell": "PCS \u03bb=0.005",
+ "base": "A",
+ "pre_quant": 1.07595394,
+ "quantized": 1.09052669,
+ "sliding": 1.07462584,
+ "ttt": null,
+ "quant_cost_mBPB": 14.572749999999912,
+ "sliding_gain_mBPB": -15.900849999999966,
+ "ttt_gain_mBPB": null,
+ "final": 1.07462584
+ },
+ {
+ "cell": "PQT + ES \u03bb=0.05",
+ "base": "A-PQT",
+ "pre_quant": 1.07516222,
+ "quantized": 1.05144513,
+ "sliding": 1.03942213,
+ "ttt": null,
+ "quant_cost_mBPB": -23.717089999999885,
+ "sliding_gain_mBPB": -12.023000000000117,
+ "ttt_gain_mBPB": null,
+ "final": 1.03942213
+ },
+ {
+ "cell": "PQT + QAHSP \u03bb=0.3",
+ "base": "A-PQT",
+ "pre_quant": 1.07550024,
+ "quantized": 1.05180787,
+ "sliding": 1.03985482,
+ "ttt": null,
+ "quant_cost_mBPB": -23.692370000000018,
+ "sliding_gain_mBPB": -11.95305000000002,
+ "ttt_gain_mBPB": null,
+ "final": 1.03985482
+ },
+ {
+ "cell": "Base B baseline",
+ "base": "B",
+ "pre_quant": 1.06162051,
+ "quantized": 1.06995915,
+ "sliding": null,
+ "ttt": 1.05822408,
+ "quant_cost_mBPB": 8.338640000000064,
+ "sliding_gain_mBPB": null,
+ "ttt_gain_mBPB": -11.73507000000007,
+ "final": 1.05822408
+ },
+ {
+ "cell": "B + SimCTG+QAHSP \u03bb=0.1",
+ "base": "B",
+ "pre_quant": 1.06235812,
+ "quantized": 1.07065788,
+ "sliding": null,
+ "ttt": 1.05880775,
+ "quant_cost_mBPB": 8.299759999999878,
+ "sliding_gain_mBPB": null,
+ "ttt_gain_mBPB": -11.850130000000014,
+ "final": 1.05880775
+ },
+ {
+ "cell": "B + ES \u03bb=0.05",
+ "base": "B",
+ "pre_quant": 1.06333929,
+ "quantized": 1.07184449,
+ "sliding": null,
+ "ttt": 1.05993433,
+ "quant_cost_mBPB": 8.50519999999988,
+ "sliding_gain_mBPB": null,
+ "ttt_gain_mBPB": -11.910160000000003,
+ "final": 1.05993433
+ },
+ {
+ "cell": "B + bigram 1024\u00d78",
+ "base": "B",
+ "pre_quant": 1.06223567,
+ "quantized": 1.07064692,
+ "sliding": null,
+ "ttt": 1.05886441,
+ "quant_cost_mBPB": 8.411249999999981,
+ "sliding_gain_mBPB": null,
+ "ttt_gain_mBPB": -11.782509999999968,
+ "final": 1.05886441
+ }
+]
\ No newline at end of file
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_canonical_metrics.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_canonical_metrics.json
new file mode 100644
index 0000000000..41ae5614e8
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_canonical_metrics.json
@@ -0,0 +1,44 @@
+{
+ "no-reg": {
+ "isoscore": 0.9196764230728149,
+ "eff_rank": 96.45550204522421,
+ "norm_var": 19.190195083618164,
+ "norm_mean": 27.46596908569336,
+ "max_abs": 36.5
+ },
+ "SimCTG": {
+ "isoscore": 0.8950488567352295,
+ "eff_rank": 94.40702880943573,
+ "norm_var": 28.09220314025879,
+ "norm_mean": 26.198692321777344,
+ "max_abs": 36.5
+ },
+ "SimCTG+QAHSP": {
+ "isoscore": 0.9058330059051514,
+ "eff_rank": 92.1850517498386,
+ "norm_var": 29.38678550720215,
+ "norm_mean": 30.21584701538086,
+ "max_abs": 41.0
+ },
+ "SimCTG+ES": {
+ "isoscore": 0.9128783941268921,
+ "eff_rank": 95.01078049667903,
+ "norm_var": 29.09756851196289,
+ "norm_mean": 29.541982650756836,
+ "max_abs": 43.0
+ },
+ "SimCTG+HSU": {
+ "isoscore": 0.9256589412689209,
+ "eff_rank": 95.99981971300701,
+ "norm_var": 22.477293014526367,
+ "norm_mean": 29.241943359375,
+ "max_abs": 38.25
+ },
+ "SimCTG+AOS": {
+ "isoscore": 0.9227072596549988,
+ "eff_rank": 99.81859154964323,
+ "norm_var": 14.296555519104004,
+ "norm_mean": 28.209720611572266,
+ "max_abs": 35.25
+ }
+}
\ No newline at end of file
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_reg_quant_matrix.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_reg_quant_matrix.json
new file mode 100644
index 0000000000..e1fa47c068
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_reg_quant_matrix.json
@@ -0,0 +1,266 @@
+{
+ "no-reg": {
+ "int4 sym per-tensor": {
+ "l2_distortion": 8.665924072265625,
+ "cos_shift": 0.058444440364837646,
+ "isoscore_post": 0.9379310607910156,
+ "eff_rank_post": 4.676845098324823
+ },
+ "int4 sym per-row": {
+ "l2_distortion": 7.969930648803711,
+ "cos_shift": 0.049631595611572266,
+ "isoscore_post": 0.9429349899291992,
+ "eff_rank_post": 17.07542076078601
+ },
+ "int4 asym per-row": {
+ "l2_distortion": 7.040150165557861,
+ "cos_shift": 0.03654921054840088,
+ "isoscore_post": 0.9111442565917969,
+ "eff_rank_post": 78.67987774765582
+ },
+ "int6 sym per-row": {
+ "l2_distortion": 4.770608901977539,
+ "cos_shift": 0.01566535234451294,
+ "isoscore_post": 0.8909909725189209,
+ "eff_rank_post": 103.73207476658023
+ },
+ "int8 sym per-row": {
+ "l2_distortion": 1.2735472917556763,
+ "cos_shift": 0.0011124610900878906,
+ "isoscore_post": 0.9084160327911377,
+ "eff_rank_post": 97.27140992067748
+ },
+ "AWQ-lite int4": {
+ "l2_distortion": 2.503967761993408,
+ "cos_shift": 0.004290342330932617,
+ "isoscore_post": 0.9011155962944031,
+ "eff_rank_post": 100.39177831746376
+ },
+ "GPTQ-lite int4": {
+ "l2_distortion": 51.821632385253906,
+ "cos_shift": 0.4511241912841797,
+ "isoscore_post": 0.537254810333252,
+ "eff_rank_post": 106.64174919958656
+ }
+ },
+ "SimCTG": {
+ "int4 sym per-tensor": {
+ "l2_distortion": 8.885649681091309,
+ "cos_shift": 0.06010890007019043,
+ "isoscore_post": 0.9370186924934387,
+ "eff_rank_post": 3.1090120294548624
+ },
+ "int4 sym per-row": {
+ "l2_distortion": 7.985650062561035,
+ "cos_shift": 0.04347085952758789,
+ "isoscore_post": 0.9251681566238403,
+ "eff_rank_post": 12.936205182701624
+ },
+ "int4 asym per-row": {
+ "l2_distortion": 7.186392784118652,
+ "cos_shift": 0.032656848430633545,
+ "isoscore_post": 0.9017077684402466,
+ "eff_rank_post": 62.38439015852314
+ },
+ "int6 sym per-row": {
+ "l2_distortion": 5.099906921386719,
+ "cos_shift": 0.014867603778839111,
+ "isoscore_post": 0.8821943998336792,
+ "eff_rank_post": 97.97688454189425
+ },
+ "int8 sym per-row": {
+ "l2_distortion": 1.4151151180267334,
+ "cos_shift": 0.00112152099609375,
+ "isoscore_post": 0.8963391184806824,
+ "eff_rank_post": 91.62891844039687
+ },
+ "AWQ-lite int4": {
+ "l2_distortion": 2.658930778503418,
+ "cos_shift": 0.00402677059173584,
+ "isoscore_post": 0.8892554044723511,
+ "eff_rank_post": 95.11689794763643
+ },
+ "GPTQ-lite int4": {
+ "l2_distortion": 57.328311920166016,
+ "cos_shift": 0.46479684114456177,
+ "isoscore_post": 0.38995546102523804,
+ "eff_rank_post": 99.04585869628714
+ }
+ },
+ "SimCTG+QAHSP": {
+ "int4 sym per-tensor": {
+ "l2_distortion": 8.90207290649414,
+ "cos_shift": 0.04147696495056152,
+ "isoscore_post": 0.9752295017242432,
+ "eff_rank_post": 3.1338453281846426
+ },
+ "int4 sym per-row": {
+ "l2_distortion": 8.396255493164062,
+ "cos_shift": 0.037551701068878174,
+ "isoscore_post": 0.9723882675170898,
+ "eff_rank_post": 10.321481681455186
+ },
+ "int4 asym per-row": {
+ "l2_distortion": 7.7742919921875,
+ "cos_shift": 0.030310511589050293,
+ "isoscore_post": 0.9479402303695679,
+ "eff_rank_post": 61.19273380707454
+ },
+ "int6 sym per-row": {
+ "l2_distortion": 5.7215166091918945,
+ "cos_shift": 0.01521909236907959,
+ "isoscore_post": 0.922627329826355,
+ "eff_rank_post": 103.85328116432875
+ },
+ "int8 sym per-row": {
+ "l2_distortion": 1.6117725372314453,
+ "cos_shift": 0.0011837482452392578,
+ "isoscore_post": 0.9344638586044312,
+ "eff_rank_post": 98.82931581135958
+ },
+ "AWQ-lite int4": {
+ "l2_distortion": 2.8120150566101074,
+ "cos_shift": 0.0035818815231323242,
+ "isoscore_post": 0.9302618503570557,
+ "eff_rank_post": 101.67765613469722
+ },
+ "GPTQ-lite int4": {
+ "l2_distortion": 65.60872650146484,
+ "cos_shift": 0.4830549955368042,
+ "isoscore_post": 0.42430219054222107,
+ "eff_rank_post": 102.3128287695318
+ }
+ },
+ "SimCTG+ES": {
+ "int4 sym per-tensor": {
+ "l2_distortion": 8.765625,
+ "cos_shift": 0.062296152114868164,
+ "isoscore_post": 0.936721682548523,
+ "eff_rank_post": 3.9066363675054414
+ },
+ "int4 sym per-row": {
+ "l2_distortion": 8.043916702270508,
+ "cos_shift": 0.05177617073059082,
+ "isoscore_post": 0.9349973201751709,
+ "eff_rank_post": 18.594726647796865
+ },
+ "int4 asym per-row": {
+ "l2_distortion": 7.1346306800842285,
+ "cos_shift": 0.0385744571685791,
+ "isoscore_post": 0.8988239765167236,
+ "eff_rank_post": 81.55246436667036
+ },
+ "int6 sym per-row": {
+ "l2_distortion": 4.726665019989014,
+ "cos_shift": 0.015722990036010742,
+ "isoscore_post": 0.8786011934280396,
+ "eff_rank_post": 103.6412998748405
+ },
+ "int8 sym per-row": {
+ "l2_distortion": 1.2452131509780884,
+ "cos_shift": 0.0010887980461120605,
+ "isoscore_post": 0.8960879445075989,
+ "eff_rank_post": 96.95184951116325
+ },
+ "AWQ-lite int4": {
+ "l2_distortion": 2.50549054145813,
+ "cos_shift": 0.004423320293426514,
+ "isoscore_post": 0.8884121179580688,
+ "eff_rank_post": 99.88234456906645
+ },
+ "GPTQ-lite int4": {
+ "l2_distortion": 50.17822265625,
+ "cos_shift": 0.4484516978263855,
+ "isoscore_post": 0.4249190390110016,
+ "eff_rank_post": 100.84690672770277
+ }
+ },
+ "SimCTG+HSU": {
+ "int4 sym per-tensor": {
+ "l2_distortion": 9.063629150390625,
+ "cos_shift": 0.0570681095123291,
+ "isoscore_post": 0.9595622420310974,
+ "eff_rank_post": 4.471798045349368
+ },
+ "int4 sym per-row": {
+ "l2_distortion": 8.505067825317383,
+ "cos_shift": 0.04947108030319214,
+ "isoscore_post": 0.9467036128044128,
+ "eff_rank_post": 19.69998277960704
+ },
+ "int4 asym per-row": {
+ "l2_distortion": 7.623517990112305,
+ "cos_shift": 0.03686553239822388,
+ "isoscore_post": 0.9114422798156738,
+ "eff_rank_post": 78.14374941041602
+ },
+ "int6 sym per-row": {
+ "l2_distortion": 5.184071063995361,
+ "cos_shift": 0.01574110984802246,
+ "isoscore_post": 0.8899648785591125,
+ "eff_rank_post": 103.23084138151381
+ },
+ "int8 sym per-row": {
+ "l2_distortion": 1.3814940452575684,
+ "cos_shift": 0.0011126995086669922,
+ "isoscore_post": 0.9071061015129089,
+ "eff_rank_post": 96.6253216326656
+ },
+ "AWQ-lite int4": {
+ "l2_distortion": 2.6982033252716064,
+ "cos_shift": 0.004202067852020264,
+ "isoscore_post": 0.9030233025550842,
+ "eff_rank_post": 100.00491793874347
+ },
+ "GPTQ-lite int4": {
+ "l2_distortion": 56.08349609375,
+ "cos_shift": 0.45256686210632324,
+ "isoscore_post": 0.538358747959137,
+ "eff_rank_post": 106.7523560471735
+ }
+ },
+ "SimCTG+AOS": {
+ "int4 sym per-tensor": {
+ "l2_distortion": 8.788201332092285,
+ "cos_shift": 0.05601924657821655,
+ "isoscore_post": 0.9357870817184448,
+ "eff_rank_post": 3.757452934414734
+ },
+ "int4 sym per-row": {
+ "l2_distortion": 8.076977729797363,
+ "cos_shift": 0.04601740837097168,
+ "isoscore_post": 0.9340577125549316,
+ "eff_rank_post": 14.520423604886107
+ },
+ "int4 asym per-row": {
+ "l2_distortion": 7.245190620422363,
+ "cos_shift": 0.034468114376068115,
+ "isoscore_post": 0.9058271050453186,
+ "eff_rank_post": 70.79133016702093
+ },
+ "int6 sym per-row": {
+ "l2_distortion": 5.06667947769165,
+ "cos_shift": 0.015466868877410889,
+ "isoscore_post": 0.8826003670692444,
+ "eff_rank_post": 100.9460638891694
+ },
+ "int8 sym per-row": {
+ "l2_distortion": 1.3779029846191406,
+ "cos_shift": 0.0011301040649414062,
+ "isoscore_post": 0.8975257873535156,
+ "eff_rank_post": 94.78506932919245
+ },
+ "AWQ-lite int4": {
+ "l2_distortion": 2.6380958557128906,
+ "cos_shift": 0.0041866302490234375,
+ "isoscore_post": 0.8914800882339478,
+ "eff_rank_post": 98.09973908449749
+ },
+ "GPTQ-lite int4": {
+ "l2_distortion": 55.92982864379883,
+ "cos_shift": 0.47122877836227417,
+ "isoscore_post": 0.45693013072013855,
+ "eff_rank_post": 102.4443442124175
+ }
+ }
+}
\ No newline at end of file
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_after_trains.sh b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_after_trains.sh
new file mode 100755
index 0000000000..05f108cf5f
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_after_trains.sh
@@ -0,0 +1,34 @@
+#!/usr/bin/env bash
+# Auto-runs after EmbStudy_aos completes: extract hidden states, run quant matrix, build figures.
+set -e
+ROOT=/workspace/parameter-golf
+LOG=/tmp/embstudy_analysis.log
+
+echo "[$(date -u +%H:%M:%SZ)] watcher start" | tee -a "$LOG"
+
+# Wait for AOS to be in consumed (meaning daemon completed it)
+while ! grep -q "EmbStudy_aos" "${ROOT}/parameter-golf/auto/combo_consumed.txt" 2>/dev/null; do
+ sleep 30
+done
+echo "[$(date -u +%H:%M:%SZ)] EmbStudy_aos consumed; checking final_model.pt files exist" | tee -a "$LOG"
+
+# Wait for the final_model.pt to be stable (not still being written)
+sleep 30
+for reg_dir in nosimctg baseline qahsp es hsu aos; do
+ pt="${ROOT}/candidate_pack/N18_baseA_${reg_dir}/final_model.pt"
+ if [ ! -f "$pt" ]; then
+ echo "[$(date -u +%H:%M:%SZ)] WARN: missing $pt" | tee -a "$LOG"
+ else
+ sz=$(stat -c%s "$pt")
+ echo "[$(date -u +%H:%M:%SZ)] OK $reg_dir final_model.pt = $sz bytes" | tee -a "$LOG"
+ fi
+done
+
+echo "[$(date -u +%H:%M:%SZ)] running run_reg_quant_matrix.py" | tee -a "$LOG"
+cd "${ROOT}"
+python3 submissions/C_CrossBase_RegTransfer_Study/run_reg_quant_matrix.py 2>&1 | tee -a "$LOG"
+
+echo "[$(date -u +%H:%M:%SZ)] running build_synergy_figures.py" | tee -a "$LOG"
+python3 submissions/C_CrossBase_RegTransfer_Study/build_synergy_figures.py 2>&1 | tee -a "$LOG"
+
+echo "[$(date -u +%H:%M:%SZ)] done. results in submissions/C_CrossBase_RegTransfer_Study/" | tee -a "$LOG"
diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_reg_quant_matrix.py b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_reg_quant_matrix.py
new file mode 100644
index 0000000000..81a7b3c212
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_reg_quant_matrix.py
@@ -0,0 +1,249 @@
+"""
+Real-data reg × quant matrix analysis.
+
+Runs after all 5 EmbStudy_* training cells complete. For each saved BF16 model:
+ 1. Load via PyTorch
+ 2. Run forward on a small val batch to capture last-block hidden states
+ 3. Apply 6 quantization schemes to those hidden states
+ 4. Compute per-cell metrics (L2 distortion, isoscore, silhouette, effective rank)
+ 5. Identify (reg, quant) pairs that "play nice"
+"""
+
+import os, sys, math
+import numpy as np
+import torch
+import torch.nn.functional as F
+
+# --- 6 quantization schemes ---
+def quant_int_per_tensor(h, bits, asym=False):
+ qmax = (1 << (bits-1)) - 1
+ if asym:
+ h_min, h_max = h.min().item(), h.max().item()
+ scale = (h_max - h_min) / (2*qmax + 1)
+ zp = round(-h_min / max(scale, 1e-8) - qmax - 1)
+ h_q = (torch.round(h/scale) + zp).clamp(-qmax-1, qmax)
+ return (h_q - zp) * scale
+ scale = h.abs().max().clamp(min=1e-8) / qmax
+ return torch.round(h/scale).clamp(-qmax-1, qmax) * scale
+
+def quant_int_per_row(h, bits, asym=False):
+ qmax = (1 << (bits-1)) - 1
+ if asym:
+ h_min = h.min(dim=-1, keepdim=True).values
+ h_max = h.max(dim=-1, keepdim=True).values
+ scale = (h_max - h_min).clamp(min=1e-8) / (2*qmax + 1)
+ zp = (-h_min/scale - qmax - 1).round()
+ h_q = (torch.round(h/scale) + zp).clamp(-qmax-1, qmax)
+ return (h_q - zp) * scale
+ scale = h.abs().max(dim=-1, keepdim=True).values.clamp(min=1e-8) / qmax
+ return torch.round(h/scale).clamp(-qmax-1, qmax) * scale
+
+def quant_awq_lite(h, bits=4):
+ """Per-channel scaling (sqrt activation magnitude) before per-row int4."""
+ chan_scale = (h.abs().mean(dim=0, keepdim=True).clamp(min=1e-8))**0.5
+ return quant_int_per_row(h / chan_scale, bits) * chan_scale
+
+def quant_gptq_lite(h, bits=4, damping=0.1):
+ """Per-row scale, column-by-column with running residual (Hessian-free GPTQ proxy)."""
+ qmax = (1 << (bits-1)) - 1
+ n, d = h.shape
+ h_q = torch.zeros_like(h)
+ h_residual = h.clone()
+ scale = h.abs().max(dim=-1, keepdim=True).values.clamp(min=1e-8) / qmax
+ for col in range(d):
+ col_q = torch.round(h_residual[:, col:col+1] / scale).clamp(-qmax-1, qmax) * scale
+ h_q[:, col:col+1] = col_q
+ if col+1 < d:
+ err = h_residual[:, col:col+1] - col_q
+ h_residual[:, col+1:] += err * damping
+ return h_q
+
+QUANT_SCHEMES = [
+ ('int4 sym per-tensor', lambda h: quant_int_per_tensor(h, bits=4, asym=False)),
+ ('int4 sym per-row', lambda h: quant_int_per_row(h, bits=4, asym=False)),
+ ('int4 asym per-row', lambda h: quant_int_per_row(h, bits=4, asym=True)),
+ ('int6 sym per-row', lambda h: quant_int_per_row(h, bits=6, asym=False)),
+ ('int8 sym per-row', lambda h: quant_int_per_row(h, bits=8, asym=False)),
+ ('AWQ-lite int4', lambda h: quant_awq_lite(h, bits=4)),
+ ('GPTQ-lite int4', lambda h: quant_gptq_lite(h, bits=4)),
+]
+
+# --- Metrics ---
+def isoscore(h):
+ h_n = F.normalize(h, dim=-1, eps=1e-6)
+ sim = h_n @ h_n.t()
+ n = h_n.size(0)
+ off = sim - torch.eye(n, device=h.device)
+ return off.abs().mean().item()
+
+def effective_rank(h):
+ h_c = h - h.mean(dim=0, keepdim=True)
+ _, S, _ = torch.linalg.svd(h_c, full_matrices=False)
+ p = S / S.sum()
+ p = p[p > 1e-10]
+ return float(np.exp(-(p * p.log()).sum().item()))
+
+def per_token_l2_distortion(h_pre, h_post):
+ return (h_pre - h_post).pow(2).sum(dim=-1).sqrt().mean().item()
+
+def cosine_shift(h_pre, h_post):
+ return 1.0 - F.cosine_similarity(h_pre, h_post, dim=-1).mean().item()
+
+def silhouette(h, labels, n_clusters):
+ """Simplified silhouette (uses sample of pairwise distances)."""
+ h_np = h.cpu().numpy() if isinstance(h, torch.Tensor) else h
+ sil = 0.0
+ for i in range(len(h_np)):
+ same = (labels == labels[i]) & (np.arange(len(h_np)) != i)
+ if not any(same): continue
+ a = np.mean([np.linalg.norm(h_np[i] - x) for x in h_np[same]])
+ b_min = float('inf')
+ for c in range(n_clusters):
+ if c == labels[i]: continue
+ other = labels == c
+ if any(other):
+ b = np.mean([np.linalg.norm(h_np[i] - x) for x in h_np[other]])
+ b_min = min(b_min, b)
+ if max(a, b_min) > 0:
+ sil += (b_min - a) / max(a, b_min)
+ return sil / len(h_np)
+
+# --- Hidden state extraction ---
+def extract_hidden_states(model_dir, n_tokens=128, val_bin_path=None):
+ """Load the trained BF16 model and run forward on val tokens.
+ Captures the post-final-block hidden state for each token.
+ """
+ sys.path.insert(0, model_dir)
+ # Avoid import collision
+ if 'train_gpt' in sys.modules:
+ del sys.modules['train_gpt']
+ import importlib.util
+ spec = importlib.util.spec_from_file_location("train_gpt", os.path.join(model_dir, "train_gpt.py"))
+ train_gpt = importlib.util.module_from_spec(spec)
+ # Stub out distributed init; we just need the model class
+ os.environ.setdefault("WORLD_SIZE", "1")
+ os.environ.setdefault("RANK", "0")
+ os.environ.setdefault("LOCAL_RANK", "0")
+ os.environ.setdefault("MASTER_ADDR", "127.0.0.1")
+ os.environ.setdefault("MASTER_PORT", "29500")
+ spec.loader.exec_module(train_gpt)
+
+ h_cls = train_gpt.Hyperparameters() if hasattr(train_gpt, 'Hyperparameters') else None
+ # The model class might be 'GPT' or 'FinalMiniLM' depending on lineage
+ model_cls = None
+ for cls_name in ['GPT', 'FinalMiniLM', 'Model']:
+ if hasattr(train_gpt, cls_name):
+ model_cls = getattr(train_gpt, cls_name)
+ break
+ if model_cls is None:
+ raise RuntimeError("no model class found in train_gpt.py")
+ model = model_cls(h_cls)
+ state_dict = torch.load(os.path.join(model_dir, "final_model.pt"), map_location="cpu", weights_only=False)
+ if isinstance(state_dict, dict) and 'state_dict' in state_dict:
+ state_dict = state_dict['state_dict']
+ model.load_state_dict(state_dict, strict=False)
+ model.eval()
+ # Move to GPU if available
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+ model = model.to(device).bfloat16()
+
+ # Load some val tokens
+ if val_bin_path and os.path.exists(val_bin_path):
+ toks = np.fromfile(val_bin_path, dtype=np.uint16)[:n_tokens*8].astype(np.int64)
+ else:
+ # fallback: random tokens
+ toks = np.random.randint(0, 8000, size=n_tokens*8)
+ toks = torch.from_numpy(toks).reshape(8, n_tokens).to(device)
+
+ # Forward through the model and extract last-block hidden state
+ with torch.no_grad():
+ # We need access to internal hidden states. Easiest: hook on the last block.
+ captured = {}
+ for name, mod in model.named_modules():
+ if 'blocks' in name and isinstance(mod, torch.nn.Module):
+ if name.endswith('.10') or name.endswith('.9'): # last block
+ def make_hook(n):
+ def hook(m, inp, out):
+ captured[n] = out.detach().cpu().float()
+ return hook
+ mod.register_forward_hook(make_hook(name))
+ # Run forward — need to call appropriate method
+ try:
+ _ = model.forward_logits(toks) if hasattr(model, 'forward_logits') else model(toks)
+ except Exception as e:
+ print(f" forward error: {e}")
+
+ # Get hidden states from the last captured layer
+ if captured:
+ h = list(captured.values())[-1]
+ h = h.reshape(-1, h.size(-1))[:128] # take 128 tokens
+ return h
+ return None
+
+# --- Main matrix computation ---
+def main():
+ base_dirs = {
+ 'no-reg': '/workspace/parameter-golf/candidate_pack/N18_baseA_nosimctg', # SimCTG=0
+ 'SimCTG': '/workspace/parameter-golf/candidate_pack/N18_baseA_baseline', # SimCTG λ=0.3 only
+ 'SimCTG+QAHSP': '/workspace/parameter-golf/candidate_pack/N18_baseA_qahsp',
+ 'SimCTG+ES': '/workspace/parameter-golf/candidate_pack/N18_baseA_es',
+ 'SimCTG+HSU': '/workspace/parameter-golf/candidate_pack/N18_baseA_hsu',
+ 'SimCTG+AOS': '/workspace/parameter-golf/candidate_pack/N18_baseA_aos',
+ }
+
+ # Find a val_bin
+ val_bin = None
+ for p in [
+ '/workspace/parameter-golf/parameter-golf/data/datasets/datasets/fineweb10B_sp10240/fineweb_val_000000.bin',
+ '/workspace/parameter-golf/parameter-golf/data/datasets/fineweb10B_sp10240/fineweb_val_000000.bin',
+ ]:
+ if os.path.exists(p):
+ val_bin = p; break
+ print(f"val_bin: {val_bin}")
+
+ hidden_per_reg = {}
+ for reg, dir_ in base_dirs.items():
+ if not os.path.exists(os.path.join(dir_, "final_model.pt")):
+ print(f" {reg}: no final_model.pt yet — skipping")
+ continue
+ print(f" {reg}: loading...")
+ h = extract_hidden_states(dir_, n_tokens=128, val_bin_path=val_bin)
+ if h is None:
+ print(f" extraction failed")
+ continue
+ hidden_per_reg[reg] = h
+ print(f" shape: {tuple(h.shape)}, mean L2: {h.pow(2).sum(-1).sqrt().mean().item():.3f}")
+
+ if not hidden_per_reg:
+ print("No models loaded. Exiting.")
+ return
+
+ # Compute the (reg × quant) matrix
+ print()
+ print("=== Real-data reg × quant matrix ===")
+ n_tok = list(hidden_per_reg.values())[0].size(0)
+ # Crude clustering: split tokens into 8 groups of equal size by their token ID
+ labels = np.array([i // (n_tok // 8) for i in range(n_tok)])[:n_tok]
+ n_clusters = 8
+
+ results = {}
+ for reg, h in hidden_per_reg.items():
+ results[reg] = {}
+ for qname, qfn in QUANT_SCHEMES:
+ h_q = qfn(h)
+ results[reg][qname] = {
+ 'l2_distortion': per_token_l2_distortion(h, h_q),
+ 'cos_shift': cosine_shift(h, h_q),
+ 'isoscore_post': isoscore(h_q),
+ 'eff_rank_post': effective_rank(h_q),
+ }
+ print(f" {reg:<10} {qname:<22} l2={results[reg][qname]['l2_distortion']:.4f} cos_shift={results[reg][qname]['cos_shift']:.4f}")
+
+ # Save
+ import json
+ out_path = '/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/real_reg_quant_matrix.json'
+ open(out_path, 'w').write(json.dumps(results, indent=2))
+ print(f"\nsaved: {out_path}")
+
+if __name__ == "__main__":
+ main()