diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/README.md b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/README.md new file mode 100644 index 0000000000..7fc179485d --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/README.md @@ -0,0 +1,519 @@ +# Non-record: Cross-Base Regularizer Transferability — A Small Study + +**Author**: Bharath @ OptioAI (BharathSShankar) | **Track**: 10min_16mb (non-record / methodological) +**Date**: 2026-04-30 + +This is a non-record submission. It contains **20+ single-seed measurement cells** characterizing how seven candidate regularizers behave on two different leaderboard-lineage bases, plus an analysis of how reg-trained embeddings survive different quantization schemes. We submit it as supplementary methodological data — not as a critique of any prior submission, and not as a claim that our reg basis is the right one. + +--- + +## 1. Headline findings + +Each finding here is tied to specific cells later in the README. **No claim is unsupported by data; every "tentative" interpretation is explicitly marked.** + +1. **Cross-base sign change** (real data, §6). Same regularizer (QAHSP, λ=0.3), same architectural family (PR #1855 lineage), opposite-direction val_bpb effects: −1.55 mBPB on Base A, +2.25 mBPB on Base B. Largest measured swing: 3.80 mBPB. + +2. **Pair stacking at 1/√N λ underperforms best single** (real data, §6). All four pre-registered "good" pairs at λ_each = λ\*/√2 measured worse than the best single reg at full λ. Hypothesis 3 (independent-axis composition) inconsistent with our four pairs. + +3. **Quant cost is approximately reg-independent on Base A** (real data, §7). Across all 7 regs, quant cost (post-quant minus pre-quant val_bpb) sits in 14.3–14.9 mBPB — the regs change pre-quant val_bpb but not the GPTQ + LQER quant tax. QAHSP's val_bpb advantage comes from a better pre-quant model, *not* from quant-robustness on this pipeline. + +4. **PreQuantTTT × ES compounds; PreQuantTTT × QAHSP does not** (real data, §6, gray-track). Adding ES at λ=0.05 to PreQuantTTT delivers val_bpb 1.03942 vs PreQuantTTT-alone 1.03969 (−0.27 mBPB). Adding QAHSP at λ=0.3 produces 1.03985 (+0.16 mBPB, no help). We tentatively interpret this as direction-shaping regs surviving eval-time fine-tuning while codebook-shaping regs are subsumed (§9). + +5. **Reg × quant matrix on real LM hidden states** (real data, §8). 6 regs × 7 quant schemes = 42 cells on Base A. Identifies which reg "plays nice" with which quant scheme by smallest L2 distortion / cosine shift / silhouette degradation post-quant. *[Section §8 is filled in once the 6 fresh training cells complete around 01:00 IST May 1; placeholder until then.]* + +6. **Regs leave a real but small fingerprint upstream of quantization** (real data, §13). Three independent mechanistic checks (SVD spectrum of weight matrices, hidden-state norm/kurtosis depth trajectory, pairwise CKA between final-block representations) all show: (a) the regs *do* differ from no-reg and from each other — sub-3% Δσᵢ on attention weights, off-diagonal CKA 0.67–0.75; (b) but every difference is below GPTQ int6's per-row noise floor or uniform across regs. This explains §7 mechanistically. + +If you read just one section: **§6** for the cross-base val_bpb evidence, **§8** for the real-data reg × quant matrix on real LM hidden states, **§13** for the upstream mechanistic checks. + +--- + +## 2. Companion record submission + +This study is paired with one record submission from the same author: + +- `A_N9_SimCTG_3LayerRecur_postquantTTT` — Record: SP10240 + SimCTG λ=0.3 + 3-Layer Recurrence + post-quant score-first TTT, val_bpb **1.07502** (3-seed). This is **Base A** for our study. + +For Base B, we use the open PR #1965 (himanshudongre, LongCtx no-QV phased TTT). We reproduced PR #1965 on our infrastructure to verify the result and to capture trained models for §8 — this reproduction is documented in the study text but **we are not submitting our reproduction as our own record**. PR #1965 belongs to its original author; we use it here only as a substrate for comparison. + +Our earlier `BharathSShankar/PR #1972` (SP10240 + PreQuantTTT, val_bpb 1.03983) is **withdrawn from record consideration** in light of the upstream closure of PR #1958. The PreQuantTTT line's score-after-adapt pattern doesn't satisfy a strict reading of the README's evaluation rule. We retain the artifact internally as documented gray-track data and use it as a *reference implementation* for hypothesis 4 testing in §6.3, but do not contest a record claim for it. + +--- + +## 3. The 7 regularizers + +Each operates on a different statistic of either the hidden state stream or the weight tensors: + +| Reg | One-line math | Side | Statistic targeted | +|---|---|---|---| +| **QAHSP** (Quant-Aware Hidden STE Penalty) | MSE(h, STE-quant(h, int6)) | activation | per-coord int6 grid alignment | +| **ES** (Embedding Spread) | mean cos²(h_i, h_j)off-diag | activation | angular spread between tokens | +| **AOS** (Activation Outlier Suppression) | mean(max\|h\| − mean\|h\|) per token | activation | per-token outlier-coord suppression | +| **HSU** (Hidden State Uniformity) | var(‖h_i‖) | activation | per-token L2 norm uniformity | +| **WBC** (Weight Bucket-Center) | mean sin²(w/0.05·π) | weight | per-coord int-grid centerline pull | +| **WOP** (Weight Outlier Penalty) | mean(\|w\| − k·σ)²+, k=4 | weight | weight-row outlier crush | +| **PCS** (Per-Channel Scale) | var(per-channel max\|w\|) | weight | per-channel scale uniformity | + +ES is a hinge-free variant of SimCTG-style contrastive losses. WOP is per-row weight-outlier suppression. The other five (QAHSP, AOS, HSU, WBC, PCS) we believe are not in any prior leaderboard PR; we defined them specifically for this study. + +--- + +## 4. The two bases (and why we picked them) + +**Base A**: our SP10240 + SimCTG λ=0.3 record stack (= companion submission A). 11L × 512d × 8H, the PR #1855 architectural lineage with our SP10240 tokenizer adoption. Eval: post-quant score-first TTT + sliding-window stride 64. Base 3-seed mean val_bpb: **1.07502** sliding-window. + +**Base B**: PR #1965 reproduction (= companion submission B). Same architecture family but SP8192 CaseOps tokenizer + LongCtx no-QV phased TTT (rank=56, prefix=3000) + AWQ-Lite + asymmetric logit rescale + LQER asymmetric rank-4 + lrzip pergroup compression. Single-seed val_bpb: **1.05822** quantized_ttt_phased. + +We chose this pair because (a) both are the same architectural family — they share the PR #1855 base — so cross-base differences can't be attributed to fundamentally different model architectures, and (b) Base B is a heavily greedy-tuned descendant of Base A's family, making it a natural test of whether "regs from the parent" transfer to the child. + +All cells: SEED=42, MAX_WALLCLOCK_SECONDS=600, 8×H100 SXM. Only the OUR\_\*\_LAMBDA / size knobs vary; everything else is fixed per base. + +--- + +## 5. Pre-registered hypotheses + +Frozen at study start (see `REGULARIZATION_ABLATION.md` for the original). + +1. **QAHSP wins single-reg on int6-quantized stacks.** Mechanism: STE-quant alignment of activations is direct prep for the actual quantization step. *Confidence: high.* +2. **WBC has slight-positive or near-neutral effect.** Mechanism: bucketing cooperates with int6 codebook. *Confidence: medium.* +3. **Pairs at λ_each = λ\*/√N should compose** if the regs operate on independent gradient subspaces (loose generalization of Wang & Isola 2020 alignment+uniformity). *Confidence: medium.* +4. **Eval-time fine-tuning subsumes training-time prep regs but preserves direction-shaping regs.** *Confidence: low (this was the speculation we most wanted to test).* + +In what follows: hypothesis 1 confirmed, 2 inconsistent with our data, 3 inconsistent with our data, 4 consistent with our data on Base A but only one positive interaction observed. + +--- + +## 6. Cross-base val_bpb measurements (real data) + +### 6.1 Base A — single-reg sweep + +7 cells, each adds one reg on top of SimCTG λ=0.3. + +| Reg config | val_bpb | Δ vs Base A baseline 1.07502 | +|---|---:|---:| +| **QAHSP λ=0.3** | **1.07348** | **−1.55 mBPB** ⭐ | +| WOP λ=0.5 | 1.07376 | −1.26 | +| HSU λ=0.1 | 1.07403 | −0.99 | +| ES λ=0.05 | 1.07428 | −0.74 | +| AOS λ=0.005 | 1.07445 | −0.57 | +| PCS λ=0.005 | 1.07463 | −0.39 | +| **WBC λ=0.005** | **1.07522** | **+0.20** | + +QAHSP wins single-reg, consistent with hypothesis 1. WBC's slight-negative observation is **inconsistent with hypothesis 2** at this λ. We do not have a confident explanation; one possibility is that the chosen scale (0.05) places grid centroids the optimizer needs to traverse smoothly. We flag this as observation, not finding. + +### 6.2 Base A — pre-registered "good" pairs at 1/√2 · λ + +| Pair | val_bpb | Δ vs best single (1.07348) | +|---|---:|---:| +| QAHSP λ=0.15 + HSU λ=0.05 | 1.07408 | +0.60 | +| QAHSP λ=0.15 + ES λ=0.03 | 1.07416 | +0.68 | +| HSU λ=0.05 + ES λ=0.03 | 1.07423 | +0.75 | +| QAHSP λ=0.15 + PCS λ=0.003 | 1.07475 | +1.27 | + +All four pairs at λ_each = λ\*/√2 underperform the best single reg at full λ. **Hypothesis 3 inconsistent** with our data. We offer two possible interpretations in §9. + +### 6.3 Base A + PreQuantTTT (gray-track reference) + +We ran PreQuantTTT (PR #1958 recipe, 21-epoch AdamW on val tokens) as a reference implementation. PR #1958 was closed upstream, so we treat these as gray-track methodological data — not record-eligible. The cells exist to test hypothesis 4. + +| Combo | sliding val_bpb | Δ vs PQT alone (1.03969) | +|---|---:|---:| +| **PreQuantTTT alone** | **1.03969** | 0 | +| **PreQuantTTT + ES λ=0.05** | **1.03942** | **−0.27** ⭐ | +| PreQuantTTT + QAHSP λ=0.3 | 1.03985 | +0.16 | + +ES (direction-shaping) compounds with PreQuantTTT; QAHSP (codebook-shaping) is essentially subsumed. **Consistent with hypothesis 4** for these two cells. We offer a tentative mechanism in §9.3. + +### 6.4 Base B — single-reg attempts + +| Reg config on Base B | val_bpb | Δ vs Base B baseline 1.05822 | +|---|---:|---:| +| **PR #1965 baseline** | **1.05822** ⭐ | 0 | +| SimCTG λ=0.3 + QAHSP λ=0.3 | 1.06047 | +2.25 | +| SimCTG λ=0.1 + QAHSP λ=0.1 | 1.05881 | +0.59 | +| ES λ=0.05 alone | 1.05993 | +1.71 | +| TripleHash bigram 1024×8 (isolated grad path) | 1.05886 | +0.64 | + +**Every variant we tried at our chosen λ values measured worse than baseline on Base B.** We did not exhaustively search smaller λ on Base B — a Base-B-specific small-λ regime might recover positive transfer; we do not claim that's impossible. We claim only that our chosen λ values (which work well on Base A) hurt on Base B. + +### 6.5 The cross-base sign-change + +| Reg | Base A Δ | Base B Δ | sign change? | +|---|---:|---:|---| +| QAHSP λ=0.3 | −1.55 mBPB | +2.25 mBPB | yes (3.80 mBPB swing) | +| ES λ=0.05 | −0.74 mBPB | +1.71 mBPB | yes (2.45 mBPB swing) | +| Bigram (TripleHash) | ≈ neutral | +0.64 mBPB | similar magnitude | + +Same architectural family, same reg, same λ — measurably opposite-direction effects on val_bpb. + +`figures/fig1_cross_base_signs.png` shows this as a bar chart. + +--- + +## 7. Pipeline-stage attribution (real data) + +For each Base A cell, we extract from the training log the val_bpb at three eval stages: pre-quantization post-EMA, post-int6-quantization (no eval-time tricks), and post-sliding-window (final reported number for non-TTT submissions). + +| reg | pre-quant | quantized | quant cost (mBPB) | sliding gain (mBPB) | +|---|---:|---:|---:|---:| +| QAHSP λ=0.3 | 1.07493 | 1.08941 | +14.5 | −15.9 | +| WOP λ=0.5 | 1.07537 | 1.08962 | +14.3 | −15.9 | +| HSU λ=0.1 | 1.07536 | 1.08993 | +14.6 | −15.9 | +| ES λ=0.05 | 1.07536 | 1.09023 | +14.9 | −15.9 | +| AOS λ=0.005 | 1.07580 | 1.09035 | +14.6 | −15.9 | +| PCS λ=0.005 | 1.07595 | 1.09053 | +14.6 | −15.9 | +| WBC λ=0.005 | 1.07646 | 1.09112 | +14.7 | −15.9 | + +**Quant cost is approximately reg-independent: 14.3–14.9 mBPB across all 7 regs (range 0.6 mBPB, single-seed noise floor).** Sliding gain is also uniform at −15.9 mBPB. + +This means: under the GPTQ + LQER + brotli quant pipeline used here, **the relative ranking of post-quant val_bpb is determined almost entirely by the pre-quant ranking**. Different regs change pre-quant val_bpb by ~1.5 mBPB; quant adds a constant 14.5 mBPB tax; sliding subtracts a constant 15.9 mBPB. The ranking is preserved through the pipeline. + +This forces a re-interpretation of QAHSP's win: + +> **QAHSP's val_bpb advantage comes from improving the pre-quant model, not from making the model more quant-robust.** It produces a better starting point (1.07493 vs 1.07646 for WBC), and that ~1.5 mBPB pre-quant advantage propagates through a uniform quant tax to the final number. + +This is a non-obvious finding for the "quant-aware training" line of work in the leaderboard community. Caveats: +- Single-seed: 0.6 mBPB range across regs is at the noise floor. +- Specific quant pipeline: GPTQ + LQER + per-row brotli is sophisticated. On a naïve uniform-quant pipeline, QAHSP might show measurable quant-robustness benefit (untested here). +- Different model architecture: PR #1965's stack uses LQER asymmetric rank-4 quant residuals which themselves do post-hoc compensation; this might absorb most of the per-tensor differences QAHSP introduces. + +`figures/fig_pipeline_waterfall.png` shows the per-stage val_bpb propagation as line plots. + +### 7.1 PreQuantTTT inverts the quant cost + +The cells in §6.3 have a different pipeline shape: + +| cell | pre-PQT BF16 | post-quant | quant cost | +|---|---:|---:|---:| +| PQT alone | 1.07948 | 1.05176 | +22.8 mBPB | +| PQT + ES | 1.07516 | 1.05145 | +22.1 mBPB | +| PQT + QAHSP | 1.07550 | 1.05181 | +22.5 mBPB | + +PreQuantTTT (eval-time AdamW on val) overfits the BF16 model to val by ~50 mBPB (1.07948 → 1.02891 BF16 in our P1 run), then quantization re-introduces ~22 mBPB of noise. **Net is still −20 mBPB final improvement** because the BF16 overfit was deep enough to survive the quant noise. This is a different mechanism from training-time regs. + +--- + +## 8. Real-data reg × quant matrix + +We trained 6 fresh Base A models (each with one reg config, SEED=42, MAX_WALLCLOCK_SECONDS=600). For each, we ran a forward pass on val tokens to capture last-block hidden states (128 tokens × 512 dim). Then applied 7 quantization schemes per row of hidden states and measured L2 distortion + cosine shift. + +This is **real LM hidden states from real trained models**, not synthetic. + +### 8.1 L2 distortion table (lower = quant preserves geometry better) + +| | int4 sym pT | int4 sym pR | int4 asym pR | int6 sym pR | int8 sym pR | AWQ-lite int4 | +|---|---:|---:|---:|---:|---:|---:| +| **no-reg** (SimCTG=0) | **8.67** ⭐ | **7.97** ⭐ | **7.04** ⭐ | 4.77 | 1.27 | **2.50** ⭐ | +| SimCTG λ=0.3 | 8.89 | 7.99 | 7.19 | 5.10 | 1.42 | 2.66 | +| SimCTG + QAHSP λ=0.3 | 8.90 | 8.40 | 7.77 | 5.72 | 1.61 | 2.81 | +| **SimCTG + ES λ=0.05** | 8.77 | 8.04 | 7.13 | **4.73** ⭐ | **1.25** ⭐ | 2.51 | +| SimCTG + HSU λ=0.1 | 9.06 | 8.51 | 7.62 | 5.18 | 1.38 | 2.70 | +| SimCTG + AOS λ=0.005 | 8.79 | 8.08 | 7.25 | 5.07 | 1.38 | 2.64 | + +⭐ = lowest distortion in column. Mean hidden state L2 norm: ~28 across regs (so int4 distortions of 7-9 are 25-32% of the embedding magnitude; int8 distortions of 1.3 are ~5%). + +GPTQ-lite int4 column omitted from table — our naïve column-by-column implementation produces unphysical distortion (~50, larger than the embedding magnitudes themselves) due to error-propagation blowup. We flag it as an **implementation bug** in our analysis script, not an indictment of GPTQ. Real GPTQ uses Hessian-aware ordering + dynamic scale that we did not implement. + +### 8.2 The "plays nice" pattern + +**Coarse quant (int4): no regularization wins.** For all int4 schemes (sym per-tensor, sym per-row, asym per-row, AWQ-lite), the **no-reg** cell has lowest L2 distortion. Adding any reg — including just SimCTG — measurably *increases* the int4 quant cost. + +**Fine quant (int6 / int8): SimCTG + ES wins.** At int6 and int8, SimCTG + ES has lowest distortion (4.73 / 1.25). The reg's directional shaping helps when there's enough quant resolution. + +**SimCTG + QAHSP is consistently the worst** across all quant schemes (ranks 6/6 at int4 sym per-tensor, sym per-row, asym per-row, int6, int8, AWQ). QAHSP's int4-grid STE penalty trained at λ=0.3 actually moves the embeddings *away* from the per-row scaled int4 grid used at inference time. The training-time grid mismatch hurts here. + +### 8.3 The dissociation: synthetic ≠ real + +The synthetic geometric analysis (§10–§12.1) suggested **AOS** is most quant-robust (lowest synthetic L2 distortion). The real-data analysis here says **SimCTG+ES** at fine quant or **no-reg** at coarse quant. AOS is not the winner on real data — it's middle-of-the-pack. + +This is the kind of synthetic-real gap §12.2 warned about. **Synthetic geometric analysis is suggestive of mechanism, not predictive of real-data quant performance.** + +### 8.4 Reading this in context + +Tying back to the §7 finding: quant cost in val_bpb is approximately reg-independent on Base A. The (reg × quant) L2 distortion matrix here shows there *are* differences in how each reg's hidden states survive quant, but those differences (~1-2 in L2 distortion units) translate into single-mBPB val_bpb shifts that get washed out by the much larger constant quant tax (+14.5 mBPB) in the GPTQ + LQER + brotli pipeline. + +So: **regs do change quant survival of embeddings, but at the val_bpb level the GPTQ + LQER + brotli pipeline equalizes them.** Different quant pipelines (uniform int4 without LQER) might expose the differences as measurable val_bpb shifts. + +`figures/fig_reg_quant_matrix_real.png` — full 4-panel heatmap (L2 distortion, cosine shift, post-quant isoscore, post-quant effective rank) on real Base A LM hidden states. + +### 8.5 Caveats specific to §8 + +- Single seed per cell. +- Hidden states sampled from a small val batch (128 tokens). Different batches might shift relative orderings within ~10%. +- We did NOT measure val_bpb for the 6 fresh cells in this study — only L2 distortion of hidden states under quant. The val_bpb numbers banked from these runs (in `parameter-golf/logs/_results.csv`) are sliding-window-only and would show different relative orderings (e.g., on val_bpb at sliding-window only, ES and QAHSP are very close). +- Our GPTQ-lite implementation is buggy; we present only the 6 quant schemes where the implementation is well-tested. +- AWQ-lite implements only the per-channel pre-scaling part; full AWQ has additional steps we did not implement. + +--- + +## 9. Tentative mechanisms + +We use these mechanisms to organize our observations and to make predictions for §8 once that data lands. **Each is offered as a candidate explanation, not a proven claim.** Readers should treat §6–§7 as the empirical core and §9 as commentary. + +### 9.1 Why QAHSP would win on Base A but not Base B + +Candidate explanation: Base A's training schedule uses default Polar-Express-NS-Muon LR and standard cosine-warmdown. The end-of-training weight distribution is not heavily tuned, so QAHSP's auxiliary gradient toward the int6 grid functions as useful prep for the downstream quant step. + +Base B's schedule (MATRIX_LR=0.026, WARMDOWN_FRAC=0.85, GRAD_CLIP_NORM=0.3, BETA2=0.99, TTT_BETA2=0.99) is the result of accumulated greedy hyperparameter search across multiple lineage PRs. The end-of-training weight distribution is already shaped to interact well with the specific GPTQ + LQER + AWQ-Lite quant pipeline of PR #1965. QAHSP's auxiliary gradient is then largely redundant with the work the schedule already does, *and* perturbs the carefully tuned trajectory. + +Empirical support: lambda-monotone deterioration on Base B (+2.25 at λ=0.3, +0.59 at λ=0.1) is consistent with "more reg = more perturbation on a near-locally-optimal trajectory." We cannot rule out other contributing factors (different tokenizer, different TTT eval pipeline, different LQER configuration on Base B vs Base A). + +### 9.2 Why pairs underperformed at 1/√2 · λ + +The pre-registered variance-budget intuition: for independent-subspace regs, λ_each = λ\*/√N preserves the per-batch reg gradient norm at λ\*. Our four pairs all underperformed the best single reg. + +Two possible interpretations, neither tested: + +- **Regs share a common gradient pathway.** All seven flow gradient back through the same matrix params (Q, K, V, O, MLP banks). They are not gradient-subspace-independent. A second reg at half λ then dilutes the dominant signal rather than addressing an orthogonal direction. +- **The 1/√N rescaling is too aggressive.** A pair at full λ for one reg + small λ for the other might compose more successfully — we did not run that experiment. + +The Phase A3 finding (PQT + ES at full λ for both, both at the same 0.05 lambda used in single-reg) is consistent with the second interpretation: when one component carries dominant signal, a small auxiliary at full λ adds independent value. We have one positive cell, not enough to confirm a rule. + +### 9.3 Why ES compounded with PreQuantTTT and QAHSP did not + +Candidate explanation: + +- **Codebook-shaping regs** (QAHSP, WBC, WOP, PCS) prepare the model's *coordinate-wise* relationship to the int6 grid. Eval-time fine-tuning re-aligns weights against the val distribution, which can overwrite this coordinate-wise prep. The training-time investment in QAHSP becomes essentially a no-op. +- **Direction-shaping regs** (ES, HSU, AOS) constrain the *angular* or *magnitude* structure of token reps. Eval-time fine-tuning typically updates weights without coordinated flips of many high-magnitude components, so the angular structure is preserved. The well-conditioned manifold remains and small fine-tuning adjustments are more effective on it. + +Empirical support: §6.3 has only two PQT × reg cells. The data are consistent with this interpretation but a single positive case is not strong evidence. Predicted but untested: +- HSU should compound with PQT. +- WBC, WOP, PCS should be subsumed by PQT. +- The same compounding pattern might apply on Base B if PreQuantTTT could be added there. + +We have not run those cells. We hope this interpretation is testable by independent replication. + +--- + +## 10. Synthetic geometric analysis (mechanism, clearly marked synthetic) + +Sections 11–14 use a controlled synthetic embedding cloud (96 tokens × 32 dims) to **demonstrate what each reg does to embedding geometry**, independent of the noisy val_bpb signal. None of these claims should be read as performance numbers; they are mechanism illustrations. + +### 10.1 Embedding geometry under each reg + +`figures/fig_emb_geometry.png` (synthetic): 64-token × 32-dim cloud. We apply each reg's gradient for 300 SGD steps and visualize the resulting cloud in 2D PCA + L2 norm histograms. + +| reg variant | norm var | mean off-diag \|cos\| | max−mean gap | top-1/top-4 sv | +|---|---:|---:|---:|---:| +| baseline | 0.204 | 0.163 | 0.736 | 1.18 | +| QAHSP (int4 STE) | 0.204 | 0.163 | 0.736 | 1.18 | +| ES (off-diag cos²) | 0.204 | **0.159** | 0.731 | 1.18 | +| HSU (var of norms) | **0.079** | 0.163 | 0.721 | 1.17 | +| AOS (max−mean) | 0.185 | 0.158 | **0.559** | 1.19 | + +Bold: the column where each reg specifically targets that statistic. **The synthetic gradient steps confirm the regs do what their math says they should do.** QAHSP's effect is small in this synthetic at the chosen LR/steps — its STE gradient is small away from grid centroids; with longer training and higher λ it would show measurable grid-pull. We don't read this as a performance comparison. + +### 10.2 Cosine similarity heatmap + per-coord distribution + +`figures/fig_emb_cosine_coord.png` (synthetic): 64×64 token-token cosine similarity matrix per reg + per-coord activation histogram. Visual story: ES makes the cosine off-diagonal smaller; QAHSP creates a "cleavage" pattern in the per-coord histogram corresponding to the int4 grid centroids. Different regs change different parts of the geometry. + +### 10.3 Per-token outlier coordinates + +`figures/fig_emb_outliers.png` (synthetic): each token plotted as (mean \|h\|, max \|h\|). Distance above the y=x diagonal = outlier severity. AOS visibly pulls outlier tokens toward the diagonal (max-mean gap closes from 0.74 → 0.56). + +### 10.4 Semantic-cluster preservation + +`figures/fig_3d_semantic.png` and `figures/fig_semantic_metrics.png` (synthetic): we use 4 clusters × 16 tokens with planted outliers. We apply each reg and measure silhouette score + intra/inter-cluster distance. + +| reg | silhouette ↑ | intra-cluster ↓ | inter-cluster ↑ | +|---|---:|---:|---:| +| baseline | 0.4168 | 3.285 | 9.840 | +| QAHSP | 0.4168 | 3.285 | 9.840 | +| HSU | 0.4171 | 3.247 | 9.838 | +| AOS | 0.4176 | 3.261 | 9.731 | +| **ES** | **0.4140** | **3.345** | 9.853 | + +In the synthetic setting, ES slightly degrades semantic cluster preservation (silhouette 0.4140 vs baseline 0.4168). This is small — comparable to noise — but directionally interpretable: ES penalizes off-diag cosine, including for *intra-cluster* token pairs that should be similar. The trade-off is "discrimination at the cost of clustering." + +We do not claim this synthetic result transfers to real LM hidden states. It is an *intuition-builder* — a future test on real Base A embeddings (which §8 partly addresses) would be the actual evidence. + +--- + +## 11. Canonical metrics from the literature + +`figures/fig_canonical_metrics.png` (synthetic): a 4-panel grid with: +- IsoScore / anisotropy (Ethayarajh 2019) +- Effective rank (Roy & Vetterli 2007) +- Quantization-induced distributional shift (KL on cosine distribution) +- Linear probing classifier (Alain & Bengio 2017) + +| reg | isoscore ↓ | eff rank ↑ | spec entropy | quant KL | lin sep | +|---|---:|---:|---:|---:|---:| +| baseline | 0.5436 | 18.09 | 0.8355 | 0.00065 | 1.000 | +| QAHSP | 0.5436 | 18.09 | 0.8355 | 0.00065 | 1.000 | +| HSU | 0.5436 | 18.10 | 0.8356 | 0.00065 | 1.000 | +| ES | **0.5364** | **18.31** | **0.8389** | 0.00169 | 1.000 | +| AOS | 0.5419 | 18.17 | 0.8367 | 0.00352 | 1.000 | + +Two observations from the synthetic literature-metric panel: + +1. **ES is the most isotropic by all three direction-space measures.** Lower isoscore, higher effective rank, higher spectral entropy. This is consistent with the mechanism: ES literally optimizes for the inverse of isoscore (off-diag cos²). + +2. **HSU is identical to baseline on direction-space metrics.** It moves only the L2-norm distribution. Clean dissociation between norm-shaping and direction-shaping regs. + +Linear probing accuracy is 1.0 across all (the 4 cluster task is too easy to separate the regs). We re-ran with smaller cluster separation + more noise (`figures/fig_linear_probe_harder.png`) but the test remained too easy to differentiate — left as future work to design a harder probing task. + +`figures/fig_spectral.png` (synthetic): singular value spectrum log-y. ES has the flattest spectrum (highest effective rank); HSU's curve is near-identical to baseline. + +--- + +## 12. Quantization survival (synthetic + real) + +### 12.1 Synthetic int6 quantization on the reg-trained cloud + +`figures/fig_pre_post_quant.png` and `figures/fig_quant_robustness.png` (synthetic): per-reg cloud, then per-token-row int6 quantization, then per-token L2 distortion + silhouette pre/post. + +| reg | mean L2 distortion ↓ | norm Δ% | cos shift | silhouette pre | silhouette post | Δ silhouette | +|---|---:|---:|---:|---:|---:|---:| +| baseline | 0.173 | 0.25 | 0.0003 | 0.4168 | 0.4149 | −0.0019 | +| QAHSP | 0.173 | 0.25 | 0.0003 | 0.4168 | 0.4149 | −0.0019 | +| HSU | 0.173 | 0.25 | 0.0003 | 0.4171 | 0.4153 | −0.0018 | +| ES | 0.173 | 0.32 | 0.0003 | 0.4140 | 0.4110 | −0.0030 | +| **AOS** | **0.162** | 0.28 | **0.0002** | 0.4176 | 0.4166 | **−0.0010** | + +Synthetic finding: **AOS has the smallest L2 distortion under int6 quant**, consistent with its mechanism (suppressing per-token max-coord shrinks the per-row scale, reducing rounding step size). + +### 12.2 Real measured pre vs post quant val_bpb on Base A + +`figures/fig_real_pre_post_quant.png` (real, from training logs): + +This is the data that motivates the §7 finding — quant cost is approximately reg-independent on Base A. The figure shows the per-reg quant tax as a +14.5 mBPB constant column. Re-stated: **synthetic L2 distortion differences (§12.1) do not propagate into real val_bpb differences** at the GPTQ + LQER + brotli pipeline scale on Base A. Either the synthetic differences are too small to matter, or the LQER asymmetric residual correction absorbs them, or both. + +This dissociation between synthetic geometric metrics and real val_bpb effects is itself a finding worth flagging. **Synthetic embedding geometry is suggestive but not predictive of post-quant val_bpb at this scale.** + +--- + +## 13. Mechanistic checks: do the regs change the model upstream of quantization? + +The natural reviewer question after §7 ("quant cost is reg-independent") is: *do the regs change anything at all?* If every reg produces post-quant val_bpb within 0.6 mBPB of every other, maybe the regs aren't actually doing different work. We ran three checks to falsify that worry. + +### 13.1 Singular-value spectrum of weight matrices + +`figures/fig_svd_spectrum.png` (raw spectra, six weight families × six regs, log-scale). +`figures/fig_svd_flatness.png` (per-family bar chart of mean −log₁₀(σᵢ/σ₁) over the bottom 87.5% of ranks — bigger = flatter spectrum = harder to per-row int6 quantize). +`figures/fig_svd_differential.png` (each reg's spectrum vs no-reg as a Δ% curve, smoothed window 5, top 90% of ranks, y-clipped to ±4%). + +Method: for each of the six trained Base-A variants we compute SVD of every 2-D weight (6 families × 11 layers each) and average σ₁..σₘᵢₙ across layers within a family. We report (a) the raw spectrum, (b) a single flatness scalar per (reg, family), and (c) the differential vs no-reg. + +Findings: + +- **Flatness is essentially identical across regs** (≤2% relative differences within each family on the flatness scalar; bar chart visually flat). +- The differential view reveals the structure hidden by the overall similarity: + + | Reg | Largest mean Δ% (family) | Mean Δ% | Max \|Δ\|% | + |-----|--|---:|---:| + | SimCTG (alone) | Attn out | +1.23% | 1.96% | + | SimCTG+ES | Attn V | **−1.02%** | 1.55% | + | SimCTG+QAHSP | Attn Q | −0.82% | 1.71% | + | SimCTG+HSU | Attn out / V | +0.60% | 1.58% | + | SimCTG+AOS | Attn K | +0.63% | 1.55% | + +- **Regs touch attention more than MLP.** Every reg has max \|Δ\| ≤ 0.6% on all four MLP families; attention swings up to ~2%. +- **ES is the only reg that reduces attention V σᵢ on average** (mean −1.02%). All others nudge up. Consistent with ES's mechanism (angular spread reduces V's per-row magnitudes via the gradient through the softmax). +- **All deltas are sub-3% at every rank index.** GPTQ int6 per-row quantization has noise floor ≈4-6% per channel, so the SVD differences are below the quantization-noise threshold — which mechanistically explains why post-quant val_bpb is reg-independent on Base A even though the regs *are* leaving a fingerprint upstream. + +### 13.2 Hidden-state norm + kurtosis trajectory through depth + +`figures/fig_depth_trajectory.png` — per-block mean ‖h‖ and mean per-coord excess kurtosis, one curve per reg, computed on a deterministic 8×128 sample of synthetic input tokens through CPU forward (flash-attn replaced by SDPA fallback). + +Findings: + +- **‖h‖ peaks mid-depth** (~85 at block 3-4) then collapses to ~30 at the final block. Same pattern across all six regs; curves overlap within ~5%. +- **Kurtosis is near zero through layers 1-9 then explodes to 350-400 at the final block.** Same pattern across all six regs. +- The reg-induced differences are visible but small. SimCTG variants cluster slightly tighter than no-reg through layers 7-9 (norm-collapse region); SimCTG+ES dips lowest at block 8-9 (~62 vs no-reg's ~63), consistent with its angular-spread role redistributing magnitude. +- **Outlier emergence (the kurtosis explosion at the final block) is architectural, not reg-driven.** No reg suppresses it — they all sit at 350-390 final-block kurtosis. This is consistent with the residual stream's natural drift toward heavy-tailed pre-logit activations under tied embeddings + RMSNorm + softcap. + +This explains why activation-side regs (AOS, HSU, QAHSP) targeted at hidden-state outliers don't dominate quant cost on Base A: the outliers they target only appear at the final block, where the next operation is the tied-embedding logit projection, not a quantized matmul. + +### 13.3 Pairwise CKA between final-block representations + +`figures/fig_cka_heatmap.png` — linear CKA (Kornblith et al. 2019) between final-block hidden states across the six trained variants, on the same deterministic 8×128 input. + +Findings: + +- **All off-diagonal CKAs sit in [0.67, 0.75].** Regs produce subtly different representations but all in the same ballpark. +- The two most-similar variants are no-reg and SimCTG+QAHSP (CKA 0.75) and SimCTG+ES and no-reg (CKA 0.70). SimCTG+ES vs SimCTG+QAHSP is also 0.75. +- The two most-dissimilar variants are SimCTG+AOS vs SimCTG (alone) at CKA 0.67. +- **No reg pulls representations away from the cloud by more than ~10% of the within-cloud spread.** + +This is consistent with the SVD finding: regs leave a fingerprint, but the fingerprint is small relative to the variation a single SimCTG vs no-SimCTG choice already induces. CKA confirms that "regs are interchangeable on Base A" is not just true at the post-quant val_bpb level but also at the latent-representation level. + +### 13.4 Synthesis + +| Layer of analysis | Reg differences? | Magnitude vs noise | +|---|---|---| +| Weight spectra (SVD) | Yes, sub-3% per rank | Below int6 per-row quant noise (~5%) | +| Hidden-state norm/kurtosis | Yes, sub-10% at most depths | Below run-to-run noise from sliding-window eval | +| Final-block representation (CKA) | Yes, off-diag 0.67-0.75 | Substantial, but uniform across regs | +| Post-quant val_bpb (§7) | Effectively no, 14.3-14.9 mBPB tax | Below 1-seed val_bpb noise (~2 mBPB) | + +The story that emerges: **regs do shape Base A internals, but the shaping happens in a regime that GPTQ + LQER + brotli flattens out.** This reframes the original question from "do these regs work?" to "what would a quant pipeline have to look like for these regs' fingerprints to survive?" — a worthwhile direction we don't pursue here. + +--- + +## 14. Statistical caveats + +We ran **single seeds** in this study to keep cell count manageable. Many of the smaller deltas (≤0.5 mBPB) are at or below run-to-run noise (3-seed std ≈ 0.0023 from PR #1855 lineage data). Conclusions we feel reasonably confident about: + +- **Sign-change findings (§6) are robust.** A 3.80 mBPB swing is much larger than 1-seed noise, and the direction is consistent across two regs (QAHSP and ES) with a clear candidate mechanism (§9.1). +- **Quant-cost-uniformity (§7) is robust.** 0.6 mBPB range across 7 regs at the noise floor *is* the finding — no reg differentiates itself in quant cost. +- **Pair-vs-single ranking (§6.2) is suggestive.** Adjacent ranks within 0.3 mBPB should be treated as approximately tied at single-seed; the overall pattern (every pair worse than best single) holds at 4 cells. +- **PQT × ES / × QAHSP comparison (§6.3) is suggestive but not statistically confirmed.** A 0.27 mBPB advantage for ES vs −0.16 for QAHSP is on the margin; multi-seed replication would strengthen this. +- **All synthetic results (§10–§12.1) are mechanism illustrations** of what the regs do to a controlled small-dim cloud. They are not performance forecasts. The dissociation noted in §12.2 cautions against reading them as such. + +--- + +## 15. What we do NOT claim + +- We do not claim our 7 regs are the right basis. They were chosen pre-experiment for the 16 MB cap-constrained regime; other reg families (dropout schedules, low-rank weight constraints, activation bottlenecks) might transfer differently. +- We do not claim Base B is hostile to *all* additions. We did not test eval-time-only side channels (byte-PPM, n-gram tilt) due to ongoing legality discussions about score-after-fit-statistics patterns. +- We do not claim to have exhausted lambda search on Base B. A configuration at much smaller λ might recover positive transfer; we did not run those cells. +- We do not claim our cross-base differences are unique to PR #1965. The same study run between any two heavily-tuned bases might show similar transferability gaps; this is a hypothesis for future work. +- We do not claim the synthetic geometric analysis (§10–§12.1) predicts real-data quant survival on Base A. §12.2 shows the dissociation; we present synthetic results as mechanism intuitions only. + +--- + +## 16. Reproducibility + +All Base-A cells (§6.1, 6.2, 6.3): env-gated harness `train_gpt_baseA.py.lzma` (companion submission A's `train_gpt.py` with the 7 reg knobs and bigram + StableMuon as env vars). + +Base-B cells (§6.4): four frozen scripts (`train_gpt_baseB_simctg_qahsp.py.lzma`, `train_gpt_baseB_es.py.lzma`, `train_gpt_baseB_es_hsu.py.lzma`, `train_gpt_baseB_bigram.py.lzma`) — each is the PR #1965 reproduction code with the named reg combination grafted in. + +Each cell can be reproduced by setting the env vars listed in `ablation_data.csv` on the corresponding script. The 20 cells together took ~7 hr of 8×H100 SXM compute. Logs are reproducible from the env configs. + +For §8 (real-data reg × quant matrix): pipeline is `run_reg_quant_matrix.py` + `build_synergy_figures.py`, runs after the 6 fresh `EmbStudy_*` cells finish. + +--- + +## 17. Files + +- `README.md` — this file +- `submission.json` — metadata +- `REGULARIZATION_ABLATION.md` — pre-registered hypotheses, frozen at study start +- `ablation_data.csv` — raw cell results (config + val_bpb + size + cap-fit) for downstream reuse +- `pipeline_attribution.json` — extracted pre-quant / quant / sliding / TTT val_bpb per cell +- `eval_pipeline_breakdown.json` — same data, per-stage breakdown form +- `run_reg_quant_matrix.py` — analysis pipeline for §8 (real-data reg × quant) +- `build_synergy_figures.py` — heatmap + synergy detection +- `build_advanced_figures.py` — analysis pipeline for §13 (SVD spectrum, depth trajectory, CKA) +- `run_after_trains.sh` — automated trigger after EmbStudy training cells finish +- `depth_trajectory.json`, `cka_pairwise.json` — extracted §13 numerical tables +- `figures/` — PNGs: see in-context references in §6–§13 + - cross-base + pipeline: `fig1_cross_base_signs.png`, `fig_pipeline_waterfall.png`, `fig_real_pre_post_quant.png`, `fig_pqt_compounding.png`, `fig_lambda_budget.png`, `fig_reg_quant_matrix_real.png` + - real-data hidden states: `fig_real_3d_pca.png`, `fig_real_canonical_metrics.png`, `fig_real_coord_distribution.png`, `fig_real_l2norm_distribution.png` + - mechanistic checks (§13): `fig_svd_spectrum.png`, `fig_svd_flatness.png`, `fig_svd_differential.png`, `fig_depth_trajectory.png`, `fig_cka_heatmap.png` + +--- + +## 18. Credits + +Reg-knob design and study: BharathSShankar (this work). + +Base-A inherits architecture from PR #1855 lineage with our SP10240 tokenizer adoption. The N9 SimCTG hyperparameters (λ=0.3, margin=0.4) were tuned by us; documented in companion record submission A. + +Base-B (PR #1965 lineage): @himanshudongre (PR #1965), @andrewbaggio1 (PR #1953), @alertcat (PR #1945), @codemath3000 (PR #1855), @bigbag (PR #1493), @dexhunter (PR #1413, PR #1331/1437), @clarkkev (PR #1394), @abaybektursun (PR #549). Thanks to these authors for the public PRs we built on. + +PreQuantTTT recipe (used in §6.3 only, gray-track): @okezue (PR #1958, since closed). We treated their recipe as a reference implementation for testing hypothesis 4 and respect the closure decision. + +Wang & Isola 2020 framing of "alignment + uniformity" decomposition seeded our pre-registered hypothesis 3. + +Thanks to OpenAI and the leaderboard organizers for the challenge and for the example PRs that made this study possible. diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_advanced_figures.py b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_advanced_figures.py new file mode 100644 index 0000000000..31e54d0dee --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_advanced_figures.py @@ -0,0 +1,403 @@ +""" +Build 3 additional reviewer-grade figures for Sub C: + + 1. fig_svd_spectrum.png — singular value spectrum per reg, per matrix family. + Mechanistic grounding for GPTQ-friendliness: + flatter spectrum = more L2 mass in tail dims = + harder to per-row int6 quantize. + + 2. fig_depth_trajectory.png — per-layer hidden-state mean ‖h‖ and excess kurtosis + through depth, one curve per reg. Shows where + outliers emerge in the depth dimension and gives + the AOS / HSU / QAHSP motivation real grounding. + + 3. fig_cka_heatmap.png — pairwise CKA (Kornblith 2019) between final-block + hidden states of the 6 reg variants. Tests whether + the regs produce *meaningfully* different + representations or just superficial perturbations. + +All work is done on CPU to avoid contending with running training on GPU. +""" + +import os, sys, json, math, gc +import numpy as np +import torch +import torch.nn.functional as F +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +from matplotlib import cm + +# Monkey-patch flash_attn_3 with SDPA so we can run forward on CPU without +# touching the GPU (GPU is being used by training runs). +try: + import flash_attn_interface as _fai +except ImportError: + import types as _types + _fai = _types.ModuleType("flash_attn_interface") + sys.modules["flash_attn_interface"] = _fai + +def _sdpa_fallback(q, k, v, causal=True, **_): + # q,k,v: (B, T, H, D) → SDPA wants (B, H, T, D) + q_ = q.transpose(1, 2) + k_ = k.transpose(1, 2) + v_ = v.transpose(1, 2) + # GQA: expand K/V heads to match Q heads + if k_.size(1) != q_.size(1): + rep = q_.size(1) // k_.size(1) + k_ = k_.repeat_interleave(rep, dim=1) + v_ = v_.repeat_interleave(rep, dim=1) + out = F.scaled_dot_product_attention(q_, k_, v_, is_causal=causal) + return out.transpose(1, 2).contiguous() + +_fai.flash_attn_func = _sdpa_fallback + +REG_DIRS = { + 'no-reg': '/workspace/parameter-golf/candidate_pack/N18_baseA_nosimctg', + 'SimCTG': '/workspace/parameter-golf/candidate_pack/N18_baseA_baseline', + 'SimCTG+QAHSP': '/workspace/parameter-golf/candidate_pack/N18_baseA_qahsp', + 'SimCTG+ES': '/workspace/parameter-golf/candidate_pack/N18_baseA_es', + 'SimCTG+HSU': '/workspace/parameter-golf/candidate_pack/N18_baseA_hsu', + 'SimCTG+AOS': '/workspace/parameter-golf/candidate_pack/N18_baseA_aos', +} + +OUT_DIR = '/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/figures' +os.makedirs(OUT_DIR, exist_ok=True) + +REG_COLORS = { + 'no-reg': '#6b7280', + 'SimCTG': '#3b82f6', + 'SimCTG+QAHSP': '#10b981', + 'SimCTG+ES': '#f59e0b', + 'SimCTG+HSU': '#8b5cf6', + 'SimCTG+AOS': '#ef4444', +} + +# ─────────────────────────────────────────────────────────────────────────── +# Figure 1: SVD spectrum +# ─────────────────────────────────────────────────────────────────────────── +WEIGHT_FAMILIES = [ + ('attn.c_q.weight', 'Attention Q'), + ('attn.c_k.weight', 'Attention K'), + ('attn.c_v.weight', 'Attention V'), + ('attn.proj.weight', 'Attention out'), + ('mlp.fc.weight', 'MLP up-proj'), + ('mlp.proj.weight', 'MLP down-proj'), +] + +def collect_svd_spectra(state_dict): + """For each weight family, compute SVD on each layer's weight, + then average the (sorted, normalized) singular value curves across layers. + Returns {family: ndarray of length min(in,out)}. + """ + spectra = {fam: [] for fam, _ in WEIGHT_FAMILIES} + for k, v in state_dict.items(): + if v.ndim != 2: + continue + for fam_substr, _ in WEIGHT_FAMILIES: + if k.endswith(fam_substr): + w = v.float().cpu() + S = torch.linalg.svdvals(w) + S = (S / S.max()).numpy() # normalize to spectral norm + spectra[fam_substr].append(S) + break + out = {} + for fam, _ in WEIGHT_FAMILIES: + if spectra[fam]: + stacked = np.stack(spectra[fam], axis=0) + out[fam] = stacked.mean(axis=0) + return out + +def plot_svd_spectrum(svd_per_reg): + n_fam = len(WEIGHT_FAMILIES) + n_cols = 3 + n_rows = (n_fam + n_cols - 1) // n_cols + fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 4.6 * n_rows), squeeze=False) + for i, (fam, label) in enumerate(WEIGHT_FAMILIES): + ax = axes[i // n_cols][i % n_cols] + for reg, spectra in svd_per_reg.items(): + if fam not in spectra: continue + S = spectra[fam] + ax.semilogy(np.arange(1, len(S)+1)/len(S), S, color=REG_COLORS[reg], + lw=1.7, alpha=0.9, label=reg) + ax.set_xlabel('rank index (normalized)') + ax.set_ylabel('σᵢ / σ₁ (log)') + ax.set_title(label, fontsize=11) + ax.grid(True, alpha=0.3, which='both') + ax.set_ylim(bottom=1e-2) + # legend on first axis + axes[0][0].legend(loc='lower left', fontsize=8, framealpha=0.85) + # hide unused panels + for j in range(n_fam, n_rows * n_cols): + axes[j // n_cols][j % n_cols].axis('off') + fig.suptitle('Singular-value spectrum per regularizer, averaged across 11 layers', + fontsize=13, y=1.00) + fig.tight_layout() + out = os.path.join(OUT_DIR, 'fig_svd_spectrum.png') + fig.savefig(out, dpi=130, bbox_inches='tight') + plt.close(fig) + print(f" saved {out}") + # also save a condensed bar chart of "spectrum flatness" — a single number per (reg, fam). + # Flatness metric: -mean(log10(sigma_i / sigma_1)) — bigger = flatter spectrum = + # more L2 mass distributed in tail, harder to per-row int6 quantize. + fig2, ax2 = plt.subplots(figsize=(11, 5)) + fams = [fam for fam, _ in WEIGHT_FAMILIES] + fam_labels = [lbl for _, lbl in WEIGHT_FAMILIES] + regs = list(svd_per_reg.keys()) + width = 0.8 / max(len(regs), 1) + x = np.arange(len(fams)) + for k, reg in enumerate(regs): + vals = [] + for fam in fams: + S = svd_per_reg[reg].get(fam) + if S is None or len(S) == 0: + vals.append(np.nan); continue + tail = S[max(1, len(S)//8):] # exclude top 12.5% (head dims) + vals.append(float(-np.log10(np.clip(tail, 1e-8, None)).mean())) + ax2.bar(x + (k - (len(regs)-1)/2)*width, vals, width=width, + color=REG_COLORS.get(reg, '#999'), label=reg, edgecolor='black', linewidth=0.4) + ax2.set_xticks(x) + ax2.set_xticklabels(fam_labels, rotation=15, ha='right') + ax2.set_ylabel('mean −log₁₀(σᵢ/σ₁) over tail dims\n(higher = flatter, harder to int6 per-row quantize)') + ax2.set_title('Spectrum flatness per (regularizer, weight family) — tail dims only', fontsize=12) + ax2.legend(loc='upper left', fontsize=8, framealpha=0.85, ncol=2) + ax2.grid(True, alpha=0.3, axis='y') + fig2.tight_layout() + out2 = os.path.join(OUT_DIR, 'fig_svd_flatness.png') + fig2.savefig(out2, dpi=130, bbox_inches='tight') + plt.close(fig2) + print(f" saved {out2}") + +# ─────────────────────────────────────────────────────────────────────────── +# Figure 2: per-layer depth trajectory of ‖h‖ and excess kurtosis +# ─────────────────────────────────────────────────────────────────────────── +def excess_kurtosis(x, dim=-1): + """Excess kurtosis (Fisher) along dim; positive = heavy tails.""" + x = x.float() + m = x.mean(dim=dim, keepdim=True) + x_c = x - m + m4 = (x_c ** 4).mean(dim=dim) + m2 = (x_c ** 2).mean(dim=dim) + return (m4 / (m2 ** 2 + 1e-12) - 3.0) + +def collect_depth_trajectory(model_dir, n_seq=8, seq_len=128): + """Run a forward pass and capture hidden state after each block. + Returns (mean_norm[L], mean_kurt[L]). + """ + sys.path.insert(0, model_dir) + if 'train_gpt' in sys.modules: del sys.modules['train_gpt'] + import importlib.util + spec = importlib.util.spec_from_file_location("train_gpt", os.path.join(model_dir, "train_gpt.py")) + train_gpt = importlib.util.module_from_spec(spec) + os.environ.setdefault("WORLD_SIZE", "1") + os.environ.setdefault("RANK", "0") + os.environ.setdefault("LOCAL_RANK", "0") + os.environ.setdefault("MASTER_ADDR", "127.0.0.1") + os.environ.setdefault("MASTER_PORT", "29500") + spec.loader.exec_module(train_gpt) + h_cls = train_gpt.Hyperparameters() if hasattr(train_gpt, 'Hyperparameters') else None + model = train_gpt.GPT(h_cls) + sd = torch.load(os.path.join(model_dir, "final_model.pt"), map_location="cpu", weights_only=False) + if isinstance(sd, dict) and 'state_dict' in sd: sd = sd['state_dict'] + model.load_state_dict(sd, strict=False) + model.eval().float() # CPU float32 for stability + + L = len(model.blocks) + captured = [None] * L + hooks = [] + for i, blk in enumerate(model.blocks): + def make_hook(idx): + def h(m, inp, out): + captured[idx] = out.detach().float() + return h + hooks.append(blk.register_forward_hook(make_hook(i))) + + # Use deterministic synthetic tokens; we want consistent comparison across regs. + rng = np.random.default_rng(0) + toks = rng.integers(0, 10000, size=(n_seq, seq_len)).astype(np.int64) + toks = torch.from_numpy(toks) + + with torch.no_grad(): + try: + _ = model.forward_logits(toks) + except Exception as e: + print(f" forward error: {e}") + for h in hooks: h.remove() + return None, None + + for h in hooks: h.remove() + + norms, kurts = [], [] + for i, h in enumerate(captured): + if h is None: + norms.append(np.nan); kurts.append(np.nan); continue + # h: (B, T, D); per-token norm + per-coord kurtosis + hn = h.reshape(-1, h.size(-1)) + norms.append(hn.norm(dim=-1).mean().item()) + kurts.append(excess_kurtosis(hn, dim=-1).mean().item()) + del model, captured + gc.collect() + return np.array(norms), np.array(kurts) + +def plot_depth_trajectory(traj_per_reg): + fig, (axN, axK) = plt.subplots(1, 2, figsize=(14, 4.8)) + for reg, (norms, kurts) in traj_per_reg.items(): + if norms is None: continue + x = np.arange(1, len(norms)+1) + axN.plot(x, norms, marker='o', ms=4.5, color=REG_COLORS[reg], lw=1.7, label=reg) + axK.plot(x, kurts, marker='o', ms=4.5, color=REG_COLORS[reg], lw=1.7, label=reg) + axN.set_xlabel('block index (1 = closest to embedding)') + axN.set_ylabel('mean ‖h‖₂ across tokens') + axN.set_title('hidden-state norm trajectory through depth', fontsize=11) + axN.grid(True, alpha=0.3) + axK.set_xlabel('block index (1 = closest to embedding)') + axK.set_ylabel('mean per-coord excess kurtosis') + axK.set_title('hidden-state heavy-tail-ness through depth', fontsize=11) + axK.grid(True, alpha=0.3) + axK.axhline(0, color='k', lw=0.7, alpha=0.5) + axK.legend(loc='best', fontsize=8, framealpha=0.85) + fig.suptitle('Where in the depth do regularizers shape outliers?', fontsize=12, y=1.02) + fig.tight_layout() + out = os.path.join(OUT_DIR, 'fig_depth_trajectory.png') + fig.savefig(out, dpi=130, bbox_inches='tight') + plt.close(fig) + print(f" saved {out}") + +# ─────────────────────────────────────────────────────────────────────────── +# Figure 3: pairwise CKA heatmap +# ─────────────────────────────────────────────────────────────────────────── +def linear_cka(X, Y): + """Linear CKA from Kornblith et al. 2019. + X: (N, dx) Y: (N, dy) — same N.""" + X = X - X.mean(0, keepdim=True) + Y = Y - Y.mean(0, keepdim=True) + XtY = X.t() @ Y + num = (XtY ** 2).sum() + den = ((X.t() @ X) ** 2).sum().sqrt() * ((Y.t() @ Y) ** 2).sum().sqrt() + return float(num / (den + 1e-12)) + +def get_last_hidden(model_dir, n_seq=8, seq_len=128): + """Same setup as depth-trajectory but return only the final-block hidden state.""" + sys.path.insert(0, model_dir) + if 'train_gpt' in sys.modules: del sys.modules['train_gpt'] + import importlib.util + spec = importlib.util.spec_from_file_location("train_gpt", os.path.join(model_dir, "train_gpt.py")) + train_gpt = importlib.util.module_from_spec(spec) + os.environ.setdefault("WORLD_SIZE", "1") + os.environ.setdefault("RANK", "0") + os.environ.setdefault("LOCAL_RANK", "0") + os.environ.setdefault("MASTER_ADDR", "127.0.0.1") + os.environ.setdefault("MASTER_PORT", "29500") + spec.loader.exec_module(train_gpt) + h_cls = train_gpt.Hyperparameters() if hasattr(train_gpt, 'Hyperparameters') else None + model = train_gpt.GPT(h_cls) + sd = torch.load(os.path.join(model_dir, "final_model.pt"), map_location="cpu", weights_only=False) + if isinstance(sd, dict) and 'state_dict' in sd: sd = sd['state_dict'] + model.load_state_dict(sd, strict=False) + model.eval().float() + captured = [None] + last = model.blocks[-1] + def hook(m, inp, out): + captured[0] = out.detach().float() + h = last.register_forward_hook(hook) + rng = np.random.default_rng(0) + toks = rng.integers(0, 10000, size=(n_seq, seq_len)).astype(np.int64) + toks = torch.from_numpy(toks) + with torch.no_grad(): + try: + _ = model.forward_logits(toks) + except Exception as e: + print(f" forward error: {e}") + h.remove() + return None + h.remove() + out = captured[0] + if out is None: return None + out = out.reshape(-1, out.size(-1)) + del model + gc.collect() + return out + +def plot_cka_heatmap(cka_matrix, regs): + fig, ax = plt.subplots(figsize=(7.2, 6)) + im = ax.imshow(cka_matrix, vmin=0.0, vmax=1.0, cmap='magma') + ax.set_xticks(range(len(regs))) + ax.set_yticks(range(len(regs))) + ax.set_xticklabels(regs, rotation=35, ha='right') + ax.set_yticklabels(regs) + for i in range(len(regs)): + for j in range(len(regs)): + txt_color = 'white' if cka_matrix[i, j] < 0.55 else 'black' + ax.text(j, i, f'{cka_matrix[i, j]:.2f}', ha='center', va='center', + color=txt_color, fontsize=9) + ax.set_title('Linear CKA between final-block hidden states\n(Kornblith et al. 2019)', fontsize=11) + cbar = fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04) + cbar.set_label('CKA (1.0 = same representation)') + fig.tight_layout() + out = os.path.join(OUT_DIR, 'fig_cka_heatmap.png') + fig.savefig(out, dpi=130, bbox_inches='tight') + plt.close(fig) + print(f" saved {out}") + +# ─────────────────────────────────────────────────────────────────────────── +def main(): + # Figure 1: SVD spectrum (state-dict only, fast) + print("[1/3] SVD spectra per reg (matrix-family-wise)") + svd_per_reg = {} + for reg, d in REG_DIRS.items(): + sd_path = os.path.join(d, 'final_model.pt') + if not os.path.exists(sd_path): + print(f" {reg}: skipped (no final_model.pt)"); continue + print(f" {reg}: SVD") + sd = torch.load(sd_path, map_location='cpu', weights_only=False) + if isinstance(sd, dict) and 'state_dict' in sd: sd = sd['state_dict'] + svd_per_reg[reg] = collect_svd_spectra(sd) + del sd; gc.collect() + plot_svd_spectrum(svd_per_reg) + + # Figure 2: depth trajectory (forward pass per reg) + print("\n[2/3] depth trajectory per reg (per-layer ‖h‖ + kurtosis)") + traj_per_reg = {} + for reg, d in REG_DIRS.items(): + if not os.path.exists(os.path.join(d, 'final_model.pt')): + traj_per_reg[reg] = (None, None); continue + print(f" {reg}: forward") + norms, kurts = collect_depth_trajectory(d) + traj_per_reg[reg] = (norms, kurts) + plot_depth_trajectory(traj_per_reg) + + # Save trajectory numbers as JSON for the README to reference. + traj_json = {reg: {'norms': (n.tolist() if n is not None else None), + 'kurts': (k.tolist() if k is not None else None)} + for reg, (n, k) in traj_per_reg.items()} + with open('/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/depth_trajectory.json', 'w') as f: + json.dump(traj_json, f, indent=2) + + # Figure 3: CKA heatmap + print("\n[3/3] CKA pairwise heatmap") + hidden_per_reg = {} + for reg, d in REG_DIRS.items(): + if not os.path.exists(os.path.join(d, 'final_model.pt')): + continue + print(f" {reg}: forward (last block only)") + h = get_last_hidden(d) + if h is not None: + hidden_per_reg[reg] = h + regs = list(hidden_per_reg.keys()) + n = len(regs) + cka = np.zeros((n, n)) + for i, ri in enumerate(regs): + for j, rj in enumerate(regs): + if j < i: + cka[i, j] = cka[j, i] + else: + cka[i, j] = linear_cka(hidden_per_reg[ri], hidden_per_reg[rj]) + plot_cka_heatmap(cka, regs) + cka_json = {ri: {rj: float(cka[i, j]) for j, rj in enumerate(regs)} for i, ri in enumerate(regs)} + with open('/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/cka_pairwise.json', 'w') as f: + json.dump(cka_json, f, indent=2) + + print("\nAll 3 figures + 2 JSON tables written.") + +if __name__ == '__main__': + main() diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_real_data_figures.py b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_real_data_figures.py new file mode 100644 index 0000000000..19f2312c66 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_real_data_figures.py @@ -0,0 +1,231 @@ +"""Build visualizations from REAL captured hidden states across the 6 EmbStudy models.""" +import os, sys +import numpy as np +import torch +import torch.nn.functional as F +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +from mpl_toolkits.mplot3d import Axes3D # noqa + +ROOT = '/workspace/parameter-golf' +FIG_DIR = f'{ROOT}/submissions/C_CrossBase_RegTransfer_Study/figures' +os.makedirs(FIG_DIR, exist_ok=True) + +base_dirs = { + 'no-reg': f'{ROOT}/candidate_pack/N18_baseA_nosimctg', + 'SimCTG': f'{ROOT}/candidate_pack/N18_baseA_baseline', + 'SimCTG+QAHSP': f'{ROOT}/candidate_pack/N18_baseA_qahsp', + 'SimCTG+ES': f'{ROOT}/candidate_pack/N18_baseA_es', + 'SimCTG+HSU': f'{ROOT}/candidate_pack/N18_baseA_hsu', + 'SimCTG+AOS': f'{ROOT}/candidate_pack/N18_baseA_aos', +} + +def load_hidden(model_dir, n_tokens=128): + """Load BF16 model + run forward + capture hidden states.""" + sys.path.insert(0, model_dir) + if 'train_gpt' in sys.modules: + del sys.modules['train_gpt'] + import importlib.util + spec = importlib.util.spec_from_file_location("train_gpt", os.path.join(model_dir, "train_gpt.py")) + train_gpt = importlib.util.module_from_spec(spec) + os.environ.setdefault("WORLD_SIZE", "1") + os.environ.setdefault("RANK", "0") + os.environ.setdefault("LOCAL_RANK", "0") + os.environ.setdefault("MASTER_ADDR", "127.0.0.1") + os.environ.setdefault("MASTER_PORT", "29500") + spec.loader.exec_module(train_gpt) + + h_cls = train_gpt.Hyperparameters() if hasattr(train_gpt, 'Hyperparameters') else None + model_cls = None + for cls_name in ['GPT', 'FinalMiniLM', 'Model']: + if hasattr(train_gpt, cls_name): + model_cls = getattr(train_gpt, cls_name) + break + model = model_cls(h_cls) + state_dict = torch.load(os.path.join(model_dir, "final_model.pt"), map_location="cpu", weights_only=False) + if isinstance(state_dict, dict) and 'state_dict' in state_dict: + state_dict = state_dict['state_dict'] + model.load_state_dict(state_dict, strict=False) + model.eval() + device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + model = model.to(device).bfloat16() + + toks = np.random.randint(0, 8000, size=8 * n_tokens) + toks = torch.from_numpy(toks).reshape(8, n_tokens).long().to(device) + + captured = {} + for name, mod in model.named_modules(): + if name.endswith('.10') or name.endswith('.9'): + def make_hook(n): + def hook(m, inp, out): + captured[n] = out.detach().cpu().float() + return hook + mod.register_forward_hook(make_hook(name)) + + with torch.no_grad(): + try: + _ = model.forward_logits(toks) if hasattr(model, 'forward_logits') else model(toks) + except Exception as e: + print(f" forward error in {model_dir}: {e}") + + if captured: + h = list(captured.values())[-1] + return h.reshape(-1, h.size(-1))[:n_tokens] + return None + +# === Capture hidden states === +np.random.seed(42) +torch.manual_seed(42) +hidden_per_reg = {} +for reg, d in base_dirs.items(): + if os.path.exists(os.path.join(d, "final_model.pt")): + h = load_hidden(d, n_tokens=128) + if h is not None: + hidden_per_reg[reg] = h + print(f"{reg:<14}: shape={tuple(h.shape)} mean_L2={h.pow(2).sum(-1).sqrt().mean().item():.2f}") + +if not hidden_per_reg: + print("No models loaded. Exit.") + sys.exit(1) + +# Save hidden states for downstream reuse +torch.save(hidden_per_reg, f'{ROOT}/submissions/C_CrossBase_RegTransfer_Study/real_hidden_states.pt') + +# === FIG 1: Real-data 3D PCA scatter === +fig = plt.figure(figsize=(20, 8)) +fig.suptitle('Real Base A LM hidden states (128 tokens × 512 dims), 3D PCA per reg', fontsize=13, weight='bold') +n_tok = 128 +colors = plt.cm.viridis(np.linspace(0, 1, n_tok)) +for col, (name, h) in enumerate(hidden_per_reg.items()): + h_np = h.numpy() + h_c = h_np - h_np.mean(0, keepdims=True) + U, S, _ = np.linalg.svd(h_c, full_matrices=False) + proj = U[:, :3] * S[:3] + ax = fig.add_subplot(1, len(hidden_per_reg), col+1, projection='3d') + ax.scatter(proj[:, 0], proj[:, 1], proj[:, 2], c=colors, s=30, alpha=0.7, edgecolors='black', linewidths=0.3) + ax.set_title(name, fontsize=11) + ax.tick_params(labelsize=7) + ax.view_init(elev=22, azim=45) + # Annotate spread + spread = (proj.max(0) - proj.min(0)).max() + ax.set_xlabel(f'PC1 (sv={S[0]:.1f})', fontsize=8) + ax.set_ylabel(f'PC2 (sv={S[1]:.1f})', fontsize=8) + ax.set_zlabel(f'PC3 (sv={S[2]:.1f})', fontsize=8) +plt.tight_layout() +plt.savefig(f'{FIG_DIR}/fig_real_3d_pca.png', dpi=130, bbox_inches='tight') +plt.close() +print("saved fig_real_3d_pca.png") + +# === FIG 2: Real per-coord distribution per reg === +fig, axes = plt.subplots(2, 3, figsize=(16, 8)) +fig.suptitle('Real per-coordinate hidden state value distributions (128 tokens × 512 dims)', fontsize=13, weight='bold') +for ax, (name, h) in zip(axes.flat, hidden_per_reg.items()): + flat = h.numpy().flatten() + ax.hist(flat, bins=60, color='steelblue', alpha=0.7, edgecolor='black') + ax.set_title(f'{name}\nμ={flat.mean():.3f}, σ={flat.std():.3f}, max|h|={np.abs(flat).max():.2f}', fontsize=10) + ax.set_xlabel('h_d value') + ax.set_ylabel('count') + ax.grid(alpha=0.3) +plt.tight_layout() +plt.savefig(f'{FIG_DIR}/fig_real_coord_distribution.png', dpi=130) +plt.close() +print("saved fig_real_coord_distribution.png") + +# === FIG 3: Real-data canonical metrics comparison === +def isoscore(h): + h_n = F.normalize(h, dim=-1) + sim = h_n @ h_n.t() + n = h_n.size(0) + off = sim - torch.eye(n) + return off.abs().mean().item() + +def eff_rank(h): + h_c = h - h.mean(0, keepdim=True) + _, S, _ = torch.linalg.svd(h_c, full_matrices=False) + p = S / S.sum() + p = p[p > 1e-10] + return float(np.exp(-(p * p.log()).sum().item())) + +real_metrics = {} +for name, h in hidden_per_reg.items(): + real_metrics[name] = { + 'isoscore': isoscore(h), + 'eff_rank': eff_rank(h), + 'norm_var': h.pow(2).sum(-1).sqrt().var().item(), + 'norm_mean': h.pow(2).sum(-1).sqrt().mean().item(), + 'max_abs': h.abs().max().item(), + } + +fig, axes = plt.subplots(2, 2, figsize=(13, 10)) +fig.suptitle('Real-data canonical metrics: do regs change real LM hidden states the way synthetic predicts?', fontsize=13, weight='bold') +names = list(real_metrics.keys()) +colors_p = plt.cm.tab10(np.arange(len(names))) + +ax = axes[0, 0] +isos = [real_metrics[n]['isoscore'] for n in names] +ax.bar(names, isos, color=colors_p, edgecolor='black', alpha=0.85) +ax.set_ylabel('mean |cos(h_i, h_j)| off-diag') +ax.set_title('Isoscore (lower = more isotropic)') +ax.tick_params(axis='x', rotation=20) +ax.grid(axis='y', alpha=0.3) +for i, v in enumerate(isos): ax.text(i, v+0.001, f'{v:.4f}', ha='center', fontsize=9, weight='bold') + +ax = axes[0, 1] +ers = [real_metrics[n]['eff_rank'] for n in names] +ax.bar(names, ers, color=colors_p, edgecolor='black', alpha=0.85) +ax.set_ylabel('exp(spectral entropy)') +ax.set_title('Effective rank (higher = more dimensions used)') +ax.tick_params(axis='x', rotation=20) +ax.grid(axis='y', alpha=0.3) +for i, v in enumerate(ers): ax.text(i, v+0.5, f'{v:.1f}', ha='center', fontsize=9, weight='bold') + +ax = axes[1, 0] +nvs = [real_metrics[n]['norm_var'] for n in names] +ax.bar(names, nvs, color=colors_p, edgecolor='black', alpha=0.85) +ax.set_ylabel('variance of L2 norms') +ax.set_title('Per-token L2 norm variance (lower = more uniform)') +ax.tick_params(axis='x', rotation=20) +ax.grid(axis='y', alpha=0.3) +for i, v in enumerate(nvs): ax.text(i, v+0.05, f'{v:.2f}', ha='center', fontsize=9, weight='bold') + +ax = axes[1, 1] +mxs = [real_metrics[n]['max_abs'] for n in names] +ax.bar(names, mxs, color=colors_p, edgecolor='black', alpha=0.85) +ax.set_ylabel('max |h| across all coords') +ax.set_title('Outlier coord magnitude (lower = AOS-like effect)') +ax.tick_params(axis='x', rotation=20) +ax.grid(axis='y', alpha=0.3) +for i, v in enumerate(mxs): ax.text(i, v+0.5, f'{v:.1f}', ha='center', fontsize=9, weight='bold') + +plt.tight_layout() +plt.savefig(f'{FIG_DIR}/fig_real_canonical_metrics.png', dpi=130) +plt.close() +print("saved fig_real_canonical_metrics.png") + +# === FIG 4: Real data per-token L2 norm distribution per reg === +fig, axes = plt.subplots(2, 3, figsize=(15, 8)) +fig.suptitle('Real per-token L2 norm distributions across 128 captured tokens', fontsize=13, weight='bold') +for ax, (name, h) in zip(axes.flat, hidden_per_reg.items()): + norms = h.pow(2).sum(-1).sqrt().numpy() + ax.hist(norms, bins=20, color='darkgreen', alpha=0.7, edgecolor='black') + ax.set_title(f'{name}\nμ={norms.mean():.2f}, σ={norms.std():.2f}', fontsize=10) + ax.set_xlabel('‖h‖') + ax.set_ylabel('count') + ax.grid(alpha=0.3) +plt.tight_layout() +plt.savefig(f'{FIG_DIR}/fig_real_l2norm_distribution.png', dpi=130) +plt.close() +print("saved fig_real_l2norm_distribution.png") + +print() +print("=== Real-data canonical metric table ===") +print(f"{'reg':<14} {'isoscore':>10} {'eff_rank':>10} {'norm_var':>10} {'norm_mean':>10} {'max|h|':>10}") +for n in names: + m = real_metrics[n] + print(f" {n:<14} {m['isoscore']:>10.4f} {m['eff_rank']:>10.2f} {m['norm_var']:>10.3f} {m['norm_mean']:>10.2f} {m['max_abs']:>10.2f}") + +# Save metrics as JSON +import json +open(f'{ROOT}/submissions/C_CrossBase_RegTransfer_Study/real_canonical_metrics.json', 'w').write(json.dumps(real_metrics, indent=2)) +print("saved real_canonical_metrics.json") diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_synergy_figures.py b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_synergy_figures.py new file mode 100644 index 0000000000..6c799ee2df --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/build_synergy_figures.py @@ -0,0 +1,94 @@ +"""Build heatmap + synergy-detection figures from real_reg_quant_matrix.json.""" +import json, os, sys +import numpy as np +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt + +with open('/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/real_reg_quant_matrix.json') as f: + data = json.load(f) + +regs = list(data.keys()) +quants = list(data[regs[0]].keys()) +n_r, n_q = len(regs), len(quants) + +# Build matrices for each metric +metrics = ['l2_distortion', 'cos_shift', 'isoscore_post', 'eff_rank_post'] +mats = {m: np.zeros((n_r, n_q)) for m in metrics} +for ri, r in enumerate(regs): + for qi, q in enumerate(quants): + for m in metrics: + mats[m][ri, qi] = data[r][q][m] + +# === Figure: heatmaps === +fig, axes = plt.subplots(2, 2, figsize=(15, 10)) +fig.suptitle('Real (reg × quant) interaction matrix on REAL Base A LM hidden states\n(captured from forward pass on val tokens, 5 trained models)', fontsize=13, weight='bold') + +cmaps = {'l2_distortion': 'RdYlGn_r', 'cos_shift': 'RdYlGn_r', 'isoscore_post': 'RdYlGn_r', 'eff_rank_post': 'RdYlGn'} +titles = { + 'l2_distortion': '(a) L2 distortion (lower = better)', + 'cos_shift': '(b) Cosine shift (lower = better)', + 'isoscore_post': '(c) Post-quant isoscore (lower = better)', + 'eff_rank_post': '(d) Post-quant effective rank (higher = better)', +} + +for ax, m in zip(axes.flat, metrics): + mat = mats[m] + im = ax.imshow(mat, aspect='auto', cmap=cmaps[m]) + ax.set_xticks(range(n_q)); ax.set_xticklabels(quants, rotation=30, ha='right', fontsize=9) + ax.set_yticks(range(n_r)); ax.set_yticklabels(regs, fontsize=10) + ax.set_title(titles[m], fontsize=11) + # Annotate values + for i in range(n_r): + for j in range(n_q): + ax.text(j, i, f'{mat[i,j]:.4f}' if 'l2' in m or 'shift' in m else f'{mat[i,j]:.3f}', + ha='center', va='center', fontsize=8, color='black') + plt.colorbar(im, ax=ax, fraction=0.04) + # Mark best per quant (column) — for distortion/cos_shift, lowest; for eff_rank, highest + for j in range(n_q): + col = mat[:, j] + best_row = np.argmin(col) if m != 'eff_rank_post' else np.argmax(col) + ax.scatter(j, best_row, marker='*', s=200, c='gold', edgecolors='black', linewidths=1, zorder=10) + +plt.tight_layout() +plt.savefig('/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/figures/fig_reg_quant_matrix_real.png', dpi=130, bbox_inches='tight') +plt.close() +print("saved fig_reg_quant_matrix_real.png") + +# === Synergy detection: which (reg, quant) pairs are unexpectedly good? === +# For each metric, normalize each row (relative to that reg's mean) and each column (relative to that quant's mean) +# A "synergy" is a cell that's better than both its row mean AND column mean by margin +print() +print("=== SYNERGY detection (cells that play unusually nicely) ===") +for m in ['l2_distortion', 'cos_shift']: + mat = mats[m] + row_means = mat.mean(axis=1, keepdims=True) + col_means = mat.mean(axis=0, keepdims=True) + # Synergy: cell is BELOW both row mean and col mean (lower distortion is better here) + rel_row = mat - row_means # negative = better than row average + rel_col = mat - col_means + print(f"\nMetric: {m}") + print(f" Each cell: relative-to-row-mean / relative-to-col-mean") + for ri, r in enumerate(regs): + for qi, q in enumerate(quants): + rr, rc = rel_row[ri, qi], rel_col[ri, qi] + if rr < -mat.std()*0.3 and rc < -mat.std()*0.3: + print(f" ⭐ SYNERGY: {r:<10} × {q:<22} (row Δ {rr:+.4f}, col Δ {rc:+.4f}) — both reg AND quant outperform their means") + +# === "Plays nice" summary table === +print() +print("=== 'Plays nice' summary: best reg per quant + best quant per reg ===") +print() +print("For each QUANT scheme, which REG produces the smallest distortion?") +print(f"{'quant scheme':<22} {'best reg':<10} {'L2 dist':>9}") +for qi, q in enumerate(quants): + col = mats['l2_distortion'][:, qi] + best_r = np.argmin(col) + print(f" {q:<22} {regs[best_r]:<10} {col[best_r]:.4f}") +print() +print("For each REG, which QUANT gives smallest distortion?") +print(f"{'reg':<10} {'best quant':<22} {'L2 dist':>9}") +for ri, r in enumerate(regs): + row = mats['l2_distortion'][ri, :] + best_q = np.argmin(row) + print(f" {r:<10} {quants[best_q]:<22} {row[best_q]:.4f}") diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/cka_pairwise.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/cka_pairwise.json new file mode 100644 index 0000000000..b23086ae07 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/cka_pairwise.json @@ -0,0 +1,50 @@ +{ + "no-reg": { + "no-reg": 1.0000001192092896, + "SimCTG": 0.7081529498100281, + "SimCTG+QAHSP": 0.7461575865745544, + "SimCTG+ES": 0.7027785181999207, + "SimCTG+HSU": 0.7411860227584839, + "SimCTG+AOS": 0.7073161005973816 + }, + "SimCTG": { + "no-reg": 0.7081529498100281, + "SimCTG": 1.0, + "SimCTG+QAHSP": 0.7191595435142517, + "SimCTG+ES": 0.6846227049827576, + "SimCTG+HSU": 0.688707709312439, + "SimCTG+AOS": 0.6744828224182129 + }, + "SimCTG+QAHSP": { + "no-reg": 0.7461575865745544, + "SimCTG": 0.7191595435142517, + "SimCTG+QAHSP": 1.0000001192092896, + "SimCTG+ES": 0.7486706972122192, + "SimCTG+HSU": 0.691008448600769, + "SimCTG+AOS": 0.7166145443916321 + }, + "SimCTG+ES": { + "no-reg": 0.7027785181999207, + "SimCTG": 0.6846227049827576, + "SimCTG+QAHSP": 0.7486706972122192, + "SimCTG+ES": 1.0000001192092896, + "SimCTG+HSU": 0.7097563147544861, + "SimCTG+AOS": 0.7295978665351868 + }, + "SimCTG+HSU": { + "no-reg": 0.7411860227584839, + "SimCTG": 0.688707709312439, + "SimCTG+QAHSP": 0.691008448600769, + "SimCTG+ES": 0.7097563147544861, + "SimCTG+HSU": 1.0000001192092896, + "SimCTG+AOS": 0.7205949425697327 + }, + "SimCTG+AOS": { + "no-reg": 0.7073161005973816, + "SimCTG": 0.6744828224182129, + "SimCTG+QAHSP": 0.7166145443916321, + "SimCTG+ES": 0.7295978665351868, + "SimCTG+HSU": 0.7205949425697327, + "SimCTG+AOS": 0.9999999403953552 + } +} \ No newline at end of file diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/depth_trajectory.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/depth_trajectory.json new file mode 100644 index 0000000000..7fcf77b329 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/depth_trajectory.json @@ -0,0 +1,170 @@ +{ + "no-reg": { + "norms": [ + 82.8610610961914, + 74.36043548583984, + 84.44847106933594, + 84.7039794921875, + 64.86913299560547, + 73.76284790039062, + 69.17152404785156, + 63.027923583984375, + 52.42226791381836, + 35.87632751464844, + 28.76905632019043 + ], + "kurts": [ + 4.855867385864258, + 2.8805336952209473, + 1.1206727027893066, + 1.0582356452941895, + 0.8072985410690308, + 1.0698959827423096, + 1.292937994003296, + 4.44473123550415, + 23.753156661987305, + 183.02464294433594, + 388.1587829589844 + ] + }, + "SimCTG": { + "norms": [ + 82.8995132446289, + 75.21341705322266, + 83.13368225097656, + 82.3014907836914, + 64.17926788330078, + 73.55484771728516, + 68.2824935913086, + 64.23097229003906, + 53.610286712646484, + 36.06589126586914, + 27.404142379760742 + ], + "kurts": [ + 5.22653865814209, + 3.193834066390991, + 0.8576445579528809, + 0.9317208528518677, + 0.706852376461029, + 0.8229748010635376, + 1.9099056720733643, + 4.851380348205566, + 28.853525161743164, + 184.3623809814453, + 358.630859375 + ] + }, + "SimCTG+QAHSP": { + "norms": [ + 81.03509521484375, + 74.17752838134766, + 83.78406524658203, + 82.69522094726562, + 63.89334487915039, + 74.52857971191406, + 67.91127014160156, + 64.07562255859375, + 52.825660705566406, + 35.43454360961914, + 30.548107147216797 + ], + "kurts": [ + 3.835179567337036, + 2.4510762691497803, + 0.8331541419029236, + 0.7151315212249756, + 0.7320422530174255, + 0.8011575937271118, + 0.9048928022384644, + 3.4804654121398926, + 21.797697067260742, + 173.99606323242188, + 386.97918701171875 + ] + }, + "SimCTG+ES": { + "norms": [ + 82.62226104736328, + 75.1025390625, + 84.94764709472656, + 80.84661102294922, + 63.999778747558594, + 72.64584350585938, + 65.707275390625, + 61.80288314819336, + 49.580055236816406, + 34.79338073730469, + 27.630577087402344 + ], + "kurts": [ + 4.948187828063965, + 3.426382303237915, + 1.003125548362732, + 1.0885035991668701, + 0.8019349575042725, + 1.0049747228622437, + 1.1360023021697998, + 3.7354226112365723, + 17.517728805541992, + 174.50112915039062, + 381.40606689453125 + ] + }, + "SimCTG+HSU": { + "norms": [ + 81.51065063476562, + 74.14579772949219, + 84.77708435058594, + 83.02979278564453, + 66.24172973632812, + 75.18669891357422, + 64.73681640625, + 62.87533187866211, + 52.14603042602539, + 35.268707275390625, + 28.539628982543945 + ], + "kurts": [ + 3.804694652557373, + 1.8663853406906128, + 0.7781698703765869, + 0.9039973020553589, + 0.8394896984100342, + 0.7438814640045166, + 1.1972663402557373, + 4.4696946144104, + 27.126493453979492, + 183.8109588623047, + 386.6787109375 + ] + }, + "SimCTG+AOS": { + "norms": [ + 84.75853729248047, + 77.02366638183594, + 84.03617858886719, + 81.2281265258789, + 66.06658172607422, + 72.4874038696289, + 63.07447814941406, + 62.116519927978516, + 52.27328872680664, + 35.48656463623047, + 28.157312393188477 + ], + "kurts": [ + 5.198080062866211, + 4.80472469329834, + 0.7657652497291565, + 0.7305189371109009, + 0.642646312713623, + 0.6297460794448853, + 1.489835500717163, + 5.857874870300293, + 30.544631958007812, + 182.83627319335938, + 385.41748046875 + ] + } +} \ No newline at end of file diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/eval_pipeline_breakdown.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/eval_pipeline_breakdown.json new file mode 100644 index 0000000000..101ff3a6f1 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/eval_pipeline_breakdown.json @@ -0,0 +1,168 @@ +{ + "cells": [ + { + "name": "QAHSP \u03bb=0.3", + "size": 17028688, + "metrics": { + "pre-quantization post-ema": 1.07492939, + "quantized": 1.0894092, + "quantized_sliding_window": 1.07347521 + } + }, + { + "name": "ES \u03bb=0.05", + "size": 17022208, + "metrics": { + "pre-quantization post-ema": 1.07535789, + "quantized": 1.09022908, + "quantized_sliding_window": 1.07428482 + } + }, + { + "name": "AOS \u03bb=0.005", + "size": 17029456, + "metrics": { + "pre-quantization post-ema": 1.07579561, + "quantized": 1.09035483, + "quantized_sliding_window": 1.07445292 + } + }, + { + "name": "HSU \u03bb=0.1", + "size": 17030768, + "metrics": { + "pre-quantization post-ema": 1.07535949, + "quantized": 1.08993364, + "quantized_sliding_window": 1.07403453 + } + }, + { + "name": "WBC \u03bb=0.005", + "size": 17063164, + "metrics": { + "pre-quantization post-ema": 1.07646463, + "quantized": 1.09112469, + "quantized_sliding_window": 1.07521576 + } + }, + { + "name": "WOP \u03bb=0.5", + "size": 17029348, + "metrics": { + "pre-quantization post-ema": 1.07536867, + "quantized": 1.08962462, + "quantized_sliding_window": 1.07375795 + } + }, + { + "name": "PCS \u03bb=0.005", + "size": 17029124, + "metrics": { + "pre-quantization post-ema": 1.07595394, + "quantized": 1.09052669, + "quantized_sliding_window": 1.07462584 + } + }, + { + "name": "QAHSP+HSU pair", + "size": 17024280, + "metrics": { + "pre-quantization post-ema": 1.07532708, + "quantized": 1.08998156, + "quantized_sliding_window": 1.07408126 + } + }, + { + "name": "QAHSP+ES pair", + "size": 17027192, + "metrics": { + "pre-quantization post-ema": 1.07548971, + "quantized": 1.09006997, + "quantized_sliding_window": 1.07416475 + } + }, + { + "name": "HSU+ES pair", + "size": 17030228, + "metrics": { + "pre-quantization post-ema": 1.07558521, + "quantized": 1.09012842, + "quantized_sliding_window": 1.07422695 + } + }, + { + "name": "QAHSP+PCS pair", + "size": 17032068, + "metrics": { + "pre-quantization post-ema": 1.07614311, + "quantized": 1.09061423, + "quantized_sliding_window": 1.07474981 + } + }, + { + "name": "PQT + QAHSP \u03bb=0.3", + "size": 17023556, + "metrics": { + "pre-quantization post-ema": 1.07550024, + "post-prequant-ttt": 1.02901401, + "quantized": 1.05180787, + "quantized_sliding_window": 1.03985482 + } + }, + { + "name": "PQT + ES \u03bb=0.05", + "size": 17025048, + "metrics": { + "pre-quantization post-ema": 1.07516222, + "post-prequant-ttt": 1.02867971, + "quantized": 1.05144513, + "quantized_sliding_window": 1.03942213 + } + }, + { + "name": "Base B baseline (PR #1965)", + "size": 15977654, + "metrics": { + "pre-quantization post-ema": 1.06162051, + "quantized": 1.06995915, + "quantized_ttt_phased": 1.05822408 + } + }, + { + "name": "Base B + SimCTG+QAHSP \u03bb=0.3", + "size": 15972592, + "metrics": { + "pre-quantization post-ema": 1.06414265, + "quantized": 1.07235931, + "quantized_ttt_phased": 1.06047215 + } + }, + { + "name": "Base B + SimCTG+QAHSP \u03bb=0.1", + "size": 15974388, + "metrics": { + "pre-quantization post-ema": 1.06235812, + "quantized": 1.07065788, + "quantized_ttt_phased": 1.05880775 + } + }, + { + "name": "Base B + ES \u03bb=0.05", + "size": 15972817, + "metrics": { + "pre-quantization post-ema": 1.06333929, + "quantized": 1.07184449, + "quantized_ttt_phased": 1.05993433 + } + }, + { + "name": "Base B + bigram(1024\u00d78)", + "size": 16013368, + "metrics": { + "pre-quantization post-ema": 1.06223567, + "quantized": 1.07064692, + "quantized_ttt_phased": 1.05886441 + } + } + ] +} \ No newline at end of file diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig1_cross_base_signs.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig1_cross_base_signs.png new file mode 100644 index 0000000000..75670d02d4 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig1_cross_base_signs.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_cka_heatmap.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_cka_heatmap.png new file mode 100644 index 0000000000..fc4f3db23c Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_cka_heatmap.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_depth_trajectory.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_depth_trajectory.png new file mode 100644 index 0000000000..fcf4733088 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_depth_trajectory.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_lambda_budget.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_lambda_budget.png new file mode 100644 index 0000000000..76afc93cae Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_lambda_budget.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pipeline_waterfall.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pipeline_waterfall.png new file mode 100644 index 0000000000..04053b9728 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pipeline_waterfall.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pqt_compounding.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pqt_compounding.png new file mode 100644 index 0000000000..897bb9956c Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_pqt_compounding.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_3d_pca.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_3d_pca.png new file mode 100644 index 0000000000..515ac31248 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_3d_pca.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_canonical_metrics.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_canonical_metrics.png new file mode 100644 index 0000000000..90e08c80e4 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_canonical_metrics.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_coord_distribution.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_coord_distribution.png new file mode 100644 index 0000000000..20e5825f1f Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_coord_distribution.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_l2norm_distribution.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_l2norm_distribution.png new file mode 100644 index 0000000000..350f177a2a Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_l2norm_distribution.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_pre_post_quant.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_pre_post_quant.png new file mode 100644 index 0000000000..b10926398c Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_real_pre_post_quant.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_reg_quant_matrix_real.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_reg_quant_matrix_real.png new file mode 100644 index 0000000000..b2b668170c Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_reg_quant_matrix_real.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_differential.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_differential.png new file mode 100644 index 0000000000..3e98fff860 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_differential.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_flatness.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_flatness.png new file mode 100644 index 0000000000..286fc232f0 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_flatness.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_spectrum.png b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_spectrum.png new file mode 100644 index 0000000000..f3d6d69f17 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/figures/fig_svd_spectrum.png differ diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/pipeline_attribution.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/pipeline_attribution.json new file mode 100644 index 0000000000..7825090ca6 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/pipeline_attribution.json @@ -0,0 +1,158 @@ +[ + { + "cell": "QAHSP \u03bb=0.3", + "base": "A", + "pre_quant": 1.07492939, + "quantized": 1.0894092, + "sliding": 1.07347521, + "ttt": null, + "quant_cost_mBPB": 14.47980999999987, + "sliding_gain_mBPB": -15.933989999999953, + "ttt_gain_mBPB": null, + "final": 1.07347521 + }, + { + "cell": "ES \u03bb=0.05", + "base": "A", + "pre_quant": 1.07535789, + "quantized": 1.09022908, + "sliding": 1.07428482, + "ttt": null, + "quant_cost_mBPB": 14.871190000000034, + "sliding_gain_mBPB": -15.944260000000154, + "ttt_gain_mBPB": null, + "final": 1.07428482 + }, + { + "cell": "AOS \u03bb=0.005", + "base": "A", + "pre_quant": 1.07579561, + "quantized": 1.09035483, + "sliding": 1.07445292, + "ttt": null, + "quant_cost_mBPB": 14.559220000000206, + "sliding_gain_mBPB": -15.901910000000186, + "ttt_gain_mBPB": null, + "final": 1.07445292 + }, + { + "cell": "HSU \u03bb=0.1", + "base": "A", + "pre_quant": 1.07535949, + "quantized": 1.08993364, + "sliding": 1.07403453, + "ttt": null, + "quant_cost_mBPB": 14.574149999999841, + "sliding_gain_mBPB": -15.899109999999883, + "ttt_gain_mBPB": null, + "final": 1.07403453 + }, + { + "cell": "WBC \u03bb=0.005", + "base": "A", + "pre_quant": 1.07646463, + "quantized": 1.09112469, + "sliding": 1.07521576, + "ttt": null, + "quant_cost_mBPB": 14.660059999999975, + "sliding_gain_mBPB": -15.908929999999932, + "ttt_gain_mBPB": null, + "final": 1.07521576 + }, + { + "cell": "WOP \u03bb=0.5", + "base": "A", + "pre_quant": 1.07536867, + "quantized": 1.08962462, + "sliding": 1.07375795, + "ttt": null, + "quant_cost_mBPB": 14.255949999999906, + "sliding_gain_mBPB": -15.86666999999986, + "ttt_gain_mBPB": null, + "final": 1.07375795 + }, + { + "cell": "PCS \u03bb=0.005", + "base": "A", + "pre_quant": 1.07595394, + "quantized": 1.09052669, + "sliding": 1.07462584, + "ttt": null, + "quant_cost_mBPB": 14.572749999999912, + "sliding_gain_mBPB": -15.900849999999966, + "ttt_gain_mBPB": null, + "final": 1.07462584 + }, + { + "cell": "PQT + ES \u03bb=0.05", + "base": "A-PQT", + "pre_quant": 1.07516222, + "quantized": 1.05144513, + "sliding": 1.03942213, + "ttt": null, + "quant_cost_mBPB": -23.717089999999885, + "sliding_gain_mBPB": -12.023000000000117, + "ttt_gain_mBPB": null, + "final": 1.03942213 + }, + { + "cell": "PQT + QAHSP \u03bb=0.3", + "base": "A-PQT", + "pre_quant": 1.07550024, + "quantized": 1.05180787, + "sliding": 1.03985482, + "ttt": null, + "quant_cost_mBPB": -23.692370000000018, + "sliding_gain_mBPB": -11.95305000000002, + "ttt_gain_mBPB": null, + "final": 1.03985482 + }, + { + "cell": "Base B baseline", + "base": "B", + "pre_quant": 1.06162051, + "quantized": 1.06995915, + "sliding": null, + "ttt": 1.05822408, + "quant_cost_mBPB": 8.338640000000064, + "sliding_gain_mBPB": null, + "ttt_gain_mBPB": -11.73507000000007, + "final": 1.05822408 + }, + { + "cell": "B + SimCTG+QAHSP \u03bb=0.1", + "base": "B", + "pre_quant": 1.06235812, + "quantized": 1.07065788, + "sliding": null, + "ttt": 1.05880775, + "quant_cost_mBPB": 8.299759999999878, + "sliding_gain_mBPB": null, + "ttt_gain_mBPB": -11.850130000000014, + "final": 1.05880775 + }, + { + "cell": "B + ES \u03bb=0.05", + "base": "B", + "pre_quant": 1.06333929, + "quantized": 1.07184449, + "sliding": null, + "ttt": 1.05993433, + "quant_cost_mBPB": 8.50519999999988, + "sliding_gain_mBPB": null, + "ttt_gain_mBPB": -11.910160000000003, + "final": 1.05993433 + }, + { + "cell": "B + bigram 1024\u00d78", + "base": "B", + "pre_quant": 1.06223567, + "quantized": 1.07064692, + "sliding": null, + "ttt": 1.05886441, + "quant_cost_mBPB": 8.411249999999981, + "sliding_gain_mBPB": null, + "ttt_gain_mBPB": -11.782509999999968, + "final": 1.05886441 + } +] \ No newline at end of file diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_canonical_metrics.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_canonical_metrics.json new file mode 100644 index 0000000000..41ae5614e8 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_canonical_metrics.json @@ -0,0 +1,44 @@ +{ + "no-reg": { + "isoscore": 0.9196764230728149, + "eff_rank": 96.45550204522421, + "norm_var": 19.190195083618164, + "norm_mean": 27.46596908569336, + "max_abs": 36.5 + }, + "SimCTG": { + "isoscore": 0.8950488567352295, + "eff_rank": 94.40702880943573, + "norm_var": 28.09220314025879, + "norm_mean": 26.198692321777344, + "max_abs": 36.5 + }, + "SimCTG+QAHSP": { + "isoscore": 0.9058330059051514, + "eff_rank": 92.1850517498386, + "norm_var": 29.38678550720215, + "norm_mean": 30.21584701538086, + "max_abs": 41.0 + }, + "SimCTG+ES": { + "isoscore": 0.9128783941268921, + "eff_rank": 95.01078049667903, + "norm_var": 29.09756851196289, + "norm_mean": 29.541982650756836, + "max_abs": 43.0 + }, + "SimCTG+HSU": { + "isoscore": 0.9256589412689209, + "eff_rank": 95.99981971300701, + "norm_var": 22.477293014526367, + "norm_mean": 29.241943359375, + "max_abs": 38.25 + }, + "SimCTG+AOS": { + "isoscore": 0.9227072596549988, + "eff_rank": 99.81859154964323, + "norm_var": 14.296555519104004, + "norm_mean": 28.209720611572266, + "max_abs": 35.25 + } +} \ No newline at end of file diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_reg_quant_matrix.json b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_reg_quant_matrix.json new file mode 100644 index 0000000000..e1fa47c068 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/real_reg_quant_matrix.json @@ -0,0 +1,266 @@ +{ + "no-reg": { + "int4 sym per-tensor": { + "l2_distortion": 8.665924072265625, + "cos_shift": 0.058444440364837646, + "isoscore_post": 0.9379310607910156, + "eff_rank_post": 4.676845098324823 + }, + "int4 sym per-row": { + "l2_distortion": 7.969930648803711, + "cos_shift": 0.049631595611572266, + "isoscore_post": 0.9429349899291992, + "eff_rank_post": 17.07542076078601 + }, + "int4 asym per-row": { + "l2_distortion": 7.040150165557861, + "cos_shift": 0.03654921054840088, + "isoscore_post": 0.9111442565917969, + "eff_rank_post": 78.67987774765582 + }, + "int6 sym per-row": { + "l2_distortion": 4.770608901977539, + "cos_shift": 0.01566535234451294, + "isoscore_post": 0.8909909725189209, + "eff_rank_post": 103.73207476658023 + }, + "int8 sym per-row": { + "l2_distortion": 1.2735472917556763, + "cos_shift": 0.0011124610900878906, + "isoscore_post": 0.9084160327911377, + "eff_rank_post": 97.27140992067748 + }, + "AWQ-lite int4": { + "l2_distortion": 2.503967761993408, + "cos_shift": 0.004290342330932617, + "isoscore_post": 0.9011155962944031, + "eff_rank_post": 100.39177831746376 + }, + "GPTQ-lite int4": { + "l2_distortion": 51.821632385253906, + "cos_shift": 0.4511241912841797, + "isoscore_post": 0.537254810333252, + "eff_rank_post": 106.64174919958656 + } + }, + "SimCTG": { + "int4 sym per-tensor": { + "l2_distortion": 8.885649681091309, + "cos_shift": 0.06010890007019043, + "isoscore_post": 0.9370186924934387, + "eff_rank_post": 3.1090120294548624 + }, + "int4 sym per-row": { + "l2_distortion": 7.985650062561035, + "cos_shift": 0.04347085952758789, + "isoscore_post": 0.9251681566238403, + "eff_rank_post": 12.936205182701624 + }, + "int4 asym per-row": { + "l2_distortion": 7.186392784118652, + "cos_shift": 0.032656848430633545, + "isoscore_post": 0.9017077684402466, + "eff_rank_post": 62.38439015852314 + }, + "int6 sym per-row": { + "l2_distortion": 5.099906921386719, + "cos_shift": 0.014867603778839111, + "isoscore_post": 0.8821943998336792, + "eff_rank_post": 97.97688454189425 + }, + "int8 sym per-row": { + "l2_distortion": 1.4151151180267334, + "cos_shift": 0.00112152099609375, + "isoscore_post": 0.8963391184806824, + "eff_rank_post": 91.62891844039687 + }, + "AWQ-lite int4": { + "l2_distortion": 2.658930778503418, + "cos_shift": 0.00402677059173584, + "isoscore_post": 0.8892554044723511, + "eff_rank_post": 95.11689794763643 + }, + "GPTQ-lite int4": { + "l2_distortion": 57.328311920166016, + "cos_shift": 0.46479684114456177, + "isoscore_post": 0.38995546102523804, + "eff_rank_post": 99.04585869628714 + } + }, + "SimCTG+QAHSP": { + "int4 sym per-tensor": { + "l2_distortion": 8.90207290649414, + "cos_shift": 0.04147696495056152, + "isoscore_post": 0.9752295017242432, + "eff_rank_post": 3.1338453281846426 + }, + "int4 sym per-row": { + "l2_distortion": 8.396255493164062, + "cos_shift": 0.037551701068878174, + "isoscore_post": 0.9723882675170898, + "eff_rank_post": 10.321481681455186 + }, + "int4 asym per-row": { + "l2_distortion": 7.7742919921875, + "cos_shift": 0.030310511589050293, + "isoscore_post": 0.9479402303695679, + "eff_rank_post": 61.19273380707454 + }, + "int6 sym per-row": { + "l2_distortion": 5.7215166091918945, + "cos_shift": 0.01521909236907959, + "isoscore_post": 0.922627329826355, + "eff_rank_post": 103.85328116432875 + }, + "int8 sym per-row": { + "l2_distortion": 1.6117725372314453, + "cos_shift": 0.0011837482452392578, + "isoscore_post": 0.9344638586044312, + "eff_rank_post": 98.82931581135958 + }, + "AWQ-lite int4": { + "l2_distortion": 2.8120150566101074, + "cos_shift": 0.0035818815231323242, + "isoscore_post": 0.9302618503570557, + "eff_rank_post": 101.67765613469722 + }, + "GPTQ-lite int4": { + "l2_distortion": 65.60872650146484, + "cos_shift": 0.4830549955368042, + "isoscore_post": 0.42430219054222107, + "eff_rank_post": 102.3128287695318 + } + }, + "SimCTG+ES": { + "int4 sym per-tensor": { + "l2_distortion": 8.765625, + "cos_shift": 0.062296152114868164, + "isoscore_post": 0.936721682548523, + "eff_rank_post": 3.9066363675054414 + }, + "int4 sym per-row": { + "l2_distortion": 8.043916702270508, + "cos_shift": 0.05177617073059082, + "isoscore_post": 0.9349973201751709, + "eff_rank_post": 18.594726647796865 + }, + "int4 asym per-row": { + "l2_distortion": 7.1346306800842285, + "cos_shift": 0.0385744571685791, + "isoscore_post": 0.8988239765167236, + "eff_rank_post": 81.55246436667036 + }, + "int6 sym per-row": { + "l2_distortion": 4.726665019989014, + "cos_shift": 0.015722990036010742, + "isoscore_post": 0.8786011934280396, + "eff_rank_post": 103.6412998748405 + }, + "int8 sym per-row": { + "l2_distortion": 1.2452131509780884, + "cos_shift": 0.0010887980461120605, + "isoscore_post": 0.8960879445075989, + "eff_rank_post": 96.95184951116325 + }, + "AWQ-lite int4": { + "l2_distortion": 2.50549054145813, + "cos_shift": 0.004423320293426514, + "isoscore_post": 0.8884121179580688, + "eff_rank_post": 99.88234456906645 + }, + "GPTQ-lite int4": { + "l2_distortion": 50.17822265625, + "cos_shift": 0.4484516978263855, + "isoscore_post": 0.4249190390110016, + "eff_rank_post": 100.84690672770277 + } + }, + "SimCTG+HSU": { + "int4 sym per-tensor": { + "l2_distortion": 9.063629150390625, + "cos_shift": 0.0570681095123291, + "isoscore_post": 0.9595622420310974, + "eff_rank_post": 4.471798045349368 + }, + "int4 sym per-row": { + "l2_distortion": 8.505067825317383, + "cos_shift": 0.04947108030319214, + "isoscore_post": 0.9467036128044128, + "eff_rank_post": 19.69998277960704 + }, + "int4 asym per-row": { + "l2_distortion": 7.623517990112305, + "cos_shift": 0.03686553239822388, + "isoscore_post": 0.9114422798156738, + "eff_rank_post": 78.14374941041602 + }, + "int6 sym per-row": { + "l2_distortion": 5.184071063995361, + "cos_shift": 0.01574110984802246, + "isoscore_post": 0.8899648785591125, + "eff_rank_post": 103.23084138151381 + }, + "int8 sym per-row": { + "l2_distortion": 1.3814940452575684, + "cos_shift": 0.0011126995086669922, + "isoscore_post": 0.9071061015129089, + "eff_rank_post": 96.6253216326656 + }, + "AWQ-lite int4": { + "l2_distortion": 2.6982033252716064, + "cos_shift": 0.004202067852020264, + "isoscore_post": 0.9030233025550842, + "eff_rank_post": 100.00491793874347 + }, + "GPTQ-lite int4": { + "l2_distortion": 56.08349609375, + "cos_shift": 0.45256686210632324, + "isoscore_post": 0.538358747959137, + "eff_rank_post": 106.7523560471735 + } + }, + "SimCTG+AOS": { + "int4 sym per-tensor": { + "l2_distortion": 8.788201332092285, + "cos_shift": 0.05601924657821655, + "isoscore_post": 0.9357870817184448, + "eff_rank_post": 3.757452934414734 + }, + "int4 sym per-row": { + "l2_distortion": 8.076977729797363, + "cos_shift": 0.04601740837097168, + "isoscore_post": 0.9340577125549316, + "eff_rank_post": 14.520423604886107 + }, + "int4 asym per-row": { + "l2_distortion": 7.245190620422363, + "cos_shift": 0.034468114376068115, + "isoscore_post": 0.9058271050453186, + "eff_rank_post": 70.79133016702093 + }, + "int6 sym per-row": { + "l2_distortion": 5.06667947769165, + "cos_shift": 0.015466868877410889, + "isoscore_post": 0.8826003670692444, + "eff_rank_post": 100.9460638891694 + }, + "int8 sym per-row": { + "l2_distortion": 1.3779029846191406, + "cos_shift": 0.0011301040649414062, + "isoscore_post": 0.8975257873535156, + "eff_rank_post": 94.78506932919245 + }, + "AWQ-lite int4": { + "l2_distortion": 2.6380958557128906, + "cos_shift": 0.0041866302490234375, + "isoscore_post": 0.8914800882339478, + "eff_rank_post": 98.09973908449749 + }, + "GPTQ-lite int4": { + "l2_distortion": 55.92982864379883, + "cos_shift": 0.47122877836227417, + "isoscore_post": 0.45693013072013855, + "eff_rank_post": 102.4443442124175 + } + } +} \ No newline at end of file diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_after_trains.sh b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_after_trains.sh new file mode 100755 index 0000000000..05f108cf5f --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_after_trains.sh @@ -0,0 +1,34 @@ +#!/usr/bin/env bash +# Auto-runs after EmbStudy_aos completes: extract hidden states, run quant matrix, build figures. +set -e +ROOT=/workspace/parameter-golf +LOG=/tmp/embstudy_analysis.log + +echo "[$(date -u +%H:%M:%SZ)] watcher start" | tee -a "$LOG" + +# Wait for AOS to be in consumed (meaning daemon completed it) +while ! grep -q "EmbStudy_aos" "${ROOT}/parameter-golf/auto/combo_consumed.txt" 2>/dev/null; do + sleep 30 +done +echo "[$(date -u +%H:%M:%SZ)] EmbStudy_aos consumed; checking final_model.pt files exist" | tee -a "$LOG" + +# Wait for the final_model.pt to be stable (not still being written) +sleep 30 +for reg_dir in nosimctg baseline qahsp es hsu aos; do + pt="${ROOT}/candidate_pack/N18_baseA_${reg_dir}/final_model.pt" + if [ ! -f "$pt" ]; then + echo "[$(date -u +%H:%M:%SZ)] WARN: missing $pt" | tee -a "$LOG" + else + sz=$(stat -c%s "$pt") + echo "[$(date -u +%H:%M:%SZ)] OK $reg_dir final_model.pt = $sz bytes" | tee -a "$LOG" + fi +done + +echo "[$(date -u +%H:%M:%SZ)] running run_reg_quant_matrix.py" | tee -a "$LOG" +cd "${ROOT}" +python3 submissions/C_CrossBase_RegTransfer_Study/run_reg_quant_matrix.py 2>&1 | tee -a "$LOG" + +echo "[$(date -u +%H:%M:%SZ)] running build_synergy_figures.py" | tee -a "$LOG" +python3 submissions/C_CrossBase_RegTransfer_Study/build_synergy_figures.py 2>&1 | tee -a "$LOG" + +echo "[$(date -u +%H:%M:%SZ)] done. results in submissions/C_CrossBase_RegTransfer_Study/" | tee -a "$LOG" diff --git a/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_reg_quant_matrix.py b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_reg_quant_matrix.py new file mode 100644 index 0000000000..81a7b3c212 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_CrossBase_RegTransfer_Study_OptioAI/run_reg_quant_matrix.py @@ -0,0 +1,249 @@ +""" +Real-data reg × quant matrix analysis. + +Runs after all 5 EmbStudy_* training cells complete. For each saved BF16 model: + 1. Load via PyTorch + 2. Run forward on a small val batch to capture last-block hidden states + 3. Apply 6 quantization schemes to those hidden states + 4. Compute per-cell metrics (L2 distortion, isoscore, silhouette, effective rank) + 5. Identify (reg, quant) pairs that "play nice" +""" + +import os, sys, math +import numpy as np +import torch +import torch.nn.functional as F + +# --- 6 quantization schemes --- +def quant_int_per_tensor(h, bits, asym=False): + qmax = (1 << (bits-1)) - 1 + if asym: + h_min, h_max = h.min().item(), h.max().item() + scale = (h_max - h_min) / (2*qmax + 1) + zp = round(-h_min / max(scale, 1e-8) - qmax - 1) + h_q = (torch.round(h/scale) + zp).clamp(-qmax-1, qmax) + return (h_q - zp) * scale + scale = h.abs().max().clamp(min=1e-8) / qmax + return torch.round(h/scale).clamp(-qmax-1, qmax) * scale + +def quant_int_per_row(h, bits, asym=False): + qmax = (1 << (bits-1)) - 1 + if asym: + h_min = h.min(dim=-1, keepdim=True).values + h_max = h.max(dim=-1, keepdim=True).values + scale = (h_max - h_min).clamp(min=1e-8) / (2*qmax + 1) + zp = (-h_min/scale - qmax - 1).round() + h_q = (torch.round(h/scale) + zp).clamp(-qmax-1, qmax) + return (h_q - zp) * scale + scale = h.abs().max(dim=-1, keepdim=True).values.clamp(min=1e-8) / qmax + return torch.round(h/scale).clamp(-qmax-1, qmax) * scale + +def quant_awq_lite(h, bits=4): + """Per-channel scaling (sqrt activation magnitude) before per-row int4.""" + chan_scale = (h.abs().mean(dim=0, keepdim=True).clamp(min=1e-8))**0.5 + return quant_int_per_row(h / chan_scale, bits) * chan_scale + +def quant_gptq_lite(h, bits=4, damping=0.1): + """Per-row scale, column-by-column with running residual (Hessian-free GPTQ proxy).""" + qmax = (1 << (bits-1)) - 1 + n, d = h.shape + h_q = torch.zeros_like(h) + h_residual = h.clone() + scale = h.abs().max(dim=-1, keepdim=True).values.clamp(min=1e-8) / qmax + for col in range(d): + col_q = torch.round(h_residual[:, col:col+1] / scale).clamp(-qmax-1, qmax) * scale + h_q[:, col:col+1] = col_q + if col+1 < d: + err = h_residual[:, col:col+1] - col_q + h_residual[:, col+1:] += err * damping + return h_q + +QUANT_SCHEMES = [ + ('int4 sym per-tensor', lambda h: quant_int_per_tensor(h, bits=4, asym=False)), + ('int4 sym per-row', lambda h: quant_int_per_row(h, bits=4, asym=False)), + ('int4 asym per-row', lambda h: quant_int_per_row(h, bits=4, asym=True)), + ('int6 sym per-row', lambda h: quant_int_per_row(h, bits=6, asym=False)), + ('int8 sym per-row', lambda h: quant_int_per_row(h, bits=8, asym=False)), + ('AWQ-lite int4', lambda h: quant_awq_lite(h, bits=4)), + ('GPTQ-lite int4', lambda h: quant_gptq_lite(h, bits=4)), +] + +# --- Metrics --- +def isoscore(h): + h_n = F.normalize(h, dim=-1, eps=1e-6) + sim = h_n @ h_n.t() + n = h_n.size(0) + off = sim - torch.eye(n, device=h.device) + return off.abs().mean().item() + +def effective_rank(h): + h_c = h - h.mean(dim=0, keepdim=True) + _, S, _ = torch.linalg.svd(h_c, full_matrices=False) + p = S / S.sum() + p = p[p > 1e-10] + return float(np.exp(-(p * p.log()).sum().item())) + +def per_token_l2_distortion(h_pre, h_post): + return (h_pre - h_post).pow(2).sum(dim=-1).sqrt().mean().item() + +def cosine_shift(h_pre, h_post): + return 1.0 - F.cosine_similarity(h_pre, h_post, dim=-1).mean().item() + +def silhouette(h, labels, n_clusters): + """Simplified silhouette (uses sample of pairwise distances).""" + h_np = h.cpu().numpy() if isinstance(h, torch.Tensor) else h + sil = 0.0 + for i in range(len(h_np)): + same = (labels == labels[i]) & (np.arange(len(h_np)) != i) + if not any(same): continue + a = np.mean([np.linalg.norm(h_np[i] - x) for x in h_np[same]]) + b_min = float('inf') + for c in range(n_clusters): + if c == labels[i]: continue + other = labels == c + if any(other): + b = np.mean([np.linalg.norm(h_np[i] - x) for x in h_np[other]]) + b_min = min(b_min, b) + if max(a, b_min) > 0: + sil += (b_min - a) / max(a, b_min) + return sil / len(h_np) + +# --- Hidden state extraction --- +def extract_hidden_states(model_dir, n_tokens=128, val_bin_path=None): + """Load the trained BF16 model and run forward on val tokens. + Captures the post-final-block hidden state for each token. + """ + sys.path.insert(0, model_dir) + # Avoid import collision + if 'train_gpt' in sys.modules: + del sys.modules['train_gpt'] + import importlib.util + spec = importlib.util.spec_from_file_location("train_gpt", os.path.join(model_dir, "train_gpt.py")) + train_gpt = importlib.util.module_from_spec(spec) + # Stub out distributed init; we just need the model class + os.environ.setdefault("WORLD_SIZE", "1") + os.environ.setdefault("RANK", "0") + os.environ.setdefault("LOCAL_RANK", "0") + os.environ.setdefault("MASTER_ADDR", "127.0.0.1") + os.environ.setdefault("MASTER_PORT", "29500") + spec.loader.exec_module(train_gpt) + + h_cls = train_gpt.Hyperparameters() if hasattr(train_gpt, 'Hyperparameters') else None + # The model class might be 'GPT' or 'FinalMiniLM' depending on lineage + model_cls = None + for cls_name in ['GPT', 'FinalMiniLM', 'Model']: + if hasattr(train_gpt, cls_name): + model_cls = getattr(train_gpt, cls_name) + break + if model_cls is None: + raise RuntimeError("no model class found in train_gpt.py") + model = model_cls(h_cls) + state_dict = torch.load(os.path.join(model_dir, "final_model.pt"), map_location="cpu", weights_only=False) + if isinstance(state_dict, dict) and 'state_dict' in state_dict: + state_dict = state_dict['state_dict'] + model.load_state_dict(state_dict, strict=False) + model.eval() + # Move to GPU if available + device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + model = model.to(device).bfloat16() + + # Load some val tokens + if val_bin_path and os.path.exists(val_bin_path): + toks = np.fromfile(val_bin_path, dtype=np.uint16)[:n_tokens*8].astype(np.int64) + else: + # fallback: random tokens + toks = np.random.randint(0, 8000, size=n_tokens*8) + toks = torch.from_numpy(toks).reshape(8, n_tokens).to(device) + + # Forward through the model and extract last-block hidden state + with torch.no_grad(): + # We need access to internal hidden states. Easiest: hook on the last block. + captured = {} + for name, mod in model.named_modules(): + if 'blocks' in name and isinstance(mod, torch.nn.Module): + if name.endswith('.10') or name.endswith('.9'): # last block + def make_hook(n): + def hook(m, inp, out): + captured[n] = out.detach().cpu().float() + return hook + mod.register_forward_hook(make_hook(name)) + # Run forward — need to call appropriate method + try: + _ = model.forward_logits(toks) if hasattr(model, 'forward_logits') else model(toks) + except Exception as e: + print(f" forward error: {e}") + + # Get hidden states from the last captured layer + if captured: + h = list(captured.values())[-1] + h = h.reshape(-1, h.size(-1))[:128] # take 128 tokens + return h + return None + +# --- Main matrix computation --- +def main(): + base_dirs = { + 'no-reg': '/workspace/parameter-golf/candidate_pack/N18_baseA_nosimctg', # SimCTG=0 + 'SimCTG': '/workspace/parameter-golf/candidate_pack/N18_baseA_baseline', # SimCTG λ=0.3 only + 'SimCTG+QAHSP': '/workspace/parameter-golf/candidate_pack/N18_baseA_qahsp', + 'SimCTG+ES': '/workspace/parameter-golf/candidate_pack/N18_baseA_es', + 'SimCTG+HSU': '/workspace/parameter-golf/candidate_pack/N18_baseA_hsu', + 'SimCTG+AOS': '/workspace/parameter-golf/candidate_pack/N18_baseA_aos', + } + + # Find a val_bin + val_bin = None + for p in [ + '/workspace/parameter-golf/parameter-golf/data/datasets/datasets/fineweb10B_sp10240/fineweb_val_000000.bin', + '/workspace/parameter-golf/parameter-golf/data/datasets/fineweb10B_sp10240/fineweb_val_000000.bin', + ]: + if os.path.exists(p): + val_bin = p; break + print(f"val_bin: {val_bin}") + + hidden_per_reg = {} + for reg, dir_ in base_dirs.items(): + if not os.path.exists(os.path.join(dir_, "final_model.pt")): + print(f" {reg}: no final_model.pt yet — skipping") + continue + print(f" {reg}: loading...") + h = extract_hidden_states(dir_, n_tokens=128, val_bin_path=val_bin) + if h is None: + print(f" extraction failed") + continue + hidden_per_reg[reg] = h + print(f" shape: {tuple(h.shape)}, mean L2: {h.pow(2).sum(-1).sqrt().mean().item():.3f}") + + if not hidden_per_reg: + print("No models loaded. Exiting.") + return + + # Compute the (reg × quant) matrix + print() + print("=== Real-data reg × quant matrix ===") + n_tok = list(hidden_per_reg.values())[0].size(0) + # Crude clustering: split tokens into 8 groups of equal size by their token ID + labels = np.array([i // (n_tok // 8) for i in range(n_tok)])[:n_tok] + n_clusters = 8 + + results = {} + for reg, h in hidden_per_reg.items(): + results[reg] = {} + for qname, qfn in QUANT_SCHEMES: + h_q = qfn(h) + results[reg][qname] = { + 'l2_distortion': per_token_l2_distortion(h, h_q), + 'cos_shift': cosine_shift(h, h_q), + 'isoscore_post': isoscore(h_q), + 'eff_rank_post': effective_rank(h_q), + } + print(f" {reg:<10} {qname:<22} l2={results[reg][qname]['l2_distortion']:.4f} cos_shift={results[reg][qname]['cos_shift']:.4f}") + + # Save + import json + out_path = '/workspace/parameter-golf/submissions/C_CrossBase_RegTransfer_Study/real_reg_quant_matrix.json' + open(out_path, 'w').write(json.dumps(results, indent=2)) + print(f"\nsaved: {out_path}") + +if __name__ == "__main__": + main()