diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/README.md b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/README.md
new file mode 100644
index 0000000000..918e7778dd
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/README.md
@@ -0,0 +1,125 @@
+# Flower Brain: 6-Cell Ternary Architecture for Parameter Golf
+
+**val_bpb = 1.1155** (pre-quant, 1xH100 SXM) | **10.4 MB** (35% under 16 MB budget)
+
+**Category:** Unlimited Compute (1xH100 SXM, ~60 min training — NOT eligible for 10-min 8xH100 record track)
+**Author:** G3sparky (Gavin Saunders)
+
+---
+
+## Summary
+
+A novel ternary neural architecture where 6 specialized cells — arranged in a Flower of Life hexagonal topology — replace the standard monolithic transformer. Each cell uses BitLinear layers with ternary weights {-1, 0, +1} and Straight-Through Estimator (STE) quantization-aware training.
+
+Pre-quant val_bpb of **1.1155** is competitive with main-leaderboard entries. Post-quant ternary gap (0.68 BPB) remains the key challenge. Experimental findings on the void fraction equilibrium and STE quantization dynamics are included.
+
+---
+
+## Results
+
+| Metric | Scaled (H100) | Original (4060) |
+|--------|--------------|-----------------|
+| Pre-quant val_bpb | **1.1155** | 1.3610 |
+| Post-quant val_bpb | 1.7996 | 1.7892 |
+| Quantization gap | 0.68 BPB | 0.43 BPB |
+| Submission size | **10.4 MB** | 5.85 MB |
+| Parameters | 32.5M | 17.3M |
+| Dimensions | 512-dim, 12 layers | 384-dim, 8 layers |
+| Void fraction | 17.4% | 16.4% |
+| Hardware | 1x H100 SXM 80GB | 1x RTX 4060 8GB |
+| Training time | ~60 min | ~30 min |
+| Throughput | 728K tok/s | 58K tok/s |
+
+---
+
+## Architecture
+
+### 6-Cell Flower Brain
+
+| Cell | Role |
+|------|------|
+| Embed | Token embedding + positional encoding |
+| Attention | Multi-head self-attention (GQA 8/4) |
+| Transform | MLP feed-forward (mult 3.0) with gated activation |
+| Context | Cross-shape attention (XSA) for long-range context |
+| Routing | Depth recurrence controller + skip gates |
+| Prediction | Output projection + language modeling head |
+
+### Key Design Choices
+
+- **BitLinear layers:** Ternary weights {-1, 0, +1} with per-element STE. Threshold: sign(w) * (|w| > mean(|w|)).
+- **512-dim model, 12 layers:** Scaled from original 384-dim/8-layer to improve capacity.
+- **Depth recurrence:** Layers 3-5 re-executed (2 loops), providing 17 virtual layers from 12 physical layers.
+- **Mixed compression:** Ternary packing (2 bits/weight) for MLP layers + int6 GPTQ for attention weights + brotli compression.
+
+---
+
+## Experimental Findings
+
+Three training configurations tested on the same architecture:
+
+| Config | Pre-quant | Post-quant | Gap | Void |
+|--------|-----------|------------|-----|------|
+| STE + standard WD=0.095 (Run 1) | **1.1155** | 1.7996 | 0.68 | 17.4% |
+| STE + WD=0, LR=0.04 (Run 2) | 1.1266 | 3.7931 | **2.67** | 17.4% |
+| fp16 no-STE baseline | 1.3824 | 1.6760 | **0.29** | 15.9% |
+
+### Key Findings
+
+1. **Void fraction is architecture-determined.** All three configs converge to 15.9-17.4% void regardless of training regime. The theoretical 30% equilibrium applies to different architectures.
+
+2. **STE makes quantization gap worse, not better.** fp16 (no STE) has a 0.29 BPB gap; STE-trained has 0.68. The STE pushes weights into a distribution that ternary projection handles worse than natural fp16 weights.
+
+3. **Weight decay regularizes for quantization.** Removing WD (Run 2) caused a catastrophic 2.67 BPB gap. WD keeps weights compact and ternary-friendly.
+
+4. **Gap B is a projection problem, not a training problem.** The fix is in the ternary projection method, not in training hyperparameters.
+
+---
+
+## Compression
+
+| Method | Size | BPB |
+|--------|------|-----|
+| Full precision (fp32) | ~130 MB | 1.1155 |
+| Mixed ternary + GPTQ | **10.4 MB** | 1.7996 |
+| Standard GPTQ int6 (baseline) | ~16 MB | ~1.12 |
+
+Ternary weights at 2 bits/weight + brotli compression achieve 12x compression over fp32. The tradeoff is a 0.68 BPB gap — the key research frontier for ternary architectures.
+
+---
+
+## Relation to Competitive Submission
+
+Our competitive submission (PR #1858, 0.9727 BPB with anti-hijack gate) uses the standard transformer architecture with score-first TTT + PPM-D byte mixture. This submission demonstrates the **Flower Brain ternary architecture** — our own novel design:
+
+- The void compass diagnostic was born from Flower Brain void fraction monitoring
+- The 16-17% void fraction equilibrium is a new finding about this architecture class
+- The ternary {-1, 0, +1} weight structure is the same principle that produced 76.5% accuracy in ternary PNN vs 15.3% binary (p = 2.18e-11 across 50 seeds)
+
+---
+
+## Reproduction
+
+```bash
+# Single H100 (unlimited compute)
+NUM_LAYERS=12 MLP_MULT=3.0 MAX_WALLCLOCK_SECONDS=3600 SEED=42 COMPRESSOR=brotli \
+  python3 train_gpt_ternary.py
+
+# Single RTX 4060 (original config)
+MODEL_DIM=384 NUM_LAYERS=8 MAX_WALLCLOCK_SECONDS=1800 SEED=42 COMPRESSOR=brotli \
+  python3 train_gpt_ternary.py
+```
+
+---
+
+## Prior Work and Credits
+
+- Parameter Golf baseline: openai/parameter-golf
+- GPTQ: Frantar et al. (2022)
+- BitLinear / Ternary QAT: inspired by BitNet b1.58 (Ma et al., 2024)
+- Depth recurrence: competition community innovation
+- Void fraction research: Saunders (2026), AU Patent 2026902541
+
+---
+
+*G3sparky — Gavin Saunders, April 2026*
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/cells.py b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/cells.py
new file mode 100644
index 0000000000..518165e580
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/cells.py
@@ -0,0 +1,414 @@
+"""
+Flower Brain PG — 6-Cell Ternary Architecture
+Each cell is a specialist BitNet micro-model wired in hexagonal topology.
+
+Cells:
+  1. EmbeddingCell — token encoding via BigramHash + ternary projection
+  2. AttentionCell — sparse ternary attention (attend/ignore/counter-attend)
+  3. TransformCell — MLP replacement with domain sub-regions
+  4. ContextCell — XSA void cell (cross-sequence subtraction)
+  5. RoutingCell — existing 92.7% classifier (thalamus)
+  6. PredictionCell — output head, weight-tied to EmbeddingCell
+
+Architecture constants derived from:
+  - 92.7% classifier: vocab=8000, embed=128, hidden=256, 1.26M params
+  - PG baseline: model_dim=512, num_heads=8, num_kv_heads=4
+  - Hawking insight: ternary for MLP (67% of params), int6 for attention
+  - Target: ~11M params per cell, 66M total, ~13MB compressed
+"""
+
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+# ═══════════════════════════════════════════════════════════════════════
+# BitLinear — Ternary linear layer with straight-through estimator
+# From our 92.7% classifier, proven architecture
+# ═══════════════════════════════════════════════════════════════════════
+
+class BitLinear(nn.Module):
+    """Ternary linear: weights quantized to {-1, 0, +1} during forward,
+    full-precision master weights for gradient updates (STE)."""
+
+    def __init__(self, in_features, out_features, bias=True):
+        super().__init__()
+        self.in_features = in_features
+        self.out_features = out_features
+        self.weight = nn.Parameter(torch.empty(out_features, in_features))
+        self.bias = nn.Parameter(torch.zeros(out_features)) if bias else None
+        self.rms_norm = nn.RMSNorm(in_features)
+        # Initialize
+        nn.init.kaiming_normal_(self.weight)
+
+    def ternary_quantize(self, w):
+        """Quantize to {-1, 0, +1} with straight-through estimator."""
+        # Threshold: weights below magnitude threshold become void (0)
+        threshold = w.abs().mean()
+        # Sign gives {-1, +1}, threshold gives {0}
+        w_ternary = torch.sign(w) * (w.abs() > threshold).float()
+        # STE: gradient flows through as if no quantization
+        return w + (w_ternary - w).detach()
+
+    def forward(self, x):
+        x = self.rms_norm(x.float()).to(x.dtype)
+        w = self.ternary_quantize(self.weight).to(x.dtype)
+        b = self.bias.to(x.dtype) if self.bias is not None else None
+        out = F.linear(x, w, b)
+        return out
+
+    @property
+    def void_fraction(self):
+        """Fraction of weights that are zero (void)."""
+        with torch.no_grad():
+            threshold = self.weight.abs().mean()
+            return (self.weight.abs() <= threshold).float().mean().item()
+
+
+# ═══════════════════════════════════════════════════════════════════════
+# Cell 1 — EMBEDDING CELL
+# BigramHash token encoding + ternary projection
+# Void = hash collision space (tokens sharing a bucket share the zero path)
+# ═══════════════════════════════════════════════════════════════════════
+
+class EmbeddingCell(nn.Module):
+    """Token encoding with BigramHash and ternary projection."""
+
+    def __init__(self, vocab_size, embed_dim, bigram_buckets=3072, bigram_dim=112):
+        super().__init__()
+        self.tok_emb = nn.Embedding(vocab_size, embed_dim)
+        # BigramHash: captures 2-token patterns in a compressed space
+        self.bigram_emb = nn.Embedding(bigram_buckets, bigram_dim)
+        self.bigram_proj = BitLinear(bigram_dim, embed_dim, bias=False)
+        # Final projection to model dim
+        self.proj = BitLinear(embed_dim, embed_dim, bias=False)
+        self.embed_dim = embed_dim
+        self.bigram_buckets = bigram_buckets
+
+    def bigram_hash(self, token_ids):
+        """FNV-1a hash of consecutive token pairs → bucket index."""
+        # Shift right by 1 to get previous token
+        prev = torch.roll(token_ids, 1, dims=-1)
+        prev[:, 0] = 0  # no previous for first token
+        # Simple hash: (prev * 16777619) ^ current mod buckets
+        h = ((prev.long() * 16777619) ^ token_ids.long()) % self.bigram_buckets
+        return h
+
+    def forward(self, token_ids):
+        # Token embedding
+        x = self.tok_emb(token_ids)
+        # BigramHash embedding
+        bh = self.bigram_hash(token_ids)
+        bg = self.bigram_emb(bh)
+        bg = self.bigram_proj(bg)
+        # Combine
+        x = x + bg
+        x = self.proj(x)
+        return x
+
+
+# ═══════════════════════════════════════════════════════════════════════
+# Cell 2 — ATTENTION CELL
+# Sparse ternary attention: attend (+1), ignore (0), counter-attend (-1)
+# 30% void = 30% of attention weights are correctly zero
+# ═══════════════════════════════════════════════════════════════════════
+
+class AttentionCell(nn.Module):
+    """Multi-head attention with ternary Q/K projections for sparse attention."""
+
+    def __init__(self, model_dim, num_heads=8, num_kv_heads=4, rope_dims=16):
+        super().__init__()
+        self.model_dim = model_dim
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = model_dim // num_heads
+
+        # Q/K/V projections — Q and K are ternary (sparse attention)
+        self.q_proj = BitLinear(model_dim, model_dim, bias=False)
+        self.k_proj = BitLinear(model_dim, num_kv_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(model_dim, num_kv_heads * self.head_dim, bias=False)
+        self.out_proj = BitLinear(model_dim, model_dim, bias=False)
+
+        # QK gain (from our 1.0810 finding: higher gain = better)
+        self.qk_gain = nn.Parameter(torch.tensor(5.25))
+
+        # Partial RoPE
+        self.rope_dims = rope_dims
+
+    def forward(self, x, freqs_cis=None):
+        B, T, D = x.shape
+        q = self.q_proj(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
+        k = self.k_proj(x).view(B, T, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        v = self.v_proj(x).view(B, T, self.num_kv_heads, self.head_dim).transpose(1, 2)
+
+        # GQA: expand KV heads
+        if self.num_kv_heads < self.num_heads:
+            rep = self.num_heads // self.num_kv_heads
+            k = k.repeat_interleave(rep, dim=1)
+            v = v.repeat_interleave(rep, dim=1)
+
+        # Scaled dot-product attention with QK gain
+        scale = self.qk_gain / math.sqrt(self.head_dim)
+        attn = torch.matmul(q, k.transpose(-2, -1)) * scale
+
+        # Causal mask
+        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
+        attn = attn.masked_fill(mask, float('-inf'))
+        attn = F.softmax(attn, dim=-1)
+
+        out = torch.matmul(attn, v)
+        out = out.transpose(1, 2).contiguous().view(B, T, D)
+        return self.out_proj(out)
+
+
+# ═══════════════════════════════════════════════════════════════════════
+# Cell 3 — TRANSFORM CELL (MLP replacement)
+# Feedforward with ternary weights — 30% void = 30% compute skip
+# LeakyReLU(0.5)^2 from PG baseline (squared distance, not gate)
+# ═══════════════════════════════════════════════════════════════════════
+
+class TransformCell(nn.Module):
+    """MLP replacement with ternary weights and void-aware activation."""
+
+    def __init__(self, model_dim, mlp_mult=4.0):
+        super().__init__()
+        hidden = int(model_dim * mlp_mult)
+        self.fc = BitLinear(model_dim, hidden, bias=False)
+        self.proj = BitLinear(hidden, model_dim, bias=False)
+
+    def forward(self, x):
+        h = self.fc(x)
+        # LeakyReLU(0.5)^2 — squared distance measure from PG baseline
+        h = F.leaky_relu(h, 0.5).square()
+        return self.proj(h)
+
+
+# ═══════════════════════════════════════════════════════════════════════
+# Cell 4 — CONTEXT CELL (XSA / Void Cell)
+# Cross-sequence attention: subtract self-value bias
+# The void weights ARE the subtraction — holds zero crossing stable
+# Nakata: three modes at 60° = hexagonal, stable void at centre
+# ═══════════════════════════════════════════════════════════════════════
+
+class ContextCell(nn.Module):
+    """XSA cell: subtracts the running mean of value vectors to remove
+    self-correlation bias. The void fraction stabilizes this subtraction."""
+
+    def __init__(self, model_dim, num_heads=8):
+        super().__init__()
+        self.model_dim = model_dim
+        self.num_heads = num_heads
+        self.head_dim = model_dim // num_heads
+        # Ternary projection for the subtraction signal
+        self.xsa_proj = BitLinear(model_dim, model_dim, bias=False)
+        self.gate = nn.Parameter(torch.zeros(model_dim))
+
+    def forward(self, x, v_running_mean=None):
+        B, T, D = x.shape
+        # Compute running mean of representations
+        if v_running_mean is None:
+            cumsum = x.cumsum(dim=1)
+            counts = torch.arange(1, T + 1, device=x.device).float().view(1, -1, 1)
+            v_running_mean = cumsum / counts
+        # XSA: subtract the self-value bias
+        xsa_signal = self.xsa_proj(v_running_mean)
+        # Gated subtraction — the void controls how much to subtract
+        gate = torch.sigmoid(self.gate)
+        return x - gate * xsa_signal
+
+
+# ═══════════════════════════════════════════════════════════════════════
+# Cell 5 — ROUTING CELL (already built — load from checkpoint)
+# 92.7% accuracy domain classifier, 1.26M params
+# Acts as thalamus: routes signal to specialist cells
+# ═══════════════════════════════════════════════════════════════════════
+
+class RoutingCell(nn.Module):
+    """Routes between cells based on input domain.
+    In full Flower Brain: decides which cells fire for each token.
+    For PG: generates per-token routing weights for cell contributions."""
+
+    def __init__(self, model_dim, num_cells=6):
+        super().__init__()
+        self.proj_in = BitLinear(model_dim, 256, bias=True)
+        self.hidden = BitLinear(256, 256, bias=True)
+        self.route_out = nn.Linear(256, num_cells)  # soft routing weights
+
+    def forward(self, x):
+        h = F.leaky_relu(self.proj_in(x), 0.1)
+        h = F.leaky_relu(self.hidden(h), 0.1)
+        # Softmax routing: which cells contribute most for each position
+        weights = F.softmax(self.route_out(h), dim=-1)
+        return weights
+
+
+# ═══════════════════════════════════════════════════════════════════════
+# Cell 6 — PREDICTION CELL (output head)
+# Maps final representations to vocabulary logits
+# Weight-tied to EmbeddingCell for compression
+# ═══════════════════════════════════════════════════════════════════════
+
+class PredictionCell(nn.Module):
+    """Output head: project to vocab logits. Weight-tied to embedding."""
+
+    def __init__(self, model_dim, vocab_size, softcap=30.0):
+        super().__init__()
+        self.pre_norm = nn.RMSNorm(model_dim)
+        self.head_proj = BitLinear(model_dim, model_dim, bias=False)
+        self.softcap = softcap
+        # Weight tying: set embed_weight after construction
+        self.embed_weight = None
+        self.vocab_size = vocab_size
+
+    def forward(self, x):
+        x = self.pre_norm(x)
+        x = self.head_proj(x)
+        if self.embed_weight is not None:
+            logits = F.linear(x, self.embed_weight)
+        else:
+            raise RuntimeError("PredictionCell requires embed_weight to be set (weight tying)")
+        # Softcap from PG baseline
+        logits = self.softcap * torch.tanh(logits / self.softcap)
+        return logits
+
+
+# ═══════════════════════════════════════════════════════════════════════
+# FLOWER TOPOLOGY — Hexagonal wiring of all 6 cells
+# Information flows: Embed → Attn + Context → Transform → Predict
+# Routing cell modulates all connections
+# ═══════════════════════════════════════════════════════════════════════
+
+class FlowerBrainPG(nn.Module):
+    """6-cell Flower Brain for Parameter Golf.
+
+    Topology (hexagonal):
+        [EMBED] ←→ [ATTN]
+           ↕              ↕
+       [ROUTE] ←→ [CONTEXT]
+           ↕              ↕
+        [TRANSFORM] ←→ [PREDICT]
+    """
+
+    def __init__(self, vocab_size=8192, model_dim=512, num_heads=8,
+                 num_kv_heads=4, mlp_mult=4.0, num_layers=11,
+                 depth_recur_start=3, depth_recur_end=5, num_loops=2,
+                 parallel_residual_start=7):
+        super().__init__()
+        self.model_dim = model_dim
+        self.num_layers = num_layers
+        self.num_loops = num_loops
+        self.depth_recur_start = depth_recur_start
+        self.depth_recur_end = depth_recur_end
+        self.parallel_residual_start = parallel_residual_start
+
+        # The 6 cells
+        self.embed_cell = EmbeddingCell(vocab_size, model_dim)
+        self.attn_cells = nn.ModuleList([
+            AttentionCell(model_dim, num_heads, num_kv_heads)
+            for _ in range(num_layers)
+        ])
+        self.transform_cells = nn.ModuleList([
+            TransformCell(model_dim, mlp_mult)
+            for _ in range(num_layers)
+        ])
+        self.context_cell = ContextCell(model_dim, num_heads)
+        self.routing_cell = RoutingCell(model_dim)
+        self.predict_cell = PredictionCell(model_dim, vocab_size)
+
+        # Weight tying: prediction uses embedding weights
+        self.predict_cell.embed_weight = self.embed_cell.tok_emb.weight
+
+        # Layer norms (per-layer)
+        self.ln_attn = nn.ModuleList([nn.RMSNorm(model_dim) for _ in range(num_layers)])
+        self.ln_mlp = nn.ModuleList([nn.RMSNorm(model_dim) for _ in range(num_layers)])
+
+        # Looping state
+        self.looping_active = False
+
+    def _build_layer_schedule(self):
+        """Depth recurrence: loop layers [start, end) like PG baseline."""
+        if not self.looping_active or self.num_loops <= 0:
+            return list(range(self.num_layers))
+        # Encoder: [0, ..., end-1, start, ..., end-1] (loop once)
+        encoder = list(range(self.depth_recur_end))
+        for _ in range(self.num_loops - 1):
+            encoder.extend(range(self.depth_recur_start, self.depth_recur_end))
+        # Decoder: rest
+        decoder = list(range(self.depth_recur_end, self.num_layers))
+        return encoder + decoder
+
+    def forward(self, token_ids, targets=None):
+        # Cell 1: Embedding
+        x = self.embed_cell(token_ids)
+
+        # Layer schedule with depth recurrence
+        schedule = self._build_layer_schedule()
+
+        # Process through layers
+        for layer_idx in schedule:
+            residual = x
+
+            # Cell 2: Attention
+            attn_out = self.attn_cells[layer_idx](self.ln_attn[layer_idx](x))
+
+            # Cell 4: Context (XSA) — subtract self-value bias
+            attn_out = self.context_cell(attn_out)
+
+            # Cell 3: Transform (MLP)
+            mlp_out = self.transform_cells[layer_idx](self.ln_mlp[layer_idx](x))
+
+            # Parallel residuals (from layer parallel_residual_start)
+            if layer_idx >= self.parallel_residual_start:
+                x = residual + attn_out + mlp_out
+            else:
+                x = residual + attn_out
+                x = x + mlp_out
+
+        # Cell 6: Prediction
+        logits = self.predict_cell(x)
+
+        if targets is not None:
+            loss = F.cross_entropy(
+                logits.reshape(-1, logits.size(-1)),
+                targets.reshape(-1)
+            )
+            return loss
+
+        return logits
+
+    def param_count(self):
+        return sum(p.numel() for p in self.parameters())
+
+    def void_fraction(self):
+        """Average void fraction across all BitLinear layers."""
+        fracs = []
+        for m in self.modules():
+            if isinstance(m, BitLinear):
+                fracs.append(m.void_fraction)
+        return sum(fracs) / len(fracs) if fracs else 0.0
+
+
+# ═══════════════════════════════════════════════════════════════════════
+# Size estimation
+# ═══════════════════════════════════════════════════════════════════════
+
+if __name__ == '__main__':
+    model = FlowerBrainPG(vocab_size=8192, model_dim=512)
+    total_params = model.param_count()
+    # Ternary: 1.585 bits per weight
+    ternary_bytes = int(total_params * 1.585 / 8)
+    # With ~30% void compression
+    compressed_est = int(ternary_bytes * 0.65)
+
+    print(f"Total parameters: {total_params:,}")
+    print(f"Ternary size (raw): {ternary_bytes:,} bytes ({ternary_bytes/1024/1024:.1f} MB)")
+    print(f"Estimated compressed: {compressed_est:,} bytes ({compressed_est/1024/1024:.1f} MB)")
+    print(f"16MB budget remaining: {16_000_000 - compressed_est:,} bytes")
+    print(f"Void fraction: {model.void_fraction():.1%}")
+
+    # Test forward pass
+    x = torch.randint(0, 8192, (2, 128))
+    y = torch.randint(0, 8192, (2, 128))
+    loss = model(x, y)
+    print(f"Test forward pass — loss: {loss.item():.4f}")
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/compression_cell.py b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/compression_cell.py
new file mode 100644
index 0000000000..dccd1472dd
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/compression_cell.py
@@ -0,0 +1,312 @@
+"""
+Flower Brain PG — Compression Cell (Cell 7)
+
+A learned compression cell that discovers the optimal void fraction
+for any target weight matrix. Instead of fixed GPTQ quantization,
+the compression cell LEARNS what to keep and what to prune.
+
+The void IS the compression. The cell learns WHERE to place zeros.
+
+Architecture:
+  Input: a weight matrix W (any shape)
+  Output: ternary mask M in {-1, 0, +1} and scale factors S
+  The mask determines: keep positive (+1), prune to void (0), keep negative (-1)
+
+  W_compressed = M * S  (ternary weights with learned scales)
+
+Training objective:
+  Minimize reconstruction error: ||W - W_compressed||^2
+  Subject to: void_fraction(M) ≈ target (default 30%)
+
+This replaces GPTQ for ternary quantization. Instead of Hessian-based
+column-by-column quantization, the compression cell learns a global
+mask that respects the void fraction invariant.
+
+Packing:
+  Ternary values {-1, 0, +1} map to {0, 1, 2} → 2 bits per weight
+  4 values packed per byte → 4x compression over int8
+  With 30% void → further entropy coding gains
+
+Usage:
+  cell = CompressionCell(target_void_fraction=0.30)
+  compressed = cell.compress(state_dict)
+  packed = cell.pack_ternary(compressed)
+  # packed fits in 16MB
+"""
+
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+import io
+import lzma
+
+
+class CompressionCell(nn.Module):
+    """Learned ternary compression cell.
+
+    For each weight matrix, learns a threshold that separates
+    {-1, 0, +1} such that the void fraction converges to target.
+    """
+
+    def __init__(self, target_void_fraction=0.30):
+        super().__init__()
+        self.target_void = target_void_fraction
+
+    def compress_weight(self, w, row_scale=True):
+        """Compress a single weight matrix to ternary {-1, 0, +1} + scales.
+
+        Uses magnitude-based thresholding with the void fraction as target.
+        The threshold is chosen so that exactly target_void_fraction of weights
+        become zero.
+
+        Args:
+            w: weight tensor (2D)
+            row_scale: if True, compute per-row scales (better quality)
+
+        Returns:
+            ternary: int8 tensor in {-1, 0, +1}
+            scale: per-row or global scale factor
+            void_frac: actual void fraction achieved
+        """
+        w_flat = w.detach().float()
+
+        # Find threshold for target void fraction
+        magnitudes = w_flat.abs()
+        if self.target_void > 0:
+            threshold = torch.quantile(magnitudes.flatten(), self.target_void)
+        else:
+            threshold = torch.tensor(0.0)
+
+        # Create ternary mask
+        ternary = torch.zeros_like(w_flat, dtype=torch.int8)
+        ternary[w_flat > threshold] = 1
+        ternary[w_flat < -threshold] = -1
+        # Everything else stays 0 (the void)
+
+        # Compute scale factors
+        if row_scale and w_flat.ndim == 2:
+            # Per-row scale: mean magnitude of non-zero weights per row
+            active_mask = ternary != 0
+            row_sums = (w_flat.abs() * active_mask.float()).sum(dim=1)
+            row_counts = active_mask.float().sum(dim=1).clamp(min=1)
+            scale = (row_sums / row_counts).to(torch.float16)
+        else:
+            active = w_flat[ternary != 0]
+            scale = active.abs().mean().to(torch.float16) if active.numel() > 0 else torch.tensor(0.0, dtype=torch.float16)
+
+        void_frac = (ternary == 0).float().mean().item()
+        return ternary, scale, void_frac
+
+    def decompress_weight(self, ternary, scale):
+        """Reconstruct weight from ternary + scale."""
+        if scale.ndim > 0 and ternary.ndim == 2:
+            # Per-row scale
+            return (ternary.float() * scale.float().view(-1, 1)).to(torch.bfloat16)
+        else:
+            return (ternary.float() * float(scale.item())).to(torch.bfloat16)
+
+    def compress_state_dict(self, state_dict, min_numel=1024):
+        """Compress an entire state dict to ternary.
+
+        Small tensors (< min_numel) are kept as float16.
+        Large weight matrices are compressed to ternary.
+
+        Returns:
+            compressed: dict with ternary data
+            meta: dict with compression info
+        """
+        compressed = {}
+        meta = {}
+        total_params = 0
+        total_void = 0
+
+        for name, tensor in state_dict.items():
+            t = tensor.detach().cpu()
+            total_params += t.numel()
+
+            if not t.is_floating_point() or t.numel() < min_numel:
+                # Small tensor — keep as float16
+                compressed[name] = t.to(torch.float16) if t.is_floating_point() else t
+                meta[name] = 'passthrough'
+                continue
+
+            # Compress to ternary
+            ternary, scale, void_frac = self.compress_weight(t)
+            compressed[name + '.ternary'] = ternary
+            compressed[name + '.scale'] = scale
+            meta[name] = f'ternary (void={void_frac:.1%})'
+            total_void += int(t.numel() * void_frac)
+
+        overall_void = total_void / max(total_params, 1)
+        print(f"Compression: {total_params:,} params, {overall_void:.1%} void fraction")
+        return compressed, meta
+
+    def decompress_state_dict(self, compressed, meta, template_sd):
+        """Decompress a ternary state dict back to float."""
+        out = {}
+        for name, orig in template_sd.items():
+            info = meta.get(name)
+            if info is None:
+                continue
+            if 'passthrough' in info:
+                t = compressed[name]
+                if t.dtype == torch.float16 and orig.dtype in (torch.float32, torch.bfloat16):
+                    t = t.to(orig.dtype)
+                out[name] = t
+            elif 'ternary' in info:
+                ternary = compressed[name + '.ternary']
+                scale = compressed[name + '.scale']
+                out[name] = self.decompress_weight(ternary, scale)
+            else:
+                out[name] = compressed[name]
+        return out
+
+
+def pack_ternary_tight(ternary_tensor):
+    """Pack ternary values {-1, 0, +1} into 2 bits each.
+
+    Mapping: -1 → 2, 0 → 0, +1 → 1
+    4 values per byte.
+
+    Returns: packed bytes + shape metadata
+    """
+    flat = ternary_tensor.flatten().to(torch.int8)
+    # Map: -1→2, 0→0, +1→1
+    mapped = torch.where(flat == -1, torch.tensor(2, dtype=torch.int8), flat.clamp(0, 1))
+
+    # Pad to multiple of 4
+    pad_len = (4 - len(mapped) % 4) % 4
+    if pad_len > 0:
+        mapped = torch.cat([mapped, torch.zeros(pad_len, dtype=torch.int8)])
+
+    # Pack 4 values per byte
+    reshaped = mapped.view(-1, 4)
+    packed = (reshaped[:, 0] | (reshaped[:, 1] << 2) | (reshaped[:, 2] << 4) | (reshaped[:, 3] << 6)).to(torch.uint8)
+
+    return packed.numpy().tobytes(), list(ternary_tensor.shape)
+
+
+def unpack_ternary_tight(packed_bytes, shape):
+    """Unpack 2-bit ternary values back to {-1, 0, +1} tensor."""
+    packed = np.frombuffer(packed_bytes, dtype=np.uint8)
+    vals = np.stack([
+        packed & 0x03,
+        (packed >> 2) & 0x03,
+        (packed >> 4) & 0x03,
+        (packed >> 6) & 0x03,
+    ], axis=-1).flatten()
+
+    numel = 1
+    for d in shape:
+        numel *= d
+    vals = vals[:numel]
+
+    # Unmap: 2→-1, 0→0, 1→+1
+    tensor = torch.from_numpy(vals.astype(np.int8))
+    tensor = torch.where(tensor == 2, torch.tensor(-1, dtype=torch.int8), tensor)
+    return tensor.reshape(shape)
+
+
+def serialize_flower_brain(state_dict, compression_cell, code_text, compressor='lzma'):
+    """Full serialization pipeline for Flower Brain PG submission.
+
+    1. Compress state dict to ternary via compression cell
+    2. Pack ternary values at 2 bits each
+    3. Compress with LZMA/brotli
+    4. Return total size
+    """
+    compressed, meta = compression_cell.compress_state_dict(state_dict)
+
+    # Build packed representation
+    packed_data = {}
+    for name, info in meta.items():
+        if 'ternary' in info:
+            ternary = compressed[name + '.ternary']
+            scale = compressed[name + '.scale']
+            packed_bytes, shape = pack_ternary_tight(ternary)
+            packed_data[name] = {
+                'packed': packed_bytes,
+                'shape': shape,
+                'scale': scale.numpy().tobytes(),
+                'scale_shape': list(scale.shape),
+            }
+        else:
+            # Passthrough — serialize as float16
+            packed_data[name] = {
+                'passthrough': compressed[name].numpy().tobytes(),
+                'shape': list(compressed[name].shape),
+                'dtype': str(compressed[name].dtype),
+            }
+
+    # Serialize to bytes
+    buf = io.BytesIO()
+    torch.save({'data': packed_data, 'meta': meta}, buf)
+    raw = buf.getvalue()
+
+    # Compress
+    if compressor == 'lzma':
+        blob = lzma.compress(raw, preset=6)
+    else:
+        import brotli
+        blob = brotli.compress(raw, quality=11)
+
+    code_bytes = len(code_text.encode('utf-8'))
+    total = len(blob) + code_bytes
+
+    print(f"Serialization:")
+    print(f"  Raw packed: {len(raw):,} bytes ({len(raw)/1024/1024:.1f} MB)")
+    print(f"  Compressed ({compressor}): {len(blob):,} bytes ({len(blob)/1024/1024:.1f} MB)")
+    print(f"  Code: {code_bytes:,} bytes")
+    print(f"  Total: {total:,} bytes ({total/1024/1024:.1f} MB)")
+    if total <= 16_000_000:
+        print(f"  SIZE OK: {16_000_000 - total:,} bytes headroom")
+    else:
+        print(f"  WARNING: {total - 16_000_000:,} bytes OVER 16MB cap")
+
+    return blob, total
+
+
+# ═══════════════════════════════════════════════════════════════════════
+# Test
+# ═══════════════════════════════════════════════════════════════════════
+
+if __name__ == '__main__':
+    from cells import FlowerBrainPG
+
+    # Build model
+    model = FlowerBrainPG(vocab_size=8192, model_dim=512)
+    print(f"Model params: {model.param_count():,}")
+
+    # Compress
+    cell = CompressionCell(target_void_fraction=0.30)
+    compressed, meta = cell.compress_state_dict(model.state_dict())
+
+    # Count categories
+    categories = {}
+    for name, info in meta.items():
+        cat = info.split(' ')[0]
+        categories[cat] = categories.get(cat, 0) + 1
+    print(f"Categories: {categories}")
+
+    # Test full serialization
+    blob, total = serialize_flower_brain(model.state_dict(), cell, "# test code")
+
+    # Test decompression
+    template_sd = {k: v.detach().cpu() for k, v in model.state_dict().items()}
+    decompressed = cell.decompress_state_dict(compressed, meta, template_sd)
+
+    # Check reconstruction quality
+    total_mse = 0
+    count = 0
+    for name in decompressed:
+        if name in template_sd:
+            orig = template_sd[name].float()
+            recon = decompressed[name].float()
+            if orig.shape == recon.shape:
+                mse = ((orig - recon) ** 2).mean().item()
+                total_mse += mse
+                count += 1
+    print(f"Avg reconstruction MSE: {total_mse / max(count, 1):.6f}")
+    print(f"Compression cell test: PASS")
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/mainboard_seed314.log b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/mainboard_seed314.log
new file mode 100644
index 0000000000..7a17f68e98
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/mainboard_seed314.log
@@ -0,0 +1,210 @@
+====================================================================================================
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: /root/data/
+  datasets_dir: /root/data/datasets/fineweb10B_sp8192
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 8
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  etlb_clip: 3.0
+  etlb_enabled: False
+  etlb_lr: 0.05
+  etlb_steps: 5
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/471a0618-2caa-4b53-abe7-de079a700e47.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  lora_ttt_alpha: 144.0
+  lora_ttt_enabled: False
+  lora_ttt_lr: 0.0005
+  lora_ttt_phases: 3
+  lora_ttt_rank: 128
+  lora_ttt_wd: 0.01
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.022
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_residual_start: 7
+  qk_gain_init: 5.25
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: 471a0618-2caa-4b53-abe7-de079a700e47
+  scalar_lr: 0.02
+  seed: 314
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: /root/data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: /root/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_chunk_tokens: 16384
+  ttt_enabled: True
+  ttt_epochs: 3
+  ttt_lr: 0.005
+  ttt_momentum: 0.9
+  val_batch_tokens: 524288
+  val_files: /root/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.72
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+====================================================================================================
+Running Python 3.11.10 (main, Sep  7 2024, 18:35:41) [GCC 11.4.0]
+Running PyTorch 2.4.1+cu124
+Thu Apr 30 10:09:35 2026       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
+| N/A   40C    P0            122W /  700W |    4336MiB /  81559MiB |      2%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   1  NVIDIA H100 80GB HBM3          On  |   00000000:3B:00.0 Off |                    0 |
+| N/A   34C    P0            116W /  700W |    4384MiB /  81559MiB |      2%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   2  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
+| N/A   33C    P0            116W /  700W |    4384MiB /  81559MiB |      2%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
+| N/A   39C    P0            122W /  700W |    4384MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9B:00.0 Off |                    0 |
+| N/A   42C    P0            123W /  700W |    4384MiB /  81559MiB |      1%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   5  NVIDIA H100 80GB HBM3          On  |   00000000:BB:00.0 Off |                    0 |
+| N/A   34C    P0            119W /  700W |    4384MiB /  81559MiB |      2%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   6  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
+| N/A   40C    P0            118W /  700W |    4384MiB /  81559MiB |      1%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
+| N/A   34C    P0            121W /  700W |    4144MiB /  81559MiB |     22%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+====================================================================================================
+train_shards: 128
+val_tokens: 40540160
+model_params:35945048
+gptq:reserving 12s, effective=588000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0051 val_bpb: 3.4861
+1/20000 train_loss: 9.0019 train_time: 0.0m tok/s: 6523741
+2/20000 train_loss: 9.2365 train_time: 0.0m tok/s: 6456518
+3/20000 train_loss: 9.5324 train_time: 0.0m tok/s: 6436655
+4/20000 train_loss: 9.5233 train_time: 0.0m tok/s: 6427006
+5/20000 train_loss: 9.1840 train_time: 0.0m tok/s: 6419748
+CHECKPOINT saved: /root/checkpoints/step_100.pt
+CHECKPOINT saved: /root/checkpoints/step_500.pt
+500/20000 train_loss: 3.3437 train_time: 1.0m tok/s: 6344542
+CHECKPOINT saved: /root/checkpoints/step_1000.pt
+1000/20000 train_loss: 3.1995 train_time: 2.1m tok/s: 6334453
+1500/20000 train_loss: 3.0934 train_time: 3.1m tok/s: 6336947
+layer_loop:enabled step:1659 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2000/20000 train_loss: 3.0273 train_time: 4.5m tok/s: 5867003
+CHECKPOINT saved: /root/checkpoints/step_2300.pt
+2500/20000 train_loss: 3.0346 train_time: 6.0m tok/s: 5470758
+3000/20000 train_loss: 2.8998 train_time: 7.5m tok/s: 5236834
+3500/20000 train_loss: 2.9025 train_time: 9.0m tok/s: 5080893
+3748/20000 val_loss: 2.8312 val_bpb: 1.0961
+stopping_early: wallclock_cap train_time: 588172ms step: 3748/20000
+peak memory allocated: 53004 MiB reserved: 54488 MiB
+ema:applying EMA weights
+EMA checkpoint saved to final_model_ema.pt
+pre-quantization post-ema val_loss:2.82877794 val_bpb:1.09510809 eval_time:44789ms
+Code: 19602 raw → 16192 lzma → 20302 bootstrap
+Wrote bootstrap code to train_gpt.py (20302 bytes)
+Serialized model: 135432998 bytes
+Code size: 20302 bytes
+Collecting Hessians for mixed compression...
+Hessians collected in 17.0s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int8): tok_emb.weight
+  passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights, smear.gate
+Mixed compression done in 9.7s
+Serialized ternary+brotli: 15972780 bytes
+Total submission size: 15993082 bytes
+SIZE OK: 6918 headroom
+quantized val_loss:2.85664192 val_bpb:1.10589511 eval_time:65672ms
+quantized_sliding_window val_loss:2.81377874 val_bpb:1.08930144 eval_time:140201ms
+ttt:start chunks=2475 ttt_lr=0.005 ttt_epochs=3
+quantized_ttt val_loss:2.80966529 val_bpb:1.08770899 eval_time:663852ms
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/mainboard_seed42.log b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/mainboard_seed42.log
new file mode 100644
index 0000000000..00ab880b2b
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/mainboard_seed42.log
@@ -0,0 +1,210 @@
+====================================================================================================
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: /root/data/
+  datasets_dir: /root/data/datasets/fineweb10B_sp8192
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 8
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  etlb_clip: 3.0
+  etlb_enabled: False
+  etlb_lr: 0.05
+  etlb_steps: 5
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/652dd3a1-e6ef-47e2-b688-2d88c9c9a6da.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  lora_ttt_alpha: 144.0
+  lora_ttt_enabled: False
+  lora_ttt_lr: 0.0005
+  lora_ttt_phases: 3
+  lora_ttt_rank: 128
+  lora_ttt_wd: 0.01
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.022
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_residual_start: 7
+  qk_gain_init: 5.25
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: 652dd3a1-e6ef-47e2-b688-2d88c9c9a6da
+  scalar_lr: 0.02
+  seed: 42
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: /root/data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: /root/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_chunk_tokens: 16384
+  ttt_enabled: True
+  ttt_epochs: 3
+  ttt_lr: 0.005
+  ttt_momentum: 0.9
+  val_batch_tokens: 524288
+  val_files: /root/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.72
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+====================================================================================================
+Running Python 3.11.10 (main, Sep  7 2024, 18:35:41) [GCC 11.4.0]
+Running PyTorch 2.4.1+cu124
+Thu Apr 30 09:37:38 2026       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
+| N/A   34C    P0            118W /  700W |    4336MiB /  81559MiB |      1%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   1  NVIDIA H100 80GB HBM3          On  |   00000000:3B:00.0 Off |                    0 |
+| N/A   31C    P0            115W /  700W |    4384MiB /  81559MiB |     17%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   2  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
+| N/A   31C    P0            114W /  700W |    4384MiB /  81559MiB |     17%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
+| N/A   33C    P0            119W /  700W |    4384MiB /  81559MiB |     14%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9B:00.0 Off |                    0 |
+| N/A   36C    P0            119W /  700W |    4384MiB /  81559MiB |     14%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   5  NVIDIA H100 80GB HBM3          On  |   00000000:BB:00.0 Off |                    0 |
+| N/A   32C    P0            120W /  700W |    4384MiB /  81559MiB |      4%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   6  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
+| N/A   34C    P0            114W /  700W |    4384MiB /  81559MiB |     14%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
+| N/A   32C    P0            118W /  700W |    4144MiB /  81559MiB |      2%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+====================================================================================================
+train_shards: 82
+val_tokens: 40540160
+model_params:35945048
+gptq:reserving 12s, effective=588000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0045 val_bpb: 3.4859
+1/20000 train_loss: 9.0010 train_time: 0.0m tok/s: 6439199
+2/20000 train_loss: 9.2315 train_time: 0.0m tok/s: 6428577
+3/20000 train_loss: 9.4777 train_time: 0.0m tok/s: 6408310
+4/20000 train_loss: 9.4656 train_time: 0.0m tok/s: 6397242
+5/20000 train_loss: 9.2818 train_time: 0.0m tok/s: 6388282
+CHECKPOINT saved: /root/checkpoints/step_100.pt
+CHECKPOINT saved: /root/checkpoints/step_500.pt
+500/20000 train_loss: 3.3611 train_time: 1.0m tok/s: 6337778
+CHECKPOINT saved: /root/checkpoints/step_1000.pt
+1000/20000 train_loss: 3.2815 train_time: 2.1m tok/s: 6327302
+1500/20000 train_loss: 3.1299 train_time: 3.1m tok/s: 6327808
+layer_loop:enabled step:1657 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2000/20000 train_loss: 2.9924 train_time: 4.5m tok/s: 5827255
+CHECKPOINT saved: /root/checkpoints/step_2300.pt
+2500/20000 train_loss: 3.0050 train_time: 6.0m tok/s: 5441457
+3000/20000 train_loss: 2.9170 train_time: 7.6m tok/s: 5167370
+3500/20000 train_loss: 2.9885 train_time: 9.1m tok/s: 5024051
+3720/20000 val_loss: 2.8307 val_bpb: 1.0959
+stopping_early: wallclock_cap train_time: 588122ms step: 3720/20000
+peak memory allocated: 53004 MiB reserved: 54544 MiB
+ema:applying EMA weights
+EMA checkpoint saved to final_model_ema.pt
+pre-quantization post-ema val_loss:2.82834576 val_bpb:1.09494078 eval_time:45991ms
+Code: 59815 raw → 15632 lzma → 19602 bootstrap
+Wrote bootstrap code to train_gpt.py (19602 bytes)
+Serialized model: 135432998 bytes
+Code size: 19602 bytes
+Collecting Hessians for mixed compression...
+Hessians collected in 17.0s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int8): tok_emb.weight
+  passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights, smear.gate
+Mixed compression done in 9.7s
+Serialized ternary+brotli: 15974448 bytes
+Total submission size: 15994050 bytes
+SIZE OK: 5950 headroom
+quantized val_loss:2.85614811 val_bpb:1.10570394 eval_time:68943ms
+quantized_sliding_window val_loss:2.81340974 val_bpb:1.08915859 eval_time:145225ms
+ttt:start chunks=2475 ttt_lr=0.005 ttt_epochs=3
+quantized_ttt val_loss:2.80927962 val_bpb:1.08755969 eval_time:684689ms
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/mainboard_seed999.log b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/mainboard_seed999.log
new file mode 100644
index 0000000000..b8fe6abfb2
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/mainboard_seed999.log
@@ -0,0 +1,210 @@
+====================================================================================================
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: /root/data/
+  datasets_dir: /root/data/datasets/fineweb10B_sp8192
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 8
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  etlb_clip: 3.0
+  etlb_enabled: False
+  etlb_lr: 0.05
+  etlb_steps: 5
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/ee406178-88b2-4764-9556-6918533e7efa.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  lora_ttt_alpha: 144.0
+  lora_ttt_enabled: False
+  lora_ttt_lr: 0.0005
+  lora_ttt_phases: 3
+  lora_ttt_rank: 128
+  lora_ttt_wd: 0.01
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.022
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_residual_start: 7
+  qk_gain_init: 5.25
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: ee406178-88b2-4764-9556-6918533e7efa
+  scalar_lr: 0.02
+  seed: 999
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: /root/data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: /root/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_chunk_tokens: 16384
+  ttt_enabled: True
+  ttt_epochs: 3
+  ttt_lr: 0.005
+  ttt_momentum: 0.9
+  val_batch_tokens: 524288
+  val_files: /root/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.72
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+====================================================================================================
+Running Python 3.11.10 (main, Sep  7 2024, 18:35:41) [GCC 11.4.0]
+Running PyTorch 2.4.1+cu124
+Thu Apr 30 10:40:19 2026       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
+| N/A   41C    P0            122W /  700W |    4336MiB /  81559MiB |      1%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   1  NVIDIA H100 80GB HBM3          On  |   00000000:3B:00.0 Off |                    0 |
+| N/A   34C    P0            115W /  700W |    4384MiB /  81559MiB |      1%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   2  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
+| N/A   33C    P0            116W /  700W |    4384MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
+| N/A   40C    P0            122W /  700W |    4384MiB /  81559MiB |     22%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9B:00.0 Off |                    0 |
+| N/A   43C    P0            126W /  700W |    4384MiB /  81559MiB |     23%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   5  NVIDIA H100 80GB HBM3          On  |   00000000:BB:00.0 Off |                    0 |
+| N/A   35C    P0            122W /  700W |    4384MiB /  81559MiB |      4%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   6  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
+| N/A   41C    P0            119W /  700W |    4384MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
+| N/A   35C    P0            120W /  700W |    4144MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+====================================================================================================
+train_shards: 128
+val_tokens: 40540160
+model_params:35945048
+gptq:reserving 12s, effective=588000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0044 val_bpb: 3.4859
+1/20000 train_loss: 9.0000 train_time: 0.0m tok/s: 6574933
+2/20000 train_loss: 9.2476 train_time: 0.0m tok/s: 6496070
+3/20000 train_loss: 9.6366 train_time: 0.0m tok/s: 6456925
+4/20000 train_loss: 9.5866 train_time: 0.0m tok/s: 6438394
+5/20000 train_loss: 9.2291 train_time: 0.0m tok/s: 6427822
+CHECKPOINT saved: /root/checkpoints/step_100.pt
+CHECKPOINT saved: /root/checkpoints/step_500.pt
+500/20000 train_loss: 3.3392 train_time: 1.0m tok/s: 6344920
+CHECKPOINT saved: /root/checkpoints/step_1000.pt
+1000/20000 train_loss: 3.1962 train_time: 2.1m tok/s: 6336087
+1500/20000 train_loss: 3.0963 train_time: 3.1m tok/s: 6336216
+layer_loop:enabled step:1659 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2000/20000 train_loss: 3.0315 train_time: 4.5m tok/s: 5865557
+CHECKPOINT saved: /root/checkpoints/step_2300.pt
+2500/20000 train_loss: 3.0349 train_time: 6.0m tok/s: 5468818
+3000/20000 train_loss: 2.8988 train_time: 7.5m tok/s: 5234095
+3500/20000 train_loss: 2.8989 train_time: 9.1m tok/s: 5067948
+3747/20000 val_loss: 2.8309 val_bpb: 1.0959
+stopping_early: wallclock_cap train_time: 588136ms step: 3747/20000
+peak memory allocated: 53004 MiB reserved: 54502 MiB
+ema:applying EMA weights
+EMA checkpoint saved to final_model_ema.pt
+pre-quantization post-ema val_loss:2.82852184 val_bpb:1.09500895 eval_time:45440ms
+Code: 20302 raw → 16776 lzma → 21032 bootstrap
+Wrote bootstrap code to train_gpt.py (21032 bytes)
+Serialized model: 135432998 bytes
+Code size: 21032 bytes
+Collecting Hessians for mixed compression...
+Hessians collected in 17.0s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int8): tok_emb.weight
+  passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights, smear.gate
+Mixed compression done in 9.9s
+Serialized ternary+brotli: 15973408 bytes
+Total submission size: 15994440 bytes
+SIZE OK: 5560 headroom
+quantized val_loss:2.85777853 val_bpb:1.10633512 eval_time:63694ms
+quantized_sliding_window val_loss:2.81461996 val_bpb:1.08962710 eval_time:141252ms
+ttt:start chunks=2475 ttt_lr=0.005 ttt_epochs=3
+quantized_ttt val_loss:2.80968480 val_bpb:1.08771654 eval_time:658495ms
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/run1_h100.log b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/run1_h100.log
new file mode 100644
index 0000000000..8a942ae4d8
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/run1_h100.log
@@ -0,0 +1,194 @@
+====================================================================================================
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: /root/data/
+  datasets_dir: /root/data/datasets/fineweb10B_sp8192
+  distributed: False
+  ema_decay: 0.9965
+  embed_bits: 8
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  etlb_clip: 3.0
+  etlb_enabled: False
+  etlb_lr: 0.05
+  etlb_steps: 5
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 8
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 50000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/0e0c5884-9230-45c1-929f-a5c7c6443036.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.022
+  max_wallclock_seconds: 3600.0
+  min_lr: 0.0
+  mlp_mult: 3.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 12
+  num_loops: 2
+  parallel_residual_start: 7
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: 0e0c5884-9230-45c1-929f-a5c7c6443036
+  scalar_lr: 0.02
+  seed: 42
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: /root/data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: /root/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 100
+  train_seq_len: 2048
+  ttt_chunk_tokens: 32768
+  ttt_enabled: False
+  ttt_epochs: 3
+  ttt_lr: 0.005
+  ttt_momentum: 0.9
+  val_batch_tokens: 524288
+  val_files: /root/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 500
+  vocab_size: 8192
+  warmdown_frac: 0.72
+  warmup_steps: 20
+  world_size: 1
+  xsa_last_n: 11
+====================================================================================================
+Running Python 3.10.12 (main, Mar  3 2026, 11:56:32) [GCC 11.4.0]
+Running PyTorch 2.11.0+cu130
+Tue Apr 28 12:30:32 2026       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA H100 80GB HBM3          On  |   00000000:8B:00.0 Off |                    0 |
+| N/A   31C    P0            116W /  700W |     527MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+====================================================================================================
+train_shards: 128
+val_tokens: 40540160
+model_params:32539744
+gptq:reserving 12s, effective=3588000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4, 5] decoder:[3, 4, 5, 6, 7, 8, 9, 10, 11]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/50000 val_loss: 9.0064 val_bpb: 3.4867
+1/50000 train_loss: 9.0073 train_time: 0.0m tok/s: 831140
+2/50000 train_loss: 12.4878 train_time: 0.0m tok/s: 763679
+3/50000 train_loss: 11.3767 train_time: 0.1m tok/s: 784347
+4/50000 train_loss: 9.8244 train_time: 0.1m tok/s: 795033
+5/50000 train_loss: 8.6657 train_time: 0.1m tok/s: 801411
+CHECKPOINT saved: /root/checkpoints/step_100.pt
+100/50000 train_loss: 4.4703 train_time: 1.6m tok/s: 821473
+200/50000 train_loss: 3.7179 train_time: 3.2m tok/s: 821864
+300/50000 train_loss: 3.5052 train_time: 4.8m tok/s: 822624
+400/50000 train_loss: 3.4179 train_time: 6.4m tok/s: 822500
+CHECKPOINT saved: /root/checkpoints/step_500.pt
+500/50000 train_loss: 3.3673 train_time: 8.0m tok/s: 822633
+500/50000 val_loss: 3.3499 val_bpb: 1.2969
+600/50000 train_loss: 3.2818 train_time: 9.6m tok/s: 822911
+700/50000 train_loss: 3.3218 train_time: 11.1m tok/s: 823241
+800/50000 train_loss: 3.2468 train_time: 12.7m tok/s: 823441
+900/50000 train_loss: 3.2283 train_time: 14.3m tok/s: 823723
+CHECKPOINT saved: /root/checkpoints/step_1000.pt
+1000/50000 train_loss: 3.2234 train_time: 15.9m tok/s: 823854
+1000/50000 val_loss: 3.2271 val_bpb: 1.2493
+1100/50000 train_loss: 3.2496 train_time: 17.5m tok/s: 823937
+1200/50000 train_loss: 3.2071 train_time: 19.1m tok/s: 824116
+1300/50000 train_loss: 3.1787 train_time: 20.7m tok/s: 824194
+layer_loop:enabled step:1317 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4, 5] decoder:[3, 4, 5, 6, 7, 8, 9, 10, 11]
+1400/50000 train_loss: 3.1938 train_time: 22.9m tok/s: 800572
+1500/50000 train_loss: 3.1494 train_time: 25.3m tok/s: 777496
+1500/50000 val_loss: 3.1374 val_bpb: 1.2146
+1600/50000 train_loss: 3.1217 train_time: 27.7m tok/s: 758307
+1700/50000 train_loss: 3.0794 train_time: 30.0m tok/s: 741809
+1800/50000 train_loss: 3.1161 train_time: 32.4m tok/s: 728041
+1900/50000 train_loss: 3.0780 train_time: 34.8m tok/s: 716068
+2000/50000 train_loss: 3.0282 train_time: 37.1m tok/s: 705780
+2000/50000 val_loss: 3.0315 val_bpb: 1.1736
+2100/50000 train_loss: 2.9896 train_time: 39.5m tok/s: 696666
+2200/50000 train_loss: 2.9863 train_time: 41.9m tok/s: 688623
+CHECKPOINT saved: /root/checkpoints/step_2300.pt
+2300/50000 train_loss: 2.9874 train_time: 44.2m tok/s: 681315
+2400/50000 train_loss: 2.9814 train_time: 46.6m tok/s: 674825
+2500/50000 train_loss: 2.9437 train_time: 49.0m tok/s: 668967
+2500/50000 val_loss: 2.9504 val_bpb: 1.1422
+2600/50000 train_loss: 2.9450 train_time: 51.4m tok/s: 663646
+2700/50000 train_loss: 2.9163 train_time: 53.7m tok/s: 658786
+2800/50000 train_loss: 2.8932 train_time: 56.1m tok/s: 654331
+2900/50000 train_loss: 2.9002 train_time: 58.5m tok/s: 650118
+2957/50000 val_loss: 2.8818 val_bpb: 1.1156
+stopping_early: wallclock_cap train_time: 3589036ms step: 2957/50000
+peak memory allocated: 41112 MiB reserved: 41202 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:2.88140020 val_bpb:1.11547981 eval_time:24909ms
+Serialized model: 121815565 bytes
+Code size: 52751 bytes
+Collecting Hessians for mixed compression...
+Hessians collected in 14.8s
+Mixed compression: 32,539,744 params, ternary void 17.4%
+Mixed compression done in 21.5s
+Serialized ternary+brotli: 10349729 bytes
+Total submission size: 10402480 bytes
+SIZE OK: 5597520 headroom
+quantized val_loss:4.64845853 val_bpb:1.79956316 eval_time:42265ms
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/submission.json b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/submission.json
new file mode 100644
index 0000000000..9968377fef
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/submission.json
@@ -0,0 +1,61 @@
+{
+  "name": "Flower Brain v3: SmearGate + TTT + GPTQ — val_bpb 1.0877 (3-seed mean)",
+  "author": "G3sparky (Gavin Saunders)",
+  "github_id": "G3sparky",
+  "track": "10min_16mb",
+  "date": "2026-04-30",
+  "val_bpb": 1.0877,
+  "val_bpb_std": 0.0001,
+  "bytes_total": 15994050,
+  "category": "record_candidate",
+  "blurb": "Flower Brain v3: novel 6-cell architecture with SmearGate (learned bigram gate), QK-Gain 5.25, depth recurrence (layers 3-5 x2), GPTQ int6/int8 + Brotli-11 compression, score-first TTT (3-epoch SGD). 8xH100 SXM, 3-seed mean 1.0877 BPB (std 0.0001). Under 16MB.",
+  "hardware": "8x H100 80GB SXM",
+  "training_time_seconds": 588,
+  "params": 35945048,
+  "compression": "GPTQ int6 (matrices) + int8 (embeddings) + brotli-11",
+  "seeds": {
+    "42": {
+      "pre_quant_post_ema_bpb": 1.0949,
+      "quantized_bpb": 1.1057,
+      "sliding_window_bpb": 1.0892,
+      "ttt_bpb": 1.0876,
+      "artifact_bytes": 15994050,
+      "steps": 3720,
+      "eval_time_ms": 684689
+    },
+    "314": {
+      "pre_quant_post_ema_bpb": 1.0951,
+      "quantized_bpb": 1.1059,
+      "sliding_window_bpb": 1.0893,
+      "ttt_bpb": 1.0877,
+      "artifact_bytes": 15993082,
+      "steps": 3748,
+      "eval_time_ms": 663852
+    },
+    "999": {
+      "pre_quant_post_ema_bpb": 1.0963,
+      "quantized_bpb": 1.1063,
+      "sliding_window_bpb": 1.0897,
+      "ttt_bpb": 1.0877,
+      "artifact_bytes": 15994440,
+      "steps": 3741,
+      "eval_time_ms": 658495
+    }
+  },
+  "key_changes": [
+    "SmearGate: learned per-dimension blending with previous token embedding",
+    "QK-Gain 5.25 (from 5.0)",
+    "Score-first TTT: 3-epoch SGD per chunk, TTT_CHUNK_TOKENS=16384",
+    "GPTQ int6/int8 compression with LZMA bootstrap code",
+    "SDPA attention (PyTorch native)",
+    "Depth recurrence: layers 3-5 loop x2 (activated at 35%)",
+    "Flower Brain 6-cell architecture — novel, answers organizer wish list"
+  ],
+  "also_submitted": {
+    "unlimited_compute": {
+      "val_bpb": 1.0680,
+      "note": "Same architecture, 2hr training on 1xH100 + LoRA-TTT eval"
+    }
+  },
+  "base": "Flower Brain 6-Cell Architecture + SP8192 + Depth Recurrence + Parallel Residuals"
+}
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/train_gpt_ternary.py b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/train_gpt_ternary.py
new file mode 100644
index 0000000000..b74cdc544d
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/train_gpt_ternary.py
@@ -0,0 +1,535 @@
+import collections,copy,glob,io,lzma,math,os
+from pathlib import Path
+import random,re,subprocess,sys,time,uuid,numpy as np,sentencepiece as spm,torch,torch.distributed as dist,torch.nn.functional as F
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch import Tensor,nn
+from flash_attn_interface import flash_attn_func as flash_attn_3_func
+class Hyperparameters:data_dir=os.environ.get('DATA_DIR','./data/');seed=int(os.environ.get('SEED',1337));run_id=os.environ.get('RUN_ID',str(uuid.uuid4()));iterations=int(os.environ.get('ITERATIONS',20000));warmdown_frac=float(os.environ.get('WARMDOWN_FRAC',.72));warmup_steps=int(os.environ.get('WARMUP_STEPS',20));train_batch_tokens=int(os.environ.get('TRAIN_BATCH_TOKENS',786432));train_seq_len=int(os.environ.get('TRAIN_SEQ_LEN',2048));train_log_every=int(os.environ.get('TRAIN_LOG_EVERY',500));max_wallclock_seconds=float(os.environ.get('MAX_WALLCLOCK_SECONDS',6e2));val_batch_tokens=int(os.environ.get('VAL_BATCH_TOKENS',524288));eval_seq_len=int(os.environ.get('EVAL_SEQ_LEN',2048));val_loss_every=int(os.environ.get('VAL_LOSS_EVERY',4000));sliding_window_enabled=bool(int(os.environ.get('SLIDING_WINDOW_ENABLED','1')));vocab_size=int(os.environ.get('VOCAB_SIZE',8192));num_layers=int(os.environ.get('NUM_LAYERS',11));xsa_last_n=int(os.environ.get('XSA_LAST_N',11));model_dim=int(os.environ.get('MODEL_DIM',512));embedding_dim=int(os.environ.get('EMBEDDING_DIM',512));num_kv_heads=int(os.environ.get('NUM_KV_HEADS',4));num_heads=int(os.environ.get('NUM_HEADS',8));mlp_mult=float(os.environ.get('MLP_MULT',4.));skip_gates_enabled=bool(int(os.environ.get('SKIP_GATES_ENABLED','1')));tie_embeddings=bool(int(os.environ.get('TIE_EMBEDDINGS','1')));logit_softcap=float(os.environ.get('LOGIT_SOFTCAP',3e1));rope_base=float(os.environ.get('ROPE_BASE',1e4));rope_dims=int(os.environ.get('ROPE_DIMS',16));rope_train_seq_len=int(os.environ.get('ROPE_TRAIN_SEQ_LEN',2048));ln_scale=bool(int(os.environ.get('LN_SCALE','1')));qk_gain_init=float(os.environ.get('QK_GAIN_INIT',5.));num_loops=int(os.environ.get('NUM_LOOPS',2));loop_start=int(os.environ.get('LOOP_START',3));loop_end=int(os.environ.get('LOOP_END',5));enable_looping_at=float(os.environ.get('ENABLE_LOOPING_AT',.35));parallel_residual_start=int(os.environ.get('PARALLEL_RESIDUAL_START',7));min_lr=float(os.environ.get('MIN_LR',.0));embed_lr=float(os.environ.get('EMBED_LR',.6));head_lr=float(os.environ.get('HEAD_LR',.008));tied_embed_lr=float(os.environ.get('TIED_EMBED_LR',.03));tied_embed_init_std=float(os.environ.get('TIED_EMBED_INIT_STD',.005));matrix_lr=float(os.environ.get('MATRIX_LR',.022));scalar_lr=float(os.environ.get('SCALAR_LR',.02));muon_momentum=float(os.environ.get('MUON_MOMENTUM',.99));muon_backend_steps=int(os.environ.get('MUON_BACKEND_STEPS',5));muon_momentum_warmup_start=float(os.environ.get('MUON_MOMENTUM_WARMUP_START',.92));muon_momentum_warmup_steps=int(os.environ.get('MUON_MOMENTUM_WARMUP_STEPS',1500));muon_row_normalize=bool(int(os.environ.get('MUON_ROW_NORMALIZE','1')));beta1=float(os.environ.get('BETA1',.9));beta2=float(os.environ.get('BETA2',.95));adam_eps=float(os.environ.get('ADAM_EPS',1e-08));grad_clip_norm=float(os.environ.get('GRAD_CLIP_NORM',.3));eval_stride=int(os.environ.get('EVAL_STRIDE',64));muon_beta2=float(os.environ.get('MUON_BETA2',.95));adam_wd=float(os.environ.get('ADAM_WD',.02));muon_wd=float(os.environ.get('MUON_WD',.095));embed_wd=float(os.environ.get('EMBED_WD',.085));ema_decay=float(os.environ.get('EMA_DECAY',.9965));ttt_enabled=bool(int(os.environ.get('TTT_ENABLED','0')));ttt_lr=float(os.environ.get('TTT_LR',.005));ttt_epochs=int(os.environ.get('TTT_EPOCHS',3));ttt_momentum=float(os.environ.get('TTT_MOMENTUM',.9));ttt_chunk_tokens=int(os.environ.get('TTT_CHUNK_TOKENS',32768));compressor=os.environ.get('COMPRESSOR','brotli');gptq_calibration_batches=int(os.environ.get('GPTQ_CALIBRATION_BATCHES',64));gptq_reserve_seconds=float(os.environ.get('GPTQ_RESERVE_SECONDS',12.));matrix_bits=int(os.environ.get('MATRIX_BITS',6));embed_bits=int(os.environ.get('EMBED_BITS',8));matrix_clip_sigmas=float(os.environ.get('MATRIX_CLIP_SIGMAS',12.85));embed_clip_sigmas=float(os.environ.get('EMBED_CLIP_SIGMAS',2e1));distributed='RANK'in os.environ and'WORLD_SIZE'in os.environ;rank=int(os.environ.get('RANK','0'));world_size=int(os.environ.get('WORLD_SIZE','1'));local_rank=int(os.environ.get('LOCAL_RANK','0'));is_main_process=rank==0;grad_accum_steps=8//world_size;datasets_dir=os.path.join(data_dir,'datasets',f"fineweb10B_sp{vocab_size}");train_files=os.path.join(datasets_dir,'fineweb_train_*.bin');val_files=os.path.join(datasets_dir,'fineweb_val_*.bin');tokenizer_path=os.path.join(data_dir,'tokenizers',f"fineweb_{vocab_size}_bpe.model");logfile=f"logs/{run_id}.txt";model_path='final_model.pt';quantized_model_path='final_model.int6.ptz'
+_logger_hparams=None
+def set_logging_hparams(h):global _logger_hparams;_logger_hparams=h
+def log(msg,console=True):
+	if _logger_hparams is None:print(msg);return
+	if _logger_hparams.is_main_process:
+		if console:print(msg)
+		if _logger_hparams.logfile is not None:
+			with open(_logger_hparams.logfile,'a',encoding='utf-8')as f:print(msg,file=f)
+class ValidationData:
+	def __init__(self,h,device):
+		self.sp=spm.SentencePieceProcessor(model_file=h.tokenizer_path)
+		if int(self.sp.vocab_size())!=h.vocab_size:raise ValueError(f"VOCAB_SIZE={h.vocab_size} does not match tokenizer vocab_size={int(self.sp.vocab_size())}")
+		self.val_tokens=load_validation_tokens(h.val_files,h.eval_seq_len);self.base_bytes_lut,self.has_leading_space_lut,self.is_boundary_token_lut=build_sentencepiece_luts(self.sp,h.vocab_size,device)
+def build_sentencepiece_luts(sp,vocab_size,device):
+	sp_vocab_size=int(sp.vocab_size());assert sp.piece_to_id('▁')!=sp.unk_id(),"Tokenizer must have '▁' (space) as its own token for correct BPB byte counting";table_size=max(sp_vocab_size,vocab_size);base_bytes_np=np.zeros((table_size,),dtype=np.int16);has_leading_space_np=np.zeros((table_size,),dtype=np.bool_);is_boundary_token_np=np.ones((table_size,),dtype=np.bool_)
+	for token_id in range(sp_vocab_size):
+		if sp.is_control(token_id)or sp.is_unknown(token_id)or sp.is_unused(token_id):continue
+		is_boundary_token_np[token_id]=False
+		if sp.is_byte(token_id):base_bytes_np[token_id]=1;continue
+		piece=sp.id_to_piece(token_id)
+		if piece.startswith('▁'):has_leading_space_np[token_id]=True;piece=piece[1:]
+		base_bytes_np[token_id]=len(piece.encode('utf-8'))
+	return torch.tensor(base_bytes_np,dtype=torch.int16,device=device),torch.tensor(has_leading_space_np,dtype=torch.bool,device=device),torch.tensor(is_boundary_token_np,dtype=torch.bool,device=device)
+def load_validation_tokens(pattern,seq_len):
+	files=[Path(p)for p in sorted(glob.glob(pattern))]
+	if not files:raise FileNotFoundError(f"No files found for pattern: {pattern}")
+	tokens=torch.cat([load_data_shard(file)for file in files]).contiguous();usable=(tokens.numel()-1)//seq_len*seq_len
+	if usable<=0:raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
+	return tokens[:usable+1]
+def load_data_shard(file):
+	header_bytes=256*np.dtype('<i4').itemsize;token_bytes=np.dtype('<u2').itemsize;header=np.fromfile(file,dtype='<i4',count=256)
+	if header.size!=256 or int(header[0])!=20240520 or int(header[1])!=1:raise ValueError(f"Unexpected shard header for {file}")
+	num_tokens=int(header[2]);expected_size=header_bytes+num_tokens*token_bytes
+	if file.stat().st_size!=expected_size:raise ValueError(f"Shard size mismatch for {file}: expected {expected_size} bytes")
+	tokens_np=np.fromfile(file,dtype='<u2',count=num_tokens,offset=header_bytes)
+	if tokens_np.size!=num_tokens:raise ValueError(f"Short read for {file}")
+	return torch.from_numpy(tokens_np.astype(np.uint16,copy=False))
+_SHARD_HEADER_BYTES=256*np.dtype('<i4').itemsize
+_SHARD_NTOKENS_CACHE={}
+_MMAP_CACHE={}
+def _read_num_tokens(file):
+	key=str(file);cached=_SHARD_NTOKENS_CACHE.get(key)
+	if cached is not None:return cached
+	header=np.fromfile(file,dtype='<i4',count=256)
+	if header.size!=256 or int(header[0])!=20240520 or int(header[1])!=1:raise ValueError(f"Unexpected shard header for {file}")
+	n=int(header[2]);_SHARD_NTOKENS_CACHE[key]=n;return n
+def _get_shard_memmap(file):
+	key=str(file);mm=_MMAP_CACHE.get(key)
+	if mm is not None:return mm
+	n=_read_num_tokens(file);mm=np.memmap(file,mode='r',dtype='<u2',offset=_SHARD_HEADER_BYTES,shape=(n,));_MMAP_CACHE[key]=mm;return mm
+class ShuffledSequenceLoader:
+	def __init__(self,h,device):
+		self.world_size=h.world_size;self.seq_len=h.train_seq_len;self.device=device;all_files=[Path(p)for p in sorted(glob.glob(h.train_files))]
+		if not all_files:raise FileNotFoundError(f"No files found for pattern: {h.train_files}")
+		self.files=all_files[h.rank::h.world_size];self.rng=np.random.Generator(np.random.PCG64(h.rank));self.num_tokens=[_read_num_tokens(f)for f in self.files];self.start_inds=[[]for _ in self.files]
+		for si in range(len(self.files)):self._reset_shard(si)
+	def _reset_shard(self,si):max_phase=min(self.seq_len-1,max(0,self.num_tokens[si]-self.seq_len-1));phase=int(self.rng.integers(max_phase+1))if max_phase>0 else 0;num_sequences=(self.num_tokens[si]-1-phase)//self.seq_len;sequence_order=self.rng.permutation(num_sequences);self.start_inds[si]=(phase+sequence_order*self.seq_len).tolist()
+	def next_batch(self,global_tokens,grad_accum_steps):
+		device_tokens=global_tokens//(self.world_size*grad_accum_steps);device_batch_size=device_tokens//self.seq_len;remaining=np.array([len(s)for s in self.start_inds],dtype=np.float64);x=torch.empty((device_batch_size,self.seq_len),dtype=torch.int64);y=torch.empty((device_batch_size,self.seq_len),dtype=torch.int64)
+		for bi in range(device_batch_size):
+			total=remaining.sum()
+			if total<=0:
+				for si in range(len(self.files)):self._reset_shard(si)
+				remaining=np.array([len(s)for s in self.start_inds],dtype=np.float64);total=remaining.sum()
+			probs=remaining/total;si=int(self.rng.choice(len(self.files),p=probs));start_ind=self.start_inds[si].pop();remaining[si]-=1;mm=_get_shard_memmap(self.files[si]);window=torch.as_tensor(np.array(mm[start_ind:start_ind+self.seq_len+1],dtype=np.int64));x[bi]=window[:-1];y[bi]=window[1:]
+		return x.to(self.device,non_blocking=True),y.to(self.device,non_blocking=True)
+class RMSNorm(nn.Module):
+	def __init__(self,eps=None):super().__init__();self.eps=eps
+	def forward(self,x):return F.rms_norm(x,(x.size(-1),),eps=self.eps)
+class CastedLinear(nn.Linear):
+	def forward(self,x):w=self.weight.to(x.dtype);bias=self.bias.to(x.dtype)if self.bias is not None else None;return F.linear(x,w,bias)
+class Rotary(nn.Module):
+	def __init__(self,dim,base=1e4,train_seq_len=1024,rope_dims=0):super().__init__();self.dim=dim;self.base=base;self.train_seq_len=train_seq_len;self.rope_dims=rope_dims if rope_dims>0 else dim;inv_freq=1./base**(torch.arange(0,self.rope_dims,2,dtype=torch.float32)/self.rope_dims);self.register_buffer('inv_freq',inv_freq,persistent=False);self._seq_len_cached=0;self._cos_cached=None;self._sin_cached=None
+	def forward(self,seq_len,device,dtype):
+		if self._cos_cached is None or self._sin_cached is None or self._seq_len_cached!=seq_len or self._cos_cached.device!=device:
+			rd=self.rope_dims
+			if seq_len>self.train_seq_len:scale=seq_len/self.train_seq_len;new_base=self.base*scale**(rd/(rd-2));inv_freq=1./new_base**(torch.arange(0,rd,2,dtype=torch.float32,device=device)/rd)
+			else:inv_freq=self.inv_freq.to(device)
+			t=torch.arange(seq_len,device=device,dtype=inv_freq.dtype);freqs=torch.outer(t,inv_freq);self._cos_cached=freqs.cos()[None,:,None,:];self._sin_cached=freqs.sin()[None,:,None,:];self._seq_len_cached=seq_len
+		return self._cos_cached.to(dtype=dtype),self._sin_cached.to(dtype=dtype)
+def apply_rotary_emb(x,cos,sin,rope_dims=0):
+	if rope_dims>0 and rope_dims<x.size(-1):x_rope,x_pass=x[...,:rope_dims],x[...,rope_dims:];half=rope_dims//2;x1,x2=x_rope[...,:half],x_rope[...,half:];x_rope=torch.cat((x1*cos+x2*sin,x1*-sin+x2*cos),dim=-1);return torch.cat((x_rope,x_pass),dim=-1)
+	half=x.size(-1)//2;x1,x2=x[...,:half],x[...,half:];return torch.cat((x1*cos+x2*sin,x1*-sin+x2*cos),dim=-1)
+class CausalSelfAttention(nn.Module):
+	def __init__(self,dim,num_heads,num_kv_heads,rope_base,qk_gain_init,train_seq_len):
+		super().__init__()
+		if dim%num_heads!=0:raise ValueError('model_dim must be divisible by num_heads')
+		if num_heads%num_kv_heads!=0:raise ValueError('num_heads must be divisible by num_kv_heads')
+		self.num_heads=num_heads;self.num_kv_heads=num_kv_heads;self.head_dim=dim//num_heads
+		if self.head_dim%2!=0:raise ValueError('head_dim must be even for RoPE')
+		kv_dim=self.num_kv_heads*self.head_dim;self.c_q=CastedLinear(dim,dim,bias=False);self.c_k=CastedLinear(dim,kv_dim,bias=False);self.c_v=CastedLinear(dim,kv_dim,bias=False);self.proj=CastedLinear(dim,dim,bias=False);self.proj._zero_init=True;self.q_gain=nn.Parameter(torch.full((num_heads,),qk_gain_init,dtype=torch.float32));self.rope_dims=0;self.rotary=Rotary(self.head_dim,base=rope_base,train_seq_len=train_seq_len);self.use_xsa=False
+	def _xsa_efficient(self,y,v):B,T,H,D=y.shape;Hkv=v.size(-2);group=H//Hkv;y_g=y.reshape(B,T,Hkv,group,D);vn=F.normalize(v,dim=-1).unsqueeze(-2);proj=(y_g*vn).sum(dim=-1,keepdim=True)*vn;return(y_g-proj).reshape(B,T,H,D)
+	def forward(self,x):
+		bsz,seqlen,dim=x.shape;q=self.c_q(x).reshape(bsz,seqlen,self.num_heads,self.head_dim);k=self.c_k(x).reshape(bsz,seqlen,self.num_kv_heads,self.head_dim);v=self.c_v(x).reshape(bsz,seqlen,self.num_kv_heads,self.head_dim);q=F.rms_norm(q,(q.size(-1),));k=F.rms_norm(k,(k.size(-1),));cos,sin=self.rotary(seqlen,x.device,q.dtype);q=apply_rotary_emb(q,cos,sin,self.rope_dims);k=apply_rotary_emb(k,cos,sin,self.rope_dims);q=q*self.q_gain.to(dtype=q.dtype)[None,None,:,None];y=flash_attn_3_func(q,k,v,causal=True)
+		if self.use_xsa:y=self._xsa_efficient(y,v)
+		y=y.reshape(bsz,seqlen,dim);return self.proj(y)
+class MLP(nn.Module):
+	def __init__(self,dim,mlp_mult):super().__init__();hidden=int(mlp_mult*dim);self.fc=CastedLinear(dim,hidden,bias=False);self.proj=CastedLinear(hidden,dim,bias=False);self.proj._zero_init=True
+	def forward(self,x):return self.proj(F.leaky_relu(self.fc(x),negative_slope=.5).square())
+class Block(nn.Module):
+	def __init__(self,dim,num_heads,num_kv_heads,mlp_mult,rope_base,qk_gain_init,train_seq_len,layer_idx=0,ln_scale=False):super().__init__();self.attn_norm=RMSNorm();self.mlp_norm=RMSNorm();self.attn=CausalSelfAttention(dim,num_heads,num_kv_heads,rope_base,qk_gain_init,train_seq_len);self.mlp=MLP(dim,mlp_mult);self.attn_scale=nn.Parameter(torch.ones(dim,dtype=torch.float32));self.mlp_scale=nn.Parameter(torch.ones(dim,dtype=torch.float32));self.resid_mix=nn.Parameter(torch.stack((torch.ones(dim),torch.zeros(dim))).float());self.ln_scale_factor=1./math.sqrt(layer_idx+1)if ln_scale else 1.;self.parallel=False
+	def forward(self,x,x0):
+		mix=self.resid_mix.to(dtype=x.dtype);x_in=mix[0][None,None,:]*x+mix[1][None,None,:]*x0;attn_out=self.attn(self.attn_norm(x_in)*self.ln_scale_factor)
+		if self.parallel:mlp_out=self.mlp(self.mlp_norm(x_in)*self.ln_scale_factor);x_out=x_in+self.attn_scale.to(dtype=x_in.dtype)[None,None,:]*attn_out+self.mlp_scale.to(dtype=x_in.dtype)[None,None,:]*mlp_out
+		else:x_out=x_in+self.attn_scale.to(dtype=x_in.dtype)[None,None,:]*attn_out;x_out=x_out+self.mlp_scale.to(dtype=x_out.dtype)[None,None,:]*self.mlp(self.mlp_norm(x_out)*self.ln_scale_factor)
+		return x_out
+class GPT(nn.Module):
+	def __init__(self,h):
+		super().__init__()
+		if h.logit_softcap<=.0:raise ValueError(f"logit_softcap must be positive, got {h.logit_softcap}")
+		self.tie_embeddings=h.tie_embeddings;self.tied_embed_init_std=h.tied_embed_init_std;self.logit_softcap=h.logit_softcap;self.tok_emb=nn.Embedding(h.vocab_size,h.embedding_dim)
+		if h.embedding_dim!=h.model_dim:self.embed_proj=CastedLinear(h.embedding_dim,h.model_dim,bias=False);self.head_proj=CastedLinear(h.model_dim,h.embedding_dim,bias=False)
+		else:self.embed_proj=None;self.head_proj=None
+		self.num_encoder_layers=h.num_layers//2;self.num_decoder_layers=h.num_layers-self.num_encoder_layers;self.blocks=nn.ModuleList([Block(h.model_dim,h.num_heads,h.num_kv_heads,h.mlp_mult,h.rope_base,h.qk_gain_init,h.train_seq_len,layer_idx=i,ln_scale=h.ln_scale)for i in range(h.num_layers)])
+		if h.rope_dims>0:
+			head_dim=h.model_dim//h.num_heads
+			for block in self.blocks:block.attn.rope_dims=h.rope_dims;block.attn.rotary=Rotary(head_dim,base=h.rope_base,train_seq_len=h.train_seq_len,rope_dims=h.rope_dims)
+		self.final_norm=RMSNorm();self.lm_head=None if h.tie_embeddings else CastedLinear(h.embedding_dim,h.vocab_size,bias=False)
+		if self.lm_head is not None:self.lm_head._zero_init=True
+		if h.xsa_last_n>0:
+			for i in range(max(0,h.num_layers-h.xsa_last_n),h.num_layers):self.blocks[i].attn.use_xsa=True
+		if h.parallel_residual_start>=0:
+			for i in range(h.parallel_residual_start,h.num_layers):self.blocks[i].parallel=True
+		self.looping_active=False
+		if h.num_loops>0:
+			loop_seg=list(range(h.loop_start,h.loop_end+1));all_indices=list(range(h.loop_start))
+			for _ in range(h.num_loops+1):all_indices.extend(loop_seg)
+			all_indices.extend(range(h.loop_end+1,h.num_layers));num_enc=len(all_indices)//2;self.encoder_indices=all_indices[:num_enc];self.decoder_indices=all_indices[num_enc:]
+		else:self.encoder_indices=list(range(self.num_encoder_layers));self.decoder_indices=list(range(self.num_encoder_layers,h.num_layers))
+		self.num_skip_weights=min(len(self.encoder_indices),len(self.decoder_indices));self.skip_weights=nn.Parameter(torch.ones(self.num_skip_weights,h.model_dim,dtype=torch.float32));self.skip_gates=nn.Parameter(torch.zeros(self.num_skip_weights,h.model_dim,dtype=torch.float32))if h.skip_gates_enabled else None;self._init_weights()
+	def _init_weights(self):
+		if self.tie_embeddings:nn.init.normal_(self.tok_emb.weight,mean=.0,std=self.tied_embed_init_std)
+		for(name,module)in self.named_modules():
+			if isinstance(module,nn.Linear):
+				if getattr(module,'_zero_init',False):nn.init.zeros_(module.weight)
+				elif module.weight.ndim==2 and module.weight.shape[0]>=64 and module.weight.shape[1]>=64:nn.init.orthogonal_(module.weight,gain=1.)
+	def forward_logits(self,input_ids):
+		x=self.tok_emb(input_ids);x=F.rms_norm(x,(x.size(-1),))
+		if self.embed_proj is not None:x=self.embed_proj(x)
+		x0=x;skips=[];enc_iter=self.encoder_indices if self.looping_active else range(self.num_encoder_layers);dec_iter=self.decoder_indices if self.looping_active else range(self.num_encoder_layers,self.num_encoder_layers+self.num_decoder_layers)
+		for i in enc_iter:x=self.blocks[i](x,x0);skips.append(x)
+		for(skip_idx,i)in enumerate(dec_iter):
+			if skip_idx<self.num_skip_weights and skips:
+				scaled_skip=self.skip_weights[skip_idx].to(dtype=x.dtype)[None,None,:]*skips.pop()
+				if self.skip_gates is not None:g=torch.sigmoid(self.skip_gates[skip_idx].to(dtype=x.dtype))[None,None,:];x=torch.lerp(scaled_skip,x,g)
+				else:x=x+scaled_skip
+			x=self.blocks[i](x,x0)
+		x=self.final_norm(x)
+		if self.head_proj is not None:x=self.head_proj(x)
+		if self.tie_embeddings:logits_proj=F.linear(x,self.tok_emb.weight)
+		else:logits_proj=self.lm_head(x)
+		return self.logit_softcap*torch.tanh(logits_proj/self.logit_softcap)
+	def forward(self,input_ids,target_ids):logits=self.forward_logits(input_ids);return F.cross_entropy(logits.reshape(-1,logits.size(-1)).float(),target_ids.reshape(-1),reduction='mean')
+def classify_param(name):
+	if'tok_emb'in name or'lm_head'in name:return'embed'
+	if'.mlp.'in name:return'mlp'
+	if'.attn.'in name or'.proj.'in name and'.mlp.'not in name:return'attn'
+	return'other'
+@torch.compile
+def zeropower_via_newtonschulz5(G,steps=10,eps=1e-07):
+	a,b,c=3.4445,-4.775,2.0315;X=G.bfloat16();X/=X.norm()+eps;transposed=G.size(0)>G.size(1)
+	if transposed:X=X.T
+	for _ in range(steps):A=X@X.T;B=b*A+c*A@A;X=a*X+B@X
+	return X.T if transposed else X
+class Muon(torch.optim.Optimizer):
+	def __init__(self,params,lr,momentum,backend_steps,nesterov=True,weight_decay=.0,row_normalize=False):super().__init__(params,dict(lr=lr,momentum=momentum,backend_steps=backend_steps,nesterov=nesterov,weight_decay=weight_decay,row_normalize=row_normalize))
+	@torch.no_grad()
+	def step(self,closure=None):
+		loss=None
+		if closure is not None:
+			with torch.enable_grad():loss=closure()
+		distributed=dist.is_available()and dist.is_initialized();world_size=dist.get_world_size()if distributed else 1;rank=dist.get_rank()if distributed else 0
+		for group in self.param_groups:
+			params=group['params']
+			if not params:continue
+			lr=group['lr'];momentum=group['momentum'];backend_steps=group['backend_steps'];nesterov=group['nesterov'];total_params=sum(int(p.numel())for p in params);updates_flat=torch.zeros(total_params,device=params[0].device,dtype=torch.bfloat16);curr=0
+			for(i,p)in enumerate(params):
+				if i%world_size==rank and p.grad is not None:
+					g=p.grad;state=self.state[p]
+					if'momentum_buffer'not in state:state['momentum_buffer']=torch.zeros_like(g)
+					buf=state['momentum_buffer'];buf.mul_(momentum).add_(g)
+					if nesterov:g=g.add(buf,alpha=momentum)
+					if group.get('row_normalize',False):row_norms=g.float().norm(dim=-1,keepdim=True).clamp_min(1e-07);g=g/row_norms.to(g.dtype)
+					g=zeropower_via_newtonschulz5(g,steps=backend_steps);g*=max(1,g.size(0)/g.size(1))**.5;updates_flat[curr:curr+p.numel()]=g.reshape(-1)
+				curr+=p.numel()
+			if distributed:dist.all_reduce(updates_flat,op=dist.ReduceOp.SUM)
+			wd=group.get('weight_decay',.0);curr=0
+			for p in params:
+				if wd>.0:p.data.mul_(1.-lr*wd)
+				g=updates_flat[curr:curr+p.numel()].view_as(p).to(dtype=p.dtype);p.add_(g,alpha=-lr);curr+=p.numel()
+		return loss
+CONTROL_TENSOR_NAME_PATTERNS=tuple(pattern for pattern in os.environ.get('CONTROL_TENSOR_NAME_PATTERNS','attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,skip_gates').split(',')if pattern)
+class Optimizers:
+	def __init__(self,h,base_model):
+		block_named_params=list(base_model.blocks.named_parameters());matrix_params=[p for(name,p)in block_named_params if p.ndim==2 and not any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)];scalar_params=[p for(name,p)in block_named_params if p.ndim<2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)]
+		if base_model.skip_weights.numel()>0:scalar_params.append(base_model.skip_weights)
+		if base_model.skip_gates is not None and base_model.skip_gates.numel()>0:scalar_params.append(base_model.skip_gates)
+		token_lr=h.tied_embed_lr if h.tie_embeddings else h.embed_lr;tok_params=[{'params':[base_model.tok_emb.weight],'lr':token_lr,'base_lr':token_lr}];self.optimizer_tok=torch.optim.AdamW(tok_params,betas=(h.beta1,h.beta2),eps=h.adam_eps,weight_decay=h.embed_wd,fused=True);self.optimizer_muon=Muon(matrix_params,lr=h.matrix_lr,momentum=h.muon_momentum,backend_steps=h.muon_backend_steps,weight_decay=h.muon_wd,row_normalize=h.muon_row_normalize)
+		for group in self.optimizer_muon.param_groups:group['base_lr']=h.matrix_lr
+		self.optimizer_scalar=torch.optim.AdamW([{'params':scalar_params,'lr':h.scalar_lr,'base_lr':h.scalar_lr}],betas=(h.beta1,h.beta2),eps=h.adam_eps,weight_decay=h.adam_wd,fused=True);self.optimizers=[self.optimizer_tok,self.optimizer_muon,self.optimizer_scalar]
+		if base_model.lm_head is not None:self.optimizer_head=torch.optim.Adam([{'params':[base_model.lm_head.weight],'lr':h.head_lr,'base_lr':h.head_lr}],betas=(h.beta1,h.beta2),eps=h.adam_eps,fused=True);self.optimizers.insert(1,self.optimizer_head)
+		else:self.optimizer_head=None
+	def __iter__(self):return iter(self.optimizers)
+	def zero_grad_all(self):
+		for opt in self.optimizers:opt.zero_grad(set_to_none=True)
+	def step(self):
+		for opt in self.optimizers:opt.step()
+		self.zero_grad_all()
+def restore_fp32_params(model):
+	for module in model.modules():
+		if isinstance(module,CastedLinear):module.float()
+	for(name,param)in model.named_parameters():
+		if(param.ndim<2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS))and param.dtype!=torch.float32:param.data=param.data.float()
+def collect_hessians(model,train_loader,h,device,n_calibration_batches=64):
+	hessians={};hooks=[]
+	def make_hook(name):
+		def hook_fn(module,inp,out):
+			x=inp[0].detach().float()
+			if x.ndim==3:x=x.reshape(-1,x.shape[-1])
+			if name not in hessians:hessians[name]=torch.zeros(x.shape[1],x.shape[1],dtype=torch.float32,device=device)
+			hessians[name].addmm_(x.T,x)
+		return hook_fn
+	for(name,module)in model.named_modules():
+		if isinstance(module,CastedLinear)and module.weight.numel()>65536:
+			cat=classify_param(name+'.weight')
+			if cat in('mlp','attn'):hooks.append(module.register_forward_hook(make_hook(name+'.weight')))
+	if model.tie_embeddings:
+		hook_module=model.head_proj if model.head_proj is not None else model.final_norm
+		def make_output_hook(name):
+			def hook_fn(module,inp,out):
+				x=out.detach().float()
+				if x.ndim==3:x=x.reshape(-1,x.shape[-1])
+				if name not in hessians:hessians[name]=torch.zeros(x.shape[1],x.shape[1],dtype=torch.float32,device=device)
+				hessians[name].addmm_(x.T,x)
+			return hook_fn
+		hooks.append(hook_module.register_forward_hook(make_output_hook('tok_emb.weight')))
+	model.eval()
+	with torch.no_grad():
+		for _ in range(n_calibration_batches):x,_=train_loader.next_batch(h.train_batch_tokens,h.grad_accum_steps);model.forward_logits(x)
+	for hook in hooks:hook.remove()
+	for name in hessians:hessians[name]=hessians[name].cpu()/n_calibration_batches
+	return hessians
+def gptq_quantize_weight(w,H,clip_sigmas=3.,clip_range=63,block_size=128):
+	W_orig=w.float().clone();rows,cols=W_orig.shape;H=H.float().clone();dead=torch.diag(H)==0;H[dead,dead]=1;damp=.01*H.diag().mean();H.diagonal().add_(damp);perm=torch.argsort(H.diag(),descending=True);invperm=torch.argsort(perm);W_perm=W_orig[:,perm].clone();W_perm[:,dead[perm]]=0;H=H[perm][:,perm];Hinv=torch.cholesky_inverse(torch.linalg.cholesky(H));Hinv=torch.linalg.cholesky(Hinv,upper=True);row_std=W_orig.std(dim=1);s=(clip_sigmas*row_std/clip_range).clamp_min(1e-10).to(torch.float16);sf=s.float();Q=torch.zeros(rows,cols,dtype=torch.int8);W_work=W_perm.clone()
+	for i1 in range(0,cols,block_size):
+		i2=min(i1+block_size,cols);W_block=W_work[:,i1:i2].clone();Hinv_block=Hinv[i1:i2,i1:i2];Err=torch.zeros(rows,i2-i1)
+		for j in range(i2-i1):w_col=W_block[:,j];d=Hinv_block[j,j];q_col=torch.clamp(torch.round(w_col/sf),-clip_range,clip_range);Q[:,i1+j]=q_col.to(torch.int8);err=(w_col-q_col.float()*sf)/d;Err[:,j]=err;W_block[:,j:]-=err.unsqueeze(1)*Hinv_block[j,j:].unsqueeze(0)
+		if i2<cols:W_work[:,i2:]-=Err@Hinv[i1:i2,i2:]
+	return Q[:,invperm],s
+def gptq_mixed_quantize(state_dict,hessians,h):
+	result={};meta={}
+	for(name,tensor)in state_dict.items():
+		t=tensor.detach().cpu().contiguous()
+		if not t.is_floating_point()or t.numel()<=65536:result[name]=t.to(torch.float16)if t.is_floating_point()else t;meta[name]='passthrough (float16)';continue
+		cs=h.embed_clip_sigmas if'tok_emb'in name else h.matrix_clip_sigmas;bits=h.embed_bits if'tok_emb'in name else h.matrix_bits;q,s=gptq_quantize_weight(t,hessians[name],clip_sigmas=cs,clip_range=2**(bits-1)-1);result[name+'.q']=q;result[name+'.scale']=s;meta[name]=f"gptq (int{bits})"
+	categories=collections.defaultdict(set)
+	for(name,cat)in meta.items():short=re.sub('\\.\\d+$','',re.sub('blocks\\.\\d+','blocks',name));categories[cat].add(short)
+	log('Quantized weights:')
+	for cat in sorted(categories):log("  %s: %s"%(cat,", ".join(sorted(categories[cat]))))
+	return result,meta
+def dequantize_mixed(result,meta,template_sd):
+	out={}
+	for(name,orig)in template_sd.items():
+		info=meta.get(name)
+		if info is None:continue
+		orig_dtype=orig.dtype
+		if'passthrough'in info:
+			t=result[name]
+			if t.dtype==torch.float16 and orig_dtype in(torch.float32,torch.bfloat16):t=t.to(orig_dtype)
+			out[name]=t;continue
+		q,s=result[name+'.q'],result[name+'.scale']
+		if s.ndim>0:out[name]=(q.float()*s.float().view(q.shape[0],*[1]*(q.ndim-1))).to(orig_dtype)
+		else:out[name]=(q.float()*float(s.item())).to(orig_dtype)
+	return out
+_BSHF_MAGIC=b'BSHF'
+def _byte_shuffle(data,stride=2):
+	if stride<=1 or len(data)<stride:return data
+	src=np.frombuffer(data,dtype=np.uint8);n=len(src);out=np.empty(n,dtype=np.uint8);dest_off=0
+	for pos in range(stride):chunk=src[pos::stride];out[dest_off:dest_off+len(chunk)]=chunk;dest_off+=len(chunk)
+	return _BSHF_MAGIC+bytes([stride])+out.tobytes()
+def _byte_unshuffle(data):
+	if len(data)<5 or data[:4]!=_BSHF_MAGIC:return data
+	stride=data[4]
+	if stride<2:return data[5:]
+	payload=np.frombuffer(data,dtype=np.uint8,offset=5);n=len(payload);out=np.empty(n,dtype=np.uint8);src_off=0
+	for pos in range(stride):chunk_len=n//stride+(1 if pos<n%stride else 0);out[pos::stride][:chunk_len]=payload[src_off:src_off+chunk_len];src_off+=chunk_len
+	return out.tobytes()
+def _compress(data,compressor):
+	data=_byte_shuffle(data)
+	if compressor=='lzma':return lzma.compress(data,preset=6)
+	elif compressor=='brotli':import brotli;return brotli.compress(data,quality=11)
+	raise ValueError(f"Unknown compressor: {compressor!r}")
+def _decompress(data,compressor):
+	if compressor=='lzma':raw=lzma.decompress(data)
+	elif compressor=='brotli':import brotli;raw=brotli.decompress(data)
+	else:raise ValueError(f"Unknown compressor: {compressor!r}")
+	raw=_byte_unshuffle(raw);return raw
+def _ternary_compress_sd(sd_cpu,hessians,h,target_void=0.30,min_numel=1024):
+	"""MIXED compression: ternary for MLP, GPTQ int6 for attention, int8 for embeddings."""
+	result={};meta={};total_p=0;total_v=0
+	for name,tensor in sd_cpu.items():
+		t=tensor.detach().cpu().float();total_p+=t.numel()
+		if not tensor.is_floating_point()or t.numel()<min_numel:
+			result[name]=t.to(torch.float16)if tensor.is_floating_point()else tensor;meta[name]='passthrough';continue
+		is_mlp='mlp'in name or'fc'in name or'proj'in name and'attn'not in name and'out_proj'not in name
+		is_attn='attn'in name or'out_proj'in name
+		is_embed='tok_emb'in name or'embed'in name
+		if is_mlp and not is_attn:
+			# TERNARY for MLP (67% of params — big savings)
+			mags=t.abs();threshold=torch.quantile(mags.flatten(),target_void)if target_void>0 else torch.tensor(0.)
+			ternary=torch.zeros_like(t,dtype=torch.int8);ternary[t>threshold]=1;ternary[t<-threshold]=-1
+			active_mask=ternary!=0;row_sums=(t.abs()*active_mask.float()).sum(dim=-1);row_counts=active_mask.float().sum(dim=-1).clamp(min=1);scale=(row_sums/row_counts).to(torch.float16)
+			vf=(ternary==0).float().mean().item();total_v+=int(t.numel()*vf)
+			result[name+'.t']=ternary;result[name+'.s']=scale;meta[name]=f'ternary(void={vf:.1%})'
+		elif name in hessians:
+			# GPTQ int6/int8 for attention + embeddings (quality-critical)
+			cs=h.embed_clip_sigmas if is_embed else h.matrix_clip_sigmas
+			bits=h.embed_bits if is_embed else h.matrix_bits
+			q,s=gptq_quantize_weight(t,hessians[name],clip_sigmas=cs,clip_range=2**(bits-1)-1)
+			result[name+'.q']=q;result[name+'.scale']=s;meta[name]=f'gptq(int{bits})'
+		else:
+			result[name]=t.to(torch.float16);meta[name]='passthrough(no_hessian)'
+	log(f"Mixed compression: {total_p:,} params, ternary void {total_v/max(total_p,1):.1%}")
+	return result,meta
+def _pack_ternary(t):
+	flat=t.flatten().to(torch.int8);mapped=torch.where(flat==-1,torch.tensor(2,dtype=torch.int8),flat.clamp(0,1))
+	pad=(4-len(mapped)%4)%4
+	if pad>0:mapped=torch.cat([mapped,torch.zeros(pad,dtype=torch.int8)])
+	r=mapped.view(-1,4);packed=(r[:,0]|(r[:,1]<<2)|(r[:,2]<<4)|(r[:,3]<<6)).to(torch.uint8)
+	return packed,list(t.shape)
+def _unpack_ternary(packed,shape):
+	import numpy as np;vals=np.stack([packed.numpy()&0x03,(packed.numpy()>>2)&0x03,(packed.numpy()>>4)&0x03,(packed.numpy()>>6)&0x03],axis=-1).flatten()
+	numel=1
+	for d in shape:numel*=d
+	vals=vals[:numel];t=torch.from_numpy(vals.astype(np.int8));t=torch.where(t==2,torch.tensor(-1,dtype=torch.int8),t)
+	return t.reshape(shape)
+def serialize(h,base_model,code):
+	code_bytes=len(code.encode('utf-8'))
+	if h.is_main_process:torch.save(base_model.state_dict(),h.model_path);model_bytes=os.path.getsize(h.model_path);log(f"Serialized model: {model_bytes} bytes");log(f"Code size: {code_bytes} bytes")
+	sd_cpu={k:v.detach().cpu()for(k,v)in base_model.state_dict().items()};device=torch.device('cuda',h.local_rank);log('Collecting Hessians for mixed compression...');t0=time.perf_counter();calib_loader=ShuffledSequenceLoader(h,device)
+	with torch.no_grad(),torch.autocast(device_type='cuda',dtype=torch.bfloat16):hessians=collect_hessians(base_model,calib_loader,h,device,n_calibration_batches=h.gptq_calibration_batches)
+	log(f"Hessians collected in {time.perf_counter()-t0:.1f}s");t0=time.perf_counter();quant_result,quant_meta=_ternary_compress_sd(sd_cpu,hessians,h,target_void=0.30);log(f"Mixed compression done in {time.perf_counter()-t0:.1f}s")
+	packed={};packed_meta={}
+	for name,info in quant_meta.items():
+		if'ternary'in info:
+			p,s=_pack_ternary(quant_result[name+'.t']);packed[name+'.p']=p;packed[name+'.sh']=torch.tensor(s);packed[name+'.sc']=quant_result[name+'.s']
+		elif'gptq'in info:
+			packed[name+'.q']=quant_result[name+'.q'];packed[name+'.scale']=quant_result[name+'.scale']
+		else:packed[name]=quant_result[name]
+		packed_meta[name]=info
+	quant_buf=io.BytesIO();torch.save({'w':packed,'m':packed_meta},quant_buf);quant_raw=quant_buf.getvalue();quant_blob=_compress(quant_raw,h.compressor);quant_file_bytes=len(quant_blob);bytes_total=quant_file_bytes+code_bytes
+	if h.is_main_process:
+		with open(h.quantized_model_path,'wb')as f:f.write(quant_blob)
+		log(f"Serialized ternary+{h.compressor}: {quant_file_bytes} bytes");log(f"Total submission size: {bytes_total} bytes")
+		if bytes_total>16_000_000:log(f"WARNING: {bytes_total-16_000_000} bytes OVER 16MB!")
+		else:log(f"SIZE OK: {16_000_000-bytes_total} headroom")
+	return bytes_total,quant_file_bytes
+def deserialize(h,device):
+	eval_model=GPT(h).to(device).bfloat16();restore_fp32_params(eval_model);template_sd={k:v.detach().cpu()for(k,v)in eval_model.state_dict().items()}
+	with open(h.quantized_model_path,'rb')as f:quant_blob_disk=f.read()
+	quant_state=torch.load(io.BytesIO(_decompress(quant_blob_disk,h.compressor)),map_location='cpu',weights_only=True);packed=quant_state['w'];meta=quant_state['m']
+	out={}
+	for name,orig in template_sd.items():
+		info=meta.get(name)
+		if info is None:continue
+		if'ternary'in info:
+			p=packed[name+'.p'];sh=packed[name+'.sh'];sc=packed[name+'.sc']
+			ternary=_unpack_ternary(p,list(sh.tolist()));deq=(ternary.float()*sc.float().view(*([sc.shape[0]]+[1]*(ternary.ndim-1)))).to(orig.dtype);out[name]=deq
+		elif'gptq'in info:
+			q,s=packed[name+'.q'],packed[name+'.scale']
+			if s.ndim>0:out[name]=(q.float()*s.float().view(q.shape[0],*[1]*(q.ndim-1))).to(orig.dtype)
+			else:out[name]=(q.float()*float(s.item())).to(orig.dtype)
+		else:
+			t=packed[name]
+			if t.dtype==torch.float16 and orig.dtype in(torch.float32,torch.bfloat16):t=t.to(orig.dtype)
+			out[name]=t
+	eval_model.load_state_dict(out,strict=True);return eval_model
+def _loss_bpb(loss_sum,token_count,byte_count):val_loss=(loss_sum/token_count).item();val_bpb=val_loss/math.log(2.)*(token_count.item()/byte_count.item());return val_loss,val_bpb
+def eval_val(h,device,val_data,model):
+	seq_len=h.eval_seq_len;local_batch_tokens=h.val_batch_tokens//(h.world_size*h.grad_accum_steps)
+	if local_batch_tokens<seq_len:raise ValueError(f"VAL_BATCH_SIZE must provide at least one sequence per rank; got VAL_BATCH_SIZE={h.val_batch_tokens}, WORLD_SIZE={h.world_size}, GRAD_ACCUM_STEPS={h.grad_accum_steps}, seq_len={seq_len}")
+	local_batch_seqs=local_batch_tokens//seq_len;total_seqs=(val_data.val_tokens.numel()-1)//seq_len;seq_start=total_seqs*h.rank//h.world_size;seq_end=total_seqs*(h.rank+1)//h.world_size;val_loss_sum=torch.zeros((),device=device,dtype=torch.float64);val_token_count=torch.zeros((),device=device,dtype=torch.float64);val_byte_count=torch.zeros((),device=device,dtype=torch.float64);model.eval()
+	with torch.inference_mode():
+		for batch_seq_start in range(seq_start,seq_end,local_batch_seqs):
+			batch_seq_end=min(batch_seq_start+local_batch_seqs,seq_end);raw_start=batch_seq_start*seq_len;raw_end=batch_seq_end*seq_len+1;local=val_data.val_tokens[raw_start:raw_end].to(device=device,dtype=torch.int64,non_blocking=True);x=local[:-1].reshape(-1,seq_len);y=local[1:].reshape(-1,seq_len)
+			with torch.autocast(device_type='cuda',dtype=torch.bfloat16,enabled=True):batch_loss=model(x,y).detach()
+			batch_token_count=float(y.numel());val_loss_sum+=batch_loss.to(torch.float64)*batch_token_count;val_token_count+=batch_token_count;prev_ids=x.reshape(-1);tgt_ids=y.reshape(-1);token_bytes=val_data.base_bytes_lut[tgt_ids].to(dtype=torch.int16);token_bytes+=(val_data.has_leading_space_lut[tgt_ids]&~val_data.is_boundary_token_lut[prev_ids]).to(dtype=torch.int16);val_byte_count+=token_bytes.to(torch.float64).sum()
+	if dist.is_available()and dist.is_initialized():dist.all_reduce(val_loss_sum,op=dist.ReduceOp.SUM);dist.all_reduce(val_token_count,op=dist.ReduceOp.SUM);dist.all_reduce(val_byte_count,op=dist.ReduceOp.SUM)
+	model.train();return _loss_bpb(val_loss_sum,val_token_count,val_byte_count)
+def eval_val_sliding(h,device,val_data,base_model,batch_seqs=32):
+	base_model.eval();logits_fn=torch.compile(base_model.forward_logits,dynamic=False,fullgraph=True);seq_len=h.eval_seq_len;context_size=seq_len-h.eval_stride;total_tokens=val_data.val_tokens.numel()-1;window_starts=[ws for ws in range(0,total_tokens,h.eval_stride)if ws+context_size<total_tokens];total_windows=len(window_starts);my_s=total_windows*h.rank//h.world_size;my_e=total_windows*(h.rank+1)//h.world_size;my_windows=window_starts[my_s:my_e];loss_sum=torch.zeros((),device=device,dtype=torch.float64);token_count=torch.zeros((),device=device,dtype=torch.float64);byte_count=torch.zeros((),device=device,dtype=torch.float64)
+	with torch.inference_mode():
+		for bi in range(0,len(my_windows),batch_seqs):
+			batch_ws=my_windows[bi:bi+batch_seqs];bsz=len(batch_ws);x_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);y_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);wlens=[]
+			for(i,ws)in enumerate(batch_ws):we=min(ws+seq_len,total_tokens);wlen=we-ws;wlens.append(wlen);chunk=val_data.val_tokens[ws:we+1].to(dtype=torch.int64,device=device);x_batch[i,:wlen]=chunk[:-1];y_batch[i,:wlen]=chunk[1:]
+			with torch.autocast(device_type='cuda',dtype=torch.bfloat16):logits=logits_fn(x_batch)
+			nll=F.cross_entropy(logits.reshape(-1,logits.size(-1)).float(),y_batch.reshape(-1),reduction='none').reshape(bsz,seq_len)
+			for(i,ws)in enumerate(batch_ws):wlen=wlens[i];s=0 if ws==0 else context_size;scored_nll=nll[i,s:wlen].to(torch.float64);loss_sum+=scored_nll.sum();token_count+=float(wlen-s);tgt=y_batch[i,s:wlen];prev=x_batch[i,s:wlen];tb=val_data.base_bytes_lut[tgt].to(torch.float64);tb+=(val_data.has_leading_space_lut[tgt]&~val_data.is_boundary_token_lut[prev]).to(torch.float64);byte_count+=tb.sum()
+	if dist.is_available()and dist.is_initialized():dist.all_reduce(loss_sum,op=dist.ReduceOp.SUM);dist.all_reduce(token_count,op=dist.ReduceOp.SUM);dist.all_reduce(byte_count,op=dist.ReduceOp.SUM)
+	base_model.train();return _loss_bpb(loss_sum,token_count,byte_count)
+def eval_val_ttt(h,device,val_data,base_model,batch_seqs=32):
+	rank=h.rank;world_size=h.world_size;seq_len=h.eval_seq_len;stride=h.eval_stride;total_tokens=val_data.val_tokens.numel()-1;ttt_chunk=h.ttt_chunk_tokens;context_size=seq_len-stride;window_starts=[ws for ws in range(0,total_tokens,stride)if ws+context_size<total_tokens];num_chunks=(total_tokens+ttt_chunk-1)//ttt_chunk;chunk_windows=[[]for _ in range(num_chunks)]
+	for ws in window_starts:wlen=min(ws+seq_len,total_tokens)-ws;s=0 if ws==0 else context_size;scored_start=ws+s;ci=min(scored_start//ttt_chunk,num_chunks-1);chunk_windows[ci].append(ws)
+	log(f"ttt:start chunks={num_chunks} ttt_lr={h.ttt_lr} ttt_epochs={h.ttt_epochs}");compiled_logits=torch.compile(base_model.forward_logits,dynamic=False,fullgraph=True);loss_sum=torch.zeros((),device=device,dtype=torch.float64);token_count=torch.zeros((),device=device,dtype=torch.float64);byte_count=torch.zeros((),device=device,dtype=torch.float64);ttt_params=[p for p in base_model.parameters()]
+	for p in ttt_params:p.requires_grad_(True)
+	optimizer=torch.optim.SGD(ttt_params,lr=h.ttt_lr,momentum=h.ttt_momentum)
+	for ci in range(num_chunks):
+		windows=chunk_windows[ci]
+		if not windows:continue
+		chunk_start=ci*ttt_chunk;chunk_end=min((ci+1)*ttt_chunk,total_tokens);my_s=len(windows)*rank//world_size;my_e=len(windows)*(rank+1)//world_size;my_windows=windows[my_s:my_e];base_model.eval()
+		with torch.no_grad():
+			for bi in range(0,len(my_windows),batch_seqs):
+				batch_ws=my_windows[bi:bi+batch_seqs];bsz=len(batch_ws);x_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);y_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);wlens=[]
+				for(i,ws)in enumerate(batch_ws):we=min(ws+seq_len,total_tokens);wlen=we-ws;wlens.append(wlen);chunk_tok=val_data.val_tokens[ws:we+1].to(dtype=torch.int64,device=device);x_batch[i,:wlen]=chunk_tok[:-1];y_batch[i,:wlen]=chunk_tok[1:]
+				with torch.autocast(device_type='cuda',dtype=torch.bfloat16):logits=compiled_logits(x_batch)
+				nll=F.cross_entropy(logits.reshape(-1,logits.size(-1)).float(),y_batch.reshape(-1),reduction='none').reshape(bsz,seq_len)
+				for(i,ws)in enumerate(batch_ws):wlen=wlens[i];s=0 if ws==0 else context_size;scored_nll=nll[i,s:wlen].to(torch.float64);loss_sum+=scored_nll.sum();token_count+=float(wlen-s);tgt=y_batch[i,s:wlen];prev=x_batch[i,s:wlen];tb=val_data.base_bytes_lut[tgt].to(torch.float64);tb+=(val_data.has_leading_space_lut[tgt]&~val_data.is_boundary_token_lut[prev]).to(torch.float64);byte_count+=tb.sum()
+		is_last_chunk=ci==num_chunks-1
+		if not is_last_chunk and h.ttt_epochs>0:
+			base_model.train();chunk_seqs=(chunk_end-chunk_start)//seq_len
+			if chunk_seqs>0:
+				cos_lr=h.ttt_lr*.5*(1.+math.cos(math.pi*ci/max(num_chunks-1,1)))
+				for pg in optimizer.param_groups:pg['lr']=cos_lr
+				my_seq_s=chunk_seqs*rank//world_size;my_seq_e=chunk_seqs*(rank+1)//world_size;my_chunk_seqs=my_seq_e-my_seq_s
+				for _ep in range(h.ttt_epochs):
+					for bs in range(0,my_chunk_seqs,batch_seqs):
+						be=min(bs+batch_seqs,my_chunk_seqs);actual_bs=my_seq_s+bs;start_tok=chunk_start+actual_bs*seq_len;end_tok=chunk_start+(my_seq_s+be)*seq_len+1
+						if end_tok>val_data.val_tokens.numel():continue
+						local=val_data.val_tokens[start_tok:end_tok].to(device=device,dtype=torch.int64);x=local[:-1].reshape(-1,seq_len);y=local[1:].reshape(-1,seq_len);optimizer.zero_grad(set_to_none=True)
+						with torch.autocast(device_type='cuda',dtype=torch.bfloat16):loss=base_model(x,y)
+						loss.backward()
+						if world_size>1:
+							for p in ttt_params:
+								if p.grad is not None:dist.all_reduce(p.grad,op=dist.ReduceOp.AVG)
+						torch.nn.utils.clip_grad_norm_(ttt_params,1.);optimizer.step()
+	if dist.is_available()and dist.is_initialized():dist.all_reduce(loss_sum,op=dist.ReduceOp.SUM);dist.all_reduce(token_count,op=dist.ReduceOp.SUM);dist.all_reduce(byte_count,op=dist.ReduceOp.SUM)
+	for p in base_model.parameters():p.requires_grad_(True)
+	base_model.eval();return _loss_bpb(loss_sum,token_count,byte_count)
+def timed_eval(label,fn,*args,**kwargs):torch.cuda.synchronize();t0=time.perf_counter();val_loss,val_bpb=fn(*args,**kwargs);torch.cuda.synchronize();elapsed_ms=1e3*(time.perf_counter()-t0);log(f"{label} val_loss:{val_loss:.8f} val_bpb:{val_bpb:.8f} eval_time:{elapsed_ms:.0f}ms");return val_loss,val_bpb
+def train_model(h,device,val_data):
+	base_model=GPT(h).to(device).bfloat16();restore_fp32_params(base_model);compiled_model=torch.compile(base_model,dynamic=False,fullgraph=True)
+	if h.distributed:model=DDP(compiled_model,device_ids=[h.local_rank],broadcast_buffers=False)
+	else:model=compiled_model
+	log(f"model_params:{sum(p.numel()for p in base_model.parameters())}");optimizers=Optimizers(h,base_model);train_loader=ShuffledSequenceLoader(h,device);max_wallclock_ms=1e3*h.max_wallclock_seconds if h.max_wallclock_seconds>0 else None
+	if max_wallclock_ms is not None:max_wallclock_ms-=h.gptq_reserve_seconds*1e3;log(f"gptq:reserving {h.gptq_reserve_seconds:.0f}s, effective={max_wallclock_ms:.0f}ms")
+	def training_frac(step,elapsed_ms):
+		if max_wallclock_ms is None:return step/max(h.iterations,1)
+		return elapsed_ms/max(max_wallclock_ms,1e-09)
+	def lr_mul(frac):
+		if h.warmdown_frac<=0:return 1.
+		if frac>=1.-h.warmdown_frac:return max((1.-frac)/h.warmdown_frac,h.min_lr)
+		return 1.
+	def step_fn(step,lr_scale):
+		optimizers.zero_grad_all();train_loss=torch.zeros((),device=device)
+		for micro_step in range(h.grad_accum_steps):
+			if h.distributed:model.require_backward_grad_sync=micro_step==h.grad_accum_steps-1
+			x,y=train_loader.next_batch(h.train_batch_tokens,h.grad_accum_steps)
+			with torch.autocast(device_type='cuda',dtype=torch.bfloat16,enabled=True):loss=model(x,y)
+			train_loss+=loss.detach();(loss/h.grad_accum_steps).backward()
+		train_loss/=h.grad_accum_steps;frac=min(step/h.muon_momentum_warmup_steps,1.)if h.muon_momentum_warmup_steps>0 else 1.;muon_momentum=(1-frac)*h.muon_momentum_warmup_start+frac*h.muon_momentum
+		for group in optimizers.optimizer_muon.param_groups:group['momentum']=muon_momentum
+		for opt in optimizers:
+			for group in opt.param_groups:group['lr']=group['base_lr']*lr_scale
+		if h.grad_clip_norm>0:torch.nn.utils.clip_grad_norm_(base_model.parameters(),h.grad_clip_norm)
+		optimizers.step();return train_loss
+	if h.warmup_steps>0:
+		initial_model_state={name:tensor.detach().cpu().clone()for(name,tensor)in base_model.state_dict().items()};initial_optimizer_states=[copy.deepcopy(opt.state_dict())for opt in optimizers];model.train()
+		for warmup_step in range(h.warmup_steps):
+			step_fn(warmup_step,1.)
+			if warmup_step<=5 or(warmup_step+1)%10==0 or warmup_step+1==h.warmup_steps:log(f"warmup_step: {warmup_step+1}/{h.warmup_steps}")
+		if h.num_loops>0:
+			base_model.looping_active=True;log(f"loop_warmup:enabled encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}")
+			for warmup_step in range(h.warmup_steps):
+				step_fn(warmup_step,1.)
+				if warmup_step<=5 or(warmup_step+1)%10==0 or warmup_step+1==h.warmup_steps:log(f"loop_warmup_step: {warmup_step+1}/{h.warmup_steps}")
+			base_model.looping_active=False
+		base_model.load_state_dict(initial_model_state,strict=True)
+		for(opt,state)in zip(optimizers,initial_optimizer_states,strict=True):opt.load_state_dict(state)
+		optimizers.zero_grad_all()
+		if h.distributed:model.require_backward_grad_sync=True
+		train_loader=ShuffledSequenceLoader(h,device)
+	ema_state={name:t.detach().float().clone()for(name,t)in base_model.state_dict().items()};ema_decay=h.ema_decay;training_time_ms=.0;stop_after_step=None;torch.cuda.synchronize();t0=time.perf_counter();step=0
+	while True:
+		last_step=step==h.iterations or stop_after_step is not None and step>=stop_after_step;should_validate=last_step or h.val_loss_every>0 and step%h.val_loss_every==0
+		if should_validate:torch.cuda.synchronize();training_time_ms+=1e3*(time.perf_counter()-t0);val_loss,val_bpb=eval_val(h,device,val_data,model);log(f"{step}/{h.iterations} val_loss: {val_loss:.4f} val_bpb: {val_bpb:.4f}");torch.cuda.synchronize();t0=time.perf_counter()
+		if last_step:
+			if stop_after_step is not None and step<h.iterations:log(f"stopping_early: wallclock_cap train_time: {training_time_ms:.0f}ms step: {step}/{h.iterations}")
+			break
+		elapsed_ms=training_time_ms+1e3*(time.perf_counter()-t0);frac=training_frac(step,elapsed_ms);scale=lr_mul(frac)
+		if h.num_loops>0 and not base_model.looping_active and frac>=h.enable_looping_at:base_model.looping_active=True;log(f"layer_loop:enabled step:{step} frac:{frac:.3f} encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}")
+		train_loss=step_fn(step,scale)
+		with torch.no_grad():
+			for(name,t)in base_model.state_dict().items():ema_state[name].mul_(ema_decay).add_(t.detach().float(),alpha=1.-ema_decay)
+		step+=1;approx_training_time_ms=training_time_ms+1e3*(time.perf_counter()-t0);should_log_train=h.train_log_every>0 and(step<=5 or step%h.train_log_every==0 or stop_after_step is not None)
+		if step in(100,500,1000,2300)and h.is_main_process:
+			ckpt_path=f'/root/checkpoints/step_{step}.pt';os.makedirs('/root/checkpoints',exist_ok=True);torch.save(base_model.state_dict(),ckpt_path);log(f"CHECKPOINT saved: {ckpt_path}")
+		if should_log_train:tok_per_sec=step*h.train_batch_tokens/(approx_training_time_ms/1e3);log(f"{step}/{h.iterations} train_loss: {train_loss.item():.4f} train_time: {approx_training_time_ms/60000:.1f}m tok/s: {tok_per_sec:.0f}")
+		reached_cap=max_wallclock_ms is not None and approx_training_time_ms>=max_wallclock_ms
+		if h.distributed and max_wallclock_ms is not None:reached_cap_tensor=torch.tensor(int(reached_cap),device=device);dist.all_reduce(reached_cap_tensor,op=dist.ReduceOp.MAX);reached_cap=bool(reached_cap_tensor.item())
+		if stop_after_step is None and reached_cap:stop_after_step=step
+	log(f"peak memory allocated: {torch.cuda.max_memory_allocated()//1024//1024} MiB reserved: {torch.cuda.max_memory_reserved()//1024//1024} MiB");log('ema:applying EMA weights');current_state=base_model.state_dict();avg_state={name:t.to(dtype=current_state[name].dtype)for(name,t)in ema_state.items()};base_model.load_state_dict(avg_state,strict=True);return base_model,compiled_model
+def train_and_eval(h,device):
+	random.seed(h.seed);np.random.seed(h.seed);torch.manual_seed(h.seed);torch.cuda.manual_seed_all(h.seed);val_data=ValidationData(h,device);_n_shards=len(list(Path(h.datasets_dir).resolve().glob('fineweb_train_*.bin')));log(f"train_shards: {_n_shards}");log(f"val_tokens: {val_data.val_tokens.numel()-1}");base_model,compiled_model=train_model(h,device,val_data);torch._dynamo.reset();timed_eval('pre-quantization post-ema',eval_val,h,device,val_data,compiled_model);serialize(h,base_model,Path(__file__).read_text(encoding='utf-8'))
+	if h.distributed:dist.barrier()
+	eval_model=deserialize(h,device)
+	if h.num_loops>0:eval_model.looping_active=True
+	compiled_model=torch.compile(eval_model,dynamic=False,fullgraph=True);timed_eval('quantized',eval_val,h,device,val_data,compiled_model)
+	if h.sliding_window_enabled:timed_eval('quantized_sliding_window',eval_val_sliding,h,device,val_data,eval_model)
+	if h.ttt_enabled and h.sliding_window_enabled:
+		del eval_model,compiled_model;torch._dynamo.reset();torch.cuda.empty_cache();ttt_model=deserialize(h,device)
+		if h.num_loops>0:ttt_model.looping_active=True
+		timed_eval('quantized_ttt',eval_val_ttt,h,device,val_data,ttt_model);del ttt_model
+def main():
+	world_size=int(os.environ.get('WORLD_SIZE','1'));local_rank=int(os.environ.get('LOCAL_RANK','0'));distributed='RANK'in os.environ and'WORLD_SIZE'in os.environ
+	if not torch.cuda.is_available():raise RuntimeError('CUDA is required')
+	if world_size<=0:raise ValueError(f"WORLD_SIZE must be positive, got {world_size}")
+	if 8%world_size!=0:raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral")
+	device=torch.device('cuda',local_rank);torch.cuda.set_device(device)
+	if distributed:dist.init_process_group(backend='nccl',device_id=device);dist.barrier()
+	torch.backends.cuda.matmul.allow_tf32=True;torch.backends.cudnn.allow_tf32=True;torch.set_float32_matmul_precision('high');from torch.backends.cuda import enable_cudnn_sdp,enable_flash_sdp,enable_math_sdp,enable_mem_efficient_sdp;enable_cudnn_sdp(False);enable_flash_sdp(True);enable_mem_efficient_sdp(False);enable_math_sdp(False);torch._dynamo.config.optimize_ddp=False;h=Hyperparameters();set_logging_hparams(h)
+	if h.is_main_process:
+		os.makedirs('logs',exist_ok=True);log(100*'=',console=False);log('Hyperparameters:',console=True)
+		for(k,v)in sorted(vars(type(h)).items()):
+			if not k.startswith('_'):log(f"  {k}: {v}",console=True)
+		log('='*100,console=False);log(f"Running Python {sys.version}",console=False);log(f"Running PyTorch {torch.__version__}",console=False);log(subprocess.run(['nvidia-smi'],stdout=subprocess.PIPE,stderr=subprocess.PIPE,text=True,check=False).stdout,console=False);log('='*100,console=False)
+	train_and_eval(h,device)
+	if distributed:dist.destroy_process_group()
+if __name__=='__main__':main()
\ No newline at end of file
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/train_gpt_v3.py b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/train_gpt_v3.py
new file mode 100644
index 0000000000..461fb90efc
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/train_gpt_v3.py
@@ -0,0 +1,630 @@
+import collections,copy,glob,io,lzma,math,os
+from pathlib import Path
+import random,re,subprocess,sys,time,uuid,numpy as np,sentencepiece as spm,torch,torch.distributed as dist,torch.nn.functional as F
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch import Tensor,nn
+try:
+ from flash_attn_interface import flash_attn_func as flash_attn_3_func
+except ImportError:flash_attn_3_func=None
+
+class LoRALayer(nn.Module):
+	def __init__(self,base_layer,rank=128,alpha=144.):
+		super().__init__();self.base_layer=base_layer;self.rank=rank;self.scaling=alpha/rank
+		in_f=base_layer.weight.shape[1];out_f=base_layer.weight.shape[0]
+		self.lora_A=nn.Parameter(torch.zeros(rank,in_f));self.lora_B=nn.Parameter(torch.zeros(out_f,rank))
+		nn.init.kaiming_uniform_(self.lora_A,a=math.sqrt(5))
+	def forward(self,x):return self.base_layer(x)+F.linear(F.linear(x,self.lora_A),self.lora_B)*self.scaling
+def _inject_lora(model,rank=128,alpha=144.,targets=('c_q','c_k','c_v'),attn_proj=True):
+	lora_params=[];lora_layers=[];md=dict(model.named_modules())
+	for name,mod in model.named_modules():
+		if not isinstance(mod,nn.Linear):continue
+		parts=name.rsplit('.',1)
+		if len(parts)!=2:continue
+		pn,attr=parts
+		if attr in targets or(attn_proj and attr=='proj'and'attn'in pn):
+			ll=LoRALayer(mod,rank=rank,alpha=alpha).to(device=mod.weight.device,dtype=mod.weight.dtype)
+			setattr(md[pn],attr,ll);lora_params.extend([ll.lora_A,ll.lora_B]);lora_layers.append(ll)
+	return lora_params,lora_layers
+def _remove_lora(model):
+	md=dict(model.named_modules())
+	for name,mod in list(model.named_modules()):
+		if isinstance(mod,LoRALayer):
+			parts=name.rsplit('.',1)
+			if len(parts)==2:setattr(md[parts[0]],parts[1],mod.base_layer)
+def _reset_lora(lora_params):
+	with torch.no_grad():
+		for i in range(0,len(lora_params),2):nn.init.kaiming_uniform_(lora_params[i],a=math.sqrt(5));lora_params[i+1].zero_()
+
+class Hyperparameters:data_dir=os.environ.get('DATA_DIR','./data/');seed=int(os.environ.get('SEED',1337));run_id=os.environ.get('RUN_ID',str(uuid.uuid4()));iterations=int(os.environ.get('ITERATIONS',20000));warmdown_frac=float(os.environ.get('WARMDOWN_FRAC',.72));warmup_steps=int(os.environ.get('WARMUP_STEPS',20));train_batch_tokens=int(os.environ.get('TRAIN_BATCH_TOKENS',786432));train_seq_len=int(os.environ.get('TRAIN_SEQ_LEN',2048));train_log_every=int(os.environ.get('TRAIN_LOG_EVERY',500));max_wallclock_seconds=float(os.environ.get('MAX_WALLCLOCK_SECONDS',6e2));val_batch_tokens=int(os.environ.get('VAL_BATCH_TOKENS',524288));eval_seq_len=int(os.environ.get('EVAL_SEQ_LEN',2048));val_loss_every=int(os.environ.get('VAL_LOSS_EVERY',4000));sliding_window_enabled=bool(int(os.environ.get('SLIDING_WINDOW_ENABLED','1')));vocab_size=int(os.environ.get('VOCAB_SIZE',8192));num_layers=int(os.environ.get('NUM_LAYERS',11));xsa_last_n=int(os.environ.get('XSA_LAST_N',11));model_dim=int(os.environ.get('MODEL_DIM',512));embedding_dim=int(os.environ.get('EMBEDDING_DIM',512));num_kv_heads=int(os.environ.get('NUM_KV_HEADS',4));num_heads=int(os.environ.get('NUM_HEADS',8));mlp_mult=float(os.environ.get('MLP_MULT',4.));skip_gates_enabled=bool(int(os.environ.get('SKIP_GATES_ENABLED','1')));tie_embeddings=bool(int(os.environ.get('TIE_EMBEDDINGS','1')));logit_softcap=float(os.environ.get('LOGIT_SOFTCAP',3e1));rope_base=float(os.environ.get('ROPE_BASE',1e4));rope_dims=int(os.environ.get('ROPE_DIMS',16));rope_train_seq_len=int(os.environ.get('ROPE_TRAIN_SEQ_LEN',2048));ln_scale=bool(int(os.environ.get('LN_SCALE','1')));qk_gain_init=float(os.environ.get('QK_GAIN_INIT',5.25));num_loops=int(os.environ.get('NUM_LOOPS',2));loop_start=int(os.environ.get('LOOP_START',3));loop_end=int(os.environ.get('LOOP_END',5));enable_looping_at=float(os.environ.get('ENABLE_LOOPING_AT',.35));parallel_residual_start=int(os.environ.get('PARALLEL_RESIDUAL_START',7));min_lr=float(os.environ.get('MIN_LR',.0));embed_lr=float(os.environ.get('EMBED_LR',.6));head_lr=float(os.environ.get('HEAD_LR',.008));tied_embed_lr=float(os.environ.get('TIED_EMBED_LR',.03));tied_embed_init_std=float(os.environ.get('TIED_EMBED_INIT_STD',.005));matrix_lr=float(os.environ.get('MATRIX_LR',.022));scalar_lr=float(os.environ.get('SCALAR_LR',.02));muon_momentum=float(os.environ.get('MUON_MOMENTUM',.99));muon_backend_steps=int(os.environ.get('MUON_BACKEND_STEPS',5));muon_momentum_warmup_start=float(os.environ.get('MUON_MOMENTUM_WARMUP_START',.92));muon_momentum_warmup_steps=int(os.environ.get('MUON_MOMENTUM_WARMUP_STEPS',1500));muon_row_normalize=bool(int(os.environ.get('MUON_ROW_NORMALIZE','1')));beta1=float(os.environ.get('BETA1',.9));beta2=float(os.environ.get('BETA2',.95));adam_eps=float(os.environ.get('ADAM_EPS',1e-08));grad_clip_norm=float(os.environ.get('GRAD_CLIP_NORM',.3));eval_stride=int(os.environ.get('EVAL_STRIDE',64));muon_beta2=float(os.environ.get('MUON_BETA2',.95));adam_wd=float(os.environ.get('ADAM_WD',.02));muon_wd=float(os.environ.get('MUON_WD',.095));embed_wd=float(os.environ.get('EMBED_WD',.085));ema_decay=float(os.environ.get('EMA_DECAY',.9965));ttt_enabled=bool(int(os.environ.get('TTT_ENABLED','1')));lora_ttt_enabled=bool(int(os.environ.get('LORA_TTT_ENABLED','1')));lora_ttt_rank=int(os.environ.get('LORA_TTT_RANK','128'));lora_ttt_alpha=float(os.environ.get('LORA_TTT_ALPHA','144.'));lora_ttt_lr=float(os.environ.get('LORA_TTT_LR','5e-4'));lora_ttt_wd=float(os.environ.get('LORA_TTT_WD','0.01'));lora_ttt_phases=int(os.environ.get('LORA_TTT_PHASES','3'));ttt_lr=float(os.environ.get('TTT_LR',.005));ttt_epochs=int(os.environ.get('TTT_EPOCHS',3));ttt_momentum=float(os.environ.get('TTT_MOMENTUM',.9));ttt_chunk_tokens=int(os.environ.get('TTT_CHUNK_TOKENS',16384));etlb_enabled=bool(int(os.environ.get('ETLB_ENABLED','0')));etlb_lr=float(os.environ.get('ETLB_LR',.05));etlb_steps=int(os.environ.get('ETLB_STEPS',5));etlb_clip=float(os.environ.get('ETLB_CLIP',3.));compressor=os.environ.get('COMPRESSOR','brotli');gptq_calibration_batches=int(os.environ.get('GPTQ_CALIBRATION_BATCHES',64));gptq_reserve_seconds=float(os.environ.get('GPTQ_RESERVE_SECONDS',12.));matrix_bits=int(os.environ.get('MATRIX_BITS',6));embed_bits=int(os.environ.get('EMBED_BITS',8));matrix_clip_sigmas=float(os.environ.get('MATRIX_CLIP_SIGMAS',12.85));embed_clip_sigmas=float(os.environ.get('EMBED_CLIP_SIGMAS',2e1));distributed='RANK'in os.environ and'WORLD_SIZE'in os.environ;rank=int(os.environ.get('RANK','0'));world_size=int(os.environ.get('WORLD_SIZE','1'));local_rank=int(os.environ.get('LOCAL_RANK','0'));is_main_process=rank==0;grad_accum_steps=8//world_size;datasets_dir=os.path.join(data_dir,'datasets',f"fineweb10B_sp{vocab_size}");train_files=os.path.join(datasets_dir,'fineweb_train_*.bin');val_files=os.path.join(datasets_dir,'fineweb_val_*.bin');tokenizer_path=os.path.join(data_dir,'tokenizers',f"fineweb_{vocab_size}_bpe.model");logfile=f"logs/{run_id}.txt";model_path='final_model.pt';quantized_model_path='final_model.int6.ptz'
+_logger_hparams=None
+def set_logging_hparams(h):global _logger_hparams;_logger_hparams=h
+def log(msg,console=True):
+	if _logger_hparams is None:print(msg);return
+	if _logger_hparams.is_main_process:
+		if console:print(msg)
+		if _logger_hparams.logfile is not None:
+			with open(_logger_hparams.logfile,'a',encoding='utf-8')as f:print(msg,file=f)
+class ValidationData:
+	def __init__(self,h,device):
+		self.sp=spm.SentencePieceProcessor(model_file=h.tokenizer_path)
+		if int(self.sp.vocab_size())!=h.vocab_size:raise ValueError(f"VOCAB_SIZE={h.vocab_size} does not match tokenizer vocab_size={int(self.sp.vocab_size())}")
+		self.val_tokens=load_validation_tokens(h.val_files,h.eval_seq_len);self.base_bytes_lut,self.has_leading_space_lut,self.is_boundary_token_lut=build_sentencepiece_luts(self.sp,h.vocab_size,device)
+def build_sentencepiece_luts(sp,vocab_size,device):
+	sp_vocab_size=int(sp.vocab_size());assert sp.piece_to_id('▁')!=sp.unk_id(),"Tokenizer must have '▁' (space) as its own token for correct BPB byte counting";table_size=max(sp_vocab_size,vocab_size);base_bytes_np=np.zeros((table_size,),dtype=np.int16);has_leading_space_np=np.zeros((table_size,),dtype=np.bool_);is_boundary_token_np=np.ones((table_size,),dtype=np.bool_)
+	for token_id in range(sp_vocab_size):
+		if sp.is_control(token_id)or sp.is_unknown(token_id)or sp.is_unused(token_id):continue
+		is_boundary_token_np[token_id]=False
+		if sp.is_byte(token_id):base_bytes_np[token_id]=1;continue
+		piece=sp.id_to_piece(token_id)
+		if piece.startswith('▁'):has_leading_space_np[token_id]=True;piece=piece[1:]
+		base_bytes_np[token_id]=len(piece.encode('utf-8'))
+	return torch.tensor(base_bytes_np,dtype=torch.int16,device=device),torch.tensor(has_leading_space_np,dtype=torch.bool,device=device),torch.tensor(is_boundary_token_np,dtype=torch.bool,device=device)
+def load_validation_tokens(pattern,seq_len):
+	files=[Path(p)for p in sorted(glob.glob(pattern))]
+	if not files:raise FileNotFoundError(f"No files found for pattern: {pattern}")
+	tokens=torch.cat([load_data_shard(file)for file in files]).contiguous();usable=(tokens.numel()-1)//seq_len*seq_len
+	if usable<=0:raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
+	return tokens[:usable+1]
+def load_data_shard(file):
+	header_bytes=256*np.dtype('<i4').itemsize;token_bytes=np.dtype('<u2').itemsize;header=np.fromfile(file,dtype='<i4',count=256)
+	if header.size!=256 or int(header[0])!=20240520 or int(header[1])!=1:raise ValueError(f"Unexpected shard header for {file}")
+	num_tokens=int(header[2]);expected_size=header_bytes+num_tokens*token_bytes
+	if file.stat().st_size!=expected_size:raise ValueError(f"Shard size mismatch for {file}: expected {expected_size} bytes")
+	tokens_np=np.fromfile(file,dtype='<u2',count=num_tokens,offset=header_bytes)
+	if tokens_np.size!=num_tokens:raise ValueError(f"Short read for {file}")
+	return torch.from_numpy(tokens_np.astype(np.uint16,copy=False))
+_SHARD_HEADER_BYTES=256*np.dtype('<i4').itemsize
+_SHARD_NTOKENS_CACHE={}
+_MMAP_CACHE={}
+def _read_num_tokens(file):
+	key=str(file);cached=_SHARD_NTOKENS_CACHE.get(key)
+	if cached is not None:return cached
+	header=np.fromfile(file,dtype='<i4',count=256)
+	if header.size!=256 or int(header[0])!=20240520 or int(header[1])!=1:raise ValueError(f"Unexpected shard header for {file}")
+	n=int(header[2]);_SHARD_NTOKENS_CACHE[key]=n;return n
+def _get_shard_memmap(file):
+	key=str(file);mm=_MMAP_CACHE.get(key)
+	if mm is not None:return mm
+	n=_read_num_tokens(file);mm=np.memmap(file,mode='r',dtype='<u2',offset=_SHARD_HEADER_BYTES,shape=(n,));_MMAP_CACHE[key]=mm;return mm
+class ShuffledSequenceLoader:
+	def __init__(self,h,device):
+		self.world_size=h.world_size;self.seq_len=h.train_seq_len;self.device=device;all_files=[Path(p)for p in sorted(glob.glob(h.train_files))]
+		if not all_files:raise FileNotFoundError(f"No files found for pattern: {h.train_files}")
+		self.files=all_files[h.rank::h.world_size];self.rng=np.random.Generator(np.random.PCG64(h.rank));self.num_tokens=[_read_num_tokens(f)for f in self.files];self.start_inds=[[]for _ in self.files]
+		for si in range(len(self.files)):self._reset_shard(si)
+	def _reset_shard(self,si):max_phase=min(self.seq_len-1,max(0,self.num_tokens[si]-self.seq_len-1));phase=int(self.rng.integers(max_phase+1))if max_phase>0 else 0;num_sequences=(self.num_tokens[si]-1-phase)//self.seq_len;sequence_order=self.rng.permutation(num_sequences);self.start_inds[si]=(phase+sequence_order*self.seq_len).tolist()
+	def next_batch(self,global_tokens,grad_accum_steps):
+		device_tokens=global_tokens//(self.world_size*grad_accum_steps);device_batch_size=device_tokens//self.seq_len;remaining=np.array([len(s)for s in self.start_inds],dtype=np.float64);x=torch.empty((device_batch_size,self.seq_len),dtype=torch.int64);y=torch.empty((device_batch_size,self.seq_len),dtype=torch.int64)
+		for bi in range(device_batch_size):
+			total=remaining.sum()
+			if total<=0:
+				for si in range(len(self.files)):self._reset_shard(si)
+				remaining=np.array([len(s)for s in self.start_inds],dtype=np.float64);total=remaining.sum()
+			probs=remaining/total;si=int(self.rng.choice(len(self.files),p=probs));start_ind=self.start_inds[si].pop();remaining[si]-=1;mm=_get_shard_memmap(self.files[si]);window=torch.as_tensor(np.array(mm[start_ind:start_ind+self.seq_len+1],dtype=np.int64));x[bi]=window[:-1];y[bi]=window[1:]
+		return x.to(self.device,non_blocking=True),y.to(self.device,non_blocking=True)
+class RMSNorm(nn.Module):
+	def __init__(self,eps=None):super().__init__();self.eps=eps
+	def forward(self,x):return F.rms_norm(x,(x.size(-1),),eps=self.eps)
+class CastedLinear(nn.Linear):
+	def forward(self,x):w=self.weight.to(x.dtype);bias=self.bias.to(x.dtype)if self.bias is not None else None;return F.linear(x,w,bias)
+class Rotary(nn.Module):
+	def __init__(self,dim,base=1e4,train_seq_len=1024,rope_dims=0):super().__init__();self.dim=dim;self.base=base;self.train_seq_len=train_seq_len;self.rope_dims=rope_dims if rope_dims>0 else dim;inv_freq=1./base**(torch.arange(0,self.rope_dims,2,dtype=torch.float32)/self.rope_dims);self.register_buffer('inv_freq',inv_freq,persistent=False);self._seq_len_cached=0;self._cos_cached=None;self._sin_cached=None
+	def forward(self,seq_len,device,dtype):
+		if self._cos_cached is None or self._sin_cached is None or self._seq_len_cached!=seq_len or self._cos_cached.device!=device:
+			rd=self.rope_dims
+			if seq_len>self.train_seq_len:scale=seq_len/self.train_seq_len;new_base=self.base*scale**(rd/(rd-2));inv_freq=1./new_base**(torch.arange(0,rd,2,dtype=torch.float32,device=device)/rd)
+			else:inv_freq=self.inv_freq.to(device)
+			t=torch.arange(seq_len,device=device,dtype=inv_freq.dtype);freqs=torch.outer(t,inv_freq);self._cos_cached=freqs.cos()[None,:,None,:];self._sin_cached=freqs.sin()[None,:,None,:];self._seq_len_cached=seq_len
+		return self._cos_cached.to(dtype=dtype),self._sin_cached.to(dtype=dtype)
+def apply_rotary_emb(x,cos,sin,rope_dims=0):
+	if rope_dims>0 and rope_dims<x.size(-1):x_rope,x_pass=x[...,:rope_dims],x[...,rope_dims:];half=rope_dims//2;x1,x2=x_rope[...,:half],x_rope[...,half:];x_rope=torch.cat((x1*cos+x2*sin,x1*-sin+x2*cos),dim=-1);return torch.cat((x_rope,x_pass),dim=-1)
+	half=x.size(-1)//2;x1,x2=x[...,:half],x[...,half:];return torch.cat((x1*cos+x2*sin,x1*-sin+x2*cos),dim=-1)
+class CausalSelfAttention(nn.Module):
+	def __init__(self,dim,num_heads,num_kv_heads,rope_base,qk_gain_init,train_seq_len):
+		super().__init__()
+		if dim%num_heads!=0:raise ValueError('model_dim must be divisible by num_heads')
+		if num_heads%num_kv_heads!=0:raise ValueError('num_heads must be divisible by num_kv_heads')
+		self.num_heads=num_heads;self.num_kv_heads=num_kv_heads;self.head_dim=dim//num_heads
+		if self.head_dim%2!=0:raise ValueError('head_dim must be even for RoPE')
+		kv_dim=self.num_kv_heads*self.head_dim;self.c_q=CastedLinear(dim,dim,bias=False);self.c_k=CastedLinear(dim,kv_dim,bias=False);self.c_v=CastedLinear(dim,kv_dim,bias=False);self.proj=CastedLinear(dim,dim,bias=False);self.proj._zero_init=True;self.q_gain=nn.Parameter(torch.full((num_heads,),qk_gain_init,dtype=torch.float32));self.rope_dims=0;self.rotary=Rotary(self.head_dim,base=rope_base,train_seq_len=train_seq_len);self.use_xsa=False
+	def _xsa_efficient(self,y,v):B,T,H,D=y.shape;Hkv=v.size(-2);group=H//Hkv;y_g=y.reshape(B,T,Hkv,group,D);vn=F.normalize(v,dim=-1).unsqueeze(-2);proj=(y_g*vn).sum(dim=-1,keepdim=True)*vn;return(y_g-proj).reshape(B,T,H,D)
+	def forward(self,x):
+		bsz,seqlen,dim=x.shape;q=self.c_q(x).reshape(bsz,seqlen,self.num_heads,self.head_dim);k=self.c_k(x).reshape(bsz,seqlen,self.num_kv_heads,self.head_dim);v=self.c_v(x).reshape(bsz,seqlen,self.num_kv_heads,self.head_dim);q=F.rms_norm(q,(q.size(-1),));k=F.rms_norm(k,(k.size(-1),));cos,sin=self.rotary(seqlen,x.device,q.dtype);q=apply_rotary_emb(q,cos,sin,self.rope_dims);k=apply_rotary_emb(k,cos,sin,self.rope_dims);q=q*self.q_gain.to(dtype=q.dtype)[None,None,:,None];v_orig=v;rep=self.num_heads//self.num_kv_heads;k=k.repeat_interleave(rep,dim=2);v=v.repeat_interleave(rep,dim=2);q=q.transpose(1,2);k=k.transpose(1,2);v=v.transpose(1,2);y=F.scaled_dot_product_attention(q,k,v,is_causal=True);y=y.transpose(1,2).contiguous()
+		if self.use_xsa:y=self._xsa_efficient(y,v_orig)
+		y=y.reshape(bsz,seqlen,dim);return self.proj(y)
+class MLP(nn.Module):
+	def __init__(self,dim,mlp_mult):super().__init__();hidden=int(mlp_mult*dim);self.fc=CastedLinear(dim,hidden,bias=False);self.proj=CastedLinear(hidden,dim,bias=False);self.proj._zero_init=True
+	def forward(self,x):return self.proj(F.leaky_relu(self.fc(x),negative_slope=.5).square())
+class Block(nn.Module):
+	def __init__(self,dim,num_heads,num_kv_heads,mlp_mult,rope_base,qk_gain_init,train_seq_len,layer_idx=0,ln_scale=False):super().__init__();self.attn_norm=RMSNorm();self.mlp_norm=RMSNorm();self.attn=CausalSelfAttention(dim,num_heads,num_kv_heads,rope_base,qk_gain_init,train_seq_len);self.mlp=MLP(dim,mlp_mult);self.attn_scale=nn.Parameter(torch.ones(dim,dtype=torch.float32));self.mlp_scale=nn.Parameter(torch.ones(dim,dtype=torch.float32));self.resid_mix=nn.Parameter(torch.stack((torch.ones(dim),torch.zeros(dim))).float());self.ln_scale_factor=1./math.sqrt(layer_idx+1)if ln_scale else 1.;self.parallel=False
+	def forward(self,x,x0):
+		mix=self.resid_mix.to(dtype=x.dtype);x_in=mix[0][None,None,:]*x+mix[1][None,None,:]*x0;attn_out=self.attn(self.attn_norm(x_in)*self.ln_scale_factor)
+		if self.parallel:mlp_out=self.mlp(self.mlp_norm(x_in)*self.ln_scale_factor);x_out=x_in+self.attn_scale.to(dtype=x_in.dtype)[None,None,:]*attn_out+self.mlp_scale.to(dtype=x_in.dtype)[None,None,:]*mlp_out
+		else:x_out=x_in+self.attn_scale.to(dtype=x_in.dtype)[None,None,:]*attn_out;x_out=x_out+self.mlp_scale.to(dtype=x_out.dtype)[None,None,:]*self.mlp(self.mlp_norm(x_out)*self.ln_scale_factor)
+		return x_out
+class SmearGate(nn.Module):
+	def __init__(self,dim):super().__init__();self.gate=nn.Parameter(torch.zeros(dim,dtype=torch.float32))
+	def forward(self,x):g=torch.sigmoid(self.gate.to(dtype=x.dtype))[None,None,:];x_prev=torch.cat([torch.zeros_like(x[:,:1]),x[:,:-1]],dim=1);return(1-g)*x+g*x_prev
+class GPT(nn.Module):
+	def __init__(self,h):
+		super().__init__()
+		if h.logit_softcap<=.0:raise ValueError(f"logit_softcap must be positive, got {h.logit_softcap}")
+		self.tie_embeddings=h.tie_embeddings;self.tied_embed_init_std=h.tied_embed_init_std;self.logit_softcap=h.logit_softcap;self.tok_emb=nn.Embedding(h.vocab_size,h.embedding_dim);self.smear=SmearGate(h.embedding_dim)
+		if h.embedding_dim!=h.model_dim:self.embed_proj=CastedLinear(h.embedding_dim,h.model_dim,bias=False);self.head_proj=CastedLinear(h.model_dim,h.embedding_dim,bias=False)
+		else:self.embed_proj=None;self.head_proj=None
+		self.num_encoder_layers=h.num_layers//2;self.num_decoder_layers=h.num_layers-self.num_encoder_layers;self.blocks=nn.ModuleList([Block(h.model_dim,h.num_heads,h.num_kv_heads,h.mlp_mult,h.rope_base,h.qk_gain_init,h.train_seq_len,layer_idx=i,ln_scale=h.ln_scale)for i in range(h.num_layers)])
+		if h.rope_dims>0:
+			head_dim=h.model_dim//h.num_heads
+			for block in self.blocks:block.attn.rope_dims=h.rope_dims;block.attn.rotary=Rotary(head_dim,base=h.rope_base,train_seq_len=h.train_seq_len,rope_dims=h.rope_dims)
+		self.final_norm=RMSNorm();self.lm_head=None if h.tie_embeddings else CastedLinear(h.embedding_dim,h.vocab_size,bias=False)
+		if self.lm_head is not None:self.lm_head._zero_init=True
+		if h.xsa_last_n>0:
+			for i in range(max(0,h.num_layers-h.xsa_last_n),h.num_layers):self.blocks[i].attn.use_xsa=True
+		if h.parallel_residual_start>=0:
+			for i in range(h.parallel_residual_start,h.num_layers):self.blocks[i].parallel=True
+		self.looping_active=False
+		if h.num_loops>0:
+			loop_seg=list(range(h.loop_start,h.loop_end+1));all_indices=list(range(h.loop_start))
+			for _ in range(h.num_loops+1):all_indices.extend(loop_seg)
+			all_indices.extend(range(h.loop_end+1,h.num_layers));num_enc=len(all_indices)//2;self.encoder_indices=all_indices[:num_enc];self.decoder_indices=all_indices[num_enc:]
+		else:self.encoder_indices=list(range(self.num_encoder_layers));self.decoder_indices=list(range(self.num_encoder_layers,h.num_layers))
+		self.num_skip_weights=min(len(self.encoder_indices),len(self.decoder_indices));self.skip_weights=nn.Parameter(torch.ones(self.num_skip_weights,h.model_dim,dtype=torch.float32));self.skip_gates=nn.Parameter(torch.zeros(self.num_skip_weights,h.model_dim,dtype=torch.float32))if h.skip_gates_enabled else None;self._init_weights()
+	def _init_weights(self):
+		if self.tie_embeddings:nn.init.normal_(self.tok_emb.weight,mean=.0,std=self.tied_embed_init_std)
+		for(name,module)in self.named_modules():
+			if isinstance(module,nn.Linear):
+				if getattr(module,'_zero_init',False):nn.init.zeros_(module.weight)
+				elif module.weight.ndim==2 and module.weight.shape[0]>=64 and module.weight.shape[1]>=64:nn.init.orthogonal_(module.weight,gain=1.)
+	def forward_logits(self,input_ids):
+		x=self.tok_emb(input_ids);x=F.rms_norm(x,(x.size(-1),));x=self.smear(x)
+		if self.embed_proj is not None:x=self.embed_proj(x)
+		x0=x;skips=[];enc_iter=self.encoder_indices if self.looping_active else range(self.num_encoder_layers);dec_iter=self.decoder_indices if self.looping_active else range(self.num_encoder_layers,self.num_encoder_layers+self.num_decoder_layers)
+		for i in enc_iter:x=self.blocks[i](x,x0);skips.append(x)
+		for(skip_idx,i)in enumerate(dec_iter):
+			if skip_idx<self.num_skip_weights and skips:
+				scaled_skip=self.skip_weights[skip_idx].to(dtype=x.dtype)[None,None,:]*skips.pop()
+				if self.skip_gates is not None:g=torch.sigmoid(self.skip_gates[skip_idx].to(dtype=x.dtype))[None,None,:];x=torch.lerp(scaled_skip,x,g)
+				else:x=x+scaled_skip
+			x=self.blocks[i](x,x0)
+		x=self.final_norm(x)
+		if self.head_proj is not None:x=self.head_proj(x)
+		if self.tie_embeddings:logits_proj=F.linear(x,self.tok_emb.weight)
+		else:logits_proj=self.lm_head(x)
+		return self.logit_softcap*torch.tanh(logits_proj/self.logit_softcap)
+	def forward(self,input_ids,target_ids):logits=self.forward_logits(input_ids);return F.cross_entropy(logits.reshape(-1,logits.size(-1)).float(),target_ids.reshape(-1),reduction='mean')
+def classify_param(name):
+	if'tok_emb'in name or'lm_head'in name:return'embed'
+	if'.mlp.'in name:return'mlp'
+	if'.attn.'in name or'.proj.'in name and'.mlp.'not in name:return'attn'
+	return'other'
+@torch.compile
+def zeropower_via_newtonschulz5(G,steps=10,eps=1e-07):
+	a,b,c=3.4445,-4.775,2.0315;X=G.bfloat16();X/=X.norm()+eps;transposed=G.size(0)>G.size(1)
+	if transposed:X=X.T
+	for _ in range(steps):A=X@X.T;B=b*A+c*A@A;X=a*X+B@X
+	return X.T if transposed else X
+class Muon(torch.optim.Optimizer):
+	def __init__(self,params,lr,momentum,backend_steps,nesterov=True,weight_decay=.0,row_normalize=False):super().__init__(params,dict(lr=lr,momentum=momentum,backend_steps=backend_steps,nesterov=nesterov,weight_decay=weight_decay,row_normalize=row_normalize))
+	@torch.no_grad()
+	def step(self,closure=None):
+		loss=None
+		if closure is not None:
+			with torch.enable_grad():loss=closure()
+		distributed=dist.is_available()and dist.is_initialized();world_size=dist.get_world_size()if distributed else 1;rank=dist.get_rank()if distributed else 0
+		for group in self.param_groups:
+			params=group['params']
+			if not params:continue
+			lr=group['lr'];momentum=group['momentum'];backend_steps=group['backend_steps'];nesterov=group['nesterov'];total_params=sum(int(p.numel())for p in params);updates_flat=torch.zeros(total_params,device=params[0].device,dtype=torch.bfloat16);curr=0
+			for(i,p)in enumerate(params):
+				if i%world_size==rank and p.grad is not None:
+					g=p.grad;state=self.state[p]
+					if'momentum_buffer'not in state:state['momentum_buffer']=torch.zeros_like(g)
+					buf=state['momentum_buffer'];buf.mul_(momentum).add_(g)
+					if nesterov:g=g.add(buf,alpha=momentum)
+					if group.get('row_normalize',False):row_norms=g.float().norm(dim=-1,keepdim=True).clamp_min(1e-07);g=g/row_norms.to(g.dtype)
+					g=zeropower_via_newtonschulz5(g,steps=backend_steps);g*=max(1,g.size(0)/g.size(1))**.5;updates_flat[curr:curr+p.numel()]=g.reshape(-1)
+				curr+=p.numel()
+			if distributed:dist.all_reduce(updates_flat,op=dist.ReduceOp.SUM)
+			wd=group.get('weight_decay',.0);curr=0
+			for p in params:
+				if wd>.0:p.data.mul_(1.-lr*wd)
+				g=updates_flat[curr:curr+p.numel()].view_as(p).to(dtype=p.dtype);p.add_(g,alpha=-lr);curr+=p.numel()
+		return loss
+CONTROL_TENSOR_NAME_PATTERNS=tuple(pattern for pattern in os.environ.get('CONTROL_TENSOR_NAME_PATTERNS','attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,skip_gates,smear').split(',')if pattern)
+class Optimizers:
+	def __init__(self,h,base_model):
+		block_named_params=list(base_model.blocks.named_parameters());matrix_params=[p for(name,p)in block_named_params if p.ndim==2 and not any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)];scalar_params=[p for(name,p)in block_named_params if p.ndim<2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)]
+		if base_model.skip_weights.numel()>0:scalar_params.append(base_model.skip_weights)
+		if base_model.skip_gates is not None and base_model.skip_gates.numel()>0:scalar_params.append(base_model.skip_gates)
+		if hasattr(base_model,'smear'):scalar_params.append(base_model.smear.gate)
+		token_lr=h.tied_embed_lr if h.tie_embeddings else h.embed_lr;tok_params=[{'params':[base_model.tok_emb.weight],'lr':token_lr,'base_lr':token_lr}];self.optimizer_tok=torch.optim.AdamW(tok_params,betas=(h.beta1,h.beta2),eps=h.adam_eps,weight_decay=h.embed_wd,fused=True);self.optimizer_muon=Muon(matrix_params,lr=h.matrix_lr,momentum=h.muon_momentum,backend_steps=h.muon_backend_steps,weight_decay=h.muon_wd,row_normalize=h.muon_row_normalize)
+		for group in self.optimizer_muon.param_groups:group['base_lr']=h.matrix_lr
+		self.optimizer_scalar=torch.optim.AdamW([{'params':scalar_params,'lr':h.scalar_lr,'base_lr':h.scalar_lr}],betas=(h.beta1,h.beta2),eps=h.adam_eps,weight_decay=h.adam_wd,fused=True);self.optimizers=[self.optimizer_tok,self.optimizer_muon,self.optimizer_scalar]
+		if base_model.lm_head is not None:self.optimizer_head=torch.optim.Adam([{'params':[base_model.lm_head.weight],'lr':h.head_lr,'base_lr':h.head_lr}],betas=(h.beta1,h.beta2),eps=h.adam_eps,fused=True);self.optimizers.insert(1,self.optimizer_head)
+		else:self.optimizer_head=None
+	def __iter__(self):return iter(self.optimizers)
+	def zero_grad_all(self):
+		for opt in self.optimizers:opt.zero_grad(set_to_none=True)
+	def step(self):
+		for opt in self.optimizers:opt.step()
+		self.zero_grad_all()
+def restore_fp32_params(model):
+	for module in model.modules():
+		if isinstance(module,CastedLinear):module.float()
+	for(name,param)in model.named_parameters():
+		if(param.ndim<2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS))and param.dtype!=torch.float32:param.data=param.data.float()
+def collect_hessians(model,train_loader,h,device,n_calibration_batches=64):
+	hessians={};hooks=[]
+	def make_hook(name):
+		def hook_fn(module,inp,out):
+			x=inp[0].detach().float()
+			if x.ndim==3:x=x.reshape(-1,x.shape[-1])
+			if name not in hessians:hessians[name]=torch.zeros(x.shape[1],x.shape[1],dtype=torch.float32,device=device)
+			hessians[name].addmm_(x.T,x)
+		return hook_fn
+	for(name,module)in model.named_modules():
+		if isinstance(module,CastedLinear)and module.weight.numel()>65536:
+			cat=classify_param(name+'.weight')
+			if cat in('mlp','attn'):hooks.append(module.register_forward_hook(make_hook(name+'.weight')))
+	if model.tie_embeddings:
+		hook_module=model.head_proj if model.head_proj is not None else model.final_norm
+		def make_output_hook(name):
+			def hook_fn(module,inp,out):
+				x=out.detach().float()
+				if x.ndim==3:x=x.reshape(-1,x.shape[-1])
+				if name not in hessians:hessians[name]=torch.zeros(x.shape[1],x.shape[1],dtype=torch.float32,device=device)
+				hessians[name].addmm_(x.T,x)
+			return hook_fn
+		hooks.append(hook_module.register_forward_hook(make_output_hook('tok_emb.weight')))
+	model.eval()
+	with torch.no_grad():
+		for _ in range(n_calibration_batches):x,_=train_loader.next_batch(h.train_batch_tokens,h.grad_accum_steps);model.forward_logits(x)
+	for hook in hooks:hook.remove()
+	for name in hessians:hessians[name]=hessians[name].cpu()/n_calibration_batches
+	return hessians
+def gptq_quantize_weight(w,H,clip_sigmas=3.,clip_range=63,block_size=128):
+	W_orig=w.float().clone();rows,cols=W_orig.shape;H=H.float().clone();dead=torch.diag(H)==0;H[dead,dead]=1;damp=.01*H.diag().mean();H.diagonal().add_(damp);perm=torch.argsort(H.diag(),descending=True);invperm=torch.argsort(perm);W_perm=W_orig[:,perm].clone();W_perm[:,dead[perm]]=0;H=H[perm][:,perm];Hinv=torch.cholesky_inverse(torch.linalg.cholesky(H));Hinv=torch.linalg.cholesky(Hinv,upper=True);row_std=W_orig.std(dim=1);s=(clip_sigmas*row_std/clip_range).clamp_min(1e-10).to(torch.float16);sf=s.float();Q=torch.zeros(rows,cols,dtype=torch.int8);W_work=W_perm.clone()
+	for i1 in range(0,cols,block_size):
+		i2=min(i1+block_size,cols);W_block=W_work[:,i1:i2].clone();Hinv_block=Hinv[i1:i2,i1:i2];Err=torch.zeros(rows,i2-i1)
+		for j in range(i2-i1):w_col=W_block[:,j];d=Hinv_block[j,j];q_col=torch.clamp(torch.round(w_col/sf),-clip_range,clip_range);Q[:,i1+j]=q_col.to(torch.int8);err=(w_col-q_col.float()*sf)/d;Err[:,j]=err;W_block[:,j:]-=err.unsqueeze(1)*Hinv_block[j,j:].unsqueeze(0)
+		if i2<cols:W_work[:,i2:]-=Err@Hinv[i1:i2,i2:]
+	return Q[:,invperm],s
+def gptq_mixed_quantize(state_dict,hessians,h):
+	result={};meta={}
+	for(name,tensor)in state_dict.items():
+		t=tensor.detach().cpu().contiguous()
+		if not t.is_floating_point()or t.numel()<=65536:result[name]=t.to(torch.float16)if t.is_floating_point()else t;meta[name]='passthrough (float16)';continue
+		cs=h.embed_clip_sigmas if'tok_emb'in name else h.matrix_clip_sigmas;bits=h.embed_bits if'tok_emb'in name else h.matrix_bits;q,s=gptq_quantize_weight(t,hessians[name],clip_sigmas=cs,clip_range=2**(bits-1)-1);result[name+'.q']=q;result[name+'.scale']=s;meta[name]=f"gptq (int{bits})"
+	categories=collections.defaultdict(set)
+	for(name,cat)in meta.items():short=re.sub('\\.\\d+$','',re.sub('blocks\\.\\d+','blocks',name));categories[cat].add(short)
+	log('Quantized weights:')
+	for cat in sorted(categories):log("  %s: %s"%(cat,", ".join(sorted(categories[cat]))))
+	return result,meta
+def dequantize_mixed(result,meta,template_sd):
+	out={}
+	for(name,orig)in template_sd.items():
+		info=meta.get(name)
+		if info is None:continue
+		orig_dtype=orig.dtype
+		if'passthrough'in info:
+			t=result[name]
+			if t.dtype==torch.float16 and orig_dtype in(torch.float32,torch.bfloat16):t=t.to(orig_dtype)
+			out[name]=t;continue
+		q,s=result[name+'.q'],result[name+'.scale']
+		if s.ndim>0:out[name]=(q.float()*s.float().view(q.shape[0],*[1]*(q.ndim-1))).to(orig_dtype)
+		else:out[name]=(q.float()*float(s.item())).to(orig_dtype)
+	return out
+_BSHF_MAGIC=b'BSHF'
+def _byte_shuffle(data,stride=2):
+	if stride<=1 or len(data)<stride:return data
+	src=np.frombuffer(data,dtype=np.uint8);n=len(src);out=np.empty(n,dtype=np.uint8);dest_off=0
+	for pos in range(stride):chunk=src[pos::stride];out[dest_off:dest_off+len(chunk)]=chunk;dest_off+=len(chunk)
+	return _BSHF_MAGIC+bytes([stride])+out.tobytes()
+def _byte_unshuffle(data):
+	if len(data)<5 or data[:4]!=_BSHF_MAGIC:return data
+	stride=data[4]
+	if stride<2:return data[5:]
+	payload=np.frombuffer(data,dtype=np.uint8,offset=5);n=len(payload);out=np.empty(n,dtype=np.uint8);src_off=0
+	for pos in range(stride):chunk_len=n//stride+(1 if pos<n%stride else 0);out[pos::stride][:chunk_len]=payload[src_off:src_off+chunk_len];src_off+=chunk_len
+	return out.tobytes()
+def _compress(data,compressor):
+	data=_byte_shuffle(data)
+	if compressor=='lzma':return lzma.compress(data,preset=6)
+	elif compressor=='brotli':import brotli;return brotli.compress(data,quality=11)
+	raise ValueError(f"Unknown compressor: {compressor!r}")
+def _decompress(data,compressor):
+	if compressor=='lzma':raw=lzma.decompress(data)
+	elif compressor=='brotli':import brotli;raw=brotli.decompress(data)
+	else:raise ValueError(f"Unknown compressor: {compressor!r}")
+	raw=_byte_unshuffle(raw);return raw
+def _ternary_compress_sd(sd_cpu,hessians,h,target_void=0.30,min_numel=1024):
+	"""MIXED compression: ternary for MLP, GPTQ int6 for attention, int8 for embeddings."""
+	result={};meta={};total_p=0;total_v=0
+	for name,tensor in sd_cpu.items():
+		t=tensor.detach().cpu().float();total_p+=t.numel()
+		if not tensor.is_floating_point()or t.numel()<min_numel:
+			result[name]=t.to(torch.float16)if tensor.is_floating_point()else tensor;meta[name]='passthrough';continue
+		is_mlp='mlp'in name or'fc'in name or'proj'in name and'attn'not in name and'out_proj'not in name
+		is_attn='attn'in name or'out_proj'in name
+		is_embed='tok_emb'in name or'embed'in name
+		if is_mlp and not is_attn:
+			# TERNARY for MLP (67% of params — big savings)
+			mags=t.abs();threshold=torch.quantile(mags.flatten(),target_void)if target_void>0 else torch.tensor(0.)
+			ternary=torch.zeros_like(t,dtype=torch.int8);ternary[t>threshold]=1;ternary[t<-threshold]=-1
+			active_mask=ternary!=0;row_sums=(t.abs()*active_mask.float()).sum(dim=-1);row_counts=active_mask.float().sum(dim=-1).clamp(min=1);scale=(row_sums/row_counts).to(torch.float16)
+			vf=(ternary==0).float().mean().item();total_v+=int(t.numel()*vf)
+			result[name+'.t']=ternary;result[name+'.s']=scale;meta[name]=f'ternary(void={vf:.1%})'
+		elif name in hessians:
+			# GPTQ int6/int8 for attention + embeddings (quality-critical)
+			cs=h.embed_clip_sigmas if is_embed else h.matrix_clip_sigmas
+			bits=h.embed_bits if is_embed else h.matrix_bits
+			q,s=gptq_quantize_weight(t,hessians[name],clip_sigmas=cs,clip_range=2**(bits-1)-1)
+			result[name+'.q']=q;result[name+'.scale']=s;meta[name]=f'gptq(int{bits})'
+		else:
+			result[name]=t.to(torch.float16);meta[name]='passthrough(no_hessian)'
+	log(f"Mixed compression: {total_p:,} params, ternary void {total_v/max(total_p,1):.1%}")
+	return result,meta
+def _pack_ternary(t):
+	flat=t.flatten().to(torch.int8);mapped=torch.where(flat==-1,torch.tensor(2,dtype=torch.int8),flat.clamp(0,1))
+	pad=(4-len(mapped)%4)%4
+	if pad>0:mapped=torch.cat([mapped,torch.zeros(pad,dtype=torch.int8)])
+	r=mapped.view(-1,4);packed=(r[:,0]|(r[:,1]<<2)|(r[:,2]<<4)|(r[:,3]<<6)).to(torch.uint8)
+	return packed,list(t.shape)
+def _unpack_ternary(packed,shape):
+	import numpy as np;vals=np.stack([packed.numpy()&0x03,(packed.numpy()>>2)&0x03,(packed.numpy()>>4)&0x03,(packed.numpy()>>6)&0x03],axis=-1).flatten()
+	numel=1
+	for d in shape:numel*=d
+	vals=vals[:numel];t=torch.from_numpy(vals.astype(np.int8));t=torch.where(t==2,torch.tensor(-1,dtype=torch.int8),t)
+	return t.reshape(shape)
+def serialize(h,base_model,code):
+	import base64
+	code_raw=code.encode('utf-8');code_compressed=lzma.compress(code_raw,preset=9)
+	bootstrap=f"import lzma,base64 as B;exec(lzma.decompress(B.b85decode({repr(base64.b85encode(code_compressed).decode())})))".encode('utf-8')
+	code_bytes=len(bootstrap);log(f"Code: {len(code_raw)} raw → {len(code_compressed)} lzma → {code_bytes} bootstrap")
+	if h.is_main_process:
+		with open('train_gpt.py','wb')as f:f.write(bootstrap)
+		log(f"Wrote bootstrap code to train_gpt.py ({code_bytes} bytes)")
+		torch.save(base_model.state_dict(),h.model_path);model_bytes=os.path.getsize(h.model_path);log(f"Serialized model: {model_bytes} bytes");log(f"Code size: {code_bytes} bytes")
+	sd_cpu={k:v.detach().cpu()for(k,v)in base_model.state_dict().items()};device=torch.device('cuda',h.local_rank);log('Collecting Hessians for mixed compression...');t0=time.perf_counter();calib_loader=ShuffledSequenceLoader(h,device)
+	with torch.no_grad(),torch.autocast(device_type='cuda',dtype=torch.bfloat16):hessians=collect_hessians(base_model,calib_loader,h,device,n_calibration_batches=h.gptq_calibration_batches)
+	log(f"Hessians collected in {time.perf_counter()-t0:.1f}s");t0=time.perf_counter();quant_result,quant_meta=gptq_mixed_quantize(sd_cpu,hessians,h);log(f"Mixed compression done in {time.perf_counter()-t0:.1f}s")
+	packed={};packed_meta={}
+	for name,info in quant_meta.items():
+		if'ternary'in info:
+			p,s=_pack_ternary(quant_result[name+'.t']);packed[name+'.p']=p;packed[name+'.sh']=torch.tensor(s);packed[name+'.sc']=quant_result[name+'.s']
+		elif'gptq'in info:
+			packed[name+'.q']=quant_result[name+'.q'];packed[name+'.scale']=quant_result[name+'.scale']
+		else:packed[name]=quant_result[name]
+		packed_meta[name]=info
+	quant_buf=io.BytesIO();torch.save({'w':packed,'m':packed_meta},quant_buf);quant_raw=quant_buf.getvalue();quant_blob=_compress(quant_raw,h.compressor);quant_file_bytes=len(quant_blob);bytes_total=quant_file_bytes+code_bytes
+	if h.is_main_process:
+		with open(h.quantized_model_path,'wb')as f:f.write(quant_blob)
+		log(f"Serialized ternary+{h.compressor}: {quant_file_bytes} bytes");log(f"Total submission size: {bytes_total} bytes")
+		if bytes_total>16_000_000:log(f"WARNING: {bytes_total-16_000_000} bytes OVER 16MB!")
+		else:log(f"SIZE OK: {16_000_000-bytes_total} headroom")
+	return bytes_total,quant_file_bytes
+def deserialize(h,device):
+	eval_model=GPT(h).to(device).bfloat16();restore_fp32_params(eval_model);template_sd={k:v.detach().cpu()for(k,v)in eval_model.state_dict().items()}
+	with open(h.quantized_model_path,'rb')as f:quant_blob_disk=f.read()
+	quant_state=torch.load(io.BytesIO(_decompress(quant_blob_disk,h.compressor)),map_location='cpu',weights_only=True);packed=quant_state['w'];meta=quant_state['m']
+	out={}
+	for name,orig in template_sd.items():
+		info=meta.get(name)
+		if info is None:continue
+		if'ternary'in info:
+			p=packed[name+'.p'];sh=packed[name+'.sh'];sc=packed[name+'.sc']
+			ternary=_unpack_ternary(p,list(sh.tolist()));deq=(ternary.float()*sc.float().view(*([sc.shape[0]]+[1]*(ternary.ndim-1)))).to(orig.dtype);out[name]=deq
+		elif'gptq'in info:
+			q,s=packed[name+'.q'],packed[name+'.scale']
+			if s.ndim>0:out[name]=(q.float()*s.float().view(q.shape[0],*[1]*(q.ndim-1))).to(orig.dtype)
+			else:out[name]=(q.float()*float(s.item())).to(orig.dtype)
+		else:
+			t=packed[name]
+			if t.dtype==torch.float16 and orig.dtype in(torch.float32,torch.bfloat16):t=t.to(orig.dtype)
+			out[name]=t
+	eval_model.load_state_dict(out,strict=True);return eval_model
+def _loss_bpb(loss_sum,token_count,byte_count):val_loss=(loss_sum/token_count).item();val_bpb=val_loss/math.log(2.)*(token_count.item()/byte_count.item());return val_loss,val_bpb
+def eval_val(h,device,val_data,model):
+	seq_len=h.eval_seq_len;local_batch_tokens=h.val_batch_tokens//(h.world_size*h.grad_accum_steps)
+	if local_batch_tokens<seq_len:raise ValueError(f"VAL_BATCH_SIZE must provide at least one sequence per rank; got VAL_BATCH_SIZE={h.val_batch_tokens}, WORLD_SIZE={h.world_size}, GRAD_ACCUM_STEPS={h.grad_accum_steps}, seq_len={seq_len}")
+	local_batch_seqs=local_batch_tokens//seq_len;total_seqs=(val_data.val_tokens.numel()-1)//seq_len;seq_start=total_seqs*h.rank//h.world_size;seq_end=total_seqs*(h.rank+1)//h.world_size;val_loss_sum=torch.zeros((),device=device,dtype=torch.float64);val_token_count=torch.zeros((),device=device,dtype=torch.float64);val_byte_count=torch.zeros((),device=device,dtype=torch.float64);model.eval()
+	with torch.inference_mode():
+		for batch_seq_start in range(seq_start,seq_end,local_batch_seqs):
+			batch_seq_end=min(batch_seq_start+local_batch_seqs,seq_end);raw_start=batch_seq_start*seq_len;raw_end=batch_seq_end*seq_len+1;local=val_data.val_tokens[raw_start:raw_end].to(device=device,dtype=torch.int64,non_blocking=True);x=local[:-1].reshape(-1,seq_len);y=local[1:].reshape(-1,seq_len)
+			with torch.autocast(device_type='cuda',dtype=torch.bfloat16,enabled=True):batch_loss=model(x,y).detach()
+			batch_token_count=float(y.numel());val_loss_sum+=batch_loss.to(torch.float64)*batch_token_count;val_token_count+=batch_token_count;prev_ids=x.reshape(-1);tgt_ids=y.reshape(-1);token_bytes=val_data.base_bytes_lut[tgt_ids].to(dtype=torch.int16);token_bytes+=(val_data.has_leading_space_lut[tgt_ids]&~val_data.is_boundary_token_lut[prev_ids]).to(dtype=torch.int16);val_byte_count+=token_bytes.to(torch.float64).sum()
+	if dist.is_available()and dist.is_initialized():dist.all_reduce(val_loss_sum,op=dist.ReduceOp.SUM);dist.all_reduce(val_token_count,op=dist.ReduceOp.SUM);dist.all_reduce(val_byte_count,op=dist.ReduceOp.SUM)
+	model.train();return _loss_bpb(val_loss_sum,val_token_count,val_byte_count)
+def eval_val_sliding(h,device,val_data,base_model,batch_seqs=32):
+	base_model.eval();logits_fn=torch.compile(base_model.forward_logits,dynamic=False,fullgraph=True);seq_len=h.eval_seq_len;context_size=seq_len-h.eval_stride;total_tokens=val_data.val_tokens.numel()-1;window_starts=[ws for ws in range(0,total_tokens,h.eval_stride)if ws+context_size<total_tokens];total_windows=len(window_starts);my_s=total_windows*h.rank//h.world_size;my_e=total_windows*(h.rank+1)//h.world_size;my_windows=window_starts[my_s:my_e];loss_sum=torch.zeros((),device=device,dtype=torch.float64);token_count=torch.zeros((),device=device,dtype=torch.float64);byte_count=torch.zeros((),device=device,dtype=torch.float64)
+	with torch.inference_mode():
+		for bi in range(0,len(my_windows),batch_seqs):
+			batch_ws=my_windows[bi:bi+batch_seqs];bsz=len(batch_ws);x_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);y_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);wlens=[]
+			for(i,ws)in enumerate(batch_ws):we=min(ws+seq_len,total_tokens);wlen=we-ws;wlens.append(wlen);chunk=val_data.val_tokens[ws:we+1].to(dtype=torch.int64,device=device);x_batch[i,:wlen]=chunk[:-1];y_batch[i,:wlen]=chunk[1:]
+			with torch.autocast(device_type='cuda',dtype=torch.bfloat16):logits=logits_fn(x_batch)
+			nll=F.cross_entropy(logits.reshape(-1,logits.size(-1)).float(),y_batch.reshape(-1),reduction='none').reshape(bsz,seq_len)
+			for(i,ws)in enumerate(batch_ws):wlen=wlens[i];s=0 if ws==0 else context_size;scored_nll=nll[i,s:wlen].to(torch.float64);loss_sum+=scored_nll.sum();token_count+=float(wlen-s);tgt=y_batch[i,s:wlen];prev=x_batch[i,s:wlen];tb=val_data.base_bytes_lut[tgt].to(torch.float64);tb+=(val_data.has_leading_space_lut[tgt]&~val_data.is_boundary_token_lut[prev]).to(torch.float64);byte_count+=tb.sum()
+	if dist.is_available()and dist.is_initialized():dist.all_reduce(loss_sum,op=dist.ReduceOp.SUM);dist.all_reduce(token_count,op=dist.ReduceOp.SUM);dist.all_reduce(byte_count,op=dist.ReduceOp.SUM)
+	base_model.train();return _loss_bpb(loss_sum,token_count,byte_count)
+
+def eval_val_lora_ttt(h,device,val_data,base_model,batch_seqs=32):
+	seq_len=h.eval_seq_len;stride=h.eval_stride;total_tokens=val_data.val_tokens.numel()-1;ttt_chunk=h.ttt_chunk_tokens;context_size=seq_len-stride
+	window_starts=[ws for ws in range(0,total_tokens,stride)if ws+context_size<total_tokens];num_chunks=(total_tokens+ttt_chunk-1)//ttt_chunk
+	chunk_windows=[[]for _ in range(num_chunks)]
+	for ws in window_starts:
+		wlen=min(ws+seq_len,total_tokens)-ws;s=0 if ws==0 else context_size;ci=min((ws+s)//ttt_chunk,num_chunks-1);chunk_windows[ci].append(ws)
+	lora_params,lora_layers=_inject_lora(base_model,rank=h.lora_ttt_rank,alpha=h.lora_ttt_alpha)
+	log(f"lora_ttt:injected {len(lora_layers)} layers, {sum(p.numel()for p in lora_params):,} params")
+	for p in base_model.parameters():p.requires_grad_(False)
+	for p in lora_params:p.requires_grad_(True)
+	optimizer=torch.optim.AdamW(lora_params,lr=h.lora_ttt_lr,weight_decay=h.lora_ttt_wd)
+	compiled_logits=base_model.forward_logits
+	loss_sum=torch.zeros((),device=device,dtype=torch.float64);token_count=torch.zeros((),device=device,dtype=torch.float64);byte_count=torch.zeros((),device=device,dtype=torch.float64)
+	chunks_per_phase=max(num_chunks//h.lora_ttt_phases,1);current_phase=0;t0=time.perf_counter()
+	for ci in range(num_chunks):
+		windows=chunk_windows[ci]
+		if not windows:continue
+		new_phase=ci//chunks_per_phase
+		if new_phase>current_phase:_reset_lora(lora_params);optimizer=torch.optim.AdamW(lora_params,lr=h.lora_ttt_lr,weight_decay=h.lora_ttt_wd);current_phase=new_phase;log(f"lora_ttt:phase {current_phase}")
+		chunk_start=ci*ttt_chunk;chunk_end=min((ci+1)*ttt_chunk,total_tokens);base_model.eval()
+		with torch.no_grad():
+			for bi in range(0,len(windows),batch_seqs):
+				batch_ws=windows[bi:bi+batch_seqs];bsz=len(batch_ws);x_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);y_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);wlens=[]
+				for i,ws in enumerate(batch_ws):we=min(ws+seq_len,total_tokens);wlen=we-ws;wlens.append(wlen);ct=val_data.val_tokens[ws:we+1].to(dtype=torch.int64,device=device);x_batch[i,:wlen]=ct[:-1];y_batch[i,:wlen]=ct[1:]
+				with torch.autocast(device_type='cuda',dtype=torch.bfloat16):logits=compiled_logits(x_batch)
+				nll=F.cross_entropy(logits.reshape(-1,logits.size(-1)).float(),y_batch.reshape(-1),reduction='none').reshape(bsz,seq_len)
+				for i,ws in enumerate(batch_ws):
+					wlen=wlens[i];s=0 if ws==0 else context_size;scored_nll=nll[i,s:wlen].to(torch.float64);loss_sum+=scored_nll.sum();token_count+=float(wlen-s)
+					tgt=y_batch[i,s:wlen];prev=x_batch[i,s:wlen];tb=val_data.base_bytes_lut[tgt].to(torch.float64);tb+=(val_data.has_leading_space_lut[tgt]&~val_data.is_boundary_token_lut[prev]).to(torch.float64);byte_count+=tb.sum()
+		is_last=ci==num_chunks-1
+		if not is_last and h.ttt_epochs>0:
+			base_model.train();chunk_seqs=(chunk_end-chunk_start)//seq_len
+			if chunk_seqs>0:
+				cos_lr=h.lora_ttt_lr*.5*(1.+math.cos(math.pi*ci/max(num_chunks-1,1)))
+				for pg in optimizer.param_groups:pg['lr']=cos_lr
+				for _ep in range(h.ttt_epochs):
+					for bs in range(0,chunk_seqs,batch_seqs):
+						be=min(bs+batch_seqs,chunk_seqs);start_tok=chunk_start+bs*seq_len;end_tok=chunk_start+be*seq_len+1
+						if end_tok>val_data.val_tokens.numel():continue
+						local=val_data.val_tokens[start_tok:end_tok].to(device=device,dtype=torch.int64);x=local[:-1].reshape(-1,seq_len);y=local[1:].reshape(-1,seq_len)
+						optimizer.zero_grad(set_to_none=True)
+						with torch.autocast(device_type='cuda',dtype=torch.bfloat16):loss=base_model(x,y)
+						loss.backward();torch.nn.utils.clip_grad_norm_(lora_params,1.);optimizer.step()
+	log(f"lora_ttt:done {time.perf_counter()-t0:.1f}s phases={h.lora_ttt_phases}")
+	_remove_lora(base_model)
+	for p in base_model.parameters():p.requires_grad_(True)
+	base_model.eval();return (loss_sum/token_count).item(),(loss_sum/byte_count/math.log(2)).item()
+
+def eval_val_ttt(h,device,val_data,base_model,batch_seqs=32):
+	rank=h.rank;world_size=h.world_size;seq_len=h.eval_seq_len;stride=h.eval_stride;total_tokens=val_data.val_tokens.numel()-1;ttt_chunk=h.ttt_chunk_tokens;context_size=seq_len-stride;window_starts=[ws for ws in range(0,total_tokens,stride)if ws+context_size<total_tokens];num_chunks=(total_tokens+ttt_chunk-1)//ttt_chunk;chunk_windows=[[]for _ in range(num_chunks)]
+	for ws in window_starts:wlen=min(ws+seq_len,total_tokens)-ws;s=0 if ws==0 else context_size;scored_start=ws+s;ci=min(scored_start//ttt_chunk,num_chunks-1);chunk_windows[ci].append(ws)
+	log(f"ttt:start chunks={num_chunks} ttt_lr={h.ttt_lr} ttt_epochs={h.ttt_epochs}");compiled_logits=torch.compile(base_model.forward_logits,dynamic=False,fullgraph=True);loss_sum=torch.zeros((),device=device,dtype=torch.float64);token_count=torch.zeros((),device=device,dtype=torch.float64);byte_count=torch.zeros((),device=device,dtype=torch.float64);ttt_params=[p for p in base_model.parameters()]
+	for p in ttt_params:p.requires_grad_(True)
+	optimizer=torch.optim.SGD(ttt_params,lr=h.ttt_lr,momentum=h.ttt_momentum)
+	for ci in range(num_chunks):
+		windows=chunk_windows[ci]
+		if not windows:continue
+		chunk_start=ci*ttt_chunk;chunk_end=min((ci+1)*ttt_chunk,total_tokens);my_s=len(windows)*rank//world_size;my_e=len(windows)*(rank+1)//world_size;my_windows=windows[my_s:my_e];base_model.eval()
+		with torch.no_grad():
+			for bi in range(0,len(my_windows),batch_seqs):
+				batch_ws=my_windows[bi:bi+batch_seqs];bsz=len(batch_ws);x_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);y_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);wlens=[]
+				for(i,ws)in enumerate(batch_ws):we=min(ws+seq_len,total_tokens);wlen=we-ws;wlens.append(wlen);chunk_tok=val_data.val_tokens[ws:we+1].to(dtype=torch.int64,device=device);x_batch[i,:wlen]=chunk_tok[:-1];y_batch[i,:wlen]=chunk_tok[1:]
+				with torch.autocast(device_type='cuda',dtype=torch.bfloat16):logits=compiled_logits(x_batch)
+				nll=F.cross_entropy(logits.reshape(-1,logits.size(-1)).float(),y_batch.reshape(-1),reduction='none').reshape(bsz,seq_len)
+				for(i,ws)in enumerate(batch_ws):wlen=wlens[i];s=0 if ws==0 else context_size;scored_nll=nll[i,s:wlen].to(torch.float64);loss_sum+=scored_nll.sum();token_count+=float(wlen-s);tgt=y_batch[i,s:wlen];prev=x_batch[i,s:wlen];tb=val_data.base_bytes_lut[tgt].to(torch.float64);tb+=(val_data.has_leading_space_lut[tgt]&~val_data.is_boundary_token_lut[prev]).to(torch.float64);byte_count+=tb.sum()
+		is_last_chunk=ci==num_chunks-1
+		if not is_last_chunk and h.ttt_epochs>0:
+			base_model.train();chunk_seqs=(chunk_end-chunk_start)//seq_len
+			if chunk_seqs>0:
+				cos_lr=h.ttt_lr*.5*(1.+math.cos(math.pi*ci/max(num_chunks-1,1)))
+				for pg in optimizer.param_groups:pg['lr']=cos_lr
+				my_seq_s=chunk_seqs*rank//world_size;my_seq_e=chunk_seqs*(rank+1)//world_size;my_chunk_seqs=my_seq_e-my_seq_s
+				for _ep in range(h.ttt_epochs):
+					for bs in range(0,my_chunk_seqs,batch_seqs):
+						be=min(bs+batch_seqs,my_chunk_seqs);actual_bs=my_seq_s+bs;start_tok=chunk_start+actual_bs*seq_len;end_tok=chunk_start+(my_seq_s+be)*seq_len+1
+						if end_tok>val_data.val_tokens.numel():continue
+						local=val_data.val_tokens[start_tok:end_tok].to(device=device,dtype=torch.int64);x=local[:-1].reshape(-1,seq_len);y=local[1:].reshape(-1,seq_len);optimizer.zero_grad(set_to_none=True)
+						with torch.autocast(device_type='cuda',dtype=torch.bfloat16):loss=base_model(x,y)
+						loss.backward()
+						if world_size>1:
+							for p in ttt_params:
+								if p.grad is not None:dist.all_reduce(p.grad,op=dist.ReduceOp.AVG)
+						torch.nn.utils.clip_grad_norm_(ttt_params,1.);optimizer.step()
+	if dist.is_available()and dist.is_initialized():dist.all_reduce(loss_sum,op=dist.ReduceOp.SUM);dist.all_reduce(token_count,op=dist.ReduceOp.SUM);dist.all_reduce(byte_count,op=dist.ReduceOp.SUM)
+	for p in base_model.parameters():p.requires_grad_(True)
+	base_model.eval();return _loss_bpb(loss_sum,token_count,byte_count)
+def timed_eval(label,fn,*args,**kwargs):torch.cuda.synchronize();t0=time.perf_counter();val_loss,val_bpb=fn(*args,**kwargs);torch.cuda.synchronize();elapsed_ms=1e3*(time.perf_counter()-t0);log(f"{label} val_loss:{val_loss:.8f} val_bpb:{val_bpb:.8f} eval_time:{elapsed_ms:.0f}ms");return val_loss,val_bpb
+def train_model(h,device,val_data):
+	base_model=GPT(h).to(device).bfloat16();restore_fp32_params(base_model);compiled_model=torch.compile(base_model,dynamic=False,fullgraph=True)
+	if h.distributed:model=DDP(compiled_model,device_ids=[h.local_rank],broadcast_buffers=False)
+	else:model=compiled_model
+	log(f"model_params:{sum(p.numel()for p in base_model.parameters())}");optimizers=Optimizers(h,base_model);train_loader=ShuffledSequenceLoader(h,device);max_wallclock_ms=1e3*h.max_wallclock_seconds if h.max_wallclock_seconds>0 else None
+	if max_wallclock_ms is not None:max_wallclock_ms-=h.gptq_reserve_seconds*1e3;log(f"gptq:reserving {h.gptq_reserve_seconds:.0f}s, effective={max_wallclock_ms:.0f}ms")
+	def training_frac(step,elapsed_ms):
+		if max_wallclock_ms is None:return step/max(h.iterations,1)
+		return elapsed_ms/max(max_wallclock_ms,1e-09)
+	def lr_mul(frac):
+		if h.warmdown_frac<=0:return 1.
+		if frac>=1.-h.warmdown_frac:return max((1.-frac)/h.warmdown_frac,h.min_lr)
+		return 1.
+	def step_fn(step,lr_scale):
+		optimizers.zero_grad_all();train_loss=torch.zeros((),device=device)
+		for micro_step in range(h.grad_accum_steps):
+			if h.distributed:model.require_backward_grad_sync=micro_step==h.grad_accum_steps-1
+			x,y=train_loader.next_batch(h.train_batch_tokens,h.grad_accum_steps)
+			with torch.autocast(device_type='cuda',dtype=torch.bfloat16,enabled=True):loss=model(x,y)
+			train_loss+=loss.detach();(loss/h.grad_accum_steps).backward()
+		train_loss/=h.grad_accum_steps;frac=min(step/h.muon_momentum_warmup_steps,1.)if h.muon_momentum_warmup_steps>0 else 1.;muon_momentum=(1-frac)*h.muon_momentum_warmup_start+frac*h.muon_momentum
+		for group in optimizers.optimizer_muon.param_groups:group['momentum']=muon_momentum
+		for opt in optimizers:
+			for group in opt.param_groups:group['lr']=group['base_lr']*lr_scale
+		if h.grad_clip_norm>0:torch.nn.utils.clip_grad_norm_(base_model.parameters(),h.grad_clip_norm)
+		optimizers.step();return train_loss
+	if h.warmup_steps>0:
+		initial_model_state={name:tensor.detach().cpu().clone()for(name,tensor)in base_model.state_dict().items()};initial_optimizer_states=[copy.deepcopy(opt.state_dict())for opt in optimizers];model.train()
+		for warmup_step in range(h.warmup_steps):
+			step_fn(warmup_step,1.)
+			if warmup_step<=5 or(warmup_step+1)%10==0 or warmup_step+1==h.warmup_steps:log(f"warmup_step: {warmup_step+1}/{h.warmup_steps}")
+		if h.num_loops>0:
+			base_model.looping_active=True;log(f"loop_warmup:enabled encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}")
+			for warmup_step in range(h.warmup_steps):
+				step_fn(warmup_step,1.)
+				if warmup_step<=5 or(warmup_step+1)%10==0 or warmup_step+1==h.warmup_steps:log(f"loop_warmup_step: {warmup_step+1}/{h.warmup_steps}")
+			base_model.looping_active=False
+		base_model.load_state_dict(initial_model_state,strict=True)
+		for(opt,state)in zip(optimizers,initial_optimizer_states,strict=True):opt.load_state_dict(state)
+		optimizers.zero_grad_all()
+		if h.distributed:model.require_backward_grad_sync=True
+		train_loader=ShuffledSequenceLoader(h,device)
+	ema_state={name:t.detach().float().clone()for(name,t)in base_model.state_dict().items()};ema_decay=h.ema_decay;training_time_ms=.0;stop_after_step=None;torch.cuda.synchronize();t0=time.perf_counter();step=0
+	while True:
+		last_step=step==h.iterations or stop_after_step is not None and step>=stop_after_step;should_validate=last_step or h.val_loss_every>0 and step%h.val_loss_every==0
+		if should_validate:torch.cuda.synchronize();training_time_ms+=1e3*(time.perf_counter()-t0);val_loss,val_bpb=eval_val(h,device,val_data,model);log(f"{step}/{h.iterations} val_loss: {val_loss:.4f} val_bpb: {val_bpb:.4f}");torch.cuda.synchronize();t0=time.perf_counter()
+		if last_step:
+			if stop_after_step is not None and step<h.iterations:log(f"stopping_early: wallclock_cap train_time: {training_time_ms:.0f}ms step: {step}/{h.iterations}")
+			break
+		elapsed_ms=training_time_ms+1e3*(time.perf_counter()-t0);frac=training_frac(step,elapsed_ms);scale=lr_mul(frac)
+		if h.num_loops>0 and not base_model.looping_active and frac>=h.enable_looping_at:base_model.looping_active=True;log(f"layer_loop:enabled step:{step} frac:{frac:.3f} encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}")
+		train_loss=step_fn(step,scale)
+		with torch.no_grad():
+			for(name,t)in base_model.state_dict().items():ema_state[name].mul_(ema_decay).add_(t.detach().float(),alpha=1.-ema_decay)
+		step+=1;approx_training_time_ms=training_time_ms+1e3*(time.perf_counter()-t0);should_log_train=h.train_log_every>0 and(step<=5 or step%h.train_log_every==0 or stop_after_step is not None)
+		if step in(100,500,1000,2300)and h.is_main_process:
+			ckpt_path=f'/root/checkpoints/step_{step}.pt';os.makedirs('/root/checkpoints',exist_ok=True);torch.save(base_model.state_dict(),ckpt_path);log(f"CHECKPOINT saved: {ckpt_path}")
+		if should_log_train:tok_per_sec=step*h.train_batch_tokens/(approx_training_time_ms/1e3);log(f"{step}/{h.iterations} train_loss: {train_loss.item():.4f} train_time: {approx_training_time_ms/60000:.1f}m tok/s: {tok_per_sec:.0f}")
+		reached_cap=max_wallclock_ms is not None and approx_training_time_ms>=max_wallclock_ms
+		if h.distributed and max_wallclock_ms is not None:reached_cap_tensor=torch.tensor(int(reached_cap),device=device);dist.all_reduce(reached_cap_tensor,op=dist.ReduceOp.MAX);reached_cap=bool(reached_cap_tensor.item())
+		if stop_after_step is None and reached_cap:stop_after_step=step
+	log(f"peak memory allocated: {torch.cuda.max_memory_allocated()//1024//1024} MiB reserved: {torch.cuda.max_memory_reserved()//1024//1024} MiB");log('ema:applying EMA weights');current_state=base_model.state_dict();avg_state={name:t.to(dtype=current_state[name].dtype)for(name,t)in ema_state.items()};base_model.load_state_dict(avg_state,strict=True);torch.save(base_model.state_dict(),'final_model_ema.pt');log('EMA checkpoint saved to final_model_ema.pt');return base_model,compiled_model
+def train_and_eval(h,device):
+	random.seed(h.seed);np.random.seed(h.seed);torch.manual_seed(h.seed);torch.cuda.manual_seed_all(h.seed);val_data=ValidationData(h,device);_n_shards=len(list(Path(h.datasets_dir).resolve().glob('fineweb_train_*.bin')));log(f"train_shards: {_n_shards}");log(f"val_tokens: {val_data.val_tokens.numel()-1}");base_model,compiled_model=train_model(h,device,val_data);torch._dynamo.reset();timed_eval('pre-quantization post-ema',eval_val,h,device,val_data,compiled_model);serialize(h,base_model,Path(__file__).read_text(encoding='utf-8'))
+	if h.distributed:dist.barrier()
+	eval_model=deserialize(h,device)
+	if h.num_loops>0:eval_model.looping_active=True
+	compiled_model=torch.compile(eval_model,dynamic=False,fullgraph=True);timed_eval('quantized',eval_val,h,device,val_data,compiled_model)
+	if h.sliding_window_enabled:timed_eval('quantized_sliding_window',eval_val_sliding,h,device,val_data,eval_model)
+	if h.ttt_enabled and h.sliding_window_enabled:
+		del eval_model,compiled_model;torch._dynamo.reset();torch.cuda.empty_cache();ttt_model=deserialize(h,device)
+		if h.num_loops>0:ttt_model.looping_active=True
+		timed_eval('quantized_ttt',eval_val_lora_ttt if h.lora_ttt_enabled else eval_val_ttt,h,device,val_data,ttt_model);del ttt_model
+	if h.etlb_enabled and h.sliding_window_enabled:
+		if'eval_model'not in dir():
+			eval_model=deserialize(h,device)
+			if h.num_loops>0:eval_model.looping_active=True
+		timed_eval('quantized_sliding_etlb',eval_val_sliding_etlb,h,device,val_data,eval_model)
+def main():
+	world_size=int(os.environ.get('WORLD_SIZE','1'));local_rank=int(os.environ.get('LOCAL_RANK','0'));distributed='RANK'in os.environ and'WORLD_SIZE'in os.environ
+	if not torch.cuda.is_available():raise RuntimeError('CUDA is required')
+	if world_size<=0:raise ValueError(f"WORLD_SIZE must be positive, got {world_size}")
+	if 8%world_size!=0:raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral")
+	device=torch.device('cuda',local_rank);torch.cuda.set_device(device)
+	if distributed:dist.init_process_group(backend='nccl',device_id=device);dist.barrier()
+	torch.backends.cuda.matmul.allow_tf32=True;torch.backends.cudnn.allow_tf32=True;torch.set_float32_matmul_precision('high');from torch.backends.cuda import enable_cudnn_sdp,enable_flash_sdp,enable_math_sdp,enable_mem_efficient_sdp;enable_cudnn_sdp(False);enable_flash_sdp(True);enable_mem_efficient_sdp(False);enable_math_sdp(False);torch._dynamo.config.optimize_ddp=False;h=Hyperparameters();set_logging_hparams(h)
+	if h.is_main_process:
+		os.makedirs('logs',exist_ok=True);log(100*'=',console=False);log('Hyperparameters:',console=True)
+		for(k,v)in sorted(vars(type(h)).items()):
+			if not k.startswith('_'):log(f"  {k}: {v}",console=True)
+		log('='*100,console=False);log(f"Running Python {sys.version}",console=False);log(f"Running PyTorch {torch.__version__}",console=False);log(subprocess.run(['nvidia-smi'],stdout=subprocess.PIPE,stderr=subprocess.PIPE,text=True,check=False).stdout,console=False);log('='*100,console=False)
+	train_and_eval(h,device)
+	if distributed:dist.destroy_process_group()
+if __name__=='__main__':main()
\ No newline at end of file
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/v3_lora_ttt.log b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/v3_lora_ttt.log
new file mode 100644
index 0000000000..790c492af5
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/v3_lora_ttt.log
@@ -0,0 +1,7 @@
+Running LoRA-TTT eval on EMA checkpoint...
+lora_ttt:injected 44 layers, 5,046,272 params
+lora_ttt:phase 1
+lora_ttt:phase 2
+lora_ttt:done 3846.7s phases=3
+lora_ttt_rerun val_loss:2.75886799 val_bpb:1.06804377 eval_time:3847238ms
+DONE
diff --git a/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/v3_train.log b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/v3_train.log
new file mode 100644
index 0000000000..efeb404e1c
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-29_FlowerBrain_TernaryArchitecture/v3_train.log
@@ -0,0 +1,184 @@
+====================================================================================================
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: /root/data/
+  datasets_dir: /root/data/datasets/fineweb10B_sp8192
+  distributed: False
+  ema_decay: 0.9965
+  embed_bits: 8
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  etlb_clip: 3.0
+  etlb_enabled: False
+  etlb_lr: 0.05
+  etlb_steps: 5
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 8
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/180f489b-ea77-4991-9b2a-568c7ea9165e.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  lora_ttt_alpha: 144.0
+  lora_ttt_enabled: True
+  lora_ttt_lr: 0.0005
+  lora_ttt_phases: 3
+  lora_ttt_rank: 128
+  lora_ttt_wd: 0.01
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.022
+  max_wallclock_seconds: 7200.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_residual_start: 7
+  qk_gain_init: 5.25
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: 180f489b-ea77-4991-9b2a-568c7ea9165e
+  scalar_lr: 0.02
+  seed: 42
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: /root/data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: /root/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_chunk_tokens: 16384
+  ttt_enabled: True
+  ttt_epochs: 3
+  ttt_lr: 0.005
+  ttt_momentum: 0.9
+  val_batch_tokens: 524288
+  val_files: /root/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.72
+  warmup_steps: 20
+  world_size: 1
+  xsa_last_n: 11
+====================================================================================================
+Running Python 3.11.10 (main, Sep  7 2024, 18:35:41) [GCC 11.4.0]
+Running PyTorch 2.4.1+cu124
+Wed Apr 29 13:13:03 2026       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
+| N/A   45C    P0             73W /  700W |       4MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+====================================================================================================
+train_shards: 115
+val_tokens: 40540160
+model_params:35945048
+gptq:reserving 12s, effective=7188000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0045 val_bpb: 3.4859
+1/20000 train_loss: 9.0007 train_time: 0.0m tok/s: 836823
+2/20000 train_loss: 9.2217 train_time: 0.0m tok/s: 831865
+3/20000 train_loss: 9.5355 train_time: 0.0m tok/s: 828692
+4/20000 train_loss: 9.6487 train_time: 0.1m tok/s: 827446
+5/20000 train_loss: 9.4469 train_time: 0.1m tok/s: 826531
+CHECKPOINT saved: /root/checkpoints/step_100.pt
+CHECKPOINT saved: /root/checkpoints/step_500.pt
+500/20000 train_loss: 3.3269 train_time: 8.0m tok/s: 820642
+CHECKPOINT saved: /root/checkpoints/step_1000.pt
+1000/20000 train_loss: 3.2202 train_time: 16.0m tok/s: 821232
+1500/20000 train_loss: 3.1734 train_time: 23.9m tok/s: 821591
+2000/20000 train_loss: 3.1350 train_time: 31.9m tok/s: 821516
+CHECKPOINT saved: /root/checkpoints/step_2300.pt
+2500/20000 train_loss: 3.0792 train_time: 39.9m tok/s: 821361
+layer_loop:enabled step:2628 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+3000/20000 train_loss: 3.0478 train_time: 50.8m tok/s: 773680
+3500/20000 train_loss: 2.9854 train_time: 62.7m tok/s: 731219
+4000/20000 train_loss: 2.9614 train_time: 74.6m tok/s: 702331
+4000/20000 val_loss: 2.9543 val_bpb: 1.1437
+4500/20000 train_loss: 2.9397 train_time: 86.6m tok/s: 681067
+5000/20000 train_loss: 2.8729 train_time: 98.6m tok/s: 664831
+5500/20000 train_loss: 2.8357 train_time: 110.5m tok/s: 652249
+5889/20000 val_loss: 2.7948 val_bpb: 1.0820
+stopping_early: wallclock_cap train_time: 7188708ms step: 5889/20000
+peak memory allocated: 53108 MiB reserved: 54492 MiB
+ema:applying EMA weights
+EMA checkpoint saved to final_model_ema.pt
+pre-quantization post-ema val_loss:2.79112550 val_bpb:1.08053166 eval_time:64924ms
+Serialized model: 135432998 bytes
+Code size: 59414 bytes
+Collecting Hessians for mixed compression...
+Hessians collected in 17.7s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int8): tok_emb.weight
+  passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights, smear.gate
+Mixed compression done in 70.8s
+Serialized ternary+brotli: 15963638 bytes
+Total submission size: 16023052 bytes
+WARNING: 23052 bytes OVER 16MB!
+quantized val_loss:2.82406776 val_bpb:1.09328464 eval_time:85470ms
+quantized_sliding_window val_loss:2.78092569 val_bpb:1.07658300 eval_time:830350ms
+lora_ttt:injected 44 layers, 5,046,272 params