diff --git a/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/README.md b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/README.md
new file mode 100644
index 0000000000..e121c6c254
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/README.md
@@ -0,0 +1,162 @@
+# Notable Non-Record Submission: 1.1239 BPB — 106.2 Asymmetric Binary U-Net Transformer
+
+**1-bit Quantisation + 15L (7 Encoder - 8 Decoder) + NeoMuon + 4x relu² MLP + SmearGate + Factored Tied Embedding + Poly5 Softcap + YaRN 2048 + 8192 BPE + FP8 QAT + LZMA + Stride-16 Sliding Eval**
+
+**val_bpb: 1.1239** (sliding, seed=42) | **15.67 MB** artifact | 8×H100 SXM, 50k steps (~2.15h)
+
+> **This is a **non-record submission** — training exceeds the 10-minute wallclock constraint (50,000 steps / ~2.15 hours). Submitted to demonstrate the compression frontier: 106.2 parameters in 15.67MB via 1-bit quantisation. Over 120M possible with FP4 (implemented) with a worse bpb. Full experiment log: [RESULTS.md](RESULTS.md). Complete training logs: [logs/](https://github.com/CiprianFlorin-Ifrim/openai-parameter-golf-submission/tree/main/logs/cuda).**
+
+## Results (seed=42, 8×H100 SXM)
+
+| Metric | Value |
+|--------|-------|
+| Sliding BPB (s16) | **1.1239** |
+| val_bpb | 1.1497 |
+| RT bpb | 1.1516 |
+| Steps | 50,000 |
+| ms/step | 155.3 |
+| Training time | 7,763s (~2.15h) |
+| optimal_T | 0.90 |
+| Artifact | 15,670,651 bytes (15.67MB) |
+| Parameters | 106,154,616 |
+
+### Comparison to Ternary Submission
+
+Binary reaches better absolute quality but requires circa 13x more training time. Within the 10-minute budget, binary's best fitting run (14L, 4,820 steps) scores 1.1824 sliding — 0.025 bpb worse than ternary (my previous record PR). The zero state is worth more at convergence than the 60% parameter density advantage.
+
+The results document linked here and in my repo showcases all methods and sweeps applied to both Binary and Ternary Bitnets, which unfortunately are incompatible with many methods, such as Tversky Layers, EMA, Muon WD, LM Logit Head ranking and many more.
+
+## Architecture
+
+- 15 transformer layers, dim=768, 8 heads, 4 KV heads (GQA), head_dim=96
+- Binary quantisation: weights {-1, +1}, 1 bit/param, per-group (128) absmean scaling
+- 4x MLP expansion (hidden=3072) with **relu²** activation, fused gate+up projection
+- U-Net encoder/decoder with learned skip weights (ones-init) and per-block residual mix from input embedding
+- **SmearGate:** causal cumulative mean blending with learned tanh gate, zero-init for safe residual start
+- Factored tied embedding: 8192×254 bottleneck with learned projections
+- Polynomial softcap (degree 5, cap=10) with Z-loss regularisation (1e-4)
+- YaRN positional encoding (max_len=2048, ROPE_BASE=5000)
+- Fused QKV projection
+- FlashAttention-3 (Hopper native kernels)
+- 106.2M parameters, 15.67MB artifact (97.3M binary + 2.5M fp8 + 70KB code)
+
+## Key Techniques
+
+### Architecture
+- **Binary quantisation:** 1 bit/param packs 60% more parameters per MB than ternary (1.6 bits/param), allowing 15 layers vs 10 within similar budget
+- **4x relu² MLP:* relu² strictly dominates relu; 4x width outperforms 3x even with fewer layers at matched budget
+- **SmearGate:** blends each position with causal cumulative mean; adds 22ms/step overhead but provides -0.007 bpb at scale. Viable here because the run is not wallclock-constrained
+
+### Training
+- **NeoMuon** with 3 Newton-Schulz steps optimizer
+- **50,000 steps unconstrained:** binary converges slower than ternary (my other #640, at 4,000 steps (the 10-minute equivalent) binary lags by 0.025 bpb. Extended training closes the gap and surpasses ternary, showcasing with "unlimited compute" the models can be quite powerful.
+- **524k batch tokens:**
+
+### Evaluation
+- **Temperature scaling (T=0.90):** auto-calibrated grid
+- **Sliding window (stride=16):** evaluation protocol
+
+### Compression
+- **Bit-packing + LZMA (preset=9):** binary weights pack at exactly 1 bit/param before LZMA entropy coding
+- **FP8 QAT (e4m3):** for non-binary parameters. Clean roundtrip, binary has no zero state, so `mean(|Q|)=1.0` always; no shrinkage correction needed
+- **No EMA:** despite clean binary roundtrip math, EMA still hurts quality by 0.03 bpb in practice
+
+## Setup and Run
+
+```bash
+# Environment setup (conda + Python 3.13 + PyTorch + FlashAttention-3 + Triton + dataset)
+bash setup.sh
+
+# Activate and run
+conda activate golf
+SEED=42 bash run_cuda_binary.sh
+```
+
+
+Full run command
+
+```bash
+RUN_ID=binary_run \
+DATA_PATH=./data/datasets/fineweb10B_sp8192 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
+ATTN_PROJ_TYPE=standard \
+LOGIT_HEAD_TYPE=standard \
+TVERSKY_MEMBERSHIP=sigmoid \
+TVERSKY_NUM_FEATURES=0 \
+TVERSKY_FEATURE_POOLS=0 \
+VOCAB_SIZE=8192 \
+BITNET_GROUP_SIZE=128 \
+BIGRAM_HASH=0 \
+EMBED_DIM=254 \
+TRAINING_DEPTH_RECURRENCE=0 \
+EVAL_DEPTH_RECURRENCE=0 \
+NUM_LAYERS=15 \
+MODEL_DIM=768 \
+NUM_KV_HEADS=4 \
+NUM_HEADS=8 \
+DIFF_ATTN=0 \
+MLP_MULT=4 \
+MLP_GROUPS=0 \
+MATRIX_OPTIMIZER=muon \
+ADAM_LR=0.05 \
+ADAM_WD=0.05 \
+MUON_BACKEND_STEPS=3 \
+MUON_MOMENTUM=0.95 \
+MUON_MOMENTUM_WARMUP_START=0.85 \
+MUON_MOMENTUM_WARMUP_STEPS=500 \
+MUON_WD=0.0 \
+MATRIX_LR=0.04 \
+SCALAR_LR=0.02 \
+TIED_EMBED_LR=0.02 \
+WARMDOWN_FRACTION=0.2 \
+LOGIT_SOFTCAP=10 \
+QK_GAIN_INIT=2.25 \
+ROPE_TYPE=yarn \
+YARN_MAX_LEN=2048 \
+ROPE_BASE=5000 \
+BATCH_TOKENS_START=0 \
+BATCH_SCHEDULE_FRACTION=0.33 \
+TRAIN_BATCH_TOKENS=524288 \
+SEQ_LEN_START=0 \
+SEQ_SCHEDULE_FRACTION=0.0 \
+TRAIN_SEQ_LEN=1024 \
+SMEAR=1 \
+ITERATIONS=50000 \
+WARMUP_STEPS=5 \
+MAX_WALLCLOCK_SECONDS=0 \
+VAL_LOSS_EVERY=0 \
+TRAIN_LOG_EVERY=500 \
+CHURN_LOG_EVERY=1000 \
+VAL_MAX_TOKENS=0 \
+TIE_EMBEDDINGS=1 \
+UNTIE_AT_FRACTION=0.00 \
+HEAD_LR=0.02 \
+CORR_WEIGHT_LR=0.02 \
+ACTIVATION=relu2 \
+SOFTCAP_TYPE=poly \
+MTP_HEADS=0 \
+REFINER=0 \
+REFINER_KERNEL=3 \
+SLIDING_EVAL=1 \
+SLIDING_EVAL_STRIDE=16 \
+SLIDING_BATCH_SIZE=256 \
+TEMP_SCALING=1 \
+FP_STORAGE=FP8 \
+EMA=0 \
+EMA_DECAY=0.995 \
+EMA_START_FRACTION=0.5 \
+SEED=42 \
+COMPILE_MODE=default \
+OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 train_gpt_cuda_binary.py
+```
+
+
+
+## Compliance
+
+- [x] Artifact <=16,000,000 bytes (15,670,651)
+- [x] Sliding window eval stride=16
+- [x] No test-time training on validation data
+- [x] No network calls during evaluation
+- [x] No external compute
+- [x] Train time: **non-record submission** (7,763s/ 2.2h / 50,000 steps)
diff --git a/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/RESULTS.md b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/RESULTS.md
new file mode 100644
index 0000000000..82fcd581f0
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/RESULTS.md
@@ -0,0 +1,1236 @@
+# Parameter Golf — Complete Experiment Log
+
+**Author:** Ciprian-Florin Ifrim
+**Date:** March 2026
+
+---
+
+## Challenge Overview
+
+Train the best language model that fits in a 16MB artifact and trains in under 10 minutes on 8×H100 SXM GPUs, evaluated by tokenizer-agnostic bits-per-byte (BPB) compression on the FineWeb validation set.
+
+- **Baseline:** 1.2244 bpb (9L 512d int8+zlib, 1k vocab)
+- **Our best (ternary, valid):** 1.1565 bpb sliding (P2, 10L 768d relu² 4×MLP fp8, EMBED_DIM=254, seed=42, 16.00MB)
+- **Our best (binary, unconstrained):** 1.1239 bpb sliding (15L 768d binary relu² 4×MLP fp8, 50k steps / ~2h compute, 15.67MB)
+- **Our best (quality, over budget):** 1.1771 bpb (F59, 12L 768d swiglu 3×MLP, 21.96MB)
+- **Challenge period:** March 18 – April 30, 2026
+- **Compute sponsor:** OpenAI ($1M in compute credits)
+
+The challenge is framed as L(N) optimisation — minimising loss given fixed parameter count N, unconstrained by data, compute, steps, or architecture. Related challenges include NanoGPT Speedrun (L(T): lowest loss given constrained time) and NanoGPT Slowrun (L(D): lowest loss given constrained dataset).
+
+---
+
+## Run Numbering Convention
+
+| Prefix | Description |
+|--------|-------------|
+| Plain (1–100) | Dev runs on RTX 5090, 100 steps |
+| R prefix (R1...) | Record runs — 600s on 8×H100, leaderboard-targeted |
+| S prefix (S1...) | Scaling runs — 1500 steps or 300s on 8×H100, controlled sweeps |
+| SB prefix (SB1...) | Binary scaling runs |
+| F prefix (F1...) | Final runs — 600s on 8×H100, official submissions |
+| P prefix (P1...) | Pushed/submission runs — final config pushed to GitHub |
+
+Additionally, 20 early architecture iterations were performed on MLX (Mac Studio M1 Ultra, 32GB unified memory) and 2 on MPS (MacBook Pro M1 Pro, 32GB unified memory) for rapid prototyping before GPU scaling.
+
+> **Note:** This document covers ~85 named runs (F, S, R series). An additional ~165 dev runs (plain numbered 1–100, repeated sweeps, smoke tests) were conducted but are not individually listed. Key findings from those runs are incorporated into the sweep tables and decision rationale. Separate synthetic-data notebooks were used to isolate the behaviour of specific techniques (Tversky similarity, linear alternatives, grouped projections) before committing H100 compute.
+
+---
+
+## Hardware
+
+| System | Spec | Notes |
+|--------|------|-------|
+| Dev | RTX 5090 32GB, single GPU | Triton smem ceiling 101KB/SM; blocks value embeddings and some kernels |
+| Mac (MLX) | Mac Studio M1 Ultra 32GB | MLX early iteration, 20 runs |
+| Mac (MPS) | MacBook Pro M1 Pro 32GB | MPS early iteration, 2 runs |
+| Final | 8×H100 SXM 80GB | Primary training platform |
+
+**Step times at 768d (12L):** relu² 2x: 89ms | relu² 3x: 99ms | relu² 4x: 91ms | swiglu 3x: 127ms | leaky relu 3x: 103ms
+
+**Step times at 512d:** 26L baseline: 149ms → 136ms with FA3 → 127ms with FA3 + fusions + EMBED=256 at 25L
+
+**FlashAttention-3** reduced step time by ~9% (~380 free training steps per 600s run).
+
+**Kernel fusion optimisations** (fused QKV + fused SwiGLU + dataloader + softcap) saved a further ~7-10ms/step.
+
+**Width vs depth discovery:** 12L 768d at 106ms/step gets ~5640 steps in 600s vs ~4720 steps for 25L 512d — 920 extra steps from the faster per-step time of wider/shallower models. Final 10L 768d 4×MLP at 91.8ms/step gets ~6530 steps.
+
+---
+
+## Architecture: Ternary U-Net Transformer
+
+### Quantisation Scheme
+
+BitNet b1.58 ternary quantisation — weights constrained to {−1, 0, +1} with per-group absmean scaling. Approximately 1.6 bits per parameter.
+
+**Compression pipeline:** Base-3 packing (5 trits/byte) or bitmask packing → LZMA (preset=9). Best method auto-selected per run. Bitmask wins when zero fraction is high.
+
+**Quantisation shrinkage fix:** When ternary Q contains zeros, `mean(|Q|) < 1.0`, causing scale mismatch on reload. Fix: inflate by `1/mean(|Q|)` during dequantisation. Eliminates all roundtrip gaps.
+
+### U-Net Skip Connections
+
+The model uses a U-Net style encoder/decoder structure with learned skip connections. The first `num_layers // 2` blocks (encoder) store their outputs; the second half (decoder) receives these via `x = x + skip_weight[i] * skips.pop()`. This allows the decoder to simultaneously access high-level semantic representations (from deep processing) and low-level token-level features (from early processing), without requiring the decoder to reconstruct low-level information from the compressed residual stream.
+
+Additionally, each block receives `x0` (the original input embedding) via a learned residual mix: `x = mix[0] * x + mix[1] * x0`, giving every layer direct access to the raw token representation regardless of accumulated residual drift.
+
+For odd layer counts, the decoder receives the larger half (e.g. 27L → 13 encoder + 14 decoder), which is the standard U-Net convention — more processing power applied after skip injection.
+
+### Factored Embedding
+
+With `EMBED_DIM=254`, token embedding is `[8192, 254]` instead of `[8192, 768]`, with learned projections `embed_proj` (254→768) and `embed_proj_rev` (768→254) for the tied output head.
+
+**EMBED_DIM history:** Started at 128 (dev runs), upgraded to 256 after an optimizer coverage fix revealed that the projection layers had not been receiving gradients (−0.024 bpb improvement vs 128 once trained), then trimmed to 254 to fit artifact+code under the 16,000,000 byte budget (~0.0004 bpb cost, 0.00018/dim from 128→256 scaling data).
+
+### Fused Operations
+
+**Fused QKV:** Single `TernaryLinear(dim, dim + 2*kv_dim)`. **Fused SwiGLU/relu²:** Gate and up projections combined into single wide matrix. Combined saving: ~4-6ms/step.
+
+### Z-Loss Regularisation
+
+`1e-4 * logsumexp(logits)²` (from PaLM/Gemma) anchors logits near zero, keeping gradients sharp through the ternary STE.
+
+---
+
+## Compression Scheme
+
+### Base-3 + LZMA (Primary)
+
+5 trits per byte (1.585 bits/trit), lossless. LZMA at preset=9 achieves ~39% reduction over int8+zlib. Ternary distribution at convergence: ~20–29% zeros, ~35–40% each ±1. The skewed distribution (more zeros) is exploited by LZMA's entropy coding.
+
+### Bitmask Compression (Alternative)
+
+Encodes "is this weight zero?" and "if nonzero, is it +1?" as separate bitmasks. Both methods are tried and the smaller is selected automatically. In practice, bitmask and base-3+LZMA produce nearly identical artifact sizes — bitmask wins marginally in some runs (e.g. S72: 15.84MB vs 15.87MB). Zero fraction would need to drop below ~5% for bitmask to provide a clear advantage; our zero fraction ranges from 17–29% at convergence, making bitmask non-competitive.
+
+### 3D Tensor Support
+
+Conv1d weights (`[dim, dim, kernel]`) are reshaped to 2D before ternary quantisation and restored to original shape on load.
+
+### FP8 QAT
+
+Non-ternary parameters (embeddings, projections) stored at fp8 (e4m3) with Quantisation-Aware Training via STE. Halves fp_params storage (~5MB → ~2.5MB). Typical roundtrip gap: 0.001–0.002 bpb.
+
+---
+
+## Submission Runs (P prefix) — Ternary
+
+Configuration: F88 (10L 768d relu² 4×MLP fp8, WD=0, EMBED_DIM=254, 599s wallclock, TEMP=0.90)
+
+| Seed | Steps | val_bpb | RT bpb | Sliding bpb | Train Time | Eval Time | Artifact | Budget |
+|------|-------|---------|--------|-------------|------------|-----------|----------|--------|
+| 1337 | 6520 | 1.1825 | 1.1839 | **1.1568** | 599.1s | 428.7s | 15.92MB | 16.00/16.00MB |
+| 42 | 6530 | 1.1816 | 1.1837 | **1.1565** | 599.7s | 429.3s | 15.92MB | 15.99/16.00MB |
+| 7 | 6530 | 1.1823 | 1.1850 | **1.1578** | 599.6s | 429.0s | 15.92MB | 15.99/16.00MB |
+| **Mean** | **6527** | **1.1821** | **1.1842** | **1.1570** | **599.5s** | **429.0s** | **15.92MB** | |
+| **Std** | **5** | **0.0005** | **0.0007** | **0.0007** | **0.3s** | **0.3s** | **0.00MB** | |
+
+All three seeds fit within the 16,000,000 byte budget. The standard deviation of 0.0007 bpb across seeds confirms high reproducibility. All runs achieve p < 0.001 improvement over the 1.2244 bpb baseline.
+
+### Batch Size Sensitivity (Ternary, 599s wallclock)
+
+| Batch Tokens | Steps | ms/step | val_bpb | Sliding bpb | Tokens Seen | Fits Budget |
+|-------------|-------|---------|---------|-------------|-------------|-------------|
+| 262,144 | 10,000 | 49 | 1.2413 | — | 2.6B | No |
+| **524,288** | **6,530** | **92** | **1.1850** | **1.1578** | **3.4B** | **Yes** |
+| 1,048,576 | 3,480 | 172 | 1.1925 | 1.1659 | 3.5B | No |
+
+524k batch tokens is the optimal operating point. Halving the batch (262k) doubles the step count but degrades quality by 0.056 bpb due to noisier gradients interacting poorly with the ternary STE. Doubling it (1M) sees similar total tokens but fewer gradient updates, costing 0.008 bpb.
+
+---
+
+## Current Best Configuration
+
+### Ternary: 10L 768d relu² 4×MLP fp8, WD=0, EMBED_DIM=254
+
+```bash
+NUM_LAYERS=10 MODEL_DIM=768 NUM_HEADS=8
+NUM_KV_HEADS=4 MLP_MULT=4 VOCAB_SIZE=8192
+ACTIVATION=relu2 LOGIT_SOFTCAP=10 SOFTCAP_TYPE=poly
+QK_GAIN_INIT=2.25 ROPE_BASE=5000 ROPE_TYPE=yarn
+YARN_MAX_LEN=2048 EMBED_DIM=254 TIE_EMBEDDINGS=1
+BITNET_GROUP_SIZE=128 FP_STORAGE=FP8 MUON_WD=0.0
+MATRIX_LR=0.04 SCALAR_LR=0.02 TIED_EMBED_LR=0.02
+MUON_BACKEND_STEPS=3 MUON_MOMENTUM=0.95 WARMDOWN_FRACTION=0.2
+MAX_WALLCLOCK_SECONDS=599
+SLIDING_EVAL=1 SLIDING_EVAL_STRIDE=16 TEMP_SCALING=1
+TRAIN_BATCH_TOKENS=524288
+```
+
+| Metric | Value |
+|--------|-------|
+| val_bpb (mean) | 1.1821 |
+| RT bpb (mean) | 1.1842 |
+| Sliding bpb (mean) | 1.1570 |
+| Artifact + code | 15,992,753–15,995,705 / 16,000,000 bytes |
+| Steps | 6520–6530 |
+| ms/step | 91.8 |
+| zero_frac | 0.335–0.336 |
+| optimal_T | 0.90 |
+| Params | 73,685,840 |
+
+---
+
+## Dev Runs (RTX 5090, 100–500 steps)
+
+### Phase 0 — Ternary vs Binary (500 steps, 16L 512d, 1k vocab)
+
+| Run | Config | val_bpb | RT bpb | Artifact | ms/step |
+|-----|--------|---------|--------|----------|---------|
+| 17 | Ternary baseline | 1.7110 | 1.7300 | 23.95MB | 1312 |
+| 18 | Binary {−1,+1} | 1.7121 | 1.7316 | 23.93MB | 1309 |
+
+Ternary wins by 0.0016 bpb. The zero state provides representational benefit.
+
+---
+
+### Phase 1 — Training Techniques (100 steps, 9L 512d, 1k vocab)
+
+| Run | Config | val_bpb | RT bpb | Artifact | Notes |
+|-----|--------|---------|--------|----------|-------|
+| 19 | Ternary 16L 512d baseline | 2.3371 | 2.3793 | 7.33MB | |
+| 20 | + Untie lm_head at 2/3 | 2.3569 | 2.3983 | 8.13MB | Deferred — needs wallclock fix |
+| 21 | + Value embeddings | — | — | — | Blocked: RTX 5090 Triton smem |
+| 22 | + Smear module | 2.3593 | 2.3985 | 7.33MB | Deferred — gate needs many steps |
+| 23 | Baseline 9L 512d | 2.4483 | 2.4768 | 4.45MB | Switched from 16L |
+| 24 | + Polynomial softcap | 2.3981 | 2.4438 | 4.45MB | **−0.033 rt** |
+| 25 | + Seq length schedule | 2.4633 | 2.5106 | 4.45MB | Deferred — recompile cost |
+| 26 | + NorMuon | 2.4018 | 2.4104 | 4.40MB | **−0.033 rt**, 5× smaller RT gap |
+| 27 | + Grad accum delay | 2.6298 | 2.6571 | 4.40MB | Deferred — needs 2000+ steps |
+
+---
+
+### Vocabulary Sweep (100 steps, 9L 512d)
+
+| Run | Vocab | val_bpb | RT bpb | Artifact | Notes |
+|-----|-------|---------|--------|----------|-------|
+| 23 | 1024 | 2.4483 | 2.4768 | 4.45MB | Baseline |
+| 28 | 4096 | 2.0930 | 2.0974 | 6.68MB | −0.32 vs 1k |
+| **29** | **8192** | **1.9946** | **1.9990** | **9.64MB** | **−0.42 vs 1k — largest single win** |
+
+8192 vocab locked. The tokeniser merges ~1.57× more aggressively than 1k, directly reducing BPB. Val token count drops from 63.8M (sp1024) to 40.5M (sp8192) for the same 50k documents.
+
+---
+
+### Activation Sweep (100 steps, 9L 512d, 8k vocab)
+
+| Run | Activation | val_bpb | RT bpb | Artifact | ms/step |
+|-----|-----------|---------|--------|----------|---------|
+| 29 | relu2 | 1.9946 | 1.9990 | 9.64MB | 838 |
+| 30 | relu | 1.9846 | 1.9879 | 9.63MB | 830 |
+| **31** | **SwiGLU** | **1.9704** | **1.9743** | **10.70MB** | **960** |
+| 32 | SwiGLU + MTP(2) | 1.9627 | 1.9672 | 10.69MB | 1111 |
+
+SwiGLU with MTP auxiliary loss gives −0.032 bpb but +16% slower. SwiGLU alone gives −0.025 bpb. MTP deferred.
+
+---
+
+### Embedding Factorization Sweep (100 steps, 9L 512d, 8k vocab)
+
+| Run | EMBED_DIM | val_bpb | RT bpb | RT gap | Artifact |
+|-----|-----------|---------|--------|--------|----------|
+| 33a | 0 (=512) | 1.9931 | 1.9962 | 0.003 | 9.63MB |
+| **33d** | **128** | **1.9656** | **1.9656** | **0.000** | **9.12MB** |
+| 33c | 256 | 2.0538 | 2.1339 | 0.080 | 6.68MB |
+| 33e | 64 | 2.0936 | 2.0968 | 0.003 | 4.49MB |
+| 33f | 1024 | 2.0709 | 2.1845 | 0.114 | 15.60MB |
+
+128 was optimal at dev scale. After an optimizer fix revealed the projection layers had not been training, 256 became optimal at full convergence — see EMBED_DIM Sweep at full convergence.
+
+---
+
+### Tversky Neural Network Investigation
+
+Based on Doumbouya et al. (2025). Three-term Tversky similarity: `S = theta * f(A intersection B) - alpha * f(A - B) - beta * f(B - A)` with learned membership functions.
+
+**Feature count sweep (FP16 features, ternary prototypes, 100 steps, 9L 512d):**
+
+| Run | Features | val_bpb | RT bpb | RT gap | Artifact |
+|-----|----------|---------|--------|--------|----------|
+| — | No Tversky | 1.9751 | 1.9751 | 0.000 | 5.33MB |
+| 38 | 16 | 1.9877 | 2.0186 | 0.031 | 5.46MB |
+| 39 | 32 | 1.9843 | 2.0133 | 0.029 | 5.57MB |
+| 40 | 64 | 1.9790 | 2.0097 | 0.031 | 5.79MB |
+| **41** | **128** | **1.9427** | **1.9865** | **0.044** | **6.20MB** |
+| 42 | 256 | 1.9737 | 2.0863 | 0.113 | 5.63MB |
+| 43 | 512 | 2.0036 | 2.0965 | 0.093 | 5.90MB |
+| 44 | 128 + shrinkage fix | 1.9425 | **1.9425** | **0.000** | 6.20MB |
+
+Tversky showed genuine quality benefit (~-0.017 bpb) at dev scale with 128 features and fp16 prototype storage. However, subsequent investigation at full convergence (12L 768d) and with corrected prototype storage showed all Tversky variants within noise of the linear baseline. Additional experiments included full ternary prototypes, shared feature pools across layers, no-features mode, logit-head application, and different membership functions (sigmoid, poly, tanh). A synthetic-data notebook confirmed that Tversky's asymmetric similarity only helps on tasks with genuine directional feature relationships (hypernym/hyponym, cause/effect); next-token prediction on FineWeb web text is not such a task.
+
+At the 768d architecture with relu², Tversky also incurred a 19ms/step overhead because the smaller MLP no longer masked the compute cost.
+
+**Conclusion:** Tversky is quality-neutral on FineWeb language modelling regardless of configuration. Not a quantisation issue, not an optimizer issue — the task simply does not benefit from asymmetric similarity.
+
+---
+
+### Key Hyperparameter Sweeps (100 steps, 9L 512d, 8k vocab)
+
+**QK_GAIN_INIT sweep:**
+
+| Run | QK_GAIN | val_bpb | Delta |
+|-----|---------|---------|-------|
+| 75 | 1.0 | 2.0007 | +0.0076 |
+| 73 | 1.5 | 1.9931 | baseline |
+| 81 | 2.15 | 1.9913 | −0.0018 |
+| **79** | **2.25** | **1.9898** | **−0.0033** |
+| 77 | 2.5 | 1.9915 | −0.0016 |
+| 80 | 2.75 | 1.9975 | +0.0044 |
+| 78 | 3.0 | 2.0011 | +0.0080 |
+
+Clear inverted-U response. **QK_GAIN_INIT=2.25 locked.**
+
+**LOGIT_SOFTCAP sweep:**
+
+| Run | SOFTCAP | val_bpb | Delta |
+|-----|---------|---------|-------|
+| 74 | 5 | 1.9942 | −0.0013 |
+| **73** | **10** | **1.9931** | **−0.0024** |
+| 72 | 20 | 1.9935 | −0.0020 |
+| 71 | 50 | 1.9957 | +0.0003 |
+
+**LOGIT_SOFTCAP=10 locked.**
+
+**Softcap type (poly vs tanh):**
+
+| Run | Type | val_bpb | Notes |
+|-----|------|---------|-------|
+| S23 | poly | 1.3680 | |
+| S24 | tanh | 1.3693 | |
+| S28/S29 | both at EMBED=1024 | 1.3460–1.3462 | Identical at convergence |
+
+Zero effect. Polynomial retained as default.
+
+**ROPE_BASE sweep:**
+
+| Run | ROPE_BASE | val_bpb | Notes |
+|-----|-----------|---------|-------|
+| **70** | **5000** | **1.9959** | Best at short training |
+| 73 | 10000 | 1.9931 | Close second |
+| 69 | 20000 | 2.0008 | |
+| 68 | 50000 | 2.0017 | |
+
+**KV Heads:**
+
+| Run | KV_HEADS | val_bpb | Artifact |
+|-----|----------|---------|----------|
+| **58** | **4 (GQA)** | **1.9955** | **7.75MB** |
+| 66 | 8 (MHA) | 2.0148 | 8.46MB |
+
+**MLP_MULT:**
+
+| Run | MLP_MULT | val_bpb | Artifact |
+|-----|----------|---------|----------|
+| **58** | **2** | **1.9955** | **7.75MB** |
+| 64 | 3 | 2.0004 | 9.09MB |
+| 65 | 4 | 1.9992 | 10.39MB |
+
+**Storage precision:**
+
+| Run | Storage | val_bpb | RT bpb | RT gap | Artifact |
+|-----|---------|---------|--------|--------|----------|
+| **90** | **fp16** | **1.9656** | **1.9656** | **0.000** | **9.06MB** |
+| 91 | fp8 | 1.9662 | 1.9702 | 0.004 | 7.83MB |
+| 92 | fp4 | 1.9661 | 1.9955 | 0.029 | 7.11MB |
+
+**TTT-LoRA sweep (100 steps, ROPE=5000):**
+
+| Run | Rank | LR | TTT bpb | Delta |
+|-----|------|-----|---------|-------|
+| **85** | **8** | **0.01** | **1.9368** | **−0.0315** |
+| 86 | 8 | 0.005 | 1.9378 | −0.0312 |
+| 87 | 8 | 0.02 | 1.9644 | −0.0038 |
+| **88** | **4** | **0.01** | **1.9371** | **−0.0285** |
+| 89 | 16 | 0.01 | OOM | — |
+
+TTT confirmed working at dev scale (−0.0315 bpb). Incompatible at convergence — see TTT investigation.
+
+**EMBED_DIM sweep at 512d (12L, 100 steps):**
+
+| Run | EMBED_DIM | Tversky feat | RT bpb | Artifact | bpb/MB efficiency |
+|-----|-----------|-------------|--------|----------|-------------------|
+| 95 | 64 | 128 | 2.1961 | 8.40MB | worst |
+| 98 | 96 | 128 | 2.0356 | 8.74MB | |
+| 97 | 128 | 128 | 1.9656 | 9.12MB | best |
+| 99 | 192 | 128 | 2.0409 | 10.07MB | |
+| 94 | 256 | 128 | 2.0703 | 10.93MB | |
+| 100 | 256 | 256 | 2.0340 | 10.09MB | RT gap 0.021 |
+| 96 | 512 (off) | 128 | 2.0642 | 13.50MB | |
+
+128 confirmed optimal at dev scale.
+
+---
+
+### Architecture Sizing Table (Ternary, EMBED_DIM=128, standard proj)
+
+| Config | Layers | Artifact | Under 16MB? | RT gap | Headroom |
+|--------|--------|----------|-------------|--------|----------|
+| fp16 | 20 | 14.23MB | Yes | 0.0001 | 1.77MB |
+| **fp16** | **22** | **15.48MB** | **Yes** | **0.0001** | **0.52MB** |
+| fp16 | 24 | 16.74MB | No | — | −0.74MB |
+| fp8 QAT | 24 | 14.63MB | Yes | 0.028 | 1.37MB |
+| fp8 QAT | 26 | 15.77MB | Yes | 0.066 | 0.23MB |
+| **fp8 QAT** | **27** | **15.42MB** | **Yes** | **0.0025** | **0.58MB** |
+| fp8 QAT | 28 | 15.92MB+code | Marginal | 0.0029 | ~0MB |
+| fp8 QAT | 30 | 16.92MB | No | 0.0029 | −0.92MB |
+
+---
+
+## H100 Record Runs (R prefix)
+
+**Hardware:** 8×H100 SXM 80GB | **Time limit:** 600 seconds
+
+| Run | Config | Steps | val_bpb | RT bpb | Artifact | Notes |
+|-----|--------|-------|---------|--------|---------|-------|
+| R1 | 22L Tversky fp16 | 4299 | 1.2789 | 1.2792 | 15.80MB | |
+| R2 | 26L standard fp16 | 3973 | 1.2649 | 1.2650 | 15.85MB | Pre-LR tuning best |
+| R3 | 16L Tversky fp16 | 5949 | 1.2900 | 1.2904 | 11.95MB | Too shallow |
+| R4 | 9L Tversky fp16 | 10112 | 1.3374 | 1.3394 | 7.48MB | Way too shallow |
+| R5 | 30L fp8 | 2852 | 1.2689 | 1.2815 | 17.22MB | Over budget |
+| R6 | 26L fp16, 2× LR | ~4003 | 1.2991 | — | ~15.85MB | LR overshot |
+| **R7** | **26L fp16, LR=0.02** | **4008** | **1.2608** | **1.2610** | **15.83MB** | **Best pre-FA3** |
+| R8 | 26L fp16, LR=0.01 | 4017 | 1.2853 | 1.2855 | 15.72MB | LR too low |
+| R9 | 26L BigramHash | 4010 | 1.2804 | 1.2802 | 15.81MB | BigramHash negative |
+| R10 | 26L untie@66% | 3706 | 1.2754 | 1.2753 | 23.15MB | Over budget |
+| R11 | 26L tied, updated code | 4009 | 1.2806 | 1.2808 | 15.81MB | Code regression |
+
+**LR sweep (R-series):**
+
+| LR | val_bpb | Notes |
+|----|---------|-------|
+| 0.08 | 1.2991 | Overshoots — ternary STE amplifies gradient noise |
+| **0.02** | **1.2608** | **Optimal** |
+| 0.01 | 1.2853 | Too slow |
+
+---
+
+## Scaling Runs (S prefix)
+
+**Hardware:** 8×H100 SXM 80GB | **Steps:** 1500 | **Timer:** disabled (MAX_WALLCLOCK_SECONDS=0)
+**Base config:** 26L 512d, EMBED_DIM=128, ROPE=5000, QK_GAIN=2.25, SOFTCAP=10, LR=0.02 all, VOCAB=8192, SwiGLU, SEED=1337
+
+---
+
+### Warmdown Sweep
+
+| Run | Fraction | val_bpb |
+|-----|----------|---------|
+| S3 | 10% | 1.3467 |
+| **S1** | **20%** | **1.3438** |
+| S2 | 30% | 1.3443 |
+| S4 | 30% repeat | 1.3458 |
+| S5 | 40% | 1.3501 |
+
+S2 vs S4 (identical config): 0.0015 bpb spread — confirmed seed variance floor.
+
+### Muon Backend Steps
+
+| Run | Steps | ms/step | val_bpb |
+|-----|-------|---------|---------|
+| S8 | 3 | 144.87 | 1.3491 |
+| S9 | 4 | 146.61 | 1.3448 |
+| **S1** | **5** | **149.19** | **1.3438** |
+| S7 | 8 | 164.31 | 1.3441 |
+| S6 | 10 | 157.95 | 1.3456 |
+
+At full convergence (F6 vs F1): 3 steps matches 5 due to +190 extra training steps. Locked at 3.
+
+### Muon Momentum
+
+| Run | Momentum | val_bpb | zero_frac | Artifact |
+|-----|----------|---------|-----------|---------|
+| S11 | 0.90 | 1.3680 | 0.179 | 15.39MB |
+| **S1** | **0.95** | **1.3438** | **0.205** | **15.56MB** |
+| S10 | 0.99 | 1.3505 | 0.259 | 15.78MB |
+
+Higher momentum increases zero_frac, inflating artifact size.
+
+### Architecture Experiments
+
+| Run | Config | ms/step | val_bpb | Notes |
+|-----|--------|---------|---------|-------|
+| S12 | 20L 640d (80M params) | 160.58 | 1.6676 | 17.75MB — over budget |
+| **S1** | **26L 512d baseline** | **149.19** | **1.3438** | **Reference** |
+| S13 | 26L, TRAINING_DR=2 | 281.63 | 1.3727 | ~795 effective steps, OOM at DR=3 |
+
+### Eval Depth Recurrence Sweep
+
+| Run | EVAL_DR | val_bpb |
+|-----|---------|---------|
+| S15 | 0/1 | 1.3685–1.3690 |
+| S16 | 2 | 1.3688 |
+| S17 | 3 | 1.3681 |
+| S18 | 4 | 1.3690 |
+| S19 | 5 | 1.3683 |
+
+Total range: 0.0009 bpb — pure noise.
+
+### Weight Decay (1500 steps)
+
+| Run | MUON_WD | val_bpb | zero_frac | Artifact |
+|-----|---------|---------|-----------|---------|
+| **S15** | **0.00** | **1.3685** | **0.179** | **15.39MB** |
+| S20 | 0.04 | 1.3722 | 0.145 | 15.12MB |
+
+WD hurts at 1500 steps but saves 0.27MB. Reversed at full convergence — see Final Ternary Record Runs.
+
+### BigramHash
+
+| Run | Config | Steps | val_bpb | Artifact |
+|-----|--------|-------|---------|---------|
+| S21 | 26L + BigramHash | 1500 | 1.3681 | 15.45MB |
+| R9 | 26L + BigramHash | 4010 | 1.2804 | 15.81MB |
+
+At full convergence: 0.020 bpb worse than R7. The 2.1MB fp16 cost of the bigram table displaces ternary layer depth at convergence. **Not viable within budget.**
+
+### Tied Embedding / Correction Weight / Untie Investigation
+
+| TIE_EMBEDDINGS | UNTIE_AT_FRACTION | LM_HEAD_RANK | Behaviour |
+|---------------|-------------------|--------------|-----------|
+| 0 | any | any | Untied from start — unstable, loss = log(8192) = 9.01 |
+| 1 | 0.0 | 0 | Always tied — current best |
+| 1 | 0.66 | 0 | Tied → full-rank untie at 66% of wallclock |
+| 2 | 0.0 | 0 | Tied + correction weight residual on tok_emb |
+| 2 | 0.66 | 0 | Tied + correction → full-rank untie at 66% |
+| 2 | 0.66 | r | Tied + correction → SVD rank-r untie at 66% |
+
+**1500-step results:**
+
+| Run | TIE | UNTIE | RANK | val_bpb | Artifact |
+|-----|-----|-------|------|---------|---------|
+| S15 | 1 | 0.00 | 0 | 1.3685 | 15.39MB |
+| S30 | 2 | 0.00 | 0 | 1.3678 | 15.39MB |
+| S36 | 1 | 0.66 | 0 | 1.3648 | 22.83MB |
+| **S37** | **2** | **0.66** | **0** | **1.3642** | **22.84MB** |
+| S38 | 1 | 0.66 | 0 | 1.3667 | 22.84MB |
+| S39 | 0 | 0.66 | 0 | 3.4890 | 10.88MB |
+
+Untie gives +0.005 bpb gain but adds 7.3MB — over budget. **TIE=1, no untie locked.**
+
+### LM Head Factorization (SVD-at-Untie)
+
+| Run | RANK | val_bpb | Artifact | Delta vs baseline |
+|-----|------|---------|---------|-------------------|
+| S37 | 0 (full) | 1.3642 | 22.84MB | +0.004 — over budget |
+| S43 | 32 | 1.4873 | 17.27MB | −0.119 |
+| S41 | 64 | 1.4243 | 17.60MB | −0.056 |
+| S42 | 128 | 1.3889 | 18.40MB | −0.020 |
+
+SVD factorization does not recover within the remaining 34% of training. The model requires full-rank lm_head for 8192-class separability in 512-dimensional space.
+
+### Tied Embed LR Sweep
+
+| Run | TIED_EMBED_LR | MATRIX_LR | SCALAR_LR | val_bpb |
+|-----|--------------|-----------|-----------|---------|
+| S33 | 0.01 | 0.02 | 0.02 | 1.3723 |
+| **S15** | **0.02** | **0.02** | **0.02** | **1.3685** |
+| S34 | 0.03 | 0.02 | 0.02 | 1.3742 |
+
+Symmetric degradation. **TIED_EMBED_LR=0.02 locked.**
+
+### TTT-LoRA Investigation
+
+Test-time training with per-document LoRA adapters. Confirmed working at dev scale (−0.0315 bpb). Incompatible at convergence across 6 diagnostic runs.
+
+| Run | Config | val_bpb | TTT bpb | Notes |
+|-----|--------|---------|---------|-------|
+| S22 | TTT_LR=0.01 | 1.3690 | 1.5065 | TTT hurts |
+| S23 | No lm_head_lora | 1.3690 | 1.4993 | Still hurts |
+| S24 | tanh softcap | 1.3693 | 1.4982 | No improvement |
+| S25 | Q/V loras only | 1.3692 | 1.5193 | Worse |
+| S26 | EMBED_DIM=1024 | 1.3473 | 1.4746 | Bottleneck not cause |
+| S27 | 9L (original depth) | 1.4039 | 1.5189 | Still incompatible at 9L |
+
+**Root cause:** Every `TernaryLinear` applies RMSNorm to its input before the weight multiply. The LoRA adapter delta is computed on the pre-normalised representation, but injected into a forward pass where base weights operate on a differently-normalised space. At 100 steps the model is poorly calibrated and LoRA signal dominates. At convergence, the base model's representations are precisely calibrated to this normalised space, and any LoRA delta corrupts rather than adapts. This incompatibility is architectural. **TTT permanently disabled.**
+
+### MTP (Multi-Token Prediction)
+
+| Run | MTP_HEADS | ms/step | val_bpb | Notes |
+|-----|-----------|---------|---------|-------|
+| **S47** | **0** | **149** | **1.3693** | **Baseline** |
+| S45 | 2 | 157 | 1.3704 | +0.0011 worse |
+| S62 | 2 | 144 | 1.3727 | +0.0034 worse |
+
+Confirmed at both 1500 steps and full convergence (post-fix retest: 0.006 bpb worse at both MTP=1 and MTP=2). A 60M+ parameter, 1.58-bit model does not have the parameter bandwidth for auxiliary future-planning objectives.
+
+### Smear Module
+
+| Run | SMEAR | val_bpb | ms/step |
+|-----|-------|---------|---------|
+| **S48** | **0** | **1.3687** | **149** |
+| S49 | 1 | 1.3675 | 182 |
+
++22% slower, −0.0012 bpb at 1500 steps. At full 600s wallclock, smear costs ~740 fewer training steps. Not viable within the ternary 10-minute budget but explored further in the binary track.
+
+### Sequence Length Schedule
+
+| Run | Config | val_bpb | ms/step avg |
+|-----|--------|---------|-------------|
+| S48 | baseline | 1.3687 | 149 |
+| S51 | smear + seq@33% | 1.3660 | ~240 |
+| S52 | smear + seq@33% repeat | 1.3640 | ~221 |
+| **S58** | **smear + seq@33% + YaRN** | **1.3628** | **~221** |
+
+Real gain at 1500 steps but severe step penalty at full 600s. **Disabled for final runs.**
+
+### Batch Size Schedule
+
+| Run | Config | val_bpb |
+|-----|--------|---------|
+| S48 | baseline | 1.3687 |
+| S50 | smear + batch | 1.3698 |
+| S53 | smear + seq + batch | 1.3667 |
+
+Noisier gradients interfere with ternary STE convergence. **Not viable.**
+
+### YaRN Positional Encoding
+
+| Run | Config | val_bpb |
+|-----|--------|---------|
+| S48 | RoPE baseline | 1.3687 |
+| S54 | YaRN 4096 | 1.3705 |
+| S55 | YaRN 2048 | 1.3679 |
+| S56 | YaRN 2048 + seq@33% | 1.3672 |
+| S57 | YaRN 2048 + seq@50% + smear | 1.3637 |
+| **S58** | **YaRN 2048 + seq@33% + smear** | **1.3628** |
+
+YaRN 4096 hurts (scale=0.25 too aggressive). YaRN 2048 marginally better. **YaRN 2048 retained; seq schedule disabled.**
+
+ROPE_BASE with YaRN: S63 (10000) = 1.3692, **S61 (5000) = 1.3686**. ROPE_BASE=5000 locked.
+
+### Sliding Window Evaluation
+
+| Run | Stride | Sliding bpb | Eval time |
+|-----|--------|-------------|-----------|
+| S60 | 16 | 1.3452* | >600s |
+| S67 | 24 | 1.3146 | 592s |
+| **S61/S66** | **32** | **1.3139–1.3452*** | **~350s** |
+
+*S60/S61 used incorrect momentum=0.90. At full convergence (F1): stride=32 gives 1.2312 sliding bpb in 280s.
+
+### Temperature Scaling
+
+Grid search over T in [0.80, 1.20] on 65,536 training tokens. 5-point grid. Optimal T was consistently 1.00 at convergence for the 512d SwiGLU architecture. At the 768d relu² architecture, T=0.90 was consistently optimal (relu² logits slightly underconfident). **TEMP_SCALING=1 in all final runs.**
+
+### Group Size Sweep (S73–S76, 2000 steps, 27L)
+
+| Run | Group Size | Layers | val_bpb | Artifact | Total |
+|-----|-----------|--------|---------|----------|-------|
+| S76 | 32 | 27 | 1.2739 | 17.64MB | 17.73MB |
+| S75 | 64 | 27 | 1.2683 | 16.22MB | 16.31MB |
+| **S73** | **128** | **27** | **1.2677** | **15.53MB** | **15.62MB** |
+| S74 | 256 | 27 | 1.2699 | 15.19MB | 15.28MB |
+
+128 wins on both quality and compression.
+
+### Skip Weights Init — Zero vs Ones (S77)
+
+| Run | Init | val_bpb | artifact |
+|-----|------|---------|---------|
+| S73 | ones | 1.2677 | 15.62MB |
+| S77 | zeros | 1.2781 | 15.62MB |
+
+Zero-init is **0.0104 bpb worse**. Decoder needs skip signal from step 0.
+
+### FP8/FP4 Storage with QAT
+
+**FP8 sweep:**
+
+| Run | Config | val_bpb | RT bpb | RT gap | Sliding bpb | Artifact |
+|-----|--------|---------|--------|--------|-------------|---------|
+| S64 | 26L fp16 | 1.3390 | 1.3390 | 0.000 | 1.3150 | 15.58MB |
+| S65 | 30L fp8, no QAT | 1.3346 | 1.3394 | 0.0048 | 1.3150 | 16.92MB |
+| S66 | 30L fp8, QAT | 1.3351 | 1.3380 | 0.0029 | **1.3139** | 16.92MB |
+| S71 | 27L fp8, QAT | 1.3380 | 1.3405 | 0.0025 | 1.3164 | 15.42MB |
+| S72 | 28L fp8, QAT | 1.3377 | 1.3406 | 0.0029 | 1.3166 | 15.92MB |
+
+QAT reduces fp8 RT gap from 0.0048 to 0.0029 (40% improvement). However at full convergence (F3), 28L fp8 QAT (1.2353 sliding) loses to 26L fp16 (1.2312 sliding).
+
+**FP4 sweep:**
+
+| Run | Config | val_bpb | RT bpb | RT gap | Sliding bpb | Artifact |
+|-----|--------|---------|--------|--------|-------------|---------|
+| S68 | 30L fp4 QAT | 1.3377 | 1.3643 | **0.0266** | 1.3404 | 16.49MB |
+| S69 | 26L fp4 Tversky QAT | 1.3543 | 1.3835 | **0.0292** | 1.3606 | 15.01MB |
+| S70 | 28L fp4 QAT | 1.3405 | 1.3666 | **0.0261** | 1.3424 | 15.43MB |
+
+FP4 RT gap of ~0.026–0.029 even with QAT is unrecoverable. **FP4 not viable at any layer count.**
+
+### EMBED_DIM Sweep (Full Convergence, 25L)
+
+| Config | EMBED_DIM | Steps | val_bpb | sliding_bpb | artifact | Notes |
+|--------|-----------|-------|---------|-------------|---------|-------|
+| S80 | 0 (=512) | 4500 | 1.1902 | ~1.168 est | 19.78MB | OOM on sliding eval |
+| **F22** | **256** | **4720** | **1.2012** | **1.1739 (s16)** | **16.21MB** | **Best 512d result** |
+| F16-era | 128 | 4310 | 1.2245 | — | 16.19MB | Pre-fix baseline |
+
+**EMBED_DIM=256 locked.** Budget impact: fp_params ~4.85MB vs ~2.48MB at 128 (+2.37MB).
+
+---
+
+## Final Ternary Record Runs (F prefix)
+
+**Hardware:** 8×H100 SXM 80GB | **FlashAttention-3 enabled** | **Time limit:** 600 seconds
+
+| Run | Config | Steps | val_bpb | RT bpb | Sliding bpb | Eval time | Artifact |
+|-----|--------|-------|---------|--------|-------------|-----------|---------|
+| **F1** | **26L fp16, no smear, no seq** | **4362** | **1.2560** | **1.2560** | **1.2312** | **280s** | **15.85MB** |
+| F2 | 26L fp16, smear + seq@33% | 3044 | 1.2779 | 1.2778 | 1.2535 | 390s | 15.85MB |
+| F3 | 28L fp8 QAT, no smear, no seq | 4019 | 1.2571 | 1.2601 | 1.2353 (s24) | 385s | 16.14MB |
+| F4 | 26L fp16, EMA=1 | 4145 | 1.2589 | 2.3307 | — | — | 14.52MB |
+| F5 | 26L fp16, EMA fix v1 (smoke) | 407 | 1.5483 | 2.3642 | — | — | 14.90MB |
+| F6 | 26L fp16, MUON_BACKEND_STEPS=3 | 4552 | 1.2558 | 1.2558 | 1.2311 (s24) | 362s | 15.81MB |
+| F7 | 26L fp16, WD=0.04, steps=3 | 4499 | 1.2552 | 1.2551 | 1.2302 (s24) | 362s | 15.60MB |
+| F8 | 28L fp16, WD=0.04, steps=2, LR=0.02 | 4219 | 1.2799 | 1.2801 | 1.2558 (s16) | 577s | 15.92MB |
+| F9 | 28L fp16, WD=0.04, steps=2, LR=0.03 | 4231 | 1.2673 | 1.2676 | 1.2431 (s16) | 577s | 16.00MB |
+| F10 | 28L fp16, WD=0.04, steps=2, LR=0.04 | 4226 | 1.2636 | 1.2636 | 1.2391 (s16) | 578s | 16.01MB |
+| F11 | 28L fp16, WD=0.04, steps=3, LR=0.04 | 4137 | 1.2489 | 1.2488 | — | — | 16.69MB |
+| F12 | 28L fp16, WD=0.04, steps=4, LR=0.04 | 4047 | 1.2496 | 1.2500 | — | — | 16.71MB |
+| F13 | 28L fp16, WD=0.04, steps=3, LR=0.05 | 4048 | 1.2512 | 1.2510 | — | — | 16.73MB |
+| F14 | 28L fp16, WD=0.04, steps=3, LR=0.08 | 4036 | 1.2576 | 1.2574 | — | — | 16.75MB |
+| F15 | 27L fp16, AdamW matrix, LR=0.01 | 4676 | 1.2943 | 1.2942 | — | — | 15.71MB |
+| F16 | 27L fp16, Muon, LR=0.04, WD=0.04 | 4310 | 1.2245 | — | — | — | 16.19MB |
+| **F22** | **25L fp16, EMBED=256, steps=3, WD=0.04** | **4720** | **1.2012** | **1.2011** | **1.1739 (s16)** | **493s** | **16.21MB** |
+
+**Key findings:** F22 with EMBED_DIM=256 and corrected optimizer achieves 0.055 bpb improvement over F1 (the best pre-fix config). 28L extensively attempted (F8–F14) but artifact always over budget at competitive LR. AdamW for matrix params (F15) is clearly worse than Muon.
+
+---
+
+## Phase 2 — Post-Optimizer-Fix Experiments (25L 512d EMBED=256)
+
+### EMA (Exponential Moving Average)
+
+| Run | Config | Steps | val_bpb | RT bpb | Artifact |
+|-----|--------|-------|---------|--------|----------|
+| F4 | EMA=1, decay=0.999 | 4145 | 1.2589 | 2.3307 | 14.52MB |
+| — | Full run with EMA | 4144 | 1.2584 | 1.3776 | 14.94MB |
+
+**EMA is fundamentally incompatible with ternary quantization.** EMA averaging in fp32 produces smoother, more zero-centered weights. More latent weights near zero → more round to 0 in ternary → scale factor mismatch → 0.13 bpb RT gap. **Permanently disabled.**
+
+### Muon Backend Steps — Full Convergence
+
+| Run | Steps | step_avg | val_bpb | sliding_bpb | artifact |
+|-----|-------|----------|---------|-------------|---------|
+| F1 (steps=5) | 4362 | 137ms | 1.2560 | 1.2312 | 15.85MB |
+| F6 (steps=3) | 4552 | 131ms | 1.2558 | 1.2311 | 15.81MB |
+
+6ms/step saving → 190 extra steps → quality equivalent. **MUON_BACKEND_STEPS=3 locked.**
+
+### Weight Decay — Full Convergence
+
+| Run | WD | Steps | val_bpb | sliding_bpb | zero_frac | artifact |
+|-----|-----|-------|---------|-------------|-----------|---------|
+| F6 | 0.00 | 4552 | 1.2558 | 1.2311 | 0.294 | 15.81MB |
+| F7 | 0.04 | 4499 | 1.2552 | 1.2302 | 0.221 | 15.60MB |
+
+WD=0.04 wins at full convergence on the 26L architecture. However at 10L 4×MLP (Phase 4), WD=0.00 was better — wider MLP needs full weight freedom.
+
+### MTP Retest (Post-Fix)
+
+| Run | MTP_HEADS | Steps | step_avg | val_bpb | artifact |
+|-----|-----------|-------|----------|---------|---------|
+| F22 baseline | 0 | 4720 | 127ms | 1.2012 | 16.29MB |
+| Run 26 | 1 | 4560 | 131ms | 1.2074 | 16.30MB |
+| Run 27 | 2 | 4420 | 135ms | 1.2074 | 16.29MB |
+
+**MTP confirmed not viable post-fix.** 0.006 bpb worse at both heads. **MTP_HEADS=0 permanently locked.**
+
+### Tversky Phase 2 (Post-Fix, 12L 768d, fp16 Prototypes)
+
+Comprehensive retest with corrected optimizer and fp16 prototype storage:
+
+| Run | Config | Features | Pools | val_bpb | RT gap |
+|-----|--------|----------|-------|---------|--------|
+| 49 | No Tversky | — | — | **1.1888** | 0.0002 |
+| 50 | Attn proj only | 128 | 1 | 1.1893 | 0.0000 |
+| 51 | Attn proj only | 256 | 1 | 1.1894 | 0.0001 |
+| 52 | Attn proj only | 32 | 1 | 1.1898 | 0.0001 |
+| 53 | Attn + head | 128 | 1 | 1.1892 | — |
+| 54 | Attn + head | 128 | 0 (local) | 1.1897 | +0.0006 |
+
+All variants within 0.001–0.002 bpb of baseline — pure noise. Confirmed by synthetic-data analysis that Tversky's asymmetric similarity only helps on tasks with directional feature relationships, which next-token prediction on web text is not.
+
+---
+
+## Phase 3 — Architecture Exploration (Post-Optimizer-Fix)
+
+### Width vs Depth
+
+The central Phase 3 finding: wider models with fewer layers beat deeper models.
+
+#### 768d Scaling Curve
+
+| Run | Layers | Steps | step_avg | val_bpb | Artifact |
+|-----|--------|-------|----------|---------|----------|
+| 34 | 8 | 8110 | 74ms | 1.2894 | 12.94MB |
+| 30 | 12 | 5640 | 106ms | 1.1893 | 17.50MB |
+| 38 | 14 | 4900 | 122ms | 1.1870 | 19.79MB |
+| 33/37 | 16 | 4320 | 139ms | 1.1825–37 | 22.08MB |
+| 39 | 18 | 3870 | 155ms | 1.1801 | 24.39MB |
+| 36 | 20 | 3510 | 171ms | 1.1854 | 26.67MB |
+
+Peak at 18L, then step penalty dominates. 8L collapses (U-Net encoder too shallow). Seed variance: Run 33 vs 37 = 0.0012 bpb.
+
+#### Cross-Architecture Comparison
+
+| Config | Layers | Dim | Steps | val_bpb |
+|--------|--------|-----|-------|---------|
+| F22 | 25 | 512 | 4720 | 1.2012 |
+| Run 30 | 12 | 768 | 5640 | 1.1893 |
+| Run 40 | 8 | 1024 | 5870 | 1.1858 |
+| Run 41 | 10 | 896 | 5400 | 1.1862 |
+| Run 35 | 20 | 640 | 4170 | 1.1927 |
+| Run 42 | 6 | 896 | 8510 | 1.2157 |
+
+Width beats depth: 12L 768d (1.1893) beats 25L 512d (1.2012). Minimum viable depth: 768d ~10–12L, 896d ~10L, 1024d ~8L.
+
+### FP8 at 768d
+
+| Run | Layers | Storage | val_bpb | RT bpb | RT gap |
+|-----|--------|---------|---------|--------|--------|
+| 49 | 12 | fp16 | 1.1888 | 1.1886 | 0.0002 |
+| 42 | 13 | fp8 | 1.1879 | 1.1900 | 0.0021 |
+
+FP8 RT gap acceptable at 768d. Enables extra layers within budget.
+
+### LM_HEAD_RANK Investigation (Post-Fix, 768d)
+
+| Run | Config | val_bpb | RT bpb | Total | Notes |
+|-----|--------|---------|--------|-------|-------|
+| Run 49 | baseline | 1.1888 | 1.1886 | 17.50MB | Reference |
+| Run 43 | TIE=2, rank=256, fp8 | 1.2021 | 1.2028 | 20.41MB | Artifact bloated |
+| Run 44 | TIE=0, rank=512, untie=0.0 | 1.3196 | 1.3195 | 16.92MB | Random head, no learning |
+| Run 45 | TIE=2, rank=512, fp16 | 1.2312 | 1.2317 | 26.87MB | Catastrophic artifact blowup |
+
+Root cause: the SVD factors U and V require fp16/fp8 precision to maintain approximation quality. At any viable compression level, the two new matrices cost more storage than the original tied embedding saves. **Not viable.**
+
+---
+
+## Phase 4 — Final Architecture Search
+
+### Activation Sweep (12L 768d 3×MLP, 600s)
+
+| Run | Activation | MLP | ms/step | Steps | val_bpb | Artifact |
+|-----|-----------|-----|---------|-------|---------|----------|
+| F55 | relu | 2× | 88.7 | 6760 | 1.2284 | 14.49MB |
+| **F56** | **relu²** | **2×** | **89.5** | **6700** | **1.2042** | **14.48MB** |
+| F60 | leaky relu | 3× | 102.6 | 5840 | 1.2094 | 17.50MB |
+| **F57** | **relu²** | **3×** | **101.5** | **5910** | **1.1878** | **17.51MB** |
+| F58 | swiglu | 3× | 127.4 | 4700 | 1.1786 | 22.05MB |
+| **F59** | **swiglu** | **3×** | **127.3** | **4710** | **1.1771** | **21.96MB** |
+
+relu² beats relu by 0.024 bpb at no cost — strictly dominant. relu² locked for budget-constrained path.
+
+### MLP Width Sweep (600s)
+
+| Run | Activation | MLP | Layers | ms/step | Steps | val_bpb | Artifact |
+|-----|-----------|-----|--------|---------|-------|---------|----------|
+| F56 | relu² | 2× | 12 | 89.5 | 6700 | 1.2042 | 14.48MB |
+| F64 | relu² | 3× | 12 | 99.4 | 6030 | 1.1873 | 17.50MB |
+| F75 | relu² | 4× | 12 | 91.6 | 6550 | 1.1795 | 20.54MB |
+| F82 | relu² | 4× | 10 | 91.6 | 6550 | 1.1861 | 16.04MB |
+
+4× MLP at 10L beats 3× at 12L within similar budget.
+
+### Layer Count vs MLP Width (fp8, 600s)
+
+| Run | Config | Layers | ms/step | Steps | val_bpb | RT bpb | Artifact |
+|-----|--------|--------|---------|-------|---------|--------|----------|
+| F78 | relu² 3× fp8 | 12 | 99.3 | 6040 | 1.1884 | 1.1898 | 15.80MB |
+| F77 | relu² 3× fp8 | 13 | 106.6 | 5630 | 1.2065 | 1.2077 | 16.96MB |
+| F80 | relu² 2× fp8 | 15 | 106.9 | 5610 | 1.2120 | 1.2136 | 15.45MB |
+| F81 | relu² 2× fp8 | 16 | 113.9 | 5270 | 1.1996 | 1.2009 | 16.33MB |
+| F79 | relu² 3× fp8 | 11 | 91.5 | 6560 | 1.1920 | 1.1933 | 14.66MB |
+| **F82** | **relu² 4× fp8** | **10** | **91.6** | **6550** | **1.1861** | **1.1877** | **16.04MB** |
+| F83 | swiglu 3× fp8 | 10 | 105.5 | 5690 | 1.1842 | 1.1853 | 17.29MB |
+
+### Weight Decay at 10L 4×MLP fp8
+
+| Run | WD | val_bpb | RT bpb | Artifact |
+|-----|-----|---------|--------|----------|
+| F82 | 0.04 | 1.1861 | 1.1877 | 16.04MB |
+| F84 | 0.08 | 1.1983 | 1.1998 | 16.04MB |
+| **F85** | **0.00** | **1.1828** | **1.1844** | **16.02MB** |
+| S87 | 0.00 | 1.1831 | 1.1843 | 16.01MB |
+| **F88** | **0.00 (EMBED=254)** | **1.1820** | **1.1839** | **16.00MB — FITS** |
+
+WD=0 optimal at 10L 4× — opposite to 26L result. Wider MLP needs full weight freedom.
+
+---
+
+## Binary Quantisation Track
+
+### Motivation
+
+Binary quantisation constrains weights to {-1, +1} with no zero state. At 1 bit/param vs ternary's 1.6 bits/param, binary packs approximately 60% more parameters per MB. The hypothesis was that additional depth could compensate for the loss of the zero state.
+
+Starting point: the ternary best config (10L, 768d, 8h, 4kv, 4× relu², FP8, 524k batch, 599s) scoring 1.1578 sliding bpb.
+
+### Binary Scaling Runs
+
+| Run | Layers | MLP | FP | Other | Steps | ms/step | Sliding bpb | Artifact | Fits |
+|-----|--------|-----|-----|-------|-------|---------|-------------|----------|------|
+| F17 | 17 | 4× | FP8 | — | 4010 | 149 | 1.2022 | 17.45MB | No |
+| **F1** | **14** | **4×** | **FP8** | **—** | **4820** | **124** | **1.1824** | **14.74MB** | **Yes** |
+| F2 | 14 | 4× | FP8 | EMA | 4800 | 125 | 1.2110 | 14.56MB | Yes |
+| S3 | 15 | 4× | FP8 | — | 1000 | 133 | 1.3114 | 15.65MB | Yes |
+| S4 | 20 | 3× | FP8 | — | 1000 | 160 | 1.3077 | 16.90MB | No |
+| S5 | 21 | 3× | FP4 | — | 1000 | 167 | 1.3676 | 16.64MB | No |
+| S6 | 19 | 3× | FP8 | — | 1000 | 152 | 1.3130 | 16.16MB | No |
+| S7 | 15 | 4× | FP8 | refiner | 1000 | 135 | 1.3123 | 15.89MB | Yes |
+| S8 | 15 | 4× | FP8 | smear | 1000 | 155 | 1.3043 | 15.67MB | Yes |
+| S9 | 15 | 4× | FP8 | tversky_attn | 1000 | 179 | 1.4016 | 15.74MB | Yes |
+
+### Key Decisions from Binary Scaling
+
+**MLP width (4× vs 3×):** 4× won even when 3× received 4–5 extra layers. S3 (15L 4×) outperformed S6 (19L 3×) at matched steps. Width matters more than depth past a minimum viable layer count.
+
+**FP storage (FP8 vs FP4):** FP4 added a 0.06 bpb roundtrip penalty and was immediately ruled out. FP8 used for all non-binary tensors.
+
+**Layer count:** 17L was the theoretical maximum at 4× FP8 but landed 1.45MB over budget. 15L at 15.65MB was the maximum that fit. 14L left 1.26MB headroom.
+
+**EMA:** Mathematically sound for binary (no zero bucket means `mean(|Q|)=1.0` always, clean roundtrip). In practice, 0.03 bpb worse — the smoothed weights apparently hurt binary's learning dynamics despite the clean quantisation math.
+
+**Smear:** 0.007 bpb gain at 1000 steps but added 22ms/step overhead (133→155ms). Retained for the extended binary run to test whether the gain survives the step penalty at longer training.
+
+**Refiner (causal conv):** Neutral at 1000 steps, added 2ms/step. Not justified.
+
+**Tversky attention projection:** 0.09 bpb worse. Completely incompatible with binary weights.
+
+**Activation:** relu² inherited from ternary sweeps, not retested for binary. SwiGLU would cost ~4MB extra across 15 layers, eliminating the layer budget advantage.
+
+### Extended Binary Run (Unconstrained Compute)
+
+To measure the binary architecture's convergence ceiling without the 10-minute wallclock constraint, a single extended run was conducted at 50,000 steps (~2 hours on 8×H100).
+
+**Configuration:** 15L 768d, 4× relu², FP8, smear, 524k batch tokens, seed=42, MUON_WD=0.0
+
+```
+step:50000/50000 val_loss:2.9692 val_bpb:1.1497 train_time:7763s
+artifact:15.60MB binary:97320960(13685760B) fp:2542200(2585072B) code:70399
+budget:15670651/16000000 (15.67/16.00MB) FITS
+final_binary_roundtrip val_loss:2.9743 val_bpb:1.1516
+temp_scaling optimal_T:0.90
+final_sliding val_loss:2.9027 val_bpb:1.1239 (stride=16, T=0.90)
+```
+
+| Metric | Value |
+|--------|-------|
+| val_bpb | 1.1497 |
+| RT bpb | 1.1516 |
+| Sliding bpb | **1.1239** |
+| Artifact | 15.60MB (15.67MB total) |
+| Params | 97,320,960 |
+| Steps | 50,000 |
+| ms/step | 155.3 |
+| Training time | ~2.15 hours |
+
+The 1.1239 sliding bpb demonstrates that with sufficient compute the binary architecture reaches strong quality. This validates the compression approach — nearly 100M parameters in 15.67MB via 1-bit quantisation — though the 50k steps required far exceeds the competition's 10-minute budget.
+
+### Binary vs Ternary at Equal Architecture (Dev Scale)
+
+| Metric | Binary | Ternary | Delta |
+|--------|--------|---------|-------|
+| val_bpb | 1.8609 | 1.8113 | Ternary wins by 0.050 |
+| Artifact | 9.14MB | 11.56MB | Binary saves 2.42MB |
+| ms/step | 918 | 924 | Identical |
+| RT gap | 0.000 | 0.000 | Both clean |
+
+Ternary is better at equal architecture. Binary's only advantage is fitting more layers in the same budget.
+
+### Binary Conclusion
+
+Binary lost the depth-for-sparsity trade. The 5 extra layers (15L binary vs 10L ternary) could not overcome ternary's representational advantage from the zero state. The 0.0016 bpb gap measured at 500 dev steps significantly understated the true difference at convergence. Ternary at 1.1578 sliding bpb (10-minute budget) outperforms binary's best fitting run (F1: 1.1824 at 14L without smear) by 0.025 bpb. Even the over-budget 17L binary run (1.2022) could not match ternary.
+
+The extended 50k-step binary run reaching 1.1239 sliding bpb shows that binary has a competitive convergence ceiling, but it requires approximately 8× more training steps to approach competitive quality — well beyond the competition constraints.
+
+---
+
+## Grouped MLP Investigation
+
+Tested GroupedTernaryLinear: splits MLP into independent groups for parameter/speed savings.
+
+### Real Model Results (relu² 3×, 768d, 600s)
+
+| Run | Config | Layers | ms/step | Steps | val_bpb | Artifact |
+|-----|--------|--------|---------|-------|---------|----------|
+| F64 | standard | 12 | 99.4 | 6030 | 1.1873 | 17.50MB |
+| F72 | g=2 | 12 | 87.4 | 6870 | 1.2180 | 12.97MB |
+| F71 | g=4 | 12 | 83.5 | 7190 | 1.2429 | 10.74MB |
+| F73 | g=2 | 16 | 114.2 | 5260 | 1.2037 | 16.04MB |
+| F74 | swiglu g=2 | 12 | 113.3 | 5300 | 1.2084 | 15.24MB |
+
+Cross-group isolation costs 0.031–0.056 bpb. Even with 4 extra layers (F73), only recovers 0.014 of the deficit. **Not viable for language modelling.**
+
+---
+
+## Differential Attention
+
+Microsoft (2024): computes two attention maps from split Q/K and takes their difference.
+
+| Run | Config | ms/step | Steps | val_bpb |
+|-----|--------|---------|-------|---------|
+| F64 | standard | 99.4 | 6030 | 1.1873 |
+| F68 | diff_attn | 109.3 | 5480 | 1.2094 |
+
+Splits 96-dim heads into 48-dim sub-heads — insufficient dimensionality for meaningful attention patterns at this model scale.
+
+---
+
+## Sequence Refiner (CausalConvRefiner)
+
+| Run | Config | ms/step | Steps | val_bpb | Artifact |
+|-----|--------|---------|-------|---------|----------|
+| F64 | none | 99.4 | 6030 | 1.1873 | 17.50MB |
+| F69 | k=3 | 102.2 | 5860 | 1.1885 | 19.92MB |
+| F70 | k=5 | 103.0 | 5820 | 1.2018 | 18.13MB |
+
+Noise-level quality improvement with storage bloat. 12 attention layers already saturate local pattern capture.
+
+---
+
+## ByteCNN Vocabulary Generator
+
+Replaces `nn.Embedding(8192, 256)` with a CNN that generates the embedding matrix from byte spellings.
+
+```
+step:500 loss:9.0471 — step:2000 loss:9.0471 (flat, no learning)
+```
+
+All 8192 CNN-generated embeddings converge to near-identical vectors at initialisation. The CNN's inductive bias (byte-similar tokens → similar embeddings) destroys the initial diversity needed for gradient signal.
+
+---
+
+## Asymmetric Tokenizer Investigation
+
+8k BPE input with 256-byte output to eliminate large output projection.
+
+| Model | BPB | Notes |
+|-------|-----|-------|
+| Standard (tied, emb=256) | 3.10 | reference |
+| Asymmetric parallel (emb=256) | 8.65 | byte independence assumption fails |
+| Asymmetric autoregressive (emb=256) | 8.17 | tiny GRU insufficient capacity |
+
+Multi-byte parallel heads assume conditional independence between bytes within a token — mathematically incorrect. Sequence-length mismatch (7 BPE tokens → 70 bytes) also incompatible with the evaluation framework.
+
+---
+
+## Linear Alternative Exploration
+
+Systematic notebook testing of linear layer alternatives at real model dimensions (768d).
+
+### Projection Benchmark (DIM → DIM, H100)
+
+| Model | Params | ms | vs Linear |
+|-------|--------|-----|-----------|
+| Linear | 589,824 | 0.07ms | 1.00× |
+| LowRank r=64 | 98,304 | 0.03ms | 0.44× |
+| BlockDiag b=4 | 147,456 | 0.03ms | 0.40× |
+| Grouped g=4 | 147,456 | 0.03ms | 0.40× |
+| BD4 + mix32 | 196,608 | 0.07ms | 0.97× |
+| Hash 65536 | 65,536 | 0.08ms | 1.13× |
+
+BlockDiag/Grouped offer speed advantages but cross-group isolation degrades LM quality in practice.
+
+---
+
+## H100 Microbenchmark Results
+
+Standalone kernel timing vs torch.compile behaviour (critical lesson: standalone microbenchmarks can mislead when torch.compile fuses operations).
+
+### STE Speed
+
+| Variant | ms/call |
+|---------|---------|
+| Current | 0.041 |
+| Reciprocal | 0.043 |
+
+No gain — 48 STE calls/step = ~2ms overhead (unavoidable).
+
+### Contiguous Checks
+
+Q and K are contiguous after RoPE. V is non-contiguous (view into fused QKV). V's `.contiguous()` costs 0.065ms/call = 0.78ms/step (necessary for flash_attn).
+
+### RoPE Variants
+
+Current (half-split + cat) is fastest at 0.52ms/call.
+
+### Softcap: Poly5 vs Tanh
+
+| Variant | ms/call |
+|---------|---------|
+| Poly5 (current) | 8.43 |
+| Poly3 | 5.98 |
+| Tanh | 2.12 |
+| Hardtanh | 0.71 |
+
+**Critical finding:** Tanh is 4× faster standalone due to H100 hardware transcendental units. However in the real training loop, torch.compile fuses poly5 with surrounding ops into a single kernel. **Switching to tanh broke fusion — F63 was 16ms/step slower.** Poly5 retained.
+
+### CE + Z-Loss Fusion
+
+| Variant | ms/call (fwd+bwd) |
+|---------|-------------------|
+| Separate (current) | 16.56 |
+| Fused (shared LSE) | 12.33 |
+
+**Same lesson:** 4.2ms saving standalone, but torch.compile already optimises `F.cross_entropy`. Manual gather+logsumexp prevents optimisation. Current approach retained.
+
+---
+
+## Efficiency Analysis
+
+### BPB Gained Per Component
+
+| Component | BPB gain | Source |
+|-----------|----------|--------|
+| relu → relu² | −0.024 | F55 vs F56 |
+| MLP 2× → 3× (relu²) | −0.017 | F56 vs F64 |
+| MLP 3× → 4× (relu²) | −0.008 | F64 vs F75 |
+| relu² → swiglu (at 3×) | −0.010 | F64 vs F59 |
+| +1 layer (average) | −0.0012 | scaling data |
+| fp16 → fp8 (RT penalty) | +0.002 | run 42 vs 49 |
+| Sliding eval stride=16 | −0.025 | F22 data |
+| WD=0.04 vs WD=0 (at 26L) | −0.001 | F7 vs F6 |
+
+### MB Cost Per Component
+
+| Component | MB/layer |
+|-----------|----------|
+| relu² 2× layer | 0.767 |
+| relu² 3× layer | 1.003 |
+| relu² 4× layer | 1.220 |
+| swiglu 3× layer | 1.357 |
+| fp16 → fp8 (fixed saving) | −2.51 |
+
+### Efficiency Ratio (BPB Gained Per MB Spent)
+
+| Change | BPB gain | MB cost | BPB/MB |
+|--------|----------|---------|--------|
+| relu → relu² | −0.024 | 0.00 | infinite (free) |
+| Sliding eval | −0.025 | 0.00 | infinite (free) |
+| MLP 2× → 3× | −0.017 | +2.83 (12L) | −0.0060/MB |
+| MLP 3× → 4× | −0.008 | +2.83 (12L) | −0.0028/MB |
+| relu² → swiglu | −0.010 | +4.25 (12L) | −0.0024/MB |
+| +1 layer (relu² 2×) | −0.0012 | +0.767 | −0.0016/MB |
+| +1 layer (relu² 3×) | −0.0012 | +1.003 | −0.0012/MB |
+
+MLP 2×→3× is the most efficient paid upgrade. relu² and sliding eval are free wins.
+
+### Layer Budget at 768d
+
+| Config | Max Layers | Est ms/step |
+|--------|-----------|-------------|
+| relu² 2× fp16 | 14L | ~95ms |
+| relu² 2× fp8 | 17L | ~97ms |
+| relu² 3× fp16 | 10L | ~99ms |
+| relu² 3× fp8 | 13L | ~106ms |
+| relu² 4× fp8 | 10L | ~92ms |
+| swiglu 3× fp8 | 9L | ~105ms |
+
+---
+
+## Ternary-Incompatible Techniques
+
+These are not merely unhelpful but structurally incompatible with 1.58-bit quantisation:
+
+| Technique | Mechanism of failure |
+|-----------|---------------------|
+| **EMA** | Weight averaging → values cluster near zero → ternary rounds most to 0 → 0.12 bpb RT gap |
+| **TTT-LoRA** | LoRA delta computed outside RMSNorm space that TernaryLinear normalises into. Corrupts calibrated representations at convergence |
+| **Ternary prototypes + sigmoid** | Sigmoid membership needs continuous values. Ternary {-1,0,+1} collapses membership patterns → 0.077 RT gap |
+| **LM head rank factorisation** | SVD factors U,V need fp16 precision. Storage exceeds original tied embedding |
+
+---
+
+## Software Optimisations
+
+| Optimisation | Saving | Notes |
+|---|---|---|
+| Fused QKV (c_q+c_k+c_v → single matmul) | ~2ms/step | Safe: in_features divisible by all group sizes |
+| Fused SwiGLU/relu² (gate+up → single wide matmul) | ~2-4ms/step | Same params, fewer kernel launches |
+| Z-loss regularisation (1e-4 x logsumexp²) | quality | Anchors logits, keeps STE gradients sharp |
+| DataLoader int16 transfer (pin then cast on GPU) | ~1ms/step | 4× less PCIe bandwidth |
+| FlashAttention-3 | ~13ms/step | ~9% speedup, ~380 free training steps |
+| TernaryLinear bf16 weights, cleaner STE | ~1ms/step | Eliminates fp32 roundtrip |
+| DDP static_graph + gradient_as_bucket_view | ~1ms/step | Free when find_unused=False |
+| Fused optimizer loop (LR set + step in one pass) | ~0.5ms/step | Fewer Python-level iterations |
+| Removed CUBLAS determinism tax | ~1ms/step | Not required for competition |
+| Temperature grid: 5 points instead of 21 | ~1s total | T=0.90 consistently with relu² |
+| Temp scaling moved to eval phase | ~3 steps gained | No longer steals training time |
+| `_e()` helper for Hyperparameters | -1.8KB code | Eliminates env var boilerplate |
+| 3D tensor ternary quantisation | storage fix | Conv1d weights reshaped to 2D for ternary |
+
+---
+
+## Rejected Techniques (Summary)
+
+| Technique | Reason |
+|-----------|--------|
+| Tversky (all variants) | Quality-neutral on FineWeb LM — confirmed via synthetic data analysis; speed penalty with relu² |
+| Differential attention | Halved head_dim (96→48) degrades quality at this model scale |
+| Grouped MLP (g=2, g=4) | Cross-group isolation costs 0.031–0.056 bpb; not recoverable with extra layers |
+| CausalConvRefiner | Noise-level quality; storage bloat from Conv1d weights |
+| ByteCNN vocabulary generator | Embedding collapse — CNN inductive bias destroys initial diversity |
+| Asymmetric tokenizer | Byte independence assumption incorrect; sequence mismatch with eval framework |
+| EMA | Incompatible with ternary — weight averaging causes 0.12 bpb RT gap |
+| TTT-LoRA | Architectural incompatibility with RMSNorm space in TernaryLinear |
+| LM head factorisation | SVD factors bloat artifact beyond budget; unrecoverable quality loss |
+| MTP | 0.006 bpb worse — model capacity too limited for auxiliary objectives |
+| BigramHash | 0.020 bpb worse at convergence; fp16 table displaces ternary layers |
+| Seq/batch schedule | Recompile and step penalties dominate at 600s wallclock |
+| SmearModule | +22% step cost for −0.001 gain within ternary 10-minute budget |
+| Depth recurrence | Halves effective steps; OOM at DR=3 |
+| AdamW for matrix params | Clearly inferior to Muon for ternary weights |
+| FP4 storage | 0.026–0.029 RT gap even with QAT — unrecoverable |
+| Tanh softcap | Faster standalone but breaks torch.compile kernel fusion |
+| Fused CE+Z-loss | Same — breaks compile optimisation |
+| 16 heads at 768d | 48-dim head_dim insufficient for meaningful attention |
+| relu (plain) | Strictly dominated by relu² |
+| leaky relu | Strictly dominated by relu² |
+| Distillation (in-run) | Train-from-scratch teacher always worse than supervised |
+| reduce-overhead compile | Rotary + embed_proj_rev incompatible with CUDA graphs |
+| max-autotune compile | 30+ minute kernel search prohibitive for 600s runs |
+| Skip weights zero-init | 0.010 bpb worse — decoder needs skip signal from step 0 |
+| EMBED_DIM=0 (full 512) | 19.78MB artifact — 3.78MB over budget |
+| Untie lm_head full-rank | 7.3MB budget overrun not justified by 0.005 bpb gain |
+
+---
+
+## Decision Log
+
+| Decision | Rationale |
+|----------|-----------|
+| 8k vocabulary | −0.42 bpb, largest single win |
+| relu² activation | −0.024 bpb vs relu, free (no cost) |
+| 4×MLP width | Best BPB within budget at 10L; 0.008 better than 3× |
+| 10L 768d | Minimum viable depth at 768d with maximum MLP width |
+| WD=0.0 at 10L 4× | Opposite to deep models — wider MLP needs full weight freedom |
+| fp8 storage | Halves fp_params (5MB→2.5MB), enables wider MLP within budget |
+| EMBED_DIM=254 | 256-2 dims to fit artifact+code under 16,000,000 byte budget; ~0.0004 bpb cost |
+| BITNET_GROUP_SIZE=128 | Same quality as 64; saves 0.69MB |
+| 8 heads, 4 KV, 96-dim head_dim | 16h at 48-dim insufficient; MHA only +0.0012 at +1.5MB |
+| Poly softcap | Fuses with torch.compile; tanh breaks fusion |
+| ROPE_BASE=5000 + YaRN 2048 | Best frequency calibration |
+| Muon optimizer | Newton-Schulz normalisation compensates for ternary STE gradient attenuation |
+| MUON_BACKEND_STEPS=3 | Equivalent to 5 at convergence; +190 extra steps |
+| MUON_MOMENTUM=0.95 | Both directions degrade; affects artifact via zero_frac |
+| WARMDOWN=20% | Asymmetric — too little hurts more than too much |
+| MATRIX_LR=0.04 | Higher LR compensates for ternary STE gradient attenuation |
+| SCALAR_LR=0.02 | Optimal — scalars do not pass through STE |
+| TIED_EMBED_LR=0.02 | Optimal |
+| TRAIN_BATCH_TOKENS=524k | Optimal tradeoff between gradient quality and step count |
+| Base-3 + LZMA | 39% reduction over int8+zlib |
+| Shrinkage fix | Eliminates all RT gaps universally |
+| Skip weights ones-init | Decoder needs skip signal from step 0; zeros costs 0.010 bpb |
+| Tied embeddings | Untie costs 7.3MB; not justified |
+| Standard attn projection | Tversky quality-neutral; grouped destroys quality |
+| No EMA | Fundamentally incompatible with ternary |
+| No TTT | RMSNorm space incompatibility confirmed across 6 runs |
+| No MTP | Confirmed post-fix: 0.006 bpb worse |
+| Temperature scaling T=0.90 | relu² logits slightly underconfident; auto-calibrated |
+| Fused QKV + relu² | ~130-180 free training steps per run |
+| Z-loss regularisation | Anchors logits; keeps STE gradients sharp |
+| FlashAttention-3 | Free ~380 extra training steps per 600s run |
+| Sliding eval stride=16 | Best quality when eval budget unconstrained |
+| Optimizer coverage fix | embed_proj/embed_proj_rev now train; +0.055 bpb improvement |
+| MAX_WALLCLOCK_SECONDS=599 | 1s leeway for safety margin |
+| Binary 15L 768d 4× fp8 | 97M params in 15.67MB — maximum parameter density; convergence ceiling validated at 50k steps |
diff --git a/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/binary_log.txt b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/binary_log.txt
new file mode 100644
index 0000000000..f75377dcdf
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/binary_log.txt
@@ -0,0 +1,1518 @@
+"""Binary training script for OpenAI's Parameter Golf Challenge. Ciprian-Florin Ifrim - 24 March 2026"""
+
+import copy
+import glob
+import io
+import math
+import os
+import random
+import sys
+import time
+import lzma
+from pathlib import Path
+import numpy as np
+import sentencepiece as spm
+import torch
+import torch.distributed as dist
+import torch.nn.functional as F
+from torch import Tensor, nn
+from torch.nn.parallel import DistributedDataParallel as DDP
+from flash_attn_interface import flash_attn_func
+
+# ---------------------------------------------------------------------------
+# Hyperparameters (all configurable via environment variables)
+# ---------------------------------------------------------------------------
+def _e(k, d, t=str):
+ v = os.environ.get(k, str(d))
+ if t == bool: return bool(int(v))
+ return t(v)
+
+class Hyperparameters:
+ data_path = _e("DATA_PATH", "./data/datasets/fineweb10B_sp1024")
+ train_files = os.path.join(data_path, "fineweb_train_*.bin")
+ val_files = os.path.join(data_path, "fineweb_val_*.bin")
+ tokenizer_path = _e("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model")
+ run_id = os.environ.get("RUN_ID", f"run_{int(time.time())}")
+ seed = _e("SEED", 1337, int)
+ compile_mode = _e("COMPILE_MODE", "default")
+ val_batch_size = _e("VAL_BATCH_SIZE", 524288, int)
+ val_loss_every = _e("VAL_LOSS_EVERY", 500, int)
+ train_log_every = _e("TRAIN_LOG_EVERY", 10, int)
+ iterations = _e("ITERATIONS", 2000, int)
+ warmdown_fraction = _e("WARMDOWN_FRACTION", 0.2, float)
+ warmup_steps = _e("WARMUP_STEPS", 20, int)
+ train_batch_tokens = _e("TRAIN_BATCH_TOKENS", 524288, int)
+ train_seq_len = _e("TRAIN_SEQ_LEN", 1024, int)
+ max_wallclock_seconds = _e("MAX_WALLCLOCK_SECONDS", 0.0, float)
+ vocab_size = _e("VOCAB_SIZE", 1024, int)
+ num_layers = _e("NUM_LAYERS", 16, int)
+ num_kv_heads = _e("NUM_KV_HEADS", 4, int)
+ model_dim = _e("MODEL_DIM", 512, int)
+ num_heads = _e("NUM_HEADS", 8, int)
+ mlp_mult = _e("MLP_MULT", 2, int)
+ tie_embeddings = _e("TIE_EMBEDDINGS", 1, int)
+ rope_base = _e("ROPE_BASE", 10000.0, float)
+ rope_type = _e("ROPE_TYPE", "rope")
+ yarn_max_len = _e("YARN_MAX_LEN", 4096, int)
+ logit_softcap = _e("LOGIT_SOFTCAP", 30.0, float)
+ softcap_type = _e("SOFTCAP_TYPE", "poly")
+ tied_embed_init_std = _e("TIED_EMBED_INIT_STD", 0.005, float)
+ qk_gain_init = _e("QK_GAIN_INIT", 1.5, float)
+ activation_type = _e("ACTIVATION", "swiglu")
+ embed_dim = _e("EMBED_DIM", 0, int)
+ bigram_hash = _e("BIGRAM_HASH", 0, bool)
+ mtp_heads_count = _e("MTP_HEADS", 0, int)
+ training_depth_recurrence = _e("TRAINING_DEPTH_RECURRENCE", 1, int)
+ eval_depth_recurrence = _e("EVAL_DEPTH_RECURRENCE", 1, int)
+ attn_proj_type = _e("ATTN_PROJ_TYPE", "standard")
+ logit_head_type = _e("LOGIT_HEAD_TYPE", "standard")
+ tversky_num_features = _e("TVERSKY_NUM_FEATURES", 16, int)
+ tversky_feature_pools = _e("TVERSKY_FEATURE_POOLS", 0, int)
+ tversky_membership = _e("TVERSKY_MEMBERSHIP", "sigmoid")
+ diff_attn = _e("DIFF_ATTN", 0, bool)
+ refiner = _e("REFINER", 0, bool)
+ refiner_kernel = _e("REFINER_KERNEL", 3, int)
+ mlp_groups = _e("MLP_GROUPS", 0, int)
+ embed_lr = _e("EMBED_LR", 0.6, float)
+ head_lr = _e("HEAD_LR", 0.008, float)
+ adam_lr = _e("ADAM_LR", 1e-3, float)
+ adam_wd = _e("ADAM_WD", 0.05, float)
+ untie_at_fraction = _e("UNTIE_AT_FRACTION", 0.0, float)
+ tied_embed_lr = _e("TIED_EMBED_LR", 0.05, float)
+ corr_weight_lr = _e("CORR_WEIGHT_LR", 0.05, float)
+ smear = _e("SMEAR", 0, bool)
+ seq_len_start = _e("SEQ_LEN_START", 0, int)
+ seq_schedule_fraction = _e("SEQ_SCHEDULE_FRACTION", 0.33, float)
+ batch_tokens_start = _e("BATCH_TOKENS_START", 0, int)
+ batch_schedule_fraction = _e("BATCH_SCHEDULE_FRACTION", 0.33, float)
+ churn_log_every = _e("CHURN_LOG_EVERY", 500, int)
+ matrix_lr = _e("MATRIX_LR", 0.04, float)
+ scalar_lr = _e("SCALAR_LR", 0.04, float)
+ muon_momentum = _e("MUON_MOMENTUM", 0.95, float)
+ muon_backend_steps = _e("MUON_BACKEND_STEPS", 5, int)
+ muon_wd = _e("MUON_WD", 0.0, float)
+ matrix_optimizer = _e("MATRIX_OPTIMIZER", "muon")
+ muon_momentum_warmup_start = _e("MUON_MOMENTUM_WARMUP_START", 0.85, float)
+ muon_momentum_warmup_steps = _e("MUON_MOMENTUM_WARMUP_STEPS", 500, int)
+ beta1 = _e("BETA1", 0.9, float)
+ beta2 = _e("BETA2", 0.95, float)
+ adam_eps = _e("ADAM_EPS", 1e-8, float)
+ grad_clip_norm = _e("GRAD_CLIP_NORM", 0.0, float)
+ bitnet_group_size = _e("BITNET_GROUP_SIZE", 64, int)
+ sliding_eval = _e("SLIDING_EVAL", 0, bool)
+ sliding_eval_stride = _e("SLIDING_EVAL_STRIDE", 64, int)
+ sliding_batch_size = _e("SLIDING_BATCH_SIZE", 64, int)
+ temp_scaling = _e("TEMP_SCALING", 0, bool)
+ _fp_raw = os.environ.get("FP_STORAGE", "0")
+ fp_storage = True if _fp_raw == "FP8" else ("fp4" if _fp_raw == "FP4" else False)
+ ema = _e("EMA", 0, bool)
+ ema_decay = _e("EMA_DECAY", 0.995, float)
+ ema_start_fraction = _e("EMA_START_FRACTION", 0.5, float)
+
+CTP = ("attn_scale","attn_scales","mlp_scale","mlp_scales","resid_mix","resid_mixes","q_gain","diff_lambda","skip_weight","skip_weights","vocab_bias","refiner.gate")
+
+# ---------------------------------------------------------------------------
+# Binary packing — bitpacking (8 weights/byte = 1 bit/param, lossless)
+# ---------------------------------------------------------------------------
+def pack_binary(q: Tensor) -> tuple[bytes, int]:
+ bits = ((q.reshape(-1).to(torch.int8) + 1) // 2).numpy().astype(np.uint8)
+ n = len(bits)
+ pad = (8 - n % 8) % 8
+ if pad:
+ bits = np.concatenate([bits, np.zeros(pad, dtype=np.uint8)])
+ groups = bits.reshape(-1, 8)
+ packed = np.zeros(len(groups), dtype=np.uint8)
+ for i in range(8):
+ packed |= groups[:, i] << i
+ return packed.tobytes(), n
+
+def unpack_binary(data: bytes, n: int) -> Tensor:
+ packed = np.frombuffer(data, dtype=np.uint8)
+ bits = np.zeros((len(packed), 8), dtype=np.int8)
+ for i in range(8):
+ bits[:, i] = (packed >> i) & 1
+ flat = bits.reshape(-1)[:n]
+ return torch.from_numpy(flat.astype(np.int8) * 2 - 1)
+
+# ---------------------------------------------------------------------------
+# FP4 quantization (per-row absmax, 2 values packed per byte)
+# ---------------------------------------------------------------------------
+def quantize_to_int4(t: Tensor) -> tuple[Tensor, Tensor, list]:
+ t32 = t.float()
+ orig_shape = t32.shape
+ if t32.ndim < 2:
+ t32 = t32.unsqueeze(0)
+ absmax = t32.abs().amax(dim=-1, keepdim=True).clamp(min=1e-8)
+ scale = absmax / 7.0
+ q = torch.clamp(torch.round(t32 / scale), -7, 7).to(torch.int8)
+ flat = q.reshape(-1)
+ if flat.numel() % 2 != 0:
+ flat = F.pad(flat, (0, 1))
+ low = (flat[0::2] + 8).to(torch.uint8)
+ high = (flat[1::2] + 8).to(torch.uint8)
+ return low | (high << 4), scale.half().squeeze(-1), list(orig_shape)
+
+def dequantize_from_int4(packed: Tensor, scale: Tensor, shape: list) -> Tensor:
+ low = (packed & 0x0F).to(torch.int8) - 8
+ high = ((packed >> 4) & 0x0F).to(torch.int8) - 8
+ flat = torch.zeros(packed.numel() * 2, dtype=torch.int8)
+ flat[0::2] = low
+ flat[1::2] = high
+ numel = 1
+ for s in shape:
+ numel *= s
+ flat = flat[:numel].float()
+ if len(shape) <= 1:
+ return (flat * scale.float().squeeze()).reshape(shape)
+ return (flat.reshape(-1, shape[-1]) * scale.float().unsqueeze(-1)).reshape(shape)
+
+# ---------------------------------------------------------------------------
+# State dict serialization (binary + fp16/fp8/fp4)
+# ---------------------------------------------------------------------------
+def q_sd(state_dict: dict, group_size: int = 64, fp_storage=False, binary_override_names: set | None = None) -> tuple[dict, dict]:
+ "Binary for large 2D weight matrices, fp16/fp8/fp4 for everything else."
+ quantized = {}
+ stats = {"binary_params": 0, "binary_bytes": 0, "fp_params": 0, "fp_bytes": 0}
+ for name, tensor in state_dict.items():
+ if "mtp_heads" in name:
+ continue
+ t = tensor.detach().cpu().float().contiguous()
+ t_orig_shape = list(t.shape)
+ if t.ndim == 3:
+ t = t.reshape(t.shape[0], -1)
+ is_binary_candidate = (
+ t.ndim == 2 and t.numel() > 65_536
+ and "tok_emb" not in name and "lm_head" not in name and "embed_proj" not in name and "bigram_emb" not in name and "lm_head_correction" not in name and "lm_head_U" not in name and "lm_head_V" not in name
+ and "prototypes" not in name and "tversky" not in name
+ ) or (binary_override_names is not None and name in binary_override_names)
+ if is_binary_candidate:
+ pad = (group_size - t.shape[1] % group_size) % group_size
+ t_padded = F.pad(t, (0, pad)) if pad > 0 else t
+ t_grouped = t_padded.reshape(-1, group_size)
+ scale = t_grouped.abs().mean(-1, keepdim=True).clamp(min=1e-8).half().float()
+ q = torch.where(t_grouped >= 0,
+ torch.ones_like(t_grouped, dtype=torch.int8),
+ -torch.ones_like(t_grouped, dtype=torch.int8))
+ packed_bytes, n_bits = pack_binary(q)
+ quantized[name] = {
+ "type": "binary", "packed": packed_bytes,
+ "scale": scale.half().squeeze(-1),
+ "shape": list(t.shape), "padded_cols": t_padded.shape[1],
+ "group_size": group_size, "n_bits": n_bits,
+ "orig_shape": t_orig_shape,
+ }
+ stats["binary_params"] += t.numel()
+ stats["binary_bytes"] += len(packed_bytes) + scale.numel() * 2
+ elif fp_storage == "fp4" and t.ndim == 2:
+ packed, scale, orig_shape = quantize_to_int4(t)
+ quantized[name] = {"type": "fp4", "packed": packed, "scale": scale, "shape": orig_shape}
+ stats["fp_params"] += t.numel()
+ stats["fp_bytes"] += packed.numel() + scale.numel() * 2
+ elif fp_storage and t.ndim == 2:
+ quantized[name] = {"type": "fp8", "data": t.to(torch.float8_e4m3fn)}
+ stats["fp_params"] += t.numel()
+ stats["fp_bytes"] += t.numel()
+ else:
+ quantized[name] = {"type": "fp16", "data": t.half()}
+ stats["fp_params"] += t.numel()
+ stats["fp_bytes"] += t.numel() * 2
+ return quantized, stats
+
+def deq_sd(quantized: dict, target_dtype=torch.bfloat16):
+ "Reconstruct full-precision state dict from quantized representation."
+ out = {}
+ for name, entry in quantized.items():
+ if entry["type"] == "binary":
+ q = unpack_binary(entry["packed"], entry["n_bits"])
+ q = q.float().reshape(-1, entry["group_size"])
+ scale = entry["scale"].float().unsqueeze(-1)
+ # No shrinkage correction needed: binary has no zeros, q.abs().mean() == 1.0 always
+ t = (q * scale).reshape(-1, entry["padded_cols"])
+ shape = entry["shape"]
+ result = t[:shape[0], :shape[1]].to(target_dtype)
+ orig = entry.get("orig_shape")
+ out[name] = result.reshape(orig).contiguous() if orig and orig != shape else result.contiguous()
+ elif entry["type"] == "fp8":
+ out[name] = entry["data"].to(torch.float32).to(target_dtype).contiguous()
+ elif entry["type"] == "fp4":
+ out[name] = dequantize_from_int4(entry["packed"], entry["scale"], entry["shape"]).to(target_dtype).contiguous()
+ else:
+ out[name] = entry["data"].to(target_dtype).contiguous()
+ return out
+
+# ---------------------------------------------------------------------------
+# Binary diagnostics (logged during training)
+# ---------------------------------------------------------------------------
+_prev_committed: dict = {}
+def churn_fn(model: nn.Module, group_size: int = 64):
+ global _prev_committed
+ total = flipped = 0
+ with torch.no_grad():
+ for name, p in model.named_parameters():
+ if p.ndim == 2 and ("weight" in name or "prototypes" in name) and p.shape[0] > 1:
+ w = p.detach().float().reshape(-1, group_size)
+ q = torch.where(w >= 0, torch.ones_like(w), -torch.ones_like(w)).cpu().numpy()
+ if name in _prev_committed:
+ flipped += int(np.sum(q != _prev_committed[name]))
+ total += q.size
+ _prev_committed[name] = q
+ return flipped / max(total, 1)
+
+# ---------------------------------------------------------------------------
+# Muon optimizer (Newton-Schulz orthogonalized momentum)
+# ---------------------------------------------------------------------------
+def ns_orth(G: Tensor, steps: int = 10, eps: float = 1e-7) -> Tensor:
+ a, b, c = (3.4445, -4.7750, 2.0315)
+ X = G.bfloat16()
+ X /= X.norm() + eps
+ transposed = G.size(0) > G.size(1)
+ if transposed:
+ X = X.T
+ for _ in range(steps):
+ A = X @ X.T
+ B = b * A + c * A @ A
+ X = a * X + B @ X
+ return X.T if transposed else X
+
+class Muon(torch.optim.Optimizer):
+ def __init__(self, params, lr: float, momentum: float, backend_steps: int, nesterov: bool = True, wd: float = 0.0):
+ super().__init__(params, dict(lr=lr, momentum=momentum, backend_steps=backend_steps, nesterov=nesterov, wd=wd))
+ @torch.no_grad()
+ def step(self, closure=None):
+ loss = None
+ if closure is not None:
+ with torch.enable_grad():
+ loss = closure()
+ distributed = dist.is_available() and dist.is_initialized()
+ world_size = dist.get_world_size() if distributed else 1
+ rank = dist.get_rank() if distributed else 0
+ for group in self.param_groups:
+ params = group["params"]
+ if not params:
+ continue
+ lr, momentum = group["lr"], group["momentum"]
+ backend_steps, nesterov = group["backend_steps"], group["nesterov"]
+ total_params = sum(int(p.numel()) for p in params)
+ updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16)
+ curr = 0
+ for i, p in enumerate(params):
+ if i % world_size == rank and p.grad is not None:
+ g = p.grad
+ state = self.state[p]
+ if "momentum_buffer" not in state:
+ state["momentum_buffer"] = torch.zeros_like(g)
+ buf = state["momentum_buffer"]
+ buf.mul_(momentum).add_(g)
+ if nesterov:
+ g = g.add(buf, alpha=momentum)
+ g = F.rms_norm(g.float(), (g.size(-1),)).bfloat16()
+ g = ns_orth(g, steps=backend_steps)
+ g *= max(1, g.size(0) / g.size(1)) ** 0.5
+ updates_flat[curr:curr + p.numel()] = g.reshape(-1)
+ curr += p.numel()
+ if distributed:
+ dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM)
+ wd = group.get("wd", 0.0)
+ curr = 0
+ for p in params:
+ g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype)
+ if wd > 0:
+ p.mul_(1 - lr * wd)
+ p.add_(g, alpha=-lr)
+ curr += p.numel()
+ return loss
+
+# ---------------------------------------------------------------------------
+# Data loading
+# ---------------------------------------------------------------------------
+def ld_shard(file: Path) -> Tensor:
+ header_bytes = 256 * np.dtype(" Tensor:
+ chunks = []
+ remaining = n
+ while remaining > 0:
+ avail = self.tokens.numel() - self.pos
+ if avail <= 0:
+ self._advance_file()
+ continue
+ k = min(remaining, avail)
+ chunks.append(self.tokens[self.pos:self.pos + k])
+ self.pos += k
+ remaining -= k
+ return chunks[0] if len(chunks) == 1 else torch.cat(chunks)
+
+class DistributedTokenLoader:
+ def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device):
+ self.rank, self.world_size, self.device = rank, world_size, device
+ self.stream = TokenStream(pattern)
+ def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]:
+ local_tokens = global_tokens // (self.world_size * grad_accum_steps)
+ per_rank_span = local_tokens + 1
+ chunk = self.stream.take(per_rank_span * self.world_size)
+ start = self.rank * per_rank_span
+ local = chunk[start:start + per_rank_span].pin_memory().to(self.device, non_blocking=True).to(torch.int64)
+ x = local[:-1].reshape(-1, seq_len)
+ y = local[1:].reshape(-1, seq_len)
+ return x, y
+# ---------------------------------------------------------------------------
+# Model
+# ---------------------------------------------------------------------------
+class RMSNorm(nn.Module):
+ def __init__(self, eps: float | None = None):
+ super().__init__()
+ self.eps = eps
+ def forward(self, x: Tensor) -> Tensor:
+ return F.rms_norm(x, (x.size(-1),), eps=self.eps)
+
+def apply_qat_ste(w: Tensor, fp_storage: str | bool) -> Tensor:
+ """Applies Straight-Through Estimator (STE) for FP4 or FP8 simulated quantization."""
+ if not fp_storage:
+ return w
+ if fp_storage == "fp4":
+ absmax = w.abs().amax(dim=-1, keepdim=True).clamp(min=1e-8)
+ scale = absmax / 7.0
+ q = torch.clamp(torch.round(w / scale), -7.0, 7.0)
+ w_sim = q * scale
+ return (w_sim - w).detach() + w
+ elif fp_storage is True or fp_storage == "fp8":
+ w_sim = w.to(torch.float8_e4m3fn).to(w.dtype)
+ return (w_sim - w).detach() + w
+ return w
+
+class QATLinear(nn.Linear):
+ def __init__(self, in_features: int, out_features: int, bias: bool = False, fp_storage: str | bool = False):
+ super().__init__(in_features, out_features, bias=bias)
+ self.fp_storage = fp_storage
+ def forward(self, x: Tensor) -> Tensor:
+ w_qat = apply_qat_ste(self.weight, self.fp_storage)
+ return F.linear(x, w_qat.to(x.dtype), self.bias.to(x.dtype) if self.bias is not None else None)
+
+class QATEmbedding(nn.Embedding):
+ def __init__(self, num_embeddings: int, embedding_dim: int, fp_storage: str | bool = False):
+ super().__init__(num_embeddings, embedding_dim)
+ self.fp_storage = fp_storage
+ def forward(self, input: Tensor) -> Tensor:
+ w_qat = apply_qat_ste(self.weight, self.fp_storage)
+ return F.embedding(input, w_qat, self.padding_idx, self.max_norm,
+ self.norm_type, self.scale_grad_by_freq, self.sparse)
+
+class BinaryLinear(nn.Linear):
+ def __init__(self, in_features, out_features, bias=False, group_size=64):
+ super().__init__(in_features, out_features, bias=bias)
+ self.group_size = group_size
+ def forward(self, x: Tensor) -> Tensor:
+ w = self.weight.bfloat16()
+ g = self.group_size
+ w_g = w.reshape(-1, g)
+ scale = w_g.abs().mean(-1, keepdim=True).clamp(min=1e-8)
+ q = torch.where(w_g >= 0, torch.ones_like(w_g), -torch.ones_like(w_g))
+ w_binary = w + ((q * scale).reshape(w.shape) - w).detach()
+ return F.linear(x, w_binary,
+ self.bias.to(x.dtype) if self.bias is not None else None)
+
+class NormedBinaryLinear(BinaryLinear):
+ "Binary linear with RMSNorm on input — for output projections receiving un-normalized activations."
+ def forward(self, x: Tensor) -> Tensor:
+ return super().forward(F.rms_norm(x, (x.size(-1),)))
+
+class GroupedBinaryLinear(nn.Module):
+ "Grouped linear with binary STE. Weight stored as 2D [groups*group_out, group_in] for binary quantization compatibility."
+ def __init__(self, in_features, out_features, groups=4, group_size=64, normed=False):
+ super().__init__()
+ assert in_features % groups == 0 and out_features % groups == 0
+ self.groups = groups
+ self.group_in = in_features // groups
+ self.group_out = out_features // groups
+ self.group_size = group_size
+ self.normed = normed
+ self.weight = nn.Parameter(torch.randn(groups * self.group_out, self.group_in) * 0.02)
+ def forward(self, x: Tensor) -> Tensor:
+ if self.normed:
+ x = F.rms_norm(x, (x.size(-1),))
+ w = self.weight.bfloat16()
+ g = self.group_size
+ w_g = w.reshape(-1, g)
+ scale = w_g.abs().mean(-1, keepdim=True).clamp(min=1e-8)
+ q = torch.where(w_g >= 0, torch.ones_like(w_g), -torch.ones_like(w_g))
+ w_binary = w + ((q * scale).reshape(w.shape) - w).detach()
+ w_grouped = w_binary.reshape(self.groups, self.group_out, self.group_in)
+ bsz = x.shape[:-1]
+ x_g = x.reshape(*bsz, self.groups, self.group_in)
+ out = torch.einsum('...gi,goi->...go', x_g, w_grouped)
+ return out.reshape(*bsz, self.groups * self.group_out)
+
+class TverskyProjection(nn.Module):
+ "Tversky similarity: S = θ·f(A∩B) - α·f(A\\B) - β·f(B\\A). Three modes."
+ def __init__(self, in_features: int, out_features: int, num_features: int = 16,
+ group_size: int = 64, use_shared_features: bool = False,
+ membership: str = "sigmoid"):
+ super().__init__()
+ self.group_size = group_size
+ self.num_features = num_features
+ self.membership_type = membership
+ self.no_features_mode = (num_features == 0)
+ if not self.no_features_mode and not use_shared_features:
+ self.features = nn.Parameter(torch.empty(num_features, in_features).uniform_(-0.02, 0.02))
+ else:
+ self.register_parameter('features', None)
+ self.prototypes = nn.Parameter(torch.empty(out_features, in_features).uniform_(-0.02, 0.02))
+ self.theta = nn.Parameter(torch.tensor(1.0))
+ self.alpha = nn.Parameter(torch.tensor(0.5))
+ self.beta = nn.Parameter(torch.tensor(0.5))
+
+ def _binary_ste(self, w: Tensor) -> Tensor:
+ w_bf16 = w.bfloat16()
+ g = self.group_size
+ w_grouped = w_bf16.reshape(-1, g)
+ scale = w_grouped.abs().mean(-1, keepdim=True).clamp(min=1e-8)
+ q = torch.where(w_grouped >= 0, torch.ones_like(w_grouped), -torch.ones_like(w_grouped))
+ w_binary = w_bf16 + ((q * scale).reshape(w_bf16.shape) - w_bf16).detach()
+ return w_binary.reshape(w.shape)
+
+ def _membership(self, t: Tensor) -> Tensor:
+ if self.membership_type == "poly":
+ return torch.clamp(t * 5.0 / 4.0 + 0.5, 0.0, 1.0)
+ elif self.membership_type == "tanh":
+ return (torch.tanh(t * 5.0) + 1.0) * 0.5
+ else:
+ return torch.sigmoid(t * 5.0)
+
+ def forward(self, x: Tensor, shared_features: Tensor | None = None) -> Tensor:
+ proto = self._binary_ste(self.prototypes)
+ if self.no_features_mode:
+ x_f = x @ proto.t()
+ p_norm = F.normalize(proto, dim=-1)
+ p_f = p_norm @ p_norm.t()
+ else:
+ feat = (shared_features if shared_features is not None else self.features).float()
+ x_f = x @ feat.t()
+ p_f = proto @ feat.t()
+ x_s = self._membership(x_f)
+ p_s = self._membership(p_f)
+ x_a = x_f * x_s
+ p_a = p_f * p_s
+ t, a, b = self.theta.abs(), self.alpha.abs(), self.beta.abs()
+ return t * (x_a @ p_a.t()) - a * (x_a @ (1 - p_s).t()) - b * ((1 - x_s) @ p_a.t())
+
+def restore_low_dim_params_to_fp32(module: nn.Module) -> None:
+ with torch.no_grad():
+ for name, param in module.named_parameters():
+ if (param.ndim < 2 or any(p in name for p in CTP)) and param.dtype != torch.float32:
+ param.data = param.data.float()
+
+class Rotary(nn.Module):
+ def __init__(self, dim: int, base: float = 10000.0, no_cache: bool = False,
+ rope_type: str = "rope", yarn_max_len: int = 4096, train_seq_len: int = 1024):
+ super().__init__()
+ self.no_cache = no_cache
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+ if rope_type == "yarn":
+ scale = train_seq_len / yarn_max_len
+ freq_idx = torch.arange(0, dim, 2, dtype=torch.float32)
+ ramp = torch.clamp((freq_idx / dim - 0.25) / 0.75, 0.0, 1.0)
+ inv_freq = inv_freq / (ramp * (1.0 / scale - 1.0) + 1.0)
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
+ self._seq_len_cached = 0
+ self._cos_cached: Tensor | None = None
+ self._sin_cached: Tensor | None = None
+ def forward(self, seq_len, device, dtype):
+ if self.no_cache:
+ t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype)
+ freqs = torch.outer(t, self.inv_freq.to(device))
+ return freqs.cos()[None, :, None, :].to(dtype=dtype), freqs.sin()[None, :, None, :].to(dtype=dtype)
+ if (
+ self._cos_cached is None
+ or self._sin_cached is None
+ or self._seq_len_cached != seq_len
+ or self._cos_cached.device != device
+ ):
+ t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype)
+ freqs = torch.outer(t, self.inv_freq.to(device))
+ self._cos_cached = freqs.cos()[None, :, None, :]
+ self._sin_cached = freqs.sin()[None, :, None, :]
+ self._seq_len_cached = seq_len
+ return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype)
+
+def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor) -> Tensor:
+ half = x.size(-1) // 2
+ x1, x2 = x[..., :half], x[..., half:]
+ return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1)
+
+class CausalSelfAttention(nn.Module):
+ def __init__(self, dim, num_heads, num_kv_heads, rope_base, qk_gain_init,
+ group_size=64, attn_proj_type="standard", tversky_num_features=16,
+ tversky_feature_pools=0, no_cache=False, rope_type="rope",
+ yarn_max_len=4096, train_seq_len=1024, tversky_membership="sigmoid",
+ diff_attn=False):
+ super().__init__()
+ self.num_heads, self.num_kv_heads = num_heads, num_kv_heads
+ self.head_dim = dim // num_heads
+ self.diff_attn = diff_attn
+ self.q_size = self.num_heads * self.head_dim
+ self.kv_size = self.num_kv_heads * self.head_dim
+ self.c_qkv = BinaryLinear(dim, self.q_size + 2 * self.kv_size, bias=False, group_size=group_size)
+ self.proj = NormedBinaryLinear(dim, dim, bias=False, group_size=group_size) if attn_proj_type != "tversky" else None
+ if self.proj is not None:
+ self.proj._zero_init = True
+ self.tversky_proj = TverskyProjection(
+ dim, dim, num_features=tversky_num_features, group_size=group_size,
+ use_shared_features=(tversky_feature_pools > 0),
+ membership=tversky_membership,
+ ) if attn_proj_type == "tversky" else None
+ self.shared_features = None
+ self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32))
+ if diff_attn:
+ self.diff_lambda = nn.Parameter(torch.full((num_heads,), 0.5, dtype=torch.float32))
+ self.rotary = Rotary(self.head_dim, base=rope_base, no_cache=no_cache,
+ rope_type=rope_type, yarn_max_len=yarn_max_len,
+ train_seq_len=train_seq_len)
+ def forward(self, x: Tensor) -> Tensor:
+ bsz, seqlen, dim = x.shape
+ qkv_out = self.c_qkv(x)
+ q_out, k_out, v_out = qkv_out.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
+ q = q_out.reshape(bsz, seqlen, self.num_heads, self.head_dim)
+ k = k_out.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim)
+ v = v_out.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim)
+ q, k = F.rms_norm(q, (q.size(-1),)), F.rms_norm(k, (k.size(-1),))
+ cos, sin = self.rotary(seqlen, x.device, q.dtype)
+ q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin)
+ q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None]
+ if self.diff_attn:
+ half = self.head_dim // 2
+ q1, q2 = q[..., :half], q[..., half:]
+ k1, k2 = k[..., :half], k[..., half:]
+ v1, v2 = v[..., :half], v[..., half:]
+ y1 = flash_attn_func(q1.contiguous(), k1.contiguous(), v1.contiguous(), causal=True)
+ y2 = flash_attn_func(q2.contiguous(), k2.contiguous(), v2.contiguous(), causal=True)
+ lam = self.diff_lambda.to(dtype=y1.dtype)[None, None, :, None]
+ y = torch.cat([y1 - lam * y2, y1 + lam * y2], dim=-1)
+ else:
+ y = flash_attn_func(
+ q.contiguous(),
+ k.contiguous(),
+ v.contiguous(),
+ causal=True
+ )
+ y = y.reshape(bsz, seqlen, dim)
+ return self.tversky_proj(y, self.shared_features) if self.tversky_proj is not None else self.proj(y)
+
+class MLP(nn.Module):
+ def __init__(self, dim, mlp_mult, group_size=64, activation="swiglu", mlp_groups=0):
+ super().__init__()
+ hidden = mlp_mult * dim
+ self.activation = activation
+ if mlp_groups > 0:
+ if activation == "swiglu":
+ self.gate_up = GroupedBinaryLinear(dim, hidden * 2, groups=mlp_groups, group_size=group_size)
+ else:
+ self.fc = GroupedBinaryLinear(dim, hidden, groups=mlp_groups, group_size=group_size)
+ self.proj = GroupedBinaryLinear(hidden, dim, groups=mlp_groups, group_size=group_size, normed=True)
+ else:
+ if activation == "swiglu":
+ self.gate_up = BinaryLinear(dim, hidden * 2, bias=False, group_size=group_size)
+ else:
+ self.fc = BinaryLinear(dim, hidden, bias=False, group_size=group_size)
+ self.proj = NormedBinaryLinear(hidden, dim, bias=False, group_size=group_size)
+ self.proj._zero_init = True
+ def forward(self, x: Tensor) -> Tensor:
+ if self.activation == "swiglu":
+ gu = self.gate_up(x)
+ gate, up = gu.chunk(2, dim=-1)
+ return self.proj(F.silu(gate) * up)
+ elif self.activation == "relu":
+ return self.proj(torch.relu(self.fc(x)))
+ elif self.activation == "leaky_relu":
+ return self.proj(F.leaky_relu(self.fc(x), negative_slope=0.01))
+ else: # relu2
+ return self.proj(torch.relu(self.fc(x)).square())
+
+class SmearModule(nn.Module):
+ def __init__(self, dim: int):
+ super().__init__()
+ self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32))
+ def forward(self, x: Tensor) -> Tensor:
+ cumsum = x.cumsum(dim=1)
+ counts = torch.arange(1, x.size(1) + 1, device=x.device, dtype=x.dtype).view(1, -1, 1)
+ smeared = cumsum / counts
+ gate = torch.tanh(self.gate.to(dtype=x.dtype))
+ return x + gate * (smeared - x)
+
+class CausalConvRefiner(nn.Module):
+ "Causal Conv1d that refines hidden states using local n-gram context."
+ def __init__(self, dim: int, kernel_size: int = 3):
+ super().__init__()
+ self.kernel_size = kernel_size
+ self.conv = nn.Conv1d(dim, dim, kernel_size, padding=0, bias=False)
+ self.gate = nn.Parameter(torch.zeros(1, dtype=torch.float32))
+ def forward(self, x: Tensor) -> Tensor:
+ h = x.permute(0, 2, 1)
+ h = F.pad(h, (self.kernel_size - 1, 0))
+ h = self.conv(h)
+ h = h.permute(0, 2, 1)
+ return x + torch.tanh(self.gate.to(dtype=x.dtype)) * F.rms_norm(h, (h.size(-1),))
+
+class Block(nn.Module):
+ def __init__(self, dim: int, num_heads: int, num_kv_heads: int, mlp_mult: int,
+ rope_base: float, qk_gain_init: float, group_size: int=64,
+ activation: str="swiglu", attn_proj_type: str="standard",
+ tversky_num_features: int=16, tversky_feature_pools: int=0, no_cache: bool=False,
+ smear: bool=False, rope_type: str="rope", yarn_max_len: int=4096,
+ train_seq_len: int=1024, tversky_membership: str="sigmoid",
+ diff_attn: bool=False, mlp_groups: int=0):
+ super().__init__()
+ self.attn_norm = RMSNorm()
+ self.mlp_norm = RMSNorm()
+ self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init,
+ group_size, attn_proj_type, tversky_num_features,
+ tversky_feature_pools, no_cache, rope_type, yarn_max_len,
+ train_seq_len, tversky_membership, diff_attn)
+ self.mlp = MLP(dim, mlp_mult, group_size, activation, mlp_groups)
+ self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+ self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+ self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float())
+ self.smear = SmearModule(dim) if smear else None
+ def forward(self, x: Tensor, x0: Tensor) -> Tensor:
+ mix = self.resid_mix.to(dtype=x.dtype)
+ x = mix[0] * x + mix[1] * x0
+ n = self.attn_norm(x)
+ x = x + self.attn_scale.to(dtype=x.dtype) * self.attn(n)
+ x = x + self.mlp_scale.to(dtype=x.dtype) * self.mlp(self.mlp_norm(x))
+ if self.smear is not None:
+ x = self.smear(x)
+ return x
+
+class GPT(nn.Module):
+ def __init__(self, vocab_size, num_layers, model_dim, num_heads, num_kv_heads, mlp_mult,
+ tie_embeddings, tied_embed_init_std, logit_softcap, rope_base, qk_gain_init,
+ group_size: int = 64, activation: str = "swiglu", mtp_heads_count: int = 0,
+ embed_dim: int = 0, attn_proj_type: str = "standard", logit_head_type: str = "standard",
+ tversky_num_features: int = 16, tversky_feature_pools: int = 0,
+ training_depth_recurrence: int=1, fp_storage=False, bigram_hash: bool=False,
+ softcap_type: str="poly", no_cache: bool=False,
+ smear: bool=False, rope_type: str="rope", yarn_max_len: int=4096,
+ train_seq_len: int=1024, tversky_membership: str="sigmoid",
+ diff_attn=False, mlp_groups=0, refiner=False, refiner_kernel=3):
+ super().__init__()
+ self.training_depth_recurrence = training_depth_recurrence
+ self.fp_storage = fp_storage
+ self.tie_embeddings = tie_embeddings
+ self.logit_softcap = logit_softcap
+ self.softcap_type = softcap_type
+ self.embed_dim = embed_dim if embed_dim > 0 else model_dim
+ self.tok_emb = QATEmbedding(vocab_size, self.embed_dim, fp_storage=fp_storage)
+ self.bigram_emb = QATEmbedding(vocab_size, self.embed_dim, fp_storage=fp_storage) if bigram_hash else None
+ if self.bigram_emb is not None:
+ nn.init.zeros_(self.bigram_emb.weight)
+ self.lm_head_correction = nn.Parameter(
+ torch.zeros(vocab_size, self.embed_dim)) if tie_embeddings == 2 else None
+ self.embed_proj = QATLinear(self.embed_dim, model_dim, bias=False, fp_storage=fp_storage) if self.embed_dim != model_dim else None
+ self.embed_proj_rev = QATLinear(model_dim, self.embed_dim, bias=False, fp_storage=fp_storage) if (
+ self.embed_dim != model_dim and logit_head_type != "tversky") else None
+ self.num_encoder_layers = num_layers // 2
+ self.num_decoder_layers = num_layers - self.num_encoder_layers
+ self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
+ self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32))
+ # Shared Tversky feature pools (if enabled and num_features > 0)
+ if attn_proj_type == "tversky" and tversky_feature_pools > 0 and tversky_num_features > 0:
+ self.tversky_feature_pools_list = nn.ParameterList([
+ nn.Parameter(torch.empty(tversky_num_features, model_dim).uniform_(-0.02, 0.02))
+ for _ in range(tversky_feature_pools)
+ ])
+ else:
+ self.tversky_feature_pools_list = None
+ self.blocks = nn.ModuleList([
+ Block(model_dim, num_heads, num_kv_heads, mlp_mult, rope_base, qk_gain_init,
+ group_size, activation, attn_proj_type, tversky_num_features, tversky_feature_pools,
+ no_cache, smear, rope_type, yarn_max_len, train_seq_len, tversky_membership,
+ diff_attn, mlp_groups)
+ for _ in range(num_layers)
+ ])
+ # Inject shared feature pool references into attention layers
+ if self.tversky_feature_pools_list is not None:
+ for i, block in enumerate(self.blocks):
+ pool_idx = (i * tversky_feature_pools) // num_layers
+ block.attn.shared_features = self.tversky_feature_pools_list[pool_idx]
+ self.final_norm = RMSNorm()
+ self.refiner = CausalConvRefiner(model_dim, kernel_size=refiner_kernel) if refiner else None
+ self.mtp_heads = nn.ModuleList([
+ nn.Linear(model_dim, vocab_size, bias=False) for _ in range(mtp_heads_count)
+ ])
+ for h in self.mtp_heads:
+ nn.init.zeros_(h.weight)
+ self.logit_head_type = logit_head_type
+ if logit_head_type == "tversky" and tversky_num_features == 0 and vocab_size > 1024:
+ raise ValueError(
+ f"Tversky logit head with no-features mode creates O(V^2) = {vocab_size}x{vocab_size} "
+ f"matrix per forward pass. Use tversky_num_features > 0 or a smaller vocab."
+ )
+ self.tversky_head = TverskyProjection(
+ model_dim, vocab_size, num_features=tversky_num_features,
+ membership=tversky_membership,
+ ) if logit_head_type == "tversky" else None
+ self.lm_head = QATLinear(model_dim, vocab_size, bias=False, fp_storage=fp_storage)
+ self.lm_head._zero_init = True
+ if self.lm_head is not None and (tie_embeddings or logit_head_type == "tversky"):
+ self.lm_head.weight.requires_grad_(False)
+ self.vocab_bias = nn.Parameter(torch.zeros(vocab_size, dtype=torch.float32))
+ self._init_weights(tied_embed_init_std)
+ def _init_weights(self, tied_embed_init_std: float) -> None:
+ if self.tie_embeddings:
+ nn.init.normal_(self.tok_emb.weight, mean=0.0, std=tied_embed_init_std)
+ for module in self.modules():
+ if isinstance(module, BinaryLinear) and not getattr(module, "_zero_init", False):
+ nn.init.normal_(module.weight, mean=0.0, std=0.02)
+ elif isinstance(module, nn.Linear) and getattr(module, "_zero_init", False):
+ nn.init.zeros_(module.weight)
+ def _compute_logits(self, x: Tensor) -> Tensor:
+ if self.tversky_head is not None:
+ logits_raw = self.tversky_head(x)
+ elif self.tie_embeddings:
+ if self.embed_proj_rev is not None:
+ proj = self.embed_proj_rev(x)
+ else:
+ proj = x
+ weight = self.tok_emb.weight
+ if self.lm_head_correction is not None:
+ weight = weight + self.lm_head_correction
+ logits_raw = F.linear(proj, weight.to(x.dtype))
+ else:
+ logits_raw = self.lm_head(x)
+ return logits_raw + self.vocab_bias.to(x.dtype)
+ def _softcap(self, logits: Tensor) -> Tensor:
+ s = self.logit_softcap
+ if self.softcap_type == "tanh":
+ return s * torch.tanh(logits / s)
+ x_sc = torch.clamp(logits / s, -2.0, 2.0)
+ x2 = x_sc * x_sc
+ return s * torch.clamp(x_sc * (1.0 - x2 / 3.0 + x2 * x2 / 15.0), -1.0, 1.0)
+ def forward(self, input_ids: Tensor, target_ids: Tensor, reduction: str = "mean", temperature: float = 1.0) -> Tensor:
+ x = self.tok_emb(input_ids).float()
+ if self.bigram_emb is not None:
+ prev = F.pad(input_ids[:, :-1], (1, 0), value=0)
+ x = x + self.bigram_emb(prev).float()
+ if self.embed_proj is not None:
+ x = self.embed_proj(x)
+ x = F.rms_norm(x, (x.size(-1),))
+ x0 = x
+ # U-Net style encoder/decoder with skip connections
+ skips = []
+ for i in range(self.num_encoder_layers):
+ for _ in range(max(1, self.training_depth_recurrence)):
+ x = self.blocks[i](x, x0)
+ skips.append(x)
+ for i in range(self.num_decoder_layers):
+ bi = self.num_encoder_layers + i
+ if skips:
+ x = x + self.skip_weights[i].to(dtype=x.dtype) * skips.pop()
+ for _ in range(max(1, self.training_depth_recurrence)):
+ x = self.blocks[bi](x, x0)
+ x_normed = self.final_norm(x)
+ if self.refiner is not None:
+ x_normed = self.refiner(x_normed)
+ # Standard training/eval path
+ x_flat = x_normed.reshape(-1, x_normed.size(-1))
+ targets = target_ids.reshape(-1)
+ logits = self._softcap(self._compute_logits(x_flat))
+ if reduction == "none":
+ return F.cross_entropy(logits.float(), targets, reduction="none").reshape(input_ids.shape)
+ # Fused CE + Z-loss: single logsumexp computation
+ logits_f = logits.float()
+ lse = torch.logsumexp(logits_f, dim=-1)
+ target_logits = logits_f.gather(1, targets.unsqueeze(1)).squeeze(1)
+ main_loss = (lse - target_logits).mean() + 1e-4 * (lse ** 2).mean()
+ # Multi-token prediction auxiliary loss (training only)
+ if self.training and len(self.mtp_heads) > 0:
+ mtp_loss = torch.zeros((), device=main_loss.device)
+ for k, head in enumerate(self.mtp_heads):
+ shift = k + 2
+ if target_ids.shape[1] > shift:
+ mtp_tgt = target_ids[:, shift:].reshape(-1)
+ mtp_in = x_normed[:, :target_ids.shape[1] - shift, :].reshape(-1, x_normed.shape[-1])
+ mtp_loss = mtp_loss + F.cross_entropy(head(mtp_in).float(), mtp_tgt, reduction="mean")
+ main_loss = main_loss + 0.1 * mtp_loss / len(self.mtp_heads)
+ return main_loss
+
+# ---------------------------------------------------------------------------
+# Validation
+# ---------------------------------------------------------------------------
+def build_luts(sp, vocab_size: int, device: torch.device):
+ sp_vocab_size = int(sp.vocab_size())
+ table_size = max(sp_vocab_size, vocab_size)
+ base_bytes_np = np.zeros((table_size,), dtype=np.int16)
+ has_leading_space_np = np.zeros((table_size,), dtype=np.bool_)
+ is_boundary_token_np = np.ones((table_size,), dtype=np.bool_)
+ for token_id in range(sp_vocab_size):
+ if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+ continue
+ is_boundary_token_np[token_id] = False
+ if sp.is_byte(token_id):
+ base_bytes_np[token_id] = 1
+ continue
+ piece = sp.id_to_piece(token_id)
+ if piece.startswith("\u2581"):
+ has_leading_space_np[token_id] = True
+ piece = piece[1:]
+ base_bytes_np[token_id] = len(piece.encode("utf-8"))
+ return (
+ torch.tensor(base_bytes_np, dtype=torch.int16, device=device),
+ torch.tensor(has_leading_space_np, dtype=torch.bool, device=device),
+ torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device),
+ )
+
+def ld_val(pattern, seq_len, max_tok=int(os.environ.get("VAL_MAX_TOKENS", 500000))):
+ files = sorted(glob.glob(pattern))
+ assert files, f"No files: {pattern}"
+ tok = torch.cat([ld_shard(Path(p)) for p in files]).contiguous()
+ if max_tok > 0: tok = tok[:max_tok + 1]
+ u = ((tok.numel() - 1) // seq_len) * seq_len
+ return tok[:u + 1]
+
+def eval_val(args, model, rank, world_size, device, grad_accum_steps, val_tokens,
+ base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, temperature: float = 1.0):
+ local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps)
+ local_batch_seqs = max(1, local_batch_tokens // args.train_seq_len)
+ total_seqs = (val_tokens.numel() - 1) // args.train_seq_len
+ seq_start = (total_seqs * rank) // world_size
+ seq_end = (total_seqs * (rank + 1)) // world_size
+ loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+ token_count = torch.zeros((), device=device, dtype=torch.float64)
+ byte_count = torch.zeros((), device=device, dtype=torch.float64)
+ model.eval()
+ with torch.inference_mode():
+ for batch_start in range(seq_start, seq_end, local_batch_seqs):
+ batch_end = min(batch_start + local_batch_seqs, seq_end)
+ raw_start = batch_start * args.train_seq_len
+ raw_end = batch_end * args.train_seq_len + 1
+ local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64)
+ x, y = local[:-1].reshape(-1, args.train_seq_len), local[1:].reshape(-1, args.train_seq_len)
+ with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+ batch_loss = model(x, y, temperature=temperature).detach()
+ n = float(y.numel())
+ loss_sum += batch_loss.to(torch.float64) * n
+ token_count += n
+ prev_ids, tgt_ids = x.reshape(-1), y.reshape(-1)
+ tok_bytes = base_bytes_lut[tgt_ids].to(torch.int16)
+ tok_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(torch.int16)
+ byte_count += tok_bytes.to(torch.float64).sum()
+ if dist.is_available() and dist.is_initialized():
+ for t in (loss_sum, token_count, byte_count):
+ dist.all_reduce(t, op=dist.ReduceOp.SUM)
+ val_loss = loss_sum / token_count
+ bpb = (val_loss.item() / math.log(2.0)) * (token_count.item() / byte_count.item())
+ model.train()
+ return float(val_loss.item()), float(bpb)
+
+def eval_val_sliding(args, model, rank, world_size, device, grad_accum_steps, val_tokens,
+ base_bytes_lut, has_leading_space_lut, is_boundary_token_lut,
+ stride: int = 64, temperature: float = 1.0):
+ seq_len = args.train_seq_len
+ batch_size = args.sliding_batch_size
+ total_tokens = val_tokens.numel() - 1
+ loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+ token_count = torch.zeros((), device=device, dtype=torch.float64)
+ byte_count = torch.zeros((), device=device, dtype=torch.float64)
+ all_starts = list(range(0, total_tokens - seq_len, stride))
+ my_starts = all_starts[rank::world_size]
+ model.eval()
+ with torch.inference_mode():
+ for i in range(0, len(my_starts), batch_size):
+ batch_starts = my_starts[i:i + batch_size]
+ starts_t = torch.tensor(batch_starts, dtype=torch.int64)
+ offsets = torch.arange(seq_len + 1, dtype=torch.int64)
+ indices = starts_t.unsqueeze(1) + offsets.unsqueeze(0)
+ local_batch = val_tokens[indices].to(device=device, dtype=torch.int64, non_blocking=True)
+ x = local_batch[:, :-1]
+ y = local_batch[:, 1:]
+ with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+ per_token_loss = model(x, y, reduction="none", temperature=temperature).detach()
+ for b, start in enumerate(batch_starts):
+ score_from = 0 if start == 0 else seq_len - stride
+ scored = per_token_loss[b, score_from:]
+ sx, sy = x[b, score_from:], y[b, score_from:]
+ loss_sum += scored.to(torch.float64).sum()
+ token_count += scored.numel()
+ tok_bytes = base_bytes_lut[sy].to(torch.int16)
+ tok_bytes += (has_leading_space_lut[sy] & ~is_boundary_token_lut[sx]).to(torch.int16)
+ byte_count += tok_bytes.to(torch.float64).sum()
+ if dist.is_available() and dist.is_initialized():
+ for t in (loss_sum, token_count, byte_count):
+ dist.all_reduce(t, op=dist.ReduceOp.SUM)
+ val_loss = loss_sum / token_count
+ bpb = (val_loss.item() / math.log(2.0)) * (token_count.item() / byte_count.item())
+ model.train()
+ return float(val_loss.item()), float(bpb)
+
+# ---------------------------------------------------------------------------
+# Temperature scaling
+# ---------------------------------------------------------------------------
+def find_temp(args, base_model, rank, world_size, device, grad_accum_steps,
+ calibration_tokens, base_bytes_lut, has_leading_space_lut,
+ is_boundary_token_lut):
+ best_t, best_loss = 1.0, float("inf")
+ for t in [0.90, 0.95, 1.00, 1.05, 1.10]:
+ loss, _ = eval_val(args, base_model, rank, world_size, device, grad_accum_steps,
+ calibration_tokens, base_bytes_lut, has_leading_space_lut,
+ is_boundary_token_lut, temperature=t)
+ if loss < best_loss:
+ best_loss = loss
+ best_t = t
+ return best_t
+
+# ---------------------------------------------------------------------------
+# Training
+# ---------------------------------------------------------------------------
+def main() -> None:
+ args = Hyperparameters()
+ code = Path(__file__).read_text(encoding="utf-8")
+ if args.matrix_optimizer != "adamw":
+ global ns_orth
+ ns_orth = torch.compile(ns_orth)
+ distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ
+ rank = int(os.environ.get("RANK", "0"))
+ world_size = int(os.environ.get("WORLD_SIZE", "1"))
+ local_rank = int(os.environ.get("LOCAL_RANK", "0"))
+ grad_accum_steps = max(1, 8 // world_size)
+ grad_scale = 1.0 / grad_accum_steps
+ if not torch.cuda.is_available():
+ raise RuntimeError("CUDA is required")
+ device = torch.device("cuda", local_rank)
+ torch.cuda.set_device(device)
+ if distributed:
+ dist.init_process_group(backend="nccl", device_id=device)
+ dist.barrier()
+ master_process = rank == 0
+ torch.backends.cuda.matmul.allow_tf32 = True
+ torch.backends.cudnn.allow_tf32 = True
+ os.makedirs("logs/cuda/", exist_ok=True)
+ logfile = f"logs/cuda/{args.run_id}.txt" if master_process else None
+ if master_process:
+ print(logfile)
+ def log0(msg: str, console: bool = True) -> None:
+ if not master_process:
+ return
+ if console:
+ print(msg)
+ if logfile:
+ with open(logfile, "a", encoding="utf-8") as f:
+ print(msg, file=f)
+ log0(code, console=False)
+ log0("=" * 100, console=False)
+ log0(f"Python {sys.version}", console=False)
+ log0(f"PyTorch {torch.__version__}", console=False)
+ random.seed(args.seed)
+ np.random.seed(args.seed)
+ torch.manual_seed(args.seed)
+ torch.cuda.manual_seed_all(args.seed)
+ sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
+ val_tokens = ld_val(args.val_files, args.train_seq_len)
+ base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_luts(
+ sp, args.vocab_size, device)
+
+ # --- Model ---
+ base_model = GPT(
+ vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim,
+ num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult,
+ tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std,
+ logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init,
+ group_size=args.bitnet_group_size, activation=args.activation_type, mtp_heads_count=args.mtp_heads_count,
+ embed_dim=args.embed_dim, attn_proj_type=args.attn_proj_type, logit_head_type=args.logit_head_type,
+ tversky_num_features=args.tversky_num_features, tversky_feature_pools=args.tversky_feature_pools,
+ training_depth_recurrence=args.training_depth_recurrence, fp_storage=args.fp_storage,
+ bigram_hash=args.bigram_hash, softcap_type=args.softcap_type, no_cache=(args.compile_mode == "reduce-overhead"),
+ smear=args.smear, rope_type=args.rope_type, yarn_max_len=args.yarn_max_len, train_seq_len=args.train_seq_len,
+ tversky_membership=args.tversky_membership, diff_attn=args.diff_attn,
+ refiner=args.refiner, refiner_kernel=args.refiner_kernel, mlp_groups=args.mlp_groups,
+ ).to(device).bfloat16()
+ for module in base_model.modules():
+ if isinstance(module, nn.Linear):
+ module.float()
+ restore_low_dim_params_to_fp32(base_model)
+ if base_model.lm_head is not None and (args.tie_embeddings or args.logit_head_type == "tversky"):
+ base_model.lm_head.weight.requires_grad_(False)
+ torch._dynamo.config.optimize_ddp = False
+ compiled_model = torch.compile(base_model, mode=args.compile_mode if args.compile_mode != "default" else None)
+ use_find_unused = args.untie_at_fraction > 0 or args.mtp_heads_count > 0 or not args.tie_embeddings
+ model = DDP(compiled_model, device_ids=[local_rank], broadcast_buffers=False,
+ find_unused_parameters=use_find_unused,
+ static_graph=not use_find_unused,
+ gradient_as_bucket_view=True) if distributed else compiled_model
+
+ # --- Optimizers ---
+ _excl = {"tok_emb.weight", "lm_head.weight", "lm_head_correction"}
+ all_other_params = [(n, p) for n, p in base_model.named_parameters()
+ if not any(eh in n for eh in _excl)]
+ matrix_params = [p for n, p in all_other_params
+ if p.ndim == 2 and not any(pat in n for pat in CTP)]
+ scalar_params = [p for n, p in all_other_params
+ if p.ndim < 2 or any(pat in n for pat in CTP)]
+ token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr
+ opt_tok = torch.optim.Adam(
+ [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}],
+ betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True)
+ if args.matrix_optimizer == "adamw":
+ opt_muon = torch.optim.AdamW(
+ [{"params": matrix_params, "lr": args.adam_lr, "base_lr": args.adam_lr}],
+ betas=(args.beta1, args.beta2), eps=args.adam_eps, weight_decay=args.adam_wd, fused=True)
+ else:
+ opt_muon = Muon(matrix_params, lr=args.matrix_lr, momentum=args.muon_momentum,
+ backend_steps=args.muon_backend_steps, wd=args.muon_wd)
+ for g in opt_muon.param_groups:
+ g["base_lr"] = args.matrix_lr
+ opt_scalar = torch.optim.Adam(
+ [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}],
+ betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True)
+ opt_head = torch.optim.Adam(
+ [{"params": [base_model.lm_head.weight], "lr": 0.0, "base_lr": 0.0}],
+ betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True)
+ optimizers = [opt for opt in [opt_tok, opt_muon, opt_scalar, opt_head] if opt is not None]
+ if base_model.lm_head_correction is not None:
+ opt_corr = torch.optim.Adam(
+ [{"params": [base_model.lm_head_correction],
+ "lr": args.corr_weight_lr, "base_lr": args.corr_weight_lr}],
+ betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True)
+ optimizers.append(opt_corr)
+
+ # --- Log all hyperparameters ---
+ log0("--- Hyperparameters ---", console=False)
+ log0(" ".join(f"{a}={getattr(args,a)}" for a in sorted(dir(args)) if not a.startswith("_") and a not in ("train_files","val_files") and not callable(getattr(args,a))), console=False)
+ n_params = sum(p.numel() for p in base_model.parameters())
+ log0(f"params:{n_params} L:{args.num_layers} d:{args.model_dim} h:{args.num_heads} kv:{args.num_kv_heads} ws:{world_size} ga:{grad_accum_steps} s:{args.seed}")
+ # --- Data loader & helpers ---
+ train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+ def zero_grad_all():
+ for opt in optimizers:
+ opt.zero_grad(set_to_none=True)
+ max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None
+ def lr_mul(step: int, elapsed_ms: float):
+ if args.warmdown_fraction <= 0:
+ return 1.0
+ if max_wallclock_ms is None:
+ warmdown_start = int(args.iterations * (1.0 - args.warmdown_fraction))
+ return max((args.iterations - step) / max(args.iterations * args.warmdown_fraction, 1), 0.0) if step >= warmdown_start else 1.0
+ warmdown_ms = max_wallclock_ms * args.warmdown_fraction
+ remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0)
+ return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0
+ _seq_switched = False
+ _batch_switched = False
+ active_seq_len = args.seq_len_start if args.seq_len_start > 0 else args.train_seq_len
+ active_batch_tokens = args.batch_tokens_start if args.batch_tokens_start > 0 else args.train_batch_tokens
+ # --- Compiler warmup ---
+ if args.warmup_steps > 0:
+ _ms = {n: t.detach().cpu().clone() for n, t in base_model.state_dict().items()}
+ _os = [copy.deepcopy(o.state_dict()) for o in optimizers]
+ model.train()
+ for ws in range(args.warmup_steps):
+ zero_grad_all()
+ for mi in range(grad_accum_steps):
+ if distributed: model.require_backward_grad_sync = mi == grad_accum_steps - 1
+ x, y = train_loader.next_batch(active_batch_tokens, active_seq_len, grad_accum_steps)
+ torch.compiler.cudagraph_mark_step_begin()
+ with torch.autocast(device_type="cuda", dtype=torch.bfloat16): loss = model(x, y)
+ (loss * grad_scale).backward()
+ for o in optimizers: o.step()
+ zero_grad_all()
+ log0(f"warmup:{ws+1}/{args.warmup_steps}")
+ base_model.load_state_dict(_ms, strict=True)
+ for o, s in zip(optimizers, _os): o.load_state_dict(s)
+ zero_grad_all()
+ train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+
+ # --- EMA model ---
+ ema_model = None
+ _ema_started = False
+ _ema_steps = 0
+ if args.ema:
+ ema_model = copy.deepcopy(base_model)
+ for p in ema_model.parameters():
+ p.requires_grad_(False)
+
+ # --- Main training loop ---
+ training_time_ms = 0.0
+ stop_after_step: int | None = None
+ _untied = False
+ train_loss = torch.zeros((), device=device)
+ torch.cuda.synchronize()
+ t0 = time.perf_counter()
+ step = 0
+ while True:
+ last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step)
+ if last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0):
+ torch.cuda.synchronize()
+ training_time_ms += 1000.0 * (time.perf_counter() - t0)
+ val_loss, val_bpb = eval_val(args, model, rank, world_size, device, grad_accum_steps,
+ val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut)
+ log0(f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} "
+ f"train_time:{training_time_ms:.0f}ms")
+ torch.cuda.synchronize()
+ t0 = time.perf_counter()
+ if last_step:
+ if stop_after_step is not None and step < args.iterations:
+ log0(f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms step:{step}/{args.iterations}")
+ break
+ elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+ scale = lr_mul(step, elapsed_ms)
+ # Sequence length schedule
+ if args.seq_len_start > 0 and not _seq_switched:
+ if max_wallclock_ms is not None:
+ should_switch_seq = elapsed_ms >= args.seq_schedule_fraction * max_wallclock_ms
+ else:
+ should_switch_seq = step >= int(args.iterations * args.seq_schedule_fraction)
+ if should_switch_seq:
+ active_seq_len = args.train_seq_len
+ _seq_switched = True
+ torch._dynamo.reset()
+ train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+ log0(f"step:{step} seq_len_switch:{args.seq_len_start}->{active_seq_len}")
+
+ # Batch size schedule
+ if args.batch_tokens_start > 0 and not _batch_switched:
+ if max_wallclock_ms is not None:
+ should_switch_batch = elapsed_ms >= args.batch_schedule_fraction * max_wallclock_ms
+ else:
+ should_switch_batch = step >= int(args.iterations * args.batch_schedule_fraction)
+ if should_switch_batch:
+ active_batch_tokens = args.train_batch_tokens
+ _batch_switched = True
+ log0(f"step:{step} batch_switch:{args.batch_tokens_start}->{active_batch_tokens}")
+ zero_grad_all()
+ train_loss.zero_()
+ for micro in range(grad_accum_steps):
+ if distributed:
+ model.require_backward_grad_sync = micro == grad_accum_steps - 1
+ x, y = train_loader.next_batch(active_batch_tokens, active_seq_len, grad_accum_steps)
+ torch.compiler.cudagraph_mark_step_begin()
+ with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+ loss = model(x, y)
+ train_loss.add_(loss.detach())
+ (loss * grad_scale).backward()
+ train_loss /= grad_accum_steps
+
+ # Untie lm_head at configured fraction of training
+ if args.untie_at_fraction > 0:
+ if max_wallclock_ms is not None:
+ should_untie = not _untied and elapsed_ms >= args.untie_at_fraction * max_wallclock_ms
+ else:
+ should_untie = not _untied and step >= int(args.iterations * args.untie_at_fraction)
+ if should_untie and base_model.tie_embeddings:
+ with torch.no_grad():
+ base_weight = base_model.tok_emb.weight.float()
+ if base_model.lm_head_correction is not None:
+ base_weight = base_weight + base_model.lm_head_correction.float()
+ if base_model.embed_proj_rev is not None:
+ full_weight = base_weight @ base_model.embed_proj_rev.weight.float()
+ else:
+ full_weight = base_weight
+ base_model.lm_head.weight.copy_(full_weight)
+ base_model.tie_embeddings = False
+ base_model.lm_head.weight.requires_grad_(True)
+ for g in opt_head.param_groups:
+ g["lr"] = g["base_lr"] = args.head_lr
+ _untied = True
+ torch._dynamo.reset()
+ log0(f"step:{step} untied lm_head (head_lr={args.head_lr})")
+
+ # Muon momentum warmup
+ if args.matrix_optimizer != "adam":
+ frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0
+ for g in opt_muon.param_groups:
+ g["momentum"] = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum
+
+ # LR scheduling
+ for opt in optimizers:
+ for g in opt.param_groups:
+ g["lr"] = g["base_lr"] * scale
+ opt.step()
+ zero_grad_all()
+ # EMA update
+ if ema_model is not None:
+ if not _ema_started:
+ if max_wallclock_ms is not None:
+ should_start_ema = elapsed_ms >= args.ema_start_fraction * max_wallclock_ms
+ else:
+ should_start_ema = step >= int(args.iterations * args.ema_start_fraction)
+ if should_start_ema:
+ _ema_started = True
+ _ema_steps = 0
+ with torch.no_grad():
+ for ep, bp in zip(ema_model.parameters(), base_model.parameters()):
+ ep.data.copy_(bp.data)
+ log0(f"step:{step} ema_started")
+ if _ema_started:
+ _ema_steps += 1
+ decay = min(args.ema_decay, (1.0 + _ema_steps) / (10.0 + _ema_steps))
+ with torch.no_grad():
+ for ep, bp in zip(ema_model.parameters(), base_model.parameters()):
+ ep.data.mul_(decay).add_(bp.data, alpha=1.0 - decay)
+ step += 1
+ approx_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+
+ if args.train_log_every > 0 and step % args.train_log_every == 0:
+ log0(f"step:{step}/{args.iterations} loss:{train_loss.item():.4f} t:{approx_ms:.0f}ms avg:{approx_ms/step:.1f}ms")
+ if args.churn_log_every > 0 and step % args.churn_log_every == 0:
+ log0(f"step:{step} churn:{churn_fn(base_model, args.bitnet_group_size):.4f}")
+ # Wallclock cap sync
+ if stop_after_step is None and max_wallclock_ms is not None and step % 10 == 0:
+ reached_cap = approx_ms >= max_wallclock_ms
+ if distributed:
+ cap_t = torch.tensor(int(reached_cap), device=device)
+ dist.all_reduce(cap_t, op=dist.ReduceOp.MAX)
+ reached_cap = bool(cap_t.item())
+ if reached_cap:
+ stop_after_step = step
+
+ # --- Serialization ---
+ if master_process:
+ sd = (ema_model if ema_model is not None and _ema_started else base_model).state_dict()
+ if base_model.tie_embeddings or args.logit_head_type == "tversky":
+ sd.pop("lm_head.weight", None)
+
+ # Compute binary overrides for no-features Tversky prototypes
+ binary_overrides = set()
+ for n, m in base_model.named_modules():
+ if isinstance(m, TverskyProjection) and m.no_features_mode:
+ binary_overrides.add(n + ".prototypes")
+ binary_overrides = binary_overrides or None
+ q_obj, q_stats = q_sd(sd, group_size=args.bitnet_group_size, fp_storage=args.fp_storage, binary_override_names=binary_overrides)
+ buf = io.BytesIO()
+ torch.save(q_obj, buf)
+ final_blob = lzma.compress(buf.getvalue(), preset=9)
+ with open("final_model.binary.ptz", "wb") as f:
+ f.write(final_blob)
+ artifact_bytes = len(final_blob)
+ code_bytes = len(code.encode("utf-8"))
+ total = artifact_bytes + code_bytes
+ log0(f"artifact:{artifact_bytes/1e6:.2f}MB binary:{q_stats['binary_params']}({q_stats['binary_bytes']}B) fp:{q_stats['fp_params']}({q_stats['fp_bytes']}B) code:{code_bytes}")
+ log0(f"budget:{total}/{16000000} ({total/1e6:.2f}/{16.00:.2f}MB) {'FITS' if total <= 16000000 else 'OVER'}")
+ if args.eval_depth_recurrence > 0:
+ base_model.training_depth_recurrence = args.eval_depth_recurrence
+ log0(f"eval_depth_recurrence:{args.eval_depth_recurrence}")
+
+ # --- All ranks load roundtrip weights and evaluate ---
+ if distributed:
+ dist.barrier()
+ with open("final_model.binary.ptz", "rb") as f:
+ loaded = torch.load(io.BytesIO(lzma.decompress(f.read())), map_location="cpu", weights_only=False)
+ base_model.load_state_dict(deq_sd(loaded), strict=False)
+ if ema_model is not None:
+ ema_model.load_state_dict(deq_sd(loaded), strict=False)
+ torch._dynamo.reset()
+ q_val_loss, q_val_bpb = eval_val(args, model, rank, world_size, device, grad_accum_steps,
+ val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut)
+ log0(f"final_binary_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f}")
+
+ opt_temp = 1.0
+ if args.temp_scaling:
+ torch.cuda.synchronize()
+ t_temp = time.perf_counter()
+ calibration_tokens = train_loader.stream.take(65536).to(device)
+ opt_temp = find_temp(args, base_model, rank, world_size, device, grad_accum_steps,
+ calibration_tokens, base_bytes_lut, has_leading_space_lut,
+ is_boundary_token_lut)
+ torch.cuda.synchronize()
+ temp_time_ms = 1000.0 * (time.perf_counter() - t_temp)
+ log0(f"temp_scaling optimal_T:{opt_temp:.2f} eval_time:{temp_time_ms:.0f}ms")
+
+ if args.sliding_eval:
+ torch.cuda.synchronize()
+ t_sliding = time.perf_counter()
+ sw_loss, sw_bpb = eval_val_sliding(args, base_model, rank, world_size, device, grad_accum_steps,
+ val_tokens, base_bytes_lut, has_leading_space_lut,
+ is_boundary_token_lut, stride=args.sliding_eval_stride,
+ temperature=opt_temp)
+ torch.cuda.synchronize()
+ sliding_time_ms = 1000.0 * (time.perf_counter() - t_sliding)
+ log0(f"final_sliding val_loss:{sw_loss:.4f} val_bpb:{sw_bpb:.4f} "
+ f"(stride={args.sliding_eval_stride}, T={opt_temp:.2f}) eval_time:{sliding_time_ms:.0f}ms")
+
+ if distributed:
+ dist.destroy_process_group()
+
+
+if __name__ == "__main__":
+ main()
+
+====================================================================================================
+Python 3.13.12 | packaged by Anaconda, Inc. | (main, Feb 24 2026, 16:13:31) [GCC 14.3.0]
+PyTorch 2.10.0+cu128
+--- Hyperparameters ---
+activation_type=relu2 adam_eps=1e-08 adam_lr=0.05 adam_wd=0.05 attn_proj_type=standard batch_schedule_fraction=0.33 batch_tokens_start=0 beta1=0.9 beta2=0.95 bigram_hash=False bitnet_group_size=128 churn_log_every=1000 compile_mode=default corr_weight_lr=0.02 data_path=./data/datasets/fineweb10B_sp8192 diff_attn=False ema=False ema_decay=0.995 ema_start_fraction=0.5 embed_dim=254 embed_lr=0.6 eval_depth_recurrence=0 fp_storage=True grad_clip_norm=0.0 head_lr=0.02 iterations=50000 logit_head_type=standard logit_softcap=10.0 matrix_lr=0.04 matrix_optimizer=muon max_wallclock_seconds=0.0 mlp_groups=0 mlp_mult=4 model_dim=768 mtp_heads_count=0 muon_backend_steps=3 muon_momentum=0.95 muon_momentum_warmup_start=0.85 muon_momentum_warmup_steps=500 muon_wd=0.0 num_heads=8 num_kv_heads=4 num_layers=15 qk_gain_init=2.25 refiner=False refiner_kernel=3 rope_base=5000.0 rope_type=yarn run_id=pushing_run_binary_1 scalar_lr=0.02 seed=42 seq_len_start=0 seq_schedule_fraction=0.0 sliding_batch_size=256 sliding_eval=True sliding_eval_stride=16 smear=True softcap_type=poly temp_scaling=True tie_embeddings=1 tied_embed_init_std=0.005 tied_embed_lr=0.02 tokenizer_path=./data/tokenizers/fineweb_8192_bpe.model train_batch_tokens=524288 train_log_every=500 train_seq_len=1024 training_depth_recurrence=0 tversky_feature_pools=0 tversky_membership=sigmoid tversky_num_features=0 untie_at_fraction=0.0 val_batch_size=524288 val_loss_every=0 vocab_size=8192 warmdown_fraction=0.2 warmup_steps=5 yarn_max_len=2048
+params:106154616 L:15 d:768 h:8 kv:4 ws:8 ga:1 s:42
+warmup:1/5
+warmup:2/5
+warmup:3/5
+warmup:4/5
+warmup:5/5
+step:500/50000 loss:3.6805 t:77540ms avg:155.1ms
+step:1000/50000 loss:3.3485 t:155075ms avg:155.1ms
+step:1000 churn:0.0000
+step:1500/50000 loss:3.3714 t:232880ms avg:155.3ms
+step:2000/50000 loss:3.3187 t:310516ms avg:155.3ms
+step:2000 churn:0.1984
+step:2500/50000 loss:3.2573 t:388417ms avg:155.4ms
+step:3000/50000 loss:3.1844 t:465980ms avg:155.3ms
+step:3000 churn:0.1457
+step:3500/50000 loss:3.3885 t:543772ms avg:155.4ms
+step:4000/50000 loss:3.3496 t:621381ms avg:155.3ms
+step:4000 churn:0.1252
+step:4500/50000 loss:3.3527 t:699211ms avg:155.4ms
+step:5000/50000 loss:3.2171 t:776797ms avg:155.4ms
+step:5000 churn:0.1151
+step:5500/50000 loss:3.0536 t:854512ms avg:155.4ms
+step:6000/50000 loss:3.1355 t:932007ms avg:155.3ms
+step:6000 churn:0.1087
+step:6500/50000 loss:3.1928 t:1009731ms avg:155.3ms
+step:7000/50000 loss:3.2378 t:1087253ms avg:155.3ms
+step:7000 churn:0.1041
+step:7500/50000 loss:3.1585 t:1164994ms avg:155.3ms
+step:8000/50000 loss:3.1436 t:1242513ms avg:155.3ms
+step:8000 churn:0.1009
+step:8500/50000 loss:3.0573 t:1320248ms avg:155.3ms
+step:9000/50000 loss:3.0523 t:1397837ms avg:155.3ms
+step:9000 churn:0.0982
+step:9500/50000 loss:3.3082 t:1475596ms avg:155.3ms
+step:10000/50000 loss:3.3521 t:1553112ms avg:155.3ms
+step:10000 churn:0.0964
+step:10500/50000 loss:3.1877 t:1630835ms avg:155.3ms
+step:11000/50000 loss:2.7388 t:1708388ms avg:155.3ms
+step:11000 churn:0.0948
+step:11500/50000 loss:3.2052 t:1786100ms avg:155.3ms
+step:12000/50000 loss:3.2859 t:1863613ms avg:155.3ms
+step:12000 churn:0.0935
+step:12500/50000 loss:3.0326 t:1941282ms avg:155.3ms
+step:13000/50000 loss:3.2551 t:2018764ms avg:155.3ms
+step:13000 churn:0.0924
+step:13500/50000 loss:3.1339 t:2096463ms avg:155.3ms
+step:14000/50000 loss:3.0606 t:2173965ms avg:155.3ms
+step:14000 churn:0.0915
+step:14500/50000 loss:3.1752 t:2251634ms avg:155.3ms
+step:15000/50000 loss:3.0206 t:2329140ms avg:155.3ms
+step:15000 churn:0.0907
+step:15500/50000 loss:3.2017 t:2406858ms avg:155.3ms
+step:16000/50000 loss:3.1705 t:2484387ms avg:155.3ms
+step:16000 churn:0.0900
+step:16500/50000 loss:3.0774 t:2562139ms avg:155.3ms
+step:17000/50000 loss:3.2494 t:2639671ms avg:155.3ms
+step:17000 churn:0.0894
+step:17500/50000 loss:3.2024 t:2717393ms avg:155.3ms
+step:18000/50000 loss:3.1627 t:2794977ms avg:155.3ms
+step:18000 churn:0.0888
+step:18500/50000 loss:3.1733 t:2872744ms avg:155.3ms
+step:19000/50000 loss:3.2055 t:2950389ms avg:155.3ms
+step:19000 churn:0.0885
+step:19500/50000 loss:3.2026 t:3028137ms avg:155.3ms
+step:20000/50000 loss:2.9144 t:3105704ms avg:155.3ms
+step:20000 churn:0.0880
+step:20500/50000 loss:3.2154 t:3183466ms avg:155.3ms
+step:21000/50000 loss:3.1016 t:3261044ms avg:155.3ms
+step:21000 churn:0.0878
+step:21500/50000 loss:3.2065 t:3338791ms avg:155.3ms
+step:22000/50000 loss:3.1611 t:3416326ms avg:155.3ms
+step:22000 churn:0.0875
+step:22500/50000 loss:3.2578 t:3494047ms avg:155.3ms
+step:23000/50000 loss:3.0689 t:3571604ms avg:155.3ms
+step:23000 churn:0.0871
+step:23500/50000 loss:3.2047 t:3649319ms avg:155.3ms
+step:24000/50000 loss:3.0689 t:3726856ms avg:155.3ms
+step:24000 churn:0.0868
+step:24500/50000 loss:3.2355 t:3804562ms avg:155.3ms
+step:25000/50000 loss:3.2085 t:3882065ms avg:155.3ms
+step:25000 churn:0.0865
+step:25500/50000 loss:3.2235 t:3959778ms avg:155.3ms
+step:26000/50000 loss:3.2484 t:4037303ms avg:155.3ms
+step:26000 churn:0.0863
+step:26500/50000 loss:3.2419 t:4114994ms avg:155.3ms
+step:27000/50000 loss:3.1215 t:4192502ms avg:155.3ms
+step:27000 churn:0.0861
+step:27500/50000 loss:3.1305 t:4270187ms avg:155.3ms
+step:28000/50000 loss:3.2679 t:4347697ms avg:155.3ms
+step:28000 churn:0.0858
+step:28500/50000 loss:3.1768 t:4425383ms avg:155.3ms
+step:29000/50000 loss:3.1519 t:4502876ms avg:155.3ms
+step:29000 churn:0.0857
+step:29500/50000 loss:3.1614 t:4580510ms avg:155.3ms
+step:30000/50000 loss:3.2341 t:4658001ms avg:155.3ms
+step:30000 churn:0.0855
+step:30500/50000 loss:3.1673 t:4735648ms avg:155.3ms
+step:31000/50000 loss:3.0884 t:4813158ms avg:155.3ms
+step:31000 churn:0.0854
+step:31500/50000 loss:3.0147 t:4890803ms avg:155.3ms
+step:32000/50000 loss:3.1793 t:4968281ms avg:155.3ms
+step:32000 churn:0.0853
+step:32500/50000 loss:3.1626 t:5045990ms avg:155.3ms
+step:33000/50000 loss:3.3086 t:5123506ms avg:155.3ms
+step:33000 churn:0.0851
+step:33500/50000 loss:2.9607 t:5201190ms avg:155.3ms
+step:34000/50000 loss:3.1584 t:5278703ms avg:155.3ms
+step:34000 churn:0.0850
+step:34500/50000 loss:3.2311 t:5356349ms avg:155.3ms
+step:35000/50000 loss:3.0574 t:5433881ms avg:155.3ms
+step:35000 churn:0.0848
+step:35500/50000 loss:3.1880 t:5511613ms avg:155.3ms
+step:36000/50000 loss:3.0474 t:5589157ms avg:155.3ms
+step:36000 churn:0.0848
+step:36500/50000 loss:3.1925 t:5666894ms avg:155.3ms
+step:37000/50000 loss:3.0935 t:5744417ms avg:155.3ms
+step:37000 churn:0.0847
+step:37500/50000 loss:3.1454 t:5822114ms avg:155.3ms
+step:38000/50000 loss:2.9914 t:5899675ms avg:155.3ms
+step:38000 churn:0.0846
+step:38500/50000 loss:3.1192 t:5977449ms avg:155.3ms
+step:39000/50000 loss:3.1994 t:6055002ms avg:155.3ms
+step:39000 churn:0.0845
+step:39500/50000 loss:3.1586 t:6132704ms avg:155.3ms
+step:40000/50000 loss:3.1402 t:6210265ms avg:155.3ms
+step:40000 churn:0.0845
+step:40500/50000 loss:3.2176 t:6287989ms avg:155.3ms
+step:41000/50000 loss:3.1743 t:6365543ms avg:155.3ms
+step:41000 churn:0.0831
+step:41500/50000 loss:3.1811 t:6443269ms avg:155.3ms
+step:42000/50000 loss:3.0934 t:6520796ms avg:155.3ms
+step:42000 churn:0.0810
+step:42500/50000 loss:3.0804 t:6598538ms avg:155.3ms
+step:43000/50000 loss:3.1341 t:6676105ms avg:155.3ms
+step:43000 churn:0.0788
+step:43500/50000 loss:3.0942 t:6753855ms avg:155.3ms
+step:44000/50000 loss:3.0144 t:6831414ms avg:155.3ms
+step:44000 churn:0.0769
+step:44500/50000 loss:2.8582 t:6909098ms avg:155.3ms
+step:45000/50000 loss:3.3925 t:6986654ms avg:155.3ms
+step:45000 churn:0.0745
+step:45500/50000 loss:3.0488 t:7064379ms avg:155.3ms
+step:46000/50000 loss:2.9942 t:7141950ms avg:155.3ms
+step:46000 churn:0.0721
+step:46500/50000 loss:3.0737 t:7219653ms avg:155.3ms
+step:47000/50000 loss:3.1052 t:7297260ms avg:155.3ms
+step:47000 churn:0.0688
+step:47500/50000 loss:3.1031 t:7375013ms avg:155.3ms
+step:48000/50000 loss:3.0978 t:7452604ms avg:155.3ms
+step:48000 churn:0.0648
+step:48500/50000 loss:3.0704 t:7530338ms avg:155.3ms
+step:49000/50000 loss:3.0631 t:7607877ms avg:155.3ms
+step:49000 churn:0.0586
+step:49500/50000 loss:2.9547 t:7685573ms avg:155.3ms
+step:50000/50000 loss:3.0994 t:7763153ms avg:155.3ms
+step:50000 churn:0.0453
+step:50000/50000 val_loss:2.9692 val_bpb:1.1497 train_time:7763355ms
+artifact:15.60MB binary:97320960(13685760B) fp:2542200(2585072B) code:70399
+budget:15670651/16000000 (15.67/16.00MB) FITS
+final_binary_roundtrip val_loss:2.9743 val_bpb:1.1516
+temp_scaling optimal_T:0.90 eval_time:245ms
+final_sliding val_loss:2.9027 val_bpb:1.1239 (stride=16, T=0.90) eval_time:768782ms
diff --git a/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/fineweb_8192_bpe.model b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/fineweb_8192_bpe.model
new file mode 100644
index 0000000000..6574784f5f
Binary files /dev/null and b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/fineweb_8192_bpe.model differ
diff --git a/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/fineweb_8192_bpe.vocab b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/fineweb_8192_bpe.vocab
new file mode 100644
index 0000000000..6e194bf03c
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/fineweb_8192_bpe.vocab
@@ -0,0 +1,8192 @@
+ 0
+ 0
+ 0
+ 0
+<0x00> 0
+<0x01> 0
+<0x02> 0
+<0x03> 0
+<0x04> 0
+<0x05> 0
+<0x06> 0
+<0x07> 0
+<0x08> 0
+<0x09> 0
+<0x0A> 0
+<0x0B> 0
+<0x0C> 0
+<0x0D> 0
+<0x0E> 0
+<0x0F> 0
+<0x10> 0
+<0x11> 0
+<0x12> 0
+<0x13> 0
+<0x14> 0
+<0x15> 0
+<0x16> 0
+<0x17> 0
+<0x18> 0
+<0x19> 0
+<0x1A> 0
+<0x1B> 0
+<0x1C> 0
+<0x1D> 0
+<0x1E> 0
+<0x1F> 0
+<0x20> 0
+<0x21> 0
+<0x22> 0
+<0x23> 0
+<0x24> 0
+<0x25> 0
+<0x26> 0
+<0x27> 0
+<0x28> 0
+<0x29> 0
+<0x2A> 0
+<0x2B> 0
+<0x2C> 0
+<0x2D> 0
+<0x2E> 0
+<0x2F> 0
+<0x30> 0
+<0x31> 0
+<0x32> 0
+<0x33> 0
+<0x34> 0
+<0x35> 0
+<0x36> 0
+<0x37> 0
+<0x38> 0
+<0x39> 0
+<0x3A> 0
+<0x3B> 0
+<0x3C> 0
+<0x3D> 0
+<0x3E> 0
+<0x3F> 0
+<0x40> 0
+<0x41> 0
+<0x42> 0
+<0x43> 0
+<0x44> 0
+<0x45> 0
+<0x46> 0
+<0x47> 0
+<0x48> 0
+<0x49> 0
+<0x4A> 0
+<0x4B> 0
+<0x4C> 0
+<0x4D> 0
+<0x4E> 0
+<0x4F> 0
+<0x50> 0
+<0x51> 0
+<0x52> 0
+<0x53> 0
+<0x54> 0
+<0x55> 0
+<0x56> 0
+<0x57> 0
+<0x58> 0
+<0x59> 0
+<0x5A> 0
+<0x5B> 0
+<0x5C> 0
+<0x5D> 0
+<0x5E> 0
+<0x5F> 0
+<0x60> 0
+<0x61> 0
+<0x62> 0
+<0x63> 0
+<0x64> 0
+<0x65> 0
+<0x66> 0
+<0x67> 0
+<0x68> 0
+<0x69> 0
+<0x6A> 0
+<0x6B> 0
+<0x6C> 0
+<0x6D> 0
+<0x6E> 0
+<0x6F> 0
+<0x70> 0
+<0x71> 0
+<0x72> 0
+<0x73> 0
+<0x74> 0
+<0x75> 0
+<0x76> 0
+<0x77> 0
+<0x78> 0
+<0x79> 0
+<0x7A> 0
+<0x7B> 0
+<0x7C> 0
+<0x7D> 0
+<0x7E> 0
+<0x7F> 0
+<0x80> 0
+<0x81> 0
+<0x82> 0
+<0x83> 0
+<0x84> 0
+<0x85> 0
+<0x86> 0
+<0x87> 0
+<0x88> 0
+<0x89> 0
+<0x8A> 0
+<0x8B> 0
+<0x8C> 0
+<0x8D> 0
+<0x8E> 0
+<0x8F> 0
+<0x90> 0
+<0x91> 0
+<0x92> 0
+<0x93> 0
+<0x94> 0
+<0x95> 0
+<0x96> 0
+<0x97> 0
+<0x98> 0
+<0x99> 0
+<0x9A> 0
+<0x9B> 0
+<0x9C> 0
+<0x9D> 0
+<0x9E> 0
+<0x9F> 0
+<0xA0> 0
+<0xA1> 0
+<0xA2> 0
+<0xA3> 0
+<0xA4> 0
+<0xA5> 0
+<0xA6> 0
+<0xA7> 0
+<0xA8> 0
+<0xA9> 0
+<0xAA> 0
+<0xAB> 0
+<0xAC> 0
+<0xAD> 0
+<0xAE> 0
+<0xAF> 0
+<0xB0> 0
+<0xB1> 0
+<0xB2> 0
+<0xB3> 0
+<0xB4> 0
+<0xB5> 0
+<0xB6> 0
+<0xB7> 0
+<0xB8> 0
+<0xB9> 0
+<0xBA> 0
+<0xBB> 0
+<0xBC> 0
+<0xBD> 0
+<0xBE> 0
+<0xBF> 0
+<0xC0> 0
+<0xC1> 0
+<0xC2> 0
+<0xC3> 0
+<0xC4> 0
+<0xC5> 0
+<0xC6> 0
+<0xC7> 0
+<0xC8> 0
+<0xC9> 0
+<0xCA> 0
+<0xCB> 0
+<0xCC> 0
+<0xCD> 0
+<0xCE> 0
+<0xCF> 0
+<0xD0> 0
+<0xD1> 0
+<0xD2> 0
+<0xD3> 0
+<0xD4> 0
+<0xD5> 0
+<0xD6> 0
+<0xD7> 0
+<0xD8> 0
+<0xD9> 0
+<0xDA> 0
+<0xDB> 0
+<0xDC> 0
+<0xDD> 0
+<0xDE> 0
+<0xDF> 0
+<0xE0> 0
+<0xE1> 0
+<0xE2> 0
+<0xE3> 0
+<0xE4> 0
+<0xE5> 0
+<0xE6> 0
+<0xE7> 0
+<0xE8> 0
+<0xE9> 0
+<0xEA> 0
+<0xEB> 0
+<0xEC> 0
+<0xED> 0
+<0xEE> 0
+<0xEF> 0
+<0xF0> 0
+<0xF1> 0
+<0xF2> 0
+<0xF3> 0
+<0xF4> 0
+<0xF5> 0
+<0xF6> 0
+<0xF7> 0
+<0xF8> 0
+<0xF9> 0
+<0xFA> 0
+<0xFB> 0
+<0xFC> 0
+<0xFD> 0
+<0xFE> 0
+<0xFF> 0
+▁t -0
+▁a -1
+in -2
+he -3
+re -4
+on -5
+er -6
+▁the -7
+▁s -8
+▁w -9
+or -10
+at -11
+nd -12
+ou -13
+▁c -14
+it -15
+es -16
+▁f -17
+is -18
+ing -19
+en -20
+▁b -21
+▁p -22
+▁o -23
+an -24
+ed -25
+▁to -26
+al -27
+▁m -28
+ar -29
+▁and -30
+▁in -31
+▁of -32
+▁d -33
+le -34
+ic -35
+as -36
+▁h -37
+om -38
+ion -39
+▁th -40
+il -41
+▁T -42
+▁l -43
+ent -44
+ve -45
+▁I -46
+ro -47
+st -48
+▁y -49
+▁e -50
+▁re -51
+▁n -52
+▁S -53
+▁g -54
+et -55
+ct -56
+▁A -57
+▁C -58
+▁you -59
+ly -60
+ay -61
+id -62
+▁for -63
+▁on -64
+▁is -65
+ot -66
+▁be -67
+ow -68
+ol -69
+am -70
+ac -71
+ig -72
+us -73
+ad -74
+el -75
+▁M -76
+im -77
+ver -78
+ith -79
+ut -80
+▁st -81
+▁P -82
+ation -83
+▁with -84
+ur -85
+▁B -86
+▁that -87
+ir -88
+▁W -89
+ch -90
+▁he -91
+▁it -92
+▁The -93
+ce -94
+ill -95
+ers -96
+un -97
+▁al -98
+▁D -99
+ul -100
+▁an -101
+▁H -102
+▁F -103
+out -104
+ra -105
+ke -106
+▁pro -107
+▁wh -108
+▁as -109
+▁are -110
+se -111
+ter -112
+▁we -113
+▁ha -114
+▁R -115
+oo -116
+if -117
+ge -118
+our -119
+pp -120
+▁at -121
+ate -122
+ess -123
+▁com -124
+▁or -125
+▁con -126
+▁L -127
+her -128
+ore -129
+est -130
+▁fr -131
+ment -132
+igh -133
+▁- -134
+ab -135
+▁N -136
+▁se -137
+▁ne -138
+ld -139
+ort -140
+▁G -141
+▁E -142
+ri -143
+ist -144
+▁( -145
+▁your -146
+op -147
+▁O -148
+▁ex -149
+em -150
+ure -151
+ity -152
+▁r -153
+ant -154
+qu -155
+▁v -156
+▁was -157
+art -158
+ust -159
+▁have -160
+ive -161
+um -162
+▁this -163
+▁from -164
+pe -165
+▁de -166
+oc -167
+▁sh -168
+th -169
+ain -170
+up -171
+ies -172
+▁will -173
+▁by -174
+ight -175
+▁ch -176
+and -177
+os -178
+▁can -179
+ie -180
+nt -181
+all -182
+▁us -183
+ome -184
+▁not -185
+ard -186
+ud -187
+▁le -188
+res -189
+▁J -190
+ast -191
+.. -192
+ost -193
+▁pl -194
+ear -195
+▁ab -196
+ack -197
+▁su -198
+iv -199
+▁wor -200
+gh -201
+▁all -202
+rou -203
+ide -204
+ould -205
+▁j -206
+ell -207
+ial -208
+te -209
+ak -210
+ine -211
+od -212
+ag -213
+are -214
+▁has -215
+ice -216
+▁U -217
+▁Th -218
+▁do -219
+age -220
+▁k -221
+ook -222
+fe -223
+▁ad -224
+▁me -225
+ip -226
+▁In -227
+▁comp -228
+▁but -229
+▁up -230
+▁out -231
+ake -232
+per -233
+red -234
+▁whe -235
+ions -236
+ally -237
+pt -238
+ry -239
+og -240
+one -241
+▁more -242
+ail -243
+able -244
+ind -245
+▁my -246
+ite -247
+▁our -248
+ther -249
+▁en -250
+▁“ -251
+very -252
+▁Y -253
+▁sa -254
+▁so -255
+ich -256
+ime -257
+cc -258
+▁cl -259
+ong -260
+▁their -261
+▁K -262
+ated -263
+ood -264
+ame -265
+orm -266
+▁St -267
+▁they -268
+▁one -269
+▁te -270
+ber -271
+ace -272
+ike -273
+iz -274
+▁about -275
+so -276
+ous -277
+du -278
+ick -279
+ase -280
+ans -281
+▁" -282
+▁V -283
+pl -284
+▁cont -285
+act -286
+ia -287
+▁im -288
+▁work -289
+▁un -290
+▁who -291
+ree -292
+cl -293
+ire -294
+▁fe -295
+ign -296
+▁off -297
+▁his -298
+▁man -299
+ue -300
+ff -301
+ance -302
+▁go -303
+ll -304
+ach -305
+▁year -306
+▁new -307
+▁tr -308
+ays -309
+ne -310
+reat -311
+▁It -312
+ction -313
+ub -314
+ib -315
+ult -316
+▁app -317
+erv -318
+und -319
+▁We -320
+ap -321
+▁Ch -322
+ass -323
+▁qu -324
+ep -325
+▁res -326
+ary -327
+ark -328
+▁sp -329
+▁per -330
+ations -331
+ile -332
+ove -333
+form -334
+▁int -335
+▁get -336
+▁also -337
+▁time -338
+▁which -339
+ount -340
+ven -341
+▁like -342
+own -343
+▁other -344
+ents -345
+▁some -346
+ond -347
+ord -348
+▁any -349
+ings -350
+vel -351
+av -352
+▁been -353
+ical -354
+▁over -355
+▁part -356
+ress -357
+▁This -358
+▁dis -359
+ks -360
+▁He -361
+ors -362
+ence -363
+▁said -364
+▁sc -365
+▁rec -366
+▁ar -367
+ition -368
+▁them -369
+▁ag -370
+▁when -371
+▁pe -372
+ild -373
+port -374
+▁her -375
+ound -376
+ough -377
+▁kn -378
+ose -379
+ob -380
+irst -381
+low -382
+▁just -383
+mer -384
+int -385
+▁ro -386
+ov -387
+ck -388
+ish -389
+▁what -390
+oy -391
+▁pr -392
+ru -393
+▁spe -394
+▁pre -395
+▁there -396
+ens -397
+wn -398
+▁acc -399
+day -400
+▁if -401
+ren -402
+▁than -403
+▁would -404
+▁need -405
+▁Re -406
+▁had -407
+vers -408
+▁its -409
+▁were -410
+ink -411
+fter -412
+ning -413
+▁am -414
+ater -415
+... -416
+▁des -417
+old -418
+itt -419
+clud -420
+ade -421
+rough -422
+▁tw -423
+▁into -424
+lp -425
+ory -426
+use -427
+ople -428
+ool -429
+ang -430
+▁first -431
+▁how -432
+▁bec -433
+▁help -434
+lic -435
+hed -436
+ons -437
+▁add -438
+anc -439
+ft -440
+▁make -441
+amp -442
+gr -443
+▁bl -444
+▁look -445
+▁– -446
+▁Wh -447
+▁prov -448
+▁col -449
+▁includ -450
+▁people -451
+▁comm -452
+▁produ -453
+▁You -454
+▁Ne -455
+ual -456
+▁know -457
+ful -458
+▁she -459
+ian -460
+ments -461
+ates -462
+iew -463
+round -464
+▁em -465
+▁every -466
+▁back -467
+▁only -468
+▁serv -469
+tern -470
+les -471
+ious -472
+▁no -473
+▁may -474
+rent -475
+▁through -476
+▁bu -477
+ict -478
+▁most -479
+cts -480
+ating -481
+▁see -482
+▁want -483
+▁two -484
+▁ph -485
+com -486
+pport -487
+▁As -488
+xt -489
+we -490
+ities -491
+ices -492
+iss -493
+▁use -494
+▁well -495
+ont -496
+▁bet -497
+▁after -498
+▁If -499
+ise -500
+hing -501
+▁ind -502
+ause -503
+▁play -504
+▁Se -505
+ph -506
+▁und -507
+je -508
+▁& -509
+▁co -510
+ife -511
+▁| -512
+ock -513
+ily -514
+▁stud -515
+lect -516
+row -517
+▁act -518
+ting -519
+iness -520
+▁fl -521
+hen -522
+▁years -523
+▁Com -524
+▁Un -525
+urn -526
+ts -527
+▁$ -528
+enc -529
+aw -530
+▁these -531
+▁tra -532
+▁An -533
+fore -534
+▁cons -535
+▁under -536
+als -537
+cial -538
+ange -539
+▁exper -540
+bs -541
+aking -542
+▁ke -543
+oth -544
+▁now -545
+ures -546
+ational -547
+▁very -548
+▁Pro -549
+▁wee -550
+▁bus -551
+▁good -552
+▁gu -553
+ased -554
+vent -555
+▁And -556
+formation -557
+▁many -558
+▁sm -559
+get -560
+▁way -561
+any -562
+▁reg -563
+erson -564
+oint -565
+ific -566
+ward -567
+▁De -568
+ert -569
+ility -570
+▁start -571
+▁fin -572
+▁dif -573
+▁could -574
+rit -575
+lease -576
+▁great -577
+▁imp -578
+ork -579
+uch -580
+▁day -581
+fect -582
+▁rem -583
+▁Sh -584
+yst -585
+▁rel -586
+ience -587
+ible -588
+▁even -589
+▁For -590
+uring -591
+ty -592
+▁show -593
+▁high -594
+oss -595
+ics -596
+▁sec -597
+ull -598
+▁own -599
+nds -600
+velop -601
+▁inv -602
+▁where -603
+▁here -604
+▁don -605
+▁inc -606
+▁down -607
+). -608
+▁ent -609
+ident -610
+hes -611
+olog -612
+cess -613
+▁loc -614
+arch -615
+▁right -616
+ble -617
+▁then -618
+chool -619
+▁home -620
+▁should -621
+▁Al -622
+▁New -623
+elf -624
+alth -625
+The -626
+▁ass -627
+ied -628
+▁br -629
+its -630
+ited -631
+▁find -632
+ath -633
+air -634
+ular -635
+▁read -636
+▁too -637
+▁ac -638
+hip -639
+▁av -640
+▁set -641
+ix -642
+▁car -643
+▁fam -644
+ner -645
+▁information -646
+▁mon -647
+gan -648
+line -649
+▁best -650
+▁last -651
+ys -652
+▁min -653
+gram -654
+▁take -655
+io -656
+▁design -657
+▁Cl -658
+pect -659
+ract -660
+▁long -661
+ason -662
+▁did -663
+▁inst -664
+▁much -665
+omet -666
+▁che -667
+|| -668
+erm -669
+▁Be -670
+▁business -671
+ystem -672
+▁because -673
+▁before -674
+other -675
+ank -676
+▁dec -677
+ues -678
+▁But -679
+▁att -680
+▁ins -681
+▁Fr -682
+.” -683
+▁made -684
+▁team -685
+ative -686
+▁call -687
+▁Le -688
+▁him -689
+pr -690
+▁sur -691
+pen -692
+atch -693
+▁cre -694
+rib -695
+me -696
+▁think -697
+ject -698
+ollow -699
+az -700
+▁again -701
+▁world -702
+way -703
+ax -704
+ale -705
+ug -706
+▁Ad -707
+▁art -708
+▁mem -709
+▁does -710
+alk -711
+), -712
+▁vis -713
+arket -714
+▁being -715
+▁pres -716
+ave -717
+▁develop -718
+▁person -719
+oun -720
+▁requ -721
+arn -722
+ustom -723
+ower -724
+chn -725
+rest -726
+▁inte -727
+arm -728
+ient -729
+▁life -730
+▁those -731
+ener -732
+▁diffe -733
+▁such -734
+ins -735
+▁med -736
+ng -737
+ivers -738
+ince -739
+ouse -740
+▁support -741
+ving -742
+▁while -743
+ash -744
+irect -745
+▁Ar -746
+▁pol -747
+view -748
+land -749
+▁sk -750
+▁provid -751
+ss -752
+unity -753
+ier -754
+▁lead -755
+▁ra -756
+▁Te -757
+▁each -758
+▁around -759
+▁book -760
+der -761
+▁love -762
+▁free -763
+▁used -764
+ced -765
+akes -766
+▁care -767
+▁end -768
+read -769
+▁mod -770
+ailable -771
+▁ser -772
+▁comple -773
+▁post -774
+▁run -775
+▁gr -776
+ather -777
+▁disc -778
+▁sim -779
+ric -780
+▁program -781
+ality -782
+▁ret -783
+▁pub -784
+ces -785
+ional -786
+ages -787
+ually -788
+▁bo -789
+▁cur -790
+▁ed -791
+ines -792
+imes -793
+ton -794
+ives -795
+▁All -796
+▁det -797
+▁really -798
+roup -799
+ple -800
+oad -801
+ars -802
+▁eas -803
+ets -804
+▁On -805
+▁child -806
+▁system -807
+▁There -808
+▁So -809
+▁num -810
+iel -811
+au -812
+ize -813
+▁follow -814
+▁trans -815
+." -816
+led -817
+ene -818
+▁count -819
+▁going -820
+▁found -821
+,” -822
+▁top -823
+ah -824
+▁form -825
+▁char -826
+▁somet -827
+iet -828
+▁three -829
+ittle -830
+▁inter -831
+▁list -832
+▁cour -833
+ames -834
+man -835
+▁still -836
+▁Bl -837
+▁fun -838
+▁How -839
+▁month -840
+▁available -841
+▁place -842
+▁del -843
+ature -844
+▁Pl -845
+▁custom -846
+ute -847
+ness -848
+▁though -849
+▁They -850
+▁feel -851
+ways -852
+▁prof -853
+▁cle -854
+▁both -855
+▁To -856
+▁few -857
+▁sub -858
+cept -859
+▁aut -860
+orn -861
+meric -862
+▁str -863
+▁happ -864
+▁week -865
+▁sign -866
+▁open -867
+▁hand -868
+ved -869
+▁gl -870
+▁pur -871
+▁say -872
+uc -873
+▁report -874
+▁health -875
+▁game -876
+▁adv -877
+att -878
+▁rep -879
+▁market -880
+ital -881
+▁different -882
+oot -883
+ired -884
+orth -885
+▁frie -886
+bers -887
+▁keep -888
+▁same -889
+ering -890
+tt -891
+▁lot -892
+▁Ex -893
+▁She -894
+▁point -895
+▁Col -896
+ween -897
+▁techn -898
+▁family -899
+▁ev -900
+▁i -901
+ology -902
+▁exp -903
+iqu -904
+▁ext -905
+▁school -906
+ining -907
+▁little -908
+▁using -909
+," -910
+▁process -911
+ished -912
+atur -913
+▁company -914
+▁lar -915
+ata -916
+▁including -917
+▁Sc -918
+ross -919
+iving -920
+oh -921
+ants -922
+▁next -923
+▁plan -924
+▁win -925
+▁Americ -926
+ott -927
+▁fil -928
+▁real -929
+▁during -930
+▁Tr -931
+▁between -932
+thing -933
+ized -934
+▁water -935
+ger -936
+▁sol -937
+▁Ph -938
+▁import -939
+▁Q -940
+ody -941
+cent -942
+▁state -943
+▁What -944
+gg -945
+ield -946
+▁things -947
+ik -948
+ves -949
+▁met -950
+arly -951
+els -952
+▁come -953
+aut -954
+ists -955
+be -956
+▁allow -957
+▁big -958
+less -959
+aint -960
+reen -961
+▁mus -962
+▁put -963
+▁contin -964
+uss -965
+▁Or -966
+▁rece -967
+▁experience -968
+ware -969
+▁service -970
+▁opt -971
+▁build -972
+cer -973
+self -974
+▁small -975
+▁dri -976
+▁days -977
+▁appro -978
+ined -979
+iversity -980
+ex -981
+▁organ -982
+▁full -983
+ling -984
+▁since -985
+▁cent -986
+▁always -987
+▁rest -988
+▁try -989
+▁phot -990
+▁better -991
+▁cr -992
+▁sure -993
+▁When -994
+ution -995
+▁pat -996
+▁online -997
+▁pri -998
+▁quest -999
+▁ref -1000
+▁Ind -1001
+▁second -1002
+▁pass -1003
+▁something -1004
+▁var -1005
+illion -1006
+▁bel -1007
+▁interest -1008
+rand -1009
+ever -1010
+over -1011
+▁iss -1012
+▁partic -1013
+▁class -1014
+▁poss -1015
+▁gener -1016
+▁def -1017
+▁group -1018
+▁tri -1019
+▁mov -1020
+ffect -1021
+▁perform -1022
+▁hard -1023
+▁direct -1024
+▁Z -1025
+▁pay -1026
+pping -1027
+ours -1028
+▁With -1029
+▁result -1030
+▁bro -1031
+▁today -1032
+▁head -1033
+▁special -1034
+gy -1035
+▁— -1036
+▁sl -1037
+ps -1038
+▁ty -1039
+▁ve -1040
+ploy -1041
+ER -1042
+▁At -1043
+joy -1044
+▁stand -1045
+ms -1046
+work -1047
+ared -1048
+outh -1049
+▁another -1050
+▁ide -1051
+▁give -1052
+br -1053
+▁ann -1054
+▁Con -1055
+▁wom -1056
+▁provide -1057
+uck -1058
+▁got -1059
+▁cor -1060
+ccess -1061
+ior -1062
+▁Chr -1063
+ote -1064
+oor -1065
+▁Res -1066
+oney -1067
+▁meet -1068
+▁students -1069
+▁resp -1070
+istr -1071
+▁current -1072
+ense -1073
+ately -1074
+▁wr -1075
+▁without -1076
+ision -1077
+▁conf -1078
+▁Our -1079
+ients -1080
+rence -1081
+ok -1082
+ium -1083
+▁old -1084
+▁area -1085
+ley -1086
+ope -1087
+ards -1088
+▁number -1089
+▁four -1090
+▁bre -1091
+▁cost -1092
+aj -1093
+ems -1094
+ered -1095
+▁able -1096
+ically -1097
+▁soc -1098
+▁val -1099
+▁Sp -1100
+▁invest -1101
+▁must -1102
+con -1103
+▁access -1104
+▁services -1105
+▁unt -1106
+raph -1107
+ats -1108
+ird -1109
+▁ask -1110
+▁working -1111
+▁never -1112
+▁US -1113
+▁Cent -1114
+iver -1115
+▁No -1116
+stand -1117
+ww -1118
+▁webs -1119
+▁proble -1120
+▁public -1121
+▁vide -1122
+ission -1123
+▁visit -1124
+▁important -1125
+ann -1126
+▁light -1127
+pped -1128
+▁fact -1129
+let -1130
+▁sal -1131
+▁level -1132
+▁order -1133
+▁fac -1134
+ged -1135
+▁Comm -1136
+▁My -1137
+▁test -1138
+▁might -1139
+▁exc -1140
+ral -1141
+▁rese -1142
+▁product -1143
+▁local -1144
+▁night -1145
+▁season -1146
+inal -1147
+▁el -1148
+▁incre -1149
+ember -1150
+▁site -1151
+rol -1152
+▁That -1153
+▁sing -1154
+ruct -1155
+ample -1156
+▁expl -1157
+▁Mar -1158
+▁spec -1159
+▁grow -1160
+▁let -1161
+▁ca -1162
+▁proper -1163
+▁less -1164
+ording -1165
+▁enjoy -1166
+▁ob -1167
+▁past -1168
+▁event -1169
+▁products -1170
+▁Man -1171
+▁' -1172
+▁inf -1173
+▁May -1174
+▁looking -1175
+▁food -1176
+here -1177
+lection -1178
+▁within -1179
+▁profess -1180
+▁Fe -1181
+▁Is -1182
+▁data -1183
+▁making -1184
+▁pop -1185
+ertain -1186
+▁until -1187
+ases -1188
+ories -1189
+ffic -1190
+enn -1191
+ency -1192
+▁children -1193
+ently -1194
+▁University -1195
+We -1196
+gin -1197
+sh -1198
+▁job -1199
+▁offer -1200
+▁law -1201
+ery -1202
+ains -1203
+ney -1204
+urs -1205
+▁pos -1206
+eng -1207
+utes -1208
+▁power -1209
+▁view -1210
+▁turn -1211
+▁eng -1212
+▁email -1213
+ential -1214
+tend -1215
+▁oper -1216
+▁sit -1217
+▁check -1218
+▁against -1219
+ieve -1220
+▁est -1221
+▁Pr -1222
+ream -1223
+ised -1224
+▁Br -1225
+ina -1226
+▁prote -1227
+ids -1228
+ode -1229
+▁room -1230
+▁contact -1231
+IN -1232
+▁community -1233
+med -1234
+to -1235
+▁addition -1236
+▁prom -1237
+▁says -1238
+▁intern -1239
+load -1240
+▁toget -1241
+▁together -1242
+▁Fl -1243
+▁away -1244
+ivid -1245
+▁impro -1246
+▁quality -1247
+▁leg -1248
+ator -1249
+▁dist -1250
+▁creat -1251
+ills -1252
+irl -1253
+hor -1254
+▁indust -1255
+▁complete -1256
+▁news -1257
+aring -1258
+iron -1259
+ique -1260
+ret -1261
+▁App -1262
+icle -1263
+iday -1264
+agement -1265
+ified -1266
+oci -1267
+▁supp -1268
+osed -1269
+ability -1270
+▁project -1271
+▁website -1272
+▁Car -1273
+iety -1274
+ane -1275
+por -1276
+!! -1277
+▁change -1278
+co -1279
+▁success -1280
+▁dep -1281
+bo -1282
+▁learn -1283
+▁include -1284
+▁Co -1285
+pend -1286
+▁fav -1287
+▁chang -1288
+ym -1289
+▁Ste -1290
+▁detail -1291
+ism -1292
+▁offic -1293
+▁Can -1294
+▁members -1295
+▁dr -1296
+arent -1297
+son -1298
+▁buy -1299
+▁easy -1300
+▁please -1301
+rap -1302
+▁Me -1303
+aster -1304
+▁applic -1305
+ising -1306
+ury -1307
+▁name -1308
+▁pract -1309
+▁times -1310
+atures -1311
+▁along -1312
+▁equ -1313
+▁present -1314
+▁One -1315
+▁large -1316
+▁money -1317
+▁beaut -1318
+atter -1319
+augh -1320
+▁Am -1321
+aterial -1322
+the -1323
+▁Cont -1324
+iting -1325
+▁activ -1326
+vern -1327
+RE -1328
+▁employ -1329
+▁la -1330
+aff -1331
+une -1332
+▁house -1333
+ready -1334
+Th -1335
+▁course -1336
+▁expect -1337
+▁. -1338
+▁needs -1339
+ored -1340
+▁air -1341
+▁left -1342
+▁Christ -1343
+▁thing -1344
+itions -1345
+ift -1346
+sc -1347
+ably -1348
+▁cap -1349
+ider -1350
+ived -1351
+lish -1352
+▁music -1353
+▁dra -1354
+min -1355
+▁why -1356
+▁En -1357
+yle -1358
+ohn -1359
+ump -1360
+ify -1361
+▁hist -1362
+ec -1363
+ron -1364
+by -1365
+▁bas -1366
+ern -1367
+▁hum -1368
+▁video -1369
+rie -1370
+▁sw -1371
+▁account -1372
+ON -1373
+ffe -1374
+alf -1375
+ocus -1376
+veral -1377
+▁below -1378
+▁soft -1379
+▁hot -1380
+▁These -1381
+▁short -1382
+ries -1383
+▁Eng -1384
+▁line -1385
+▁live -1386
+pecial -1387
+▁opport -1388
+enef -1389
+▁create -1390
+book -1391
+▁cond -1392
+▁beh -1393
+▁... -1394
+▁perfect -1395
+uly -1396
+▁ce -1397
+▁page -1398
+▁word -1399
+▁/ -1400
+▁writ -1401
+AT -1402
+▁dem -1403
+ots -1404
+▁Med -1405
+▁mar -1406
+▁Please -1407
+fort -1408
+side -1409
+ows -1410
+mber -1411
+▁govern -1412
+▁pa -1413
+artment -1414
+▁already -1415
+▁Che -1416
+▁kind -1417
+▁After -1418
+▁enough -1419
+▁ever -1420
+▁research -1421
+ured -1422
+▁makes -1423
+▁following -1424
+▁million -1425
+▁Do -1426
+▁review -1427
+▁getting -1428
+▁dev -1429
+ten -1430
+itive -1431
+ush -1432
+▁friends -1433
+▁cut -1434
+▁conne -1435
+▁trad -1436
+ee -1437
+., -1438
+▁record -1439
+room -1440
+▁treat -1441
+▁side -1442
+▁const -1443
+vious -1444
+▁Ass -1445
+▁case -1446
+▁having -1447
+ajor -1448
+▁tell -1449
+▁Count -1450
+▁personal -1451
+▁move -1452
+▁based -1453
+▁story -1454
+viron -1455
+ention -1456
+▁John -1457
+rop -1458
+▁Your -1459
+▁Serv -1460
+▁won -1461
+unch -1462
+ips -1463
+▁Des -1464
+▁minutes -1465
+uper -1466
+▁become -1467
+uture -1468
+▁possible -1469
+osp -1470
+oice -1471
+iam -1472
+▁talk -1473
+▁city -1474
+ights -1475
+▁across -1476
+▁vers -1477
+▁share -1478
+ization -1479
+▁done -1480
+▁bit -1481
+▁camp -1482
+▁pack -1483
+▁didn -1484
+▁comes -1485
+▁men -1486
+▁understand -1487
+ead -1488
+▁several -1489
+▁-- -1490
+yn -1491
+▁: -1492
+▁country -1493
+▁Tw -1494
+▁hours -1495
+▁effect -1496
+▁cou -1497
+▁purch -1498
+iven -1499
+▁benef -1500
+ES -1501
+▁mil -1502
+▁women -1503
+uff -1504
+▁net -1505
+ividual -1506
+app -1507
+aces -1508
+▁percent -1509
+▁Comp -1510
+▁educ -1511
+wards -1512
+▁focus -1513
+▁often -1514
+▁material -1515
+ball -1516
+▁social -1517
+aim -1518
+▁elect -1519
+▁Wor -1520
+idd -1521
+ances -1522
+ination -1523
+uro -1524
+ides -1525
+ober -1526
+▁quick -1527
+▁Not -1528
+▁development -1529
+▁es -1530
+▁bring -1531
+▁return -1532
+orts -1533
+▁American -1534
+ister -1535
+ienc -1536
+▁doing -1537
+▁Bro -1538
+▁School -1539
+ript -1540
+▁pie -1541
+▁X -1542
+▁far -1543
+▁hold -1544
+arl -1545
+▁mult -1546
+ted -1547
+▁body -1548
+arr -1549
+err -1550
+▁Gr -1551
+of -1552
+mend -1553
+▁pot -1554
+ference -1555
+iful -1556
+ones -1557
+AN -1558
+▁wa -1559
+ners -1560
+▁fund -1561
+▁took -1562
+ograph -1563
+▁Here -1564
+▁tre -1565
+ource -1566
+lished -1567
+▁blog -1568
+oose -1569
+itc -1570
+AR -1571
+▁State -1572
+▁doesn -1573
+reet -1574
+conom -1575
+▁jo -1576
+vironment -1577
+▁deal -1578
+lement -1579
+▁others -1580
+▁City -1581
+▁Rep -1582
+▁came -1583
+▁called -1584
+▁started -1585
+▁sum -1586
+▁rele -1587
+org -1588
+▁Inst -1589
+nder -1590
+▁least -1591
+▁months -1592
+▁Intern -1593
+▁space -1594
+acy -1595
+▁Gu -1596
+▁mom -1597
+▁future -1598
+▁orig -1599
+▁compet -1600
+▁individual -1601
+oon -1602
+lege -1603
+▁went -1604
+▁occ -1605
+▁yet -1606
+▁young -1607
+rodu -1608
+▁clean -1609
+▁non -1610
+▁mind -1611
+▁told -1612
+ai -1613
+▁five -1614
+▁early -1615
+▁series -1616
+▁control -1617
+af -1618
+utions -1619
+▁term -1620
+▁major -1621
+oll -1622
+hers -1623
+ille -1624
+ape -1625
+▁games -1626
+ained -1627
+▁comb -1628
+▁means -1629
+▁pict -1630
+▁industry -1631
+▁chall -1632
+yl -1633
+▁tool -1634
+anks -1635
+▁Min -1636
+▁ens -1637
+▁lim -1638
+▁cover -1639
+ctor -1640
+▁fore -1641
+▁ago -1642
+AS -1643
+▁low -1644
+sw -1645
+▁key -1646
+fer -1647
+ama -1648
+▁x -1649
+▁heart -1650
+▁features -1651
+▁Ed -1652
+ilt -1653
+▁tem -1654
+rew -1655
+▁price -1656
+unic -1657
+▁store -1658
+fact -1659
+jects -1660
+▁offers -1661
+▁Ab -1662
+itor -1663
+back -1664
+▁once -1665
+▁specific -1666
+come -1667
+▁range -1668
+▁thought -1669
+ges -1670
+urity -1671
+ither -1672
+ateg -1673
+▁Bo -1674
+▁Jan -1675
+sel -1676
+▁pick -1677
+illed -1678
+▁Now -1679
+eral -1680
+▁God -1681
+▁Dr -1682
+▁favor -1683
+▁appear -1684
+year -1685
+▁More -1686
+▁York -1687
+ilities -1688
+▁Ke -1689
+▁Im -1690
+▁hope -1691
+▁redu -1692
+▁discuss -1693
+OR -1694
+ibr -1695
+▁happen -1696
+▁require -1697
+yr -1698
+▁Pe -1699
+▁However -1700
+atic -1701
+It -1702
+▁mean -1703
+▁single -1704
+nes -1705
+▁step -1706
+▁close -1707
+▁upd -1708
+▁land -1709
+▁break -1710
+▁ey -1711
+▁main -1712
+▁invol -1713
+most -1714
+anies -1715
+▁Pres -1716
+ourn -1717
+▁stay -1718
+▁government -1719
+▁Em -1720
+isk -1721
+isc -1722
+// -1723
+▁Sm -1724
+ony -1725
+▁field -1726
+de -1727
+▁priv -1728
+▁United -1729
+▁beautiful -1730
+resh -1731
+cle -1732
+▁Per -1733
+▁friend -1734
+▁everything -1735
+▁Qu -1736
+▁walk -1737
+ched -1738
+▁questions -1739
+▁added -1740
+▁hig -1741
+▁Cal -1742
+▁tax -1743
+aken -1744
+▁customers -1745
+▁strong -1746
+now -1747
+▁taking -1748
+▁install -1749
+for -1750
+:// -1751
+aps -1752
+ging -1753
+▁Pol -1754
+▁charact -1755
+▁wond -1756
+▁South -1757
+▁begin -1758
+▁study -1759
+ources -1760
+▁North -1761
+▁Just -1762
+▁announ -1763
+ief -1764
+ensive -1765
+▁miss -1766
+▁recom -1767
+▁travel -1768
+▁certain -1769
+▁Park -1770
+▁address -1771
+▁problem -1772
+▁By -1773
+▁County -1774
+▁actually -1775
+play -1776
+▁staff -1777
+▁tot -1778
+▁half -1779
+▁mess -1780
+▁z -1781
+aur -1782
+ew -1783
+inc -1784
+ians -1785
+▁search -1786
+▁technology -1787
+▁girl -1788
+▁media -1789
+urther -1790
+time -1791
+▁watch -1792
+▁typ -1793
+▁known -1794
+▁official -1795
+▁manag -1796
+▁National -1797
+▁six -1798
+irm -1799
+▁Pre -1800
+▁wind -1801
+▁enc -1802
+gle -1803
+atural -1804
+ural -1805
+▁front -1806
+ublic -1807
+▁Add -1808
+▁sound -1809
+▁improve -1810
+▁Post -1811
+wh -1812
+▁dig -1813
+irt -1814
+▁lat -1815
+▁content -1816
+▁Su -1817
+▁Stud -1818
+▁anal -1819
+▁track -1820
+itted -1821
+▁Mc -1822
+▁face -1823
+▁training -1824
+▁link -1825
+▁click -1826
+icy -1827
+▁ste -1828
+▁web -1829
+▁someone -1830
+ison -1831
+▁Oct -1832
+arning -1833
+▁works -1834
+▁author -1835
+▁later -1836
+▁building -1837
+not -1838
+lebr -1839
+▁host -1840
+ocu -1841
+▁Gl -1842
+▁environment -1843
+abor -1844
+cted -1845
+▁Center -1846
+▁mor -1847
+▁log -1848
+▁unique -1849
+▁everyone -1850
+▁Reg -1851
+raft -1852
+▁port -1853
+▁provides -1854
+IS -1855
+gest -1856
+▁ener -1857
+▁fall -1858
+▁cred -1859
+▁seen -1860
+▁Dep -1861
+▁film -1862
+ask -1863
+▁Day -1864
+▁prep -1865
+▁oil -1866
+▁particular -1867
+▁professional -1868
+▁aud -1869
+fully -1870
+▁Aug -1871
+▁Euro -1872
+ests -1873
+▁particip -1874
+lex -1875
+ided -1876
+unities -1877
+▁bar -1878
+ibility -1879
+▁results -1880
+▁ident -1881
+▁recommend -1882
+roll -1883
+▁press -1884
+ED -1885
+▁card -1886
+▁While -1887
+▁Will -1888
+▁whole -1889
+▁Don -1890
+aturday -1891
+▁World -1892
+rain -1893
+▁companies -1894
+ino -1895
+▁Ge -1896
+▁High -1897
+urch -1898
+▁Friday -1899
+▁office -1900
+IT -1901
+pper -1902
+▁Bar -1903
+▁March -1904
+▁color -1905
+▁events -1906
+▁anything -1907
+▁issues -1908
+EN -1909
+ancial -1910
+▁mot -1911
+▁eff -1912
+▁prob -1913
+▁mag -1914
+▁areas -1915
+▁pret -1916
+resent -1917
+▁vol -1918
+▁Some -1919
+▁comput -1920
+▁respons -1921
+ops -1922
+▁points -1923
+▁Acc -1924
+▁performance -1925
+▁near -1926
+▁pain -1927
+ster -1928
+obile -1929
+▁red -1930
+▁print -1931
+▁cook -1932
+▁Apr -1933
+itch -1934
+umb -1935
+▁given -1936
+▁history -1937
+▁econom -1938
+pecially -1939
+crib -1940
+obal -1941
+.... -1942
+▁feature -1943
+go -1944
+ili -1945
+ands -1946
+▁sell -1947
+▁designed -1948
+▁above -1949
+ches -1950
+▁maint -1951
+▁skin -1952
+▁text -1953
+▁aff -1954
+▁simple -1955
+eth -1956
+▁assist -1957
+IC -1958
+my -1959
+ued -1960
+▁age -1961
+icult -1962
+▁reason -1963
+inks -1964
+In -1965
+▁size -1966
+▁question -1967
+▁dou -1968
+imate -1969
+▁according -1970
+▁repl -1971
+iod -1972
+ply -1973
+▁Sec -1974
+nding -1975
+▁black -1976
+▁Aust -1977
+head -1978
+▁htt -1979
+edd -1980
+▁pretty -1981
+▁foot -1982
+▁believe -1983
+▁Saturday -1984
+oved -1985
+ables -1986
+▁due -1987
+▁Part -1988
+▁among -1989
+▁select -1990
+AL -1991
+itter -1992
+▁Sund -1993
+▁fire -1994
+cript -1995
+▁phys -1996
+omes -1997
+ental -1998
+ledge -1999
+▁idea -2000
+ety -2001
+▁latest -2002
+▁details -2003
+▁ant -2004
+▁popular -2005
+ole -2006
+▁third -2007
+▁et -2008
+ators -2009
+▁Mr -2010
+pro -2011
+val -2012
+▁management -2013
+aining -2014
+itional -2015
+▁includes -2016
+ruction -2017
+asing -2018
+▁July -2019
+▁energy -2020
+▁items -2021
+ze -2022
+▁weeks -2023
+ouch -2024
+onday -2025
+▁sent -2026
+▁Feb -2027
+▁living -2028
+ites -2029
+▁cult -2030
+▁receive -2031
+▁fre -2032
+▁continue -2033
+▁bad -2034
+▁June -2035
+▁relations -2036
+▁Europe -2037
+vert -2038
+astic -2039
+idence -2040
+▁human -2041
+▁parent -2042
+ulation -2043
+▁Val -2044
+▁His -2045
+▁claim -2046
+aily -2047
+▁Sept -2048
+ufact -2049
+ctions -2050
+elt -2051
+▁Dav -2052
+▁sex -2053
+▁prop -2054
+▁soon -2055
+ung -2056
+▁property -2057
+▁hon -2058
+nov -2059
+▁currently -2060
+▁amount -2061
+▁entire -2062
+new -2063
+▁West -2064
+uation -2065
+▁coming -2066
+ese -2067
+though -2068
+ana -2069
+ogn -2070
+▁Off -2071
+▁kids -2072
+▁TH -2073
+▁Tra -2074
+▁From -2075
+itting -2076
+▁phone -2077
+This -2078
+cast -2079
+▁final -2080
+▁consum -2081
+▁ess -2082
+▁happy -2083
+▁taken -2084
+▁celebr -2085
+▁docu -2086
+▁member -2087
+icro -2088
+.) -2089
+▁answ -2090
+▁meas -2091
+AC -2092
+▁wanted -2093
+▁type -2094
+▁software -2095
+selves -2096
+▁experienc -2097
+▁forward -2098
+▁diff -2099
+eds -2100
+▁whether -2101
+▁Us -2102
+▁wide -2103
+▁Read -2104
+▁either -2105
+▁Bu -2106
+ires -2107
+▁El -2108
+▁value -2109
+▁concer -2110
+▁deb -2111
+▁further -2112
+ux -2113
+ilar -2114
+ival -2115
+▁isn -2116
+▁coll -2117
+used -2118
+ams -2119
+aced -2120
+▁par -2121
+▁almost -2122
+▁required -2123
+▁crit -2124
+▁held -2125
+▁white -2126
+arter -2127
+▁date -2128
+▁comfort -2129
+▁quite -2130
+▁trying -2131
+▁provided -2132
+▁summer -2133
+▁Sw -2134
+▁fit -2135
+▁Pa -2136
+▁sugg -2137
+▁needed -2138
+▁favorite -2139
+▁tit -2140
+St -2141
+ees -2142
+▁Sunday -2143
+▁opportunity -2144
+▁Jo -2145
+▁ach -2146
+aching -2147
+uary -2148
+ek -2149
+▁Cor -2150
+▁via -2151
+▁extra -2152
+▁players -2153
+▁April -2154
+▁books -2155
+▁Monday -2156
+▁network -2157
+▁cop -2158
+amer -2159
+ler -2160
+▁example -2161
+▁box -2162
+▁users -2163
+▁, -2164
+itten -2165
+▁seem -2166
+▁period -2167
+▁various -2168
+▁Health -2169
+▁options -2170
+where -2171
+▁running -2172
+gress -2173
+▁style -2174
+▁especially -2175
+▁consider -2176
+▁yourself -2177
+▁Art -2178
+▁dam -2179
+▁safe -2180
+▁previous -2181
+▁swe -2182
+▁ways -2183
+▁version -2184
+▁created -2185
+▁sle -2186
+▁Mon -2187
+▁recently -2188
+▁potential -2189
+OU -2190
+▁issue -2191
+▁common -2192
+ises -2193
+▁di -2194
+▁Inc -2195
+▁stri -2196
+▁ready -2197
+▁attend -2198
+▁morning -2199
+▁regular -2200
+▁insp -2201
+▁else -2202
+▁road -2203
+▁nice -2204
+▁throughout -2205
+▁probably -2206
+▁ensure -2207
+-- -2208
+▁veh -2209
+▁received -2210
+earch -2211
+▁ball -2212
+▁Associ -2213
+▁President -2214
+▁clear -2215
+▁download -2216
+par -2217
+icles -2218
+▁engine -2219
+▁sho -2220
+erc -2221
+▁song -2222
+azing -2223
+▁lo -2224
+▁brand -2225
+▁relationship -2226
+▁takes -2227
+▁reading -2228
+mit -2229
+▁natural -2230
+▁Aut -2231
+▁States -2232
+ades -2233
+amed -2234
+▁park -2235
+▁House -2236
+ively -2237
+▁shows -2238
+▁asked -2239
+▁medical -2240
+istration -2241
+ague -2242
+▁inj -2243
+▁hit -2244
+▁choose -2245
+▁collect -2246
+▁Direct -2247
+▁Mich -2248
+▁original -2249
+▁cool -2250
+▁spr -2251
+▁couple -2252
+angu -2253
+reme -2254
+ipping -2255
+▁represent -2256
+▁bott -2257
+▁init -2258
+▁release -2259
+▁goal -2260
+▁behind -2261
+ny -2262
+apt -2263
+oid -2264
+▁Face -2265
+▁wonder -2266
+▁Soc -2267
+▁recent -2268
+▁sales -2269
+eter -2270
+▁clients -2271
+▁financial -2272
+aging -2273
+overed -2274
+▁accom -2275
+▁fresh -2276
+▁fast -2277
+▁super -2278
+▁leave -2279
+▁problems -2280
+▁anyone -2281
+▁role -2282
+face -2283
+▁Get -2284
+gs -2285
+hib -2286
+▁Ser -2287
+▁career -2288
+uge -2289
+▁Fin -2290
+bor -2291
+▁Black -2292
+ume -2293
+▁cup -2294
+ried -2295
+ville -2296
+▁model -2297
+▁article -2298
+oura -2299
+▁ful -2300
+uesday -2301
+▁meth -2302
+arth -2303
+▁ground -2304
+▁programs -2305
+▁Up -2306
+▁hol -2307
+▁fail -2308
+na -2309
+▁sun -2310
+aving -2311
+▁weeke -2312
+▁accept -2313
+▁flow -2314
+ada -2315
+ursday -2316
+▁base -2317
+medi -2318
+▁customer -2319
+▁difficult -2320
+OT -2321
+atform -2322
+▁writing -2323
+anced -2324
+urance -2325
+▁looks -2326
+▁PM -2327
+▁tour -2328
+▁polit -2329
+▁likely -2330
+ox -2331
+hel -2332
+oogle -2333
+▁paper -2334
+▁ap -2335
+▁abs -2336
+▁simply -2337
+cing -2338
+name -2339
+verage -2340
+▁inside -2341
+▁manufact -2342
+▁TV -2343
+clus -2344
+▁etc -2345
+▁mix -2346
+▁total -2347
+▁included -2348
+▁po -2349
+idge -2350
+ming -2351
+▁Int -2352
+▁risk -2353
+▁Wed -2354
+adem -2355
+aker -2356
+▁increase -2357
+▁party -2358
+▁changes -2359
+▁ele -2360
+ashing -2361
+▁board -2362
+▁education -2363
+oud -2364
+▁Her -2365
+▁October -2366
+▁action -2367
+▁former -2368
+▁meeting -2369
+Wh -2370
+▁however -2371
+▁News -2372
+▁outside -2373
+ification -2374
+uit -2375
+iple -2376
+▁match -2377
+▁Ac -2378
+▁America -2379
+▁Act -2380
+▁nothing -2381
+▁security -2382
+▁self -2383
+ground -2384
+▁contrib -2385
+▁stop -2386
+ester -2387
+▁town -2388
+▁August -2389
+▁matter -2390
+▁position -2391
+▁Af -2392
+▁ple -2393
+▁bed -2394
+▁late -2395
+istrict -2396
+▁Ob -2397
+▁systems -2398
+▁Every -2399
+icated -2400
+adu -2401
+ules -2402
+▁Bus -2403
+▁words -2404
+▁playing -2405
+▁cir -2406
+▁pan -2407
+ST -2408
+▁UK -2409
+wood -2410
+▁sat -2411
+▁impact -2412
+▁anim -2413
+▁mark -2414
+▁private -2415
+▁application -2416
+▁police -2417
+▁knowledge -2418
+▁exist -2419
+▁photos -2420
+▁method -2421
+▁longer -2422
+▁coun -2423
+▁worked -2424
+iddle -2425
+▁national -2426
+▁projects -2427
+ederal -2428
+▁ord -2429
+▁Are -2430
+▁necess -2431
+ude -2432
+▁table -2433
+▁stra -2434
+off -2435
+▁Ag -2436
+empt -2437
+elcome -2438
+▁September -2439
+ecut -2440
+▁activities -2441
+▁worth -2442
+▁recogn -2443
+▁production -2444
+str -2445
+nesday -2446
+▁Department -2447
+based -2448
+aby -2449
+iff -2450
+▁comment -2451
+▁compl -2452
+▁skills -2453
+▁true -2454
+▁general -2455
+▁Austral -2456
+▁January -2457
+iol -2458
+▁round -2459
+▁lives -2460
+▁learning -2461
+▁Tuesday -2462
+▁Thursday -2463
+ID -2464
+che -2465
+▁Then -2466
+▁introdu -2467
+ky -2468
+arden -2469
+▁signific -2470
+ING -2471
+oom -2472
+▁Sal -2473
+▁ill -2474
+▁student -2475
+▁Pat -2476
+▁lay -2477
+▁hair -2478
+▁Free -2479
+▁Nove -2480
+▁computer -2481
+▁squ -2482
+▁purchase -2483
+▁tal -2484
+ham -2485
+▁Also -2486
+ession -2487
+ett -2488
+▁Mus -2489
+▁death -2490
+▁defin -2491
+▁seems -2492
+▁Of -2493
+ci -2494
+▁hands -2495
+izing -2496
+▁communic -2497
+mon -2498
+▁rad -2499
+▁choice -2500
+▁screen -2501
+AM -2502
+▁draw -2503
+▁concern -2504
+▁leading -2505
+▁additional -2506
+▁First -2507
+▁rights -2508
+attle -2509
+▁cell -2510
+▁credit -2511
+▁located -2512
+▁variety -2513
+▁leaders -2514
+▁Facebook -2515
+▁stat -2516
+▁tick -2517
+▁drive -2518
+▁movie -2519
+▁San -2520
+arget -2521
+oring -2522
+▁file -2523
+▁fig -2524
+ipment -2525
+▁hy -2526
+▁bud -2527
+▁image -2528
+▁determ -2529
+▁amazing -2530
+aign -2531
+▁Sim -2532
+▁suggest -2533
+mercial -2534
+▁chance -2535
+▁Red -2536
+▁associ -2537
+▁rather -2538
+▁practice -2539
+▁built -2540
+▁plans -2541
+▁function -2542
+oph -2543
+▁Har -2544
+▁providing -2545
+iter -2546
+▁cal -2547
+ached -2548
+airs -2549
+light -2550
+ought -2551
+urg -2552
+pm -2553
+▁War -2554
+▁vict -2555
+▁court -2556
+▁aw -2557
+▁saf -2558
+▁cand -2559
+example -2560
+▁Out -2561
+▁touch -2562
+▁Air -2563
+▁teac -2564
+cil -2565
+▁exam -2566
+▁autom -2567
+▁Street -2568
+▁international -2569
+▁loss -2570
+▁weekend -2571
+▁Wind -2572
+▁infl -2573
+▁prior -2574
+▁prevent -2575
+▁allows -2576
+▁arri -2577
+▁Calif -2578
+▁Click -2579
+irth -2580
+ibrary -2581
+▁character -2582
+▁piece -2583
+▁treatment -2584
+cember -2585
+itchen -2586
+olution -2587
+▁http -2588
+ma -2589
+▁similar -2590
+▁Most -2591
+▁moment -2592
+gar -2593
+oke -2594
+ruary -2595
+▁clos -2596
+▁Design -2597
+▁investig -2598
+▁rate -2599
+▁AM -2600
+reg -2601
+▁commit -2602
+▁growth -2603
+imum -2604
+▁norm -2605
+OM -2606
+iber -2607
+▁Dis -2608
+ivery -2609
+▁estab -2610
+▁cause -2611
+▁user -2612
+sp -2613
+▁deg -2614
+▁lost -2615
+▁display -2616
+▁collection -2617
+▁myself -2618
+▁Cr -2619
+▁op -2620
+▁enter -2621
+▁Wednesday -2622
+unt -2623
+▁rout -2624
+ault -2625
+▁decided -2626
+▁decision -2627
+▁sil -2628
+▁inde -2629
+▁Any -2630
+▁higher -2631
+cy -2632
+▁bal -2633
+▁daily -2634
+ha -2635
+ournal -2636
+▁digital -2637
+▁November -2638
+▁purp -2639
+▁Group -2640
+▁released -2641
+▁significant -2642
+▁reported -2643
+LE -2644
+▁Home -2645
+▁woman -2646
+▁Cour -2647
+▁easily -2648
+▁cannot -2649
+▁goes -2650
+▁International -2651
+▁excell -2652
+lin -2653
+▁wall -2654
+▁Thanks -2655
+▁quickly -2656
+▁College -2657
+▁usually -2658
+amb -2659
+▁bag -2660
+▁apply -2661
+▁floor -2662
+▁expected -2663
+iant -2664
+▁involved -2665
+▁Law -2666
+▁dom -2667
+▁attack -2668
+just -2669
+▁boy -2670
+illing -2671
+▁regard -2672
+▁platform -2673
+▁capt -2674
+▁iP -2675
+▁Net -2676
+▁encoura -2677
+▁protect -2678
+ondon -2679
+▁Cons -2680
+▁agree -2681
+ael -2682
+▁serious -2683
+▁December -2684
+▁safety -2685
+▁roll -2686
+▁saw -2687
+▁dress -2688
+▁Google -2689
+▁gen -2690
+▁parents -2691
+▁mach -2692
+idents -2693
+▁played -2694
+▁Service -2695
+▁immedi -2696
+▁surpr -2697
+mas -2698
+▁warm -2699
+zz -2700
+▁integr -2701
+▁mobile -2702
+▁tast -2703
+ica -2704
+▁February -2705
+▁sn -2706
+▁club -2707
+▁langu -2708
+▁president -2709
+▁sche -2710
+▁related -2711
+hern -2712
+▁shoot -2713
+▁finish -2714
+▁ideas -2715
+▁global -2716
+▁marketing -2717
+▁tools -2718
+▁ep -2719
+▁expert -2720
+band -2721
+▁code -2722
+▁exact -2723
+ospital -2724
+asons -2725
+▁mass -2726
+▁note -2727
+avy -2728
+▁photo -2729
+izes -2730
+▁save -2731
+▁source -2732
+▁ut -2733
+▁option -2734
+▁respect -2735
+▁Brit -2736
+▁Let -2737
+▁feed -2738
+enge -2739
+iding -2740
+▁arch -2741
+▁deep -2742
+▁corre -2743
+▁Ang -2744
+▁announced -2745
+ilies -2746
+▁appe -2747
+edding -2748
+▁Well -2749
+cription -2750
+▁La -2751
+www -2752
+hood -2753
+reng -2754
+▁stock -2755
+▁sens -2756
+▁admin -2757
+▁location -2758
+▁ri -2759
+ellow -2760
+▁gets -2761
+▁David -2762
+▁costs -2763
+▁helps -2764
+▁Av -2765
+ples -2766
+▁materials -2767
+ength -2768
+▁Je -2769
+ipe -2770
+rab -2771
+▁Tex -2772
+▁huge -2773
+▁published -2774
+agn -2775
+like -2776
+AP -2777
+▁send -2778
+▁mother -2779
+▁benefits -2780
+▁English -2781
+enior -2782
+mission -2783
+ography -2784
+▁lab -2785
+oday -2786
+▁Play -2787
+▁fight -2788
+▁Over -2789
+▁hear -2790
+▁weight -2791
+rown -2792
+▁Spr -2793
+ornia -2794
+uel -2795
+vey -2796
+iction -2797
+▁images -2798
+rought -2799
+▁restaur -2800
+key -2801
+▁gar -2802
+▁Book -2803
+▁earn -2804
+ald -2805
+▁ability -2806
+▁interview -2807
+add -2808
+▁Check -2809
+▁Business -2810
+atory -2811
+▁London -2812
+ructure -2813
+▁written -2814
+akers -2815
+▁challeng -2816
+▁standard -2817
+▁gives -2818
+▁giving -2819
+▁ones -2820
+▁legal -2821
+▁sense -2822
+▁campaign -2823
+▁Sch -2824
+▁dest -2825
+▁innov -2826
+erved -2827
+▁door -2828
+▁patients -2829
+rom -2830
+▁mid -2831
+▁trust -2832
+urt -2833
+▁sus -2834
+▁wasn -2835
+▁Services -2836
+▁center -2837
+▁instead -2838
+aged -2839
+▁Produ -2840
+▁fab -2841
+▁Coun -2842
+▁heat -2843
+▁neg -2844
+▁fine -2845
+▁item -2846
+▁Great -2847
+▁target -2848
+erous -2849
+▁prem -2850
+erve -2851
+▁sold -2852
+▁White -2853
+aught -2854
+▁wish -2855
+▁Trans -2856
+▁parts -2857
+▁write -2858
+▁levels -2859
+▁lic -2860
+▁award -2861
+iring -2862
+arant -2863
+aves -2864
+▁cases -2865
+▁describ -2866
+▁picture -2867
+▁pers -2868
+▁partners -2869
+▁Web -2870
+▁dry -2871
+▁neigh -2872
+irit -2873
+▁Mod -2874
+▁Prof -2875
+▁stuff -2876
+ashington -2877
+ida -2878
+▁pull -2879
+▁conditions -2880
+▁ded -2881
+atives -2882
+▁green -2883
+▁California -2884
+▁broad -2885
+▁effic -2886
+▁Hol -2887
+board -2888
+▁Hall -2889
+put -2890
+rows -2891
+▁Program -2892
+ivity -2893
+▁began -2894
+▁sale -2895
+▁upon -2896
+istic -2897
+▁highly -2898
+▁interesting -2899
+TM -2900
+bit -2901
+OS -2902
+▁vot -2903
+▁fans -2904
+▁stories -2905
+inner -2906
+▁request -2907
+▁contract -2908
+▁remember -2909
+▁slow -2910
+▁Cle -2911
+▁emer -2912
+▁subs -2913
+▁answer -2914
+▁Techn -2915
+anch -2916
+▁comments -2917
+acing -2918
+ocol -2919
+▁bra -2920
+▁Phot -2921
+▁wood -2922
+▁Other -2923
+▁lower -2924
+▁sym -2925
+▁dead -2926
+orge -2927
+▁prim -2928
+orage -2929
+▁modern -2930
+▁player -2931
+▁cat -2932
+coming -2933
+bum -2934
+▁interested -2935
+ooth -2936
+▁reports -2937
+aches -2938
+▁except -2939
+ara -2940
+lev -2941
+▁dise -2942
+▁trip -2943
+▁teams -2944
+▁Jack -2945
+▁Texas -2946
+▁attention -2947
+▁equipment -2948
+▁paint -2949
+sy -2950
+▁fully -2951
+▁wrong -2952
+▁directly -2953
+▁starting -2954
+▁completely -2955
+▁organization -2956
+▁types -2957
+uk -2958
+wide -2959
+▁Green -2960
+mm -2961
+▁resources -2962
+▁Last -2963
+▁www -2964
+ET -2965
+urb -2966
+ager -2967
+▁document -2968
+▁themselves -2969
+apan -2970
+▁dru -2971
+▁solutions -2972
+▁stru -2973
+▁viol -2974
+ashion -2975
+▁bank -2976
+▁Washington -2977
+▁Loc -2978
+▁Rem -2979
+ament -2980
+▁multiple -2981
+▁Association -2982
+▁band -2983
+▁achieve -2984
+▁condition -2985
+▁gold -2986
+▁businesses -2987
+▁Twitter -2988
+uses -2989
+▁wait -2990
+ule -2991
+▁Go -2992
+ening -2993
+udd -2994
+▁Each -2995
+▁affect -2996
+▁opportunities -2997
+▁vac -2998
+▁Gener -2999
+urer -3000
+▁hop -3001
+EC -3002
+▁sett -3003
+▁policy -3004
+▁Par -3005
+▁led -3006
+ension -3007
+▁thinking -3008
+▁dream -3009
+▁Once -3010
+raz -3011
+rel -3012
+▁groups -3013
+▁planning -3014
+▁commercial -3015
+EO -3016
+He -3017
+ffee -3018
+olf -3019
+▁Spe -3020
+▁separ -3021
+▁applications -3022
+▁qual -3023
+▁streng -3024
+▁approach -3025
+▁families -3026
+▁solution -3027
+▁Del -3028
+▁firm -3029
+▁Class -3030
+▁express -3031
+ores -3032
+▁gave -3033
+▁Found -3034
+enty -3035
+iles -3036
+▁offe -3037
+▁consult -3038
+▁Year -3039
+▁gift -3040
+▁subject -3041
+▁Mem -3042
+AD -3043
+▁Afric -3044
+▁prices -3045
+▁successful -3046
+ties -3047
+▁positive -3048
+▁employees -3049
+arlier -3050
+▁blood -3051
+▁AN -3052
+▁race -3053
+itute -3054
+▁deliver -3055
+oul -3056
+▁join -3057
+ares -3058
+▁itself -3059
+▁King -3060
+▁shot -3061
+▁advice -3062
+▁cert -3063
+▁THE -3064
+▁eye -3065
+riend -3066
+▁hour -3067
+▁defe -3068
+▁saying -3069
+▁healthy -3070
+▁glass -3071
+▁creating -3072
+▁Sub -3073
+▁According -3074
+▁dark -3075
+ration -3076
+▁spent -3077
+▁div -3078
+▁Even -3079
+▁Why -3080
+field -3081
+▁cy -3082
+itely -3083
+ford -3084
+▁Best -3085
+▁cancer -3086
+▁Christmas -3087
+▁effective -3088
+▁serve -3089
+omen -3090
+▁sites -3091
+▁budget -3092
+▁Whe -3093
+▁Road -3094
+▁lif -3095
+▁goals -3096
+▁message -3097
+king -3098
+▁Vis -3099
+▁reve -3100
+mb -3101
+down -3102
+▁Paul -3103
+▁fair -3104
+▁India -3105
+▁average -3106
+▁Dan -3107
+▁fix -3108
+▁circ -3109
+▁Office -3110
+▁Pri -3111
+▁condu -3112
+▁East -3113
+▁reach -3114
+elling -3115
+▁Since -3116
+▁cross -3117
+aughter -3118
+▁traditional -3119
+▁extreme -3120
+▁organiz -3121
+▁director -3122
+PS -3123
+▁Hot -3124
+▁implement -3125
+Ch -3126
+▁sometimes -3127
+▁physical -3128
+▁obs -3129
+ipped -3130
+▁camer -3131
+ords -3132
+vis -3133
+▁Oh -3134
+▁opp -3135
+▁adult -3136
+▁terms -3137
+iable -3138
+▁Germ -3139
+▁plant -3140
+▁wonderful -3141
+US -3142
+rote -3143
+▁hor -3144
+▁Many -3145
+▁Rec -3146
+▁aim -3147
+▁attempt -3148
+▁limited -3149
+▁pictures -3150
+tee -3151
+▁Japan -3152
+▁See -3153
+▁Develop -3154
+▁excellent -3155
+▁dro -3156
+urning -3157
+ysis -3158
+▁mount -3159
+BC -3160
+▁emb -3161
+▁Work -3162
+imately -3163
+onse -3164
+▁brought -3165
+uth -3166
+yond -3167
+▁Ann -3168
+▁quarter -3169
+hest -3170
+▁title -3171
+▁section -3172
+ecutive -3173
+▁block -3174
+▁delivery -3175
+▁Mor -3176
+▁became -3177
+▁farm -3178
+▁arr -3179
+▁carry -3180
+▁effort -3181
+▁IN -3182
+▁kitchen -3183
+▁mention -3184
+▁developed -3185
+▁imm -3186
+inary -3187
+▁Use -3188
+iance -3189
+yright -3190
+reci -3191
+▁jud -3192
+▁fish -3193
+▁China -3194
+▁Inter -3195
+▁countries -3196
+estern -3197
+▁progress -3198
+▁necessary -3199
+▁ge -3200
+▁suppl -3201
+▁sweet -3202
+pendent -3203
+▁complex -3204
+ocks -3205
+▁baby -3206
+vest -3207
+▁felt -3208
+mitted -3209
+▁feeling -3210
+▁System -3211
+▁nation -3212
+▁promot -3213
+▁Top -3214
+▁Make -3215
+▁Dem -3216
+▁Good -3217
+hold -3218
+iced -3219
+▁birth -3220
+▁sleep -3221
+▁growing -3222
+▁impress -3223
+porate -3224
+▁Public -3225
+▁places -3226
+ocr -3227
+▁seven -3228
+▁IT -3229
+▁Flor -3230
+ffects -3231
+venue -3232
+▁Mac -3233
+▁war -3234
+▁heard -3235
+itation -3236
+gu -3237
+pite -3238
+▁weather -3239
+▁Lear -3240
+▁Open -3241
+▁region -3242
+▁Michael -3243
+haps -3244
+▁billion -3245
+▁son -3246
+itary -3247
+▁star -3248
+▁Sur -3249
+duc -3250
+▁Today -3251
+▁hotel -3252
+▁wants -3253
+Re -3254
+▁Thank -3255
+▁stick -3256
+▁college -3257
+▁construction -3258
+IL -3259
+▁bi -3260
+▁album -3261
+▁spend -3262
+▁mat -3263
+▁cold -3264
+▁medic -3265
+▁stage -3266
+▁ver -3267
+▁Port -3268
+▁Director -3269
+▁individuals -3270
+▁double -3271
+nded -3272
+▁Canada -3273
+▁Market -3274
+): -3275
+EL -3276
+aries -3277
+▁Down -3278
+▁convers -3279
+▁Russ -3280
+▁profession -3281
+ying -3282
+▁ble -3283
+▁speed -3284
+▁distrib -3285
+pects -3286
+▁exerc -3287
+rup -3288
+▁ST -3289
+aled -3290
+▁finished -3291
+fl -3292
+▁gas -3293
+istry -3294
+▁suit -3295
+ils -3296
+▁pages -3297
+▁statement -3298
+pre -3299
+ancy -3300
+▁charge -3301
+▁ing -3302
+▁spot -3303
+▁ult -3304
+▁requirements -3305
+▁finally -3306
+▁schools -3307
+▁vehicle -3308
+▁smart -3309
+▁annual -3310
+▁Windows -3311
+". -3312
+ado -3313
+wor -3314
+▁eat -3315
+useum -3316
+▁feet -3317
+▁Board -3318
+▁advant -3319
+ibly -3320
+▁blue -3321
+▁load -3322
+▁aware -3323
+unk -3324
+▁Gold -3325
+▁Research -3326
+▁straight -3327
+▁appl -3328
+arc -3329
+▁Mark -3330
+▁nearly -3331
+ato -3332
+▁Bel -3333
+▁Tom -3334
+▁tried -3335
+▁hous -3336
+▁avoid -3337
+aling -3338
+ports -3339
+▁difference -3340
+▁wrote -3341
+▁William -3342
+▁Sol -3343
+▁pattern -3344
+owl -3345
+ened -3346
+▁James -3347
+▁respond -3348
+▁challenge -3349
+▁Bre -3350
+▁dog -3351
+▁beginning -3352
+ION -3353
+▁Educ -3354
+▁About -3355
+▁helping -3356
+:|| -3357
+▁benefit -3358
+▁insurance -3359
+▁situation -3360
+iment -3361
+▁essential -3362
+▁imag -3363
+ancing -3364
+unte -3365
+▁device -3366
+ceed -3367
+▁Obama -3368
+rast -3369
+▁shop -3370
+ological -3371
+▁Care -3372
+▁Indian -3373
+▁political -3374
+box -3375
+uted -3376
+▁Time -3377
+▁loved -3378
+▁Review -3379
+ube -3380
+▁nut -3381
+▁pow -3382
+overn -3383
+▁wear -3384
+▁Apple -3385
+▁Sl -3386
+▁Mag -3387
+olute -3388
+▁Find -3389
+▁activity -3390
+▁devices -3391
+▁moving -3392
+▁Met -3393
+▁lik -3394
+▁paid -3395
+▁enh -3396
+▁Club -3397
+▁Hel -3398
+▁uses -3399
+▁eight -3400
+▁exhib -3401
+▁Court -3402
+▁turned -3403
+oms -3404
+oses -3405
+▁posted -3406
+▁towards -3407
+”. -3408
+▁nature -3409
+▁Sk -3410
+▁partner -3411
+asy -3412
+▁investment -3413
+ourney -3414
+▁appreci -3415
+▁offering -3416
+▁temper -3417
+▁contain -3418
+▁largest -3419
+ivil -3420
+▁knew -3421
+▁ahead -3422
+oves -3423
+rench -3424
+idered -3425
+▁retail -3426
+▁hus -3427
+▁eyes -3428
+▁owners -3429
+▁language -3430
+▁Ant -3431
+inger -3432
+▁expand -3433
+house -3434
+ey -3435
+rences -3436
+ios -3437
+▁rent -3438
+ned -3439
+▁cas -3440
+▁connect -3441
+▁wife -3442
+ampions -3443
+▁advert -3444
+▁Rel -3445
+▁Rich -3446
+▁reduce -3447
+▁European -3448
+▁guarant -3449
+ago -3450
+cause -3451
+▁Look -3452
+▁sports -3453
+▁correct -3454
+aly -3455
+anta -3456
+▁categ -3457
+▁client -3458
+▁states -3459
+▁consist -3460
+pri -3461
+▁maybe -3462
+▁named -3463
+▁definitely -3464
+hips -3465
+▁influ -3466
+▁entertain -3467
+erry -3468
+hens -3469
+▁accur -3470
+▁concept -3471
+osing -3472
+ounds -3473
+▁runs -3474
+▁grand -3475
+▁stress -3476
+IP -3477
+change -3478
+▁Super -3479
+▁guide -3480
+▁homes -3481
+▁Have -3482
+▁thous -3483
+last -3484
+▁jobs -3485
+▁offered -3486
+estival -3487
+▁earlier -3488
+▁immediately -3489
+▁doll -3490
+▁numbers -3491
+sych -3492
+▁conc -3493
+iers -3494
+▁decl -3495
+▁Fam -3496
+esome -3497
+▁Rob -3498
+▁rates -3499
+▁Council -3500
+azine -3501
+▁rev -3502
+▁Community -3503
+▁path -3504
+▁collabor -3505
+lying -3506
+roud -3507
+▁Cop -3508
+You -3509
+alt -3510
+orrow -3511
+▁candid -3512
+▁interact -3513
+ails -3514
+▁remain -3515
+▁II -3516
+more -3517
+▁bottom -3518
+sec -3519
+dule -3520
+▁Sum -3521
+▁Cong -3522
+▁belie -3523
+▁drink -3524
+▁pieces -3525
+▁exactly -3526
+asc -3527
+lim -3528
+▁tips -3529
+▁Micro -3530
+▁View -3531
+iation -3532
+▁overall -3533
+▁max -3534
+▁federal -3535
+▁storage -3536
+vin -3537
+icious -3538
+▁Custom -3539
+▁opening -3540
+▁demand -3541
+▁Two -3542
+place -3543
+▁surround -3544
+▁Cur -3545
+▁histor -3546
+▁Bay -3547
+orial -3548
+▁Rober -3549
+▁adjust -3550
+ulations -3551
+▁shipping -3552
+▁strateg -3553
+▁Internet -3554
+▁active -3555
+▁threat -3556
+ram -3557
+▁Win -3558
+▁looked -3559
+oma -3560
+▁ten -3561
+▁occas -3562
+▁length -3563
+inated -3564
+▁served -3565
+▁conference -3566
+ico -3567
+iny -3568
+▁IS -3569
+▁guys -3570
+▁rock -3571
+▁button -3572
+▁garden -3573
+▁Florida -3574
+▁acqu -3575
+▁Police -3576
+▁easier -3577
+▁Angel -3578
+yd -3579
+order -3580
+undred -3581
+▁Island -3582
+▁father -3583
+oly -3584
+▁bath -3585
+▁speak -3586
+▁attract -3587
+If -3588
+▁normal -3589
+▁thanks -3590
+dom -3591
+umn -3592
+▁Love -3593
+▁thank -3594
+▁bill -3595
+▁People -3596
+▁background -3597
+illa -3598
+rial -3599
+▁born -3600
+arily -3601
+▁girls -3602
+rig -3603
+▁Ev -3604
+▁Det -3605
+▁wedding -3606
+care -3607
+▁lots -3608
+▁damage -3609
+roid -3610
+▁Big -3611
+▁fat -3612
+▁pet -3613
+bl -3614
+ses -3615
+▁Ty -3616
+▁culture -3617
+▁replace -3618
+▁creative -3619
+▁internet -3620
+▁completed -3621
+▁assess -3622
+OL -3623
+▁Call -3624
+▁prec -3625
+aduate -3626
+atever -3627
+mod -3628
+que -3629
+▁Life -3630
+▁Team -3631
+▁wine -3632
+▁Company -3633
+▁husband -3634
+ij -3635
+▁coach -3636
+▁beyond -3637
+aith -3638
+▁cards -3639
+ipp -3640
+▁cash -3641
+▁Child -3642
+▁haven -3643
+▁altern -3644
+ota -3645
+▁Matt -3646
+▁guy -3647
+phone -3648
+▁depend -3649
+▁setting -3650
+leg -3651
+▁bul -3652
+▁Back -3653
+▁Show -3654
+▁miles -3655
+▁er -3656
+antly -3657
+force -3658
+▁transport -3659
+▁Management -3660
+ustain -3661
+body -3662
+ston -3663
+wise -3664
+▁emot -3665
+▁behav -3666
+▁driving -3667
+▁cream -3668
+▁response -3669
+iling -3670
+▁pred -3671
+▁estate -3672
+ously -3673
+het -3674
+▁USA -3675
+oving -3676
+isions -3677
+▁owner -3678
+▁Australia -3679
+friend -3680
+▁Pet -3681
+▁Sun -3682
+▁cho -3683
+error -3684
+▁Contact -3685
+izz -3686
+▁excited -3687
+▁selection -3688
+▁Ir -3689
+ales -3690
+anging -3691
+▁Ret -3692
+▁middle -3693
+▁efforts -3694
+▁particularly -3695
+▁Plan -3696
+▁Pal -3697
+itect -3698
+icks -3699
+▁Dri -3700
+▁helped -3701
+door -3702
+ustr -3703
+▁Lake -3704
+▁doub -3705
+▁colors -3706
+▁inform -3707
+▁Ve -3708
+aper -3709
+▁files -3710
+▁allowed -3711
+▁lines -3712
+▁existing -3713
+▁Bank -3714
+▁satis -3715
+▁patient -3716
+▁comfortable -3717
+istered -3718
+▁welcome -3719
+▁considered -3720
+▁responsible -3721
+▁clot -3722
+▁drop -3723
+▁truly -3724
+▁coffee -3725
+▁understanding -3726
+DA -3727
+▁plus -3728
+▁Govern -3729
+▁Thom -3730
+▁measure -3731
+set -3732
+▁economic -3733
+▁Yes -3734
+oming -3735
+▁frame -3736
+▁slight -3737
+▁journey -3738
+isl -3739
+▁Dec -3740
+▁indic -3741
+▁degree -3742
+▁ingred -3743
+▁himself -3744
+bon -3745
+▁purpose -3746
+▁tom -3747
+▁surv -3748
+▁changed -3749
+▁liter -3750
+▁mission -3751
+free -3752
+nown -3753
+ences -3754
+onstr -3755
+ona -3756
+▁Although -3757
+EM -3758
+▁pen -3759
+ologies -3760
+▁models -3761
+reed -3762
+▁train -3763
+▁winter -3764
+▁prot -3765
+▁stream -3766
+▁highest -3767
+ads -3768
+see -3769
+encies -3770
+▁prefer -3771
+▁seeing -3772
+▁strugg -3773
+▁evening -3774
+press -3775
+▁Take -3776
+▁artist -3777
+▁talking -3778
+OW -3779
+▁Camp -3780
+▁Phil -3781
+▁afford -3782
+▁Information -3783
+▁Str -3784
+▁sty -3785
+▁Smith -3786
+▁fashion -3787
+▁Republic -3788
+▁gun -3789
+▁disease -3790
+▁pool -3791
+▁absolute -3792
+OV -3793
+▁Sen -3794
+▁shopping -3795
+raw -3796
+oman -3797
+apter -3798
+▁River -3799
+▁Church -3800
+met -3801
+soft -3802
+▁Mart -3803
+▁lack -3804
+▁appoint -3805
+▁heavy -3806
+▁letter -3807
+rem -3808
+▁Color -3809
+▁British -3810
+▁daughter -3811
+▁fem -3812
+▁Rock -3813
+▁cast -3814
+▁brother -3815
+rey -3816
+▁Sing -3817
+▁flav -3818
+porary -3819
+▁occur -3820
+▁smooth -3821
+▁opin -3822
+▁increased -3823
+▁Jes -3824
+▁Music -3825
+▁moved -3826
+▁proud -3827
+▁couldn -3828
+▁launch -3829
+▁analysis -3830
+▁organizations -3831
+dd -3832
+▁PC -3833
+tion -3834
+▁mer -3835
+fit -3836
+▁links -3837
+gery -3838
+▁obt -3839
+▁Water -3840
+▁craft -3841
+▁church -3842
+▁compon -3843
+▁Blue -3844
+▁fill -3845
+▁rules -3846
+▁shared -3847
+▁spring -3848
+eria -3849
+uled -3850
+▁mail -3851
+▁Under -3852
+▁sched -3853
+▁Because -3854
+ronic -3855
+chan -3856
+▁Special -3857
+▁reviews -3858
+▁senior -3859
+▁hundred -3860
+IM -3861
+▁onto -3862
+▁whose -3863
+bed -3864
+▁Brown -3865
+net -3866
+▁fan -3867
+icing -3868
+▁Power -3869
+▁decor -3870
+▁secure -3871
+▁machine -3872
+imal -3873
+▁spread -3874
+▁u -3875
+▁frequ -3876
+▁score -3877
+ocolate -3878
+▁spirit -3879
+▁residents -3880
+amic -3881
+▁Hum -3882
+▁trade -3883
+▁science -3884
+vant -3885
+▁fra -3886
+▁Wood -3887
+▁appropri -3888
+▁officials -3889
+▁Sam -3890
+▁unit -3891
+▁died -3892
+hone -3893
+▁gone -3894
+▁manager -3895
+▁pressure -3896
+▁Like -3897
+▁challenges -3898
+TS -3899
+ady -3900
+▁clin -3901
+▁extend -3902
+▁instruct -3903
+▁dedicated -3904
+▁competition -3905
+▁Mount -3906
+▁Char -3907
+▁session -3908
+▁fant -3909
+▁Follow -3910
+▁happened -3911
+rian -3912
+▁Food -3913
+▁Mary -3914
+▁sort -3915
+ulated -3916
+▁initial -3917
+▁Fire -3918
+▁trou -3919
+▁Media -3920
+▁District -3921
+BA -3922
+icon -3923
+▁characters -3924
+▁basic -3925
+▁camera -3926
+▁holiday -3927
+azon -3928
+ategy -3929
+▁Enter -3930
+▁powerful -3931
+▁Institute -3932
+▁produce -3933
+▁beg -3934
+istics -3935
+▁Press -3936
+osition -3937
+▁dating -3938
+ette -3939
+asp -3940
+▁Hist -3941
+▁reasons -3942
+▁increasing -3943
+icken -3944
+▁shown -3945
+▁sugar -3946
+▁incred -3947
+▁extremely -3948
+▁rob -3949
+▁chem -3950
+▁Education -3951
+oos -3952
+▁AC -3953
+inese -3954
+▁volunte -3955
+▁disp -3956
+▁package -3957
+▁payment -3958
+RA -3959
+▁eval -3960
+▁guests -3961
+▁aren -3962
+▁snow -3963
+▁leader -3964
+▁biggest -3965
+▁TO -3966
+▁alone -3967
+▁object -3968
+▁proced -3969
+▁Sa -3970
+rowd -3971
+▁basis -3972
+▁disapp -3973
+▁supply -3974
+▁General -3975
+orney -3976
+▁Star -3977
+ifying -3978
+olic -3979
+▁laws -3980
+▁breat -3981
+▁graph -3982
+▁solid -3983
+▁forget -3984
+▁continues -3985
+LC -3986
+▁cars -3987
+▁guid -3988
+▁voice -3989
+▁experienced -3990
+▁Lou -3991
+▁mis -3992
+▁brows -3993
+rapy -3994
+▁arrest -3995
+▁passed -3996
+▁schedule -3997
+ken -3998
+omb -3999
+uing -4000
+▁egg -4001
+▁passion -4002
+▁dang -4003
+▁fear -4004
+▁guess -4005
+▁scene -4006
+esterday -4007
+BS -4008
+▁bur -4009
+▁steps -4010
+cel -4011
+▁Mal -4012
+▁beat -4013
+▁military -4014
+Sh -4015
+▁PR -4016
+▁Miss -4017
+gal -4018
+▁gra -4019
+▁names -4020
+▁approx -4021
+▁update -4022
+▁subst -4023
+▁During -4024
+▁protection -4025
+▁Att -4026
+▁Franc -4027
+▁French -4028
+annel -4029
+▁peace -4030
+▁conven -4031
+term -4032
+▁Who -4033
+▁ton -4034
+▁advantage -4035
+state -4036
+▁placed -4037
+▁Commission -4038
+▁pair -4039
+▁notice -4040
+▁strength -4041
+ero -4042
+What -4043
+incip -4044
+using -4045
+▁academ -4046
+▁Arch -4047
+▁epis -4048
+▁adding -4049
+▁waiting -4050
+▁although -4051
+ags -4052
+ideo -4053
+▁League -4054
+IV -4055
+▁Ben -4056
+clusive -4057
+▁Mot -4058
+▁reb -4059
+▁Alex -4060
+▁beauty -4061
+▁scient -4062
+ula -4063
+▁Dig -4064
+▁calls -4065
+▁relax -4066
+▁demonstr -4067
+▁regarding -4068
+amin -4069
+mark -4070
+ovel -4071
+▁income -4072
+▁covered -4073
+▁effects -4074
+ari -4075
+ixt -4076
+▁Sign -4077
+▁Online -4078
+uty -4079
+imin -4080
+▁copy -4081
+iverse -4082
+▁initi -4083
+▁experts -4084
+▁standards -4085
+▁technical -4086
+ros -4087
+okes -4088
+▁Atl -4089
+▁Vol -4090
+ading -4091
+▁manage -4092
+▁Chic -4093
+▁knows -4094
+▁winning -4095
+▁hospital -4096
+▁certainly -4097
+▁Real -4098
+▁batter -4099
+▁workers -4100
+▁connection -4101
+osh -4102
+▁compared -4103
+As -4104
+oe -4105
+▁RE -4106
+▁hom -4107
+ga -4108
+oop -4109
+▁Ins -4110
+▁Form -4111
+▁Development -4112
+▁wild -4113
+▁dinner -4114
+▁fabric -4115
+▁associated -4116
+▁experiences -4117
+▁Pay -4118
+▁doctor -4119
+▁master -4120
+▁cit -4121
+▁cru -4122
+▁wat -4123
+ograp -4124
+▁vote -4125
+▁posts -4126
+▁finding -4127
+▁Foundation -4128
+▁opened -4129
+▁Profess -4130
+▁reflect -4131
+IG -4132
+▁Carol -4133
+amm -4134
+▁audience -4135
+▁friendly -4136
+cell -4137
+unning -4138
+atically -4139
+mail -4140
+ctors -4141
+▁surface -4142
+▁den -4143
+▁Science -4144
+▁pm -4145
+▁Cap -4146
+itude -4147
+▁trail -4148
+▁artists -4149
+▁traffic -4150
+▁critical -4151
+▁communities -4152
+AA -4153
+uce -4154
+▁NY -4155
+▁Valley -4156
+works -4157
+▁remind -4158
+▁victim -4159
+▁Step -4160
+▁salt -4161
+▁followed -4162
+la -4163
+well -4164
+▁Rad -4165
+iques -4166
+▁Elect -4167
+▁football -4168
+tr -4169
+aming -4170
+▁electric -4171
+aven -4172
+▁Beach -4173
+▁facility -4174
+▁cry -4175
+gency -4176
+▁Disc -4177
+▁keeping -4178
+▁meaning -4179
+▁luck -4180
+▁pros -4181
+▁figure -4182
+▁learned -4183
+yer -4184
+ander -4185
+ulate -4186
+▁tickets -4187
+▁professionals -4188
+antic -4189
+▁laun -4190
+▁taste -4191
+▁instit -4192
+gen -4193
+▁bright -4194
+ech -4195
+arge -4196
+▁produced -4197
+▁watching -4198
+▁flex -4199
+▁catch -4200
+▁monitor -4201
+▁contains -4202
+lor -4203
+▁ter -4204
+There -4205
+ooper -4206
+▁entry -4207
+▁Project -4208
+▁Society -4209
+▁classic -4210
+▁department -4211
+edy -4212
+itar -4213
+▁diagn -4214
+▁lock -4215
+▁classes -4216
+rees -4217
+▁closed -4218
+▁starts -4219
+▁continued -4220
+▁dire -4221
+▁jump -4222
+▁awesome -4223
+▁kept -4224
+▁bought -4225
+▁listed -4226
+▁Christian -4227
+▁Wil -4228
+osure -4229
+▁Whether -4230
+▁neighbor -4231
+▁selected -4232
+▁Town -4233
+▁explore -4234
+▁testing -4235
+▁harm -4236
+▁Date -4237
+▁larger -4238
+▁videos -4239
+▁Another -4240
+▁presented -4241
+fast -4242
+▁Ber -4243
+▁ice -4244
+▁Times -4245
+▁transfer -4246
+▁thousands -4247
+▁developing -4248
+fin -4249
+▁capital -4250
+▁OF -4251
+iller -4252
+▁teaching -4253
+▁Mel -4254
+▁Nov -4255
+▁Long -4256
+▁force -4257
+▁grant -4258
+▁minute -4259
+▁talent -4260
+▁established -4261
+▁fol -4262
+▁Hill -4263
+▁desk -4264
+standing -4265
+▁England -4266
+▁AP -4267
+enses -4268
+▁announce -4269
+▁exciting -4270
+end -4271
+▁Vir -4272
+acity -4273
+▁Family -4274
+▁street -4275
+▁furn -4276
+▁facilities -4277
+▁Jim -4278
+▁brings -4279
+▁Tim -4280
+▁buying -4281
+▁records -4282
+▁articles -4283
+gn -4284
+▁sto -4285
+▁drug -4286
+▁ideal -4287
+▁library -4288
+▁requires -4289
+noon -4290
+itors -4291
+enance -4292
+▁Scott -4293
+▁micro -4294
+▁Chicago -4295
+win -4296
+rief -4297
+▁sup -4298
+▁rich -4299
+▁virt -4300
+▁novel -4301
+▁Chinese -4302
+▁sharing -4303
+▁updated -4304
+▁mo -4305
+part -4306
+sequ -4307
+▁Start -4308
+▁butter -4309
+▁driver -4310
+▁greater -4311
+riage -4312
+▁Sand -4313
+▁ship -4314
+▁crowd -4315
+▁wouldn -4316
+▁restaurant -4317
+imb -4318
+▁ir -4319
+lands -4320
+▁vision -4321
+▁Note -4322
+▁Exper -4323
+▁ingredients -4324
+ray -4325
+unately -4326
+▁List -4327
+▁poor -4328
+▁Stand -4329
+▁studies -4330
+▁Cup -4331
+overy -4332
+▁loan -4333
+▁Build -4334
+▁Grand -4335
+▁handle -4336
+▁plenty -4337
+▁resident -4338
+outs -4339
+▁bird -4340
+illage -4341
+ka -4342
+▁tree -4343
+▁economy -4344
+▁Central -4345
+▁leaving -4346
+▁serving -4347
+▁Div -4348
+▁sem -4349
+▁Support -4350
+SP -4351
+word -4352
+▁Mex -4353
+iture -4354
+▁beach -4355
+▁famous -4356
+ini -4357
+inn -4358
+▁Mil -4359
+lastname -4360
+▁manufacturer -4361
+▁faith -4362
+▁rooms -4363
+▁shall -4364
+▁recipe -4365
+▁Congress -4366
+CH -4367
+▁station -4368
+UR -4369
+▁react -4370
+▁shape -4371
+pective -4372
+▁origin -4373
+night -4374
+▁Amazon -4375
+▁injury -4376
+▁missing -4377
+reek -4378
+semb -4379
+▁Sil -4380
+▁upgr -4381
+▁Social -4382
+do -4383
+▁Pub -4384
+isher -4385
+▁motor -4386
+▁claims -4387
+▁medium -4388
+▁Bill -4389
+▁Posted -4390
+▁orders -4391
+▁maintain -4392
+rd -4393
+▁Fun -4394
+asure -4395
+▁brain -4396
+▁notes -4397
+▁views -4398
+▁Download -4399
+▁appropriate -4400
+▁boo -4401
+ishes -4402
+point -4403
+▁Offic -4404
+▁meant -4405
+▁older -4406
+▁spons -4407
+▁window -4408
+▁sustain -4409
+atab -4410
+▁Jesus -4411
+▁signed -4412
+berg -4413
+▁remove -4414
+cks -4415
+▁ended -4416
+▁changing -4417
+▁strategy -4418
+fr -4419
+cles -4420
+look -4421
+▁map -4422
+▁Union -4423
+outhern -4424
+▁happens -4425
+▁efficient -4426
+▁uns -4427
+going -4428
+▁advance -4429
+▁journal -4430
+ervation -4431
+▁plastic -4432
+▁Fore -4433
+▁stores -4434
+▁independent -4435
+▁iPhone -4436
+iest -4437
+▁useful -4438
+top -4439
+▁CD -4440
+umber -4441
+▁Organ -4442
+▁forms -4443
+▁leaves -4444
+▁Jul -4445
+craft -4446
+▁Light -4447
+▁Academ -4448
+acks -4449
+▁Award -4450
+▁advent -4451
+no -4452
+▁sand -4453
+▁shut -4454
+rehens -4455
+▁agency -4456
+▁repair -4457
+▁evidence -4458
+▁spending -4459
+▁afternoon -4460
+▁tim -4461
+apers -4462
+odes -4463
+rooms -4464
+▁throw -4465
+▁AND -4466
+▁menu -4467
+essions -4468
+▁secret -4469
+▁whatever -4470
+▁Fil -4471
+▁fee -4472
+estic -4473
+iliar -4474
+▁core -4475
+▁pray -4476
+▁sport -4477
+▁operations -4478
+▁combination -4479
+allery -4480
+▁Chris -4481
+▁Before -4482
+▁helpful -4483
+▁reality -4484
+atively -4485
+▁Where -4486
+▁multi -4487
+▁district -4488
+▁prepared -4489
+men -4490
+oyal -4491
+eless -4492
+icted -4493
+▁Week -4494
+▁cris -4495
+▁cab -4496
+ption -4497
+▁adop -4498
+▁tend -4499
+▁Democr -4500
+▁Series -4501
+▁status -4502
+▁balance -4503
+▁Mad -4504
+▁YOU -4505
+▁scen -4506
+▁estim -4507
+alls -4508
+▁flu -4509
+▁Both -4510
+▁flat -4511
+▁Author -4512
+▁joined -4513
+▁designs -4514
+▁remains -4515
+▁ID -4516
+▁Los -4517
+▁ride -4518
+▁corner -4519
+▁rank -4520
+▁eating -4521
+▁memory -4522
+Cl -4523
+mp -4524
+itz -4525
+▁Bet -4526
+▁Mont -4527
+▁caused -4528
+▁operating -4529
+▁Ma -4530
+aser -4531
+▁mist -4532
+▁George -4533
+▁discount -4534
+▁slightly -4535
+▁teachers -4536
+eed -4537
+▁IP -4538
+▁Women -4539
+▁esc -4540
+▁perhaps -4541
+▁primary -4542
+▁numerous -4543
+hem -4544
+▁funds -4545
+▁worry -4546
+▁survey -4547
+▁winner -4548
+▁enjoyed -4549
+▁showing -4550
+▁exercise -4551
+een -4552
+▁unc -4553
+▁Card -4554
+▁fourth -4555
+▁showed -4556
+▁spl -4557
+uries -4558
+▁anti -4559
+▁Francis -4560
+▁surgery -4561
+▁becoming -4562
+▁properties -4563
+pan -4564
+▁gain -4565
+▁recip -4566
+▁veget -4567
+▁Engine -4568
+▁markets -4569
+▁obvious -4570
+▁committed -4571
+▁suff -4572
+▁theme -4573
+▁focused -4574
+vere -4575
+▁plants -4576
+▁direction -4577
+ius -4578
+▁Tor -4579
+▁listen -4580
+▁managed -4581
+▁kick -4582
+iences -4583
+▁forum -4584
+▁chocolate -4585
+▁shel -4586
+▁limit -4587
+gers -4588
+lets -4589
+iency -4590
+▁legisl -4591
+aked -4592
+▁Its -4593
+▁Jun -4594
+▁busy -4595
+▁rain -4596
+issions -4597
+▁mechan -4598
+▁movement -4599
+▁encourage -4600
+▁rap -4601
+▁cloud -4602
+▁resist -4603
+▁putting -4604
+▁communication -4605
+OP -4606
+cher -4607
+▁bon -4608
+▁Their -4609
+▁raised -4610
+▁animals -4611
+▁assistance -4612
+?? -4613
+obe -4614
+oles -4615
+▁Bob -4616
+▁CEO -4617
+▁Full -4618
+▁Frank -4619
+▁lunch -4620
+▁defense -4621
+ita -4622
+▁analy -4623
+▁relig -4624
+life -4625
+rael -4626
+▁poll -4627
+▁corporate -4628
+▁practices -4629
+▁Technology -4630
+”, -4631
+itness -4632
+▁discover -4633
+▁Microsoft -4634
+", -4635
+gl -4636
+!!! -4637
+▁Mike -4638
+▁civil -4639
+▁reached -4640
+▁sources -4641
+bert -4642
+▁util -4643
+igation -4644
+vention -4645
+▁society -4646
+▁yesterday -4647
+orter -4648
+▁mill -4649
+▁chair -4650
+▁Wr -4651
+▁scr -4652
+▁youth -4653
+▁central -4654
+abilities -4655
+▁advanced -4656
+▁Ham -4657
+▁cart -4658
+▁architect -4659
+▁determine -4660
+REE -4661
+▁Fort -4662
+arrant -4663
+▁cleaning -4664
+▁vehicles -4665
+▁firstname -4666
+ena -4667
+ror -4668
+west -4669
+▁Tri -4670
+▁tea -4671
+▁dete -4672
+▁rare -4673
+▁AS -4674
+▁NOT -4675
+▁Mass -4676
+▁actual -4677
+yan -4678
+▁psych -4679
+▁Robert -4680
+▁tables -4681
+▁worksh -4682
+▁methods -4683
+▁leadership -4684
+▁Bur -4685
+▁ath -4686
+▁structure -4687
+kin -4688
+▁vs -4689
+▁pock -4690
+aturing -4691
+▁Commit -4692
+CC -4693
+MS -4694
+iled -4695
+▁Log -4696
+▁Set -4697
+▁fell -4698
+▁register -4699
+?” -4700
+▁repe -4701
+▁battle -4702
+▁format -4703
+▁becomes -4704
+▁willing -4705
+bre -4706
+ifts -4707
+▁colle -4708
+▁charges -4709
+▁funding -4710
+▁updates -4711
+▁thoughts -4712
+▁ju -4713
+▁Tre -4714
+ordin -4715
+▁toward -4716
+▁appears -4717
+▁visitors -4718
+▁fees -4719
+▁incor -4720
+▁sector -4721
+▁Copyright -4722
+▁absolutely -4723
+▁temperature -4724
+▁lose -4725
+▁locations -4726
+▁Keep -4727
+▁Next -4728
+▁colour -4729
+▁filled -4730
+▁songs -4731
+▁Network -4732
+▁Old -4733
+▁instru -4734
+levision -4735
+▁Wall -4736
+▁Trump -4737
+▁brown -4738
+▁Spring -4739
+▁century -4740
+▁extensive -4741
+▁Conference -4742
+kins -4743
+▁Land -4744
+▁Learn -4745
+▁Louis -4746
+▁asking -4747
+▁environmental -4748
+ola -4749
+ship -4750
+▁Way -4751
+▁topic -4752
+▁favour -4753
+▁transl -4754
+▁courses -4755
+▁profile -4756
+▁AL -4757
+▁Ol -4758
+while -4759
+▁Test -4760
+▁south -4761
+▁dur -4762
+▁Medic -4763
+▁Report -4764
+▁documents -4765
+▁previously -4766
+coh -4767
+▁Dou -4768
+▁Oper -4769
+▁adapt -4770
+▁north -4771
+ception -4772
+ipl -4773
+▁Plus -4774
+▁bowl -4775
+▁swim -4776
+ivered -4777
+▁guest -4778
+▁refer -4779
+▁visual -4780
+▁readers -4781
+▁anywhere -4782
+▁kid -4783
+▁registered -4784
+otton -4785
+▁Jeff -4786
+▁France -4787
+For -4788
+▁Cre -4789
+▁Lim -4790
+▁lux -4791
+▁sch -4792
+▁polic -4793
+▁charged -4794
+▁expertise -4795
+New -4796
+water -4797
+▁task -4798
+iration -4799
+▁upcoming -4800
+▁UN -4801
+▁wire -4802
+▁allowing -4803
+FL -4804
+▁Ok -4805
+▁selling -4806
+po -4807
+bour -4808
+▁bask -4809
+▁recommended -4810
+▁stre -4811
+▁Hotel -4812
+▁plays -4813
+▁Android -4814
+▁coverage -4815
+icip -4816
+▁Lat -4817
+▁fuel -4818
+▁neck -4819
+▁audio -4820
+▁sounds -4821
+▁Library -4822
+▁population -4823
+list -4824
+umin -4825
+▁Only -4826
+▁Conne -4827
+▁featured -4828
+▁Saf -4829
+▁pal -4830
+▁joint -4831
+▁Medical -4832
+▁princip -4833
+▁smaller -4834
+▁walking -4835
+▁ur -4836
+ulty -4837
+▁thr -4838
+▁Prov -4839
+▁seat -4840
+▁mental -4841
+▁establish -4842
+▁discussion -4843
+▁Jew -4844
+▁tun -4845
+▁apart -4846
+▁trial -4847
+▁parties -4848
+▁NE -4849
+istan -4850
+▁dance -4851
+ferences -4852
+IA -4853
+azz -4854
+ora -4855
+osis -4856
+▁Somet -4857
+▁Watch -4858
+igan -4859
+prise -4860
+▁Main -4861
+▁dogs -4862
+▁radio -4863
+▁despite -4864
+On -4865
+▁Lord -4866
+▁Walk -4867
+▁fold -4868
+▁truck -4869
+▁Africa -4870
+▁Virgin -4871
+▁scheduled -4872
+▁maintenance -4873
+▁Head -4874
+▁inspired -4875
+▁ON -4876
+▁diet -4877
+▁nine -4878
+▁restr -4879
+SA -4880
+▁writer -4881
+▁outdoor -4882
+▁Security -4883
+▁accommod -4884
+▁combined -4885
+▁van -4886
+ki -4887
+▁CA -4888
+▁har -4889
+▁citiz -4890
+▁scored -4891
+aks -4892
+alog -4893
+▁Western -4894
+rehensive -4895
+▁techniques -4896
+OO -4897
+▁Game -4898
+▁Admin -4899
+▁decide -4900
+▁seconds -4901
+▁Soft -4902
+▁Museum -4903
+▁values -4904
+▁removed -4905
+▁provider -4906
+▁sav -4907
+▁earth -4908
+▁raise -4909
+▁accompl -4910
+ownt -4911
+▁metal -4912
+▁stret -4913
+▁researc -4914
+eal -4915
+▁Place -4916
+▁spect -4917
+▁elements -4918
+▁purchased -4919
+▁joy -4920
+▁calc -4921
+▁purs -4922
+▁trees -4923
+▁launched -4924
+zen -4925
+▁Hy -4926
+▁Mer -4927
+▁sea -4928
+▁honest -4929
+▁movies -4930
+▁innovative -4931
+An -4932
+IF -4933
+▁panel -4934
+idering -4935
+▁counter -4936
+▁shooting -4937
+▁delicious -4938
+▁approximately -4939
+▁sitting -4940
+gment -4941
+▁killed -4942
+▁separate -4943
+▁edge -4944
+▁Video -4945
+▁Digital -4946
+▁teacher -4947
+▁relevant -4948
+ano -4949
+▁matt -4950
+▁approved -4951
+gage -4952
+▁lovely -4953
+▁parking -4954
+▁consumers -4955
+▁executive -4956
+My -4957
+nel -4958
+van -4959
+▁steel -4960
+▁Israel -4961
+▁Angeles -4962
+▁Manager -4963
+▁magazine -4964
+rs -4965
+ye -4966
+orry -4967
+▁hearing -4968
+▁concerns -4969
+bu -4970
+appy -4971
+igned -4972
+ushed -4973
+▁Charl -4974
+▁Person -4975
+pet -4976
+ellig -4977
+known -4978
+▁chat -4979
+▁conv -4980
+▁Georg -4981
+▁Peter -4982
+ensions -4983
+▁mostly -4984
+▁agreement -4985
+ears -4986
+▁eth -4987
+▁milk -4988
+▁rise -4989
+▁occasion -4990
+ups -4991
+▁Aud -4992
+▁tow -4993
+olars -4994
+▁Cook -4995
+▁Data -4996
+▁Join -4997
+isation -4998
+▁cheese -4999
+▁highlight -5000
+▁generation -5001
+VD -5002
+▁Ext -5003
+▁Ill -5004
+▁Penn -5005
+▁Word -5006
+▁Const -5007
+osit -5008
+▁mur -5009
+▁rid -5010
+▁Room -5011
+▁Thomas -5012
+▁identify -5013
+▁Gal -5014
+▁Pac -5015
+▁Centre -5016
+▁connected -5017
+▁intended -5018
+▁appearance -5019
+TV -5020
+fol -5021
+ring -5022
+orthern -5023
+▁controll -5024
+PA -5025
+ris -5026
+apes -5027
+▁sets -5028
+▁Prote -5029
+▁feels -5030
+▁waste -5031
+▁described -5032
+▁operation -5033
+▁commitment -5034
+▁Mo -5035
+▁Ver -5036
+irmed -5037
+▁truth -5038
+▁Master -5039
+▁academic -5040
+▁delivered -5041
+▁participate -5042
+cm -5043
+▁sympt -5044
+▁Through -5045
+ournament -5046
+!) -5047
+ENT -5048
+▁Men -5049
+oston -5050
+▁Lead -5051
+▁push -5052
+▁stars -5053
+▁Indust -5054
+▁Invest -5055
+▁server -5056
+▁Children -5057
+▁familiar -5058
+▁marriage -5059
+osen -5060
+▁Bas -5061
+▁nom -5062
+▁Arts -5063
+▁tough -5064
+▁enhance -5065
+▁capacity -5066
+▁relationships -5067
+UT -5068
+ycl -5069
+▁Upd -5070
+reens -5071
+▁cooking -5072
+▁promote -5073
+den -5074
+elines -5075
+▁landsc -5076
+ker -5077
+alend -5078
+nergy -5079
+▁cells -5080
+▁campus -5081
+▁editor -5082
+mond -5083
+▁mort -5084
+▁optim -5085
+▁cities -5086
+▁Journal -5087
+▁decisions -5088
+▁generally -5089
+▁Fair -5090
+▁signs -5091
+▁Access -5092
+▁wearing -5093
+▁therefore -5094
+▁introduced -5095
+arsh -5096
+berry -5097
+▁Vict -5098
+▁breast -5099
+▁accident -5100
+▁properly -5101
+▁processes -5102
+▁Er -5103
+prene -5104
+▁educational -5105
+▁Ul -5106
+▁Cam -5107
+cohol -5108
+eline -5109
+▁situ -5110
+▁majority -5111
+▁investigation -5112
+anda -5113
+inch -5114
+▁jew -5115
+▁minor -5116
+ya -5117
+burg -5118
+▁arm -5119
+ishing -5120
+▁opinion -5121
+▁detailed -5122
+▁Government -5123
+▁Dev -5124
+▁fly -5125
+▁Hand -5126
+▁Rest -5127
+reprene -5128
+▁technologies -5129
+▁teen -5130
+▁Chief -5131
+▁Earth -5132
+atabase -5133
+▁Global -5134
+▁minimum -5135
+▁category -5136
+▁presence -5137
+IR -5138
+▁Lab -5139
+▁ban -5140
+▁Live -5141
+▁label -5142
+▁calling -5143
+▁returned -5144
+▁emergency -5145
+▁expensive -5146
+▁mentioned -5147
+ef -5148
+▁Tur -5149
+▁feedback -5150
+fortunately -5151
+▁responsibility -5152
+▁Ari -5153
+▁Fund -5154
+▁Ohio -5155
+▁Wild -5156
+ression -5157
+▁Committee -5158
+▁installed -5159
+DF -5160
+▁Mur -5161
+▁ring -5162
+▁square -5163
+▁Johnson -5164
+▁foreign -5165
+▁bringing -5166
+▁hundreds -5167
+▁websites -5168
+▁Americans -5169
+▁installation -5170
+col -5171
+▁Que -5172
+▁plug -5173
+▁female -5174
+▁ourselves -5175
+rag -5176
+razy -5177
+▁Boston -5178
+▁entertainment -5179
+otten -5180
+ternal -5181
+▁invent -5182
+▁arrange -5183
+▁behavior -5184
+▁exchange -5185
+▁performed -5186
+▁episode -5187
+▁factors -5188
+▁consumer -5189
+▁advertising -5190
+ien -5191
+▁Pack -5192
+▁sizes -5193
+▁begins -5194
+▁satisf -5195
+hab -5196
+text -5197
+▁appeared -5198
+▁Di -5199
+▁Kn -5200
+aded -5201
+▁brief -5202
+▁sides -5203
+▁veter -5204
+▁Squ -5205
+▁flo -5206
+▁teach -5207
+▁units -5208
+▁studio -5209
+uts -5210
+▁Den -5211
+▁coast -5212
+ictions -5213
+emporary -5214
+▁MP -5215
+rist -5216
+▁Adv -5217
+▁Sup -5218
+▁Human -5219
+▁Federal -5220
+AY -5221
+▁elig -5222
+▁icon -5223
+▁tight -5224
+▁caught -5225
+▁transform -5226
+▁confidence -5227
+icians -5228
+▁chief -5229
+▁sauce -5230
+▁thick -5231
+ae -5232
+When -5233
+iser -5234
+▁Tour -5235
+▁fruit -5236
+▁Colorado -5237
+▁honor -5238
+▁holding -5239
+▁reserved -5240
+lock -5241
+▁Wal -5242
+▁Those -5243
+▁adults -5244
+▁topics -5245
+▁policies -5246
+▁supporting -5247
+spe -5248
+uke -5249
+▁https -5250
+▁Contin -5251
+▁ven -5252
+OC -5253
+hew -5254
+cean -5255
+▁alle -5256
+▁meat -5257
+▁ment -5258
+▁achie -5259
+▁chicken -5260
+▁windows -5261
+▁confident -5262
+▁HD -5263
+acle -5264
+▁vary -5265
+▁Price -5266
+rastructure -5267
+▁administration -5268
+▁Pan -5269
+▁motiv -5270
+▁animal -5271
+ifications -5272
+▁supported -5273
+with -5274
+▁Jud -5275
+▁cro -5276
+▁fantastic -5277
+ushing -5278
+▁mouth -5279
+▁sexual -5280
+▁seeking -5281
+SS -5282
+▁meal -5283
+▁Creat -5284
+▁alternative -5285
+arp -5286
+iat -5287
+arks -5288
+oted -5289
+▁Maybe -5290
+▁victory -5291
+ait -5292
+how -5293
+▁Bi -5294
+▁Search -5295
+▁Carolina -5296
+▁Australian -5297
+kes -5298
+ancer -5299
+▁Germany -5300
+▁components -5301
+▁importance -5302
+▁competitive -5303
+vy -5304
+▁sy -5305
+▁Prem -5306
+▁quiet -5307
+▁basket -5308
+▁edition -5309
+paper -5310
+▁tele -5311
+▁sister -5312
+▁dollars -5313
+rier -5314
+▁cheap -5315
+▁leads -5316
+▁thread -5317
+▁apparent -5318
+ste -5319
+▁Jon -5320
+▁rom -5321
+▁rub -5322
+unting -5323
+▁Canad -5324
+▁Sports -5325
+▁switch -5326
+▁guarantee -5327
+▁Academy -5328
+▁conduct -5329
+▁confirm -5330
+▁transact -5331
+▁conversation -5332
+inct -5333
+▁Lin -5334
+ighter -5335
+▁distance -5336
+▁Tit -5337
+▁Young -5338
+▁recru -5339
+▁centre -5340
+▁measures -5341
+▁worldwide -5342
+Com -5343
+▁Gar -5344
+▁Gen -5345
+▁info -5346
+▁Festival -5347
+▁Students -5348
+.| -5349
+etic -5350
+▁Bal -5351
+▁fif -5352
+▁picked -5353
+iability -5354
+▁remaining -5355
+▁photograph -5356
+weet -5357
+▁Jose -5358
+weight -5359
+▁bread -5360
+▁license -5361
+away -5362
+ucks -5363
+▁impl -5364
+▁flight -5365
+▁totally -5366
+▁Nor -5367
+▁rat -5368
+▁Meet -5369
+▁doubt -5370
+▁prison -5371
+▁unless -5372
+▁tack -5373
+▁Martin -5374
+inations -5375
+NA -5376
+atre -5377
+▁Sar -5378
+▁ang -5379
+▁vir -5380
+achel -5381
+uable -5382
+▁species -5383
+How -5384
+elly -5385
+ersey -5386
+▁restaurants -5387
+▁comprehensive -5388
+asks -5389
+▁seek -5390
+▁doors -5391
+▁contest -5392
+▁agencies -5393
+ailability -5394
+▁Champions -5395
+iano -5396
+verse -5397
+▁Quest -5398
+▁tests -5399
+▁faster -5400
+▁delight -5401
+▁maximum -5402
+▁celebrate -5403
+uzz -5404
+eries -5405
+▁league -5406
+▁clearly -5407
+▁musical -5408
+▁visiting -5409
+▁photograp -5410
+RC -5411
+TH -5412
+Our -5413
+▁Type -5414
+▁forg -5415
+itable -5416
+▁depart -5417
+▁painting -5418
+▁eventually -5419
+pass -5420
+▁Did -5421
+▁dyn -5422
+▁wel -5423
+estyle -5424
+▁noted -5425
+▁planned -5426
+▁election -5427
+▁revealed -5428
+▁considering -5429
+TC -5430
+otic -5431
+▁Inte -5432
+▁propos -5433
+▁prepare -5434
+▁depending -5435
+▁Cred -5436
+▁Using -5437
+▁Energy -5438
+▁arrived -5439
+▁housing -5440
+▁married -5441
+▁university -5442
+igr -5443
+▁Ro -5444
+usion -5445
+▁burn -5446
+▁lived -5447
+▁ticket -5448
+▁Hospital -5449
+▁bike -5450
+▁mine -5451
+▁Jackson -5452
+▁sessions -5453
+erg -5454
+▁Ce -5455
+▁inn -5456
+iminal -5457
+ixture -5458
+orough -5459
+▁scale -5460
+▁Assist -5461
+▁SP -5462
+wing -5463
+▁McC -5464
+▁ign -5465
+▁ris -5466
+ulous -5467
+▁FREE -5468
+▁apps -5469
+▁otherwise -5470
+▁discovered -5471
+▁Mid -5472
+▁Cost -5473
+▁compar -5474
+▁gather -5475
+▁officer -5476
+mes -5477
+▁Secret -5478
+▁climate -5479
+▁monthly -5480
+▁Japanese -5481
+▁chemical -5482
+▁neighborhood -5483
+▁boys -5484
+▁ends -5485
+▁liqu -5486
+▁evalu -5487
+▁turns -5488
+▁inches -5489
+▁spokes -5490
+▁struct -5491
+▁commission -5492
+▁Kore -5493
+▁weap -5494
+▁symptoms -5495
+ht -5496
+▁Bul -5497
+▁Cat -5498
+agram -5499
+▁freed -5500
+▁missed -5501
+▁cutting -5502
+▁accounts -5503
+▁internal -5504
+▁reliable -5505
+ias -5506
+▁ran -5507
+tered -5508
+▁pump -5509
+▁surf -5510
+related -5511
+▁brands -5512
+▁lights -5513
+▁seemed -5514
+▁appreciate -5515
+▁participants -5516
+otes -5517
+alian -5518
+▁Know -5519
+▁battery -5520
+▁organic -5521
+▁affordable -5522
+edia -5523
+▁hyd -5524
+▁Cert -5525
+▁corn -5526
+▁twice -5527
+▁Applic -5528
+▁Columb -5529
+▁Georgia -5530
+▁cultural -5531
+▁resource -5532
+▁featuring -5533
+hi -5534
+▁Second -5535
+▁automatically -5536
+They -5537
+ician -5538
+▁valid -5539
+▁athlet -5540
+▁paying -5541
+▁submit -5542
+▁African -5543
+▁meetings -5544
+iors -5545
+▁Code -5546
+▁Jones -5547
+▁Andrew -5548
+EE -5549
+▁emp -5550
+▁Share -5551
+▁bigger -5552
+▁regularly -5553
+); -5554
+Ex -5555
+but -5556
+▁Hard -5557
+▁Qual -5558
+▁debt -5559
+▁Middle -5560
+▁failed -5561
+▁supposed -5562
+▁Ep -5563
+▁Help -5564
+▁Steve -5565
+▁storm -5566
+▁accurate -5567
+▁possibly -5568
+GB -5569
+ua -5570
+ban -5571
+▁mel -5572
+▁pod -5573
+▁boost -5574
+▁deals -5575
+▁labor -5576
+▁volume -5577
+▁television -5578
+▁presentation -5579
+cont -5580
+▁fro -5581
+▁draft -5582
+▁fellow -5583
+▁realize -5584
+▁manufacturing -5585
+Pro -5586
+▁Ut -5587
+▁fle -5588
+▁Daniel -5589
+▁concent -5590
+▁Virginia -5591
+▁messages -5592
+?" -5593
+▁SH -5594
+ennis -5595
+idden -5596
+pected -5597
+▁fields -5598
+▁revenue -5599
+▁affected -5600
+▁recovery -5601
+EST -5602
+rupt -5603
+▁Boy -5604
+▁Blog -5605
+▁German -5606
+▁covers -5607
+▁shares -5608
+▁proposed -5609
+▁researchers -5610
+No -5611
+roy -5612
+eper -5613
+mosp -5614
+▁die -5615
+rical -5616
+▁Page -5617
+iamond -5618
+alendar -5619
+oration -5620
+▁Rights -5621
+ployment -5622
+▁returns -5623
+▁engineering -5624
+▁Lee -5625
+▁Tem -5626
+▁Farm -5627
+▁Travel -5628
+▁birthday -5629
+▁AD -5630
+case -5631
+▁Rom -5632
+▁aid -5633
+▁ages -5634
+▁Little -5635
+▁confirmed -5636
+▁instructions -5637
+▁amb -5638
+cious -5639
+▁Cast -5640
+▁Trust -5641
+▁dates -5642
+▁tells -5643
+▁answers -5644
+▁creation -5645
+▁interior -5646
+▁protected -5647
+ca -5648
+ters -5649
+▁Tech -5650
+▁breakfast -5651
+▁sad -5652
+▁wal -5653
+▁dish -5654
+▁chart -5655
+▁warrant -5656
+▁industrial -5657
+▁infrastructure -5658
+iner -5659
+▁nor -5660
+which -5661
+▁Orig -5662
+▁Games -5663
+▁Visit -5664
+▁loves -5665
+▁Mexico -5666
+▁county -5667
+▁applied -5668
+▁browser -5669
+▁employee -5670
+ario -5671
+▁nurs -5672
+▁agent -5673
+▁pregn -5674
+▁specifically -5675
+▁Opt -5676
+▁mir -5677
+▁poly -5678
+▁route -5679
+▁desire -5680
+▁issued -5681
+▁choices -5682
+▁decades -5683
+▁drivers -5684
+▁NC -5685
+▁Hen -5686
+▁hook -5687
+▁rapid -5688
+▁furniture -5689
+▁chain -5690
+▁foods -5691
+fection -5692
+▁flowers -5693
+▁reference -5694
+▁twe -5695
+▁hero -5696
+▁jack -5697
+▁affili -5698
+▁element -5699
+▁perfectly -5700
+▁WH -5701
+gend -5702
+▁Joe -5703
+erves -5704
+▁thus -5705
+lights -5706
+▁attorney -5707
+▁standing -5708
+▁exclusive -5709
+ansas -5710
+▁tail -5711
+▁plate -5712
+▁chosen -5713
+▁earned -5714
+▁supports -5715
+upp -5716
+▁CH -5717
+▁anc -5718
+▁yes -5719
+anger -5720
+odies -5721
+▁Made -5722
+▁bond -5723
+▁Broad -5724
+▁talks -5725
+▁Control -5726
+▁Francisco -5727
+▁employment -5728
+hand -5729
+rick -5730
+▁Ken -5731
+hetic -5732
+oking -5733
+▁mode -5734
+▁vent -5735
+▁Brand -5736
+▁remote -5737
+ibilities -5738
+▁Executive -5739
+anna -5740
+irms -5741
+▁Dom -5742
+▁End -5743
+ospit -5744
+▁Enjoy -5745
+▁agreed -5746
+▁purposes -5747
+▁apartment -5748
+▁incredible -5749
+Al -5750
+▁AT -5751
+▁Lo -5752
+lymp -5753
+▁Bon -5754
+▁wid -5755
+▁Expl -5756
+▁broken -5757
+▁improved -5758
+▁strategies -5759
+UN -5760
+can -5761
+▁DVD -5762
+▁nav -5763
+▁Does -5764
+▁logo -5765
+▁Store -5766
+▁Williams -5767
+▁processing -5768
+▁Hope -5769
+▁Pass -5770
+▁Sher -5771
+▁Current -5772
+▁illustr -5773
+▁hardware -5774
+▁surrounding -5775
+▁Sy -5776
+anges -5777
+▁cake -5778
+▁cute -5779
+▁whom -5780
+▁advis -5781
+▁Product -5782
+▁recorded -5783
+▁disappoint -5784
+BI -5785
+MA -5786
+▁Id -5787
+ench -5788
+hent -5789
+▁Equ -5790
+▁Haw -5791
+▁lit -5792
+▁Coast -5793
+▁quant -5794
+▁reput -5795
+▁rough -5796
+▁premium -5797
+aped -5798
+▁Mic -5799
+adium -5800
+▁golf -5801
+ampion -5802
+▁holds -5803
+▁judge -5804
+▁pleased -5805
+▁accepted -5806
+▁suitable -5807
+umes -5808
+idays -5809
+▁boat -5810
+▁Point -5811
+▁downt -5812
+▁losing -5813
+▁Instead -5814
+▁male -5815
+▁pure -5816
+▁grade -5817
+▁trouble -5818
+uous -5819
+▁rule -5820
+▁Three -5821
+▁wheel -5822
+▁administr -5823
+▁buildings -5824
+lyn -5825
+oga -5826
+uits -5827
+▁usual -5828
+▁History -5829
+▁explain -5830
+▁domestic -5831
+▁concerned -5832
+!” -5833
+xy -5834
+itage -5835
+▁telling -5836
+▁Minister -5837
+▁violence -5838
+▁candidates -5839
+gas -5840
+ums -5841
+▁moist -5842
+▁licens -5843
+▁aspects -5844
+▁Communic -5845
+▁injuries -5846
+▁favourite -5847
+tra -5848
+▁ok -5849
+what -5850
+▁Girl -5851
+person -5852
+▁moments -5853
+▁typically -5854
+otal -5855
+▁pun -5856
+▁tur -5857
+▁Party -5858
+▁error -5859
+▁causes -5860
+▁styles -5861
+▁Italian -5862
+▁awareness -5863
+▁registration -5864
+▁vit -5865
+▁arts -5866
+▁phil -5867
+▁Night -5868
+▁Print -5869
+▁Perform -5870
+rim -5871
+road -5872
+lines -5873
+▁oven -5874
+▁grown -5875
+▁enable -5876
+▁island -5877
+▁greatest -5878
+vell -5879
+▁Harr -5880
+▁rand -5881
+orable -5882
+▁abuse -5883
+▁shoes -5884
+▁forces -5885
+▁stated -5886
+fficient -5887
+▁surprise -5888
+va -5889
+▁FOR -5890
+▁Key -5891
+▁tag -5892
+▁taxes -5893
+▁photography -5894
+ERS -5895
+hors -5896
+▁jun -5897
+anish -5898
+cluding -5899
+▁closer -5900
+▁citizens -5901
+▁negative -5902
+▁influence -5903
+CA -5904
+bur -5905
+writ -5906
+▁Four -5907
+▁circum -5908
+▁actions -5909
+ria -5910
+▁Def -5911
+▁Dog -5912
+tters -5913
+ulture -5914
+▁retire -5915
+▁script -5916
+▁stopped -5917
+▁stretch -5918
+▁broadcast -5919
+▁Wi -5920
+pond -5921
+▁Drive -5922
+▁Local -5923
+▁gradu -5924
+▁resol -5925
+▁Division -5926
+▁wet -5927
+▁crew -5928
+▁powder -5929
+▁database -5930
+▁tomorrow -5931
+▁sam -5932
+astern -5933
+▁Olymp -5934
+▁leather -5935
+▁practical -5936
+ribe -5937
+▁Bra -5938
+▁Ell -5939
+▁Max -5940
+▁adm -5941
+▁argu -5942
+Un -5943
+▁serves -5944
+▁weekly -5945
+▁alleged -5946
+iami -5947
+udden -5948
+▁shock -5949
+▁Pacific -5950
+▁payments -5951
+▁functions -5952
+▁inspiration -5953
+DS -5954
+▁Gra -5955
+stone -5956
+▁acid -5957
+▁bound -5958
+▁faculty -5959
+And -5960
+yers -5961
+▁tro -5962
+alled -5963
+▁mini -5964
+▁funny -5965
+▁Awards -5966
+▁speech -5967
+▁receiving -5968
+▁authorities -5969
+ava -5970
+hus -5971
+▁Mat -5972
+merce -5973
+▁Ryan -5974
+▁sequ -5975
+▁thin -5976
+lywood -5977
+▁column -5978
+▁designer -5979
+ucle -5980
+▁hits -5981
+▁cable -5982
+forcement -5983
+▁supplies -5984
+▁Available -5985
+▁electronic -5986
+TA -5987
+ERE -5988
+▁rot -5989
+atholic -5990
+▁config -5991
+▁pepper -5992
+▁village -5993
+▁identified -5994
+▁tut -5995
+▁gear -5996
+▁Cross -5997
+▁random -5998
+poration -5999
+▁everyday -6000
+▁committee -6001
+GE -6002
+bol -6003
+oup -6004
+irty -6005
+▁Hor -6006
+▁Oil -6007
+under -6008
+profit -6009
+▁Econom -6010
+▁perman -6011
+▁recognized -6012
+ache -6013
+▁Aff -6014
+itate -6015
+never -6016
+right -6017
+▁Coll -6018
+▁Need -6019
+▁grab -6020
+▁atmosp -6021
+▁degrees -6022
+▁printed -6023
+▁convenient -6024
+▁healthcare -6025
+▁impressive -6026
+PM -6027
+mar -6028
+inet -6029
+▁crime -6030
+▁keeps -6031
+▁lessons -6032
+▁Michigan -6033
+Pl -6034
+So -6035
+rip -6036
+▁tab -6037
+▁Bell -6038
+▁Cond -6039
+isters -6040
+▁essay -6041
+▁flour -6042
+▁crisis -6043
+▁height -6044
+▁emotional -6045
+▁determined -6046
+▁Cas -6047
+▁Ref -6048
+▁Tay -6049
+▁voc -6050
+atoes -6051
+etime -6052
+▁Ariz -6053
+▁films -6054
+▁imagine -6055
+▁treated -6056
+▁Sometimes -6057
+▁dangerous -6058
+▁happening -6059
+▁Lt -6060
+▁PS -6061
+aren -6062
+phas -6063
+▁Dun -6064
+▁Try -6065
+▁Small -6066
+▁crazy -6067
+▁Comple -6068
+▁ongoing -6069
+▁champions -6070
+▁explained -6071
+iate -6072
+hered -6073
+inter -6074
+▁Jenn -6075
+▁Mean -6076
+uction -6077
+▁Santa -6078
+▁fixed -6079
+▁sheet -6080
+▁entreprene -6081
+Ar -6082
+▁Run -6083
+▁Sus -6084
+urban -6085
+▁Safety -6086
+▁dropped -6087
+▁Marketing -6088
+cue -6089
+rum -6090
+▁Fed -6091
+▁patterns -6092
+▁resolution -6093
+▁du -6094
+pret -6095
+▁Mach -6096
+▁Canadian -6097
+▁investors -6098
+LS -6099
+All -6100
+aid -6101
+eler -6102
+made -6103
+▁row -6104
+▁worse -6105
+▁Victor -6106
+▁dining -6107
+iversary -6108
+▁subscrib -6109
+▁gro -6110
+anged -6111
+arian -6112
+▁Writ -6113
+▁rear -6114
+▁Guide -6115
+▁command -6116
+▁trading -6117
+▁conducted -6118
+▁tradition -6119
+LA -6120
+mary -6121
+anche -6122
+osoph -6123
+▁Rose -6124
+▁soul -6125
+▁taught -6126
+▁arrested -6127
+▁attended -6128
+▁officers -6129
+▁appointment -6130
+▁collaboration -6131
+Bl -6132
+Con -6133
+▁GM -6134
+▁Kh -6135
+enced -6136
+▁lift -6137
+▁simpl -6138
+▁extended -6139
+lete -6140
+▁der -6141
+▁Priv -6142
+▁cock -6143
+▁grad -6144
+▁roof -6145
+▁Chair -6146
+▁hoping -6147
+▁alcohol -6148
+▁positions -6149
+▁Environment -6150
+▁successfully -6151
+ppers -6152
+oosing -6153
+▁native -6154
+▁tournament -6155
+Don -6156
+inson -6157
+▁grew -6158
+▁wash -6159
+▁depth -6160
+▁flood -6161
+▁Account -6162
+▁freedom -6163
+▁ordered -6164
+▁eligible -6165
+▁incident -6166
+▁sick -6167
+▁folks -6168
+▁Senate -6169
+▁versions -6170
+iana -6171
+▁Inf -6172
+▁kne -6173
+▁Mult -6174
+▁spin -6175
+▁Richard -6176
+ello -6177
+rate -6178
+▁obtain -6179
+▁severe -6180
+▁Sat -6181
+aints -6182
+▁Turn -6183
+▁Photo -6184
+▁cycle -6185
+▁guard -6186
+▁teeth -6187
+▁noticed -6188
+iki -6189
+▁bat -6190
+▁Area -6191
+▁Paris -6192
+▁advoc -6193
+▁belong -6194
+▁forced -6195
+▁massive -6196
+▁graduate -6197
+▁construct -6198
+Be -6199
+ala -6200
+cers -6201
+essed -6202
+racts -6203
+▁adds -6204
+▁dram -6205
+▁none -6206
+▁houses -6207
+▁improvement -6208
+hire -6209
+real -6210
+rics -6211
+▁Daily -6212
+▁trend -6213
+iveness -6214
+▁Summer -6215
+▁tested -6216
+▁failure -6217
+▁Building -6218
+▁valuable -6219
+▁innovation -6220
+tle -6221
+▁ol -6222
+▁Kent -6223
+▁Which -6224
+▁mixed -6225
+▁shots -6226
+▁yards -6227
+▁cotton -6228
+▁regional -6229
+ayer -6230
+utch -6231
+▁Ash -6232
+▁Die -6233
+rease -6234
+▁Carl -6235
+▁Clean -6236
+▁Right -6237
+▁council -6238
+Is -6239
+▁MS -6240
+▁Box -6241
+▁Rev -6242
+▁thorough -6243
+▁integrated -6244
+▁DC -6245
+▁syn -6246
+▁Size -6247
+▁tiny -6248
+hentic -6249
+▁output -6250
+za -6251
+▁ec -6252
+inem -6253
+▁tank -6254
+▁owned -6255
+▁concert -6256
+▁knowing -6257
+▁routine -6258
+▁turning -6259
+▁efficiency -6260
+erse -6261
+▁drugs -6262
+▁Avenue -6263
+▁facing -6264
+▁guitar -6265
+▁diverse -6266
+▁therapy -6267
+▁clothing -6268
+▁providers -6269
+▁MO -6270
+▁Sn -6271
+▁Ent -6272
+▁Tool -6273
+acking -6274
+▁Select -6275
+▁publish -6276
+▁reduced -6277
+▁interface -6278
+CE -6279
+▁fo -6280
+▁Hon -6281
+osite -6282
+secut -6283
+▁Asia -6284
+▁Though -6285
+▁yellow -6286
+▁follows -6287
+▁description -6288
+▁distribution -6289
+illy -6290
+▁LLC -6291
+▁ped -6292
+abled -6293
+ansion -6294
+▁Training -6295
+▁settings -6296
+▁surprised -6297
+▁effectively -6298
+▁EU -6299
+print -6300
+▁auto -6301
+▁dial -6302
+sembly -6303
+▁Miami -6304
+▁silver -6305
+▁mixture -6306
+▁contemporary -6307
+▁expectations -6308
+▁:) -6309
+abet -6310
+▁Ball -6311
+intage -6312
+▁baking -6313
+▁enthus -6314
+▁unable -6315
+▁carried -6316
+▁circumst -6317
+▁intellig -6318
+▁accessible -6319
+▁challenging -6320
+▁perspective -6321
+▁Ira -6322
+▁Low -6323
+▁Want -6324
+letter -6325
+▁bonus -6326
+▁risks -6327
+▁upper -6328
+quality -6329
+▁nearby -6330
+▁pulled -6331
+▁protein -6332
+▁stunning -6333
+▁candidate -6334
+CT -6335
+PR -6336
+▁af -6337
+iece -6338
+ATION -6339
+▁Phys -6340
+▁Italy -6341
+▁stands -6342
+ev -6343
+aze -6344
+claim -6345
+▁Lind -6346
+ington -6347
+▁Beaut -6348
+▁matters -6349
+▁tonight -6350
+▁significantly -6351
+rowse -6352
+▁Nick -6353
+▁laugh -6354
+▁Proper -6355
+▁excess -6356
+▁garlic -6357
+▁univers -6358
+▁witness -6359
+▁approval -6360
+▁medicine -6361
+▁carefully -6362
+sm -6363
+zy -6364
+▁hur -6365
+▁Shop -6366
+▁chapter -6367
+▁complic -6368
+▁joining -6369
+obs -6370
+flow -6371
+oral -6372
+▁Cir -6373
+oured -6374
+▁fulf -6375
+▁equal -6376
+▁kinds -6377
+▁awarded -6378
+▁bedroom -6379
+▁channel -6380
+▁hosting -6381
+▁guidance -6382
+▁vacation -6383
+▁adventure -6384
+▁increases -6385
+▁recording -6386
+▁availability -6387
+▁SU -6388
+▁Dub -6389
+▁Requ -6390
+▁sole -6391
+▁Never -6392
+▁Works -6393
+▁likes -6394
+▁emphas -6395
+▁festival -6396
+▁accessories -6397
+bal -6398
+zer -6399
+▁glad -6400
+▁iron -6401
+▁tall -6402
+▁Heart -6403
+▁loans -6404
+▁Spanish -6405
+UL -6406
+rete -6407
+▁ease -6408
+riends -6409
+▁filed -6410
+▁renew -6411
+clusion -6412
+▁cooper -6413
+▁Republican -6414
+▁exhibition -6415
+▁partnership -6416
+stal -6417
+▁hopes -6418
+▁Credit -6419
+▁Mobile -6420
+▁SE -6421
+▁Rub -6422
+acked -6423
+ether -6424
+folio -6425
+▁bags -6426
+nesota -6427
+orgeous -6428
+▁creates -6429
+▁speaking -6430
+▁lifestyle -6431
+HA -6432
+sen -6433
+you -6434
+▁diss -6435
+▁hang -6436
+▁vend -6437
+▁Connect -6438
+▁Student -6439
+To -6440
+▁) -6441
+▁AR -6442
+adow -6443
+▁unf -6444
+▁legs -6445
+▁occup -6446
+▁Disney -6447
+▁appeal -6448
+▁assets -6449
+▁motion -6450
+▁trends -6451
+▁clothes -6452
+▁context -6453
+▁reporting -6454
+▁replacement -6455
+FC -6456
+yth -6457
+onto -6458
+yard -6459
+agues -6460
+▁Email -6461
+▁spaces -6462
+▁entirely -6463
+▁scholars -6464
+▁constantly -6465
+!" -6466
+anny -6467
+ican -6468
+long -6469
+▁arms -6470
+orders -6471
+▁shift -6472
+▁stamp -6473
+▁forest -6474
+▁Members -6475
+▁certific -6476
+▁searching -6477
+▁sustainable -6478
+▁OS -6479
+irts -6480
+onym -6481
+rition -6482
+▁spark -6483
+▁Number -6484
+▁Taylor -6485
+▁engage -6486
+▁manner -6487
+▁conflic -6488
+▁believes -6489
+▁submitted -6490
+II -6491
+bi -6492
+▁LED -6493
+comes -6494
+eding -6495
+▁kill -6496
+▁luxury -6497
+▁Studies -6498
+▁streets -6499
+▁procedures -6500
+ml -6501
+▁pil -6502
+▁fort -6503
+▁Still -6504
+▁sudden -6505
+▁outstanding -6506
+rid -6507
+▁Rh -6508
+foot -6509
+▁odd -6510
+▁cuts -6511
+▁Field -6512
+▁goods -6513
+▁negot -6514
+▁awards -6515
+▁criminal -6516
+▁monitoring -6517
+▁originally -6518
+▁SC -6519
+▁Kim -6520
+ially -6521
+▁Russian -6522
+▁invited -6523
+▁trained -6524
+▁Southern -6525
+▁millions -6526
+▁seriously -6527
+▁performing -6528
+▁transition -6529
+erts -6530
+ikes -6531
+▁Pot -6532
+▁eleg -6533
+▁weak -6534
+▁walls -6535
+▁recycl -6536
+▁refund -6537
+▁unlike -6538
+▁Arizona -6539
+▁capture -6540
+osc -6541
+asts -6542
+emic -6543
+izer -6544
+▁Pop -6545
+▁dim -6546
+▁rac -6547
+athan -6548
+ented -6549
+▁ille -6550
+▁zone -6551
+▁factor -6552
+▁prompt -6553
+▁reward -6554
+friendly -6555
+PC -6556
+ih -6557
+pat -6558
+bing -6559
+▁mal -6560
+▁Very -6561
+▁entr -6562
+▁horse -6563
+▁quote -6564
+▁museum -6565
+▁Mountain -6566
+Le -6567
+Ph -6568
+ba -6569
+▁Ra -6570
+▁Far -6571
+▁anx -6572
+▁vul -6573
+▁Jersey -6574
+▁conver -6575
+▁relief -6576
+▁illness -6577
+▁fighting -6578
+ATE -6579
+icket -6580
+▁blow -6581
+▁remov -6582
+▁Despite -6583
+▁Seattle -6584
+▁Standard -6585
+▁interests -6586
+▁foundation -6587
+▁cm -6588
+izza -6589
+front -6590
+▁Braz -6591
+▁Kenn -6592
+▁Pract -6593
+▁Should -6594
+▁herself -6595
+▁virtual -6596
+▁younger -6597
+HS -6598
+born -6599
+elry -6600
+▁tip -6601
+▁Easy -6602
+▁Ford -6603
+▁Iraq -6604
+▁moves -6605
+▁pocket -6606
+▁involve -6607
+▁examples -6608
+ani -6609
+rell -6610
+▁rose -6611
+▁smile -6612
+▁pounds -6613
+▁wealth -6614
+▁offices -6615
+▁flexible -6616
+▁Minnesota -6617
+▁transportation -6618
+▁Fre -6619
+▁Ire -6620
+▁Fall -6621
+▁gifts -6622
+▁input -6623
+▁Senior -6624
+▁upload -6625
+▁bathroom -6626
+▁assessment -6627
+▁capabilities -6628
+▁Jr -6629
+▁Ray -6630
+▁Rod -6631
+▁Stat -6632
+▁eggs -6633
+▁hole -6634
+▁pink -6635
+▁directed -6636
+▁identity -6637
+anes -6638
+ifer -6639
+iler -6640
+uter -6641
+▁Luc -6642
+▁Sav -6643
+▁beer -6644
+▁rein -6645
+▁bottle -6646
+▁Finally -6647
+▁airport -6648
+▁founded -6649
+▁clinical -6650
+▁ultimate -6651
+RS -6652
+sey -6653
+▁Army -6654
+▁debut -6655
+aturally -6656
+▁scientific -6657
+At -6658
+▁Ha -6659
+aron -6660
+▁Ask -6661
+▁Jac -6662
+▁sac -6663
+▁Bible -6664
+▁Royal -6665
+▁worst -6666
+illiant -6667
+▁distinct -6668
+▁improving -6669
+car -6670
+ilst -6671
+quir -6672
+▁Est -6673
+▁Kat -6674
+▁Vers -6675
+▁Event -6676
+▁elimin -6677
+▁figures -6678
+▁fishing -6679
+▁forever -6680
+▁copyright -6681
+da -6682
+▁Put -6683
+▁bab -6684
+ashed -6685
+▁Supp -6686
+▁faces -6687
+▁hospit -6688
+▁Country -6689
+▁Software -6690
+▁? -6691
+▁Non -6692
+ingly -6693
+▁garage -6694
+▁Instagram -6695
+▁tie -6696
+arrow -6697
+icate -6698
+▁Come -6699
+▁Site -6700
+▁Again -6701
+▁spoke -6702
+▁rating -6703
+▁Charles -6704
+▁visited -6705
+▁residential -6706
+▁Cab -6707
+ylvan -6708
+▁Arab -6709
+▁Fact -6710
+▁hasn -6711
+▁blank -6712
+▁stone -6713
+aration -6714
+▁entered -6715
+▁objects -6716
+▁rig -6717
+▁split -6718
+▁contribute -6719
+▁Unfortunately -6720
+RI -6721
+awn -6722
+uine -6723
+▁Bed -6724
+▁Dist -6725
+season -6726
+▁liked -6727
+▁spots -6728
+▁murder -6729
+▁Atlanta -6730
+▁developers -6731
+▁implementation -6732
+eah -6733
+With -6734
+▁coc -6735
+▁san -6736
+▁sky -6737
+▁Term -6738
+▁pitc -6739
+cluded -6740
+▁Radio -6741
+▁shower -6742
+▁Looking -6743
+▁Systems -6744
+▁baseball -6745
+▁calendar -6746
+▁Professor -6747
+▁procedure -6748
+oes -6749
+▁Ms -6750
+That -6751
+▁Save -6752
+▁cups -6753
+▁vital -6754
+resents -6755
+▁Member -6756
+▁linked -6757
+▁historical -6758
+▁possibility -6759
+Se -6760
+omy -6761
+umps -6762
+▁Mom -6763
+▁Foot -6764
+▁vibr -6765
+▁pitch -6766
+▁flavor -6767
+▁liquid -6768
+▁drawing -6769
+▁fitness -6770
+▁password -6771
+▁household -6772
+▁programme -6773
+▁atmosphere -6774
+▁reputation -6775
+andy -6776
+hell -6777
+ossible -6778
+▁enroll -6779
+▁papers -6780
+▁recipes -6781
+▁attached -6782
+▁mountain -6783
+▁organized -6784
+▁LA -6785
+▁Pow -6786
+▁hall -6787
+▁soph -6788
+▁tiss -6789
+asters -6790
+▁liber -6791
+▁Having -6792
+▁critic -6793
+▁muscle -6794
+▁talked -6795
+▁Administration -6796
+LY -6797
+One -6798
+host -6799
+▁Sem -6800
+▁Van -6801
+▁empt -6802
+▁seed -6803
+Americ -6804
+▁Brazil -6805
+▁Russia -6806
+▁carbon -6807
+▁passing -6808
+▁privacy -6809
+▁seasons -6810
+▁victims -6811
+▁frequently -6812
+▁institutions -6813
+.' -6814
+MP -6815
+But -6816
+rad -6817
+▁CO -6818
+▁PA -6819
+▁Space -6820
+▁chose -6821
+▁Living -6822
+▁theory -6823
+▁Shipping -6824
+▁MA -6825
+Read -6826
+▁ads -6827
+enger -6828
+ordan -6829
+▁rail -6830
+▁tech -6831
+▁regul -6832
+▁profit -6833
+▁managing -6834
+▁circumstances -6835
+ras -6836
+adel -6837
+tain -6838
+▁Son -6839
+▁Barb -6840
+▁hurt -6841
+▁proven -6842
+▁Justice -6843
+▁historic -6844
+▁networks -6845
+▁permission -6846
+▁legislation -6847
+▁publication -6848
+phy -6849
+▁Ba -6850
+bury -6851
+▁Cru -6852
+▁Cut -6853
+rible -6854
+▁butt -6855
+▁inch -6856
+▁Image -6857
+▁Express -6858
+▁regulations -6859
+dy -6860
+neys -6861
+ucky -6862
+▁err -6863
+uling -6864
+▁counsel -6865
+ta -6866
+ura -6867
+▁BE -6868
+▁Ur -6869
+olis -6870
+▁Fac -6871
+worth -6872
+▁Prom -6873
+▁skill -6874
+unction -6875
+▁Source -6876
+▁debate -6877
+▁Further -6878
+▁exposure -6879
+ubs -6880
+▁($ -6881
+▁Mir -6882
+▁Nic -6883
+▁Tax -6884
+▁cos -6885
+▁west -6886
+▁Garden -6887
+▁tracks -6888
+▁operate -6889
+RL -6890
+nders -6891
+▁Link -6892
+▁Name -6893
+▁lets -6894
+ffered -6895
+▁breath -6896
+▁qualified -6897
+▁represents -6898
+▁Leg -6899
+▁Oak -6900
+▁Brad -6901
+▁delay -6902
+▁finds -6903
+▁Season -6904
+▁walked -6905
+▁technique -6906
+▁NAS -6907
+▁bow -6908
+▁obl -6909
+▁tou -6910
+▁Anth -6911
+uclear -6912
+▁Choose -6913
+▁saving -6914
+▁authors -6915
+▁Learning -6916
+▁contrast -6917
+ella -6918
+ione -6919
+pons -6920
+▁Ltd -6921
+▁lad -6922
+icial -6923
+▁Scot -6924
+▁Brian -6925
+▁normally -6926
+▁realized -6927
+▁authentic -6928
+zes -6929
+urse -6930
+▁Rog -6931
+eller -6932
+▁fifth -6933
+▁merch -6934
+▁sight -6935
+▁tasks -6936
+▁hosted -6937
+▁reader -6938
+▁causing -6939
+▁savings -6940
+▁downtown -6941
+▁instance -6942
+By -6943
+odd -6944
+▁OR -6945
+▁Tony -6946
+▁mold -6947
+▁casual -6948
+▁execut -6949
+igration -6950
+ographic -6951
+▁anticip -6952
+▁justice -6953
+▁promise -6954
+▁somewhere -6955
+▁Professional -6956
+▁architecture -6957
+ingu -6958
+stra -6959
+entle -6960
+▁coat -6961
+▁smell -6962
+▁templ -6963
+ultural -6964
+▁sample -6965
+▁consequ -6966
+▁portion -6967
+▁estimated -6968
+Sc -6969
+idi -6970
+▁Pict -6971
+▁trib -6972
+remony -6973
+▁Labor -6974
+▁agric -6975
+▁trick -6976
+▁coordin -6977
+▁default -6978
+▁sending -6979
+▁upgrade -6980
+▁priority -6981
+▁interpret -6982
+▁surprising -6983
+▁volunteers -6984
+ults -6985
+cknow -6986
+▁batt -6987
+▁soil -6988
+▁mainly -6989
+▁manual -6990
+▁matches -6991
+▁gorgeous -6992
+▁shoulder -6993
+▁certified -6994
+▁apparently -6995
+▁continuing -6996
+▁situations -6997
+law -6998
+▁Es -6999
+▁exec -7000
+▁warn -7001
+arters -7002
+▁Stock -7003
+▁banks -7004
+▁bench -7005
+▁facil -7006
+▁lucky -7007
+ylvania -7008
+▁Golden -7009
+▁planet -7010
+▁posting -7011
+▁immediate -7012
+▁guidelines -7013
+bel -7014
+▁PH -7015
+star -7016
+▁Buy -7017
+▁Hou -7018
+words -7019
+▁Wilson -7020
+▁blocks -7021
+▁Financial -7022
+▁discussed -7023
+owa -7024
+ulf -7025
+ulpt -7026
+▁Mix -7027
+▁Mrs -7028
+▁USB -7029
+class -7030
+▁bear -7031
+▁hate -7032
+earing -7033
+▁firms -7034
+▁shops -7035
+▁Policy -7036
+▁Spirit -7037
+▁drinks -7038
+▁scheme -7039
+▁Customer -7040
+▁Medicine -7041
+▁Lar -7042
+anned -7043
+▁fasc -7044
+ealand -7045
+▁charm -7046
+ogether -7047
+respond -7048
+▁ending -7049
+▁terror -7050
+▁attacks -7051
+▁singles -7052
+▁workshop -7053
+▁Engineering -7054
+▁FA -7055
+iger -7056
+▁Ron -7057
+uster -7058
+▁Stay -7059
+▁magn -7060
+▁Sales -7061
+▁layer -7062
+▁prove -7063
+▁teasp -7064
+▁fairly -7065
+▁vulner -7066
+▁Ireland -7067
+▁external -7068
+nam -7069
+▁Yet -7070
+▁hat -7071
+▁vice -7072
+ingers -7073
+▁aspect -7074
+▁capable -7075
+▁Catholic -7076
+▁retirement -7077
+from -7078
+icit -7079
+unes -7080
+▁Cro -7081
+inder -7082
+▁scan -7083
+bridge -7084
+▁Motor -7085
+▁Order -7086
+▁Phone -7087
+▁stuck -7088
+eration -7089
+▁loving -7090
+▁Toronto -7091
+▁closely -7092
+▁injured -7093
+▁listing -7094
+▁Memorial -7095
+▁clicking -7096
+▁programming -7097
+aping -7098
+▁bare -7099
+▁Linux -7100
+▁climb -7101
+▁saved -7102
+▁orange -7103
+▁Zealand -7104
+▁proceed -7105
+▁believed -7106
+▁listening -7107
+▁industries -7108
+▁destination -7109
+▁Cy -7110
+▁EV -7111
+rich -7112
+▁Exp -7113
+▁wra -7114
+uting -7115
+▁Conf -7116
+▁Eric -7117
+▁juice -7118
+▁casino -7119
+▁breaking -7120
+▁memories -7121
+▁collected -7122
+▁landscape -7123
+SE -7124
+lo -7125
+▁Ca -7126
+▁FL -7127
+alle -7128
+aska -7129
+▁Ram -7130
+otted -7131
+▁Band -7132
+▁Tenn -7133
+▁terr -7134
+angers -7135
+▁reform -7136
+▁strike -7137
+▁Welcome -7138
+▁doctors -7139
+▁Material -7140
+▁enjoying -7141
+▁religious -7142
+▁spiritual -7143
+▁suggested -7144
+ati -7145
+▁MD -7146
+▁OK -7147
+Tube -7148
+aste -7149
+odge -7150
+▁hell -7151
+▁Roman -7152
+▁blend -7153
+▁forth -7154
+▁meets -7155
+▁assign -7156
+▁winners -7157
+▁machines -7158
+▁alongside -7159
+▁relatively -7160
+equ -7161
+ghan -7162
+▁Fox -7163
+▁Ide -7164
+oster -7165
+cludes -7166
+▁index -7167
+faction -7168
+▁riding -7169
+▁choosing -7170
+▁pleasure -7171
+▁strategic -7172
+▁anniversary -7173
+Ad -7174
+gypt -7175
+▁Dur -7176
+▁gym -7177
+child -7178
+imize -7179
+▁Line -7180
+▁yard -7181
+▁Smart -7182
+▁Think -7183
+▁aside -7184
+▁boxes -7185
+▁newly -7186
+▁prize -7187
+▁treatments -7188
+▁celebration -7189
+▁Subsc -7190
+▁bodies -7191
+▁writers -7192
+▁requests -7193
+▁designers -7194
+▁engagement -7195
+bro -7196
+inte -7197
+amber -7198
+▁Dave -7199
+▁east -7200
+▁Davis -7201
+▁Happy -7202
+▁bunch -7203
+▁pharm -7204
+▁belief -7205
+▁covering -7206
+▁extension -7207
+▁performances -7208
+▁WW -7209
+days -7210
+▁Sky -7211
+▁arg -7212
+▁Bang -7213
+▁elev -7214
+▁Camer -7215
+▁buyers -7216
+▁Meanwhile -7217
+▁brilliant -7218
+De -7219
+ls -7220
+agon -7221
+obby -7222
+▁Dar -7223
+▁NFL -7224
+▁Sep -7225
+ormal -7226
+▁enem -7227
+ensity -7228
+giving -7229
+▁birds -7230
+▁broke -7231
+▁giant -7232
+▁proof -7233
+▁franch -7234
+▁division -7235
+nic -7236
+inos -7237
+▁Pak -7238
+ashes -7239
+osophy -7240
+▁Asian -7241
+▁Kevin -7242
+lements -7243
+▁acknow -7244
+▁symbol -7245
+▁titles -7246
+sylvania -7247
+▁packaging -7248
+▁platforms -7249
+▁instrument -7250
+▁differences -7251
+oty -7252
+▁raw -7253
+▁unw -7254
+iders -7255
+ureau -7256
+▁Adam -7257
+▁iPad -7258
+esides -7259
+▁meals -7260
+▁river -7261
+▁compat -7262
+▁enables -7263
+▁drinking -7264
+▁volunteer -7265
+’. -7266
+▁PDF -7267
+inton -7268
+▁mile -7269
+▁slic -7270
+▁solo -7271
+▁superv -7272
+▁letters -7273
+▁authority -7274
+.’ -7275
+wan -7276
+▁PL -7277
+alse -7278
+rage -7279
+wart -7280
+▁pip -7281
+▁Bush -7282
+▁Iran -7283
+lisher -7284
+parent -7285
+▁Story -7286
+▁urban -7287
+ainless -7288
+▁consistent -7289
+pes -7290
+▁Uk -7291
+▁|| -7292
+bles -7293
+wich -7294
+▁kit -7295
+ronics -7296
+▁Chall -7297
+▁Model -7298
+▁centers -7299
+▁charity -7300
+▁typical -7301
+▁explains -7302
+▁replaced -7303
+▁newspaper -7304
+▁communications -7305
+GA -7306
+OVID -7307
+▁rug -7308
+▁acts -7309
+▁lapt -7310
+▁vacc -7311
+▁vast -7312
+ateful -7313
+jection -7314
+▁infect -7315
+▁YouTube -7316
+▁mortgage -7317
+▁CN -7318
+leep -7319
+oker -7320
+▁Jay -7321
+▁stim -7322
+▁tape -7323
+▁trim -7324
+▁tooth -7325
+▁dreams -7326
+▁falling -7327
+▁handling -7328
+▁holidays -7329
+▁swimming -7330
+cons -7331
+iley -7332
+page -7333
+▁stir -7334
+▁Return -7335
+▁decade -7336
+▁domain -7337
+▁singer -7338
+▁Perhaps -7339
+▁destroy -7340
+▁dynamic -7341
+▁lighting -7342
+▁proposal -7343
+▁categories -7344
+▁encouraged -7345
+▁membership -7346
+▁personally -7347
+Fi -7348
+acious -7349
+▁Jason -7350
+▁Jordan -7351
+▁Columbia -7352
+▁forecast -7353
+▁informed -7354
+▁wireless -7355
+▁classroom -7356
+▁accomplish -7357
+▁initiative -7358
+▁suggestions -7359
+▁Po -7360
+▁mut -7361
+erman -7362
+▁Bird -7363
+▁Mill -7364
+▁Swed -7365
+▁slee -7366
+▁susp -7367
+▁Egypt -7368
+▁Staff -7369
+▁Treat -7370
+▁recre -7371
+▁solve -7372
+▁agents -7373
+▁combine -7374
+▁founder -7375
+▁percentage -7376
+▁Advis -7377
+▁Cancer -7378
+▁arrive -7379
+▁headed -7380
+▁expansion -7381
+▁sensitive -7382
+▁manufacturers -7383
+TER -7384
+uis -7385
+athy -7386
+▁Bad -7387
+▁Ess -7388
+▁magic -7389
+▁penal -7390
+▁Agency -7391
+▁Miller -7392
+▁Gallery -7393
+ounce -7394
+▁bars -7395
+▁embr -7396
+▁tied -7397
+▁Being -7398
+▁crash -7399
+▁flash -7400
+▁filter -7401
+▁Classic -7402
+▁Houston -7403
+▁shouldn -7404
+▁Remember -7405
+▁Transport -7406
+▁participating -7407
+▁ast -7408
+▁Talk -7409
+▁dust -7410
+▁Annual -7411
+▁Recent -7412
+▁slowly -7413
+▁Airport -7414
+▁Kingdom -7415
+▁pricing -7416
+▁travell -7417
+▁Northern -7418
+▁enterprise -7419
+ko -7420
+▁Josh -7421
+▁evol -7422
+▁mood -7423
+▁unus -7424
+▁facts -7425
+▁phones -7426
+▁Consult -7427
+▁ancient -7428
+▁presents -7429
+▁printing -7430
+▁Secretary -7431
+▁permanent -7432
+wis -7433
+onna -7434
+level -7435
+▁hire -7436
+amsung -7437
+rovers -7438
+▁Brook -7439
+▁venue -7440
+▁Joseph -7441
+▁gender -7442
+▁extract -7443
+▁intense -7444
+ervations -7445
+▁Pennsylvania -7446
+▁DI -7447
+..... -7448
+abeth -7449
+▁Base -7450
+▁assum -7451
+▁dealing -7452
+▁gallery -7453
+▁genuine -7454
+▁portfolio -7455
+▁enforcement -7456
+FA -7457
+esy -7458
+site -7459
+▁suc -7460
+igate -7461
+uties -7462
+▁Film -7463
+▁gall -7464
+ership -7465
+▁Level -7466
+▁roles -7467
+ologist -7468
+▁Create -7469
+▁watched -7470
+▁producing -7471
+▁IC -7472
+lers -7473
+wear -7474
+▁Dam -7475
+asted -7476
+mates -7477
+▁fest -7478
+making -7479
+▁scenes -7480
+▁constit -7481
+▁carrying -7482
+▁suffered -7483
+▁traveling -7484
+▁attractive -7485
+OD -7486
+Tr -7487
+▁Own -7488
+▁Sea -7489
+iking -7490
+oices -7491
+▁Webs -7492
+▁vari -7493
+ardens -7494
+▁Grant -7495
+ulating -7496
+▁Silver -7497
+▁border -7498
+▁assault -7499
+▁Continue -7500
+▁generate -7501
+▁assistant -7502
+▁Collection -7503
+▁guaranteed -7504
+▁recommendations -7505
+Do -7506
+axy -7507
+bar -7508
+pir -7509
+Book -7510
+▁Sym -7511
+▁Stan -7512
+▁trig -7513
+▁wins -7514
+▁Books -7515
+▁absor -7516
+▁stake -7517
+▁Studio -7518
+▁Quality -7519
+▁chances -7520
+▁Personal -7521
+▁equipped -7522
+▁Ter -7523
+Press -7524
+books -7525
+active -7526
+▁grass -7527
+▁opens -7528
+▁solar -7529
+inating -7530
+▁compens -7531
+▁heading -7532
+▁Everyone -7533
+▁diseases -7534
+▁reducing -7535
+▁Hollywood -7536
+▁languages -7537
+▁professor -7538
+▁incredibly -7539
+boy -7540
+▁rh -7541
+aine -7542
+ilty -7543
+raid -7544
+burgh -7545
+▁Fred -7546
+▁actor -7547
+▁formed -7548
+▁Eastern -7549
+▁booking -7550
+▁podcast -7551
+▁speaker -7552
+▁Experience -7553
+▁interactive -7554
+SC -7555
+Te -7556
+rm -7557
+amel -7558
+▁hel -7559
+▁anyway -7560
+▁lawyer -7561
+▁neighb -7562
+▁cookies -7563
+▁Magazine -7564
+▁Therefore -7565
+acc -7566
+ila -7567
+▁CL -7568
+▁Deb -7569
+asant -7570
+ctive -7571
+▁Bern -7572
+▁lect -7573
+▁Force -7574
+▁Henry -7575
+▁Would -7576
+▁formal -7577
+▁string -7578
+▁filling -7579
+▁Products -7580
+▁purchasing -7581
+▁connections -7582
+alo -7583
+run -7584
+▁Gi -7585
+etch -7586
+game -7587
+phia -7588
+shire -7589
+▁narr -7590
+▁alive -7591
+▁pride -7592
+graduate -7593
+▁preferred -7594
+▁Hi -7595
+ials -7596
+▁Ath -7597
+▁Hun -7598
+▁Mov -7599
+stein -7600
+▁Clin -7601
+▁Emer -7602
+▁Guard -7603
+▁Major -7604
+▁phase -7605
+▁limits -7606
+▁marked -7607
+▁writes -7608
+▁defined -7609
+▁deposit -7610
+▁visible -7611
+▁suggests -7612
+oto -7613
+swe -7614
+roke -7615
+▁Tel -7616
+▁Kids -7617
+▁seats -7618
+▁shell -7619
+▁accused -7620
+▁aggress -7621
+▁expressed -7622
+▁basketball -7623
+Fr -7624
+▁EN -7625
+onic -7626
+allas -7627
+▁bact -7628
+lessly -7629
+▁empty -7630
+▁Estate -7631
+▁hotels -7632
+▁nights -7633
+▁racing -7634
+▁Comment -7635
+▁jewelry -7636
+▁substant -7637
+▁primarily -7638
+esh -7639
+imp -7640
+▁CP -7641
+bell -7642
+▁bid -7643
+▁gay -7644
+utter -7645
+▁Past -7646
+▁aims -7647
+▁lady -7648
+▁habit -7649
+▁Father -7650
+▁Histor -7651
+▁Mother -7652
+▁Things -7653
+▁rental -7654
+▁shapes -7655
+▁weapons -7656
+itionally -7657
+▁accuracy -7658
+▁resulting -7659
+▁creativity -7660
+▁specialist -7661
+▁vegetables -7662
+AV -7663
+▁oz -7664
+ogue -7665
+▁Has -7666
+▁lie -7667
+ifies -7668
+inity -7669
+▁cycl -7670
+intend -7671
+▁Based -7672
+▁bills -7673
+limited -7674
+▁remark -7675
+▁rising -7676
+▁engaged -7677
+▁instant -7678
+▁organis -7679
+▁politics -7680
+▁Published -7681
+▁recognition -7682
+ns -7683
+hour -7684
+▁Las -7685
+inois -7686
+uters -7687
+▁Give -7688
+▁Iowa -7689
+▁Marc -7690
+▁Tele -7691
+abetes -7692
+▁Vegas -7693
+▁criteria -7694
+▁suffering -7695
+▁compliance -7696
+essee -7697
+▁rice -7698
+▁marks -7699
+adelphia -7700
+▁Officer -7701
+▁compare -7702
+▁desired -7703
+▁component -7704
+▁highlights -7705
+▁TR -7706
+uana -7707
+▁tub -7708
+oween -7709
+▁dism -7710
+▁Prime -7711
+▁brush -7712
+▁Kansas -7713
+▁dollar -7714
+▁Britain -7715
+▁crucial -7716
+▁graphic -7717
+▁recover -7718
+▁achieved -7719
+▁literally -7720
+▁interviews -7721
+jo -7722
+igs -7723
+lee -7724
+▁Ap -7725
+greg -7726
+▁Map -7727
+▁tap -7728
+▁Fast -7729
+▁HERE -7730
+▁duty -7731
+makers -7732
+▁Among -7733
+▁Steel -7734
+▁knock -7735
+▁healing -7736
+▁illegal -7737
+▁admitted -7738
+▁describe -7739
+▁entering -7740
+▁releases -7741
+▁speakers -7742
+▁Solutions -7743
+▁functional -7744
+des -7745
+▁pra -7746
+▁Roll -7747
+▁Cover -7748
+▁Kelly -7749
+athered -7750
+▁intent -7751
+▁Edition -7752
+▁massage -7753
+▁packages -7754
+▁Following -7755
+▁attending -7756
+▁obviously -7757
+li -7758
+uan -7759
+▁EX -7760
+mers -7761
+▁Meth -7762
+▁keys -7763
+▁heads -7764
+holders -7765
+▁Change -7766
+▁Orange -7767
+▁matching -7768
+▁displayed -7769
+▁recognize -7770
+▁wondering -7771
+▁correspond -7772
+isa -7773
+▁CC -7774
+▁IM -7775
+Cont -7776
+orous -7777
+▁Diego -7778
+▁dough -7779
+▁trips -7780
+▁signal -7781
+▁developer -7782
+▁exceptional -7783
+▁increasingly -7784
+%. -7785
+ja -7786
+htt -7787
+▁Ros -7788
+athon -7789
+heast -7790
+▁Dead -7791
+▁puts -7792
+▁till -7793
+▁Nation -7794
+▁alumin -7795
+▁struck -7796
+novation -7797
+▁claimed -7798
+▁farmers -7799
+▁hitting -7800
+▁whenever -7801
+▁officially -7802
+▁introduction -7803
+pson -7804
+▁Isl -7805
+found -7806
+▁Auto -7807
+▁Body -7808
+▁king -7809
+▁mand -7810
+inding -7811
+▁Table -7812
+▁Forest -7813
+▁Valent -7814
+▁narrow -7815
+▁colours -7816
+▁Attorney -7817
+▁networking -7818
+▁necessarily -7819
+▁improvements -7820
+tail -7821
+▁bug -7822
+▁clar -7823
+▁Civil -7824
+utional -7825
+▁hidden -7826
+▁Theatre -7827
+▁texture -7828
+▁checking -7829
+▁constant -7830
+▁licensed -7831
+▁Cry -7832
+▁cust -7833
+▁root -7834
+ickets -7835
+terior -7836
+▁Youth -7837
+▁loose -7838
+▁setup -7839
+▁acting -7840
+▁Chapter -7841
+▁Reading -7842
+▁occurred -7843
+▁struggling -7844
+TP -7845
+tw -7846
+AND -7847
+▁ -7848
+e -7849
+t -7850
+a -7851
+o -7852
+i -7853
+n -7854
+s -7855
+r -7856
+h -7857
+l -7858
+d -7859
+c -7860
+u -7861
+m -7862
+p -7863
+g -7864
+f -7865
+y -7866
+w -7867
+b -7868
+. -7869
+v -7870
+, -7871
+k -7872
+T -7873
+I -7874
+S -7875
+A -7876
+- -7877
+C -7878
+0 -7879
+1 -7880
+M -7881
+P -7882
+B -7883
+x -7884
+2 -7885
+W -7886
+D -7887
+R -7888
+E -7889
+H -7890
+F -7891
+L -7892
+O -7893
+N -7894
+’ -7895
+' -7896
+: -7897
+G -7898
+j -7899
+) -7900
+3 -7901
+( -7902
+z -7903
+5 -7904
+q -7905
+" -7906
+U -7907
+4 -7908
+J -7909
+9 -7910
+6 -7911
+8 -7912
+V -7913
+Y -7914
+K -7915
+7 -7916
+! -7917
+| -7918
+/ -7919
+? -7920
+“ -7921
+” -7922
+; -7923
+– -7924
+& -7925
+$ -7926
+— -7927
+Q -7928
+X -7929
+% -7930
+Z -7931
diff --git a/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/requirements.txt b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/requirements.txt
new file mode 100644
index 0000000000..0c5eedce7b
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/requirements.txt
@@ -0,0 +1,10 @@
+numpy
+tqdm
+torch==2.10
+huggingface-hub
+kernels
+setuptools
+typing-extensions==4.15.0
+datasets
+tiktoken
+sentencepiece
\ No newline at end of file
diff --git a/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/run_cuda_binary.sh b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/run_cuda_binary.sh
new file mode 100644
index 0000000000..473b3388e3
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/run_cuda_binary.sh
@@ -0,0 +1,72 @@
+RUN_ID=pushing_run_binary_1 \
+DATA_PATH=./data/datasets/fineweb10B_sp8192 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
+ATTN_PROJ_TYPE=standard \
+LOGIT_HEAD_TYPE=standard \
+TVERSKY_MEMBERSHIP=sigmoid \
+TVERSKY_NUM_FEATURES=0 \
+TVERSKY_FEATURE_POOLS=0 \
+VOCAB_SIZE=8192 \
+BITNET_GROUP_SIZE=128 \
+BIGRAM_HASH=0 \
+EMBED_DIM=254 \
+TRAINING_DEPTH_RECURRENCE=0 \
+EVAL_DEPTH_RECURRENCE=0 \
+NUM_LAYERS=15 \
+MODEL_DIM=768 \
+NUM_KV_HEADS=4 \
+NUM_HEADS=8 \
+DIFF_ATTN=0 \
+MLP_MULT=4 \
+MLP_GROUPS=0 \
+MATRIX_OPTIMIZER=muon \
+ADAM_LR=0.05 \
+ADAM_WD=0.05 \
+MUON_BACKEND_STEPS=3 \
+MUON_MOMENTUM=0.95 \
+MUON_MOMENTUM_WARMUP_START=0.85 \
+MUON_MOMENTUM_WARMUP_STEPS=500 \
+MUON_WD=0.0 \
+MATRIX_LR=0.04 \
+SCALAR_LR=0.02 \
+TIED_EMBED_LR=0.02 \
+WARMDOWN_FRACTION=0.2 \
+LOGIT_SOFTCAP=10 \
+QK_GAIN_INIT=2.25 \
+ROPE_TYPE=yarn \
+YARN_MAX_LEN=2048 \
+ROPE_BASE=5000 \
+BATCH_TOKENS_START=0 \
+BATCH_SCHEDULE_FRACTION=0.33 \
+TRAIN_BATCH_TOKENS=524288 \
+SEQ_LEN_START=0 \
+SEQ_SCHEDULE_FRACTION=0.0 \
+TRAIN_SEQ_LEN=1024 \
+SMEAR=1 \
+ITERATIONS=50000 \
+WARMUP_STEPS=5 \
+MAX_WALLCLOCK_SECONDS=0 \
+VAL_LOSS_EVERY=0 \
+TRAIN_LOG_EVERY=500 \
+CHURN_LOG_EVERY=1000 \
+VAL_MAX_TOKENS=0 \
+TIE_EMBEDDINGS=1 \
+UNTIE_AT_FRACTION=0.00 \
+HEAD_LR=0.02 \
+CORR_WEIGHT_LR=0.02 \
+ACTIVATION=relu2 \
+SOFTCAP_TYPE=poly \
+MTP_HEADS=0 \
+REFINER=0 \
+REFINER_KERNEL=3 \
+SLIDING_EVAL=1 \
+SLIDING_EVAL_STRIDE=16 \
+SLIDING_BATCH_SIZE=256 \
+TEMP_SCALING=1 \
+FP_STORAGE=FP8 \
+EMA=0 \
+EMA_DECAY=0.995 \
+EMA_START_FRACTION=0.5 \
+SEED=42 \
+COMPILE_MODE=default \
+OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 train_gpt_cuda_binary.py
diff --git a/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/setup.sh b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/setup.sh
new file mode 100644
index 0000000000..93f1c41fea
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/setup.sh
@@ -0,0 +1,143 @@
+#!/bin/bash
+# -------------------------------------------------------------------------------
+# Parameter Golf -- Complete Environment Setup Script
+# Drop this into the project root and run: bash setup.sh
+# -------------------------------------------------------------------------------
+
+set -e
+
+echo "----------------------------------------------"
+echo " Parameter Golf -- Environment Setup"
+echo "----------------------------------------------"
+
+# -------------------------------------------------------------------------------
+# 1. Miniconda
+# -------------------------------------------------------------------------------
+echo ""
+echo "[1/5] Miniconda..."
+
+if [ -d "$HOME/miniconda3" ]; then
+ echo " Already installed -- skipping."
+else
+ wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh
+ bash /tmp/miniconda.sh -b
+ rm /tmp/miniconda.sh
+ ~/miniconda3/bin/conda init bash
+ echo " Installed."
+fi
+
+export PATH="$HOME/miniconda3/bin:$PATH"
+source ~/miniconda3/etc/profile.d/conda.sh
+
+echo " Accepting conda TOS..."
+~/miniconda3/bin/conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
+~/miniconda3/bin/conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
+echo " TOS accepted."
+
+# -------------------------------------------------------------------------------
+# 2. Python Environment
+# -------------------------------------------------------------------------------
+echo ""
+echo "[2/5] Python 3.13 environment..."
+
+if conda env list | grep -q "^golf "; then
+ echo " Environment 'golf' already exists -- skipping."
+else
+ conda create -n golf python=3.13 -y
+ echo " Created."
+fi
+
+conda activate golf
+echo " Activated."
+
+# -------------------------------------------------------------------------------
+# 3. Requirements
+# -------------------------------------------------------------------------------
+echo ""
+echo "[3/5] Requirements..."
+
+if python3 -c "import torch, sentencepiece, numpy" 2>/dev/null; then
+ echo " Core packages already installed -- skipping."
+else
+ pip install --upgrade pip -q
+ pip install -r requirements.txt -q
+ echo " Installed."
+fi
+
+# -------------------------------------------------------------------------------
+# 4. FlashAttention-3
+# -------------------------------------------------------------------------------
+echo ""
+echo "[4/5] FlashAttention-3..."
+
+if python3 -c "import flash_attn" 2>/dev/null || python3 -c "import flash_attn_interface" 2>/dev/null; then
+ echo " Already installed -- skipping."
+else
+ # abi3 wheel -- Python 3.9+ compatible, installs in seconds, no compilation
+ pip install --no-cache-dir "https://download.pytorch.org/whl/cu128/flash_attn_3-3.0.0-cp39-abi3-manylinux_2_28_x86_64.whl"
+ echo " Installed."
+fi
+
+# -------------------------------------------------------------------------------
+# 5. Dataset
+# -------------------------------------------------------------------------------
+echo ""
+echo "[5/5] FineWeb dataset (sp8192, 10 shards)..."
+
+echo " Downloading... ($TRAIN_COUNT/10 train shards found)"
+hf download sproos/parameter-golf-tokenizers --include "datasets/fineweb10B_sp8192/*" --local-dir ./data
+echo " Downloaded."
+
+# -------------------------------------------------------------------------------
+# Verification
+# -------------------------------------------------------------------------------
+echo ""
+echo "----------------------------------------------"
+echo " Verification"
+echo "----------------------------------------------"
+
+python3 - << 'EOF'
+import sys
+import torch
+import numpy as np
+import glob
+
+print(f"Python : {sys.version.split()[0]}")
+print(f"PyTorch : {torch.__version__}")
+print(f"CUDA : {torch.cuda.is_available()}")
+print(f"GPUs : {torch.cuda.device_count()}")
+
+if torch.cuda.is_available():
+ for i in range(torch.cuda.device_count()):
+ props = torch.cuda.get_device_properties(i)
+ print(f" GPU {i} : {props.name} ({props.total_memory // 1024**3}GB)")
+
+try:
+ import flash_attn
+ print(f"FlashAttn : {flash_attn.__version__}")
+except ImportError:
+ try:
+ import flash_attn_interface
+ print(f"FlashAttn3 : available")
+ except ImportError:
+ print(f"FlashAttn : NOT found")
+
+train_files = sorted(glob.glob("./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin"))
+val_files = sorted(glob.glob("./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin"))
+print(f"Train shards : {len(train_files)}")
+print(f"Val shards : {len(val_files)}")
+
+if val_files:
+ total = sum(
+ int(np.fromfile(f, dtype=' tuple[bytes, int]:
+ bits = ((q.reshape(-1).to(torch.int8) + 1) // 2).numpy().astype(np.uint8)
+ n = len(bits)
+ pad = (8 - n % 8) % 8
+ if pad:
+ bits = np.concatenate([bits, np.zeros(pad, dtype=np.uint8)])
+ groups = bits.reshape(-1, 8)
+ packed = np.zeros(len(groups), dtype=np.uint8)
+ for i in range(8):
+ packed |= groups[:, i] << i
+ return packed.tobytes(), n
+
+def unpack_binary(data: bytes, n: int) -> Tensor:
+ packed = np.frombuffer(data, dtype=np.uint8)
+ bits = np.zeros((len(packed), 8), dtype=np.int8)
+ for i in range(8):
+ bits[:, i] = (packed >> i) & 1
+ flat = bits.reshape(-1)[:n]
+ return torch.from_numpy(flat.astype(np.int8) * 2 - 1)
+
+# ---------------------------------------------------------------------------
+# FP4 quantization (per-row absmax, 2 values packed per byte)
+# ---------------------------------------------------------------------------
+def quantize_to_int4(t: Tensor) -> tuple[Tensor, Tensor, list]:
+ t32 = t.float()
+ orig_shape = t32.shape
+ if t32.ndim < 2:
+ t32 = t32.unsqueeze(0)
+ absmax = t32.abs().amax(dim=-1, keepdim=True).clamp(min=1e-8)
+ scale = absmax / 7.0
+ q = torch.clamp(torch.round(t32 / scale), -7, 7).to(torch.int8)
+ flat = q.reshape(-1)
+ if flat.numel() % 2 != 0:
+ flat = F.pad(flat, (0, 1))
+ low = (flat[0::2] + 8).to(torch.uint8)
+ high = (flat[1::2] + 8).to(torch.uint8)
+ return low | (high << 4), scale.half().squeeze(-1), list(orig_shape)
+
+def dequantize_from_int4(packed: Tensor, scale: Tensor, shape: list) -> Tensor:
+ low = (packed & 0x0F).to(torch.int8) - 8
+ high = ((packed >> 4) & 0x0F).to(torch.int8) - 8
+ flat = torch.zeros(packed.numel() * 2, dtype=torch.int8)
+ flat[0::2] = low
+ flat[1::2] = high
+ numel = 1
+ for s in shape:
+ numel *= s
+ flat = flat[:numel].float()
+ if len(shape) <= 1:
+ return (flat * scale.float().squeeze()).reshape(shape)
+ return (flat.reshape(-1, shape[-1]) * scale.float().unsqueeze(-1)).reshape(shape)
+
+# ---------------------------------------------------------------------------
+# State dict serialization (binary + fp16/fp8/fp4)
+# ---------------------------------------------------------------------------
+def q_sd(state_dict: dict, group_size: int = 64, fp_storage=False, binary_override_names: set | None = None) -> tuple[dict, dict]:
+ "Binary for large 2D weight matrices, fp16/fp8/fp4 for everything else."
+ quantized = {}
+ stats = {"binary_params": 0, "binary_bytes": 0, "fp_params": 0, "fp_bytes": 0}
+ for name, tensor in state_dict.items():
+ if "mtp_heads" in name:
+ continue
+ t = tensor.detach().cpu().float().contiguous()
+ t_orig_shape = list(t.shape)
+ if t.ndim == 3:
+ t = t.reshape(t.shape[0], -1)
+ is_binary_candidate = (
+ t.ndim == 2 and t.numel() > 65_536
+ and "tok_emb" not in name and "lm_head" not in name and "embed_proj" not in name and "bigram_emb" not in name and "lm_head_correction" not in name and "lm_head_U" not in name and "lm_head_V" not in name
+ and "prototypes" not in name and "tversky" not in name
+ ) or (binary_override_names is not None and name in binary_override_names)
+ if is_binary_candidate:
+ pad = (group_size - t.shape[1] % group_size) % group_size
+ t_padded = F.pad(t, (0, pad)) if pad > 0 else t
+ t_grouped = t_padded.reshape(-1, group_size)
+ scale = t_grouped.abs().mean(-1, keepdim=True).clamp(min=1e-8).half().float()
+ q = torch.where(t_grouped >= 0,
+ torch.ones_like(t_grouped, dtype=torch.int8),
+ -torch.ones_like(t_grouped, dtype=torch.int8))
+ packed_bytes, n_bits = pack_binary(q)
+ quantized[name] = {
+ "type": "binary", "packed": packed_bytes,
+ "scale": scale.half().squeeze(-1),
+ "shape": list(t.shape), "padded_cols": t_padded.shape[1],
+ "group_size": group_size, "n_bits": n_bits,
+ "orig_shape": t_orig_shape,
+ }
+ stats["binary_params"] += t.numel()
+ stats["binary_bytes"] += len(packed_bytes) + scale.numel() * 2
+ elif fp_storage == "fp4" and t.ndim == 2:
+ packed, scale, orig_shape = quantize_to_int4(t)
+ quantized[name] = {"type": "fp4", "packed": packed, "scale": scale, "shape": orig_shape}
+ stats["fp_params"] += t.numel()
+ stats["fp_bytes"] += packed.numel() + scale.numel() * 2
+ elif fp_storage and t.ndim == 2:
+ quantized[name] = {"type": "fp8", "data": t.to(torch.float8_e4m3fn)}
+ stats["fp_params"] += t.numel()
+ stats["fp_bytes"] += t.numel()
+ else:
+ quantized[name] = {"type": "fp16", "data": t.half()}
+ stats["fp_params"] += t.numel()
+ stats["fp_bytes"] += t.numel() * 2
+ return quantized, stats
+
+def deq_sd(quantized: dict, target_dtype=torch.bfloat16):
+ "Reconstruct full-precision state dict from quantized representation."
+ out = {}
+ for name, entry in quantized.items():
+ if entry["type"] == "binary":
+ q = unpack_binary(entry["packed"], entry["n_bits"])
+ q = q.float().reshape(-1, entry["group_size"])
+ scale = entry["scale"].float().unsqueeze(-1)
+ # No shrinkage correction needed: binary has no zeros, q.abs().mean() == 1.0 always
+ t = (q * scale).reshape(-1, entry["padded_cols"])
+ shape = entry["shape"]
+ result = t[:shape[0], :shape[1]].to(target_dtype)
+ orig = entry.get("orig_shape")
+ out[name] = result.reshape(orig).contiguous() if orig and orig != shape else result.contiguous()
+ elif entry["type"] == "fp8":
+ out[name] = entry["data"].to(torch.float32).to(target_dtype).contiguous()
+ elif entry["type"] == "fp4":
+ out[name] = dequantize_from_int4(entry["packed"], entry["scale"], entry["shape"]).to(target_dtype).contiguous()
+ else:
+ out[name] = entry["data"].to(target_dtype).contiguous()
+ return out
+
+# ---------------------------------------------------------------------------
+# Binary diagnostics (logged during training)
+# ---------------------------------------------------------------------------
+_prev_committed: dict = {}
+def churn_fn(model: nn.Module, group_size: int = 64):
+ global _prev_committed
+ total = flipped = 0
+ with torch.no_grad():
+ for name, p in model.named_parameters():
+ if p.ndim == 2 and ("weight" in name or "prototypes" in name) and p.shape[0] > 1:
+ w = p.detach().float().reshape(-1, group_size)
+ q = torch.where(w >= 0, torch.ones_like(w), -torch.ones_like(w)).cpu().numpy()
+ if name in _prev_committed:
+ flipped += int(np.sum(q != _prev_committed[name]))
+ total += q.size
+ _prev_committed[name] = q
+ return flipped / max(total, 1)
+
+# ---------------------------------------------------------------------------
+# Muon optimizer (Newton-Schulz orthogonalized momentum)
+# ---------------------------------------------------------------------------
+def ns_orth(G: Tensor, steps: int = 10, eps: float = 1e-7) -> Tensor:
+ a, b, c = (3.4445, -4.7750, 2.0315)
+ X = G.bfloat16()
+ X /= X.norm() + eps
+ transposed = G.size(0) > G.size(1)
+ if transposed:
+ X = X.T
+ for _ in range(steps):
+ A = X @ X.T
+ B = b * A + c * A @ A
+ X = a * X + B @ X
+ return X.T if transposed else X
+
+class Muon(torch.optim.Optimizer):
+ def __init__(self, params, lr: float, momentum: float, backend_steps: int, nesterov: bool = True, wd: float = 0.0):
+ super().__init__(params, dict(lr=lr, momentum=momentum, backend_steps=backend_steps, nesterov=nesterov, wd=wd))
+ @torch.no_grad()
+ def step(self, closure=None):
+ loss = None
+ if closure is not None:
+ with torch.enable_grad():
+ loss = closure()
+ distributed = dist.is_available() and dist.is_initialized()
+ world_size = dist.get_world_size() if distributed else 1
+ rank = dist.get_rank() if distributed else 0
+ for group in self.param_groups:
+ params = group["params"]
+ if not params:
+ continue
+ lr, momentum = group["lr"], group["momentum"]
+ backend_steps, nesterov = group["backend_steps"], group["nesterov"]
+ total_params = sum(int(p.numel()) for p in params)
+ updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16)
+ curr = 0
+ for i, p in enumerate(params):
+ if i % world_size == rank and p.grad is not None:
+ g = p.grad
+ state = self.state[p]
+ if "momentum_buffer" not in state:
+ state["momentum_buffer"] = torch.zeros_like(g)
+ buf = state["momentum_buffer"]
+ buf.mul_(momentum).add_(g)
+ if nesterov:
+ g = g.add(buf, alpha=momentum)
+ g = F.rms_norm(g.float(), (g.size(-1),)).bfloat16()
+ g = ns_orth(g, steps=backend_steps)
+ g *= max(1, g.size(0) / g.size(1)) ** 0.5
+ updates_flat[curr:curr + p.numel()] = g.reshape(-1)
+ curr += p.numel()
+ if distributed:
+ dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM)
+ wd = group.get("wd", 0.0)
+ curr = 0
+ for p in params:
+ g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype)
+ if wd > 0:
+ p.mul_(1 - lr * wd)
+ p.add_(g, alpha=-lr)
+ curr += p.numel()
+ return loss
+
+# ---------------------------------------------------------------------------
+# Data loading
+# ---------------------------------------------------------------------------
+def ld_shard(file: Path) -> Tensor:
+ header_bytes = 256 * np.dtype(" Tensor:
+ chunks = []
+ remaining = n
+ while remaining > 0:
+ avail = self.tokens.numel() - self.pos
+ if avail <= 0:
+ self._advance_file()
+ continue
+ k = min(remaining, avail)
+ chunks.append(self.tokens[self.pos:self.pos + k])
+ self.pos += k
+ remaining -= k
+ return chunks[0] if len(chunks) == 1 else torch.cat(chunks)
+
+class DistributedTokenLoader:
+ def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device):
+ self.rank, self.world_size, self.device = rank, world_size, device
+ self.stream = TokenStream(pattern)
+ def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]:
+ local_tokens = global_tokens // (self.world_size * grad_accum_steps)
+ per_rank_span = local_tokens + 1
+ chunk = self.stream.take(per_rank_span * self.world_size)
+ start = self.rank * per_rank_span
+ local = chunk[start:start + per_rank_span].pin_memory().to(self.device, non_blocking=True).to(torch.int64)
+ x = local[:-1].reshape(-1, seq_len)
+ y = local[1:].reshape(-1, seq_len)
+ return x, y
+# ---------------------------------------------------------------------------
+# Model
+# ---------------------------------------------------------------------------
+class RMSNorm(nn.Module):
+ def __init__(self, eps: float | None = None):
+ super().__init__()
+ self.eps = eps
+ def forward(self, x: Tensor) -> Tensor:
+ return F.rms_norm(x, (x.size(-1),), eps=self.eps)
+
+def apply_qat_ste(w: Tensor, fp_storage: str | bool) -> Tensor:
+ """Applies Straight-Through Estimator (STE) for FP4 or FP8 simulated quantization."""
+ if not fp_storage:
+ return w
+ if fp_storage == "fp4":
+ absmax = w.abs().amax(dim=-1, keepdim=True).clamp(min=1e-8)
+ scale = absmax / 7.0
+ q = torch.clamp(torch.round(w / scale), -7.0, 7.0)
+ w_sim = q * scale
+ return (w_sim - w).detach() + w
+ elif fp_storage is True or fp_storage == "fp8":
+ w_sim = w.to(torch.float8_e4m3fn).to(w.dtype)
+ return (w_sim - w).detach() + w
+ return w
+
+class QATLinear(nn.Linear):
+ def __init__(self, in_features: int, out_features: int, bias: bool = False, fp_storage: str | bool = False):
+ super().__init__(in_features, out_features, bias=bias)
+ self.fp_storage = fp_storage
+ def forward(self, x: Tensor) -> Tensor:
+ w_qat = apply_qat_ste(self.weight, self.fp_storage)
+ return F.linear(x, w_qat.to(x.dtype), self.bias.to(x.dtype) if self.bias is not None else None)
+
+class QATEmbedding(nn.Embedding):
+ def __init__(self, num_embeddings: int, embedding_dim: int, fp_storage: str | bool = False):
+ super().__init__(num_embeddings, embedding_dim)
+ self.fp_storage = fp_storage
+ def forward(self, input: Tensor) -> Tensor:
+ w_qat = apply_qat_ste(self.weight, self.fp_storage)
+ return F.embedding(input, w_qat, self.padding_idx, self.max_norm,
+ self.norm_type, self.scale_grad_by_freq, self.sparse)
+
+class BinaryLinear(nn.Linear):
+ def __init__(self, in_features, out_features, bias=False, group_size=64):
+ super().__init__(in_features, out_features, bias=bias)
+ self.group_size = group_size
+ def forward(self, x: Tensor) -> Tensor:
+ w = self.weight.bfloat16()
+ g = self.group_size
+ w_g = w.reshape(-1, g)
+ scale = w_g.abs().mean(-1, keepdim=True).clamp(min=1e-8)
+ q = torch.where(w_g >= 0, torch.ones_like(w_g), -torch.ones_like(w_g))
+ w_binary = w + ((q * scale).reshape(w.shape) - w).detach()
+ return F.linear(x, w_binary,
+ self.bias.to(x.dtype) if self.bias is not None else None)
+
+class NormedBinaryLinear(BinaryLinear):
+ "Binary linear with RMSNorm on input — for output projections receiving un-normalized activations."
+ def forward(self, x: Tensor) -> Tensor:
+ return super().forward(F.rms_norm(x, (x.size(-1),)))
+
+class GroupedBinaryLinear(nn.Module):
+ "Grouped linear with binary STE. Weight stored as 2D [groups*group_out, group_in] for binary quantization compatibility."
+ def __init__(self, in_features, out_features, groups=4, group_size=64, normed=False):
+ super().__init__()
+ assert in_features % groups == 0 and out_features % groups == 0
+ self.groups = groups
+ self.group_in = in_features // groups
+ self.group_out = out_features // groups
+ self.group_size = group_size
+ self.normed = normed
+ self.weight = nn.Parameter(torch.randn(groups * self.group_out, self.group_in) * 0.02)
+ def forward(self, x: Tensor) -> Tensor:
+ if self.normed:
+ x = F.rms_norm(x, (x.size(-1),))
+ w = self.weight.bfloat16()
+ g = self.group_size
+ w_g = w.reshape(-1, g)
+ scale = w_g.abs().mean(-1, keepdim=True).clamp(min=1e-8)
+ q = torch.where(w_g >= 0, torch.ones_like(w_g), -torch.ones_like(w_g))
+ w_binary = w + ((q * scale).reshape(w.shape) - w).detach()
+ w_grouped = w_binary.reshape(self.groups, self.group_out, self.group_in)
+ bsz = x.shape[:-1]
+ x_g = x.reshape(*bsz, self.groups, self.group_in)
+ out = torch.einsum('...gi,goi->...go', x_g, w_grouped)
+ return out.reshape(*bsz, self.groups * self.group_out)
+
+class TverskyProjection(nn.Module):
+ "Tversky similarity: S = θ·f(A∩B) - α·f(A\\B) - β·f(B\\A). Three modes."
+ def __init__(self, in_features: int, out_features: int, num_features: int = 16,
+ group_size: int = 64, use_shared_features: bool = False,
+ membership: str = "sigmoid"):
+ super().__init__()
+ self.group_size = group_size
+ self.num_features = num_features
+ self.membership_type = membership
+ self.no_features_mode = (num_features == 0)
+ if not self.no_features_mode and not use_shared_features:
+ self.features = nn.Parameter(torch.empty(num_features, in_features).uniform_(-0.02, 0.02))
+ else:
+ self.register_parameter('features', None)
+ self.prototypes = nn.Parameter(torch.empty(out_features, in_features).uniform_(-0.02, 0.02))
+ self.theta = nn.Parameter(torch.tensor(1.0))
+ self.alpha = nn.Parameter(torch.tensor(0.5))
+ self.beta = nn.Parameter(torch.tensor(0.5))
+
+ def _binary_ste(self, w: Tensor) -> Tensor:
+ w_bf16 = w.bfloat16()
+ g = self.group_size
+ w_grouped = w_bf16.reshape(-1, g)
+ scale = w_grouped.abs().mean(-1, keepdim=True).clamp(min=1e-8)
+ q = torch.where(w_grouped >= 0, torch.ones_like(w_grouped), -torch.ones_like(w_grouped))
+ w_binary = w_bf16 + ((q * scale).reshape(w_bf16.shape) - w_bf16).detach()
+ return w_binary.reshape(w.shape)
+
+ def _membership(self, t: Tensor) -> Tensor:
+ if self.membership_type == "poly":
+ return torch.clamp(t * 5.0 / 4.0 + 0.5, 0.0, 1.0)
+ elif self.membership_type == "tanh":
+ return (torch.tanh(t * 5.0) + 1.0) * 0.5
+ else:
+ return torch.sigmoid(t * 5.0)
+
+ def forward(self, x: Tensor, shared_features: Tensor | None = None) -> Tensor:
+ proto = self._binary_ste(self.prototypes)
+ if self.no_features_mode:
+ x_f = x @ proto.t()
+ p_norm = F.normalize(proto, dim=-1)
+ p_f = p_norm @ p_norm.t()
+ else:
+ feat = (shared_features if shared_features is not None else self.features).float()
+ x_f = x @ feat.t()
+ p_f = proto @ feat.t()
+ x_s = self._membership(x_f)
+ p_s = self._membership(p_f)
+ x_a = x_f * x_s
+ p_a = p_f * p_s
+ t, a, b = self.theta.abs(), self.alpha.abs(), self.beta.abs()
+ return t * (x_a @ p_a.t()) - a * (x_a @ (1 - p_s).t()) - b * ((1 - x_s) @ p_a.t())
+
+def restore_low_dim_params_to_fp32(module: nn.Module) -> None:
+ with torch.no_grad():
+ for name, param in module.named_parameters():
+ if (param.ndim < 2 or any(p in name for p in CTP)) and param.dtype != torch.float32:
+ param.data = param.data.float()
+
+class Rotary(nn.Module):
+ def __init__(self, dim: int, base: float = 10000.0, no_cache: bool = False,
+ rope_type: str = "rope", yarn_max_len: int = 4096, train_seq_len: int = 1024):
+ super().__init__()
+ self.no_cache = no_cache
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+ if rope_type == "yarn":
+ scale = train_seq_len / yarn_max_len
+ freq_idx = torch.arange(0, dim, 2, dtype=torch.float32)
+ ramp = torch.clamp((freq_idx / dim - 0.25) / 0.75, 0.0, 1.0)
+ inv_freq = inv_freq / (ramp * (1.0 / scale - 1.0) + 1.0)
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
+ self._seq_len_cached = 0
+ self._cos_cached: Tensor | None = None
+ self._sin_cached: Tensor | None = None
+ def forward(self, seq_len, device, dtype):
+ if self.no_cache:
+ t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype)
+ freqs = torch.outer(t, self.inv_freq.to(device))
+ return freqs.cos()[None, :, None, :].to(dtype=dtype), freqs.sin()[None, :, None, :].to(dtype=dtype)
+ if (
+ self._cos_cached is None
+ or self._sin_cached is None
+ or self._seq_len_cached != seq_len
+ or self._cos_cached.device != device
+ ):
+ t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype)
+ freqs = torch.outer(t, self.inv_freq.to(device))
+ self._cos_cached = freqs.cos()[None, :, None, :]
+ self._sin_cached = freqs.sin()[None, :, None, :]
+ self._seq_len_cached = seq_len
+ return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype)
+
+def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor) -> Tensor:
+ half = x.size(-1) // 2
+ x1, x2 = x[..., :half], x[..., half:]
+ return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1)
+
+class CausalSelfAttention(nn.Module):
+ def __init__(self, dim, num_heads, num_kv_heads, rope_base, qk_gain_init,
+ group_size=64, attn_proj_type="standard", tversky_num_features=16,
+ tversky_feature_pools=0, no_cache=False, rope_type="rope",
+ yarn_max_len=4096, train_seq_len=1024, tversky_membership="sigmoid",
+ diff_attn=False):
+ super().__init__()
+ self.num_heads, self.num_kv_heads = num_heads, num_kv_heads
+ self.head_dim = dim // num_heads
+ self.diff_attn = diff_attn
+ self.q_size = self.num_heads * self.head_dim
+ self.kv_size = self.num_kv_heads * self.head_dim
+ self.c_qkv = BinaryLinear(dim, self.q_size + 2 * self.kv_size, bias=False, group_size=group_size)
+ self.proj = NormedBinaryLinear(dim, dim, bias=False, group_size=group_size) if attn_proj_type != "tversky" else None
+ if self.proj is not None:
+ self.proj._zero_init = True
+ self.tversky_proj = TverskyProjection(
+ dim, dim, num_features=tversky_num_features, group_size=group_size,
+ use_shared_features=(tversky_feature_pools > 0),
+ membership=tversky_membership,
+ ) if attn_proj_type == "tversky" else None
+ self.shared_features = None
+ self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32))
+ if diff_attn:
+ self.diff_lambda = nn.Parameter(torch.full((num_heads,), 0.5, dtype=torch.float32))
+ self.rotary = Rotary(self.head_dim, base=rope_base, no_cache=no_cache,
+ rope_type=rope_type, yarn_max_len=yarn_max_len,
+ train_seq_len=train_seq_len)
+ def forward(self, x: Tensor) -> Tensor:
+ bsz, seqlen, dim = x.shape
+ qkv_out = self.c_qkv(x)
+ q_out, k_out, v_out = qkv_out.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
+ q = q_out.reshape(bsz, seqlen, self.num_heads, self.head_dim)
+ k = k_out.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim)
+ v = v_out.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim)
+ q, k = F.rms_norm(q, (q.size(-1),)), F.rms_norm(k, (k.size(-1),))
+ cos, sin = self.rotary(seqlen, x.device, q.dtype)
+ q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin)
+ q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None]
+ if self.diff_attn:
+ half = self.head_dim // 2
+ q1, q2 = q[..., :half], q[..., half:]
+ k1, k2 = k[..., :half], k[..., half:]
+ v1, v2 = v[..., :half], v[..., half:]
+ y1 = flash_attn_func(q1.contiguous(), k1.contiguous(), v1.contiguous(), causal=True)
+ y2 = flash_attn_func(q2.contiguous(), k2.contiguous(), v2.contiguous(), causal=True)
+ lam = self.diff_lambda.to(dtype=y1.dtype)[None, None, :, None]
+ y = torch.cat([y1 - lam * y2, y1 + lam * y2], dim=-1)
+ else:
+ y = flash_attn_func(
+ q.contiguous(),
+ k.contiguous(),
+ v.contiguous(),
+ causal=True
+ )
+ y = y.reshape(bsz, seqlen, dim)
+ return self.tversky_proj(y, self.shared_features) if self.tversky_proj is not None else self.proj(y)
+
+class MLP(nn.Module):
+ def __init__(self, dim, mlp_mult, group_size=64, activation="swiglu", mlp_groups=0):
+ super().__init__()
+ hidden = mlp_mult * dim
+ self.activation = activation
+ if mlp_groups > 0:
+ if activation == "swiglu":
+ self.gate_up = GroupedBinaryLinear(dim, hidden * 2, groups=mlp_groups, group_size=group_size)
+ else:
+ self.fc = GroupedBinaryLinear(dim, hidden, groups=mlp_groups, group_size=group_size)
+ self.proj = GroupedBinaryLinear(hidden, dim, groups=mlp_groups, group_size=group_size, normed=True)
+ else:
+ if activation == "swiglu":
+ self.gate_up = BinaryLinear(dim, hidden * 2, bias=False, group_size=group_size)
+ else:
+ self.fc = BinaryLinear(dim, hidden, bias=False, group_size=group_size)
+ self.proj = NormedBinaryLinear(hidden, dim, bias=False, group_size=group_size)
+ self.proj._zero_init = True
+ def forward(self, x: Tensor) -> Tensor:
+ if self.activation == "swiglu":
+ gu = self.gate_up(x)
+ gate, up = gu.chunk(2, dim=-1)
+ return self.proj(F.silu(gate) * up)
+ elif self.activation == "relu":
+ return self.proj(torch.relu(self.fc(x)))
+ elif self.activation == "leaky_relu":
+ return self.proj(F.leaky_relu(self.fc(x), negative_slope=0.01))
+ else: # relu2
+ return self.proj(torch.relu(self.fc(x)).square())
+
+class SmearModule(nn.Module):
+ def __init__(self, dim: int):
+ super().__init__()
+ self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32))
+ def forward(self, x: Tensor) -> Tensor:
+ cumsum = x.cumsum(dim=1)
+ counts = torch.arange(1, x.size(1) + 1, device=x.device, dtype=x.dtype).view(1, -1, 1)
+ smeared = cumsum / counts
+ gate = torch.tanh(self.gate.to(dtype=x.dtype))
+ return x + gate * (smeared - x)
+
+class CausalConvRefiner(nn.Module):
+ "Causal Conv1d that refines hidden states using local n-gram context."
+ def __init__(self, dim: int, kernel_size: int = 3):
+ super().__init__()
+ self.kernel_size = kernel_size
+ self.conv = nn.Conv1d(dim, dim, kernel_size, padding=0, bias=False)
+ self.gate = nn.Parameter(torch.zeros(1, dtype=torch.float32))
+ def forward(self, x: Tensor) -> Tensor:
+ h = x.permute(0, 2, 1)
+ h = F.pad(h, (self.kernel_size - 1, 0))
+ h = self.conv(h)
+ h = h.permute(0, 2, 1)
+ return x + torch.tanh(self.gate.to(dtype=x.dtype)) * F.rms_norm(h, (h.size(-1),))
+
+class Block(nn.Module):
+ def __init__(self, dim: int, num_heads: int, num_kv_heads: int, mlp_mult: int,
+ rope_base: float, qk_gain_init: float, group_size: int=64,
+ activation: str="swiglu", attn_proj_type: str="standard",
+ tversky_num_features: int=16, tversky_feature_pools: int=0, no_cache: bool=False,
+ smear: bool=False, rope_type: str="rope", yarn_max_len: int=4096,
+ train_seq_len: int=1024, tversky_membership: str="sigmoid",
+ diff_attn: bool=False, mlp_groups: int=0):
+ super().__init__()
+ self.attn_norm = RMSNorm()
+ self.mlp_norm = RMSNorm()
+ self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init,
+ group_size, attn_proj_type, tversky_num_features,
+ tversky_feature_pools, no_cache, rope_type, yarn_max_len,
+ train_seq_len, tversky_membership, diff_attn)
+ self.mlp = MLP(dim, mlp_mult, group_size, activation, mlp_groups)
+ self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+ self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+ self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float())
+ self.smear = SmearModule(dim) if smear else None
+ def forward(self, x: Tensor, x0: Tensor) -> Tensor:
+ mix = self.resid_mix.to(dtype=x.dtype)
+ x = mix[0] * x + mix[1] * x0
+ n = self.attn_norm(x)
+ x = x + self.attn_scale.to(dtype=x.dtype) * self.attn(n)
+ x = x + self.mlp_scale.to(dtype=x.dtype) * self.mlp(self.mlp_norm(x))
+ if self.smear is not None:
+ x = self.smear(x)
+ return x
+
+class GPT(nn.Module):
+ def __init__(self, vocab_size, num_layers, model_dim, num_heads, num_kv_heads, mlp_mult,
+ tie_embeddings, tied_embed_init_std, logit_softcap, rope_base, qk_gain_init,
+ group_size: int = 64, activation: str = "swiglu", mtp_heads_count: int = 0,
+ embed_dim: int = 0, attn_proj_type: str = "standard", logit_head_type: str = "standard",
+ tversky_num_features: int = 16, tversky_feature_pools: int = 0,
+ training_depth_recurrence: int=1, fp_storage=False, bigram_hash: bool=False,
+ softcap_type: str="poly", no_cache: bool=False,
+ smear: bool=False, rope_type: str="rope", yarn_max_len: int=4096,
+ train_seq_len: int=1024, tversky_membership: str="sigmoid",
+ diff_attn=False, mlp_groups=0, refiner=False, refiner_kernel=3):
+ super().__init__()
+ self.training_depth_recurrence = training_depth_recurrence
+ self.fp_storage = fp_storage
+ self.tie_embeddings = tie_embeddings
+ self.logit_softcap = logit_softcap
+ self.softcap_type = softcap_type
+ self.embed_dim = embed_dim if embed_dim > 0 else model_dim
+ self.tok_emb = QATEmbedding(vocab_size, self.embed_dim, fp_storage=fp_storage)
+ self.bigram_emb = QATEmbedding(vocab_size, self.embed_dim, fp_storage=fp_storage) if bigram_hash else None
+ if self.bigram_emb is not None:
+ nn.init.zeros_(self.bigram_emb.weight)
+ self.lm_head_correction = nn.Parameter(
+ torch.zeros(vocab_size, self.embed_dim)) if tie_embeddings == 2 else None
+ self.embed_proj = QATLinear(self.embed_dim, model_dim, bias=False, fp_storage=fp_storage) if self.embed_dim != model_dim else None
+ self.embed_proj_rev = QATLinear(model_dim, self.embed_dim, bias=False, fp_storage=fp_storage) if (
+ self.embed_dim != model_dim and logit_head_type != "tversky") else None
+ self.num_encoder_layers = num_layers // 2
+ self.num_decoder_layers = num_layers - self.num_encoder_layers
+ self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
+ self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32))
+ # Shared Tversky feature pools (if enabled and num_features > 0)
+ if attn_proj_type == "tversky" and tversky_feature_pools > 0 and tversky_num_features > 0:
+ self.tversky_feature_pools_list = nn.ParameterList([
+ nn.Parameter(torch.empty(tversky_num_features, model_dim).uniform_(-0.02, 0.02))
+ for _ in range(tversky_feature_pools)
+ ])
+ else:
+ self.tversky_feature_pools_list = None
+ self.blocks = nn.ModuleList([
+ Block(model_dim, num_heads, num_kv_heads, mlp_mult, rope_base, qk_gain_init,
+ group_size, activation, attn_proj_type, tversky_num_features, tversky_feature_pools,
+ no_cache, smear, rope_type, yarn_max_len, train_seq_len, tversky_membership,
+ diff_attn, mlp_groups)
+ for _ in range(num_layers)
+ ])
+ # Inject shared feature pool references into attention layers
+ if self.tversky_feature_pools_list is not None:
+ for i, block in enumerate(self.blocks):
+ pool_idx = (i * tversky_feature_pools) // num_layers
+ block.attn.shared_features = self.tversky_feature_pools_list[pool_idx]
+ self.final_norm = RMSNorm()
+ self.refiner = CausalConvRefiner(model_dim, kernel_size=refiner_kernel) if refiner else None
+ self.mtp_heads = nn.ModuleList([
+ nn.Linear(model_dim, vocab_size, bias=False) for _ in range(mtp_heads_count)
+ ])
+ for h in self.mtp_heads:
+ nn.init.zeros_(h.weight)
+ self.logit_head_type = logit_head_type
+ if logit_head_type == "tversky" and tversky_num_features == 0 and vocab_size > 1024:
+ raise ValueError(
+ f"Tversky logit head with no-features mode creates O(V^2) = {vocab_size}x{vocab_size} "
+ f"matrix per forward pass. Use tversky_num_features > 0 or a smaller vocab."
+ )
+ self.tversky_head = TverskyProjection(
+ model_dim, vocab_size, num_features=tversky_num_features,
+ membership=tversky_membership,
+ ) if logit_head_type == "tversky" else None
+ self.lm_head = QATLinear(model_dim, vocab_size, bias=False, fp_storage=fp_storage)
+ self.lm_head._zero_init = True
+ if self.lm_head is not None and (tie_embeddings or logit_head_type == "tversky"):
+ self.lm_head.weight.requires_grad_(False)
+ self.vocab_bias = nn.Parameter(torch.zeros(vocab_size, dtype=torch.float32))
+ self._init_weights(tied_embed_init_std)
+ def _init_weights(self, tied_embed_init_std: float) -> None:
+ if self.tie_embeddings:
+ nn.init.normal_(self.tok_emb.weight, mean=0.0, std=tied_embed_init_std)
+ for module in self.modules():
+ if isinstance(module, BinaryLinear) and not getattr(module, "_zero_init", False):
+ nn.init.normal_(module.weight, mean=0.0, std=0.02)
+ elif isinstance(module, nn.Linear) and getattr(module, "_zero_init", False):
+ nn.init.zeros_(module.weight)
+ def _compute_logits(self, x: Tensor) -> Tensor:
+ if self.tversky_head is not None:
+ logits_raw = self.tversky_head(x)
+ elif self.tie_embeddings:
+ if self.embed_proj_rev is not None:
+ proj = self.embed_proj_rev(x)
+ else:
+ proj = x
+ weight = self.tok_emb.weight
+ if self.lm_head_correction is not None:
+ weight = weight + self.lm_head_correction
+ logits_raw = F.linear(proj, weight.to(x.dtype))
+ else:
+ logits_raw = self.lm_head(x)
+ return logits_raw + self.vocab_bias.to(x.dtype)
+ def _softcap(self, logits: Tensor) -> Tensor:
+ s = self.logit_softcap
+ if self.softcap_type == "tanh":
+ return s * torch.tanh(logits / s)
+ x_sc = torch.clamp(logits / s, -2.0, 2.0)
+ x2 = x_sc * x_sc
+ return s * torch.clamp(x_sc * (1.0 - x2 / 3.0 + x2 * x2 / 15.0), -1.0, 1.0)
+ def forward(self, input_ids: Tensor, target_ids: Tensor, reduction: str = "mean", temperature: float = 1.0) -> Tensor:
+ x = self.tok_emb(input_ids).float()
+ if self.bigram_emb is not None:
+ prev = F.pad(input_ids[:, :-1], (1, 0), value=0)
+ x = x + self.bigram_emb(prev).float()
+ if self.embed_proj is not None:
+ x = self.embed_proj(x)
+ x = F.rms_norm(x, (x.size(-1),))
+ x0 = x
+ # U-Net style encoder/decoder with skip connections
+ skips = []
+ for i in range(self.num_encoder_layers):
+ for _ in range(max(1, self.training_depth_recurrence)):
+ x = self.blocks[i](x, x0)
+ skips.append(x)
+ for i in range(self.num_decoder_layers):
+ bi = self.num_encoder_layers + i
+ if skips:
+ x = x + self.skip_weights[i].to(dtype=x.dtype) * skips.pop()
+ for _ in range(max(1, self.training_depth_recurrence)):
+ x = self.blocks[bi](x, x0)
+ x_normed = self.final_norm(x)
+ if self.refiner is not None:
+ x_normed = self.refiner(x_normed)
+ # Standard training/eval path
+ x_flat = x_normed.reshape(-1, x_normed.size(-1))
+ targets = target_ids.reshape(-1)
+ logits = self._softcap(self._compute_logits(x_flat))
+ if reduction == "none":
+ return F.cross_entropy(logits.float(), targets, reduction="none").reshape(input_ids.shape)
+ # Fused CE + Z-loss: single logsumexp computation
+ logits_f = logits.float()
+ lse = torch.logsumexp(logits_f, dim=-1)
+ target_logits = logits_f.gather(1, targets.unsqueeze(1)).squeeze(1)
+ main_loss = (lse - target_logits).mean() + 1e-4 * (lse ** 2).mean()
+ # Multi-token prediction auxiliary loss (training only)
+ if self.training and len(self.mtp_heads) > 0:
+ mtp_loss = torch.zeros((), device=main_loss.device)
+ for k, head in enumerate(self.mtp_heads):
+ shift = k + 2
+ if target_ids.shape[1] > shift:
+ mtp_tgt = target_ids[:, shift:].reshape(-1)
+ mtp_in = x_normed[:, :target_ids.shape[1] - shift, :].reshape(-1, x_normed.shape[-1])
+ mtp_loss = mtp_loss + F.cross_entropy(head(mtp_in).float(), mtp_tgt, reduction="mean")
+ main_loss = main_loss + 0.1 * mtp_loss / len(self.mtp_heads)
+ return main_loss
+
+# ---------------------------------------------------------------------------
+# Validation
+# ---------------------------------------------------------------------------
+def build_luts(sp, vocab_size: int, device: torch.device):
+ sp_vocab_size = int(sp.vocab_size())
+ table_size = max(sp_vocab_size, vocab_size)
+ base_bytes_np = np.zeros((table_size,), dtype=np.int16)
+ has_leading_space_np = np.zeros((table_size,), dtype=np.bool_)
+ is_boundary_token_np = np.ones((table_size,), dtype=np.bool_)
+ for token_id in range(sp_vocab_size):
+ if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+ continue
+ is_boundary_token_np[token_id] = False
+ if sp.is_byte(token_id):
+ base_bytes_np[token_id] = 1
+ continue
+ piece = sp.id_to_piece(token_id)
+ if piece.startswith("\u2581"):
+ has_leading_space_np[token_id] = True
+ piece = piece[1:]
+ base_bytes_np[token_id] = len(piece.encode("utf-8"))
+ return (
+ torch.tensor(base_bytes_np, dtype=torch.int16, device=device),
+ torch.tensor(has_leading_space_np, dtype=torch.bool, device=device),
+ torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device),
+ )
+
+def ld_val(pattern, seq_len, max_tok=int(os.environ.get("VAL_MAX_TOKENS", 500000))):
+ files = sorted(glob.glob(pattern))
+ assert files, f"No files: {pattern}"
+ tok = torch.cat([ld_shard(Path(p)) for p in files]).contiguous()
+ if max_tok > 0: tok = tok[:max_tok + 1]
+ u = ((tok.numel() - 1) // seq_len) * seq_len
+ return tok[:u + 1]
+
+def eval_val(args, model, rank, world_size, device, grad_accum_steps, val_tokens,
+ base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, temperature: float = 1.0):
+ local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps)
+ local_batch_seqs = max(1, local_batch_tokens // args.train_seq_len)
+ total_seqs = (val_tokens.numel() - 1) // args.train_seq_len
+ seq_start = (total_seqs * rank) // world_size
+ seq_end = (total_seqs * (rank + 1)) // world_size
+ loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+ token_count = torch.zeros((), device=device, dtype=torch.float64)
+ byte_count = torch.zeros((), device=device, dtype=torch.float64)
+ model.eval()
+ with torch.inference_mode():
+ for batch_start in range(seq_start, seq_end, local_batch_seqs):
+ batch_end = min(batch_start + local_batch_seqs, seq_end)
+ raw_start = batch_start * args.train_seq_len
+ raw_end = batch_end * args.train_seq_len + 1
+ local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64)
+ x, y = local[:-1].reshape(-1, args.train_seq_len), local[1:].reshape(-1, args.train_seq_len)
+ with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+ batch_loss = model(x, y, temperature=temperature).detach()
+ n = float(y.numel())
+ loss_sum += batch_loss.to(torch.float64) * n
+ token_count += n
+ prev_ids, tgt_ids = x.reshape(-1), y.reshape(-1)
+ tok_bytes = base_bytes_lut[tgt_ids].to(torch.int16)
+ tok_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(torch.int16)
+ byte_count += tok_bytes.to(torch.float64).sum()
+ if dist.is_available() and dist.is_initialized():
+ for t in (loss_sum, token_count, byte_count):
+ dist.all_reduce(t, op=dist.ReduceOp.SUM)
+ val_loss = loss_sum / token_count
+ bpb = (val_loss.item() / math.log(2.0)) * (token_count.item() / byte_count.item())
+ model.train()
+ return float(val_loss.item()), float(bpb)
+
+def eval_val_sliding(args, model, rank, world_size, device, grad_accum_steps, val_tokens,
+ base_bytes_lut, has_leading_space_lut, is_boundary_token_lut,
+ stride: int = 64, temperature: float = 1.0):
+ seq_len = args.train_seq_len
+ batch_size = args.sliding_batch_size
+ total_tokens = val_tokens.numel() - 1
+ loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+ token_count = torch.zeros((), device=device, dtype=torch.float64)
+ byte_count = torch.zeros((), device=device, dtype=torch.float64)
+ all_starts = list(range(0, total_tokens - seq_len, stride))
+ my_starts = all_starts[rank::world_size]
+ model.eval()
+ with torch.inference_mode():
+ for i in range(0, len(my_starts), batch_size):
+ batch_starts = my_starts[i:i + batch_size]
+ starts_t = torch.tensor(batch_starts, dtype=torch.int64)
+ offsets = torch.arange(seq_len + 1, dtype=torch.int64)
+ indices = starts_t.unsqueeze(1) + offsets.unsqueeze(0)
+ local_batch = val_tokens[indices].to(device=device, dtype=torch.int64, non_blocking=True)
+ x = local_batch[:, :-1]
+ y = local_batch[:, 1:]
+ with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+ per_token_loss = model(x, y, reduction="none", temperature=temperature).detach()
+ for b, start in enumerate(batch_starts):
+ score_from = 0 if start == 0 else seq_len - stride
+ scored = per_token_loss[b, score_from:]
+ sx, sy = x[b, score_from:], y[b, score_from:]
+ loss_sum += scored.to(torch.float64).sum()
+ token_count += scored.numel()
+ tok_bytes = base_bytes_lut[sy].to(torch.int16)
+ tok_bytes += (has_leading_space_lut[sy] & ~is_boundary_token_lut[sx]).to(torch.int16)
+ byte_count += tok_bytes.to(torch.float64).sum()
+ if dist.is_available() and dist.is_initialized():
+ for t in (loss_sum, token_count, byte_count):
+ dist.all_reduce(t, op=dist.ReduceOp.SUM)
+ val_loss = loss_sum / token_count
+ bpb = (val_loss.item() / math.log(2.0)) * (token_count.item() / byte_count.item())
+ model.train()
+ return float(val_loss.item()), float(bpb)
+
+# ---------------------------------------------------------------------------
+# Temperature scaling
+# ---------------------------------------------------------------------------
+def find_temp(args, base_model, rank, world_size, device, grad_accum_steps,
+ calibration_tokens, base_bytes_lut, has_leading_space_lut,
+ is_boundary_token_lut):
+ best_t, best_loss = 1.0, float("inf")
+ for t in [0.90, 0.95, 1.00, 1.05, 1.10]:
+ loss, _ = eval_val(args, base_model, rank, world_size, device, grad_accum_steps,
+ calibration_tokens, base_bytes_lut, has_leading_space_lut,
+ is_boundary_token_lut, temperature=t)
+ if loss < best_loss:
+ best_loss = loss
+ best_t = t
+ return best_t
+
+# ---------------------------------------------------------------------------
+# Training
+# ---------------------------------------------------------------------------
+def main() -> None:
+ args = Hyperparameters()
+ code = Path(__file__).read_text(encoding="utf-8")
+ if args.matrix_optimizer != "adamw":
+ global ns_orth
+ ns_orth = torch.compile(ns_orth)
+ distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ
+ rank = int(os.environ.get("RANK", "0"))
+ world_size = int(os.environ.get("WORLD_SIZE", "1"))
+ local_rank = int(os.environ.get("LOCAL_RANK", "0"))
+ grad_accum_steps = max(1, 8 // world_size)
+ grad_scale = 1.0 / grad_accum_steps
+ if not torch.cuda.is_available():
+ raise RuntimeError("CUDA is required")
+ device = torch.device("cuda", local_rank)
+ torch.cuda.set_device(device)
+ if distributed:
+ dist.init_process_group(backend="nccl", device_id=device)
+ dist.barrier()
+ master_process = rank == 0
+ torch.backends.cuda.matmul.allow_tf32 = True
+ torch.backends.cudnn.allow_tf32 = True
+ os.makedirs("logs/cuda/", exist_ok=True)
+ logfile = f"logs/cuda/{args.run_id}.txt" if master_process else None
+ if master_process:
+ print(logfile)
+ def log0(msg: str, console: bool = True) -> None:
+ if not master_process:
+ return
+ if console:
+ print(msg)
+ if logfile:
+ with open(logfile, "a", encoding="utf-8") as f:
+ print(msg, file=f)
+ log0(code, console=False)
+ log0("=" * 100, console=False)
+ log0(f"Python {sys.version}", console=False)
+ log0(f"PyTorch {torch.__version__}", console=False)
+ random.seed(args.seed)
+ np.random.seed(args.seed)
+ torch.manual_seed(args.seed)
+ torch.cuda.manual_seed_all(args.seed)
+ sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
+ val_tokens = ld_val(args.val_files, args.train_seq_len)
+ base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_luts(
+ sp, args.vocab_size, device)
+
+ # --- Model ---
+ base_model = GPT(
+ vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim,
+ num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult,
+ tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std,
+ logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init,
+ group_size=args.bitnet_group_size, activation=args.activation_type, mtp_heads_count=args.mtp_heads_count,
+ embed_dim=args.embed_dim, attn_proj_type=args.attn_proj_type, logit_head_type=args.logit_head_type,
+ tversky_num_features=args.tversky_num_features, tversky_feature_pools=args.tversky_feature_pools,
+ training_depth_recurrence=args.training_depth_recurrence, fp_storage=args.fp_storage,
+ bigram_hash=args.bigram_hash, softcap_type=args.softcap_type, no_cache=(args.compile_mode == "reduce-overhead"),
+ smear=args.smear, rope_type=args.rope_type, yarn_max_len=args.yarn_max_len, train_seq_len=args.train_seq_len,
+ tversky_membership=args.tversky_membership, diff_attn=args.diff_attn,
+ refiner=args.refiner, refiner_kernel=args.refiner_kernel, mlp_groups=args.mlp_groups,
+ ).to(device).bfloat16()
+ for module in base_model.modules():
+ if isinstance(module, nn.Linear):
+ module.float()
+ restore_low_dim_params_to_fp32(base_model)
+ if base_model.lm_head is not None and (args.tie_embeddings or args.logit_head_type == "tversky"):
+ base_model.lm_head.weight.requires_grad_(False)
+ torch._dynamo.config.optimize_ddp = False
+ compiled_model = torch.compile(base_model, mode=args.compile_mode if args.compile_mode != "default" else None)
+ use_find_unused = args.untie_at_fraction > 0 or args.mtp_heads_count > 0 or not args.tie_embeddings
+ model = DDP(compiled_model, device_ids=[local_rank], broadcast_buffers=False,
+ find_unused_parameters=use_find_unused,
+ static_graph=not use_find_unused,
+ gradient_as_bucket_view=True) if distributed else compiled_model
+
+ # --- Optimizers ---
+ _excl = {"tok_emb.weight", "lm_head.weight", "lm_head_correction"}
+ all_other_params = [(n, p) for n, p in base_model.named_parameters()
+ if not any(eh in n for eh in _excl)]
+ matrix_params = [p for n, p in all_other_params
+ if p.ndim == 2 and not any(pat in n for pat in CTP)]
+ scalar_params = [p for n, p in all_other_params
+ if p.ndim < 2 or any(pat in n for pat in CTP)]
+ token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr
+ opt_tok = torch.optim.Adam(
+ [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}],
+ betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True)
+ if args.matrix_optimizer == "adamw":
+ opt_muon = torch.optim.AdamW(
+ [{"params": matrix_params, "lr": args.adam_lr, "base_lr": args.adam_lr}],
+ betas=(args.beta1, args.beta2), eps=args.adam_eps, weight_decay=args.adam_wd, fused=True)
+ else:
+ opt_muon = Muon(matrix_params, lr=args.matrix_lr, momentum=args.muon_momentum,
+ backend_steps=args.muon_backend_steps, wd=args.muon_wd)
+ for g in opt_muon.param_groups:
+ g["base_lr"] = args.matrix_lr
+ opt_scalar = torch.optim.Adam(
+ [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}],
+ betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True)
+ opt_head = torch.optim.Adam(
+ [{"params": [base_model.lm_head.weight], "lr": 0.0, "base_lr": 0.0}],
+ betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True)
+ optimizers = [opt for opt in [opt_tok, opt_muon, opt_scalar, opt_head] if opt is not None]
+ if base_model.lm_head_correction is not None:
+ opt_corr = torch.optim.Adam(
+ [{"params": [base_model.lm_head_correction],
+ "lr": args.corr_weight_lr, "base_lr": args.corr_weight_lr}],
+ betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True)
+ optimizers.append(opt_corr)
+
+ # --- Log all hyperparameters ---
+ log0("--- Hyperparameters ---", console=False)
+ log0(" ".join(f"{a}={getattr(args,a)}" for a in sorted(dir(args)) if not a.startswith("_") and a not in ("train_files","val_files") and not callable(getattr(args,a))), console=False)
+ n_params = sum(p.numel() for p in base_model.parameters())
+ log0(f"params:{n_params} L:{args.num_layers} d:{args.model_dim} h:{args.num_heads} kv:{args.num_kv_heads} ws:{world_size} ga:{grad_accum_steps} s:{args.seed}")
+ # --- Data loader & helpers ---
+ train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+ def zero_grad_all():
+ for opt in optimizers:
+ opt.zero_grad(set_to_none=True)
+ max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None
+ def lr_mul(step: int, elapsed_ms: float):
+ if args.warmdown_fraction <= 0:
+ return 1.0
+ if max_wallclock_ms is None:
+ warmdown_start = int(args.iterations * (1.0 - args.warmdown_fraction))
+ return max((args.iterations - step) / max(args.iterations * args.warmdown_fraction, 1), 0.0) if step >= warmdown_start else 1.0
+ warmdown_ms = max_wallclock_ms * args.warmdown_fraction
+ remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0)
+ return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0
+ _seq_switched = False
+ _batch_switched = False
+ active_seq_len = args.seq_len_start if args.seq_len_start > 0 else args.train_seq_len
+ active_batch_tokens = args.batch_tokens_start if args.batch_tokens_start > 0 else args.train_batch_tokens
+ # --- Compiler warmup ---
+ if args.warmup_steps > 0:
+ _ms = {n: t.detach().cpu().clone() for n, t in base_model.state_dict().items()}
+ _os = [copy.deepcopy(o.state_dict()) for o in optimizers]
+ model.train()
+ for ws in range(args.warmup_steps):
+ zero_grad_all()
+ for mi in range(grad_accum_steps):
+ if distributed: model.require_backward_grad_sync = mi == grad_accum_steps - 1
+ x, y = train_loader.next_batch(active_batch_tokens, active_seq_len, grad_accum_steps)
+ torch.compiler.cudagraph_mark_step_begin()
+ with torch.autocast(device_type="cuda", dtype=torch.bfloat16): loss = model(x, y)
+ (loss * grad_scale).backward()
+ for o in optimizers: o.step()
+ zero_grad_all()
+ log0(f"warmup:{ws+1}/{args.warmup_steps}")
+ base_model.load_state_dict(_ms, strict=True)
+ for o, s in zip(optimizers, _os): o.load_state_dict(s)
+ zero_grad_all()
+ train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+
+ # --- EMA model ---
+ ema_model = None
+ _ema_started = False
+ _ema_steps = 0
+ if args.ema:
+ ema_model = copy.deepcopy(base_model)
+ for p in ema_model.parameters():
+ p.requires_grad_(False)
+
+ # --- Main training loop ---
+ training_time_ms = 0.0
+ stop_after_step: int | None = None
+ _untied = False
+ train_loss = torch.zeros((), device=device)
+ torch.cuda.synchronize()
+ t0 = time.perf_counter()
+ step = 0
+ while True:
+ last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step)
+ if last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0):
+ torch.cuda.synchronize()
+ training_time_ms += 1000.0 * (time.perf_counter() - t0)
+ val_loss, val_bpb = eval_val(args, model, rank, world_size, device, grad_accum_steps,
+ val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut)
+ log0(f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} "
+ f"train_time:{training_time_ms:.0f}ms")
+ torch.cuda.synchronize()
+ t0 = time.perf_counter()
+ if last_step:
+ if stop_after_step is not None and step < args.iterations:
+ log0(f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms step:{step}/{args.iterations}")
+ break
+ elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+ scale = lr_mul(step, elapsed_ms)
+ # Sequence length schedule
+ if args.seq_len_start > 0 and not _seq_switched:
+ if max_wallclock_ms is not None:
+ should_switch_seq = elapsed_ms >= args.seq_schedule_fraction * max_wallclock_ms
+ else:
+ should_switch_seq = step >= int(args.iterations * args.seq_schedule_fraction)
+ if should_switch_seq:
+ active_seq_len = args.train_seq_len
+ _seq_switched = True
+ torch._dynamo.reset()
+ train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device)
+ log0(f"step:{step} seq_len_switch:{args.seq_len_start}->{active_seq_len}")
+
+ # Batch size schedule
+ if args.batch_tokens_start > 0 and not _batch_switched:
+ if max_wallclock_ms is not None:
+ should_switch_batch = elapsed_ms >= args.batch_schedule_fraction * max_wallclock_ms
+ else:
+ should_switch_batch = step >= int(args.iterations * args.batch_schedule_fraction)
+ if should_switch_batch:
+ active_batch_tokens = args.train_batch_tokens
+ _batch_switched = True
+ log0(f"step:{step} batch_switch:{args.batch_tokens_start}->{active_batch_tokens}")
+ zero_grad_all()
+ train_loss.zero_()
+ for micro in range(grad_accum_steps):
+ if distributed:
+ model.require_backward_grad_sync = micro == grad_accum_steps - 1
+ x, y = train_loader.next_batch(active_batch_tokens, active_seq_len, grad_accum_steps)
+ torch.compiler.cudagraph_mark_step_begin()
+ with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+ loss = model(x, y)
+ train_loss.add_(loss.detach())
+ (loss * grad_scale).backward()
+ train_loss /= grad_accum_steps
+
+ # Untie lm_head at configured fraction of training
+ if args.untie_at_fraction > 0:
+ if max_wallclock_ms is not None:
+ should_untie = not _untied and elapsed_ms >= args.untie_at_fraction * max_wallclock_ms
+ else:
+ should_untie = not _untied and step >= int(args.iterations * args.untie_at_fraction)
+ if should_untie and base_model.tie_embeddings:
+ with torch.no_grad():
+ base_weight = base_model.tok_emb.weight.float()
+ if base_model.lm_head_correction is not None:
+ base_weight = base_weight + base_model.lm_head_correction.float()
+ if base_model.embed_proj_rev is not None:
+ full_weight = base_weight @ base_model.embed_proj_rev.weight.float()
+ else:
+ full_weight = base_weight
+ base_model.lm_head.weight.copy_(full_weight)
+ base_model.tie_embeddings = False
+ base_model.lm_head.weight.requires_grad_(True)
+ for g in opt_head.param_groups:
+ g["lr"] = g["base_lr"] = args.head_lr
+ _untied = True
+ torch._dynamo.reset()
+ log0(f"step:{step} untied lm_head (head_lr={args.head_lr})")
+
+ # Muon momentum warmup
+ if args.matrix_optimizer != "adam":
+ frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0
+ for g in opt_muon.param_groups:
+ g["momentum"] = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum
+
+ # LR scheduling
+ for opt in optimizers:
+ for g in opt.param_groups:
+ g["lr"] = g["base_lr"] * scale
+ opt.step()
+ zero_grad_all()
+ # EMA update
+ if ema_model is not None:
+ if not _ema_started:
+ if max_wallclock_ms is not None:
+ should_start_ema = elapsed_ms >= args.ema_start_fraction * max_wallclock_ms
+ else:
+ should_start_ema = step >= int(args.iterations * args.ema_start_fraction)
+ if should_start_ema:
+ _ema_started = True
+ _ema_steps = 0
+ with torch.no_grad():
+ for ep, bp in zip(ema_model.parameters(), base_model.parameters()):
+ ep.data.copy_(bp.data)
+ log0(f"step:{step} ema_started")
+ if _ema_started:
+ _ema_steps += 1
+ decay = min(args.ema_decay, (1.0 + _ema_steps) / (10.0 + _ema_steps))
+ with torch.no_grad():
+ for ep, bp in zip(ema_model.parameters(), base_model.parameters()):
+ ep.data.mul_(decay).add_(bp.data, alpha=1.0 - decay)
+ step += 1
+ approx_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0)
+
+ if args.train_log_every > 0 and step % args.train_log_every == 0:
+ log0(f"step:{step}/{args.iterations} loss:{train_loss.item():.4f} t:{approx_ms:.0f}ms avg:{approx_ms/step:.1f}ms")
+ if args.churn_log_every > 0 and step % args.churn_log_every == 0:
+ log0(f"step:{step} churn:{churn_fn(base_model, args.bitnet_group_size):.4f}")
+ # Wallclock cap sync
+ if stop_after_step is None and max_wallclock_ms is not None and step % 10 == 0:
+ reached_cap = approx_ms >= max_wallclock_ms
+ if distributed:
+ cap_t = torch.tensor(int(reached_cap), device=device)
+ dist.all_reduce(cap_t, op=dist.ReduceOp.MAX)
+ reached_cap = bool(cap_t.item())
+ if reached_cap:
+ stop_after_step = step
+
+ # --- Serialization ---
+ if master_process:
+ sd = (ema_model if ema_model is not None and _ema_started else base_model).state_dict()
+ if base_model.tie_embeddings or args.logit_head_type == "tversky":
+ sd.pop("lm_head.weight", None)
+
+ # Compute binary overrides for no-features Tversky prototypes
+ binary_overrides = set()
+ for n, m in base_model.named_modules():
+ if isinstance(m, TverskyProjection) and m.no_features_mode:
+ binary_overrides.add(n + ".prototypes")
+ binary_overrides = binary_overrides or None
+ q_obj, q_stats = q_sd(sd, group_size=args.bitnet_group_size, fp_storage=args.fp_storage, binary_override_names=binary_overrides)
+ buf = io.BytesIO()
+ torch.save(q_obj, buf)
+ final_blob = lzma.compress(buf.getvalue(), preset=9)
+ with open("final_model.binary.ptz", "wb") as f:
+ f.write(final_blob)
+ artifact_bytes = len(final_blob)
+ code_bytes = len(code.encode("utf-8"))
+ total = artifact_bytes + code_bytes
+ log0(f"artifact:{artifact_bytes/1e6:.2f}MB binary:{q_stats['binary_params']}({q_stats['binary_bytes']}B) fp:{q_stats['fp_params']}({q_stats['fp_bytes']}B) code:{code_bytes}")
+ log0(f"budget:{total}/{16000000} ({total/1e6:.2f}/{16.00:.2f}MB) {'FITS' if total <= 16000000 else 'OVER'}")
+ if args.eval_depth_recurrence > 0:
+ base_model.training_depth_recurrence = args.eval_depth_recurrence
+ log0(f"eval_depth_recurrence:{args.eval_depth_recurrence}")
+
+ # --- All ranks load roundtrip weights and evaluate ---
+ if distributed:
+ dist.barrier()
+ with open("final_model.binary.ptz", "rb") as f:
+ loaded = torch.load(io.BytesIO(lzma.decompress(f.read())), map_location="cpu", weights_only=False)
+ base_model.load_state_dict(deq_sd(loaded), strict=False)
+ if ema_model is not None:
+ ema_model.load_state_dict(deq_sd(loaded), strict=False)
+ torch._dynamo.reset()
+ q_val_loss, q_val_bpb = eval_val(args, model, rank, world_size, device, grad_accum_steps,
+ val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut)
+ log0(f"final_binary_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f}")
+
+ opt_temp = 1.0
+ if args.temp_scaling:
+ torch.cuda.synchronize()
+ t_temp = time.perf_counter()
+ calibration_tokens = train_loader.stream.take(65536).to(device)
+ opt_temp = find_temp(args, base_model, rank, world_size, device, grad_accum_steps,
+ calibration_tokens, base_bytes_lut, has_leading_space_lut,
+ is_boundary_token_lut)
+ torch.cuda.synchronize()
+ temp_time_ms = 1000.0 * (time.perf_counter() - t_temp)
+ log0(f"temp_scaling optimal_T:{opt_temp:.2f} eval_time:{temp_time_ms:.0f}ms")
+
+ if args.sliding_eval:
+ torch.cuda.synchronize()
+ t_sliding = time.perf_counter()
+ sw_loss, sw_bpb = eval_val_sliding(args, base_model, rank, world_size, device, grad_accum_steps,
+ val_tokens, base_bytes_lut, has_leading_space_lut,
+ is_boundary_token_lut, stride=args.sliding_eval_stride,
+ temperature=opt_temp)
+ torch.cuda.synchronize()
+ sliding_time_ms = 1000.0 * (time.perf_counter() - t_sliding)
+ log0(f"final_sliding val_loss:{sw_loss:.4f} val_bpb:{sw_bpb:.4f} "
+ f"(stride={args.sliding_eval_stride}, T={opt_temp:.2f}) eval_time:{sliding_time_ms:.0f}ms")
+
+ if distributed:
+ dist.destroy_process_group()
+
+
+if __name__ == "__main__":
+ main()