diff --git a/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/README.md b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/README.md new file mode 100644 index 0000000000..e75915a81c --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/README.md @@ -0,0 +1,165 @@ +# Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505 + +**val_bpb: 1.06505** (3-seed mean, std=0.00081) | **val_loss: 2.33073 nats/token** (std=0.00178) | **~15.98 MB** | 8xH100 SXM | Phased TTT + +## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, phased TTT, 10-min train / 10-min eval budgets) + +### Core table (phased TTT) + +| Seed | Steps | Pre-TTT BPB | Post-TTT BPB | TTT gain | TTT time | Artifact (bytes) | +|------|------:|------------:|-------------:|---------:|---------:|-----------------:| +| 0 | 4599 | 1.07689 | 1.06417 | -0.01272 | 470.6s | 15,984,426 | +| 42 | 4603 | 1.07792 | 1.06521 | -0.01271 | 513.8s | 15,986,579 | +| 1234 | 4604 | 1.07836 | 1.06578 | -0.01258 | 470.6s | 15,982,914 | +| **Mean** | **4602** | **1.07772** | **1.06505** | **-0.01267** | **485.0s** | **15,984,640** | +| **Std** | | 0.00076 | **0.00081** | | 24.9s | 1,842 | + +### Supplemental diagnostics + +| Seed | Post-EMA BPB (pre-quant) | Quantized BPB (no TTT) | Sliding/TTT BPB | val_loss (nats) | Train time | Eval time | +|------|-------------------------:|-----------------------:|----------------:|----------------:|-----------:|----------:| +| 0 | 1.06779 | 1.07689 | 1.06417 | 2.32880 | 596.10s | 470.6s | +| 42 | 1.06872 | 1.07792 | 1.06521 | 2.33108 | 596.15s | 513.8s | +| 1234 | 1.06934 | 1.07836 | 1.06578 | 2.33231 | 596.14s | 470.6s | + +Compared with PR #1736's 3-seed mean of **1.06549**, this curriculum improves the final endpoint by **0.00043 BPB** while staying under both the 600s eval budget and the 16,000,000-byte decimal artifact cap. + +## Specific contribution in this record + +The core new idea here is **curriculum recurrence depth**. + +The base stack already existed: + +- SP8192 base architecture / looped stack from earlier merged work +- CaseOps tokenizer + original-byte sidecar accounting from PR #1729 +- phased TTT from the prior stack +- gated attention / quant-gate components from earlier work + +This record's contribution is to change how recurrence depth is used during training and evaluation. + +Instead of training with one fixed recurrent depth after loop activation, this submission uses a **deterministic equal-thirds recurrence curriculum**: + +- once the loop path is enabled, train at total recurrence depth `1` +- then switch to total recurrence depth `3` +- then switch to total recurrence depth `4` +- evaluate and run phased TTT at fixed depth `4` + +Depth here is counted as the **total number of passes through the recurrent loop block**. So `1` is the shallowest loop-enabled path, `3` is the standard middle-depth path, and `4` is one extra refinement pass at the endpoint. + +The intended mechanism is: + +- early in the loop-enabled regime, force the recurrent block to learn a useful shallow refinement operator +- then expand to the normal depth so the model keeps strong baseline behavior +- only in the final phase ask the same shared recurrent block to support a deeper refinement chain +- at eval / phased TTT, cash in that extra learned depth by running the model at depth `4` + +So the hypothesis is not "train deeper everywhere." It is "teach the recurrent block to scale its refinement depth over training, then evaluate at the deepest trained depth." Empirically, that improves the final phased-TTT endpoint even though one seed (`1234`) is slightly worse than PR #1736; the mean improves because seeds `0` and `42` improve more strongly. + +## CaseOps tokenizer and legality + +CaseOps (`lossless_caps_caseops_v1`) is a **bijective**, character-level text transform applied before SentencePiece training. It removes English capitalization from the body of the text and records it as four operator tokens that become part of the BPE vocabulary as SentencePiece `user_defined_symbols`: + +- `TITLE` — next word is TitleCase +- `ALLCAPS` — next word or region is UPPERCASE +- `CAPNEXT` — next letter is capitalized +- `ESC` — escape for a literal operator-looking sequence + +Because the transform is fully invertible, no information is lost. Reconstruction is exact by replaying these capitalization operators over the lowercase lexical stream. + +**BPB is still charged on the original raw UTF-8 bytes**, not on the transformed representation. The validation export emits a per-token byte sidecar (`fineweb_val_bytes_XXXXXX.bin`) parallel to the transformed token stream. Eval sums those byte counts for the scored positions, so the denominator remains the original FineWeb byte count. + +That means: + +- extra CaseOps control tokens are **not free** +- they still contribute prediction loss +- but the BPB denominator stays anchored to the original corpus bytes + +So the submission remains legality-preserving: it changes representation, not the underlying text being compressed. + +## Rule compliance + +- **Artifact <= 16,000,000 bytes DECIMAL**: all 3 seeds <= 15,986,579 bytes. +- **train_time <= 600s**: all 3 seeds are 596.10-596.15s. +- **total_eval_time <= 600s**: all 3 seeds are 470.6-513.8s. +- **Score-first TTT**: phased TTT snapshots the pre-update score on each chunk before the LoRA adapter step. +- **BPB on original bytes**: per-token byte sidecar encodes the canonical UTF-8 byte count of each val position. +- **Reversibility**: `decode_lossless_caps_v2(encode_lossless_caps_v2(x)) == x`. +- **No val data in training**: training uses only `fineweb_train_*.bin` shards. +- **No external network during eval**: self-contained; tokenizer + transform ship with the submission. + +## Requirements + +```bash +pip install torch --index-url https://download.pytorch.org/whl/cu128 +pip install flash-attn-interface sentencepiece triton numpy +``` + +Python >= 3.12 is recommended. + +Run all commands below from this record directory. + +## Data setup (run once) + +The submission ships with the trained CaseOps SentencePiece model and the bijective transform module. Train/val shards and the byte sidecar are rebuilt from the canonical FineWeb-10B doc stream: + +```bash +# 1. Ensure docs_selected.jsonl exists (standard repo setup step). +python3 ../../data/download_hf_docs_and_tokenize.py + +# 2. Build CaseOps-transformed shards + val byte sidecar. +# This reproduces the original CaseOps export format: +# one BOS token per doc, and a matching leading 0 byte-count entry. +python3 prepare_caseops_data.py \ + --docs ./fineweb10B_raw/docs_selected.jsonl \ + --out ./data/datasets/fineweb10B_sp8192_caseops/datasets \ + --sp ./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ + --val-docs 50000 +``` + +## Run command (3-seed reproduction) + +```bash +for SEED in 42 0 1234; do + NCCL_NET=Socket \ + DATA_DIR=./data \ + CASEOPS_ENABLED=1 \ + PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \ + MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \ + EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \ + MATRIX_LR=0.026 \ + GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \ + GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \ + TRAIN_LOOP_PHASE_DEPTHS=1,3,4 \ + TRAIN_LOOP_PREWARM_DEPTHS=3,4 \ + EVAL_LOOP_DEPTH=4 \ + SEED=$SEED \ + torchrun --standalone --nproc_per_node=8 train_gpt.py \ + > train_seed${SEED}.log 2>&1 +done +``` + +## Lineage + +- **PR #1530** contributed the core SP8192 base architecture / looped-stack foundation. +- **PR #1626** contributed the phased-TTT schedule that this stack continues to use. +- **PR #1729** contributed the **CaseOps tokenizer**, lossless capitalization transform, and original-byte sidecar BPB accounting. +- **PR #1667** contributed the attention out-gate pattern used in this family of runs. +- **PR #1736** assembled those ingredients into one competitive stack. +- **This record's novel change** is the deterministic `1 -> 3 -> 4` recurrence-depth curriculum with fixed-depth-`4` eval. + +## Credits + +- @romeerp — CaseOps tokenizer, byte-sidecar accounting, and this recurrence-curriculum contribution. +- @samacqua — SP8192 base architecture / looped-stack foundation from PR #1530. +- @MarioPaerle — attention gate pattern. +- prior phased-TTT contributors in the PR #1626 line. + +## Included files + +- `train_gpt.py` — main training script. +- `submission.json` — metadata. +- `README.md` — this file. +- `train_seed42.log`, `train_seed0.log`, `train_seed1234.log` — 3-seed run logs. +- `tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` — CaseOps SentencePiece model. +- `lossless_caps.py` — bijective CaseOps transform. +- `prepare_caseops_data.py` — one-time data prep script that emits the per-token byte sidecar. diff --git a/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/lossless_caps.py b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/lossless_caps.py new file mode 100644 index 0000000000..98e472f824 --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/lossless_caps.py @@ -0,0 +1,833 @@ +"""Lossless capitalization pre-encoding helpers. + +This module provides a narrow, reversible transform that only touches +ASCII capital letters `A-Z`. Each uppercase ASCII letter is rewritten as +``, where `sentinel` is a private-use Unicode +character that is escaped by doubling if it appears literally in the +input text. + +Example with the default sentinel `\\uE000`: + + "The NASA Launch" -> "\\uE000the \\uE000n\\uE000a\\uE000s\\uE000a \\uE000launch" + +The transform is intentionally simple for v1: + +- lowercase ASCII letters are unchanged +- uppercase ASCII letters become sentinel + lowercase letter +- non-ASCII characters are left untouched +- literal sentinel characters are escaped as sentinel + sentinel + +This makes the transform exactly invertible while allowing a downstream +tokenizer to reuse lowercase subwords across case variants. +""" + +from __future__ import annotations + +import json +from pathlib import Path +from typing import Callable, Iterable + +LOSSLESS_CAPS_V1 = "lossless_caps_v1" +LOSSLESS_CAPS_V2 = "lossless_caps_v2" +LOSSLESS_CAPS_V3 = "lossless_caps_v3" +LOSSLESS_CAPS_V4 = "lossless_caps_v4" +LOSSLESS_CAPS_V5 = "lossless_caps_v5" +LOSSLESS_CAPS_V6 = "lossless_caps_v6" +LOSSLESS_CAPS_V7 = "lossless_caps_v7" +LOSSLESS_CAPS_CASEOPS_V1 = "lossless_caps_caseops_v1" +IDENTITY = "identity" +DEFAULT_SENTINEL = "\uE000" +DEFAULT_V2_TITLE = "\uE001" +DEFAULT_V2_ALLCAPS = "\uE002" +DEFAULT_V2_CAPNEXT = "\uE003" +DEFAULT_V2_ESC = "\uE004" +DEFAULT_V5_TITLE_MIN_LEN = 7 +DEFAULT_V6_ALLCAPS_MIN_LEN = 3 +DEFAULT_V7_ALLCAPS_MIN_LEN = 4 + + +class LosslessCapsError(ValueError): + """Raised when a transformed string is malformed.""" + + +def _is_ascii_upper(ch: str) -> bool: + return "A" <= ch <= "Z" + + +def _is_ascii_lower(ch: str) -> bool: + return "a" <= ch <= "z" + + +def _is_ascii_alpha(ch: str) -> bool: + return _is_ascii_lower(ch) or _is_ascii_upper(ch) + + +def _validate_distinct_single_chars(*chars: str) -> None: + if any(len(ch) != 1 for ch in chars): + raise ValueError("all control characters must be exactly one character") + if len(set(chars)) != len(chars): + raise ValueError("control characters must be distinct") + + +def encode_lossless_caps_v1(text: str, *, sentinel: str = DEFAULT_SENTINEL) -> str: + """Encode ASCII capitals reversibly using a one-character sentinel.""" + if len(sentinel) != 1: + raise ValueError("sentinel must be exactly one character") + out: list[str] = [] + for ch in text: + if ch == sentinel: + out.append(sentinel) + out.append(sentinel) + elif _is_ascii_upper(ch): + out.append(sentinel) + out.append(ch.lower()) + else: + out.append(ch) + return "".join(out) + + +def decode_lossless_caps_v1(text: str, *, sentinel: str = DEFAULT_SENTINEL) -> str: + """Decode the `lossless_caps_v1` transform back to the original text.""" + if len(sentinel) != 1: + raise ValueError("sentinel must be exactly one character") + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch != sentinel: + out.append(ch) + i += 1 + continue + if i + 1 >= n: + raise LosslessCapsError("dangling capitalization sentinel at end of string") + nxt = text[i + 1] + if nxt == sentinel: + out.append(sentinel) + elif _is_ascii_lower(nxt): + out.append(nxt.upper()) + else: + raise LosslessCapsError( + f"invalid sentinel escape sequence {sentinel + nxt!r}; " + "expected doubled sentinel or sentinel + lowercase ASCII letter" + ) + i += 2 + return "".join(out) + + +def encode_lossless_caps_v2( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + capnext: str = DEFAULT_V2_CAPNEXT, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Encode ASCII word capitalization with cheap word-level markers. + + Rules over maximal ASCII alphabetic runs: + - lowercase words stay unchanged + - TitleCase words become `title + lowercase(word)` + - ALLCAPS words become `allcaps + lowercase(word)` + - mixed-case words use: + - optional `title` when the first letter is uppercase + - `capnext + lowercase(letter)` for subsequent uppercase letters + - literal control characters are escaped as `esc + literal` + """ + _validate_distinct_single_chars(title, allcaps, capnext, esc) + controls = {title, allcaps, capnext, esc} + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch in controls: + out.append(esc) + out.append(ch) + i += 1 + continue + if not _is_ascii_alpha(ch): + out.append(ch) + i += 1 + continue + + j = i + 1 + while j < n and _is_ascii_alpha(text[j]): + j += 1 + word = text[i:j] + lower_word = word.lower() + + if word.islower(): + out.append(word) + elif len(word) >= 2 and word.isupper(): + out.append(allcaps) + out.append(lower_word) + elif _is_ascii_upper(word[0]) and word[1:].islower(): + out.append(title) + out.append(lower_word) + else: + if _is_ascii_upper(word[0]): + out.append(title) + out.append(lower_word[0]) + for orig_ch, lower_ch in zip(word[1:], lower_word[1:], strict=True): + if _is_ascii_upper(orig_ch): + out.append(capnext) + out.append(lower_ch) + i = j + return "".join(out) + + +def decode_lossless_caps_v2( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + capnext: str = DEFAULT_V2_CAPNEXT, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v2` transform back to the original text.""" + _validate_distinct_single_chars(title, allcaps, capnext, esc) + out: list[str] = [] + pending_escape = False + pending_word_mode: str | None = None + active_allcaps = False + pending_capnext = False + in_ascii_word = False + + for ch in text: + if pending_escape: + if pending_word_mode is not None and not _is_ascii_alpha(ch): + raise LosslessCapsError("escaped control char cannot satisfy pending word capitalization mode") + out.append(ch) + pending_escape = False + if _is_ascii_alpha(ch): + in_ascii_word = True + else: + in_ascii_word = False + active_allcaps = False + continue + + if ch == esc: + pending_escape = True + continue + if ch == title: + if pending_word_mode is not None or in_ascii_word or pending_capnext: + raise LosslessCapsError("invalid title marker placement") + pending_word_mode = "title" + continue + if ch == allcaps: + if pending_word_mode is not None or in_ascii_word or pending_capnext: + raise LosslessCapsError("invalid allcaps marker placement") + pending_word_mode = "allcaps" + continue + if ch == capnext: + if pending_capnext: + raise LosslessCapsError("duplicate capnext marker") + pending_capnext = True + continue + + if _is_ascii_alpha(ch): + at_word_start = not in_ascii_word + if at_word_start: + if pending_word_mode == "allcaps": + out.append(ch.upper()) + active_allcaps = True + elif pending_word_mode == "title": + out.append(ch.upper()) + elif pending_capnext: + out.append(ch.upper()) + else: + out.append(ch) + pending_word_mode = None + pending_capnext = False + in_ascii_word = True + continue + + if pending_word_mode is not None: + raise LosslessCapsError("word capitalization marker leaked into the middle of a word") + if active_allcaps: + out.append(ch.upper()) + elif pending_capnext: + out.append(ch.upper()) + else: + out.append(ch) + pending_capnext = False + continue + + if pending_word_mode is not None or pending_capnext: + raise LosslessCapsError("capitalization marker not followed by an ASCII letter") + out.append(ch) + in_ascii_word = False + active_allcaps = False + + if pending_escape: + raise LosslessCapsError("dangling escape marker at end of string") + if pending_word_mode is not None or pending_capnext: + raise LosslessCapsError("dangling capitalization marker at end of string") + return "".join(out) + + +def encode_lossless_caps_v3( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Encode only common word-level capitalization patterns. + + Rules over maximal ASCII alphabetic runs: + - lowercase words stay unchanged + - TitleCase words become `title + lowercase(word)` + - ALLCAPS words become `allcaps + lowercase(word)` + - all other mixed-case words are left unchanged + - literal control characters are escaped as `esc + literal` + """ + _validate_distinct_single_chars(title, allcaps, esc) + controls = {title, allcaps, esc} + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch in controls: + out.append(esc) + out.append(ch) + i += 1 + continue + if not _is_ascii_alpha(ch): + out.append(ch) + i += 1 + continue + + j = i + 1 + while j < n and _is_ascii_alpha(text[j]): + j += 1 + word = text[i:j] + + if word.islower(): + out.append(word) + elif len(word) >= 2 and word.isupper(): + out.append(allcaps) + out.append(word.lower()) + elif _is_ascii_upper(word[0]) and word[1:].islower(): + out.append(title) + out.append(word.lower()) + else: + out.append(word) + i = j + return "".join(out) + + +def decode_lossless_caps_v3( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v3` transform back to the original text.""" + _validate_distinct_single_chars(title, allcaps, esc) + out: list[str] = [] + pending_escape = False + pending_word_mode: str | None = None + active_allcaps = False + in_ascii_word = False + + for ch in text: + if pending_escape: + if pending_word_mode is not None and not _is_ascii_alpha(ch): + raise LosslessCapsError("escaped control char cannot satisfy pending word capitalization mode") + out.append(ch) + pending_escape = False + if _is_ascii_alpha(ch): + in_ascii_word = True + else: + in_ascii_word = False + active_allcaps = False + continue + + if ch == esc: + pending_escape = True + continue + if ch == title: + if pending_word_mode is not None or in_ascii_word: + raise LosslessCapsError("invalid title marker placement") + pending_word_mode = "title" + continue + if ch == allcaps: + if pending_word_mode is not None or in_ascii_word: + raise LosslessCapsError("invalid allcaps marker placement") + pending_word_mode = "allcaps" + continue + + if _is_ascii_alpha(ch): + at_word_start = not in_ascii_word + if at_word_start: + if pending_word_mode == "allcaps": + out.append(ch.upper()) + active_allcaps = True + elif pending_word_mode == "title": + out.append(ch.upper()) + else: + out.append(ch) + pending_word_mode = None + in_ascii_word = True + continue + + if pending_word_mode is not None: + raise LosslessCapsError("word capitalization marker leaked into the middle of a word") + out.append(ch.upper() if active_allcaps else ch) + continue + + if pending_word_mode is not None: + raise LosslessCapsError("capitalization marker not followed by an ASCII letter") + out.append(ch) + in_ascii_word = False + active_allcaps = False + + if pending_escape: + raise LosslessCapsError("dangling escape marker at end of string") + if pending_word_mode is not None: + raise LosslessCapsError("dangling capitalization marker at end of string") + return "".join(out) + + +def encode_lossless_caps_v4( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Encode only ALLCAPS ASCII words, leaving all other case untouched.""" + _validate_distinct_single_chars(allcaps, esc) + controls = {allcaps, esc} + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch in controls: + out.append(esc) + out.append(ch) + i += 1 + continue + if not _is_ascii_alpha(ch): + out.append(ch) + i += 1 + continue + j = i + 1 + while j < n and _is_ascii_alpha(text[j]): + j += 1 + word = text[i:j] + if len(word) >= 2 and word.isupper(): + out.append(allcaps) + out.append(word.lower()) + else: + out.append(word) + i = j + return "".join(out) + + +def decode_lossless_caps_v4( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v4` transform back to the original text.""" + _validate_distinct_single_chars(allcaps, esc) + out: list[str] = [] + pending_escape = False + pending_allcaps = False + in_ascii_word = False + active_allcaps = False + + for ch in text: + if pending_escape: + if pending_allcaps and not _is_ascii_alpha(ch): + raise LosslessCapsError("escaped control char cannot satisfy pending allcaps mode") + out.append(ch) + pending_escape = False + if _is_ascii_alpha(ch): + in_ascii_word = True + else: + in_ascii_word = False + active_allcaps = False + continue + + if ch == esc: + pending_escape = True + continue + if ch == allcaps: + if pending_allcaps or in_ascii_word: + raise LosslessCapsError("invalid allcaps marker placement") + pending_allcaps = True + continue + + if _is_ascii_alpha(ch): + if not in_ascii_word: + active_allcaps = pending_allcaps + pending_allcaps = False + in_ascii_word = True + out.append(ch.upper() if active_allcaps else ch) + continue + + if pending_allcaps: + raise LosslessCapsError("allcaps marker not followed by an ASCII letter") + out.append(ch) + in_ascii_word = False + active_allcaps = False + + if pending_escape: + raise LosslessCapsError("dangling escape marker at end of string") + if pending_allcaps: + raise LosslessCapsError("dangling allcaps marker at end of string") + return "".join(out) + + +def encode_lossless_caps_v5( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, + title_min_len: int = DEFAULT_V5_TITLE_MIN_LEN, +) -> str: + """Encode ALLCAPS words and only sufficiently long TitleCase words.""" + _validate_distinct_single_chars(title, allcaps, esc) + controls = {title, allcaps, esc} + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch in controls: + out.append(esc) + out.append(ch) + i += 1 + continue + if not _is_ascii_alpha(ch): + out.append(ch) + i += 1 + continue + j = i + 1 + while j < n and _is_ascii_alpha(text[j]): + j += 1 + word = text[i:j] + if len(word) >= 2 and word.isupper(): + out.append(allcaps) + out.append(word.lower()) + elif len(word) >= title_min_len and _is_ascii_upper(word[0]) and word[1:].islower(): + out.append(title) + out.append(word.lower()) + else: + out.append(word) + i = j + return "".join(out) + + +def decode_lossless_caps_v5( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v5` transform back to the original text.""" + return decode_lossless_caps_v3(text, title=title, allcaps=allcaps, esc=esc) + + +def encode_lossless_caps_v6( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, + allcaps_min_len: int = DEFAULT_V6_ALLCAPS_MIN_LEN, +) -> str: + """Encode only ALLCAPS words with length >= allcaps_min_len.""" + _validate_distinct_single_chars(allcaps, esc) + controls = {allcaps, esc} + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch in controls: + out.append(esc) + out.append(ch) + i += 1 + continue + if not _is_ascii_alpha(ch): + out.append(ch) + i += 1 + continue + j = i + 1 + while j < n and _is_ascii_alpha(text[j]): + j += 1 + word = text[i:j] + if len(word) >= allcaps_min_len and word.isupper(): + out.append(allcaps) + out.append(word.lower()) + else: + out.append(word) + i = j + return "".join(out) + + +def decode_lossless_caps_v6( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v6` transform back to the original text.""" + return decode_lossless_caps_v4(text, allcaps=allcaps, esc=esc) + + +def encode_lossless_caps_v7( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, + allcaps_min_len: int = DEFAULT_V7_ALLCAPS_MIN_LEN, +) -> str: + """Encode only ALLCAPS words with length >= 4.""" + return encode_lossless_caps_v6( + text, + allcaps=allcaps, + esc=esc, + allcaps_min_len=allcaps_min_len, + ) + + +def decode_lossless_caps_v7( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v7` transform back to the original text.""" + return decode_lossless_caps_v6(text, allcaps=allcaps, esc=esc) + + +def get_text_transform(name: str | None) -> Callable[[str], str]: + """Return the forward text transform for the given config name.""" + normalized = IDENTITY if name in {None, "", IDENTITY} else str(name) + if normalized == IDENTITY: + return lambda text: text + if normalized == LOSSLESS_CAPS_V1: + return encode_lossless_caps_v1 + if normalized == LOSSLESS_CAPS_V2: + return encode_lossless_caps_v2 + if normalized == LOSSLESS_CAPS_V3: + return encode_lossless_caps_v3 + if normalized == LOSSLESS_CAPS_V4: + return encode_lossless_caps_v4 + if normalized == LOSSLESS_CAPS_V5: + return encode_lossless_caps_v5 + if normalized == LOSSLESS_CAPS_V6: + return encode_lossless_caps_v6 + if normalized == LOSSLESS_CAPS_V7: + return encode_lossless_caps_v7 + if normalized == LOSSLESS_CAPS_CASEOPS_V1: + return encode_lossless_caps_v2 + raise ValueError(f"unsupported text_transform={name!r}") + + +def get_text_inverse_transform(name: str | None) -> Callable[[str], str]: + """Return the inverse transform for the given config name.""" + normalized = IDENTITY if name in {None, "", IDENTITY} else str(name) + if normalized == IDENTITY: + return lambda text: text + if normalized == LOSSLESS_CAPS_V1: + return decode_lossless_caps_v1 + if normalized == LOSSLESS_CAPS_V2: + return decode_lossless_caps_v2 + if normalized == LOSSLESS_CAPS_V3: + return decode_lossless_caps_v3 + if normalized == LOSSLESS_CAPS_V4: + return decode_lossless_caps_v4 + if normalized == LOSSLESS_CAPS_V5: + return decode_lossless_caps_v5 + if normalized == LOSSLESS_CAPS_V6: + return decode_lossless_caps_v6 + if normalized == LOSSLESS_CAPS_V7: + return decode_lossless_caps_v7 + if normalized == LOSSLESS_CAPS_CASEOPS_V1: + return decode_lossless_caps_v2 + raise ValueError(f"unsupported text_transform={name!r}") + + +def normalize_text_transform_name(name: str | None) -> str: + """Normalize empty/None transform names to the identity transform.""" + return IDENTITY if name in {None, "", IDENTITY} else str(name) + + +def get_text_transform_control_symbols(name: str | None) -> list[str]: + """Return reserved control symbols used by a transform, if any.""" + normalized = normalize_text_transform_name(name) + if normalized == IDENTITY: + return [] + if normalized == LOSSLESS_CAPS_V1: + return [DEFAULT_SENTINEL] + if normalized == LOSSLESS_CAPS_V2: + return [DEFAULT_V2_TITLE, DEFAULT_V2_ALLCAPS, DEFAULT_V2_CAPNEXT, DEFAULT_V2_ESC] + if normalized == LOSSLESS_CAPS_CASEOPS_V1: + return [DEFAULT_V2_TITLE, DEFAULT_V2_ALLCAPS, DEFAULT_V2_CAPNEXT, DEFAULT_V2_ESC] + if normalized in {LOSSLESS_CAPS_V3, LOSSLESS_CAPS_V5}: + return [DEFAULT_V2_TITLE, DEFAULT_V2_ALLCAPS, DEFAULT_V2_ESC] + if normalized in {LOSSLESS_CAPS_V4, LOSSLESS_CAPS_V6, LOSSLESS_CAPS_V7}: + return [DEFAULT_V2_ALLCAPS, DEFAULT_V2_ESC] + raise ValueError(f"unsupported text_transform={name!r}") + + +def infer_text_transform_from_manifest(tokenizer_path: str | Path) -> str: + """Best-effort lookup of a tokenizer's text transform from a local manifest.""" + tokenizer_path = Path(tokenizer_path).expanduser().resolve() + manifest_candidates = [ + tokenizer_path.parent.parent / "manifest.json", + tokenizer_path.parent / "manifest.json", + ] + for manifest_path in manifest_candidates: + if not manifest_path.is_file(): + continue + try: + payload = json.loads(manifest_path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): + continue + tokenizers = payload.get("tokenizers") + if not isinstance(tokenizers, list): + continue + for tokenizer_meta in tokenizers: + if not isinstance(tokenizer_meta, dict): + continue + model_path = tokenizer_meta.get("model_path") or tokenizer_meta.get("path") + if not model_path: + continue + candidate = (manifest_path.parent / str(model_path)).resolve() + if candidate == tokenizer_path: + return normalize_text_transform_name(tokenizer_meta.get("text_transform")) + return IDENTITY + + +def surface_piece_original_byte_counts( + surfaces: Iterable[str], + *, + text_transform_name: str | None = None, + sentinel: str = DEFAULT_SENTINEL, +) -> list[int]: + """Return exact original UTF-8 byte counts contributed by each surface piece. + + `surfaces` must be the exact decoded text fragments emitted by SentencePiece + in order, e.g. `piece.surface` from `encode_as_immutable_proto`. + """ + normalized = normalize_text_transform_name(text_transform_name) + if normalized == IDENTITY: + return [len(surface.encode("utf-8")) for surface in surfaces] + if normalized == LOSSLESS_CAPS_V1: + if len(sentinel) != 1: + raise ValueError("sentinel must be exactly one character") + sentinel_bytes = len(sentinel.encode("utf-8")) + pending_sentinel = False + counts: list[int] = [] + for surface in surfaces: + piece_bytes = 0 + for ch in surface: + if pending_sentinel: + if ch == sentinel: + piece_bytes += sentinel_bytes + elif _is_ascii_lower(ch): + piece_bytes += 1 + else: + raise LosslessCapsError( + f"invalid continuation {ch!r} after capitalization sentinel" + ) + pending_sentinel = False + continue + if ch == sentinel: + pending_sentinel = True + else: + piece_bytes += len(ch.encode("utf-8")) + counts.append(piece_bytes) + if pending_sentinel: + raise LosslessCapsError("dangling capitalization sentinel across piece boundary") + return counts + if normalized not in {LOSSLESS_CAPS_V2, LOSSLESS_CAPS_V3, LOSSLESS_CAPS_V4, LOSSLESS_CAPS_V5, LOSSLESS_CAPS_V6, LOSSLESS_CAPS_V7, LOSSLESS_CAPS_CASEOPS_V1}: + raise ValueError(f"unsupported text_transform={text_transform_name!r}") + + title = DEFAULT_V2_TITLE + allcaps = DEFAULT_V2_ALLCAPS + capnext = DEFAULT_V2_CAPNEXT + esc = DEFAULT_V2_ESC + if normalized in {LOSSLESS_CAPS_V2, LOSSLESS_CAPS_CASEOPS_V1}: + _validate_distinct_single_chars(title, allcaps, capnext, esc) + elif normalized in {LOSSLESS_CAPS_V4, LOSSLESS_CAPS_V6, LOSSLESS_CAPS_V7}: + _validate_distinct_single_chars(allcaps, esc) + else: + _validate_distinct_single_chars(title, allcaps, esc) + pending_escape = False + pending_word_mode: str | None = None + active_allcaps = False + pending_capnext = False + in_ascii_word = False + counts: list[int] = [] + for surface in surfaces: + piece_bytes = 0 + for ch in surface: + if pending_escape: + if pending_word_mode is not None and not _is_ascii_alpha(ch): + raise LosslessCapsError("escaped control char cannot satisfy pending word capitalization mode") + piece_bytes += len(ch.encode("utf-8")) + pending_escape = False + if _is_ascii_alpha(ch): + in_ascii_word = True + else: + in_ascii_word = False + active_allcaps = False + continue + if ch == esc: + pending_escape = True + continue + if normalized in {LOSSLESS_CAPS_V2, LOSSLESS_CAPS_V3, LOSSLESS_CAPS_V5, LOSSLESS_CAPS_CASEOPS_V1} and ch == title: + if pending_word_mode is not None or in_ascii_word or pending_capnext: + raise LosslessCapsError("invalid title marker placement") + pending_word_mode = "title" + continue + if ch == allcaps: + if pending_word_mode is not None or in_ascii_word or pending_capnext: + raise LosslessCapsError("invalid allcaps marker placement") + pending_word_mode = "allcaps" + continue + if normalized in {LOSSLESS_CAPS_V2, LOSSLESS_CAPS_CASEOPS_V1} and ch == capnext: + if pending_capnext: + raise LosslessCapsError("duplicate capnext marker") + pending_capnext = True + continue + + if _is_ascii_alpha(ch): + at_word_start = not in_ascii_word + if at_word_start: + piece_bytes += 1 + active_allcaps = pending_word_mode == "allcaps" + pending_word_mode = None + pending_capnext = False + in_ascii_word = True + continue + if pending_word_mode is not None: + raise LosslessCapsError("word capitalization marker leaked into the middle of a word") + piece_bytes += 1 + pending_capnext = False + continue + + if pending_word_mode is not None or pending_capnext: + raise LosslessCapsError("capitalization marker not followed by an ASCII letter") + piece_bytes += len(ch.encode("utf-8")) + in_ascii_word = False + active_allcaps = False + counts.append(piece_bytes) + if pending_escape: + raise LosslessCapsError("dangling escape marker across piece boundary") + if pending_word_mode is not None or pending_capnext: + raise LosslessCapsError("dangling capitalization marker across piece boundary") + return counts diff --git a/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/prepare_caseops_data.py b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/prepare_caseops_data.py new file mode 100644 index 0000000000..44ac60eb7b --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/prepare_caseops_data.py @@ -0,0 +1,196 @@ +"""Prepare CaseOps-tokenized FineWeb shards + per-token byte sidecar. + +CaseOps (``lossless_caps_caseops_v1``) is a bijective, character-level text +transform that introduces four operator tokens in place of explicit +capitalization: TITLE, ALLCAPS, CAPNEXT, ESC. The transform is fully +reversible — no information is lost relative to the untransformed UTF-8 +text, so BPB stays computable on TRUE byte counts. + +Forward pipeline: + 1. Read the canonical FineWeb-10B doc stream (``docs_selected.jsonl`` + produced by ``data/download_hf_docs_and_tokenize.py`` in the root repo). + 2. Apply ``encode_lossless_caps_v2`` (the caseops_v1 alias) to each doc. + 3. Tokenize with the shipped SP model + ``tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model`` + (reserves TITLE/ALLCAPS/CAPNEXT/ESC + sentinel as user_defined_symbols). + 4. Write uint16 train/val shards (``fineweb_{train,val}_XXXXXX.bin``). + 5. For the VAL stream only, emit per-token byte sidecar shards + (``fineweb_val_bytes_XXXXXX.bin``, uint16 parallel arrays) that record + each token's ORIGINAL pre-transform UTF-8 byte count. BPB is computed + from these canonical bytes so the score is on the untransformed text + (not the transformed representation). + +Output layout — matches what ``train_gpt.py`` expects under +``DATA_DIR=./data`` with ``CASEOPS_ENABLED=1``: + + data/datasets/fineweb10B_sp8192_caseops/datasets/ + tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/ + fineweb_train_000000.bin + fineweb_train_000001.bin + ... + fineweb_val_000000.bin + fineweb_val_bytes_000000.bin + +Usage: + + python3 prepare_caseops_data.py \\ + --docs ./fineweb10B_raw/docs_selected.jsonl \\ + --out ./data/datasets/fineweb10B_sp8192_caseops/datasets \\ + --sp ./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + +This script is intended to reproduce the actual shard format used by the +original CaseOps export path from PR #1729 / the HF-hosted dataset: + +- every document is prepended with ``bos_id`` +- validation byte sidecars include a matching leading ``0`` byte count +- the default validation split is the canonical 50,000-doc challenge split + +Requirements: sentencepiece, numpy. CPU-only. Runs once; reused across seeds. +""" +from __future__ import annotations + +import argparse +import json +import pathlib +import struct +import sys + +import numpy as np +import sentencepiece as spm + +# Local import — lossless_caps.py ships next to this script. +sys.path.insert(0, str(pathlib.Path(__file__).resolve().parent)) +from lossless_caps import ( # noqa: E402 + LOSSLESS_CAPS_CASEOPS_V1, + encode_lossless_caps_v2, + surface_piece_original_byte_counts, +) + + +SHARD_MAGIC = 20240520 +SHARD_VERSION = 1 +SHARD_TOKENS = 100_000_000 # tokens per shard — matches the original CaseOps export path + + +def _write_shard(out_path: pathlib.Path, arr: np.ndarray) -> None: + """Write a uint16 shard in the standard header-prefixed format.""" + assert arr.dtype == np.uint16 + header = np.zeros(256, dtype=np.int32) + header[0] = SHARD_MAGIC + header[1] = SHARD_VERSION + header[2] = int(arr.size) + with out_path.open("wb") as fh: + fh.write(header.tobytes()) + fh.write(arr.tobytes()) + + +def _iter_docs(docs_path: pathlib.Path): + """Yield doc strings from a jsonl file (one json object per line).""" + with docs_path.open("r", encoding="utf-8") as fh: + for line in fh: + line = line.strip() + if not line: + continue + obj = json.loads(line) + # Support both {"text": ...} and raw strings. + yield obj["text"] if isinstance(obj, dict) else obj + + +def _encode_with_original_byte_counts( + sp: spm.SentencePieceProcessor, text: str +) -> tuple[np.ndarray, np.ndarray]: + """Match the original CaseOps exporter exactly. + + The original PR #1729 export path tokenized via + ``encode_as_immutable_proto`` and computed canonical byte counts from the + exact piece surfaces using ``surface_piece_original_byte_counts``. Reuse + that logic here so the rebuilt validation sidecar matches the true + CaseOps dataset format byte-for-byte. + """ + transformed = encode_lossless_caps_v2(text) + proto = sp.encode_as_immutable_proto(transformed) + token_ids = np.fromiter((piece.id for piece in proto.pieces), dtype=np.int32) + byte_counts = np.asarray( + surface_piece_original_byte_counts( + (piece.surface for piece in proto.pieces), + text_transform_name=LOSSLESS_CAPS_CASEOPS_V1, + ), + dtype=np.uint16, + ) + if token_ids.shape[0] != byte_counts.shape[0]: + raise ValueError("token id count and byte count length disagree") + return token_ids, byte_counts + + +def main() -> None: + ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument("--docs", required=True, type=pathlib.Path, help="Path to docs_selected.jsonl") + ap.add_argument("--out", required=True, type=pathlib.Path, help="Output datasets dir") + ap.add_argument("--sp", required=True, type=pathlib.Path, help="Path to CaseOps SP model") + ap.add_argument("--val-docs", type=int, default=50_000, help="Validation docs count") + args = ap.parse_args() + + sp = spm.SentencePieceProcessor(model_file=str(args.sp)) + bos_id = int(sp.bos_id()) + if bos_id < 0: + raise ValueError("tokenizer must define a valid bos_id") + print(f"loaded sp: vocab={sp.vocab_size()} bos_id={bos_id}", flush=True) + + train_out = args.out / "datasets" / "fineweb10B_sp8192_lossless_caps_caseops_v1_reserved" + train_out.mkdir(parents=True, exist_ok=True) + + val_buf_tokens: list[int] = [] + val_buf_bytes: list[int] = [] + train_buf: list[int] = [] + val_written = 0 + train_written = 0 + n_docs = 0 + + for text in _iter_docs(args.docs): + piece_ids, piece_byte_counts = _encode_with_original_byte_counts(sp, text) + token_ids = np.empty(piece_ids.size + 1, dtype=np.int32) + token_ids[0] = bos_id + token_ids[1:] = piece_ids + if n_docs < args.val_docs: + # Validation doc — also compute byte sidecar + if piece_byte_counts.shape[0] != piece_ids.shape[0]: + raise ValueError("token id count and original byte count length disagree") + byte_counts = np.zeros(token_ids.shape[0], dtype=np.int32) + byte_counts[1:] = piece_byte_counts.astype(np.int32, copy=False) + val_buf_tokens.extend(int(t) for t in token_ids) + val_buf_bytes.extend(int(b) for b in byte_counts) + if len(val_buf_tokens) >= SHARD_TOKENS: + _write_shard(train_out / f"fineweb_val_{val_written:06d}.bin", + np.array(val_buf_tokens[:SHARD_TOKENS], dtype=np.uint16)) + _write_shard(train_out / f"fineweb_val_bytes_{val_written:06d}.bin", + np.array(val_buf_bytes[:SHARD_TOKENS], dtype=np.uint16)) + val_buf_tokens = val_buf_tokens[SHARD_TOKENS:] + val_buf_bytes = val_buf_bytes[SHARD_TOKENS:] + val_written += 1 + else: + train_buf.extend(int(t) for t in token_ids) + if len(train_buf) >= SHARD_TOKENS: + _write_shard(train_out / f"fineweb_train_{train_written:06d}.bin", + np.array(train_buf[:SHARD_TOKENS], dtype=np.uint16)) + train_buf = train_buf[SHARD_TOKENS:] + train_written += 1 + n_docs += 1 + if n_docs % 10_000 == 0: + print(f" processed {n_docs} docs train_shards={train_written} val_shards={val_written}", flush=True) + + # Flush tail buffers into final (possibly short) shards. + if val_buf_tokens: + _write_shard(train_out / f"fineweb_val_{val_written:06d}.bin", + np.array(val_buf_tokens, dtype=np.uint16)) + _write_shard(train_out / f"fineweb_val_bytes_{val_written:06d}.bin", + np.array(val_buf_bytes, dtype=np.uint16)) + if train_buf: + _write_shard(train_out / f"fineweb_train_{train_written:06d}.bin", + np.array(train_buf, dtype=np.uint16)) + + print(f"done. docs={n_docs} train_shards={train_written + (1 if train_buf else 0)} val_shards={val_written + (1 if val_buf_tokens else 0)}") + + +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/submission.json b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/submission.json new file mode 100644 index 0000000000..03090b7ec0 --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/submission.json @@ -0,0 +1,23 @@ +{ + "author": "dexhunter", + "github_id": "dexhunter", + "name": "CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack", + "blurb": "Combines the inherited SP8192 base architecture stack with the CaseOps tokenizer / original-byte sidecar path, and adds a deterministic 1->3->4 recurrence-depth curriculum after loop activation. Eval and phased TTT run at fixed depth 4; BPB is still scored on original pre-transform UTF-8 bytes.", + "date": "2026-04-20", + "track": "10min_16mb", + "val_loss": 2.33073, + "val_bpb": 1.06505, + "val_bpb_std": 0.00081, + "val_loss_std": 0.00178, + "seeds": [42, 0, 1234], + "seed_results": { + "42": {"val_loss": 2.33108, "val_bpb": 1.06521, "artifact_bytes": 15986579, "steps": 4603}, + "0": {"val_loss": 2.32880, "val_bpb": 1.06417, "artifact_bytes": 15984426, "steps": 4599}, + "1234": {"val_loss": 2.33231, "val_bpb": 1.06578, "artifact_bytes": 15982914, "steps": 4604} + }, + "artifact_bytes_mean": 15984640, + "train_time_s_mean": 596.13, + "eval_time_s_mean": 484.98, + "hardware": "8xH100 80GB SXM", + "reproducibility_notes": "Run prepare_caseops_data.py once to tokenize the CaseOps-transformed FineWeb into the expected shards and validation byte sidecar, then run train_gpt.py per seed as documented in README.md with TRAIN_LOOP_PHASE_DEPTHS=1,3,4, TRAIN_LOOP_PREWARM_DEPTHS=3,4, and EVAL_LOOP_DEPTH=4." +} diff --git a/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model new file mode 100644 index 0000000000..fffc8bb306 Binary files /dev/null and b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model differ diff --git a/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_gpt.py b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_gpt.py new file mode 100644 index 0000000000..1dac1be6aa --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_gpt.py @@ -0,0 +1,3374 @@ +import base64, collections, copy, fcntl, glob, io, lzma, math, os +from pathlib import Path +import random, re, subprocess, sys, time, uuid, numpy as np, sentencepiece as spm, torch, torch.distributed as dist, torch.nn.functional as F +from torch import nn +from flash_attn_interface import ( + flash_attn_func as flash_attn_3_func, + flash_attn_varlen_func, +) +from concurrent.futures import ThreadPoolExecutor +import triton +import triton.language as tl +from triton.tools.tensor_descriptor import TensorDescriptor + + +def _parse_loop_depth_list(raw_value): + raw_value = raw_value.strip() + if not raw_value: + return [] + depths = [] + for part in raw_value.split(","): + part = part.strip() + if not part: + continue + depths.append(int(part)) + return depths + + +def _parse_float_list(raw_value): + raw_value = raw_value.strip() + if not raw_value: + return [] + vals = [] + for part in raw_value.split(","): + part = part.strip() + if not part: + continue + vals.append(float(part)) + return vals + + +class Hyperparameters: + data_dir = os.environ.get("DATA_DIR", "./data/") + seed = int(os.environ.get("SEED", 1337)) + run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) + iterations = int(os.environ.get("ITERATIONS", 20000)) + warmdown_frac = float(os.environ.get("WARMDOWN_FRAC", 0.75)) + warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786432)) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500)) + max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 6e2)) + val_batch_tokens = int(os.environ.get("VAL_BATCH_TOKENS", 524288)) + eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048)) + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000)) + vocab_size = int(os.environ.get("VOCAB_SIZE", 8192)) + num_layers = int(os.environ.get("NUM_LAYERS", 11)) + xsa_last_n = int(os.environ.get("XSA_LAST_N", 11)) + model_dim = int(os.environ.get("MODEL_DIM", 512)) + num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) + num_heads = int(os.environ.get("NUM_HEADS", 8)) + mlp_mult = float(os.environ.get("MLP_MULT", 4.0)) + skip_gates_enabled = bool(int(os.environ.get("SKIP_GATES_ENABLED", "1"))) + tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) + logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 3e1)) + rope_base = float(os.environ.get("ROPE_BASE", 1e4)) + rope_dims = int(os.environ.get("ROPE_DIMS", 16)) + rope_train_seq_len = int(os.environ.get("ROPE_TRAIN_SEQ_LEN", 2048)) + rope_yarn = bool(int(os.environ.get("ROPE_YARN", "0"))) + ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) + qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 5.0)) + num_loops = int(os.environ.get("NUM_LOOPS", 2)) + loop_start = int(os.environ.get("LOOP_START", 3)) + loop_end = int(os.environ.get("LOOP_END", 5)) + enable_looping_at = float(os.environ.get("ENABLE_LOOPING_AT", 0.35)) + # Recurrence depth is expressed as the number of passes through the looped + # segment (so depth=3 corresponds to NUM_LOOPS=2). Training can optionally + # sample depths from a small precompiled set while eval stays fixed. + train_loop_min_depth = int( + os.environ.get("TRAIN_LOOP_MIN_DEPTH", str(num_loops + 1)) + ) + train_loop_max_depth = int( + os.environ.get("TRAIN_LOOP_MAX_DEPTH", str(num_loops + 1)) + ) + train_loop_depth_dist = os.environ.get("TRAIN_LOOP_DEPTH_DIST", "fixed").lower() + train_loop_depth_set = _parse_loop_depth_list( + os.environ.get("TRAIN_LOOP_DEPTH_SET", "") + ) + train_loop_phase_depths = _parse_loop_depth_list( + os.environ.get("TRAIN_LOOP_PHASE_DEPTHS", "") + ) + train_loop_phase_fractions = _parse_float_list( + os.environ.get("TRAIN_LOOP_PHASE_FRACTIONS", "") + ) + train_loop_prewarm_depths = _parse_loop_depth_list( + os.environ.get("TRAIN_LOOP_PREWARM_DEPTHS", "") + ) + eval_loop_depth = int(os.environ.get("EVAL_LOOP_DEPTH", str(num_loops + 1))) + parallel_start_layer = int(os.environ.get("PARALLEL_START_LAYER", 8)) + parallel_final_lane = os.environ.get("PARALLEL_FINAL_LANE", "mean") + min_lr = float(os.environ.get("MIN_LR", 0.0)) + embed_lr = float(os.environ.get("EMBED_LR", 0.6)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.03)) + tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) + matrix_lr = float(os.environ.get("MATRIX_LR", 0.026)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.02)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.97)) + muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) + muon_momentum_warmup_start = float( + os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92) + ) + muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500)) + muon_row_normalize = bool(int(os.environ.get("MUON_ROW_NORMALIZE", "1"))) + beta1 = float(os.environ.get("BETA1", 0.9)) + beta2 = float(os.environ.get("BETA2", 0.95)) + adam_eps = float(os.environ.get("ADAM_EPS", 1e-08)) + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) + eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) + adam_wd = float(os.environ.get("ADAM_WD", 0.02)) + muon_wd = float(os.environ.get("MUON_WD", 0.095)) + embed_wd = float(os.environ.get("EMBED_WD", 0.085)) + ema_decay = float(os.environ.get("EMA_DECAY", 0.9965)) + ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "1"))) + ttt_lora_rank = int(os.environ.get("TTT_LORA_RANK", 96)) + ttt_lora_lr = float(os.environ.get("TTT_LORA_LR", 0.0001)) + ttt_chunk_size = int(os.environ.get("TTT_CHUNK_SIZE", 48)) + ttt_eval_seq_len = int(os.environ.get("TTT_EVAL_SEQ_LEN", 2048)) + ttt_batch_size = int(os.environ.get("TTT_BATCH_SIZE", 64)) + ttt_grad_steps = int(os.environ.get("TTT_GRAD_STEPS", 1)) + ttt_weight_decay = float(os.environ.get("TTT_WEIGHT_DECAY", 0.5)) + ttt_beta1 = float(os.environ.get("TTT_BETA1", 0)) + ttt_beta2 = float(os.environ.get("TTT_BETA2", 0.999)) + ttt_k_lora = bool(int(os.environ.get("TTT_K_LORA", "1"))) + ttt_mlp_lora = bool(int(os.environ.get("TTT_MLP_LORA", "1"))) + ttt_o_lora = bool(int(os.environ.get("TTT_O_LORA", "1"))) + ttt_optimizer = os.environ.get("TTT_OPTIMIZER", "adam") + ttt_eval_batches = os.environ.get("TTT_EVAL_BATCHES", "") + val_doc_fraction = float(os.environ.get("VAL_DOC_FRACTION", 1.0)) + compressor = os.environ.get("COMPRESSOR", "brotli") + gptq_calibration_batches = int(os.environ.get("GPTQ_CALIBRATION_BATCHES", 16)) + gptq_reserve_seconds = float(os.environ.get("GPTQ_RESERVE_SECONDS", 4.0)) + phased_ttt_prefix_docs = int(os.environ.get("PHASED_TTT_PREFIX_DOCS", 2000)) + phased_ttt_num_phases = int(os.environ.get("PHASED_TTT_NUM_PHASES", 1)) + global_ttt_lr = float(os.environ.get("GLOBAL_TTT_LR", 0.001)) + global_ttt_momentum = float(os.environ.get("GLOBAL_TTT_MOMENTUM", 0.9)) + global_ttt_epochs = int(os.environ.get("GLOBAL_TTT_EPOCHS", 1)) + global_ttt_chunk_tokens = int(os.environ.get("GLOBAL_TTT_CHUNK_TOKENS", 32768)) + global_ttt_batch_seqs = int(os.environ.get("GLOBAL_TTT_BATCH_SEQS", 32)) + global_ttt_warmup_start_lr = float(os.environ.get("GLOBAL_TTT_WARMUP_START_LR", 0.0)) + global_ttt_warmup_chunks = int(os.environ.get("GLOBAL_TTT_WARMUP_CHUNKS", 0)) + global_ttt_grad_clip = float(os.environ.get("GLOBAL_TTT_GRAD_CLIP", 1.0)) + global_ttt_respect_doc_boundaries = bool(int(os.environ.get("GLOBAL_TTT_RESPECT_DOC_BOUNDARIES", "1"))) + matrix_bits = int(os.environ.get("MATRIX_BITS", 6)) + embed_bits = int(os.environ.get("EMBED_BITS", 8)) + matrix_clip_sigmas = float(os.environ.get("MATRIX_CLIP_SIGMAS", 12.85)) + embed_clip_sigmas = float(os.environ.get("EMBED_CLIP_SIGMAS", 2e1)) + mlp_clip_sigmas = float(os.environ.get("MLP_CLIP_SIGMAS", 10.0)) + attn_clip_sigmas = float(os.environ.get("ATTN_CLIP_SIGMAS", 13.0)) + # AttnOutGate (per-head multiplicative output gate, PR #1667 MarioPaerle). + # Zero-init weight: 2*sigmoid(0)=1 -> transparent at start. Source defaults to + # block input x ('proj'); 'q' uses raw Q projection output. + attn_out_gate_enabled = bool(int(os.environ.get("ATTN_OUT_GATE_ENABLED", "0"))) + attn_out_gate_src = os.environ.get("ATTN_OUT_GATE_SRC", "proj") + # SmearGate (input-dependent forward-1 token smear, modded-nanogpt @classiclarryd + # via PR #1667). x_t <- x_t + lam * sigmoid(W*x_t[:gate_window]) * x_{t-1}. + # lam=0 + W=0 -> transparent at init. + smear_gate_enabled = bool(int(os.environ.get("SMEAR_GATE_ENABLED", "0"))) + # Window: first GATE_WINDOW dims of the source feed the gate projection. + gate_window = int(os.environ.get("GATE_WINDOW", 12)) + # Gated Attention (Qwen, NeurIPS 2025 Best Paper, arXiv:2505.06708; + # qiuzh20/gated_attention). Per-head sigmoid gate on SDPA output, BEFORE + # out_proj. Gate input = full block input x (paper's headwise G1 variant + # driven from hidden_states). W_g shape (num_heads, dim), plain sigmoid. + # Near-zero init gives g~0.5 at step 0 (half attention output); per-block + # attn_scale (init 1.0) compensates during training. Name contains + # "attn_gate" so CONTROL_TENSOR_NAME_PATTERNS routes it to scalar AdamW. + gated_attn_enabled = bool(int(os.environ.get("GATED_ATTN_ENABLED", "0"))) + gated_attn_init_std = float(os.environ.get("GATED_ATTN_INIT_STD", 0.01)) + # Dedicated int8-per-row quantization for `attn_gate_w` tensors. These are + # small ((num_heads, dim) = (8, 512) = 4096 params) and bypass GPTQ via the + # numel<=65536 passthrough branch -> stored as fp16 (8 KB/layer, ~65 KB total + # compressed). int8-per-row cuts the raw tensor in half with negligible BPB + # impact: scales per head (8 values), symmetric quant over [-127, 127]. + # No Hessian needed (gate weights not in collect_hessians()). + gated_attn_quant_gate = bool(int(os.environ.get("GATED_ATTN_QUANT_GATE", "0"))) + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + is_main_process = rank == 0 + grad_accum_steps = 8 // world_size + # CaseOps integration: optional override of dataset root + tokenizer path. + # When CASEOPS_ENABLED=1, the wrapper loads a per-token byte sidecar + # (fineweb_val_bytes_*.bin, identical shard layout to val_*.bin) and uses + # it as the canonical raw-byte budget for BPB accounting. The sidecar + # REPLACES the build_sentencepiece_luts byte-counting path entirely. + caseops_enabled = bool(int(os.environ.get("CASEOPS_ENABLED", "0"))) + _default_caseops_data = os.path.join( + data_dir, + "datasets", + "fineweb10B_sp8192_caseops", + "datasets", + "datasets", + "fineweb10B_sp8192_lossless_caps_caseops_v1_reserved", + ) + _default_caseops_tok = os.path.join( + data_dir, + "datasets", + "fineweb10B_sp8192_caseops", + "datasets", + "tokenizers", + "fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model", + ) + if caseops_enabled: + datasets_dir = os.environ.get("DATA_PATH", _default_caseops_data) + tokenizer_path = os.environ.get("TOKENIZER_PATH", _default_caseops_tok) + else: + datasets_dir = os.environ.get( + "DATA_PATH", + os.path.join(data_dir, "datasets", f"fineweb10B_sp{vocab_size}"), + ) + tokenizer_path = os.environ.get( + "TOKENIZER_PATH", + os.path.join(data_dir, "tokenizers", f"fineweb_{vocab_size}_bpe.model"), + ) + train_files = os.path.join(datasets_dir, "fineweb_train_*.bin") + # Keep the validation-token glob disjoint from the byte-sidecar files + # (`fineweb_val_bytes_*.bin`) used by CaseOps scoring. + val_files = os.path.join(datasets_dir, "fineweb_val_[0-9][0-9][0-9][0-9][0-9][0-9].bin") + val_bytes_files = os.path.join(datasets_dir, "fineweb_val_bytes_*.bin") + artifact_dir = os.environ.get("ARTIFACT_DIR", "") + logfile = ( + os.path.join(artifact_dir, f"{run_id}.txt") + if artifact_dir + else f"logs/{run_id}.txt" + ) + model_path = ( + os.path.join(artifact_dir, "final_model.pt") + if artifact_dir + else "final_model.pt" + ) + quantized_model_path = ( + os.path.join(artifact_dir, "final_model.int6.ptz") + if artifact_dir + else "final_model.int6.ptz" + ) + + +_logger_hparams = None + + +def set_logging_hparams(h): + global _logger_hparams + _logger_hparams = h + + +def log(msg, console=True): + if _logger_hparams is None: + print(msg) + return + if _logger_hparams.is_main_process: + if console: + print(msg) + if _logger_hparams.logfile is not None: + with open(_logger_hparams.logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + + +def _depth_to_repeats(depth): + return max(int(depth) - 1, 0) + + +def _repeats_to_depth(repeats): + return max(int(repeats), 0) + 1 + + +def _loop_depth_weights(min_depth, max_depth, dist_name, center_depth): + depths = list(range(min_depth, max_depth + 1)) + if not depths: + return [center_depth], [1] + if dist_name == "fixed" or min_depth == max_depth: + fixed_depth = min(max(center_depth, min_depth), max_depth) + return [fixed_depth], [1] + if dist_name == "uniform": + return depths, [1] * len(depths) + if dist_name in {"triangular", "bell"}: + center = min(max(center_depth, min_depth), max_depth) + weights = [ + max_depth - min_depth + 1 - abs(depth - center) + for depth in depths + ] + return depths, weights + raise ValueError( + f"unsupported TRAIN_LOOP_DEPTH_DIST={dist_name!r}; expected fixed, uniform, or triangular" + ) + + +class ValidationData: + def __init__(self, h, device): + self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) + if int(self.sp.vocab_size()) != h.vocab_size: + raise ValueError( + f"VOCAB_SIZE={h.vocab_size} does not match tokenizer vocab_size={int(self.sp.vocab_size())}" + ) + self.val_tokens = load_validation_tokens(h.val_files, h.eval_seq_len) + ( + self.base_bytes_lut, + self.has_leading_space_lut, + self.is_boundary_token_lut, + ) = build_sentencepiece_luts(self.sp, h.vocab_size, device) + # CaseOps: when enabled, load per-token byte sidecar and stash it as a + # CPU tensor aligned 1:1 with self.val_tokens. eval_val/eval_val_ttt + # branches use this as the canonical raw-byte budget per token. + self.caseops_enabled = bool(getattr(h, "caseops_enabled", False)) + self.val_bytes = None + if self.caseops_enabled: + self.val_bytes = load_validation_byte_sidecar( + h.val_bytes_files, h.eval_seq_len, self.val_tokens.numel() + ) + + +def build_sentencepiece_luts(sp, vocab_size, device): + sp_vocab_size = int(sp.vocab_size()) + assert ( + sp.piece_to_id("▁") != sp.unk_id() + ), "Tokenizer must have '▁' (space) as its own token for correct BPB byte counting" + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("▁"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) + + +def load_validation_tokens(pattern, seq_len): + # Filter out CaseOps byte sidecar shards which share the val_*.bin glob. + files = [ + Path(p) + for p in sorted(glob.glob(pattern)) + if "_bytes_" not in Path(p).name + ] + if not files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() + usable = (tokens.numel() - 1) // seq_len * seq_len + if usable <= 0: + raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") + return tokens[: usable + 1] + + +def load_validation_byte_sidecar(pattern, seq_len, expected_len): + """Load CaseOps per-token byte sidecar(s). Same shard layout as token shards + (256 int32 header + uint16 array). Each entry = canonical raw-text byte + budget for that token in the corresponding val shard. Returns a CPU + int16 tensor sliced to match expected_len (i.e. val_tokens length).""" + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No byte sidecar files for pattern: {pattern}") + shards = [load_data_shard(file) for file in files] + # load_data_shard returns uint16 — that's exactly what the sidecar stores. + bytes_full = torch.cat(shards).contiguous() + if bytes_full.numel() < expected_len: + raise ValueError( + f"Byte sidecar too short: {bytes_full.numel()} < val_tokens {expected_len}" + ) + return bytes_full[:expected_len].to(torch.int32) + + +def load_data_shard(file): + header_bytes = 256 * np.dtype(" 0: + pos = start + while pos < end: + seg_starts.append(pos) + pos += max_doc_len + else: + seg_starts.append(start) + boundaries = seg_starts + [total_len] + padded_len = get_next_multiple_of_n(len(boundaries), bucket_size) + cu = torch.full((padded_len,), total_len, dtype=torch.int32, device=device) + cu[: len(boundaries)] = torch.tensor(boundaries, dtype=torch.int32, device=device) + seg_ends = seg_starts[1:] + [total_len] + max_seqlen = max(end - start for start, end in zip(seg_starts, seg_ends)) + return cu, max_seqlen + +class DocumentPackingLoader: + _shard_pool = ThreadPoolExecutor(1) + + def __init__(self, h, device, cu_bucket_size=64): + self.rank = h.rank + self.world_size = h.world_size + self.device = device + self.cu_bucket_size = cu_bucket_size + self.max_seq_len = h.train_seq_len + all_files = [Path(p) for p in sorted(glob.glob(h.train_files))] + if not all_files: + raise FileNotFoundError(f"No files found for pattern: {h.train_files}") + self.files = all_files + self.file_iter = iter(self.files) + self._init_shard(load_data_shard(next(self.file_iter))) + self._next_shard = self._submit_next_shard() + self._batch_pool = ThreadPoolExecutor(1) + self._next_batch = None + + def _init_shard(self, tokens): + global BOS_ID + self.tokens = tokens + self.shard_size = tokens.numel() + if BOS_ID is None: + BOS_ID = 1 + self.bos_idx = ( + (tokens == BOS_ID).nonzero(as_tuple=True)[0].to(torch.int64).cpu().numpy() + ) + if self.bos_idx.size == 0: + self.bos_idx = np.array([0], dtype=np.int64) + self.cursor = int(self.bos_idx[0]) + + def _submit_next_shard(self): + try: + path = next(self.file_iter) + return self._shard_pool.submit(load_data_shard, path) + except StopIteration: + return None + + def _advance_shard(self): + if self._next_shard is None: + self.file_iter = iter(self.files) + self._next_shard = self._shard_pool.submit( + load_data_shard, next(self.file_iter) + ) + self._init_shard(self._next_shard.result()) + self._next_shard = self._submit_next_shard() + + def _local_doc_starts(self, local_start, total_len): + lo = np.searchsorted(self.bos_idx, local_start, side="left") + hi = np.searchsorted(self.bos_idx, local_start + total_len, side="left") + return (self.bos_idx[lo:hi] - local_start).tolist() + + def _prepare_batch(self, num_tokens_local, max_seq_len): + per_rank_span = num_tokens_local + 1 + global_span = per_rank_span * self.world_size + while self.cursor + global_span > self.shard_size: + self._advance_shard() + local_start = self.cursor + self.rank * per_rank_span + buf = self.tokens[local_start : local_start + per_rank_span] + inputs = buf[:-1].to(dtype=torch.int64).pin_memory() + targets = buf[1:].to(dtype=torch.int64).pin_memory() + starts = self._local_doc_starts(local_start, inputs.numel()) + cu_seqlens, max_seqlen = _build_cu_seqlens( + starts, inputs.numel(), inputs.device, max_seq_len, self.cu_bucket_size + ) + cu_seqlens = cu_seqlens.pin_memory() + self.cursor += global_span + return inputs, targets, cu_seqlens, max_seqlen + + def next_batch(self, global_tokens, grad_accum_steps): + num_tokens_local = global_tokens // (self.world_size * grad_accum_steps) + if self._next_batch is not None: + inputs, targets, cu_seqlens, max_seqlen = self._next_batch.result() + else: + inputs, targets, cu_seqlens, max_seqlen = self._prepare_batch( + num_tokens_local, self.max_seq_len + ) + self._next_batch = self._batch_pool.submit( + self._prepare_batch, num_tokens_local, self.max_seq_len + ) + return ( + inputs[None].to(self.device, non_blocking=True), + targets[None].to(self.device, non_blocking=True), + cu_seqlens.to(self.device, non_blocking=True), + max_seqlen, + ) + + +class ShuffledSequenceLoader: + def __init__(self, h, device): + self.world_size = h.world_size + self.seq_len = h.train_seq_len + self.device = device + all_files = [Path(p) for p in sorted(glob.glob(h.train_files))] + if not all_files: + raise FileNotFoundError(f"No files found for pattern: {h.train_files}") + self.files = all_files[h.rank :: h.world_size] + self.rng = np.random.Generator(np.random.PCG64(h.rank)) + self.num_tokens = [_read_num_tokens(f) for f in self.files] + self.start_inds = [[] for _ in self.files] + for si in range(len(self.files)): + self._reset_shard(si) + + def _reset_shard(self, si): + max_phase = min( + self.seq_len - 1, max(0, self.num_tokens[si] - self.seq_len - 1) + ) + phase = int(self.rng.integers(max_phase + 1)) if max_phase > 0 else 0 + num_sequences = (self.num_tokens[si] - 1 - phase) // self.seq_len + sequence_order = self.rng.permutation(num_sequences) + self.start_inds[si] = (phase + sequence_order * self.seq_len).tolist() + + def next_batch(self, global_tokens, grad_accum_steps): + device_tokens = global_tokens // (self.world_size * grad_accum_steps) + device_batch_size = device_tokens // self.seq_len + remaining = np.array([len(s) for s in self.start_inds], dtype=np.float64) + x = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64) + y = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64) + for bi in range(device_batch_size): + total = remaining.sum() + if total <= 0: + for si in range(len(self.files)): + self._reset_shard(si) + remaining = np.array( + [len(s) for s in self.start_inds], dtype=np.float64 + ) + total = remaining.sum() + probs = remaining / total + si = int(self.rng.choice(len(self.files), p=probs)) + start_ind = self.start_inds[si].pop() + remaining[si] -= 1 + mm = _get_shard_memmap(self.files[si]) + window = torch.as_tensor( + np.array(mm[start_ind : start_ind + self.seq_len + 1], dtype=np.int64) + ) + x[bi] = window[:-1] + y[bi] = window[1:] + return x.to(self.device, non_blocking=True), y.to( + self.device, non_blocking=True + ) + + +class RMSNorm(nn.Module): + def __init__(self, eps=None): + super().__init__() + self.eps = eps + + def forward(self, x): + return F.rms_norm(x, (x.size(-1),), eps=self.eps) + + +class CastedLinear(nn.Linear): + def forward(self, x): + w = self.weight.to(x.dtype) + bias = self.bias.to(x.dtype) if self.bias is not None else None + return F.linear(x, w, bias) + + +@triton.jit +def linear_leaky_relu_square_kernel( + a_desc, + b_desc, + c_desc, + aux_desc, + M, + N, + K, + BLOCK_SIZE_M: tl.constexpr, + BLOCK_SIZE_N: tl.constexpr, + BLOCK_SIZE_K: tl.constexpr, + NUM_SMS: tl.constexpr, + FORWARD: tl.constexpr, +): + dtype = tl.bfloat16 + start_pid = tl.program_id(axis=0) + num_pid_m = tl.cdiv(M, BLOCK_SIZE_M) + num_pid_n = tl.cdiv(N, BLOCK_SIZE_N) + k_tiles = tl.cdiv(K, BLOCK_SIZE_K) + num_tiles = num_pid_m * num_pid_n + tile_id_c = start_pid - NUM_SMS + for tile_id in tl.range(start_pid, num_tiles, NUM_SMS, flatten=True): + pid_m = tile_id // num_pid_n + pid_n = tile_id % num_pid_n + offs_am = pid_m * BLOCK_SIZE_M + offs_bn = pid_n * BLOCK_SIZE_N + accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) + for ki in range(k_tiles): + offs_k = ki * BLOCK_SIZE_K + a = a_desc.load([offs_am, offs_k]) + b = b_desc.load([offs_bn, offs_k]) + accumulator = tl.dot(a, b.T, accumulator) + tile_id_c += NUM_SMS + offs_am_c = offs_am + offs_bn_c = offs_bn + acc = tl.reshape(accumulator, (BLOCK_SIZE_M, 2, BLOCK_SIZE_N // 2)) + acc = tl.permute(acc, (0, 2, 1)) + acc0, acc1 = tl.split(acc) + c0 = acc0.to(dtype) + c1 = acc1.to(dtype) + if not FORWARD: + pre0 = aux_desc.load([offs_am_c, offs_bn_c]) + pre1 = aux_desc.load([offs_am_c, offs_bn_c + BLOCK_SIZE_N // 2]) + c0 = c0 * tl.where(pre0 > 0, 2.0 * pre0, 0.5 * pre0) + c1 = c1 * tl.where(pre1 > 0, 2.0 * pre1, 0.5 * pre1) + c_desc.store([offs_am_c, offs_bn_c], c0) + c_desc.store([offs_am_c, offs_bn_c + BLOCK_SIZE_N // 2], c1) + if FORWARD: + aux0 = tl.where(c0 > 0, c0, 0.5 * c0) + aux1 = tl.where(c1 > 0, c1, 0.5 * c1) + aux_desc.store([offs_am_c, offs_bn_c], aux0 * aux0) + aux_desc.store([offs_am_c, offs_bn_c + BLOCK_SIZE_N // 2], aux1 * aux1) + + +def linear_leaky_relu_square(a, b, aux=None): + M, K = a.shape + N, K2 = b.shape + assert K == K2 + c = torch.empty((M, N), device=a.device, dtype=a.dtype) + forward = aux is None + if aux is None: + aux = torch.empty((M, N), device=a.device, dtype=a.dtype) + num_sms = torch.cuda.get_device_properties(a.device).multi_processor_count + BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K = 128, 256, 64 + num_stages = 4 if forward else 3 + a_desc = TensorDescriptor.from_tensor(a, [BLOCK_SIZE_M, BLOCK_SIZE_K]) + b_desc = TensorDescriptor.from_tensor(b, [BLOCK_SIZE_N, BLOCK_SIZE_K]) + c_desc = TensorDescriptor.from_tensor(c, [BLOCK_SIZE_M, BLOCK_SIZE_N // 2]) + aux_desc = TensorDescriptor.from_tensor(aux, [BLOCK_SIZE_M, BLOCK_SIZE_N // 2]) + grid = lambda _meta: ( + min(num_sms, triton.cdiv(M, BLOCK_SIZE_M) * triton.cdiv(N, BLOCK_SIZE_N)), + ) + linear_leaky_relu_square_kernel[grid]( + a_desc, + b_desc, + c_desc, + aux_desc, + M, + N, + K, + BLOCK_SIZE_M=BLOCK_SIZE_M, + BLOCK_SIZE_N=BLOCK_SIZE_N, + BLOCK_SIZE_K=BLOCK_SIZE_K, + NUM_SMS=num_sms, + FORWARD=forward, + num_stages=num_stages, + num_warps=8, + ) + if forward: + return c, aux + return c + + +class FusedLinearLeakyReLUSquareFunction(torch.autograd.Function): + @staticmethod + def forward(ctx, x, w1, w2): + x_flat = x.reshape(-1, x.shape[-1]) + pre, post = linear_leaky_relu_square(x_flat, w1) + out = F.linear(post, w2) + ctx.save_for_backward(x, w1, w2, pre, post) + return out.view(*x.shape[:-1], out.shape[-1]) + + @staticmethod + def backward(ctx, grad_output): + x, w1, w2, pre, post = ctx.saved_tensors + x_flat = x.reshape(-1, x.shape[-1]) + grad_output_flat = grad_output.reshape(-1, grad_output.shape[-1]) + dw2 = grad_output_flat.T @ post + dpre = linear_leaky_relu_square(grad_output_flat, w2.T.contiguous(), aux=pre) + dw1 = dpre.T @ x_flat + dx = dpre @ w1 + return dx.view_as(x), dw1, dw2 + + +FusedLeakyReLUSquareMLP = FusedLinearLeakyReLUSquareFunction.apply + + +class Rotary(nn.Module): + def __init__(self, dim, base=1e4, train_seq_len=1024, rope_dims=0, yarn=True): + super().__init__() + self.dim = dim + self.base = base + self.train_seq_len = train_seq_len + self.yarn = yarn + self.rope_dims = rope_dims if rope_dims > 0 else dim + inv_freq = 1.0 / base ** ( + torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims + ) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._seq_len_cached = 0 + self._cos_cached = None + self._sin_cached = None + + def forward(self, seq_len, device, dtype): + if ( + self._cos_cached is None + or self._sin_cached is None + or self._seq_len_cached < seq_len + or self._cos_cached.device != device + ): + rd = self.rope_dims + if self.yarn and seq_len > self.train_seq_len: + scale = seq_len / self.train_seq_len + new_base = self.base * scale ** (rd / (rd - 2)) + inv_freq = 1.0 / new_base ** ( + torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd + ) + else: + inv_freq = self.inv_freq.float().to(device) + t = torch.arange(seq_len, device=device, dtype=torch.float32) + freqs = torch.outer(t, inv_freq) + self._cos_cached = freqs.cos()[None, :, None, :] + self._sin_cached = freqs.sin()[None, :, None, :] + self._seq_len_cached = seq_len + return self._cos_cached[:, :seq_len].to(dtype=dtype), self._sin_cached[:, :seq_len].to(dtype=dtype) + + +def apply_rotary_emb(x, cos, sin, rope_dims=0): + if rope_dims > 0 and rope_dims < x.size(-1): + x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:] + half = rope_dims // 2 + x1, x2 = x_rope[..., :half], x_rope[..., half:] + x_rope = torch.cat((x1 * cos + x2 * sin, x1 * -sin + x2 * cos), dim=-1) + return torch.cat((x_rope, x_pass), dim=-1) + half = x.size(-1) // 2 + x1, x2 = x[..., :half], x[..., half:] + return torch.cat((x1 * cos + x2 * sin, x1 * -sin + x2 * cos), dim=-1) + + +class CausalSelfAttention(nn.Module): + def __init__( + self, dim, num_heads, num_kv_heads, rope_base, qk_gain_init, train_seq_len, yarn=True, + attn_out_gate=False, attn_out_gate_src="proj", gate_window=12, + gated_attn=False, gated_attn_init_std=0.01, + ): + super().__init__() + if dim % num_heads != 0: + raise ValueError("model_dim must be divisible by num_heads") + if num_heads % num_kv_heads != 0: + raise ValueError("num_heads must be divisible by num_kv_heads") + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + if self.head_dim % 2 != 0: + raise ValueError("head_dim must be even for RoPE") + self.q_gain = nn.Parameter( + torch.full((num_heads,), qk_gain_init, dtype=torch.float32) + ) + self.rope_dims = 0 + self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=train_seq_len, yarn=yarn) + self.use_xsa = False + # AttnOutGate (PR #1667 MarioPaerle): per-head multiplicative gate on attention + # output. CastedLinear so restore_fp32_params casts back to fp32 for GPTQ. + # _zero_init -> 2*sigmoid(0)=1 -> transparent at init. + self.attn_out_gate = attn_out_gate + self.attn_out_gate_src = attn_out_gate_src + self.gate_window = gate_window + if attn_out_gate: + self.attn_gate_proj = CastedLinear(gate_window, num_heads, bias=False) + self.attn_gate_proj._zero_init = True + # Gated Attention (arXiv:2505.06708, Qwen, NeurIPS 2025). Per-head sigmoid + # gate on SDPA output, BEFORE out_proj. Gate projection W_g: (num_heads, dim). + # Name "attn_gate_w" contains "attn_gate" substring so it matches + # CONTROL_TENSOR_NAME_PATTERNS and routes to the scalar AdamW group. + # fp32 Parameter -> restore_fp32_params path covers it via the ndim<2 OR + # name-pattern check (name matches "attn_gate"). Cast to x.dtype on use. + self.gated_attn = gated_attn + if gated_attn: + W = torch.empty(num_heads, dim, dtype=torch.float32) + nn.init.normal_(W, mean=0.0, std=gated_attn_init_std) + self.attn_gate_w = nn.Parameter(W) + + def _xsa_efficient(self, y, v): + B, T, H, D = y.shape + Hkv = v.size(-2) + group = H // Hkv + y_g = y.reshape(B, T, Hkv, group, D) + vn = F.normalize(v, dim=-1).unsqueeze(-2) + proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn + return (y_g - proj).reshape(B, T, H, D) + + def forward(self, x, q_w, k_w, v_w, out_w, cu_seqlens=None, max_seqlen=0): + bsz, seqlen, dim = x.shape + # q_raw kept around as a tap point for attn_out_gate_src='q' (post-projection, + # pre-reshape, pre-RoPE). + q_raw = F.linear(x, q_w.to(x.dtype)) + q = q_raw.reshape(bsz, seqlen, self.num_heads, self.head_dim) + k = F.linear(x, k_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + v = F.linear(x, v_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = self.rotary(seqlen, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, self.rope_dims) + k = apply_rotary_emb(k, cos, sin, self.rope_dims) + q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] + if cu_seqlens is not None: + y = flash_attn_varlen_func( + q[0], + k[0], + v[0], + cu_seqlens_q=cu_seqlens, + cu_seqlens_k=cu_seqlens, + max_seqlen_q=max_seqlen, + max_seqlen_k=max_seqlen, + causal=True, + window_size=(-1, -1), + )[None] + else: + y = flash_attn_3_func(q, k, v, causal=True) + if self.use_xsa: + y = self._xsa_efficient(y, v) + # AttnOutGate inlined (PR #1667). Inline + .contiguous() barrier so torch.compile + # fullgraph=True is happy (this avoids the @torch.compiler.disable trap that + # crashed gates v3). Per-head gate on (B,T,H,D) tensor: g shape [B,T,H], broadcast + # over D via [..., None]. zero-init weight -> 2*sigmoid(0)=1 -> transparent. + if self.attn_out_gate: + gate_src = q_raw if self.attn_out_gate_src == "q" else x + gate_in = gate_src[..., : self.gate_window].contiguous() + g = 2.0 * torch.sigmoid(self.attn_gate_proj(gate_in)) + y = y * g[..., None] + # Gated Attention (arXiv:2505.06708 G1). Inline + .contiguous() barrier so + # torch.compile fullgraph=True is happy. Per-head gate on (B,T,H,D): g shape + # [B,T,H], broadcast over D via [..., None]. Paper: g = sigmoid(x @ W_g.T) + # where W_g: (H, dim). .to(x.dtype) on fp32 param before broadcast with bf16. + if self.gated_attn: + x_c = x.contiguous() + g = torch.sigmoid(F.linear(x_c, self.attn_gate_w.to(x.dtype))) + y = y * g[..., None] + y = y.reshape(bsz, seqlen, dim) + self._last_proj_input = y.detach() if getattr(self, "_calib", False) else None + return F.linear(y, out_w.to(x.dtype)) + + +class MLP(nn.Module): + def __init__(self, dim, mlp_mult): + super().__init__() + self.use_fused = True + + def forward(self, x, up_w, down_w): + if self.training and self.use_fused: + return FusedLeakyReLUSquareMLP(x, up_w.to(x.dtype), down_w.to(x.dtype)) + hidden = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5).square() + self._last_down_input = hidden.detach() if getattr(self, "_calib", False) else None + return F.linear(hidden, down_w.to(x.dtype)) + + +class Block(nn.Module): + def __init__( + self, + dim, + num_heads, + num_kv_heads, + mlp_mult, + rope_base, + qk_gain_init, + train_seq_len, + layer_idx=0, + ln_scale=False, + yarn=True, + attn_out_gate=False, + attn_out_gate_src="proj", + gate_window=12, + gated_attn=False, + gated_attn_init_std=0.01, + ): + super().__init__() + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn = CausalSelfAttention( + dim, num_heads, num_kv_heads, rope_base, qk_gain_init, train_seq_len, yarn=yarn, + attn_out_gate=attn_out_gate, attn_out_gate_src=attn_out_gate_src, gate_window=gate_window, + gated_attn=gated_attn, gated_attn_init_std=gated_attn_init_std, + ) + self.mlp = MLP(dim, mlp_mult) + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter( + torch.stack((torch.ones(dim), torch.zeros(dim))).float() + ) + self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 + + def forward(self, x, x0, q_w, k_w, v_w, out_w, up_w, down_w, cu_seqlens=None, max_seqlen=0): + mix = self.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out = self.attn( + self.attn_norm(x_in) * self.ln_scale_factor, + q_w, k_w, v_w, out_w, + cu_seqlens=cu_seqlens, + max_seqlen=max_seqlen, + ) + x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[ + None, None, : + ] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) + return x_out + +class GPT(nn.Module): + def __init__(self, h): + super().__init__() + self.h = h + if h.logit_softcap <= 0.0: + raise ValueError(f"logit_softcap must be positive, got {h.logit_softcap}") + self.tie_embeddings = h.tie_embeddings + self.tied_embed_init_std = h.tied_embed_init_std + self.logit_softcap = h.logit_softcap + self.tok_emb = nn.Embedding(h.vocab_size, h.model_dim) + self.num_layers = h.num_layers + head_dim = h.model_dim // h.num_heads + kv_dim = h.num_kv_heads * head_dim + hidden_dim = int(h.mlp_mult * h.model_dim) + self.qo_bank = nn.Parameter(torch.empty(2 * h.num_layers, h.model_dim, h.model_dim)) + self.kv_bank = nn.Parameter(torch.empty(2 * h.num_layers, kv_dim, h.model_dim)) + self.mlp_up_bank = nn.Parameter(torch.empty(h.num_layers, hidden_dim, h.model_dim)) + self.mlp_down_bank = nn.Parameter(torch.empty(h.num_layers, h.model_dim, hidden_dim)) + self.num_encoder_layers = h.num_layers // 2 + self.num_decoder_layers = h.num_layers - self.num_encoder_layers + self.blocks = nn.ModuleList( + [ + Block( + h.model_dim, + h.num_heads, + h.num_kv_heads, + h.mlp_mult, + h.rope_base, + h.qk_gain_init, + h.train_seq_len, + layer_idx=i, + ln_scale=h.ln_scale, + yarn=h.rope_yarn, + attn_out_gate=h.attn_out_gate_enabled, + attn_out_gate_src=h.attn_out_gate_src, + gate_window=h.gate_window, + gated_attn=h.gated_attn_enabled, + gated_attn_init_std=h.gated_attn_init_std, + ) + for i in range(h.num_layers) + ] + ) + if h.rope_dims > 0: + head_dim = h.model_dim // h.num_heads + for block in self.blocks: + block.attn.rope_dims = h.rope_dims + block.attn.rotary = Rotary( + head_dim, + base=h.rope_base, + train_seq_len=h.train_seq_len, + rope_dims=h.rope_dims, + yarn=h.rope_yarn, + ) + self.final_norm = RMSNorm() + self.lm_head = ( + None + if h.tie_embeddings + else CastedLinear(h.model_dim, h.vocab_size, bias=False) + ) + if self.lm_head is not None: + self.lm_head._zero_init = True + if h.xsa_last_n > 0: + for i in range(max(0, h.num_layers - h.xsa_last_n), h.num_layers): + self.blocks[i].attn.use_xsa = True + self.looping_active = False + self.base_encoder_indices = list(range(self.num_encoder_layers)) + self.base_decoder_indices = list(range(self.num_encoder_layers, h.num_layers)) + self.loop_index_cache = {} + max_loop_repeats = max( + _depth_to_repeats(h.num_loops + 1), + _depth_to_repeats(h.train_loop_max_depth), + _depth_to_repeats(h.eval_loop_depth), + ) + for repeats in range(max_loop_repeats + 1): + self.loop_index_cache[repeats] = self._build_loop_indices(repeats) + self.active_loop_repeats = _depth_to_repeats(h.num_loops + 1) + self.encoder_indices, self.decoder_indices = self.loop_index_cache[ + self.active_loop_repeats + ] + self.num_skip_weights = min( + len(self.encoder_indices), len(self.decoder_indices) + ) + self.skip_weights = nn.Parameter( + torch.ones(self.num_skip_weights, h.model_dim, dtype=torch.float32) + ) + self.skip_gates = ( + nn.Parameter( + torch.zeros(self.num_skip_weights, h.model_dim, dtype=torch.float32) + ) + if h.skip_gates_enabled + else None + ) + self.parallel_start_layer = h.parallel_start_layer + self.parallel_final_lane = h.parallel_final_lane.lower() + self.parallel_post_lambdas = nn.Parameter( + torch.ones(h.num_layers, 2, 2, dtype=torch.float32) + ) + self.parallel_resid_lambdas = nn.Parameter( + torch.full((h.num_layers, 2), 1.1, dtype=torch.float32) + ) + # SmearGate (PR #1667 / modded-nanogpt @classiclarryd): + # x_t <- x_t + lam * sigmoid(W * x_t[:gate_window]) * x_{t-1}. + # Per-token forward-1 smear of the embedding lane. W zero-init + lam=0 -> + # transparent at init. Uses CastedLinear so restore_fp32_params handles dtype. + self.smear_gate_enabled = h.smear_gate_enabled + if self.smear_gate_enabled: + self.smear_window = h.gate_window + self.smear_gate = CastedLinear(self.smear_window, 1, bias=False) + self.smear_gate._zero_init = True + self.smear_lambda = nn.Parameter(torch.zeros(1, dtype=torch.float32)) + self._init_weights() + + def _build_loop_indices(self, repeats): + repeats = max(int(repeats), 0) + if repeats == 0: + return self.base_encoder_indices, self.base_decoder_indices + loop_seg = list(range(self.h.loop_start, self.h.loop_end + 1)) + all_indices = list(range(self.h.loop_start)) + for _ in range(repeats + 1): + all_indices.extend(loop_seg) + all_indices.extend(range(self.h.loop_end + 1, self.h.num_layers)) + num_enc = len(all_indices) // 2 + return all_indices[:num_enc], all_indices[num_enc:] + + def set_loop_repeats(self, repeats): + repeats = max(int(repeats), 0) + if repeats not in self.loop_index_cache: + self.loop_index_cache[repeats] = self._build_loop_indices(repeats) + self.active_loop_repeats = repeats + self.encoder_indices, self.decoder_indices = self.loop_index_cache[repeats] + + def _init_weights(self): + if self.tie_embeddings: + nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) + n = self.num_layers + proj_scale = 1.0 / math.sqrt(2 * n) + for i in range(n): + nn.init.orthogonal_(self.qo_bank.data[i], gain=1.0) + nn.init.zeros_(self.qo_bank.data[n + i]) + self.qo_bank.data[n + i].mul_(proj_scale) + nn.init.orthogonal_(self.kv_bank.data[i], gain=1.0) + nn.init.orthogonal_(self.kv_bank.data[n + i], gain=1.0) + for i in range(n): + nn.init.orthogonal_(self.mlp_up_bank.data[i], gain=1.0) + nn.init.zeros_(self.mlp_down_bank.data[i]) + self.mlp_down_bank.data[i].mul_(proj_scale) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif ( + module.weight.ndim == 2 + and module.weight.shape[0] >= 64 + and module.weight.shape[1] >= 64 + ): + nn.init.orthogonal_(module.weight, gain=1.0) + + def _bank_weights(self, i): + n = self.num_layers + return ( + self.qo_bank[i], + self.kv_bank[i], + self.kv_bank[n + i], + self.qo_bank[n + i], + self.mlp_up_bank[i], + self.mlp_down_bank[i], + ) + + def _parallel_block( + self, block_idx, lane0, lane1, x0, + q_w, k_w, v_w, out_w, up_w, down_w, + cu_seqlens=None, max_seqlen=0, + ): + block = self.blocks[block_idx] + mix = block.resid_mix.to(dtype=lane0.dtype) + attn_read = mix[0][None, None, :] * lane0 + mix[1][None, None, :] * x0 + attn_out = block.attn( + block.attn_norm(attn_read) * block.ln_scale_factor, + q_w, k_w, v_w, out_w, + cu_seqlens=cu_seqlens, max_seqlen=max_seqlen, + ) + attn_out = block.attn_scale.to(dtype=attn_out.dtype)[None, None, :] * attn_out + mlp_read = lane1 + mlp_out = block.mlp_scale.to(dtype=lane1.dtype)[None, None, :] * block.mlp( + block.mlp_norm(mlp_read) * block.ln_scale_factor, up_w, down_w + ) + attn_resid = self.parallel_resid_lambdas[block_idx, 0].to(dtype=lane0.dtype) + attn_post = self.parallel_post_lambdas[block_idx, 0].to(dtype=lane0.dtype) + mlp_resid = self.parallel_resid_lambdas[block_idx, 1].to(dtype=lane0.dtype) + mlp_post = self.parallel_post_lambdas[block_idx, 1].to(dtype=lane0.dtype) + lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out + lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out + return lane0, lane1 + + def _final_parallel_hidden(self, lane0, lane1): + if self.parallel_final_lane == "mlp": + return lane1 + if self.parallel_final_lane == "attn": + return lane0 + return 0.5 * (lane0 + lane1) + + def forward_logits(self, input_ids, cu_seqlens=None, max_seqlen=0): + x = self.tok_emb(input_ids) + # SmearGate (PR #1667). Inline gate compute with .contiguous() on the slice fed + # to the projection so torch.compile fullgraph is happy. lam=0 + W=0 -> identity + # at init. This block runs unconditionally on the smear path; the cat keeps + # position 0 untouched so causality holds. + if self.smear_gate_enabled: + sl = self.smear_lambda.to(dtype=x.dtype) + gate_in = x[:, 1:, : self.smear_window].contiguous() + g = sl * torch.sigmoid(self.smear_gate(gate_in)) + x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1]], dim=1) + x = F.rms_norm(x, (x.size(-1),)) + x0 = x + skips = [] + enc_iter = ( + self.encoder_indices + if self.looping_active + else range(self.num_encoder_layers) + ) + dec_iter = ( + self.decoder_indices + if self.looping_active + else range( + self.num_encoder_layers, + self.num_encoder_layers + self.num_decoder_layers, + ) + ) + for i in enc_iter: + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i) + x = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen) + skips.append(x) + psl = self.parallel_start_layer + lane0 = None + lane1 = None + for skip_idx, i in enumerate(dec_iter): + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i) + if i >= psl and psl > 0: + if lane0 is None: + lane0 = x + lane1 = x + if skip_idx < self.num_skip_weights and skips: + skip = skips.pop() + w = self.skip_weights[skip_idx].to(dtype=lane0.dtype)[None, None, :] + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=lane0.dtype))[None, None, :] + lane0 = torch.lerp(w * skip, lane0, g) + else: + lane0 = lane0 + w * skip + lane0, lane1 = self._parallel_block( + i, lane0, lane1, x0, q_w, k_w, v_w, out_w, up_w, down_w, + cu_seqlens=cu_seqlens, max_seqlen=max_seqlen, + ) + else: + if skip_idx < self.num_skip_weights and skips: + scaled_skip = ( + self.skip_weights[skip_idx].to(dtype=x.dtype)[None, None, :] + * skips.pop() + ) + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=x.dtype))[None, None, :] + x = torch.lerp(scaled_skip, x, g) + else: + x = x + scaled_skip + x = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen) + if lane0 is not None: + x = self._final_parallel_hidden(lane0, lane1) + x = self.final_norm(x) + if self.tie_embeddings: + logits_proj = F.linear(x, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x) + return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + + def forward(self, input_ids, target_ids, cu_seqlens=None, max_seqlen=0): + logits = self.forward_logits( + input_ids, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen + ) + return F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + target_ids.reshape(-1), + reduction="mean", + ) + + def forward_ttt(self, input_ids, target_ids, lora): + x = self.tok_emb(input_ids) + # SmearGate on the TTT path — same inline compute as forward_logits. + if self.smear_gate_enabled: + sl = self.smear_lambda.to(dtype=x.dtype) + gate_in = x[:, 1:, : self.smear_window].contiguous() + g = sl * torch.sigmoid(self.smear_gate(gate_in)) + x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1]], dim=1) + x = F.rms_norm(x, (x.size(-1),)) + x0 = x + skips = [] + enc_iter = ( + self.encoder_indices + if self.looping_active + else list(range(self.num_encoder_layers)) + ) + dec_iter = ( + self.decoder_indices + if self.looping_active + else list( + range( + self.num_encoder_layers, + self.num_encoder_layers + self.num_decoder_layers, + ) + ) + ) + slot = 0 + for i in enc_iter: + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i) + x = self._block_with_lora(self.blocks[i], x, x0, lora, slot, q_w, k_w, v_w, out_w, up_w, down_w) + slot += 1 + skips.append(x) + psl = self.parallel_start_layer + lane0 = None + lane1 = None + for skip_idx, i in enumerate(dec_iter): + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i) + if i >= psl and psl > 0: + if lane0 is None: + lane0 = x + lane1 = x + if skip_idx < self.num_skip_weights and skips: + skip = skips.pop() + w = self.skip_weights[skip_idx].to(dtype=lane0.dtype)[None, None, :] + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=lane0.dtype))[None, None, :] + lane0 = torch.lerp(w * skip, lane0, g) + else: + lane0 = lane0 + w * skip + lane0, lane1 = self._parallel_block_with_lora( + i, lane0, lane1, x0, lora, slot, + q_w, k_w, v_w, out_w, up_w, down_w, + ) + else: + if skip_idx < self.num_skip_weights and skips: + scaled_skip = ( + self.skip_weights[skip_idx].to(dtype=x.dtype)[None, None, :] + * skips.pop() + ) + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=x.dtype))[None, None, :] + x = torch.lerp(scaled_skip, x, g) + else: + x = x + scaled_skip + x = self._block_with_lora(self.blocks[i], x, x0, lora, slot, q_w, k_w, v_w, out_w, up_w, down_w) + slot += 1 + if lane0 is not None: + x = self._final_parallel_hidden(lane0, lane1) + x = self.final_norm(x) + if self.tie_embeddings: + logits = F.linear(x, self.tok_emb.weight) + else: + logits = self.lm_head(x) + logits = logits + lora.lm_head_lora(x) + logits = self.logit_softcap * torch.tanh(logits / self.logit_softcap) + bsz, sl, V = logits.shape + return F.cross_entropy( + logits.float().reshape(-1, V), target_ids.reshape(-1), reduction="none" + ).reshape(bsz, sl) + + def _block_with_lora(self, block, x, x0, lora, slot, q_w, k_w, v_w, out_w, up_w, down_w): + mix = block.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + n = block.attn_norm(x_in) * block.ln_scale_factor + attn = block.attn + bsz, seqlen, dim = n.shape + # Keep raw Q for AttnOutGate src='q' (matches forward path semantics). + q_raw = F.linear(n, q_w.to(n.dtype)) + lora.q_loras[slot](n) + q = q_raw.reshape(bsz, seqlen, attn.num_heads, attn.head_dim) + k = F.linear(n, k_w.to(n.dtype)) + if lora.k_loras is not None: + k = k + lora.k_loras[slot](n) + k = k.reshape(bsz, seqlen, attn.num_kv_heads, attn.head_dim) + v = (F.linear(n, v_w.to(n.dtype)) + lora.v_loras[slot](n)).reshape( + bsz, seqlen, attn.num_kv_heads, attn.head_dim + ) + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = attn.rotary(seqlen, n.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, attn.rope_dims) + k = apply_rotary_emb(k, cos, sin, attn.rope_dims) + q = q * attn.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if attn.use_xsa: + y = attn._xsa_efficient(y, v) + # AttnOutGate (TTT path) — inline + .contiguous() barrier, same as the eval path. + if attn.attn_out_gate: + gate_src = q_raw if attn.attn_out_gate_src == "q" else n + gate_in = gate_src[..., : attn.gate_window].contiguous() + g = 2.0 * torch.sigmoid(attn.attn_gate_proj(gate_in)) + y = y * g[..., None] + # Gated Attention (TTT path). Gate input is n (post-norm block input), same + # as eval path. .to(n.dtype) on fp32 param before bf16 broadcast. + if attn.gated_attn: + n_c = n.contiguous() + g = torch.sigmoid(F.linear(n_c, attn.attn_gate_w.to(n.dtype))) + y = y * g[..., None] + y = y.reshape(bsz, seqlen, dim) + attn_out = F.linear(y, out_w.to(n.dtype)) + if lora.o_loras is not None: + attn_out = attn_out + lora.o_loras[slot](n) + x_out = x_in + block.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + mlp_n = block.mlp_norm(x_out) * block.ln_scale_factor + mlp_out = block.mlp(mlp_n, up_w, down_w) + if lora.mlp_loras is not None: + mlp_out = mlp_out + lora.mlp_loras[slot](mlp_n) + x_out = x_out + block.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * mlp_out + return x_out + + def _parallel_block_with_lora( + self, block_idx, lane0, lane1, x0, lora, slot, + q_w, k_w, v_w, out_w, up_w, down_w, + ): + block = self.blocks[block_idx] + mix = block.resid_mix.to(dtype=lane0.dtype) + attn_read = mix[0][None, None, :] * lane0 + mix[1][None, None, :] * x0 + n = block.attn_norm(attn_read) * block.ln_scale_factor + attn = block.attn + bsz, seqlen, dim = n.shape + q_raw = F.linear(n, q_w.to(n.dtype)) + lora.q_loras[slot](n) + q = q_raw.reshape(bsz, seqlen, attn.num_heads, attn.head_dim) + k = F.linear(n, k_w.to(n.dtype)) + if lora.k_loras is not None: + k = k + lora.k_loras[slot](n) + k = k.reshape(bsz, seqlen, attn.num_kv_heads, attn.head_dim) + v = (F.linear(n, v_w.to(n.dtype)) + lora.v_loras[slot](n)).reshape( + bsz, seqlen, attn.num_kv_heads, attn.head_dim + ) + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = attn.rotary(seqlen, n.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, attn.rope_dims) + k = apply_rotary_emb(k, cos, sin, attn.rope_dims) + q = q * attn.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if attn.use_xsa: + y = attn._xsa_efficient(y, v) + # AttnOutGate (TTT parallel path) — inline + .contiguous() barrier. + if attn.attn_out_gate: + gate_src = q_raw if attn.attn_out_gate_src == "q" else n + gate_in = gate_src[..., : attn.gate_window].contiguous() + g = 2.0 * torch.sigmoid(attn.attn_gate_proj(gate_in)) + y = y * g[..., None] + # Gated Attention (TTT parallel path). Gate input is n (post-norm block input). + if attn.gated_attn: + n_c = n.contiguous() + g = torch.sigmoid(F.linear(n_c, attn.attn_gate_w.to(n.dtype))) + y = y * g[..., None] + y = y.reshape(bsz, seqlen, dim) + attn_out = F.linear(y, out_w.to(n.dtype)) + if lora.o_loras is not None: + attn_out = attn_out + lora.o_loras[slot](n) + attn_out = block.attn_scale.to(dtype=attn_out.dtype)[None, None, :] * attn_out + mlp_read = lane1 + mlp_n = block.mlp_norm(mlp_read) * block.ln_scale_factor + mlp_out = block.mlp(mlp_n, up_w, down_w) + if lora.mlp_loras is not None: + mlp_out = mlp_out + lora.mlp_loras[slot](mlp_n) + mlp_out = block.mlp_scale.to(dtype=lane1.dtype)[None, None, :] * mlp_out + attn_resid = self.parallel_resid_lambdas[block_idx, 0].to(dtype=lane0.dtype) + attn_post = self.parallel_post_lambdas[block_idx, 0].to(dtype=lane0.dtype) + mlp_resid = self.parallel_resid_lambdas[block_idx, 1].to(dtype=lane0.dtype) + mlp_post = self.parallel_post_lambdas[block_idx, 1].to(dtype=lane0.dtype) + lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out + lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out + return lane0, lane1 + + +class BatchedLinearLoRA(nn.Module): + def __init__(self, bsz, in_features, out_features, rank): + super().__init__() + self._bound = 1.0 / math.sqrt(in_features) + self.A = nn.Parameter( + torch.empty(bsz, rank, in_features).uniform_(-self._bound, self._bound) + ) + self.B = nn.Parameter(torch.zeros(bsz, out_features, rank)) + + def reset(self): + with torch.no_grad(): + self.A.uniform_(-self._bound, self._bound) + self.B.zero_() + + def forward(self, x): + return (x @ self.A.transpose(1, 2)) @ self.B.transpose(1, 2) + + +class BatchedTTTLoRA(nn.Module): + def __init__(self, bsz, model, rank, k_lora=True, mlp_lora=True, o_lora=True): + super().__init__() + self.bsz = bsz + dim = model.qo_bank.shape[-1] + vocab = model.tok_emb.num_embeddings + if getattr(model, "looping_active", False): + num_slots = len(model.encoder_indices) + len(model.decoder_indices) + else: + num_slots = len(model.blocks) + kv_dim = model.blocks[0].attn.num_kv_heads * ( + dim // model.blocks[0].attn.num_heads + ) + embed_dim = model.tok_emb.embedding_dim + self.lm_head_lora = BatchedLinearLoRA(bsz, embed_dim, vocab, rank) + self.q_loras = nn.ModuleList( + [BatchedLinearLoRA(bsz, dim, dim, rank) for _ in range(num_slots)] + ) + self.v_loras = nn.ModuleList( + [BatchedLinearLoRA(bsz, dim, kv_dim, rank) for _ in range(num_slots)] + ) + self.k_loras = ( + nn.ModuleList( + [BatchedLinearLoRA(bsz, dim, kv_dim, rank) for _ in range(num_slots)] + ) + if k_lora + else None + ) + self.mlp_loras = ( + nn.ModuleList( + [BatchedLinearLoRA(bsz, dim, dim, rank) for _ in range(num_slots)] + ) + if mlp_lora + else None + ) + self.o_loras = ( + nn.ModuleList( + [BatchedLinearLoRA(bsz, dim, dim, rank) for _ in range(num_slots)] + ) + if o_lora + else None + ) + + def reset(self): + with torch.no_grad(): + self.lm_head_lora.reset() + for loras in [self.q_loras, self.v_loras, self.k_loras, + self.mlp_loras, self.o_loras]: + if loras is not None: + for lora in loras: + lora.reset() + + +@torch.compile +def zeropower_via_newtonschulz5(G, steps=10, eps=1e-07): + a, b, c = 3.4445, -4.775, 2.0315 + was_2d = G.ndim == 2 + if was_2d: + G = G.unsqueeze(0) + X = G.bfloat16() + transposed = X.size(-2) > X.size(-1) + if transposed: + X = X.mT + X = X / (X.norm(dim=(-2, -1), keepdim=True) + eps) + for _ in range(steps): + A = X @ X.mT + B = b * A + c * (A @ A) + X = a * X + B @ X + if transposed: + X = X.mT + if was_2d: + X = X.squeeze(0) + return X + + +class Muon(torch.optim.Optimizer): + def __init__( + self, + params, + lr, + momentum, + backend_steps, + nesterov=True, + weight_decay=0.0, + row_normalize=False, + ): + super().__init__( + params, + dict( + lr=lr, + momentum=momentum, + backend_steps=backend_steps, + nesterov=nesterov, + weight_decay=weight_decay, + row_normalize=row_normalize, + ), + ) + self._built = False + + def _build(self): + self._distributed = dist.is_available() and dist.is_initialized() + self._world_size = dist.get_world_size() if self._distributed else 1 + self._rank = dist.get_rank() if self._distributed else 0 + ws = self._world_size + self._bank_meta = [] + for group in self.param_groups: + for p in group["params"]: + B = p.shape[0] + padded_B = ((B + ws - 1) // ws) * ws + shard_B = padded_B // ws + tail = p.shape[1:] + dev = p.device + self._bank_meta.append({ + "p": p, + "B": B, + "padded_grad": torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + "shard": torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + "shard_mom": torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + "full_update": torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + "scale": max(1, p.shape[-2] / p.shape[-1]) ** 0.5, + }) + self._bank_meta.sort(key=lambda m: -m["p"].numel()) + self._built = True + + def launch_reduce_scatters(self): + if not self._built: + self._build() + if not self._distributed: + return + self._rs_futures = [] + for m in self._bank_meta: + p = m["p"] + if p.grad is None: + self._rs_futures.append(None) + continue + pg = m["padded_grad"] + pg[: m["B"]].copy_(p.grad.bfloat16()) + if pg.shape[0] > m["B"]: + pg[m["B"] :].zero_() + fut = dist.reduce_scatter_tensor( + m["shard"], pg, op=dist.ReduceOp.AVG, async_op=True + ) + self._rs_futures.append(fut) + + @torch.no_grad() + def step(self, closure=None): + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + if not self._built: + self._build() + for group in self.param_groups: + lr = group["lr"] + momentum = group["momentum"] + backend_steps = group["backend_steps"] + nesterov = group["nesterov"] + wd = group.get("weight_decay", 0.0) + row_normalize = group.get("row_normalize", False) + prev_ag_handle = None + prev_m = None + sharded = self._distributed and hasattr(self, "_rs_futures") + for idx, m in enumerate(self._bank_meta): + p = m["p"] + if p.grad is None: + continue + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m["p"] + upd = prev_m["full_update"][: prev_m["B"]] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m["scale"]) + if sharded and self._rs_futures[idx] is not None: + self._rs_futures[idx].wait() + g = m["shard"] + buf = m["shard_mom"] + else: + g = p.grad.bfloat16() + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + buf.mul_(momentum).add_(g) + if nesterov: + update = g.add(buf, alpha=momentum) + else: + update = buf + if row_normalize: + rn = update.float().norm(dim=-1, keepdim=True).clamp_min(1e-07) + update = update / rn.to(update.dtype) + update = zeropower_via_newtonschulz5(update, steps=backend_steps) + if sharded: + prev_ag_handle = dist.all_gather_into_tensor( + m["full_update"], update, async_op=True + ) + prev_m = m + else: + if wd > 0.0: + p.data.mul_(1.0 - lr * wd) + p.add_(update.to(dtype=p.dtype), alpha=-lr * m["scale"]) + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m["p"] + upd = prev_m["full_update"][: prev_m["B"]] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m["scale"]) + if hasattr(self, "_rs_futures"): + del self._rs_futures + return loss + + +CONTROL_TENSOR_NAME_PATTERNS = tuple( + pattern + for pattern in os.environ.get( + "CONTROL_TENSOR_NAME_PATTERNS", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,skip_gates,parallel_post_lambdas,parallel_resid_lambdas,attn_gate_proj,attn_gate_w,smear_gate,smear_lambda", + ).split(",") + if pattern +) + + +PACKED_REPLICATED_GRAD_MAX_NUMEL = 1 << 15 + + +class Optimizers: + def __init__(self, h, base_model): + matrix_params = [ + base_model.qo_bank, + base_model.kv_bank, + base_model.mlp_up_bank, + base_model.mlp_down_bank, + ] + block_named_params = list(base_model.blocks.named_parameters()) + scalar_params = [ + p + for (name, p) in block_named_params + if p.ndim < 2 + or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS) + ] + if base_model.skip_weights.numel() > 0: + scalar_params.append(base_model.skip_weights) + if base_model.skip_gates is not None and base_model.skip_gates.numel() > 0: + scalar_params.append(base_model.skip_gates) + if base_model.parallel_post_lambdas is not None: + scalar_params.append(base_model.parallel_post_lambdas) + if base_model.parallel_resid_lambdas is not None: + scalar_params.append(base_model.parallel_resid_lambdas) + # SmearGate params live on GPT root (not in .blocks), so add them by hand. + # Both are tiny (gate_window scalars + 1 lambda). Optimized via scalar Adam. + if getattr(base_model, "smear_gate_enabled", False): + scalar_params.append(base_model.smear_gate.weight) + scalar_params.append(base_model.smear_lambda) + token_lr = h.tied_embed_lr if h.tie_embeddings else h.embed_lr + tok_params = [ + {"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr} + ] + self.optimizer_tok = torch.optim.AdamW( + tok_params, + betas=(h.beta1, h.beta2), + eps=h.adam_eps, + weight_decay=h.embed_wd, + fused=True, + ) + self.optimizer_muon = Muon( + matrix_params, + lr=h.matrix_lr, + momentum=h.muon_momentum, + backend_steps=h.muon_backend_steps, + weight_decay=h.muon_wd, + row_normalize=h.muon_row_normalize, + ) + for group in self.optimizer_muon.param_groups: + group["base_lr"] = h.matrix_lr + self.optimizer_scalar = torch.optim.AdamW( + [{"params": scalar_params, "lr": h.scalar_lr, "base_lr": h.scalar_lr}], + betas=(h.beta1, h.beta2), + eps=h.adam_eps, + weight_decay=h.adam_wd, + fused=True, + ) + self.optimizers = [ + self.optimizer_tok, + self.optimizer_muon, + self.optimizer_scalar, + ] + self.replicated_params = list(tok_params[0]["params"]) + self.replicated_params.extend(scalar_params) + self.replicated_large_params = [] + self.replicated_packed_params = [] + for p in self.replicated_params: + if p.numel() <= PACKED_REPLICATED_GRAD_MAX_NUMEL: + self.replicated_packed_params.append(p) + else: + self.replicated_large_params.append(p) + + def __iter__(self): + return iter(self.optimizers) + + def zero_grad_all(self): + for opt in self.optimizers: + opt.zero_grad(set_to_none=True) + + def _all_reduce_packed_grads(self): + grads_by_key = collections.defaultdict(list) + for p in self.replicated_packed_params: + if p.grad is not None: + grads_by_key[(p.grad.device, p.grad.dtype)].append(p.grad) + for grads in grads_by_key.values(): + flat = torch.empty( + sum(g.numel() for g in grads), + device=grads[0].device, + dtype=grads[0].dtype, + ) + offset = 0 + for g in grads: + n = g.numel() + flat[offset : offset + n].copy_(g.contiguous().view(-1)) + offset += n + dist.all_reduce(flat, op=dist.ReduceOp.AVG) + offset = 0 + for g in grads: + n = g.numel() + g.copy_(flat[offset : offset + n].view_as(g)) + offset += n + + def step(self, distributed=False): + self.optimizer_muon.launch_reduce_scatters() + if distributed: + reduce_handles = [ + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG, async_op=True) + for p in self.replicated_large_params + if p.grad is not None + ] + self._all_reduce_packed_grads() + for handle in reduce_handles: + handle.wait() + self.optimizer_tok.step() + self.optimizer_scalar.step() + self.optimizer_muon.step() + self.zero_grad_all() + + +def restore_fp32_params(model): + for module in model.modules(): + if isinstance(module, CastedLinear): + module.float() + for name, param in model.named_parameters(): + if ( + param.ndim < 2 + or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS) + ) and param.dtype != torch.float32: + param.data = param.data.float() + if hasattr(model, "qo_bank") and model.qo_bank is not None: + model.qo_bank.data = model.qo_bank.data.float() + model.kv_bank.data = model.kv_bank.data.float() + model.mlp_up_bank.data = model.mlp_up_bank.data.float() + model.mlp_down_bank.data = model.mlp_down_bank.data.float() + + +def collect_hessians(model, train_loader, h, device, n_calibration_batches=64): + hessians = {} + hooks = [] + for i, block in enumerate(model.blocks): + block.attn._calib = True + block.mlp._calib = True + block.mlp.use_fused = False + + def make_attn_hook(layer_idx): + def hook_fn(module, inp, out): + x = inp[0].detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + for suffix in ["c_q", "c_k", "c_v"]: + name = f"blocks.{layer_idx}.attn.{suffix}.weight" + if name not in hessians: + hessians[name] = torch.zeros( + x.shape[1], x.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(x.T, x) + y = module._last_proj_input + if y is not None: + y = y.float() + if y.ndim == 3: + y = y.reshape(-1, y.shape[-1]) + name = f"blocks.{layer_idx}.attn.proj.weight" + if name not in hessians: + hessians[name] = torch.zeros( + y.shape[1], y.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(y.T, y) + return hook_fn + + def make_mlp_hook(layer_idx): + def hook_fn(module, inp, out): + x = inp[0].detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + name = f"blocks.{layer_idx}.mlp.fc.weight" + if name not in hessians: + hessians[name] = torch.zeros( + x.shape[1], x.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(x.T, x) + h_act = module._last_down_input + if h_act is not None: + h_act = h_act.float() + if h_act.ndim == 3: + h_act = h_act.reshape(-1, h_act.shape[-1]) + name = f"blocks.{layer_idx}.mlp.proj.weight" + if name not in hessians: + hessians[name] = torch.zeros( + h_act.shape[1], h_act.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(h_act.T, h_act) + return hook_fn + + for i, block in enumerate(model.blocks): + hooks.append(block.attn.register_forward_hook(make_attn_hook(i))) + hooks.append(block.mlp.register_forward_hook(make_mlp_hook(i))) + + # Hessian hooks for embedding factorization projection layers + def make_linear_input_hook(weight_name): + def hook_fn(module, inp, out): + x = inp[0].detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + if weight_name not in hessians: + hessians[weight_name] = torch.zeros( + x.shape[1], x.shape[1], dtype=torch.float32, device=device + ) + hessians[weight_name].addmm_(x.T, x) + return hook_fn + + if model.tie_embeddings: + hook_module = model.final_norm + + def make_output_hook(name): + def hook_fn(module, inp, out): + x = out.detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + if name not in hessians: + hessians[name] = torch.zeros( + x.shape[1], x.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(x.T, x) + return hook_fn + + hooks.append( + hook_module.register_forward_hook(make_output_hook("tok_emb.weight")) + ) + model.eval() + with torch.no_grad(): + for _ in range(n_calibration_batches): + x, _ = train_loader.next_batch(h.train_batch_tokens, h.grad_accum_steps) + model.forward_logits(x) + for hook in hooks: + hook.remove() + for i, block in enumerate(model.blocks): + block.attn._calib = False + block.mlp._calib = False + block.mlp.use_fused = True + for name in hessians: + hessians[name] = hessians[name].cpu() / n_calibration_batches + return hessians + + +def gptq_quantize_weight(w, H, clip_sigmas=3.0, clip_range=63, block_size=128): + W_orig = w.float().clone() + rows, cols = W_orig.shape + H = H.float().clone() + dead = torch.diag(H) == 0 + H[dead, dead] = 1 + damp = 0.01 * H.diag().mean() + H.diagonal().add_(damp) + perm = torch.argsort(H.diag(), descending=True) + invperm = torch.argsort(perm) + W_perm = W_orig[:, perm].clone() + W_perm[:, dead[perm]] = 0 + H = H[perm][:, perm] + Hinv = torch.cholesky_inverse(torch.linalg.cholesky(H)) + Hinv = torch.linalg.cholesky(Hinv, upper=True) + row_std = W_orig.std(dim=1) + s = (clip_sigmas * row_std / clip_range).clamp_min(1e-10).to(torch.float16) + sf = s.float() + Q = torch.zeros(rows, cols, dtype=torch.int8) + W_work = W_perm.clone() + for i1 in range(0, cols, block_size): + i2 = min(i1 + block_size, cols) + W_block = W_work[:, i1:i2].clone() + Hinv_block = Hinv[i1:i2, i1:i2] + Err = torch.zeros(rows, i2 - i1) + for j in range(i2 - i1): + w_col = W_block[:, j] + d = Hinv_block[j, j] + q_col = torch.clamp(torch.round(w_col / sf), -clip_range, clip_range) + Q[:, i1 + j] = q_col.to(torch.int8) + err = (w_col - q_col.float() * sf) / d + Err[:, j] = err + W_block[:, j:] -= err.unsqueeze(1) * Hinv_block[j, j:].unsqueeze(0) + if i2 < cols: + W_work[:, i2:] -= Err @ Hinv[i1:i2, i2:] + return Q[:, invperm], s + + +def _quantize_gate_int8_row(w): + # Symmetric int8-per-row quantization for small gate tensors. w shape + # (R, C) -> (R,) scales in fp16, int8 values in [-127, 127]. Single scale + # per row keeps accuracy high while halving storage vs fp16. + W = w.float().contiguous() + row_max = W.abs().amax(dim=1).clamp_min(1e-10) + s = (row_max / 127.0).to(torch.float16) + sf = s.float().view(-1, 1) + q = torch.clamp(torch.round(W / sf), -127, 127).to(torch.int8) + return q, s + + +def gptq_mixed_quantize(state_dict, hessians, h): + result = {} + meta = {} + quant_gate = bool(getattr(h, "gated_attn_quant_gate", False)) + for (name, tensor) in state_dict.items(): + t = tensor.detach().cpu().contiguous() + # Dedicated int8-per-row path for attn_gate_w (bypasses both GPTQ and + # fp16 passthrough). Applied BEFORE the numel<=65536 passthrough check + # so the gate tensor is routed here instead of to fp16. + if ( + quant_gate + and t.is_floating_point() + and t.ndim == 2 + and name.endswith(".attn_gate_w") + and 1024 <= t.numel() <= 8192 + ): + gq, gs = _quantize_gate_int8_row(t) + result[name + ".gq"] = gq + result[name + ".gs"] = gs + meta[name] = "gate_int8_row" + continue + if not t.is_floating_point() or t.numel() <= 65536: + result[name] = t.to(torch.float16) if t.is_floating_point() else t + meta[name] = "passthrough (float16)" + continue + if "tok_emb" in name: + cs = h.embed_clip_sigmas + elif ".mlp." in name: + cs = h.mlp_clip_sigmas + elif ".attn." in name: + cs = h.attn_clip_sigmas + else: + cs = h.matrix_clip_sigmas + bits = h.embed_bits if "tok_emb" in name else h.matrix_bits + clip_range = 2 ** (bits - 1) - 1 + ret = gptq_quantize_weight( + t, hessians[name], clip_sigmas=cs, clip_range=clip_range + ) + q, s = ret + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = f"gptq (int{bits})" + categories = collections.defaultdict(set) + for (name, cat) in meta.items(): + short = re.sub("\\.\\d+$", "", re.sub("blocks\\.\\d+", "blocks", name)) + categories[cat].add(short) + log("Quantized weights:") + for cat in sorted(categories): + log(f" {cat}: {', '.join(sorted(categories[cat]))}") + return result, meta + +def dequantize_mixed(result, meta, template_sd): + out = {} + for (name, orig) in template_sd.items(): + info = meta.get(name) + if info is None: + continue + orig_dtype = orig.dtype + if "passthrough" in info: + t = result[name] + if t.dtype == torch.float16 and orig_dtype in ( + torch.float32, + torch.bfloat16, + ): + t = t.to(orig_dtype) + out[name] = t + continue + if info == "gate_int8_row": + gq = result[name + ".gq"] + gs = result[name + ".gs"] + out[name] = (gq.float() * gs.float().view(-1, 1)).to(orig_dtype) + continue + q, s = result[name + ".q"], result[name + ".scale"] + if s.ndim > 0: + out[name] = ( + q.float() * s.float().view(q.shape[0], *[1] * (q.ndim - 1)) + ).to(orig_dtype) + else: + out[name] = (q.float() * float(s.item())).to(orig_dtype) + return out + + +_BSHF_MAGIC = b"BSHF" + + +def _byte_shuffle(data, stride=2): + if stride <= 1 or len(data) < stride: + return data + src = np.frombuffer(data, dtype=np.uint8) + n = len(src) + out = np.empty(n, dtype=np.uint8) + dest_off = 0 + for pos in range(stride): + chunk = src[pos::stride] + out[dest_off : dest_off + len(chunk)] = chunk + dest_off += len(chunk) + return _BSHF_MAGIC + bytes([stride]) + out.tobytes() + + +def _byte_unshuffle(data): + if len(data) < 5 or data[:4] != _BSHF_MAGIC: + return data + stride = data[4] + if stride < 2: + return data[5:] + payload = np.frombuffer(data, dtype=np.uint8, offset=5) + n = len(payload) + out = np.empty(n, dtype=np.uint8) + src_off = 0 + for pos in range(stride): + chunk_len = n // stride + (1 if pos < n % stride else 0) + out[pos::stride][:chunk_len] = payload[src_off : src_off + chunk_len] + src_off += chunk_len + return out.tobytes() + + +def _compress(data, compressor): + data = _byte_shuffle(data) + if compressor == "lzma": + return lzma.compress(data, preset=6) + elif compressor == "brotli": + import brotli + + return brotli.compress(data, quality=11) + raise ValueError(f"Unknown compressor: {compressor!r}") + + +def _decompress(data, compressor): + if compressor == "lzma": + raw = lzma.decompress(data) + elif compressor == "brotli": + import brotli + + raw = brotli.decompress(data) + else: + raise ValueError(f"Unknown compressor: {compressor!r}") + raw = _byte_unshuffle(raw) + return raw + + +def _unbank_state_dict(state_dict, num_layers): + sd = {} + n = num_layers + for k, v in state_dict.items(): + t = v.detach().cpu() if v is not None else None + if k == "qo_bank": + for i in range(n): + sd[f"blocks.{i}.attn.c_q.weight"] = t[i] + sd[f"blocks.{i}.attn.proj.weight"] = t[n + i] + elif k == "kv_bank": + for i in range(n): + sd[f"blocks.{i}.attn.c_k.weight"] = t[i] + sd[f"blocks.{i}.attn.c_v.weight"] = t[n + i] + elif k == "mlp_up_bank": + for i in range(n): + sd[f"blocks.{i}.mlp.fc.weight"] = t[i] + elif k == "mlp_down_bank": + for i in range(n): + sd[f"blocks.{i}.mlp.proj.weight"] = t[i] + else: + if t is not None: + sd[k] = t + return sd + + +def _rebank_state_dict(flat_sd, num_layers, model_dim, kv_dim, hidden_dim): + sd = {} + n = num_layers + sd["qo_bank"] = torch.zeros(2 * n, model_dim, model_dim) + sd["kv_bank"] = torch.zeros(2 * n, kv_dim, model_dim) + for i in range(n): + sd["qo_bank"][i] = flat_sd[f"blocks.{i}.attn.c_q.weight"] + sd["qo_bank"][n + i] = flat_sd[f"blocks.{i}.attn.proj.weight"] + sd["kv_bank"][i] = flat_sd[f"blocks.{i}.attn.c_k.weight"] + sd["kv_bank"][n + i] = flat_sd[f"blocks.{i}.attn.c_v.weight"] + sd["mlp_up_bank"] = torch.zeros(n, hidden_dim, model_dim) + sd["mlp_down_bank"] = torch.zeros(n, model_dim, hidden_dim) + for i in range(n): + sd["mlp_up_bank"][i] = flat_sd[f"blocks.{i}.mlp.fc.weight"] + sd["mlp_down_bank"][i] = flat_sd[f"blocks.{i}.mlp.proj.weight"] + for k, v in flat_sd.items(): + if not ( + k.startswith("blocks.") + and any( + p in k + for p in [ + ".attn.c_q.", ".attn.c_k.", ".attn.c_v.", + ".attn.proj.", ".mlp.fc.", ".mlp.proj.", + ] + ) + ): + sd[k] = v + return sd + + + +def _compressed_code_size(code): + code_raw = code.encode("utf-8") + try: + minified = subprocess.run( + ["pyminify", "--no-rename-locals", "--no-hoist-literals", "--remove-literal-statements", "-"], + input=code_raw, capture_output=True, check=True, + ).stdout + except (FileNotFoundError, subprocess.CalledProcessError): + minified = code_raw + compressed = lzma.compress(minified) + encoded = base64.b85encode(compressed) + wrapper = b'import lzma as L,base64 as B\nexec(L.decompress(B.b85decode("' + encoded + b'")))\n' + return len(code_raw), len(wrapper) + + +def serialize(h, base_model, code): + code_bytes_uncompressed, code_bytes = _compressed_code_size(code) + if h.is_main_process: + torch.save(base_model.state_dict(), h.model_path) + model_bytes = os.path.getsize(h.model_path) + log(f"Serialized model: {model_bytes} bytes") + log(f"Code size (uncompressed): {code_bytes_uncompressed} bytes") + log(f"Code size (compressed): {code_bytes} bytes") + sd_cpu = _unbank_state_dict(base_model.state_dict(), h.num_layers) + device = torch.device("cuda", h.local_rank) + t0 = time.perf_counter() + calib_loader = ShuffledSequenceLoader(h, device) + log("GPTQ:collecting Hessians from calibration data...") + hessians = collect_hessians( + base_model, + calib_loader, + h, + device, + n_calibration_batches=h.gptq_calibration_batches, + ) + log(f"GPTQ:collected {len(hessians)} Hessians in {time.perf_counter()-t0:.1f}s") + quant_result, quant_meta = gptq_mixed_quantize(sd_cpu, hessians, h) + quant_buf = io.BytesIO() + torch.save({"w": quant_result, "m": quant_meta}, quant_buf) + quant_raw = quant_buf.getvalue() + quant_blob = _compress(quant_raw, h.compressor) + quant_file_bytes = len(quant_blob) + bytes_total = quant_file_bytes + code_bytes + if h.is_main_process: + with open(h.quantized_model_path, "wb") as f: + f.write(quant_blob) + log(f"Serialized model quantized+{h.compressor}: {quant_file_bytes} bytes") + log(f"Total submission size quantized+{h.compressor}: {bytes_total} bytes") + return bytes_total, quant_file_bytes + + +def deserialize(h, device): + eval_model = GPT(h).to(device).bfloat16() + restore_fp32_params(eval_model) + flat_template = _unbank_state_dict(eval_model.state_dict(), h.num_layers) + with open(h.quantized_model_path, "rb") as f: + quant_blob_disk = f.read() + quant_state = torch.load( + io.BytesIO(_decompress(quant_blob_disk, h.compressor)), map_location="cpu" + ) + deq_flat = dequantize_mixed(quant_state["w"], quant_state["m"], flat_template) + head_dim = h.model_dim // h.num_heads + kv_dim = h.num_kv_heads * head_dim + hidden_dim = int(h.mlp_mult * h.model_dim) + deq_state = _rebank_state_dict(deq_flat, h.num_layers, h.model_dim, kv_dim, hidden_dim) + eval_model.load_state_dict(deq_state, strict=True) + return eval_model + + +def _loss_bpb(loss_sum, token_count, byte_count): + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + return val_loss, val_bpb + + +def eval_val(h, device, val_data, model, forward_logits_fn=None): + seq_len = h.eval_seq_len + local_batch_tokens = h.val_batch_tokens // (h.world_size * h.grad_accum_steps) + if local_batch_tokens < seq_len: + raise ValueError( + f"VAL_BATCH_SIZE must provide at least one sequence per rank; got VAL_BATCH_SIZE={h.val_batch_tokens}, WORLD_SIZE={h.world_size}, GRAD_ACCUM_STEPS={h.grad_accum_steps}, seq_len={seq_len}" + ) + local_batch_seqs = local_batch_tokens // seq_len + total_seqs = (val_data.val_tokens.numel() - 1) // seq_len + seq_start = total_seqs * h.rank // h.world_size + seq_end = total_seqs * (h.rank + 1) // h.world_size + + # TODO: Don't truncate this. + seq_end = seq_start + ((seq_end - seq_start) // local_batch_seqs) * local_batch_seqs + + val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + val_token_count = torch.zeros((), device=device, dtype=torch.float64) + val_byte_count = torch.zeros((), device=device, dtype=torch.float64) + run_forward_logits = ( + (model.module.forward_logits if hasattr(model, "module") else model.forward_logits) + if forward_logits_fn is None + else forward_logits_fn + ) + model.eval() + global BOS_ID + if BOS_ID is None: + BOS_ID = 1 + with torch.no_grad(): + for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): + batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) + raw_start = batch_seq_start * seq_len + raw_end = batch_seq_end * seq_len + 1 + local = val_data.val_tokens[raw_start:raw_end].to( + device=device, dtype=torch.int64, non_blocking=True + ) + x = local[:-1] + y = local[1:] + bos_pos = (x == BOS_ID).nonzero(as_tuple=True)[0].tolist() + cu_seqlens, max_seqlen = _build_cu_seqlens( + bos_pos, x.numel(), x.device, h.eval_seq_len, 64 + ) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + logits = run_forward_logits( + x[None], cu_seqlens=cu_seqlens, max_seqlen=max_seqlen + ).detach() + per_token_loss = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y.reshape(-1), + reduction="none", + ) + val_loss_sum += per_token_loss.to(torch.float64).sum() + val_token_count += float(y.numel()) + prev_ids = x + tgt_ids = y + if val_data.caseops_enabled and val_data.val_bytes is not None: + # CaseOps: read per-token byte budget from sidecar at the same + # global positions as the target tokens y. raw_start/raw_end + # span [raw_start, raw_end), x = local[:-1], y = local[1:], + # so y is at sidecar positions [raw_start + 1, raw_end). + sidecar_slice = val_data.val_bytes[raw_start + 1 : raw_end].to( + device=device, dtype=torch.int32, non_blocking=True + ) + val_byte_count += sidecar_slice.to(torch.float64).sum() + else: + token_bytes = val_data.base_bytes_lut[tgt_ids].to(dtype=torch.int16) + token_bytes += ( + val_data.has_leading_space_lut[tgt_ids] + & ~val_data.is_boundary_token_lut[prev_ids] + ).to(dtype=torch.int16) + val_byte_count += token_bytes.to(torch.float64).sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) + model.train() + return _loss_bpb(val_loss_sum, val_token_count, val_byte_count) + + +def _find_docs(all_tokens): + bos_positions = (all_tokens == BOS_ID).nonzero(as_tuple=True)[0].numpy() + docs = [] + for i in range(len(bos_positions)): + start = int(bos_positions[i]) + end = ( + int(bos_positions[i + 1]) + if i + 1 < len(bos_positions) + else all_tokens.numel() + ) + if i + 1 < len(bos_positions): + end += 1 + assert end - start >= 2 + docs.append((start, end - start)) + return docs + + +def _build_ttt_global_batches(doc_entries, h, ascending=False): + batch_size = h.ttt_batch_size + global_doc_entries = sorted(doc_entries, key=lambda x: x[1][1]) + global_batches = [ + global_doc_entries[i : i + batch_size] + for i in range(0, len(global_doc_entries), batch_size) + ] + indexed = list(enumerate(global_batches)) + if not ascending: + indexed.sort(key=lambda ib: -max(dl for _, (_, dl) in ib[1])) + return indexed + + +def _init_batch_counter(path): + with open(path, "wb") as f: + f.write((0).to_bytes(4, "little")) + + +def _claim_next_batch(counter_path, queue_len): + try: + with open(counter_path, "r+b") as f: + fcntl.flock(f, fcntl.LOCK_EX) + idx = int.from_bytes(f.read(4), "little") + f.seek(0) + f.write((idx + 1).to_bytes(4, "little")) + f.flush() + except FileNotFoundError: + return queue_len + return idx + + +def _compute_chunk_window(ci, pred_len, num_chunks, chunk_size, eval_seq_len): + chunk_end = pred_len if ci == num_chunks - 1 else (ci + 1) * chunk_size + win_start = max(0, chunk_end - eval_seq_len) + win_len = chunk_end - win_start + chunk_start = ci * chunk_size + chunk_offset = chunk_start - win_start + chunk_len = chunk_end - chunk_start + return win_start, win_len, chunk_offset, chunk_len + + +def _accumulate_bpb( + ptl, + x, + y, + chunk_offsets, + chunk_lens, + pos_idx, + base_bytes_lut, + has_leading_space_lut, + is_boundary_token_lut, + loss_sum, + byte_sum, + token_count, + y_bytes=None, +): + pos = pos_idx[: x.size(1)].unsqueeze(0) + mask = ( + (chunk_lens.unsqueeze(1) > 0) + & (pos >= chunk_offsets.unsqueeze(1)) + & (pos < (chunk_offsets + chunk_lens).unsqueeze(1)) + ) + mask_f64 = mask.to(torch.float64) + if y_bytes is not None: + tok_bytes = y_bytes.to(torch.float64) + else: + tok_bytes = base_bytes_lut[y].to(torch.float64) + tok_bytes += (has_leading_space_lut[y] & ~is_boundary_token_lut[x]).to( + torch.float64 + ) + loss_sum += (ptl.to(torch.float64) * mask_f64).sum() + byte_sum += (tok_bytes * mask_f64).sum() + token_count += chunk_lens.to(torch.float64).sum() + + +def _loss_bpb_from_sums(loss_sum, token_count, byte_sum): + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_sum.item()) + return val_loss, val_bpb + + +def _add_to_counter(path, delta): + try: + with open(path, "r+b") as f: + fcntl.flock(f, fcntl.LOCK_EX) + cur = int.from_bytes(f.read(8), "little", signed=True) + cur += int(delta) + f.seek(0) + f.write(int(cur).to_bytes(8, "little", signed=True)) + f.flush() + return cur + except FileNotFoundError: + return int(delta) + + +def _init_int64_counter(path): + with open(path, "wb") as f: + f.write((0).to_bytes(8, "little", signed=True)) + + +def _select_ttt_doc_entries(docs, h): + doc_entries = list(enumerate(docs)) + if h.val_doc_fraction < 1.0: + sample_n = max(1, int(round(len(docs) * h.val_doc_fraction))) + sampled_indices = sorted( + random.Random(h.seed).sample(range(len(docs)), sample_n) + ) + return [(i, docs[i]) for i in sampled_indices] + return doc_entries + + +def train_val_ttt_global_sgd_distributed(h, device, val_data, base_model, val_tokens, batch_seqs=None): + global BOS_ID + if BOS_ID is None: + BOS_ID = 1 + base_model.eval() + seq_len = h.eval_seq_len + total_tokens = val_tokens.numel() - 1 + ttt_chunk = h.global_ttt_chunk_tokens + batch_seqs = h.global_ttt_batch_seqs if batch_seqs is None else batch_seqs + num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk + ttt_params = [p for p in base_model.parameters()] + for p in ttt_params: + p.requires_grad_(True) + optimizer = torch.optim.SGD( + ttt_params, lr=h.global_ttt_lr, momentum=h.global_ttt_momentum + ) + t_start = time.perf_counter() + for ci in range(num_chunks): + chunk_start = ci * ttt_chunk + chunk_end = min((ci + 1) * ttt_chunk, total_tokens) + is_last_chunk = ci == num_chunks - 1 + if is_last_chunk or h.global_ttt_epochs <= 0: + continue + base_model.train() + chunk_seqs = (chunk_end - chunk_start) // seq_len + if chunk_seqs <= 0: + continue + warmup_chunks = max(0, min(h.global_ttt_warmup_chunks, num_chunks - 1)) + if warmup_chunks > 0 and ci < warmup_chunks: + warmup_denom = max(warmup_chunks - 1, 1) + warmup_t = ci / warmup_denom + lr_now = ( + h.global_ttt_warmup_start_lr + + (h.global_ttt_lr - h.global_ttt_warmup_start_lr) * warmup_t + ) + else: + decay_steps = max(num_chunks - 1 - warmup_chunks, 1) + decay_ci = max(ci - warmup_chunks, 0) + lr_now = h.global_ttt_lr * 0.5 * ( + 1.0 + math.cos(math.pi * decay_ci / decay_steps) + ) + for pg in optimizer.param_groups: + pg["lr"] = lr_now + my_seq_s = chunk_seqs * h.rank // h.world_size + my_seq_e = chunk_seqs * (h.rank + 1) // h.world_size + my_chunk_seqs = my_seq_e - my_seq_s + for _ in range(h.global_ttt_epochs): + for bs in range(0, my_chunk_seqs, batch_seqs): + be = min(bs + batch_seqs, my_chunk_seqs) + actual_bs = my_seq_s + bs + start_tok = chunk_start + actual_bs * seq_len + end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 + if end_tok > val_tokens.numel(): + continue + local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) + x_flat = local[:-1] + y_flat = local[1:] + optimizer.zero_grad(set_to_none=True) + with torch.enable_grad(): + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + if h.global_ttt_respect_doc_boundaries: + bos_pos = (x_flat == BOS_ID).nonzero(as_tuple=True)[0].tolist() + cu_seqlens, max_seqlen = _build_cu_seqlens( + bos_pos, x_flat.numel(), x_flat.device, h.eval_seq_len, 64 + ) + loss = base_model( + x_flat[None], + y_flat[None], + cu_seqlens=cu_seqlens, + max_seqlen=max_seqlen, + ) + else: + x = x_flat.reshape(-1, seq_len) + y = y_flat.reshape(-1, seq_len) + loss = base_model(x, y) + loss.backward() + if dist.is_available() and dist.is_initialized(): + for p in ttt_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.SUM) + p.grad.mul_(1.0 / h.world_size) + if h.global_ttt_grad_clip > 0: + torch.nn.utils.clip_grad_norm_(ttt_params, h.global_ttt_grad_clip) + optimizer.step() + base_model.eval() + if h.rank == 0: + elapsed = time.perf_counter() - t_start + log( + f"tttg: c{ci+1}/{num_chunks} lr:{lr_now:.6f} t:{elapsed:.1f}s" + ) + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.eval() + + +def eval_val_ttt_phased(h, base_model, device, val_data, forward_ttt_train): + global BOS_ID + if BOS_ID is None: + BOS_ID = 1 + base_model.eval() + for p in base_model.parameters(): + p.requires_grad_(False) + all_tokens = val_data.val_tokens + all_tokens_idx = all_tokens.to(torch.int32) + docs = _find_docs(all_tokens) + doc_entries = _select_ttt_doc_entries(docs, h) + prefix_doc_limit = max(0, min(len(doc_entries), int(h.phased_ttt_prefix_docs))) + num_phases = max(1, int(h.phased_ttt_num_phases)) + phase_boundaries = [] + for pi in range(num_phases): + boundary = prefix_doc_limit * (pi + 1) // num_phases + phase_boundaries.append(boundary) + current_phase = 0 + current_phase_boundary = phase_boundaries[0] + log( + "ttt_phased:" + f" total_docs:{len(doc_entries)} prefix_docs:{prefix_doc_limit} " + f"suffix_docs:{len(doc_entries) - prefix_doc_limit}" + f" num_phases:{num_phases} boundaries:{phase_boundaries}" + ) + chunk_size, eval_seq_len = h.ttt_chunk_size, h.ttt_eval_seq_len + eval_batch_set = None + if h.ttt_eval_batches: + eval_batch_set = set(int(x) for x in h.ttt_eval_batches.split(",") if x.strip()) + use_ascending = eval_batch_set is not None + global_batches_sorted = _build_ttt_global_batches( + doc_entries, h, ascending=use_ascending + ) + queue_len = len(global_batches_sorted) + counter_path = f"/tmp/ttt_counter_{h.run_id}" + prefix_counter_path = f"/tmp/ttt_prefix_counter_{h.run_id}" + pause_flag_path = f"/tmp/ttt_pause_flag_{h.run_id}" + if h.rank == 0: + _init_batch_counter(counter_path) + _init_int64_counter(prefix_counter_path) + try: + os.remove(pause_flag_path) + except FileNotFoundError: + pass + if dist.is_available() and dist.is_initialized(): + path_list = [counter_path, prefix_counter_path, pause_flag_path] + dist.broadcast_object_list(path_list, src=0) + counter_path, prefix_counter_path, pause_flag_path = path_list + dist.barrier() + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + byte_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + t_start = time.perf_counter() + reusable_lora = BatchedTTTLoRA( + h.ttt_batch_size, base_model, h.ttt_lora_rank, + k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora, + ).to(device) + + def _build_opt(lora): + if h.ttt_optimizer == "sgd": + return torch.optim.SGD( + lora.parameters(), lr=h.ttt_lora_lr, + momentum=h.ttt_beta1, weight_decay=h.ttt_weight_decay, + ) + return torch.optim.AdamW( + lora.parameters(), lr=h.ttt_lora_lr, + betas=(h.ttt_beta1, h.ttt_beta2), + eps=1e-10, weight_decay=h.ttt_weight_decay, fused=True, + ) + + reusable_opt = _build_opt(reusable_lora) + local_scored_docs = [] + global_ttt_done = prefix_doc_limit == 0 + try: + while True: + queue_idx = _claim_next_batch(counter_path, queue_len) + if queue_idx >= queue_len: + break + orig_batch_idx, batch_entries = global_batches_sorted[queue_idx] + batch = [doc for _, doc in batch_entries] + bsz = len(batch) + prev_loss = loss_sum.item() + prev_bytes = byte_sum.item() + prev_tokens = token_count.item() + if bsz == reusable_lora.bsz: + reusable_lora.reset() + for s in reusable_opt.state.values(): + for k, v in s.items(): + if isinstance(v, torch.Tensor): + v.zero_() + elif k == "step": + s[k] = 0 + cur_lora = reusable_lora + cur_opt = reusable_opt + else: + cur_lora = BatchedTTTLoRA( + bsz, base_model, h.ttt_lora_rank, + k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora, + ).to(device) + cur_opt = _build_opt(cur_lora) + pred_lens = [doc_len - 1 for _, doc_len in batch] + num_chunks = [(pl + chunk_size - 1) // chunk_size for pl in pred_lens] + max_nc = max(num_chunks) + num_chunks_t = torch.tensor(num_chunks, dtype=torch.int64, device=device) + for ci in range(max_nc): + active = [ci < nc for nc in num_chunks] + needs_train = any(ci < nc - 1 for nc in num_chunks) + tok_starts = torch.zeros(bsz, dtype=torch.int64) + tok_wls = torch.zeros(bsz, dtype=torch.int64) + chunk_offsets_cpu = torch.zeros(bsz, dtype=torch.int64) + chunk_lens_cpu = torch.zeros(bsz, dtype=torch.int64) + for b in range(bsz): + if not active[b]: + continue + doc_start, doc_len = batch[b] + win_start, win_len, chunk_offset, chunk_len = _compute_chunk_window( + ci, pred_lens[b], num_chunks[b], chunk_size, eval_seq_len + ) + tok_starts[b] = doc_start + win_start + tok_wls[b] = win_len + chunk_offsets_cpu[b] = chunk_offset + chunk_lens_cpu[b] = chunk_len + _, context_size, chunk_offset, _ = _compute_chunk_window( + ci, (ci + 1) * chunk_size, ci + 1, chunk_size, eval_seq_len + ) + col_idx = torch.arange(context_size + 1) + idx = tok_starts.unsqueeze(1) + col_idx.unsqueeze(0) + idx.clamp_(max=all_tokens.numel() - 1) + gathered_gpu = all_tokens_idx[idx].to( + device=device, dtype=torch.int64, non_blocking=True + ) + valid = (col_idx[:context_size].unsqueeze(0) < tok_wls.unsqueeze(1)).to( + device, non_blocking=True + ) + chunk_offsets = chunk_offsets_cpu.to(device, non_blocking=True) + chunk_lens = chunk_lens_cpu.to(device, non_blocking=True) + x = torch.where(valid, gathered_gpu[:, :context_size], 0) + y = torch.where(valid, gathered_gpu[:, 1 : context_size + 1], 0) + ctx_pos = torch.arange(context_size, device=device, dtype=torch.int64) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + per_tok_loss = forward_ttt_train(x, y, lora=cur_lora) + # CaseOps sidecar-driven byte budget. Mirror the index pattern + # used to build y from all_tokens: y[b, j] corresponds to the + # token at global position tok_starts[b] + 1 + j (when valid). + y_bytes_arg = None + if val_data.caseops_enabled and val_data.val_bytes is not None: + y_idx = ( + tok_starts.unsqueeze(1) + + 1 + + col_idx[:context_size].unsqueeze(0) + ) + y_idx = y_idx.clamp_(max=val_data.val_bytes.numel() - 1) + y_bytes_arg = val_data.val_bytes[y_idx].to( + device=device, dtype=torch.int32, non_blocking=True + ) + # Mirror the `valid` masking used for y so out-of-range tokens + # contribute zero bytes (matches y=0 substitution above). + y_bytes_arg = torch.where( + valid, y_bytes_arg, torch.zeros_like(y_bytes_arg) + ) + with torch.no_grad(): + _accumulate_bpb( + per_tok_loss, + x, + y, + chunk_offsets, + chunk_lens, + ctx_pos, + val_data.base_bytes_lut, + val_data.has_leading_space_lut, + val_data.is_boundary_token_lut, + loss_sum, + byte_sum, + token_count, + y_bytes=y_bytes_arg, + ) + if needs_train: + activate_chunk_mask = (num_chunks_t - 1 > ci).float() + for gi in range(h.ttt_grad_steps): + if gi > 0: + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + per_tok_loss = forward_ttt_train(x, y, lora=cur_lora) + per_doc = per_tok_loss[ + :, chunk_offset : chunk_offset + chunk_size + ].mean(dim=-1) + cur_opt.zero_grad(set_to_none=True) + (per_doc * activate_chunk_mask).sum().backward() + cur_opt.step() + else: + del per_tok_loss + batch_num = orig_batch_idx + 1 + doc_lens = [dl for _, dl in batch] + should_report = batch_num in eval_batch_set if eval_batch_set is not None else True + if should_report: + cur_tokens = token_count.item() + cur_loss_val = loss_sum.item() + cur_bytes_val = byte_sum.item() + dt = cur_tokens - prev_tokens + db = cur_bytes_val - prev_bytes + if dt > 0 and db > 0: + b_loss = (cur_loss_val - prev_loss) / dt + b_bpb = b_loss / math.log(2.0) * (dt / db) + else: + b_loss = b_bpb = 0.0 + r_loss = cur_loss_val / max(cur_tokens, 1) + r_bpb = r_loss / math.log(2.0) * (cur_tokens / max(cur_bytes_val, 1)) + elapsed = time.perf_counter() - t_start + log( + f"ttp: b{batch_num}/{queue_len} bl:{b_loss:.4f} bb:{b_bpb:.4f} " + f"rl:{r_loss:.4f} rb:{r_bpb:.4f} dl:{min(doc_lens)}-{max(doc_lens)} " + f"gd:{int(global_ttt_done)}" + ) + if not global_ttt_done: + local_scored_docs.extend( + (orig_batch_idx, pos, doc_start, doc_len) + for pos, (doc_start, doc_len) in enumerate(batch) + ) + prefix_done = _add_to_counter(prefix_counter_path, len(batch_entries)) + if prefix_done >= current_phase_boundary: + try: + with open(pause_flag_path, "x"): + pass + except FileExistsError: + pass + should_pause = os.path.exists(pause_flag_path) + if should_pause: + if dist.is_available() and dist.is_initialized(): + dist.barrier() + gathered_scored_docs = [None] * h.world_size + if dist.is_available() and dist.is_initialized(): + dist.all_gather_object(gathered_scored_docs, local_scored_docs) + else: + gathered_scored_docs = [local_scored_docs] + scored_docs_for_global = [] + for rank_docs in gathered_scored_docs: + if rank_docs: + scored_docs_for_global.extend(rank_docs) + scored_docs_for_global.sort(key=lambda x: (x[0], x[1])) + scored_docs_for_global = scored_docs_for_global[:current_phase_boundary] + scored_token_chunks = [ + val_data.val_tokens[doc_start : doc_start + doc_len] + for _, _, doc_start, doc_len in scored_docs_for_global + ] + if scored_token_chunks: + global_ttt_tokens = torch.cat(scored_token_chunks) + else: + global_ttt_tokens = val_data.val_tokens[:0] + if h.rank == 0: + prefix_done = 0 + try: + with open(prefix_counter_path, "rb") as f: + prefix_done = int.from_bytes( + f.read(8), "little", signed=True + ) + except FileNotFoundError: + pass + log( + f"ttpp: phase:{current_phase + 1}/{num_phases} pd:{prefix_done} " + f"gd:{len(scored_docs_for_global)} " + f"t:{time.perf_counter() - t_start:.1f}s" + ) + train_val_ttt_global_sgd_distributed( + h, device, val_data, base_model, global_ttt_tokens + ) + for p in base_model.parameters(): + p.requires_grad_(False) + reusable_lora = BatchedTTTLoRA( + h.ttt_batch_size, base_model, h.ttt_lora_rank, + k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora, + ).to(device) + reusable_opt = _build_opt(reusable_lora) + current_phase += 1 + if current_phase >= num_phases: + global_ttt_done = True + else: + current_phase_boundary = phase_boundaries[current_phase] + if h.rank == 0: + try: + os.remove(pause_flag_path) + except FileNotFoundError: + pass + if dist.is_available() and dist.is_initialized(): + dist.barrier() + if h.rank == 0: + log(f"ttpr: phase:{current_phase}/{num_phases} t:{time.perf_counter() - t_start:.1f}s") + del cur_lora, cur_opt + finally: + pass + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.train() + return _loss_bpb_from_sums(loss_sum, token_count, byte_sum) + + +def timed_eval(label, fn, *args, **kwargs): + torch.cuda.synchronize() + t0 = time.perf_counter() + val_loss, val_bpb = fn(*args, **kwargs) + torch.cuda.synchronize() + elapsed_ms = 1e3 * (time.perf_counter() - t0) + log( + f"{label} val_loss:{val_loss:.8f} val_bpb:{val_bpb:.8f} eval_time:{elapsed_ms:.0f}ms" + ) + return val_loss, val_bpb + + +def train_model(h, device, val_data): + base_model = GPT(h).to(device).bfloat16() + restore_fp32_params(base_model) + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + compiled_forward_logits = torch.compile( + base_model.forward_logits, dynamic=False, fullgraph=True + ) + model = compiled_model + log(f"model_params:{sum(p.numel()for p in base_model.parameters())}") + optimizers = Optimizers(h, base_model) + train_loader = DocumentPackingLoader(h, device) + max_wallclock_ms = ( + 1e3 * h.max_wallclock_seconds if h.max_wallclock_seconds > 0 else None + ) + if max_wallclock_ms is not None: + max_wallclock_ms -= h.gptq_reserve_seconds * 1e3 + log( + f"gptq:reserving {h.gptq_reserve_seconds:.0f}s, effective={max_wallclock_ms:.0f}ms" + ) + + phased_train_loop = bool(h.train_loop_phase_depths) + if phased_train_loop: + train_loop_depths = list(h.train_loop_phase_depths) + train_loop_depth_weights = [1] * len(train_loop_depths) + if h.train_loop_phase_fractions: + if len(h.train_loop_phase_fractions) != len(train_loop_depths): + raise ValueError( + "TRAIN_LOOP_PHASE_FRACTIONS must match TRAIN_LOOP_PHASE_DEPTHS" + ) + total = sum(max(x, 0.0) for x in h.train_loop_phase_fractions) + if total <= 0.0: + raise ValueError("TRAIN_LOOP_PHASE_FRACTIONS must sum to > 0") + phase_weights = [max(x, 0.0) / total for x in h.train_loop_phase_fractions] + phase_boundaries = [] + running = 0.0 + for w in phase_weights[:-1]: + running += w + phase_boundaries.append(running) + else: + phase_weights = None + phase_boundaries = [] + elif h.train_loop_depth_set: + train_loop_depths = sorted(set(h.train_loop_depth_set)) + train_loop_depth_weights = [1] * len(train_loop_depths) + phase_weights = None + phase_boundaries = [] + else: + train_loop_depths, train_loop_depth_weights = _loop_depth_weights( + h.train_loop_min_depth, + h.train_loop_max_depth, + h.train_loop_depth_dist, + _repeats_to_depth(h.num_loops), + ) + phase_weights = None + phase_boundaries = [] + train_loop_enabled = any(depth > 1 for depth in train_loop_depths) + eval_loop_repeats = _depth_to_repeats(h.eval_loop_depth) + if phased_train_loop: + default_prewarm_depths = sorted( + {depth for depth in train_loop_depths if depth > 1} | {h.eval_loop_depth} + ) + else: + default_prewarm_depths = sorted( + { + _repeats_to_depth(h.num_loops), + h.eval_loop_depth, + } + ) + train_loop_prewarm_depths = ( + sorted(set(h.train_loop_prewarm_depths)) + if h.train_loop_prewarm_depths + else default_prewarm_depths + ) + + def _set_train_loop_depth(depth): + depth = max(int(depth), 1) + base_model.looping_active = depth > 1 + base_model.set_loop_repeats(_depth_to_repeats(depth)) + + def _sample_train_loop_depth(): + if not train_loop_enabled: + return 1 + if len(train_loop_depths) == 1: + return train_loop_depths[0] + return random.choices( + train_loop_depths, weights=train_loop_depth_weights, k=1 + )[0] + + def _phased_train_loop_depth(frac): + if not phased_train_loop or not train_loop_depths: + return 1 + frac = min(max(float(frac), 0.0), 1.0 - 1e-12) + if phase_boundaries: + idx = 0 + while idx < len(phase_boundaries) and frac >= phase_boundaries[idx]: + idx += 1 + else: + idx = min(int(frac * len(train_loop_depths)), len(train_loop_depths) - 1) + return train_loop_depths[idx] + + def _set_eval_loop_depth(model_obj): + model_obj.looping_active = h.eval_loop_depth > 1 + if hasattr(model_obj, "set_loop_repeats"): + model_obj.set_loop_repeats(eval_loop_repeats) + + def _run_with_eval_loop_depth(fn, *args, **kwargs): + prev_active = base_model.looping_active + prev_repeats = getattr(base_model, "active_loop_repeats", 0) + _set_eval_loop_depth(base_model) + try: + return fn(*args, **kwargs) + finally: + base_model.looping_active = prev_active + base_model.set_loop_repeats(prev_repeats) + + log( + "loop_depth_schedule:" + f" train={train_loop_depths}" + f" dist={'phased' if phased_train_loop else h.train_loop_depth_dist}" + f" phase_fracs={phase_weights if phased_train_loop else None}" + f" prewarm={train_loop_prewarm_depths}" + f" eval_depth={h.eval_loop_depth}" + ) + + def training_frac(step, elapsed_ms): + if max_wallclock_ms is None: + return step / max(h.iterations, 1) + return elapsed_ms / max(max_wallclock_ms, 1e-09) + + def lr_mul(frac): + if h.warmdown_frac <= 0: + return 1.0 + if frac >= 1.0 - h.warmdown_frac: + return max((1.0 - frac) / h.warmdown_frac, h.min_lr) + return 1.0 + + def step_fn(step, lr_scale): + if base_model.looping_active and not phased_train_loop: + base_model.set_loop_repeats(_depth_to_repeats(_sample_train_loop_depth())) + optimizers.zero_grad_all() + train_loss = torch.zeros((), device=device) + for micro_step in range(h.grad_accum_steps): + x, y, cu_seqlens, _max_seqlen = train_loader.next_batch( + h.train_batch_tokens, h.grad_accum_steps + ) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + loss = model(x, y, cu_seqlens=cu_seqlens, max_seqlen=h.train_seq_len) + train_loss += loss.detach() + (loss / h.grad_accum_steps).backward() + train_loss /= h.grad_accum_steps + frac = ( + min(step / h.muon_momentum_warmup_steps, 1.0) + if h.muon_momentum_warmup_steps > 0 + else 1.0 + ) + muon_momentum = ( + 1 - frac + ) * h.muon_momentum_warmup_start + frac * h.muon_momentum + for group in optimizers.optimizer_muon.param_groups: + group["momentum"] = muon_momentum + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * lr_scale + if h.grad_clip_norm > 0: + torch.nn.utils.clip_grad_norm_(base_model.parameters(), h.grad_clip_norm) + optimizers.step(distributed=h.distributed) + return train_loss + + if h.warmup_steps > 0: + initial_model_state = { + name: tensor.detach().cpu().clone() + for (name, tensor) in base_model.state_dict().items() + } + initial_optimizer_states = [ + copy.deepcopy(opt.state_dict()) for opt in optimizers + ] + model.train() + num_tokens_local = h.train_batch_tokens // h.world_size + for blk in base_model.blocks: + blk.attn.rotary(num_tokens_local, device, torch.bfloat16) + cu_bucket_size = train_loader.cu_bucket_size + warmup_cu_buckets = tuple(cu_bucket_size * i for i in range(1, 5)) + warmup_cu_iters = 3 + x, y, cu_seqlens, _ = train_loader.next_batch( + h.train_batch_tokens, h.grad_accum_steps + ) + log(f"warmup_cu_buckets:{','.join(str(b) for b in warmup_cu_buckets)} iters_each:{warmup_cu_iters}") + def _run_cu_bucket_warmup(): + for bucket_len in warmup_cu_buckets: + boundaries = list(range(0, x.size(1), max(h.train_seq_len, 1))) + if boundaries[-1] != x.size(1): + boundaries.append(x.size(1)) + cu = torch.full((bucket_len,), x.size(1), dtype=torch.int32, device=device) + cu[: len(boundaries)] = torch.tensor(boundaries, dtype=torch.int32, device=device) + for _ in range(warmup_cu_iters): + optimizers.zero_grad_all() + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + wloss = model(x, y, cu_seqlens=cu, max_seqlen=h.train_seq_len) + (wloss / h.grad_accum_steps).backward() + optimizers.zero_grad_all() + _run_cu_bucket_warmup() + if train_loop_enabled or len(set(train_loop_prewarm_depths)) > 1: + seen_train_loop_depths = set() + for loop_depth in train_loop_prewarm_depths: + if loop_depth in seen_train_loop_depths: + continue + seen_train_loop_depths.add(loop_depth) + _set_train_loop_depth(loop_depth) + _run_cu_bucket_warmup() + base_model.looping_active = False + base_model.set_loop_repeats(_depth_to_repeats(h.num_loops + 1)) + for warmup_step in range(h.warmup_steps): + step_fn(warmup_step, 1.0) + if ( + warmup_step <= 5 + or (warmup_step + 1) % 10 == 0 + or warmup_step + 1 == h.warmup_steps + ): + log(f"warmup_step: {warmup_step+1}/{h.warmup_steps}") + if train_loop_enabled and not phased_train_loop: + _set_train_loop_depth(_repeats_to_depth(h.num_loops)) + log( + f"loop_warmup:enabled encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}" + ) + for warmup_step in range(h.warmup_steps): + step_fn(warmup_step, 1.0) + if ( + warmup_step <= 5 + or (warmup_step + 1) % 10 == 0 + or warmup_step + 1 == h.warmup_steps + ): + log(f"loop_warmup_step: {warmup_step+1}/{h.warmup_steps}") + base_model.looping_active = False + base_model.set_loop_repeats(_depth_to_repeats(h.num_loops + 1)) + base_model.load_state_dict(initial_model_state, strict=True) + for (opt, state) in zip(optimizers, initial_optimizer_states, strict=True): + opt.load_state_dict(state) + optimizers.zero_grad_all() + train_loader = DocumentPackingLoader(h, device) + ema_state = { + name: t.detach().float().clone() + for (name, t) in base_model.state_dict().items() + } + ema_decay = h.ema_decay + training_time_ms = 0.0 + stop_after_step = None + current_train_loop_depth = 1 + torch.cuda.synchronize() + t0 = time.perf_counter() + step = 0 + while True: + last_step = ( + step == h.iterations + or stop_after_step is not None + and step >= stop_after_step + ) + should_validate = ( + last_step or h.val_loss_every > 0 and step % h.val_loss_every == 0 + ) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1e3 * (time.perf_counter() - t0) + val_loss, val_bpb = _run_with_eval_loop_depth( + eval_val, + h, device, val_data, model, compiled_forward_logits + ) + log( + f"{step}/{h.iterations} val_loss: {val_loss:.4f} val_bpb: {val_bpb:.4f}" + ) + torch.cuda.synchronize() + t0 = time.perf_counter() + if last_step: + if stop_after_step is not None and step < h.iterations: + log( + f"stopping_early: wallclock_cap train_time: {training_time_ms:.0f}ms step: {step}/{h.iterations}" + ) + break + elapsed_ms = training_time_ms + 1e3 * (time.perf_counter() - t0) + frac = training_frac(step, elapsed_ms) + scale = lr_mul(frac) + if phased_train_loop: + target_train_loop_depth = _phased_train_loop_depth(frac) + if target_train_loop_depth != current_train_loop_depth: + _set_train_loop_depth(target_train_loop_depth) + current_train_loop_depth = target_train_loop_depth + log( + f"layer_loop:phase step:{step} frac:{frac:.3f} " + f"depth:{target_train_loop_depth} phases:{train_loop_depths} " + f"eval_depth:{h.eval_loop_depth}" + ) + elif ( + train_loop_enabled + and not base_model.looping_active + and frac >= h.enable_looping_at + ): + base_model.looping_active = True + base_model.set_loop_repeats( + _depth_to_repeats(_sample_train_loop_depth()) + ) + log( + f"layer_loop:enabled step:{step} frac:{frac:.3f} " + f"train_depths:{train_loop_depths} eval_depth:{h.eval_loop_depth}" + ) + train_loss = step_fn(step, scale) + with torch.no_grad(): + for (name, t) in base_model.state_dict().items(): + ema_state[name].mul_(ema_decay).add_( + t.detach().float(), alpha=1.0 - ema_decay + ) + step += 1 + approx_training_time_ms = training_time_ms + 1e3 * (time.perf_counter() - t0) + should_log_train = h.train_log_every > 0 and ( + step <= 5 or step % h.train_log_every == 0 or stop_after_step is not None + ) + if should_log_train: + tok_per_sec = step * h.train_batch_tokens / (approx_training_time_ms / 1e3) + log( + f"{step}/{h.iterations} train_loss: {train_loss.item():.4f} train_time: {approx_training_time_ms/60000:.1f}m tok/s: {tok_per_sec:.0f}" + ) + reached_cap = ( + max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms + ) + if h.distributed and max_wallclock_ms is not None: + reached_cap_tensor = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) + reached_cap = bool(reached_cap_tensor.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + log( + f"peak memory allocated: {torch.cuda.max_memory_allocated()//1024//1024} MiB reserved: {torch.cuda.max_memory_reserved()//1024//1024} MiB" + ) + log("ema:applying EMA weights") + current_state = base_model.state_dict() + avg_state = { + name: t.to(dtype=current_state[name].dtype) for (name, t) in ema_state.items() + } + base_model.load_state_dict(avg_state, strict=True) + return base_model, compiled_model, compiled_forward_logits + + +def train_and_eval(h, device): + random.seed(h.seed) + np.random.seed(h.seed) + torch.manual_seed(h.seed) + torch.cuda.manual_seed_all(h.seed) + if h.artifact_dir and h.is_main_process: + os.makedirs(h.artifact_dir, exist_ok=True) + val_data = ValidationData(h, device) + log( + f"train_shards: {len(list(Path(h.datasets_dir).resolve().glob('fineweb_train_*.bin')))}" + ) + log(f"val_tokens: {val_data.val_tokens.numel()-1}") + base_model, compiled_model, compiled_forward_logits = train_model( + h, device, val_data + ) + torch._dynamo.reset() + if h.eval_loop_depth > 1: + base_model.looping_active = True + base_model.set_loop_repeats(_depth_to_repeats(h.eval_loop_depth)) + timed_eval( + "diagnostic pre-quantization post-ema", + eval_val, + h, + device, + val_data, + compiled_model, + compiled_forward_logits, + ) + serialize(h, base_model, Path(__file__).read_text(encoding="utf-8")) + if h.distributed: + dist.barrier() + eval_model = deserialize(h, device) + if h.eval_loop_depth > 1: + eval_model.looping_active = True + eval_model.set_loop_repeats(_depth_to_repeats(h.eval_loop_depth)) + compiled_model = torch.compile(eval_model, dynamic=False, fullgraph=True) + compiled_forward_logits = torch.compile( + eval_model.forward_logits, dynamic=False, fullgraph=True + ) + timed_eval( + "diagnostic quantized", + eval_val, + h, + device, + val_data, + compiled_model, + compiled_forward_logits, + ) + if h.ttt_enabled: + del eval_model, compiled_model + torch._dynamo.reset() + torch.cuda.empty_cache() + ttt_model = deserialize(h, device) + if h.eval_loop_depth > 1: + ttt_model.looping_active = True + ttt_model.set_loop_repeats(_depth_to_repeats(h.eval_loop_depth)) + for p in ttt_model.parameters(): + p.requires_grad_(False) + + if h.rope_yarn: + _yarn_seqlen = h.train_batch_tokens // h.grad_accum_steps + for block in ttt_model.blocks: + block.attn.rotary(_yarn_seqlen, device, torch.bfloat16) + else: + for block in ttt_model.blocks: + block.attn.rotary._cos_cached = None + block.attn.rotary._sin_cached = None + block.attn.rotary._seq_len_cached = 0 + block.attn.rotary(h.ttt_eval_seq_len, device, torch.bfloat16) + + def _fwd_ttt_inner(input_ids, target_ids, lora): + return ttt_model.forward_ttt(input_ids, target_ids, lora=lora) + + _fwd_ttt_compiled_inner = None + + def _fwd_ttt(input_ids, target_ids, lora): + nonlocal _fwd_ttt_compiled_inner + if _fwd_ttt_compiled_inner is None: + _fwd_ttt_compiled_inner = torch.compile(_fwd_ttt_inner, dynamic=True) + return _fwd_ttt_compiled_inner(input_ids, target_ids, lora=lora) + + fwd_ttt_compiled = _fwd_ttt + log(f"ttt_lora:warming up compile (random tokens, no val data)") + global BOS_ID + if BOS_ID is None: + BOS_ID = 1 + t_warmup = time.perf_counter() + warmup_bszes = [h.ttt_batch_size] + for bsz in warmup_bszes: + wl = BatchedTTTLoRA( + bsz, ttt_model, h.ttt_lora_rank, + k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora, + ).to(device) + wo = torch.optim.AdamW( + wl.parameters(), + lr=h.ttt_lora_lr, + betas=(h.ttt_beta1, h.ttt_beta2), + eps=1e-10, + weight_decay=h.ttt_weight_decay, + fused=True, + ) + for ctx_len in (h.ttt_chunk_size, h.ttt_eval_seq_len): + xw = torch.randint(0, h.vocab_size, (bsz, ctx_len), device=device, dtype=torch.int64) + yw = torch.randint(0, h.vocab_size, (bsz, ctx_len), device=device, dtype=torch.int64) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + ptl = fwd_ttt_compiled(xw, yw, lora=wl) + ptl[:, : min(h.ttt_chunk_size, ctx_len)].mean(dim=-1).sum().backward() + wo.step() + wo.zero_grad(set_to_none=True) + del wl, wo + torch.cuda.empty_cache() + compile_elapsed = time.perf_counter() - t_warmup + log(f"ttt_lora:compile warmup done ({compile_elapsed:.1f}s)") + log("\nbeginning TTT eval timer") + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_val_loss, ttt_val_bpb = eval_val_ttt_phased( + h, ttt_model, device, val_data, forward_ttt_train=fwd_ttt_compiled + ) + torch.cuda.synchronize() + ttt_eval_elapsed = time.perf_counter() - t_ttt + log( + "quantized_ttt_phased " + f"val_loss:{ttt_val_loss:.8f} val_bpb:{ttt_val_bpb:.8f} " + f"eval_time:{1e3*ttt_eval_elapsed:.0f}ms" + ) + log(f"total_eval_time:{ttt_eval_elapsed:.1f}s") + del ttt_model + + +def main(): + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required") + if world_size <= 0: + raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") + if 8 % world_size != 0: + raise ValueError( + f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral" + ) + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + torch.set_float32_matmul_precision("high") + from torch.backends.cuda import ( + enable_cudnn_sdp, + enable_flash_sdp, + enable_math_sdp, + enable_mem_efficient_sdp, + ) + + enable_cudnn_sdp(False) + enable_flash_sdp(True) + enable_mem_efficient_sdp(False) + enable_math_sdp(False) + torch._dynamo.config.optimize_ddp = False + torch._dynamo.config.cache_size_limit = 16 + h = Hyperparameters() + set_logging_hparams(h) + if h.is_main_process: + os.makedirs(h.artifact_dir if h.artifact_dir else "logs", exist_ok=True) + log(100 * "=", console=False) + log("Hyperparameters:", console=True) + for (k, v) in sorted(vars(type(h)).items()): + if not k.startswith("_"): + log(f" {k}: {v}", console=True) + log("=" * 100, console=False) + log("Source code:", console=False) + log("=" * 100, console=False) + with open(__file__, "r", encoding="utf-8") as _src: + log(_src.read(), console=False) + log("=" * 100, console=False) + log(f"Running Python {sys.version}", console=False) + log(f"Running PyTorch {torch.__version__}", console=False) + log("=" * 100, console=False) + train_and_eval(h, device) + if distributed: + dist.destroy_process_group() + + +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_seed0.log b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_seed0.log new file mode 100644 index 0000000000..9603209ed4 --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_seed0.log @@ -0,0 +1,846 @@ +Running: env | egrep '^(RUN_ID|SEED|CASEOPS_ENABLED|TRAIN_LOOP_(PHASE_DEPTHS|PREWARM_DEPTHS)|EVAL_LOOP_DEPTH|DATA_PATH|TOKENIZER_PATH|EMBED_BITS|GATED_ATTN_)' | sort +CASEOPS_ENABLED=1 +DATA_PATH=/workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved +EMBED_BITS=7 +EVAL_LOOP_DEPTH=4 +GATED_ATTN_ENABLED=1 +GATED_ATTN_INIT_STD=0.005 +GATED_ATTN_QUANT_GATE=1 +RUN_ID=pr1736_eq134_eval4_seed0_rerun +SEED=0 +TOKENIZER_PATH=/workspace/parameter-golf-pr1736/data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model +TRAIN_LOOP_PHASE_DEPTHS=1,3,4 +TRAIN_LOOP_PREWARM_DEPTHS=3,4 +W0420 20:20:11.894000 188285 torch/distributed/run.py:803] +W0420 20:20:11.894000 188285 torch/distributed/run.py:803] ***************************************** +W0420 20:20:11.894000 188285 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0420 20:20:11.894000 188285 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + artifact_dir: /workspace/fullruns/pr1736_eq134_eval4_seed0_rerun/artifact + attn_clip_sigmas: 13.0 + attn_out_gate_enabled: False + attn_out_gate_src: proj + beta1: 0.9 + beta2: 0.95 + caseops_enabled: True + compressor: brotli + data_dir: /workspace/parameter-golf-pr1736/data + datasets_dir: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 15.0 + embed_lr: 0.6 + embed_wd: 0.085 + enable_looping_at: 0.35 + eval_loop_depth: 4 + eval_seq_len: 2048 + eval_stride: 64 + gate_window: 12 + gated_attn_enabled: True + gated_attn_init_std: 0.005 + gated_attn_quant_gate: True + global_ttt_batch_seqs: 32 + global_ttt_chunk_tokens: 32768 + global_ttt_epochs: 1 + global_ttt_grad_clip: 1.0 + global_ttt_lr: 0.001 + global_ttt_momentum: 0.9 + global_ttt_respect_doc_boundaries: True + global_ttt_warmup_chunks: 0 + global_ttt_warmup_start_lr: 0.0 + gptq_calibration_batches: 16 + gptq_reserve_seconds: 4.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: /workspace/fullruns/pr1736_eq134_eval4_seed0_rerun/artifact/pr1736_eq134_eval4_seed0_rerun.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.026 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_clip_sigmas: 12.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: /workspace/fullruns/pr1736_eq134_eval4_seed0_rerun/artifact/final_model.pt + muon_backend_steps: 5 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_start_layer: 8 + phased_ttt_num_phases: 3 + phased_ttt_prefix_docs: 2000 + qk_gain_init: 5.0 + quantized_model_path: /workspace/fullruns/pr1736_eq134_eval4_seed0_rerun/artifact/final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + rope_yarn: False + run_id: pr1736_eq134_eval4_seed0_rerun + scalar_lr: 0.02 + seed: 0 + skip_gates_enabled: True + smear_gate_enabled: False + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/parameter-golf-pr1736/data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + train_batch_tokens: 786432 + train_files: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin + train_log_every: 500 + train_loop_depth_dist: fixed + train_loop_depth_set: [] + train_loop_max_depth: 3 + train_loop_min_depth: 3 + train_loop_phase_depths: [1, 3, 4] + train_loop_phase_fractions: [] + train_loop_prewarm_depths: [3, 4] + train_seq_len: 2048 + ttt_batch_size: 64 + ttt_beta1: 0.0 + ttt_beta2: 0.999 + ttt_chunk_size: 48 + ttt_enabled: True + ttt_eval_batches: + ttt_eval_seq_len: 2048 + ttt_grad_steps: 1 + ttt_k_lora: True + ttt_lora_lr: 0.0001 + ttt_lora_rank: 96 + ttt_mlp_lora: True + ttt_o_lora: True + ttt_optimizer: adam + ttt_weight_decay: 0.5 + val_batch_tokens: 524288 + val_bytes_files: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin + val_doc_fraction: 1.0 + val_files: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_[0-9][0-9][0-9][0-9][0-9][0-9].bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.75 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 80 +val_tokens: 47851520 +model_params:35989658 +gptq:reserving 4s, effective=596000ms +loop_depth_schedule: train=[1, 3, 4] dist=phased phase_fracs=None prewarm=[3, 4] eval_depth=4 +warmup_cu_buckets:64,128,192,256 iters_each:3 +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +0/20000 val_loss: 9.0314 val_bpb: 4.1267 +1/20000 train_loss: 9.0333 train_time: 0.0m tok/s: 12802762 +2/20000 train_loss: 12.9062 train_time: 0.0m tok/s: 11516724 +3/20000 train_loss: 10.1622 train_time: 0.0m tok/s: 10239296 +4/20000 train_loss: 8.5887 train_time: 0.0m tok/s: 9694146 +5/20000 train_loss: 7.7916 train_time: 0.0m tok/s: 9378852 +500/20000 train_loss: 2.5802 train_time: 0.8m tok/s: 8054443 +1000/20000 train_loss: 2.8064 train_time: 1.6m tok/s: 8004960 +1500/20000 train_loss: 2.6347 train_time: 2.5m tok/s: 7991913 +2000/20000 train_loss: 2.6662 train_time: 3.3m tok/s: 7989371 +layer_loop:phase step:2019 frac:0.333 depth:3 phases:[1, 3, 4] eval_depth:4 +2500/20000 train_loss: 2.5495 train_time: 4.5m tok/s: 7340726 +3000/20000 train_loss: 2.5604 train_time: 5.7m tok/s: 6945840 +layer_loop:phase step:3402 frac:0.667 depth:4 phases:[1, 3, 4] eval_depth:4 +3500/20000 train_loss: 2.5577 train_time: 6.9m tok/s: 6652412 +4000/20000 train_loss: 2.3885 train_time: 8.3m tok/s: 6333361 +4000/20000 val_loss: 2.4112 val_bpb: 1.1018 +4500/20000 train_loss: 2.2440 train_time: 9.7m tok/s: 6105878 +4599/20000 val_loss: 2.3371 val_bpb: 1.0679 +stopping_early: wallclock_cap train_time: 596095ms step: 4599/20000 +peak memory allocated: 46573 MiB reserved: 50344 MiB +ema:applying EMA weights +diagnostic pre-quantization post-ema val_loss:2.33686505 val_bpb:1.06778679 eval_time:8067ms +Serialized model: 135592891 bytes +Code size (uncompressed): 141428 bytes +Code size (compressed): 35795 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 4.0s +Quantized weights: + gate_int8_row: blocks.attn.attn_gate_w + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int7): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights +Serialized model quantized+brotli: 15948631 bytes +Total submission size quantized+brotli: 15984426 bytes +diagnostic quantized val_loss:2.35677860 val_bpb:1.07688592 eval_time:12973ms +ttt_lora:warming up compile (random tokens, no val data) +ttt_lora:compile warmup done (119.8s) + +beginning TTT eval timer +ttt_phased: total_docs:50000 prefix_docs:2000 suffix_docs:48000 num_phases:3 boundaries:[666, 1333, 2000] +ttp: b781/782 bl:2.1596 bb:1.0567 rl:2.1596 rb:1.0567 dl:17258-30330 gd:0 +ttpp: phase:1/3 pd:1104 gd:666 t:199.5s +tttg: c1/111 lr:0.001000 t:0.3s +tttg: c2/111 lr:0.001000 t:0.4s +tttg: c3/111 lr:0.000999 t:0.5s +tttg: c4/111 lr:0.000998 t:0.6s +tttg: c5/111 lr:0.000997 t:0.7s +tttg: c6/111 lr:0.000995 t:0.8s +tttg: c7/111 lr:0.000993 t:0.9s +tttg: c8/111 lr:0.000990 t:1.0s +tttg: c9/111 lr:0.000987 t:1.1s +tttg: c10/111 lr:0.000984 t:1.1s +tttg: c11/111 lr:0.000980 t:1.2s +tttg: c12/111 lr:0.000976 t:1.3s +tttg: c13/111 lr:0.000971 t:1.4s +tttg: c14/111 lr:0.000966 t:1.5s +tttg: c15/111 lr:0.000961 t:1.6s +tttg: c16/111 lr:0.000955 t:1.7s +tttg: c17/111 lr:0.000949 t:1.8s +tttg: c18/111 lr:0.000942 t:1.9s +tttg: c19/111 lr:0.000935 t:1.9s +tttg: c20/111 lr:0.000928 t:2.0s +tttg: c21/111 lr:0.000921 t:2.1s +tttg: c22/111 lr:0.000913 t:2.2s +tttg: c23/111 lr:0.000905 t:2.3s +tttg: c24/111 lr:0.000896 t:2.4s +tttg: c25/111 lr:0.000887 t:2.5s +tttg: c26/111 lr:0.000878 t:2.5s +tttg: c27/111 lr:0.000868 t:2.6s +tttg: c28/111 lr:0.000859 t:2.7s +tttg: c29/111 lr:0.000848 t:2.8s +tttg: c30/111 lr:0.000838 t:2.9s +tttg: c31/111 lr:0.000827 t:3.0s +tttg: c32/111 lr:0.000817 t:3.1s +tttg: c33/111 lr:0.000805 t:3.1s +tttg: c34/111 lr:0.000794 t:3.2s +tttg: c35/111 lr:0.000782 t:3.3s +tttg: c36/111 lr:0.000770 t:3.4s +tttg: c37/111 lr:0.000758 t:3.5s +tttg: c38/111 lr:0.000746 t:3.6s +tttg: c39/111 lr:0.000733 t:3.7s +tttg: c40/111 lr:0.000721 t:3.8s +tttg: c41/111 lr:0.000708 t:3.9s +tttg: c42/111 lr:0.000695 t:3.9s +tttg: c43/111 lr:0.000681 t:4.0s +tttg: c44/111 lr:0.000668 t:4.1s +tttg: c45/111 lr:0.000655 t:4.2s +tttg: c46/111 lr:0.000641 t:4.3s +tttg: c47/111 lr:0.000627 t:4.4s +tttg: c48/111 lr:0.000613 t:4.5s +tttg: c49/111 lr:0.000599 t:4.5s +tttg: c50/111 lr:0.000585 t:4.6s +tttg: c51/111 lr:0.000571 t:4.7s +tttg: c52/111 lr:0.000557 t:4.8s +tttg: c53/111 lr:0.000543 t:4.9s +tttg: c54/111 lr:0.000529 t:5.0s +tttg: c55/111 lr:0.000514 t:5.1s +tttg: c56/111 lr:0.000500 t:5.2s +tttg: c57/111 lr:0.000486 t:5.3s +tttg: c58/111 lr:0.000471 t:5.4s +tttg: c59/111 lr:0.000457 t:5.4s +tttg: c60/111 lr:0.000443 t:5.5s +tttg: c61/111 lr:0.000429 t:5.6s +tttg: c62/111 lr:0.000415 t:5.7s +tttg: c63/111 lr:0.000401 t:5.8s +tttg: c64/111 lr:0.000387 t:5.9s +tttg: c65/111 lr:0.000373 t:6.0s +tttg: c66/111 lr:0.000359 t:6.1s +tttg: c67/111 lr:0.000345 t:6.1s +tttg: c68/111 lr:0.000332 t:6.2s +tttg: c69/111 lr:0.000319 t:6.3s +tttg: c70/111 lr:0.000305 t:6.4s +tttg: c71/111 lr:0.000292 t:6.5s +tttg: c72/111 lr:0.000279 t:6.6s +tttg: c73/111 lr:0.000267 t:6.7s +tttg: c74/111 lr:0.000254 t:6.8s +tttg: c75/111 lr:0.000242 t:6.9s +tttg: c76/111 lr:0.000230 t:7.0s +tttg: c77/111 lr:0.000218 t:7.0s +tttg: c78/111 lr:0.000206 t:7.1s +tttg: c79/111 lr:0.000195 t:7.2s +tttg: c80/111 lr:0.000183 t:7.3s +tttg: c81/111 lr:0.000173 t:7.4s +tttg: c82/111 lr:0.000162 t:7.5s +tttg: c83/111 lr:0.000152 t:7.6s +tttg: c84/111 lr:0.000141 t:7.6s +tttg: c85/111 lr:0.000132 t:7.7s +tttg: c86/111 lr:0.000122 t:7.8s +tttg: c87/111 lr:0.000113 t:7.9s +tttg: c88/111 lr:0.000104 t:8.0s +tttg: c89/111 lr:0.000095 t:8.1s +tttg: c90/111 lr:0.000087 t:8.2s +tttg: c91/111 lr:0.000079 t:8.3s +tttg: c92/111 lr:0.000072 t:8.3s +tttg: c93/111 lr:0.000065 t:8.4s +tttg: c94/111 lr:0.000058 t:8.5s +tttg: c95/111 lr:0.000051 t:8.6s +tttg: c96/111 lr:0.000045 t:8.7s +tttg: c97/111 lr:0.000039 t:8.8s +tttg: c98/111 lr:0.000034 t:8.9s +tttg: c99/111 lr:0.000029 t:9.0s +tttg: c100/111 lr:0.000024 t:9.0s +tttg: c101/111 lr:0.000020 t:9.1s +tttg: c102/111 lr:0.000016 t:9.2s +tttg: c103/111 lr:0.000013 t:9.3s +tttg: c104/111 lr:0.000010 t:9.4s +tttg: c105/111 lr:0.000007 t:9.5s +tttg: c106/111 lr:0.000005 t:9.6s +tttg: c107/111 lr:0.000003 t:9.6s +tttg: c108/111 lr:0.000002 t:9.7s +tttg: c109/111 lr:0.000001 t:9.8s +tttg: c110/111 lr:0.000000 t:9.9s +ttpr: phase:1/3 t:211.8s +ttp: b759/782 bl:2.3798 bb:1.0836 rl:2.1918 rb:1.0608 dl:3741-3817 gd:0 +ttp: b754/782 bl:2.2927 bb:1.0605 rl:2.2035 rb:1.0608 dl:3345-3397 gd:0 +ttpp: phase:2/3 pd:1808 gd:1333 t:291.2s +tttg: c1/185 lr:0.001000 t:0.1s +tttg: c2/185 lr:0.001000 t:0.2s +tttg: c3/185 lr:0.001000 t:0.3s +tttg: c4/185 lr:0.000999 t:0.3s +tttg: c5/185 lr:0.000999 t:0.4s +tttg: c6/185 lr:0.000998 t:0.5s +tttg: c7/185 lr:0.000997 t:0.6s +tttg: c8/185 lr:0.000996 t:0.7s +tttg: c9/185 lr:0.000995 t:0.8s +tttg: c10/185 lr:0.000994 t:0.9s +tttg: c11/185 lr:0.000993 t:0.9s +tttg: c12/185 lr:0.000991 t:1.0s +tttg: c13/185 lr:0.000990 t:1.1s +tttg: c14/185 lr:0.000988 t:1.2s +tttg: c15/185 lr:0.000986 t:1.3s +tttg: c16/185 lr:0.000984 t:1.4s +tttg: c17/185 lr:0.000981 t:1.5s +tttg: c18/185 lr:0.000979 t:1.5s +tttg: c19/185 lr:0.000977 t:1.6s +tttg: c20/185 lr:0.000974 t:1.7s +tttg: c21/185 lr:0.000971 t:1.8s +tttg: c22/185 lr:0.000968 t:1.9s +tttg: c23/185 lr:0.000965 t:2.0s +tttg: c24/185 lr:0.000962 t:2.1s +tttg: c25/185 lr:0.000959 t:2.2s +tttg: c26/185 lr:0.000955 t:2.2s +tttg: c27/185 lr:0.000952 t:2.3s +tttg: c28/185 lr:0.000948 t:2.4s +tttg: c29/185 lr:0.000944 t:2.5s +tttg: c30/185 lr:0.000940 t:2.6s +tttg: c31/185 lr:0.000936 t:2.7s +tttg: c32/185 lr:0.000932 t:2.7s +tttg: c33/185 lr:0.000927 t:2.8s +tttg: c34/185 lr:0.000923 t:2.9s +tttg: c35/185 lr:0.000918 t:3.0s +tttg: c36/185 lr:0.000913 t:3.1s +tttg: c37/185 lr:0.000908 t:3.2s +tttg: c38/185 lr:0.000904 t:3.3s +tttg: c39/185 lr:0.000898 t:3.4s +tttg: c40/185 lr:0.000893 t:3.5s +tttg: c41/185 lr:0.000888 t:3.5s +tttg: c42/185 lr:0.000882 t:3.6s +tttg: c43/185 lr:0.000877 t:3.7s +tttg: c44/185 lr:0.000871 t:3.8s +tttg: c45/185 lr:0.000865 t:3.9s +tttg: c46/185 lr:0.000860 t:4.0s +tttg: c47/185 lr:0.000854 t:4.1s +tttg: c48/185 lr:0.000847 t:4.2s +tttg: c49/185 lr:0.000841 t:4.2s +tttg: c50/185 lr:0.000835 t:4.3s +tttg: c51/185 lr:0.000829 t:4.4s +tttg: c52/185 lr:0.000822 t:4.5s +tttg: c53/185 lr:0.000816 t:4.6s +tttg: c54/185 lr:0.000809 t:4.7s +tttg: c55/185 lr:0.000802 t:4.8s +tttg: c56/185 lr:0.000795 t:4.9s +tttg: c57/185 lr:0.000788 t:5.0s +tttg: c58/185 lr:0.000781 t:5.0s +tttg: c59/185 lr:0.000774 t:5.1s +tttg: c60/185 lr:0.000767 t:5.2s +tttg: c61/185 lr:0.000760 t:5.3s +tttg: c62/185 lr:0.000752 t:5.4s +tttg: c63/185 lr:0.000745 t:5.5s +tttg: c64/185 lr:0.000738 t:5.6s +tttg: c65/185 lr:0.000730 t:5.7s +tttg: c66/185 lr:0.000722 t:5.7s +tttg: c67/185 lr:0.000715 t:5.8s +tttg: c68/185 lr:0.000707 t:5.9s +tttg: c69/185 lr:0.000699 t:6.0s +tttg: c70/185 lr:0.000691 t:6.1s +tttg: c71/185 lr:0.000683 t:6.2s +tttg: c72/185 lr:0.000675 t:6.3s +tttg: c73/185 lr:0.000667 t:6.3s +tttg: c74/185 lr:0.000659 t:6.4s +tttg: c75/185 lr:0.000651 t:6.5s +tttg: c76/185 lr:0.000643 t:6.6s +tttg: c77/185 lr:0.000635 t:6.7s +tttg: c78/185 lr:0.000627 t:6.8s +tttg: c79/185 lr:0.000618 t:6.9s +tttg: c80/185 lr:0.000610 t:7.0s +tttg: c81/185 lr:0.000602 t:7.0s +tttg: c82/185 lr:0.000593 t:7.1s +tttg: c83/185 lr:0.000585 t:7.2s +tttg: c84/185 lr:0.000577 t:7.3s +tttg: c85/185 lr:0.000568 t:7.4s +tttg: c86/185 lr:0.000560 t:7.5s +tttg: c87/185 lr:0.000551 t:7.6s +tttg: c88/185 lr:0.000543 t:7.7s +tttg: c89/185 lr:0.000534 t:7.7s +tttg: c90/185 lr:0.000526 t:7.8s +tttg: c91/185 lr:0.000517 t:7.9s +tttg: c92/185 lr:0.000509 t:8.0s +tttg: c93/185 lr:0.000500 t:8.1s +tttg: c94/185 lr:0.000491 t:8.2s +tttg: c95/185 lr:0.000483 t:8.3s +tttg: c96/185 lr:0.000474 t:8.4s +tttg: c97/185 lr:0.000466 t:8.4s +tttg: c98/185 lr:0.000457 t:8.5s +tttg: c99/185 lr:0.000449 t:8.6s +tttg: c100/185 lr:0.000440 t:8.7s +tttg: c101/185 lr:0.000432 t:8.8s +tttg: c102/185 lr:0.000423 t:8.9s +tttg: c103/185 lr:0.000415 t:9.0s +tttg: c104/185 lr:0.000407 t:9.0s +tttg: c105/185 lr:0.000398 t:9.1s +tttg: c106/185 lr:0.000390 t:9.2s +tttg: c107/185 lr:0.000382 t:9.3s +tttg: c108/185 lr:0.000373 t:9.4s +tttg: c109/185 lr:0.000365 t:9.5s +tttg: c110/185 lr:0.000357 t:9.6s +tttg: c111/185 lr:0.000349 t:9.7s +tttg: c112/185 lr:0.000341 t:9.7s +tttg: c113/185 lr:0.000333 t:9.8s +tttg: c114/185 lr:0.000325 t:9.9s +tttg: c115/185 lr:0.000317 t:10.0s +tttg: c116/185 lr:0.000309 t:10.1s +tttg: c117/185 lr:0.000301 t:10.2s +tttg: c118/185 lr:0.000293 t:10.3s +tttg: c119/185 lr:0.000285 t:10.4s +tttg: c120/185 lr:0.000278 t:10.4s +tttg: c121/185 lr:0.000270 t:10.5s +tttg: c122/185 lr:0.000262 t:10.6s +tttg: c123/185 lr:0.000255 t:10.7s +tttg: c124/185 lr:0.000248 t:10.8s +tttg: c125/185 lr:0.000240 t:10.9s +tttg: c126/185 lr:0.000233 t:11.0s +tttg: c127/185 lr:0.000226 t:11.0s +tttg: c128/185 lr:0.000219 t:11.1s +tttg: c129/185 lr:0.000212 t:11.2s +tttg: c130/185 lr:0.000205 t:11.3s +tttg: c131/185 lr:0.000198 t:11.4s +tttg: c132/185 lr:0.000191 t:11.5s +tttg: c133/185 lr:0.000184 t:11.6s +tttg: c134/185 lr:0.000178 t:11.7s +tttg: c135/185 lr:0.000171 t:11.7s +tttg: c136/185 lr:0.000165 t:11.8s +tttg: c137/185 lr:0.000159 t:11.9s +tttg: c138/185 lr:0.000153 t:12.0s +tttg: c139/185 lr:0.000146 t:12.1s +tttg: c140/185 lr:0.000140 t:12.2s +tttg: c141/185 lr:0.000135 t:12.3s +tttg: c142/185 lr:0.000129 t:12.4s +tttg: c143/185 lr:0.000123 t:12.4s +tttg: c144/185 lr:0.000118 t:12.5s +tttg: c145/185 lr:0.000112 t:12.6s +tttg: c146/185 lr:0.000107 t:12.7s +tttg: c147/185 lr:0.000102 t:12.8s +tttg: c148/185 lr:0.000096 t:12.9s +tttg: c149/185 lr:0.000092 t:13.0s +tttg: c150/185 lr:0.000087 t:13.0s +tttg: c151/185 lr:0.000082 t:13.1s +tttg: c152/185 lr:0.000077 t:13.2s +tttg: c153/185 lr:0.000073 t:13.3s +tttg: c154/185 lr:0.000068 t:13.4s +tttg: c155/185 lr:0.000064 t:13.5s +tttg: c156/185 lr:0.000060 t:13.6s +tttg: c157/185 lr:0.000056 t:13.7s +tttg: c158/185 lr:0.000052 t:13.7s +tttg: c159/185 lr:0.000048 t:13.8s +tttg: c160/185 lr:0.000045 t:13.9s +tttg: c161/185 lr:0.000041 t:14.0s +tttg: c162/185 lr:0.000038 t:14.1s +tttg: c163/185 lr:0.000035 t:14.2s +tttg: c164/185 lr:0.000032 t:14.3s +tttg: c165/185 lr:0.000029 t:14.3s +tttg: c166/185 lr:0.000026 t:14.4s +tttg: c167/185 lr:0.000023 t:14.5s +tttg: c168/185 lr:0.000021 t:14.6s +tttg: c169/185 lr:0.000019 t:14.7s +tttg: c170/185 lr:0.000016 t:14.8s +tttg: c171/185 lr:0.000014 t:14.9s +tttg: c172/185 lr:0.000012 t:15.0s +tttg: c173/185 lr:0.000010 t:15.0s +tttg: c174/185 lr:0.000009 t:15.1s +tttg: c175/185 lr:0.000007 t:15.2s +tttg: c176/185 lr:0.000006 t:15.3s +tttg: c177/185 lr:0.000005 t:15.4s +tttg: c178/185 lr:0.000004 t:15.5s +tttg: c179/185 lr:0.000003 t:15.6s +tttg: c180/185 lr:0.000002 t:15.7s +tttg: c181/185 lr:0.000001 t:15.7s +tttg: c182/185 lr:0.000001 t:15.8s +tttg: c183/185 lr:0.000000 t:15.9s +tttg: c184/185 lr:0.000000 t:16.0s +ttpr: phase:2/3 t:309.6s +ttp: b746/782 bl:2.4184 bb:1.0656 rl:2.2229 rb:1.0613 dl:2884-2943 gd:0 +ttp: b745/782 bl:2.2423 bb:1.0266 rl:2.2245 rb:1.0583 dl:2842-2883 gd:0 +ttpp: phase:3/3 pd:2448 gd:2000 t:329.4s +tttg: c1/250 lr:0.001000 t:0.1s +tttg: c2/250 lr:0.001000 t:0.2s +tttg: c3/250 lr:0.001000 t:0.3s +tttg: c4/250 lr:0.001000 t:0.3s +tttg: c5/250 lr:0.000999 t:0.4s +tttg: c6/250 lr:0.000999 t:0.5s +tttg: c7/250 lr:0.000999 t:0.6s +tttg: c8/250 lr:0.000998 t:0.7s +tttg: c9/250 lr:0.000997 t:0.8s +tttg: c10/250 lr:0.000997 t:0.9s +tttg: c11/250 lr:0.000996 t:1.0s +tttg: c12/250 lr:0.000995 t:1.1s +tttg: c13/250 lr:0.000994 t:1.1s +tttg: c14/250 lr:0.000993 t:1.2s +tttg: c15/250 lr:0.000992 t:1.3s +tttg: c16/250 lr:0.000991 t:1.4s +tttg: c17/250 lr:0.000990 t:1.5s +tttg: c18/250 lr:0.000989 t:1.6s +tttg: c19/250 lr:0.000987 t:1.7s +tttg: c20/250 lr:0.000986 t:1.7s +tttg: c21/250 lr:0.000984 t:1.8s +tttg: c22/250 lr:0.000983 t:1.9s +tttg: c23/250 lr:0.000981 t:2.0s +tttg: c24/250 lr:0.000979 t:2.1s +tttg: c25/250 lr:0.000977 t:2.2s +tttg: c26/250 lr:0.000975 t:2.3s +tttg: c27/250 lr:0.000973 t:2.4s +tttg: c28/250 lr:0.000971 t:2.4s +tttg: c29/250 lr:0.000969 t:2.5s +tttg: c30/250 lr:0.000967 t:2.6s +tttg: c31/250 lr:0.000965 t:2.7s +tttg: c32/250 lr:0.000962 t:2.8s +tttg: c33/250 lr:0.000960 t:2.9s +tttg: c34/250 lr:0.000957 t:3.0s +tttg: c35/250 lr:0.000955 t:3.1s +tttg: c36/250 lr:0.000952 t:3.1s +tttg: c37/250 lr:0.000949 t:3.2s +tttg: c38/250 lr:0.000947 t:3.3s +tttg: c39/250 lr:0.000944 t:3.4s +tttg: c40/250 lr:0.000941 t:3.5s +tttg: c41/250 lr:0.000938 t:3.6s +tttg: c42/250 lr:0.000935 t:3.7s +tttg: c43/250 lr:0.000931 t:3.8s +tttg: c44/250 lr:0.000928 t:3.8s +tttg: c45/250 lr:0.000925 t:3.9s +tttg: c46/250 lr:0.000922 t:4.0s +tttg: c47/250 lr:0.000918 t:4.1s +tttg: c48/250 lr:0.000915 t:4.2s +tttg: c49/250 lr:0.000911 t:4.3s +tttg: c50/250 lr:0.000907 t:4.4s +tttg: c51/250 lr:0.000904 t:4.5s +tttg: c52/250 lr:0.000900 t:4.6s +tttg: c53/250 lr:0.000896 t:4.6s +tttg: c54/250 lr:0.000892 t:4.7s +tttg: c55/250 lr:0.000888 t:4.8s +tttg: c56/250 lr:0.000884 t:4.9s +tttg: c57/250 lr:0.000880 t:5.0s +tttg: c58/250 lr:0.000876 t:5.1s +tttg: c59/250 lr:0.000872 t:5.2s +tttg: c60/250 lr:0.000868 t:5.3s +tttg: c61/250 lr:0.000863 t:5.3s +tttg: c62/250 lr:0.000859 t:5.4s +tttg: c63/250 lr:0.000855 t:5.5s +tttg: c64/250 lr:0.000850 t:5.6s +tttg: c65/250 lr:0.000846 t:5.7s +tttg: c66/250 lr:0.000841 t:5.8s +tttg: c67/250 lr:0.000836 t:5.9s +tttg: c68/250 lr:0.000832 t:6.0s +tttg: c69/250 lr:0.000827 t:6.1s +tttg: c70/250 lr:0.000822 t:6.1s +tttg: c71/250 lr:0.000817 t:6.2s +tttg: c72/250 lr:0.000812 t:6.3s +tttg: c73/250 lr:0.000807 t:6.4s +tttg: c74/250 lr:0.000803 t:6.5s +tttg: c75/250 lr:0.000797 t:6.6s +tttg: c76/250 lr:0.000792 t:6.7s +tttg: c77/250 lr:0.000787 t:6.7s +tttg: c78/250 lr:0.000782 t:6.8s +tttg: c79/250 lr:0.000777 t:6.9s +tttg: c80/250 lr:0.000772 t:7.0s +tttg: c81/250 lr:0.000766 t:7.1s +tttg: c82/250 lr:0.000761 t:7.2s +tttg: c83/250 lr:0.000755 t:7.3s +tttg: c84/250 lr:0.000750 t:7.4s +tttg: c85/250 lr:0.000745 t:7.5s +tttg: c86/250 lr:0.000739 t:7.6s +tttg: c87/250 lr:0.000733 t:7.6s +tttg: c88/250 lr:0.000728 t:7.7s +tttg: c89/250 lr:0.000722 t:7.8s +tttg: c90/250 lr:0.000717 t:7.9s +tttg: c91/250 lr:0.000711 t:8.0s +tttg: c92/250 lr:0.000705 t:8.1s +tttg: c93/250 lr:0.000699 t:8.2s +tttg: c94/250 lr:0.000694 t:8.2s +tttg: c95/250 lr:0.000688 t:8.3s +tttg: c96/250 lr:0.000682 t:8.4s +tttg: c97/250 lr:0.000676 t:8.5s +tttg: c98/250 lr:0.000670 t:8.6s +tttg: c99/250 lr:0.000664 t:8.7s +tttg: c100/250 lr:0.000658 t:8.8s +tttg: c101/250 lr:0.000652 t:8.9s +tttg: c102/250 lr:0.000646 t:8.9s +tttg: c103/250 lr:0.000640 t:9.0s +tttg: c104/250 lr:0.000634 t:9.1s +tttg: c105/250 lr:0.000628 t:9.2s +tttg: c106/250 lr:0.000622 t:9.3s +tttg: c107/250 lr:0.000616 t:9.4s +tttg: c108/250 lr:0.000610 t:9.5s +tttg: c109/250 lr:0.000603 t:9.6s +tttg: c110/250 lr:0.000597 t:9.6s +tttg: c111/250 lr:0.000591 t:9.7s +tttg: c112/250 lr:0.000585 t:9.8s +tttg: c113/250 lr:0.000579 t:9.9s +tttg: c114/250 lr:0.000572 t:10.0s +tttg: c115/250 lr:0.000566 t:10.1s +tttg: c116/250 lr:0.000560 t:10.2s +tttg: c117/250 lr:0.000554 t:10.3s +tttg: c118/250 lr:0.000547 t:10.3s +tttg: c119/250 lr:0.000541 t:10.4s +tttg: c120/250 lr:0.000535 t:10.5s +tttg: c121/250 lr:0.000528 t:10.6s +tttg: c122/250 lr:0.000522 t:10.7s +tttg: c123/250 lr:0.000516 t:10.8s +tttg: c124/250 lr:0.000509 t:10.9s +tttg: c125/250 lr:0.000503 t:11.0s +tttg: c126/250 lr:0.000497 t:11.1s +tttg: c127/250 lr:0.000491 t:11.1s +tttg: c128/250 lr:0.000484 t:11.2s +tttg: c129/250 lr:0.000478 t:11.3s +tttg: c130/250 lr:0.000472 t:11.4s +tttg: c131/250 lr:0.000465 t:11.5s +tttg: c132/250 lr:0.000459 t:11.6s +tttg: c133/250 lr:0.000453 t:11.7s +tttg: c134/250 lr:0.000446 t:11.7s +tttg: c135/250 lr:0.000440 t:11.8s +tttg: c136/250 lr:0.000434 t:11.9s +tttg: c137/250 lr:0.000428 t:12.0s +tttg: c138/250 lr:0.000421 t:12.1s +tttg: c139/250 lr:0.000415 t:12.2s +tttg: c140/250 lr:0.000409 t:12.3s +tttg: c141/250 lr:0.000403 t:12.4s +tttg: c142/250 lr:0.000397 t:12.5s +tttg: c143/250 lr:0.000390 t:12.5s +tttg: c144/250 lr:0.000384 t:12.6s +tttg: c145/250 lr:0.000378 t:12.7s +tttg: c146/250 lr:0.000372 t:12.8s +tttg: c147/250 lr:0.000366 t:12.9s +tttg: c148/250 lr:0.000360 t:13.0s +tttg: c149/250 lr:0.000354 t:13.1s +tttg: c150/250 lr:0.000348 t:13.2s +tttg: c151/250 lr:0.000342 t:13.2s +tttg: c152/250 lr:0.000336 t:13.3s +tttg: c153/250 lr:0.000330 t:13.4s +tttg: c154/250 lr:0.000324 t:13.5s +tttg: c155/250 lr:0.000318 t:13.6s +tttg: c156/250 lr:0.000312 t:13.7s +tttg: c157/250 lr:0.000306 t:13.8s +tttg: c158/250 lr:0.000301 t:13.8s +tttg: c159/250 lr:0.000295 t:13.9s +tttg: c160/250 lr:0.000289 t:14.0s +tttg: c161/250 lr:0.000283 t:14.1s +tttg: c162/250 lr:0.000278 t:14.2s +tttg: c163/250 lr:0.000272 t:14.3s +tttg: c164/250 lr:0.000267 t:14.4s +tttg: c165/250 lr:0.000261 t:14.5s +tttg: c166/250 lr:0.000255 t:14.6s +tttg: c167/250 lr:0.000250 t:14.6s +tttg: c168/250 lr:0.000245 t:14.7s +tttg: c169/250 lr:0.000239 t:14.8s +tttg: c170/250 lr:0.000234 t:14.9s +tttg: c171/250 lr:0.000228 t:15.0s +tttg: c172/250 lr:0.000223 t:15.1s +tttg: c173/250 lr:0.000218 t:15.2s +tttg: c174/250 lr:0.000213 t:15.3s +tttg: c175/250 lr:0.000208 t:15.3s +tttg: c176/250 lr:0.000203 t:15.4s +tttg: c177/250 lr:0.000197 t:15.5s +tttg: c178/250 lr:0.000193 t:15.6s +tttg: c179/250 lr:0.000188 t:15.7s +tttg: c180/250 lr:0.000183 t:15.8s +tttg: c181/250 lr:0.000178 t:15.9s +tttg: c182/250 lr:0.000173 t:16.0s +tttg: c183/250 lr:0.000168 t:16.0s +tttg: c184/250 lr:0.000164 t:16.1s +tttg: c185/250 lr:0.000159 t:16.2s +tttg: c186/250 lr:0.000154 t:16.3s +tttg: c187/250 lr:0.000150 t:16.4s +tttg: c188/250 lr:0.000145 t:16.5s +tttg: c189/250 lr:0.000141 t:16.6s +tttg: c190/250 lr:0.000137 t:16.6s +tttg: c191/250 lr:0.000132 t:16.7s +tttg: c192/250 lr:0.000128 t:16.8s +tttg: c193/250 lr:0.000124 t:16.9s +tttg: c194/250 lr:0.000120 t:17.0s +tttg: c195/250 lr:0.000116 t:17.1s +tttg: c196/250 lr:0.000112 t:17.1s +tttg: c197/250 lr:0.000108 t:17.2s +tttg: c198/250 lr:0.000104 t:17.3s +tttg: c199/250 lr:0.000100 t:17.4s +tttg: c200/250 lr:0.000096 t:17.5s +tttg: c201/250 lr:0.000093 t:17.6s +tttg: c202/250 lr:0.000089 t:17.7s +tttg: c203/250 lr:0.000085 t:17.7s +tttg: c204/250 lr:0.000082 t:17.8s +tttg: c205/250 lr:0.000078 t:17.9s +tttg: c206/250 lr:0.000075 t:18.0s +tttg: c207/250 lr:0.000072 t:18.1s +tttg: c208/250 lr:0.000069 t:18.2s +tttg: c209/250 lr:0.000065 t:18.3s +tttg: c210/250 lr:0.000062 t:18.3s +tttg: c211/250 lr:0.000059 t:18.4s +tttg: c212/250 lr:0.000056 t:18.5s +tttg: c213/250 lr:0.000053 t:18.6s +tttg: c214/250 lr:0.000051 t:18.7s +tttg: c215/250 lr:0.000048 t:18.8s +tttg: c216/250 lr:0.000045 t:18.9s +tttg: c217/250 lr:0.000043 t:19.0s +tttg: c218/250 lr:0.000040 t:19.0s +tttg: c219/250 lr:0.000038 t:19.1s +tttg: c220/250 lr:0.000035 t:19.2s +tttg: c221/250 lr:0.000033 t:19.3s +tttg: c222/250 lr:0.000031 t:19.4s +tttg: c223/250 lr:0.000029 t:19.5s +tttg: c224/250 lr:0.000027 t:19.6s +tttg: c225/250 lr:0.000025 t:19.6s +tttg: c226/250 lr:0.000023 t:19.7s +tttg: c227/250 lr:0.000021 t:19.8s +tttg: c228/250 lr:0.000019 t:19.9s +tttg: c229/250 lr:0.000017 t:20.0s +tttg: c230/250 lr:0.000016 t:20.1s +tttg: c231/250 lr:0.000014 t:20.2s +tttg: c232/250 lr:0.000013 t:20.3s +tttg: c233/250 lr:0.000011 t:20.3s +tttg: c234/250 lr:0.000010 t:20.4s +tttg: c235/250 lr:0.000009 t:20.5s +tttg: c236/250 lr:0.000008 t:20.6s +tttg: c237/250 lr:0.000007 t:20.7s +tttg: c238/250 lr:0.000006 t:20.8s +tttg: c239/250 lr:0.000005 t:20.9s +tttg: c240/250 lr:0.000004 t:21.0s +tttg: c241/250 lr:0.000003 t:21.1s +tttg: c242/250 lr:0.000003 t:21.1s +tttg: c243/250 lr:0.000002 t:21.2s +tttg: c244/250 lr:0.000001 t:21.3s +tttg: c245/250 lr:0.000001 t:21.4s +tttg: c246/250 lr:0.000001 t:21.5s +tttg: c247/250 lr:0.000000 t:21.6s +tttg: c248/250 lr:0.000000 t:21.6s +tttg: c249/250 lr:0.000000 t:21.7s +ttpr: phase:3/3 t:353.5s +ttp: b740/782 bl:2.2682 bb:1.0412 rl:2.2276 rb:1.0571 dl:2653-2686 gd:1 +ttp: b731/782 bl:2.3454 bb:1.0460 rl:2.2347 rb:1.0564 dl:2377-2414 gd:1 +ttp: b724/782 bl:2.3252 bb:1.0617 rl:2.2394 rb:1.0567 dl:2203-2231 gd:1 +ttp: b712/782 bl:2.3435 bb:1.0628 rl:2.2441 rb:1.0570 dl:1984-2002 gd:1 +ttp: b704/782 bl:2.2888 bb:1.0400 rl:2.2459 rb:1.0562 dl:1872-1885 gd:1 +ttp: b696/782 bl:2.3151 bb:1.0543 rl:2.2485 rb:1.0562 dl:1779-1790 gd:1 +ttp: b688/782 bl:2.4061 bb:1.0772 rl:2.2539 rb:1.0569 dl:1696-1706 gd:1 +ttp: b680/782 bl:2.2881 bb:1.0304 rl:2.2550 rb:1.0560 dl:1618-1628 gd:1 +ttp: b672/782 bl:2.3338 bb:1.0502 rl:2.2573 rb:1.0559 dl:1553-1562 gd:1 +ttp: b664/782 bl:2.3474 bb:1.0302 rl:2.2598 rb:1.0551 dl:1493-1499 gd:1 +ttp: b656/782 bl:2.3375 bb:1.1151 rl:2.2618 rb:1.0566 dl:1439-1445 gd:1 +ttp: b648/782 bl:2.2914 bb:1.0112 rl:2.2625 rb:1.0555 dl:1387-1392 gd:1 +ttp: b640/782 bl:2.3129 bb:1.0536 rl:2.2637 rb:1.0554 dl:1337-1343 gd:1 +ttp: b632/782 bl:2.3582 bb:1.0375 rl:2.2657 rb:1.0550 dl:1290-1297 gd:1 +ttp: b625/782 bl:2.4178 bb:1.0549 rl:2.2689 rb:1.0550 dl:1255-1260 gd:1 +ttp: b617/782 bl:2.3196 bb:1.0250 rl:2.2698 rb:1.0544 dl:1211-1216 gd:1 +ttp: b609/782 bl:2.2794 bb:1.0212 rl:2.2700 rb:1.0538 dl:1172-1177 gd:1 +ttp: b601/782 bl:2.3415 bb:1.0251 rl:2.2713 rb:1.0532 dl:1137-1141 gd:1 +ttp: b595/782 bl:2.3601 bb:1.0653 rl:2.2728 rb:1.0534 dl:1110-1115 gd:1 +ttp: b586/782 bl:2.2646 bb:1.0355 rl:2.2727 rb:1.0531 dl:1073-1076 gd:1 +ttp: b578/782 bl:2.3636 bb:1.0376 rl:2.2741 rb:1.0529 dl:1041-1044 gd:1 +ttp: b569/782 bl:2.3129 bb:1.0458 rl:2.2746 rb:1.0528 dl:1007-1010 gd:1 +ttp: b561/782 bl:2.2554 bb:1.0174 rl:2.2744 rb:1.0523 dl:979-983 gd:1 +ttp: b554/782 bl:2.4407 bb:1.0988 rl:2.2766 rb:1.0529 dl:955-959 gd:1 +ttp: b550/782 bl:2.3730 bb:1.0617 rl:2.2779 rb:1.0530 dl:943-946 gd:1 +ttp: b544/782 bl:2.3515 bb:1.0716 rl:2.2788 rb:1.0533 dl:924-927 gd:1 +ttp: b536/782 bl:2.3271 bb:1.0479 rl:2.2794 rb:1.0532 dl:899-902 gd:1 +ttp: b529/782 bl:2.3194 bb:1.0189 rl:2.2799 rb:1.0528 dl:878-882 gd:1 +ttp: b521/782 bl:2.3641 bb:1.0715 rl:2.2808 rb:1.0530 dl:854-858 gd:1 +ttp: b514/782 bl:2.3143 bb:1.0684 rl:2.2812 rb:1.0532 dl:835-838 gd:1 +ttp: b507/782 bl:2.3034 bb:1.0313 rl:2.2814 rb:1.0529 dl:814-817 gd:1 +ttp: b500/782 bl:2.3316 bb:1.0671 rl:2.2820 rb:1.0531 dl:796-799 gd:1 +ttp: b492/782 bl:2.2853 bb:1.0379 rl:2.2820 rb:1.0529 dl:776-778 gd:1 +ttp: b484/782 bl:2.3775 bb:1.0535 rl:2.2829 rb:1.0529 dl:756-759 gd:1 +ttp: b476/782 bl:2.2828 bb:1.0346 rl:2.2829 rb:1.0528 dl:738-740 gd:1 +ttp: b468/782 bl:2.3727 bb:1.0680 rl:2.2837 rb:1.0529 dl:719-721 gd:1 +ttp: b460/782 bl:2.2593 bb:1.0570 rl:2.2835 rb:1.0529 dl:701-703 gd:1 +ttp: b452/782 bl:2.2690 bb:1.0155 rl:2.2834 rb:1.0526 dl:685-687 gd:1 +ttp: b444/782 bl:2.3188 bb:1.0683 rl:2.2837 rb:1.0527 dl:668-670 gd:1 +ttp: b436/782 bl:2.2825 bb:1.0543 rl:2.2836 rb:1.0527 dl:651-653 gd:1 +ttp: b428/782 bl:2.3156 bb:1.0552 rl:2.2839 rb:1.0528 dl:636-638 gd:1 +ttp: b420/782 bl:2.3646 bb:1.0555 rl:2.2845 rb:1.0528 dl:620-622 gd:1 +ttp: b413/782 bl:2.3841 bb:1.0685 rl:2.2852 rb:1.0529 dl:607-609 gd:1 +ttp: b405/782 bl:2.3659 bb:1.0617 rl:2.2857 rb:1.0530 dl:592-593 gd:1 +ttp: b397/782 bl:2.3672 bb:1.0498 rl:2.2863 rb:1.0529 dl:577-579 gd:1 +ttp: b386/782 bl:2.3485 bb:1.1029 rl:2.2867 rb:1.0533 dl:557-559 gd:1 +ttp: b378/782 bl:2.4368 bb:1.0574 rl:2.2876 rb:1.0533 dl:544-545 gd:1 +ttp: b370/782 bl:2.3755 bb:1.0875 rl:2.2882 rb:1.0535 dl:530-532 gd:1 +ttp: b362/782 bl:2.3669 bb:1.0818 rl:2.2886 rb:1.0537 dl:517-518 gd:1 +ttp: b354/782 bl:2.3141 bb:1.0706 rl:2.2888 rb:1.0538 dl:503-504 gd:1 +ttp: b346/782 bl:2.3872 bb:1.0778 rl:2.2893 rb:1.0539 dl:491-492 gd:1 +ttp: b338/782 bl:2.3670 bb:1.1025 rl:2.2897 rb:1.0541 dl:478-480 gd:1 +ttp: b332/782 bl:2.3179 bb:1.0492 rl:2.2899 rb:1.0541 dl:469-471 gd:1 +ttp: b324/782 bl:2.3253 bb:1.0872 rl:2.2900 rb:1.0543 dl:458-459 gd:1 +ttp: b316/782 bl:2.3731 bb:1.0826 rl:2.2904 rb:1.0544 dl:445-446 gd:1 +ttp: b308/782 bl:2.4142 bb:1.0950 rl:2.2910 rb:1.0546 dl:433-435 gd:1 +ttp: b298/782 bl:2.4261 bb:1.1048 rl:2.2916 rb:1.0548 dl:418-420 gd:1 +ttp: b291/782 bl:2.2726 bb:1.0161 rl:2.2916 rb:1.0547 dl:407-409 gd:1 +ttp: b283/782 bl:2.3770 bb:1.1303 rl:2.2919 rb:1.0550 dl:396-398 gd:1 +ttp: b275/782 bl:2.3623 bb:1.0644 rl:2.2922 rb:1.0550 dl:385-386 gd:1 +ttp: b267/782 bl:2.4234 bb:1.1453 rl:2.2927 rb:1.0554 dl:375-376 gd:1 +ttp: b260/782 bl:2.3787 bb:1.0837 rl:2.2931 rb:1.0555 dl:366-367 gd:1 +ttp: b253/782 bl:2.3403 bb:1.1116 rl:2.2932 rb:1.0557 dl:357-358 gd:1 +ttp: b246/782 bl:2.3564 bb:1.1014 rl:2.2935 rb:1.0559 dl:349-350 gd:1 +ttp: b239/782 bl:2.3882 bb:1.1089 rl:2.2938 rb:1.0560 dl:340-341 gd:1 +ttp: b232/782 bl:2.3022 bb:1.0851 rl:2.2938 rb:1.0561 dl:331-333 gd:1 +ttp: b224/782 bl:2.3828 bb:1.0919 rl:2.2941 rb:1.0563 dl:322-323 gd:1 +ttp: b216/782 bl:2.4789 bb:1.1495 rl:2.2947 rb:1.0566 dl:313-314 gd:1 +ttp: b208/782 bl:2.3969 bb:1.1346 rl:2.2951 rb:1.0568 dl:304-305 gd:1 +ttp: b200/782 bl:2.3790 bb:1.0999 rl:2.2953 rb:1.0569 dl:296-297 gd:1 +ttp: b192/782 bl:2.3827 bb:1.1572 rl:2.2956 rb:1.0572 dl:286-288 gd:1 +ttp: b184/782 bl:2.4025 bb:1.1326 rl:2.2959 rb:1.0574 dl:278-279 gd:1 +ttp: b176/782 bl:2.3249 bb:1.1292 rl:2.2960 rb:1.0576 dl:270-271 gd:1 +ttp: b169/782 bl:2.3849 bb:1.1209 rl:2.2962 rb:1.0578 dl:263-264 gd:1 +ttp: b161/782 bl:2.3580 bb:1.1350 rl:2.2964 rb:1.0580 dl:256-256 gd:1 +ttp: b152/782 bl:2.4027 bb:1.1508 rl:2.2966 rb:1.0582 dl:247-248 gd:1 +ttp: b144/782 bl:2.3698 bb:1.1139 rl:2.2968 rb:1.0583 dl:239-240 gd:1 +ttp: b139/782 bl:2.4444 bb:1.1387 rl:2.2972 rb:1.0585 dl:234-235 gd:1 +ttp: b131/782 bl:2.4116 bb:1.1644 rl:2.2974 rb:1.0587 dl:227-228 gd:1 +ttp: b123/782 bl:2.3972 bb:1.1656 rl:2.2976 rb:1.0590 dl:219-220 gd:1 +ttp: b115/782 bl:2.4768 bb:1.1722 rl:2.2980 rb:1.0592 dl:212-213 gd:1 +ttp: b107/782 bl:2.4464 bb:1.1716 rl:2.2983 rb:1.0594 dl:205-206 gd:1 +ttp: b99/782 bl:2.5029 bb:1.1788 rl:2.2987 rb:1.0597 dl:198-199 gd:1 +ttp: b91/782 bl:2.4689 bb:1.1572 rl:2.2990 rb:1.0598 dl:190-191 gd:1 +ttp: b84/782 bl:2.5294 bb:1.2027 rl:2.2995 rb:1.0601 dl:184-185 gd:1 +ttp: b76/782 bl:2.5041 bb:1.1761 rl:2.2998 rb:1.0603 dl:177-178 gd:1 +ttp: b68/782 bl:2.5176 bb:1.1750 rl:2.3002 rb:1.0605 dl:170-171 gd:1 +ttp: b60/782 bl:2.4795 bb:1.1918 rl:2.3005 rb:1.0607 dl:163-164 gd:1 +ttp: b52/782 bl:2.6833 bb:1.2525 rl:2.3011 rb:1.0610 dl:155-156 gd:1 +ttp: b44/782 bl:2.5690 bb:1.1988 rl:2.3015 rb:1.0612 dl:147-148 gd:1 +ttp: b35/782 bl:2.6376 bb:1.2795 rl:2.3019 rb:1.0615 dl:138-139 gd:1 +ttp: b27/782 bl:2.5974 bb:1.2279 rl:2.3023 rb:1.0617 dl:130-131 gd:1 +ttp: b20/782 bl:2.6006 bb:1.2453 rl:2.3027 rb:1.0619 dl:122-123 gd:1 +ttp: b12/782 bl:2.5778 bb:1.1923 rl:2.3030 rb:1.0620 dl:110-112 gd:1 +ttp: b4/782 bl:2.7561 bb:1.2349 rl:2.3034 rb:1.0622 dl:93-96 gd:1 +quantized_ttt_phased val_loss:2.32879846 val_bpb:1.06417002 eval_time:470601ms +total_eval_time:470.6s diff --git a/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_seed1234.log b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_seed1234.log new file mode 100644 index 0000000000..8ab5a0963c --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_seed1234.log @@ -0,0 +1,843 @@ +Running: env | egrep '^(RUN_ID|SEED|CASEOPS_ENABLED|TRAIN_LOOP_(PHASE_DEPTHS|PREWARM_DEPTHS)|EVAL_LOOP_DEPTH|DATA_PATH|TOKENIZER_PATH|EMBED_BITS|GATED_ATTN_)' | sort +CASEOPS_ENABLED=1 +DATA_PATH=/workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved +EMBED_BITS=7 +EVAL_LOOP_DEPTH=4 +GATED_ATTN_ENABLED=1 +GATED_ATTN_INIT_STD=0.005 +GATED_ATTN_QUANT_GATE=1 +RUN_ID=pr1736_eq134_eval4_seed1234 +SEED=1234 +TOKENIZER_PATH=/workspace/parameter-golf-pr1736/data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model +TRAIN_LOOP_PHASE_DEPTHS=1,3,4 +TRAIN_LOOP_PREWARM_DEPTHS=3,4 +W0420 21:43:17.323000 340378 torch/distributed/run.py:803] +W0420 21:43:17.323000 340378 torch/distributed/run.py:803] ***************************************** +W0420 21:43:17.323000 340378 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0420 21:43:17.323000 340378 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + artifact_dir: /workspace/fullruns/pr1736_eq134_eval4_seed1234/artifact + attn_clip_sigmas: 13.0 + attn_out_gate_enabled: False + attn_out_gate_src: proj + beta1: 0.9 + beta2: 0.95 + caseops_enabled: True + compressor: brotli + data_dir: ./data/ + datasets_dir: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 15.0 + embed_lr: 0.6 + embed_wd: 0.085 + enable_looping_at: 0.35 + eval_loop_depth: 4 + eval_seq_len: 2048 + eval_stride: 64 + gate_window: 12 + gated_attn_enabled: True + gated_attn_init_std: 0.005 + gated_attn_quant_gate: True + global_ttt_batch_seqs: 32 + global_ttt_chunk_tokens: 32768 + global_ttt_epochs: 1 + global_ttt_grad_clip: 1.0 + global_ttt_lr: 0.001 + global_ttt_momentum: 0.9 + global_ttt_respect_doc_boundaries: True + global_ttt_warmup_chunks: 0 + global_ttt_warmup_start_lr: 0.0 + gptq_calibration_batches: 16 + gptq_reserve_seconds: 4.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: /workspace/fullruns/pr1736_eq134_eval4_seed1234/artifact/pr1736_eq134_eval4_seed1234.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.026 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_clip_sigmas: 12.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: /workspace/fullruns/pr1736_eq134_eval4_seed1234/artifact/final_model.pt + muon_backend_steps: 5 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_start_layer: 8 + phased_ttt_num_phases: 3 + phased_ttt_prefix_docs: 2000 + qk_gain_init: 5.0 + quantized_model_path: /workspace/fullruns/pr1736_eq134_eval4_seed1234/artifact/final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + rope_yarn: False + run_id: pr1736_eq134_eval4_seed1234 + scalar_lr: 0.02 + seed: 1234 + skip_gates_enabled: True + smear_gate_enabled: False + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/parameter-golf-pr1736/data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + train_batch_tokens: 786432 + train_files: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin + train_log_every: 500 + train_loop_depth_dist: fixed + train_loop_depth_set: [] + train_loop_max_depth: 3 + train_loop_min_depth: 3 + train_loop_phase_depths: [1, 3, 4] + train_loop_phase_fractions: [] + train_loop_prewarm_depths: [3, 4] + train_seq_len: 2048 + ttt_batch_size: 64 + ttt_beta1: 0.0 + ttt_beta2: 0.999 + ttt_chunk_size: 48 + ttt_enabled: True + ttt_eval_batches: + ttt_eval_seq_len: 2048 + ttt_grad_steps: 1 + ttt_k_lora: True + ttt_lora_lr: 0.0001 + ttt_lora_rank: 96 + ttt_mlp_lora: True + ttt_o_lora: True + ttt_optimizer: adam + ttt_weight_decay: 0.5 + val_batch_tokens: 524288 + val_bytes_files: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin + val_doc_fraction: 1.0 + val_files: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_[0-9][0-9][0-9][0-9][0-9][0-9].bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.75 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 80 +val_tokens: 47851520 +model_params:35989658 +gptq:reserving 4s, effective=596000ms +loop_depth_schedule: train=[1, 3, 4] dist=phased phase_fracs=None prewarm=[3, 4] eval_depth=4 +warmup_cu_buckets:64,128,192,256 iters_each:3 +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +0/20000 val_loss: 9.0189 val_bpb: 4.1210 +1/20000 train_loss: 9.0208 train_time: 0.0m tok/s: 12763062 +2/20000 train_loss: 12.9441 train_time: 0.0m tok/s: 11455627 +3/20000 train_loss: 10.1532 train_time: 0.0m tok/s: 10191367 +4/20000 train_loss: 8.6237 train_time: 0.0m tok/s: 9670458 +5/20000 train_loss: 7.8118 train_time: 0.0m tok/s: 9366475 +500/20000 train_loss: 2.5819 train_time: 0.8m tok/s: 8039055 +1000/20000 train_loss: 2.8128 train_time: 1.6m tok/s: 8004817 +1500/20000 train_loss: 2.6396 train_time: 2.5m tok/s: 7997081 +2000/20000 train_loss: 2.6733 train_time: 3.3m tok/s: 7992830 +layer_loop:phase step:2020 frac:0.333 depth:3 phases:[1, 3, 4] eval_depth:4 +2500/20000 train_loss: 2.5505 train_time: 4.5m tok/s: 7347426 +3000/20000 train_loss: 2.5610 train_time: 5.7m tok/s: 6953961 +layer_loop:phase step:3406 frac:0.667 depth:4 phases:[1, 3, 4] eval_depth:4 +3500/20000 train_loss: 2.5635 train_time: 6.9m tok/s: 6664567 +4000/20000 train_loss: 2.3935 train_time: 8.3m tok/s: 6343819 +4000/20000 val_loss: 2.4148 val_bpb: 1.1034 +4500/20000 train_loss: 2.2511 train_time: 9.6m tok/s: 6115429 +4604/20000 val_loss: 2.3406 val_bpb: 1.0695 +stopping_early: wallclock_cap train_time: 596139ms step: 4604/20000 +peak memory allocated: 46573 MiB reserved: 50344 MiB +ema:applying EMA weights +diagnostic pre-quantization post-ema val_loss:2.34025616 val_bpb:1.06933630 eval_time:8090ms +Serialized model: 135592891 bytes +Code size (uncompressed): 141428 bytes +Code size (compressed): 35795 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 4.0s +Quantized weights: + gate_int8_row: blocks.attn.attn_gate_w + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int7): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights +Serialized model quantized+brotli: 15947119 bytes +Total submission size quantized+brotli: 15982914 bytes +diagnostic quantized val_loss:2.35999702 val_bpb:1.07835652 eval_time:13063ms +ttt_lora:warming up compile (random tokens, no val data) +ttt_lora:compile warmup done (117.4s) + +beginning TTT eval timer +ttt_phased: total_docs:50000 prefix_docs:2000 suffix_docs:48000 num_phases:3 boundaries:[666, 1333, 2000] +ttp: b777/782 bl:2.3233 bb:1.0888 rl:2.3233 rb:1.0888 dl:8452-9229 gd:0 +ttp: b772/782 bl:2.3372 bb:1.1015 rl:2.3289 rb:1.0939 dl:5762-6095 gd:0 +ttp: b767/782 bl:2.2789 bb:1.0783 rl:2.3167 rb:1.0901 dl:4681-4858 gd:0 +ttpp: phase:1/3 pd:1104 gd:666 t:199.3s +tttg: c1/111 lr:0.001000 t:0.3s +tttg: c2/111 lr:0.001000 t:0.4s +tttg: c3/111 lr:0.000999 t:0.5s +tttg: c4/111 lr:0.000998 t:0.6s +tttg: c5/111 lr:0.000997 t:0.7s +tttg: c6/111 lr:0.000995 t:0.8s +tttg: c7/111 lr:0.000993 t:0.9s +tttg: c8/111 lr:0.000990 t:1.0s +tttg: c9/111 lr:0.000987 t:1.0s +tttg: c10/111 lr:0.000984 t:1.1s +tttg: c11/111 lr:0.000980 t:1.2s +tttg: c12/111 lr:0.000976 t:1.3s +tttg: c13/111 lr:0.000971 t:1.4s +tttg: c14/111 lr:0.000966 t:1.5s +tttg: c15/111 lr:0.000961 t:1.5s +tttg: c16/111 lr:0.000955 t:1.6s +tttg: c17/111 lr:0.000949 t:1.7s +tttg: c18/111 lr:0.000942 t:1.8s +tttg: c19/111 lr:0.000935 t:1.9s +tttg: c20/111 lr:0.000928 t:2.0s +tttg: c21/111 lr:0.000921 t:2.1s +tttg: c22/111 lr:0.000913 t:2.1s +tttg: c23/111 lr:0.000905 t:2.2s +tttg: c24/111 lr:0.000896 t:2.3s +tttg: c25/111 lr:0.000887 t:2.4s +tttg: c26/111 lr:0.000878 t:2.5s +tttg: c27/111 lr:0.000868 t:2.6s +tttg: c28/111 lr:0.000859 t:2.6s +tttg: c29/111 lr:0.000848 t:2.7s +tttg: c30/111 lr:0.000838 t:2.8s +tttg: c31/111 lr:0.000827 t:2.9s +tttg: c32/111 lr:0.000817 t:3.0s +tttg: c33/111 lr:0.000805 t:3.1s +tttg: c34/111 lr:0.000794 t:3.1s +tttg: c35/111 lr:0.000782 t:3.2s +tttg: c36/111 lr:0.000770 t:3.3s +tttg: c37/111 lr:0.000758 t:3.4s +tttg: c38/111 lr:0.000746 t:3.5s +tttg: c39/111 lr:0.000733 t:3.6s +tttg: c40/111 lr:0.000721 t:3.7s +tttg: c41/111 lr:0.000708 t:3.7s +tttg: c42/111 lr:0.000695 t:3.8s +tttg: c43/111 lr:0.000681 t:3.9s +tttg: c44/111 lr:0.000668 t:4.0s +tttg: c45/111 lr:0.000655 t:4.1s +tttg: c46/111 lr:0.000641 t:4.2s +tttg: c47/111 lr:0.000627 t:4.2s +tttg: c48/111 lr:0.000613 t:4.3s +tttg: c49/111 lr:0.000599 t:4.4s +tttg: c50/111 lr:0.000585 t:4.5s +tttg: c51/111 lr:0.000571 t:4.6s +tttg: c52/111 lr:0.000557 t:4.7s +tttg: c53/111 lr:0.000543 t:4.7s +tttg: c54/111 lr:0.000529 t:4.8s +tttg: c55/111 lr:0.000514 t:4.9s +tttg: c56/111 lr:0.000500 t:5.0s +tttg: c57/111 lr:0.000486 t:5.1s +tttg: c58/111 lr:0.000471 t:5.2s +tttg: c59/111 lr:0.000457 t:5.3s +tttg: c60/111 lr:0.000443 t:5.3s +tttg: c61/111 lr:0.000429 t:5.4s +tttg: c62/111 lr:0.000415 t:5.5s +tttg: c63/111 lr:0.000401 t:5.6s +tttg: c64/111 lr:0.000387 t:5.7s +tttg: c65/111 lr:0.000373 t:5.8s +tttg: c66/111 lr:0.000359 t:5.9s +tttg: c67/111 lr:0.000345 t:5.9s +tttg: c68/111 lr:0.000332 t:6.0s +tttg: c69/111 lr:0.000319 t:6.1s +tttg: c70/111 lr:0.000305 t:6.2s +tttg: c71/111 lr:0.000292 t:6.3s +tttg: c72/111 lr:0.000279 t:6.4s +tttg: c73/111 lr:0.000267 t:6.4s +tttg: c74/111 lr:0.000254 t:6.5s +tttg: c75/111 lr:0.000242 t:6.6s +tttg: c76/111 lr:0.000230 t:6.7s +tttg: c77/111 lr:0.000218 t:6.8s +tttg: c78/111 lr:0.000206 t:6.9s +tttg: c79/111 lr:0.000195 t:7.0s +tttg: c80/111 lr:0.000183 t:7.1s +tttg: c81/111 lr:0.000173 t:7.1s +tttg: c82/111 lr:0.000162 t:7.2s +tttg: c83/111 lr:0.000152 t:7.3s +tttg: c84/111 lr:0.000141 t:7.4s +tttg: c85/111 lr:0.000132 t:7.5s +tttg: c86/111 lr:0.000122 t:7.6s +tttg: c87/111 lr:0.000113 t:7.6s +tttg: c88/111 lr:0.000104 t:7.7s +tttg: c89/111 lr:0.000095 t:7.8s +tttg: c90/111 lr:0.000087 t:7.9s +tttg: c91/111 lr:0.000079 t:8.0s +tttg: c92/111 lr:0.000072 t:8.1s +tttg: c93/111 lr:0.000065 t:8.2s +tttg: c94/111 lr:0.000058 t:8.3s +tttg: c95/111 lr:0.000051 t:8.3s +tttg: c96/111 lr:0.000045 t:8.4s +tttg: c97/111 lr:0.000039 t:8.5s +tttg: c98/111 lr:0.000034 t:8.6s +tttg: c99/111 lr:0.000029 t:8.7s +tttg: c100/111 lr:0.000024 t:8.8s +tttg: c101/111 lr:0.000020 t:8.9s +tttg: c102/111 lr:0.000016 t:8.9s +tttg: c103/111 lr:0.000013 t:9.0s +tttg: c104/111 lr:0.000010 t:9.1s +tttg: c105/111 lr:0.000007 t:9.2s +tttg: c106/111 lr:0.000005 t:9.3s +tttg: c107/111 lr:0.000003 t:9.4s +tttg: c108/111 lr:0.000002 t:9.4s +tttg: c109/111 lr:0.000001 t:9.5s +tttg: c110/111 lr:0.000000 t:9.6s +ttpr: phase:1/3 t:211.4s +ttp: b761/782 bl:2.4240 bb:1.1176 rl:2.3349 rb:1.0948 dl:3916-4032 gd:0 +ttpp: phase:2/3 pd:1808 gd:1333 t:290.5s +tttg: c1/185 lr:0.001000 t:0.1s +tttg: c2/185 lr:0.001000 t:0.2s +tttg: c3/185 lr:0.001000 t:0.3s +tttg: c4/185 lr:0.000999 t:0.4s +tttg: c5/185 lr:0.000999 t:0.4s +tttg: c6/185 lr:0.000998 t:0.5s +tttg: c7/185 lr:0.000997 t:0.6s +tttg: c8/185 lr:0.000996 t:0.7s +tttg: c9/185 lr:0.000995 t:0.8s +tttg: c10/185 lr:0.000994 t:0.9s +tttg: c11/185 lr:0.000993 t:1.0s +tttg: c12/185 lr:0.000991 t:1.1s +tttg: c13/185 lr:0.000990 t:1.1s +tttg: c14/185 lr:0.000988 t:1.2s +tttg: c15/185 lr:0.000986 t:1.3s +tttg: c16/185 lr:0.000984 t:1.4s +tttg: c17/185 lr:0.000981 t:1.5s +tttg: c18/185 lr:0.000979 t:1.6s +tttg: c19/185 lr:0.000977 t:1.7s +tttg: c20/185 lr:0.000974 t:1.7s +tttg: c21/185 lr:0.000971 t:1.8s +tttg: c22/185 lr:0.000968 t:1.9s +tttg: c23/185 lr:0.000965 t:2.0s +tttg: c24/185 lr:0.000962 t:2.1s +tttg: c25/185 lr:0.000959 t:2.2s +tttg: c26/185 lr:0.000955 t:2.3s +tttg: c27/185 lr:0.000952 t:2.3s +tttg: c28/185 lr:0.000948 t:2.4s +tttg: c29/185 lr:0.000944 t:2.5s +tttg: c30/185 lr:0.000940 t:2.6s +tttg: c31/185 lr:0.000936 t:2.7s +tttg: c32/185 lr:0.000932 t:2.8s +tttg: c33/185 lr:0.000927 t:2.9s +tttg: c34/185 lr:0.000923 t:2.9s +tttg: c35/185 lr:0.000918 t:3.0s +tttg: c36/185 lr:0.000913 t:3.1s +tttg: c37/185 lr:0.000908 t:3.2s +tttg: c38/185 lr:0.000904 t:3.3s +tttg: c39/185 lr:0.000898 t:3.4s +tttg: c40/185 lr:0.000893 t:3.5s +tttg: c41/185 lr:0.000888 t:3.5s +tttg: c42/185 lr:0.000882 t:3.6s +tttg: c43/185 lr:0.000877 t:3.7s +tttg: c44/185 lr:0.000871 t:3.8s +tttg: c45/185 lr:0.000865 t:3.9s +tttg: c46/185 lr:0.000860 t:4.0s +tttg: c47/185 lr:0.000854 t:4.1s +tttg: c48/185 lr:0.000847 t:4.2s +tttg: c49/185 lr:0.000841 t:4.3s +tttg: c50/185 lr:0.000835 t:4.4s +tttg: c51/185 lr:0.000829 t:4.4s +tttg: c52/185 lr:0.000822 t:4.5s +tttg: c53/185 lr:0.000816 t:4.6s +tttg: c54/185 lr:0.000809 t:4.7s +tttg: c55/185 lr:0.000802 t:4.8s +tttg: c56/185 lr:0.000795 t:4.9s +tttg: c57/185 lr:0.000788 t:5.0s +tttg: c58/185 lr:0.000781 t:5.0s +tttg: c59/185 lr:0.000774 t:5.1s +tttg: c60/185 lr:0.000767 t:5.2s +tttg: c61/185 lr:0.000760 t:5.3s +tttg: c62/185 lr:0.000752 t:5.4s +tttg: c63/185 lr:0.000745 t:5.5s +tttg: c64/185 lr:0.000738 t:5.6s +tttg: c65/185 lr:0.000730 t:5.6s +tttg: c66/185 lr:0.000722 t:5.7s +tttg: c67/185 lr:0.000715 t:5.8s +tttg: c68/185 lr:0.000707 t:5.9s +tttg: c69/185 lr:0.000699 t:6.0s +tttg: c70/185 lr:0.000691 t:6.1s +tttg: c71/185 lr:0.000683 t:6.2s +tttg: c72/185 lr:0.000675 t:6.2s +tttg: c73/185 lr:0.000667 t:6.3s +tttg: c74/185 lr:0.000659 t:6.4s +tttg: c75/185 lr:0.000651 t:6.5s +tttg: c76/185 lr:0.000643 t:6.6s +tttg: c77/185 lr:0.000635 t:6.7s +tttg: c78/185 lr:0.000627 t:6.8s +tttg: c79/185 lr:0.000618 t:6.9s +tttg: c80/185 lr:0.000610 t:6.9s +tttg: c81/185 lr:0.000602 t:7.0s +tttg: c82/185 lr:0.000593 t:7.1s +tttg: c83/185 lr:0.000585 t:7.2s +tttg: c84/185 lr:0.000577 t:7.3s +tttg: c85/185 lr:0.000568 t:7.4s +tttg: c86/185 lr:0.000560 t:7.4s +tttg: c87/185 lr:0.000551 t:7.5s +tttg: c88/185 lr:0.000543 t:7.6s +tttg: c89/185 lr:0.000534 t:7.7s +tttg: c90/185 lr:0.000526 t:7.8s +tttg: c91/185 lr:0.000517 t:7.9s +tttg: c92/185 lr:0.000509 t:8.0s +tttg: c93/185 lr:0.000500 t:8.1s +tttg: c94/185 lr:0.000491 t:8.1s +tttg: c95/185 lr:0.000483 t:8.2s +tttg: c96/185 lr:0.000474 t:8.3s +tttg: c97/185 lr:0.000466 t:8.4s +tttg: c98/185 lr:0.000457 t:8.5s +tttg: c99/185 lr:0.000449 t:8.6s +tttg: c100/185 lr:0.000440 t:8.7s +tttg: c101/185 lr:0.000432 t:8.7s +tttg: c102/185 lr:0.000423 t:8.8s +tttg: c103/185 lr:0.000415 t:8.9s +tttg: c104/185 lr:0.000407 t:9.0s +tttg: c105/185 lr:0.000398 t:9.1s +tttg: c106/185 lr:0.000390 t:9.2s +tttg: c107/185 lr:0.000382 t:9.3s +tttg: c108/185 lr:0.000373 t:9.3s +tttg: c109/185 lr:0.000365 t:9.4s +tttg: c110/185 lr:0.000357 t:9.5s +tttg: c111/185 lr:0.000349 t:9.6s +tttg: c112/185 lr:0.000341 t:9.7s +tttg: c113/185 lr:0.000333 t:9.8s +tttg: c114/185 lr:0.000325 t:9.9s +tttg: c115/185 lr:0.000317 t:10.0s +tttg: c116/185 lr:0.000309 t:10.0s +tttg: c117/185 lr:0.000301 t:10.1s +tttg: c118/185 lr:0.000293 t:10.2s +tttg: c119/185 lr:0.000285 t:10.3s +tttg: c120/185 lr:0.000278 t:10.4s +tttg: c121/185 lr:0.000270 t:10.5s +tttg: c122/185 lr:0.000262 t:10.5s +tttg: c123/185 lr:0.000255 t:10.6s +tttg: c124/185 lr:0.000248 t:10.7s +tttg: c125/185 lr:0.000240 t:10.8s +tttg: c126/185 lr:0.000233 t:10.9s +tttg: c127/185 lr:0.000226 t:11.0s +tttg: c128/185 lr:0.000219 t:11.1s +tttg: c129/185 lr:0.000212 t:11.1s +tttg: c130/185 lr:0.000205 t:11.2s +tttg: c131/185 lr:0.000198 t:11.3s +tttg: c132/185 lr:0.000191 t:11.4s +tttg: c133/185 lr:0.000184 t:11.5s +tttg: c134/185 lr:0.000178 t:11.6s +tttg: c135/185 lr:0.000171 t:11.7s +tttg: c136/185 lr:0.000165 t:11.7s +tttg: c137/185 lr:0.000159 t:11.8s +tttg: c138/185 lr:0.000153 t:11.9s +tttg: c139/185 lr:0.000146 t:12.0s +tttg: c140/185 lr:0.000140 t:12.1s +tttg: c141/185 lr:0.000135 t:12.2s +tttg: c142/185 lr:0.000129 t:12.3s +tttg: c143/185 lr:0.000123 t:12.3s +tttg: c144/185 lr:0.000118 t:12.4s +tttg: c145/185 lr:0.000112 t:12.5s +tttg: c146/185 lr:0.000107 t:12.6s +tttg: c147/185 lr:0.000102 t:12.7s +tttg: c148/185 lr:0.000096 t:12.8s +tttg: c149/185 lr:0.000092 t:12.9s +tttg: c150/185 lr:0.000087 t:12.9s +tttg: c151/185 lr:0.000082 t:13.0s +tttg: c152/185 lr:0.000077 t:13.1s +tttg: c153/185 lr:0.000073 t:13.2s +tttg: c154/185 lr:0.000068 t:13.3s +tttg: c155/185 lr:0.000064 t:13.4s +tttg: c156/185 lr:0.000060 t:13.4s +tttg: c157/185 lr:0.000056 t:13.5s +tttg: c158/185 lr:0.000052 t:13.6s +tttg: c159/185 lr:0.000048 t:13.7s +tttg: c160/185 lr:0.000045 t:13.8s +tttg: c161/185 lr:0.000041 t:13.9s +tttg: c162/185 lr:0.000038 t:14.0s +tttg: c163/185 lr:0.000035 t:14.0s +tttg: c164/185 lr:0.000032 t:14.1s +tttg: c165/185 lr:0.000029 t:14.2s +tttg: c166/185 lr:0.000026 t:14.3s +tttg: c167/185 lr:0.000023 t:14.4s +tttg: c168/185 lr:0.000021 t:14.5s +tttg: c169/185 lr:0.000019 t:14.5s +tttg: c170/185 lr:0.000016 t:14.6s +tttg: c171/185 lr:0.000014 t:14.7s +tttg: c172/185 lr:0.000012 t:14.8s +tttg: c173/185 lr:0.000010 t:14.9s +tttg: c174/185 lr:0.000009 t:15.0s +tttg: c175/185 lr:0.000007 t:15.1s +tttg: c176/185 lr:0.000006 t:15.2s +tttg: c177/185 lr:0.000005 t:15.3s +tttg: c178/185 lr:0.000004 t:15.3s +tttg: c179/185 lr:0.000003 t:15.4s +tttg: c180/185 lr:0.000002 t:15.5s +tttg: c181/185 lr:0.000001 t:15.6s +tttg: c182/185 lr:0.000001 t:15.7s +tttg: c183/185 lr:0.000000 t:15.8s +tttg: c184/185 lr:0.000000 t:15.8s +ttpr: phase:2/3 t:308.9s +ttp: b753/782 bl:2.2278 bb:1.0057 rl:2.3216 rb:1.0834 dl:3284-3344 gd:0 +ttpp: phase:3/3 pd:2448 gd:2000 t:328.6s +tttg: c1/250 lr:0.001000 t:0.1s +tttg: c2/250 lr:0.001000 t:0.2s +tttg: c3/250 lr:0.001000 t:0.3s +tttg: c4/250 lr:0.001000 t:0.3s +tttg: c5/250 lr:0.000999 t:0.4s +tttg: c6/250 lr:0.000999 t:0.5s +tttg: c7/250 lr:0.000999 t:0.6s +tttg: c8/250 lr:0.000998 t:0.7s +tttg: c9/250 lr:0.000997 t:0.8s +tttg: c10/250 lr:0.000997 t:0.8s +tttg: c11/250 lr:0.000996 t:0.9s +tttg: c12/250 lr:0.000995 t:1.0s +tttg: c13/250 lr:0.000994 t:1.1s +tttg: c14/250 lr:0.000993 t:1.2s +tttg: c15/250 lr:0.000992 t:1.3s +tttg: c16/250 lr:0.000991 t:1.4s +tttg: c17/250 lr:0.000990 t:1.4s +tttg: c18/250 lr:0.000989 t:1.5s +tttg: c19/250 lr:0.000987 t:1.6s +tttg: c20/250 lr:0.000986 t:1.7s +tttg: c21/250 lr:0.000984 t:1.8s +tttg: c22/250 lr:0.000983 t:1.9s +tttg: c23/250 lr:0.000981 t:1.9s +tttg: c24/250 lr:0.000979 t:2.0s +tttg: c25/250 lr:0.000977 t:2.1s +tttg: c26/250 lr:0.000975 t:2.2s +tttg: c27/250 lr:0.000973 t:2.3s +tttg: c28/250 lr:0.000971 t:2.4s +tttg: c29/250 lr:0.000969 t:2.5s +tttg: c30/250 lr:0.000967 t:2.5s +tttg: c31/250 lr:0.000965 t:2.6s +tttg: c32/250 lr:0.000962 t:2.7s +tttg: c33/250 lr:0.000960 t:2.8s +tttg: c34/250 lr:0.000957 t:2.9s +tttg: c35/250 lr:0.000955 t:3.0s +tttg: c36/250 lr:0.000952 t:3.0s +tttg: c37/250 lr:0.000949 t:3.1s +tttg: c38/250 lr:0.000947 t:3.2s +tttg: c39/250 lr:0.000944 t:3.3s +tttg: c40/250 lr:0.000941 t:3.4s +tttg: c41/250 lr:0.000938 t:3.5s +tttg: c42/250 lr:0.000935 t:3.6s +tttg: c43/250 lr:0.000931 t:3.6s +tttg: c44/250 lr:0.000928 t:3.7s +tttg: c45/250 lr:0.000925 t:3.8s +tttg: c46/250 lr:0.000922 t:3.9s +tttg: c47/250 lr:0.000918 t:4.0s +tttg: c48/250 lr:0.000915 t:4.1s +tttg: c49/250 lr:0.000911 t:4.2s +tttg: c50/250 lr:0.000907 t:4.2s +tttg: c51/250 lr:0.000904 t:4.3s +tttg: c52/250 lr:0.000900 t:4.4s +tttg: c53/250 lr:0.000896 t:4.5s +tttg: c54/250 lr:0.000892 t:4.6s +tttg: c55/250 lr:0.000888 t:4.7s +tttg: c56/250 lr:0.000884 t:4.7s +tttg: c57/250 lr:0.000880 t:4.8s +tttg: c58/250 lr:0.000876 t:4.9s +tttg: c59/250 lr:0.000872 t:5.0s +tttg: c60/250 lr:0.000868 t:5.1s +tttg: c61/250 lr:0.000863 t:5.2s +tttg: c62/250 lr:0.000859 t:5.3s +tttg: c63/250 lr:0.000855 t:5.3s +tttg: c64/250 lr:0.000850 t:5.4s +tttg: c65/250 lr:0.000846 t:5.5s +tttg: c66/250 lr:0.000841 t:5.6s +tttg: c67/250 lr:0.000836 t:5.7s +tttg: c68/250 lr:0.000832 t:5.8s +tttg: c69/250 lr:0.000827 t:5.9s +tttg: c70/250 lr:0.000822 t:5.9s +tttg: c71/250 lr:0.000817 t:6.0s +tttg: c72/250 lr:0.000812 t:6.1s +tttg: c73/250 lr:0.000807 t:6.2s +tttg: c74/250 lr:0.000803 t:6.3s +tttg: c75/250 lr:0.000797 t:6.4s +tttg: c76/250 lr:0.000792 t:6.5s +tttg: c77/250 lr:0.000787 t:6.5s +tttg: c78/250 lr:0.000782 t:6.6s +tttg: c79/250 lr:0.000777 t:6.7s +tttg: c80/250 lr:0.000772 t:6.8s +tttg: c81/250 lr:0.000766 t:6.9s +tttg: c82/250 lr:0.000761 t:7.0s +tttg: c83/250 lr:0.000755 t:7.1s +tttg: c84/250 lr:0.000750 t:7.1s +tttg: c85/250 lr:0.000745 t:7.2s +tttg: c86/250 lr:0.000739 t:7.3s +tttg: c87/250 lr:0.000733 t:7.4s +tttg: c88/250 lr:0.000728 t:7.5s +tttg: c89/250 lr:0.000722 t:7.6s +tttg: c90/250 lr:0.000717 t:7.6s +tttg: c91/250 lr:0.000711 t:7.7s +tttg: c92/250 lr:0.000705 t:7.8s +tttg: c93/250 lr:0.000699 t:7.9s +tttg: c94/250 lr:0.000694 t:8.0s +tttg: c95/250 lr:0.000688 t:8.1s +tttg: c96/250 lr:0.000682 t:8.2s +tttg: c97/250 lr:0.000676 t:8.2s +tttg: c98/250 lr:0.000670 t:8.3s +tttg: c99/250 lr:0.000664 t:8.4s +tttg: c100/250 lr:0.000658 t:8.5s +tttg: c101/250 lr:0.000652 t:8.6s +tttg: c102/250 lr:0.000646 t:8.7s +tttg: c103/250 lr:0.000640 t:8.8s +tttg: c104/250 lr:0.000634 t:8.8s +tttg: c105/250 lr:0.000628 t:8.9s +tttg: c106/250 lr:0.000622 t:9.0s +tttg: c107/250 lr:0.000616 t:9.1s +tttg: c108/250 lr:0.000610 t:9.2s +tttg: c109/250 lr:0.000603 t:9.3s +tttg: c110/250 lr:0.000597 t:9.4s +tttg: c111/250 lr:0.000591 t:9.4s +tttg: c112/250 lr:0.000585 t:9.5s +tttg: c113/250 lr:0.000579 t:9.6s +tttg: c114/250 lr:0.000572 t:9.7s +tttg: c115/250 lr:0.000566 t:9.8s +tttg: c116/250 lr:0.000560 t:9.9s +tttg: c117/250 lr:0.000554 t:10.0s +tttg: c118/250 lr:0.000547 t:10.0s +tttg: c119/250 lr:0.000541 t:10.1s +tttg: c120/250 lr:0.000535 t:10.2s +tttg: c121/250 lr:0.000528 t:10.3s +tttg: c122/250 lr:0.000522 t:10.4s +tttg: c123/250 lr:0.000516 t:10.5s +tttg: c124/250 lr:0.000509 t:10.6s +tttg: c125/250 lr:0.000503 t:10.7s +tttg: c126/250 lr:0.000497 t:10.7s +tttg: c127/250 lr:0.000491 t:10.8s +tttg: c128/250 lr:0.000484 t:10.9s +tttg: c129/250 lr:0.000478 t:11.0s +tttg: c130/250 lr:0.000472 t:11.1s +tttg: c131/250 lr:0.000465 t:11.2s +tttg: c132/250 lr:0.000459 t:11.3s +tttg: c133/250 lr:0.000453 t:11.3s +tttg: c134/250 lr:0.000446 t:11.4s +tttg: c135/250 lr:0.000440 t:11.5s +tttg: c136/250 lr:0.000434 t:11.6s +tttg: c137/250 lr:0.000428 t:11.7s +tttg: c138/250 lr:0.000421 t:11.8s +tttg: c139/250 lr:0.000415 t:11.8s +tttg: c140/250 lr:0.000409 t:11.9s +tttg: c141/250 lr:0.000403 t:12.0s +tttg: c142/250 lr:0.000397 t:12.1s +tttg: c143/250 lr:0.000390 t:12.2s +tttg: c144/250 lr:0.000384 t:12.3s +tttg: c145/250 lr:0.000378 t:12.4s +tttg: c146/250 lr:0.000372 t:12.5s +tttg: c147/250 lr:0.000366 t:12.5s +tttg: c148/250 lr:0.000360 t:12.6s +tttg: c149/250 lr:0.000354 t:12.7s +tttg: c150/250 lr:0.000348 t:12.8s +tttg: c151/250 lr:0.000342 t:12.9s +tttg: c152/250 lr:0.000336 t:13.0s +tttg: c153/250 lr:0.000330 t:13.1s +tttg: c154/250 lr:0.000324 t:13.1s +tttg: c155/250 lr:0.000318 t:13.2s +tttg: c156/250 lr:0.000312 t:13.3s +tttg: c157/250 lr:0.000306 t:13.4s +tttg: c158/250 lr:0.000301 t:13.5s +tttg: c159/250 lr:0.000295 t:13.6s +tttg: c160/250 lr:0.000289 t:13.7s +tttg: c161/250 lr:0.000283 t:13.7s +tttg: c162/250 lr:0.000278 t:13.8s +tttg: c163/250 lr:0.000272 t:13.9s +tttg: c164/250 lr:0.000267 t:14.0s +tttg: c165/250 lr:0.000261 t:14.1s +tttg: c166/250 lr:0.000255 t:14.2s +tttg: c167/250 lr:0.000250 t:14.3s +tttg: c168/250 lr:0.000245 t:14.3s +tttg: c169/250 lr:0.000239 t:14.4s +tttg: c170/250 lr:0.000234 t:14.5s +tttg: c171/250 lr:0.000228 t:14.6s +tttg: c172/250 lr:0.000223 t:14.7s +tttg: c173/250 lr:0.000218 t:14.8s +tttg: c174/250 lr:0.000213 t:14.9s +tttg: c175/250 lr:0.000208 t:14.9s +tttg: c176/250 lr:0.000203 t:15.0s +tttg: c177/250 lr:0.000197 t:15.1s +tttg: c178/250 lr:0.000193 t:15.2s +tttg: c179/250 lr:0.000188 t:15.3s +tttg: c180/250 lr:0.000183 t:15.4s +tttg: c181/250 lr:0.000178 t:15.5s +tttg: c182/250 lr:0.000173 t:15.5s +tttg: c183/250 lr:0.000168 t:15.6s +tttg: c184/250 lr:0.000164 t:15.7s +tttg: c185/250 lr:0.000159 t:15.8s +tttg: c186/250 lr:0.000154 t:15.9s +tttg: c187/250 lr:0.000150 t:16.0s +tttg: c188/250 lr:0.000145 t:16.1s +tttg: c189/250 lr:0.000141 t:16.1s +tttg: c190/250 lr:0.000137 t:16.2s +tttg: c191/250 lr:0.000132 t:16.3s +tttg: c192/250 lr:0.000128 t:16.4s +tttg: c193/250 lr:0.000124 t:16.5s +tttg: c194/250 lr:0.000120 t:16.6s +tttg: c195/250 lr:0.000116 t:16.6s +tttg: c196/250 lr:0.000112 t:16.7s +tttg: c197/250 lr:0.000108 t:16.8s +tttg: c198/250 lr:0.000104 t:16.9s +tttg: c199/250 lr:0.000100 t:17.0s +tttg: c200/250 lr:0.000096 t:17.1s +tttg: c201/250 lr:0.000093 t:17.2s +tttg: c202/250 lr:0.000089 t:17.3s +tttg: c203/250 lr:0.000085 t:17.3s +tttg: c204/250 lr:0.000082 t:17.4s +tttg: c205/250 lr:0.000078 t:17.5s +tttg: c206/250 lr:0.000075 t:17.6s +tttg: c207/250 lr:0.000072 t:17.7s +tttg: c208/250 lr:0.000069 t:17.8s +tttg: c209/250 lr:0.000065 t:17.9s +tttg: c210/250 lr:0.000062 t:17.9s +tttg: c211/250 lr:0.000059 t:18.0s +tttg: c212/250 lr:0.000056 t:18.1s +tttg: c213/250 lr:0.000053 t:18.2s +tttg: c214/250 lr:0.000051 t:18.3s +tttg: c215/250 lr:0.000048 t:18.4s +tttg: c216/250 lr:0.000045 t:18.4s +tttg: c217/250 lr:0.000043 t:18.5s +tttg: c218/250 lr:0.000040 t:18.6s +tttg: c219/250 lr:0.000038 t:18.7s +tttg: c220/250 lr:0.000035 t:18.8s +tttg: c221/250 lr:0.000033 t:18.9s +tttg: c222/250 lr:0.000031 t:19.0s +tttg: c223/250 lr:0.000029 t:19.0s +tttg: c224/250 lr:0.000027 t:19.1s +tttg: c225/250 lr:0.000025 t:19.2s +tttg: c226/250 lr:0.000023 t:19.3s +tttg: c227/250 lr:0.000021 t:19.4s +tttg: c228/250 lr:0.000019 t:19.5s +tttg: c229/250 lr:0.000017 t:19.6s +tttg: c230/250 lr:0.000016 t:19.6s +tttg: c231/250 lr:0.000014 t:19.7s +tttg: c232/250 lr:0.000013 t:19.8s +tttg: c233/250 lr:0.000011 t:19.9s +tttg: c234/250 lr:0.000010 t:20.0s +tttg: c235/250 lr:0.000009 t:20.1s +tttg: c236/250 lr:0.000008 t:20.2s +tttg: c237/250 lr:0.000007 t:20.2s +tttg: c238/250 lr:0.000006 t:20.3s +tttg: c239/250 lr:0.000005 t:20.4s +tttg: c240/250 lr:0.000004 t:20.5s +tttg: c241/250 lr:0.000003 t:20.6s +tttg: c242/250 lr:0.000003 t:20.7s +tttg: c243/250 lr:0.000002 t:20.8s +tttg: c244/250 lr:0.000001 t:20.8s +tttg: c245/250 lr:0.000001 t:20.9s +tttg: c246/250 lr:0.000001 t:21.0s +tttg: c247/250 lr:0.000000 t:21.1s +tttg: c248/250 lr:0.000000 t:21.2s +tttg: c249/250 lr:0.000000 t:21.3s +ttpr: phase:3/3 t:352.4s +ttp: b743/782 bl:2.3403 bb:1.0663 rl:2.3234 rb:1.0818 dl:2762-2805 gd:1 +ttp: b728/782 bl:2.3668 bb:1.0836 rl:2.3265 rb:1.0819 dl:2306-2324 gd:1 +ttp: b720/782 bl:2.3687 bb:1.0713 rl:2.3292 rb:1.0812 dl:2125-2144 gd:1 +ttp: b718/782 bl:2.3015 bb:1.0330 rl:2.3276 rb:1.0783 dl:2089-2106 gd:1 +ttp: b707/782 bl:2.3683 bb:1.0525 rl:2.3296 rb:1.0770 dl:1910-1923 gd:1 +ttp: b698/782 bl:2.2632 bb:1.0360 rl:2.3266 rb:1.0751 dl:1803-1814 gd:1 +ttp: b694/782 bl:2.3222 bb:1.0619 rl:2.3264 rb:1.0745 dl:1758-1769 gd:1 +ttp: b685/782 bl:2.3080 bb:1.0329 rl:2.3257 rb:1.0729 dl:1665-1675 gd:1 +ttp: b679/782 bl:2.3162 bb:1.0635 rl:2.3254 rb:1.0725 dl:1610-1618 gd:1 +ttp: b670/782 bl:2.3600 bb:1.0739 rl:2.3265 rb:1.0726 dl:1537-1544 gd:1 +ttp: b658/782 bl:2.2686 bb:1.0270 rl:2.3248 rb:1.0712 dl:1452-1459 gd:1 +ttp: b652/782 bl:2.2597 bb:1.0272 rl:2.3229 rb:1.0699 dl:1411-1419 gd:1 +ttp: b645/782 bl:2.3184 bb:1.0373 rl:2.3228 rb:1.0690 dl:1367-1375 gd:1 +ttp: b637/782 bl:2.3774 bb:1.0842 rl:2.3242 rb:1.0694 dl:1320-1325 gd:1 +ttp: b629/782 bl:2.3604 bb:1.0158 rl:2.3250 rb:1.0680 dl:1276-1280 gd:1 +ttp: b621/782 bl:2.3068 bb:1.0534 rl:2.3246 rb:1.0677 dl:1231-1237 gd:1 +ttp: b613/782 bl:2.3473 bb:1.0451 rl:2.3251 rb:1.0672 dl:1190-1195 gd:1 +ttp: b607/782 bl:2.3646 bb:1.0578 rl:2.3259 rb:1.0670 dl:1164-1168 gd:1 +ttp: b599/782 bl:2.3784 bb:1.0759 rl:2.3269 rb:1.0672 dl:1129-1133 gd:1 +ttp: b591/782 bl:2.3206 bb:1.0385 rl:2.3268 rb:1.0666 dl:1093-1098 gd:1 +ttp: b583/782 bl:2.3335 bb:1.0369 rl:2.3269 rb:1.0661 dl:1060-1064 gd:1 +ttp: b572/782 bl:2.3194 bb:1.0432 rl:2.3268 rb:1.0657 dl:1017-1021 gd:1 +ttp: b564/782 bl:2.2962 bb:1.0217 rl:2.3263 rb:1.0650 dl:990-993 gd:1 +ttp: b557/782 bl:2.3528 bb:1.0569 rl:2.3267 rb:1.0648 dl:965-968 gd:1 +ttp: b550/782 bl:2.3796 bb:1.0647 rl:2.3275 rb:1.0648 dl:943-946 gd:1 +ttp: b542/782 bl:2.3395 bb:1.0447 rl:2.3277 rb:1.0645 dl:918-921 gd:1 +ttp: b535/782 bl:2.3869 bb:1.0353 rl:2.3285 rb:1.0641 dl:896-899 gd:1 +ttp: b528/782 bl:2.3450 bb:1.0482 rl:2.3287 rb:1.0639 dl:875-878 gd:1 +ttp: b521/782 bl:2.3662 bb:1.0725 rl:2.3292 rb:1.0640 dl:854-858 gd:1 +ttp: b513/782 bl:2.3807 bb:1.0451 rl:2.3298 rb:1.0638 dl:832-835 gd:1 +ttp: b493/782 bl:2.3751 bb:1.0484 rl:2.3303 rb:1.0636 dl:778-780 gd:1 +ttp: b485/782 bl:2.3022 bb:1.0371 rl:2.3300 rb:1.0633 dl:759-761 gd:1 +ttp: b477/782 bl:2.4146 bb:1.0399 rl:2.3309 rb:1.0631 dl:740-742 gd:1 +ttp: b469/782 bl:2.3381 bb:1.0282 rl:2.3310 rb:1.0627 dl:721-724 gd:1 +ttp: b460/782 bl:2.2590 bb:1.0568 rl:2.3303 rb:1.0626 dl:701-703 gd:1 +ttp: b452/782 bl:2.2757 bb:1.0185 rl:2.3297 rb:1.0622 dl:685-687 gd:1 +ttp: b444/782 bl:2.3233 bb:1.0704 rl:2.3297 rb:1.0623 dl:668-670 gd:1 +ttp: b437/782 bl:2.3118 bb:1.0637 rl:2.3295 rb:1.0623 dl:653-655 gd:1 +ttp: b429/782 bl:2.2547 bb:1.0283 rl:2.3289 rb:1.0620 dl:638-640 gd:1 +ttp: b421/782 bl:2.3019 bb:1.0078 rl:2.3287 rb:1.0615 dl:622-624 gd:1 +ttp: b416/782 bl:2.3886 bb:1.0502 rl:2.3292 rb:1.0614 dl:613-615 gd:1 +ttp: b407/782 bl:2.2825 bb:1.0449 rl:2.3288 rb:1.0613 dl:595-597 gd:1 +ttp: b399/782 bl:2.2919 bb:1.0344 rl:2.3285 rb:1.0611 dl:581-582 gd:1 +ttp: b390/782 bl:2.3576 bb:1.0622 rl:2.3287 rb:1.0611 dl:564-566 gd:1 +ttp: b383/782 bl:2.2912 bb:1.0505 rl:2.3285 rb:1.0610 dl:552-554 gd:1 +ttp: b375/782 bl:2.4246 bb:1.0814 rl:2.3291 rb:1.0612 dl:538-540 gd:1 +ttp: b367/782 bl:2.3080 bb:1.0892 rl:2.3290 rb:1.0614 dl:525-527 gd:1 +ttp: b360/782 bl:2.3130 bb:1.0820 rl:2.3289 rb:1.0615 dl:513-515 gd:1 +ttp: b352/782 bl:2.4358 bb:1.1022 rl:2.3295 rb:1.0618 dl:499-501 gd:1 +ttp: b344/782 bl:2.3966 bb:1.0681 rl:2.3299 rb:1.0618 dl:488-489 gd:1 +ttp: b337/782 bl:2.3296 bb:1.0601 rl:2.3299 rb:1.0618 dl:477-478 gd:1 +ttp: b330/782 bl:2.2515 bb:1.0729 rl:2.3295 rb:1.0618 dl:466-468 gd:1 +ttp: b321/782 bl:2.3733 bb:1.0834 rl:2.3297 rb:1.0620 dl:453-455 gd:1 +ttp: b313/782 bl:2.3043 bb:1.0857 rl:2.3296 rb:1.0621 dl:440-442 gd:1 +ttp: b305/782 bl:2.3469 bb:1.0909 rl:2.3297 rb:1.0622 dl:429-430 gd:1 +ttp: b296/782 bl:2.4001 bb:1.1051 rl:2.3300 rb:1.0624 dl:415-417 gd:1 +ttp: b288/782 bl:2.2508 bb:1.0245 rl:2.3297 rb:1.0623 dl:403-405 gd:1 +ttp: b280/782 bl:2.3481 bb:1.0948 rl:2.3297 rb:1.0624 dl:392-394 gd:1 +ttp: b272/782 bl:2.3785 bb:1.0986 rl:2.3300 rb:1.0626 dl:382-383 gd:1 +ttp: b265/782 bl:2.3755 bb:1.1053 rl:2.3302 rb:1.0627 dl:372-374 gd:1 +ttp: b257/782 bl:2.4513 bb:1.1150 rl:2.3307 rb:1.0630 dl:362-364 gd:1 +ttp: b249/782 bl:2.4572 bb:1.1068 rl:2.3312 rb:1.0632 dl:352-354 gd:1 +ttp: b241/782 bl:2.3575 bb:1.0957 rl:2.3313 rb:1.0633 dl:342-344 gd:1 +ttp: b233/782 bl:2.3689 bb:1.1318 rl:2.3314 rb:1.0635 dl:333-334 gd:1 +ttp: b225/782 bl:2.4456 bb:1.1197 rl:2.3319 rb:1.0637 dl:323-324 gd:1 +ttp: b217/782 bl:2.3890 bb:1.1406 rl:2.3321 rb:1.0640 dl:314-315 gd:1 +ttp: b209/782 bl:2.4221 bb:1.1329 rl:2.3324 rb:1.0642 dl:305-306 gd:1 +ttp: b201/782 bl:2.3103 bb:1.1021 rl:2.3323 rb:1.0644 dl:297-298 gd:1 +ttp: b193/782 bl:2.3768 bb:1.1398 rl:2.3325 rb:1.0646 dl:288-289 gd:1 +ttp: b186/782 bl:2.4230 bb:1.1325 rl:2.3327 rb:1.0648 dl:280-281 gd:1 +ttp: b178/782 bl:2.3529 bb:1.1007 rl:2.3328 rb:1.0649 dl:272-273 gd:1 +ttp: b171/782 bl:2.4812 bb:1.1441 rl:2.3332 rb:1.0651 dl:266-266 gd:1 +ttp: b159/782 bl:2.4783 bb:1.1498 rl:2.3337 rb:1.0654 dl:254-255 gd:1 +ttp: b152/782 bl:2.4007 bb:1.1498 rl:2.3338 rb:1.0656 dl:247-248 gd:1 +ttp: b143/782 bl:2.4236 bb:1.1745 rl:2.3341 rb:1.0659 dl:238-239 gd:1 +ttp: b138/782 bl:2.3986 bb:1.1159 rl:2.3342 rb:1.0660 dl:233-234 gd:1 +ttp: b130/782 bl:2.5797 bb:1.1822 rl:2.3349 rb:1.0663 dl:226-227 gd:1 +ttp: b121/782 bl:2.4450 bb:1.1159 rl:2.3351 rb:1.0664 dl:218-219 gd:1 +ttp: b114/782 bl:2.4817 bb:1.1508 rl:2.3355 rb:1.0666 dl:211-212 gd:1 +ttp: b106/782 bl:2.4561 bb:1.1822 rl:2.3357 rb:1.0669 dl:204-205 gd:1 +ttp: b98/782 bl:2.6020 bb:1.2209 rl:2.3363 rb:1.0672 dl:197-198 gd:1 +ttp: b89/782 bl:2.5073 bb:1.1586 rl:2.3367 rb:1.0674 dl:189-190 gd:1 +ttp: b81/782 bl:2.4914 bb:1.1307 rl:2.3370 rb:1.0675 dl:182-183 gd:1 +ttp: b74/782 bl:2.4772 bb:1.1496 rl:2.3372 rb:1.0676 dl:175-176 gd:1 +ttp: b65/782 bl:2.4804 bb:1.1764 rl:2.3375 rb:1.0678 dl:167-169 gd:1 +ttp: b58/782 bl:2.5288 bb:1.2273 rl:2.3378 rb:1.0681 dl:161-162 gd:1 +ttp: b50/782 bl:2.4036 bb:1.1648 rl:2.3379 rb:1.0682 dl:153-154 gd:1 +ttp: b42/782 bl:2.4739 bb:1.2046 rl:2.3382 rb:1.0685 dl:145-146 gd:1 +ttp: b32/782 bl:2.6219 bb:1.2225 rl:2.3386 rb:1.0687 dl:135-136 gd:1 +ttp: b24/782 bl:2.4666 bb:1.1633 rl:2.3388 rb:1.0688 dl:127-128 gd:1 +ttp: b16/782 bl:2.6299 bb:1.2602 rl:2.3391 rb:1.0690 dl:117-118 gd:1 +ttp: b8/782 bl:2.8180 bb:1.3080 rl:2.3397 rb:1.0693 dl:103-105 gd:1 +quantized_ttt_phased val_loss:2.33231144 val_bpb:1.06577531 eval_time:470588ms +total_eval_time:470.6s diff --git a/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_seed42.log b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_seed42.log new file mode 100644 index 0000000000..bceabf9aea --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_CaseOps_GatedAttn_QuantGate_Loop134_Curriculum_PhasedTTT/train_seed42.log @@ -0,0 +1,842 @@ +Running: env | egrep '^(RUN_ID|SEED|CASEOPS_ENABLED|TRAIN_LOOP_(PHASE_DEPTHS|PREWARM_DEPTHS)|EVAL_LOOP_DEPTH|DATA_PATH|TOKENIZER_PATH|EMBED_BITS|GATED_ATTN_)' | sort +CASEOPS_ENABLED=1 +DATA_PATH=/workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved +EMBED_BITS=7 +EVAL_LOOP_DEPTH=4 +GATED_ATTN_ENABLED=1 +GATED_ATTN_INIT_STD=0.005 +GATED_ATTN_QUANT_GATE=1 +RUN_ID=pr1736_eq134_eval4_seed42 +SEED=42 +TOKENIZER_PATH=/workspace/parameter-golf-pr1736/data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model +TRAIN_LOOP_PHASE_DEPTHS=1,3,4 +TRAIN_LOOP_PREWARM_DEPTHS=3,4 +W0420 20:46:33.650000 218708 torch/distributed/run.py:803] +W0420 20:46:33.650000 218708 torch/distributed/run.py:803] ***************************************** +W0420 20:46:33.650000 218708 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0420 20:46:33.650000 218708 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + artifact_dir: /workspace/fullruns/pr1736_eq134_eval4_seed42/artifact + attn_clip_sigmas: 13.0 + attn_out_gate_enabled: False + attn_out_gate_src: proj + beta1: 0.9 + beta2: 0.95 + caseops_enabled: True + compressor: brotli + data_dir: /workspace/parameter-golf-pr1736/data + datasets_dir: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 15.0 + embed_lr: 0.6 + embed_wd: 0.085 + enable_looping_at: 0.35 + eval_loop_depth: 4 + eval_seq_len: 2048 + eval_stride: 64 + gate_window: 12 + gated_attn_enabled: True + gated_attn_init_std: 0.005 + gated_attn_quant_gate: True + global_ttt_batch_seqs: 32 + global_ttt_chunk_tokens: 32768 + global_ttt_epochs: 1 + global_ttt_grad_clip: 1.0 + global_ttt_lr: 0.001 + global_ttt_momentum: 0.9 + global_ttt_respect_doc_boundaries: True + global_ttt_warmup_chunks: 0 + global_ttt_warmup_start_lr: 0.0 + gptq_calibration_batches: 16 + gptq_reserve_seconds: 4.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: /workspace/fullruns/pr1736_eq134_eval4_seed42/artifact/pr1736_eq134_eval4_seed42.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.026 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_clip_sigmas: 12.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: /workspace/fullruns/pr1736_eq134_eval4_seed42/artifact/final_model.pt + muon_backend_steps: 5 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_start_layer: 8 + phased_ttt_num_phases: 3 + phased_ttt_prefix_docs: 2000 + qk_gain_init: 5.0 + quantized_model_path: /workspace/fullruns/pr1736_eq134_eval4_seed42/artifact/final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + rope_yarn: False + run_id: pr1736_eq134_eval4_seed42 + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + smear_gate_enabled: False + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /workspace/parameter-golf-pr1736/data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + train_batch_tokens: 786432 + train_files: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin + train_log_every: 500 + train_loop_depth_dist: fixed + train_loop_depth_set: [] + train_loop_max_depth: 3 + train_loop_min_depth: 3 + train_loop_phase_depths: [1, 3, 4] + train_loop_phase_fractions: [] + train_loop_prewarm_depths: [3, 4] + train_seq_len: 2048 + ttt_batch_size: 64 + ttt_beta1: 0.0 + ttt_beta2: 0.999 + ttt_chunk_size: 48 + ttt_enabled: True + ttt_eval_batches: + ttt_eval_seq_len: 2048 + ttt_grad_steps: 1 + ttt_k_lora: True + ttt_lora_lr: 0.0001 + ttt_lora_rank: 96 + ttt_mlp_lora: True + ttt_o_lora: True + ttt_optimizer: adam + ttt_weight_decay: 0.5 + val_batch_tokens: 524288 + val_bytes_files: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin + val_doc_fraction: 1.0 + val_files: /workspace/parameter-golf-pr1736/data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_[0-9][0-9][0-9][0-9][0-9][0-9].bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.75 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 80 +val_tokens: 47851520 +model_params:35989658 +gptq:reserving 4s, effective=596000ms +loop_depth_schedule: train=[1, 3, 4] dist=phased phase_fracs=None prewarm=[3, 4] eval_depth=4 +warmup_cu_buckets:64,128,192,256 iters_each:3 +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +0/20000 val_loss: 9.0176 val_bpb: 4.1204 +1/20000 train_loss: 9.0180 train_time: 0.0m tok/s: 12866658 +2/20000 train_loss: 12.7206 train_time: 0.0m tok/s: 11422450 +3/20000 train_loss: 10.1119 train_time: 0.0m tok/s: 10148772 +4/20000 train_loss: 8.5451 train_time: 0.0m tok/s: 9636150 +5/20000 train_loss: 7.7591 train_time: 0.0m tok/s: 9343651 +500/20000 train_loss: 2.5850 train_time: 0.8m tok/s: 8059061 +1000/20000 train_loss: 2.8208 train_time: 1.6m tok/s: 8011915 +1500/20000 train_loss: 2.6400 train_time: 2.5m tok/s: 7998889 +2000/20000 train_loss: 2.6689 train_time: 3.3m tok/s: 7995739 +layer_loop:phase step:2020 frac:0.333 depth:3 phases:[1, 3, 4] eval_depth:4 +2500/20000 train_loss: 2.5496 train_time: 4.5m tok/s: 7347682 +3000/20000 train_loss: 2.5616 train_time: 5.7m tok/s: 6953080 +layer_loop:phase step:3405 frac:0.667 depth:4 phases:[1, 3, 4] eval_depth:4 +3500/20000 train_loss: 2.5593 train_time: 6.9m tok/s: 6662079 +4000/20000 train_loss: 2.3889 train_time: 8.3m tok/s: 6341348 +4000/20000 val_loss: 2.4137 val_bpb: 1.1029 +4500/20000 train_loss: 2.2473 train_time: 9.6m tok/s: 6112324 +4603/20000 val_loss: 2.3393 val_bpb: 1.0689 +stopping_early: wallclock_cap train_time: 596152ms step: 4603/20000 +peak memory allocated: 46573 MiB reserved: 50344 MiB +ema:applying EMA weights +diagnostic pre-quantization post-ema val_loss:2.33890789 val_bpb:1.06872023 eval_time:8147ms +Serialized model: 135592891 bytes +Code size (uncompressed): 141428 bytes +Code size (compressed): 35795 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 4.0s +Quantized weights: + gate_int8_row: blocks.attn.attn_gate_w + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int7): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights +Serialized model quantized+brotli: 15950784 bytes +Total submission size quantized+brotli: 15986579 bytes +diagnostic quantized val_loss:2.35904908 val_bpb:1.07792337 eval_time:13178ms +ttt_lora:warming up compile (random tokens, no val data) +ttt_lora:compile warmup done (117.9s) + +beginning TTT eval timer +ttt_phased: total_docs:50000 prefix_docs:2000 suffix_docs:48000 num_phases:3 boundaries:[666, 1333, 2000] +ttp: b781/782 bl:2.1615 bb:1.0576 rl:2.1615 rb:1.0576 dl:17258-30330 gd:0 +ttpp: phase:1/3 pd:1104 gd:666 t:241.5s +tttg: c1/111 lr:0.001000 t:0.3s +tttg: c2/111 lr:0.001000 t:0.4s +tttg: c3/111 lr:0.000999 t:0.5s +tttg: c4/111 lr:0.000998 t:0.6s +tttg: c5/111 lr:0.000997 t:0.7s +tttg: c6/111 lr:0.000995 t:0.8s +tttg: c7/111 lr:0.000993 t:0.9s +tttg: c8/111 lr:0.000990 t:0.9s +tttg: c9/111 lr:0.000987 t:1.0s +tttg: c10/111 lr:0.000984 t:1.1s +tttg: c11/111 lr:0.000980 t:1.2s +tttg: c12/111 lr:0.000976 t:1.3s +tttg: c13/111 lr:0.000971 t:1.4s +tttg: c14/111 lr:0.000966 t:1.5s +tttg: c15/111 lr:0.000961 t:1.6s +tttg: c16/111 lr:0.000955 t:1.6s +tttg: c17/111 lr:0.000949 t:1.7s +tttg: c18/111 lr:0.000942 t:1.8s +tttg: c19/111 lr:0.000935 t:1.9s +tttg: c20/111 lr:0.000928 t:2.0s +tttg: c21/111 lr:0.000921 t:2.1s +tttg: c22/111 lr:0.000913 t:2.2s +tttg: c23/111 lr:0.000905 t:2.3s +tttg: c24/111 lr:0.000896 t:2.3s +tttg: c25/111 lr:0.000887 t:2.4s +tttg: c26/111 lr:0.000878 t:2.5s +tttg: c27/111 lr:0.000868 t:2.6s +tttg: c28/111 lr:0.000859 t:2.7s +tttg: c29/111 lr:0.000848 t:2.8s +tttg: c30/111 lr:0.000838 t:2.9s +tttg: c31/111 lr:0.000827 t:2.9s +tttg: c32/111 lr:0.000817 t:3.0s +tttg: c33/111 lr:0.000805 t:3.1s +tttg: c34/111 lr:0.000794 t:3.2s +tttg: c35/111 lr:0.000782 t:3.3s +tttg: c36/111 lr:0.000770 t:3.4s +tttg: c37/111 lr:0.000758 t:3.5s +tttg: c38/111 lr:0.000746 t:3.6s +tttg: c39/111 lr:0.000733 t:3.6s +tttg: c40/111 lr:0.000721 t:3.7s +tttg: c41/111 lr:0.000708 t:3.8s +tttg: c42/111 lr:0.000695 t:3.9s +tttg: c43/111 lr:0.000681 t:4.0s +tttg: c44/111 lr:0.000668 t:4.1s +tttg: c45/111 lr:0.000655 t:4.2s +tttg: c46/111 lr:0.000641 t:4.2s +tttg: c47/111 lr:0.000627 t:4.3s +tttg: c48/111 lr:0.000613 t:4.4s +tttg: c49/111 lr:0.000599 t:4.5s +tttg: c50/111 lr:0.000585 t:4.6s +tttg: c51/111 lr:0.000571 t:4.7s +tttg: c52/111 lr:0.000557 t:4.8s +tttg: c53/111 lr:0.000543 t:4.8s +tttg: c54/111 lr:0.000529 t:4.9s +tttg: c55/111 lr:0.000514 t:5.0s +tttg: c56/111 lr:0.000500 t:5.1s +tttg: c57/111 lr:0.000486 t:5.2s +tttg: c58/111 lr:0.000471 t:5.3s +tttg: c59/111 lr:0.000457 t:5.4s +tttg: c60/111 lr:0.000443 t:5.5s +tttg: c61/111 lr:0.000429 t:5.5s +tttg: c62/111 lr:0.000415 t:5.6s +tttg: c63/111 lr:0.000401 t:5.7s +tttg: c64/111 lr:0.000387 t:5.8s +tttg: c65/111 lr:0.000373 t:5.9s +tttg: c66/111 lr:0.000359 t:6.0s +tttg: c67/111 lr:0.000345 t:6.1s +tttg: c68/111 lr:0.000332 t:6.1s +tttg: c69/111 lr:0.000319 t:6.2s +tttg: c70/111 lr:0.000305 t:6.3s +tttg: c71/111 lr:0.000292 t:6.4s +tttg: c72/111 lr:0.000279 t:6.5s +tttg: c73/111 lr:0.000267 t:6.6s +tttg: c74/111 lr:0.000254 t:6.7s +tttg: c75/111 lr:0.000242 t:6.8s +tttg: c76/111 lr:0.000230 t:6.8s +tttg: c77/111 lr:0.000218 t:6.9s +tttg: c78/111 lr:0.000206 t:7.0s +tttg: c79/111 lr:0.000195 t:7.1s +tttg: c80/111 lr:0.000183 t:7.2s +tttg: c81/111 lr:0.000173 t:7.3s +tttg: c82/111 lr:0.000162 t:7.4s +tttg: c83/111 lr:0.000152 t:7.5s +tttg: c84/111 lr:0.000141 t:7.5s +tttg: c85/111 lr:0.000132 t:7.6s +tttg: c86/111 lr:0.000122 t:7.7s +tttg: c87/111 lr:0.000113 t:7.8s +tttg: c88/111 lr:0.000104 t:7.9s +tttg: c89/111 lr:0.000095 t:8.0s +tttg: c90/111 lr:0.000087 t:8.1s +tttg: c91/111 lr:0.000079 t:8.1s +tttg: c92/111 lr:0.000072 t:8.2s +tttg: c93/111 lr:0.000065 t:8.3s +tttg: c94/111 lr:0.000058 t:8.4s +tttg: c95/111 lr:0.000051 t:8.5s +tttg: c96/111 lr:0.000045 t:8.6s +tttg: c97/111 lr:0.000039 t:8.7s +tttg: c98/111 lr:0.000034 t:8.8s +tttg: c99/111 lr:0.000029 t:8.8s +tttg: c100/111 lr:0.000024 t:8.9s +tttg: c101/111 lr:0.000020 t:9.0s +tttg: c102/111 lr:0.000016 t:9.1s +tttg: c103/111 lr:0.000013 t:9.2s +tttg: c104/111 lr:0.000010 t:9.3s +tttg: c105/111 lr:0.000007 t:9.4s +tttg: c106/111 lr:0.000005 t:9.5s +tttg: c107/111 lr:0.000003 t:9.5s +tttg: c108/111 lr:0.000002 t:9.6s +tttg: c109/111 lr:0.000001 t:9.7s +tttg: c110/111 lr:0.000000 t:9.8s +ttpr: phase:1/3 t:253.8s +ttp: b764/782 bl:2.3002 bb:1.0774 rl:2.1843 rb:1.0610 dl:4284-4392 gd:0 +ttpp: phase:2/3 pd:1808 gd:1333 t:333.2s +tttg: c1/185 lr:0.001000 t:0.1s +tttg: c2/185 lr:0.001000 t:0.2s +tttg: c3/185 lr:0.001000 t:0.3s +tttg: c4/185 lr:0.000999 t:0.3s +tttg: c5/185 lr:0.000999 t:0.4s +tttg: c6/185 lr:0.000998 t:0.5s +tttg: c7/185 lr:0.000997 t:0.6s +tttg: c8/185 lr:0.000996 t:0.7s +tttg: c9/185 lr:0.000995 t:0.8s +tttg: c10/185 lr:0.000994 t:0.9s +tttg: c11/185 lr:0.000993 t:0.9s +tttg: c12/185 lr:0.000991 t:1.0s +tttg: c13/185 lr:0.000990 t:1.1s +tttg: c14/185 lr:0.000988 t:1.2s +tttg: c15/185 lr:0.000986 t:1.3s +tttg: c16/185 lr:0.000984 t:1.4s +tttg: c17/185 lr:0.000981 t:1.5s +tttg: c18/185 lr:0.000979 t:1.6s +tttg: c19/185 lr:0.000977 t:1.6s +tttg: c20/185 lr:0.000974 t:1.7s +tttg: c21/185 lr:0.000971 t:1.8s +tttg: c22/185 lr:0.000968 t:1.9s +tttg: c23/185 lr:0.000965 t:2.0s +tttg: c24/185 lr:0.000962 t:2.1s +tttg: c25/185 lr:0.000959 t:2.2s +tttg: c26/185 lr:0.000955 t:2.2s +tttg: c27/185 lr:0.000952 t:2.3s +tttg: c28/185 lr:0.000948 t:2.4s +tttg: c29/185 lr:0.000944 t:2.5s +tttg: c30/185 lr:0.000940 t:2.6s +tttg: c31/185 lr:0.000936 t:2.7s +tttg: c32/185 lr:0.000932 t:2.8s +tttg: c33/185 lr:0.000927 t:2.9s +tttg: c34/185 lr:0.000923 t:2.9s +tttg: c35/185 lr:0.000918 t:3.0s +tttg: c36/185 lr:0.000913 t:3.1s +tttg: c37/185 lr:0.000908 t:3.2s +tttg: c38/185 lr:0.000904 t:3.3s +tttg: c39/185 lr:0.000898 t:3.4s +tttg: c40/185 lr:0.000893 t:3.5s +tttg: c41/185 lr:0.000888 t:3.5s +tttg: c42/185 lr:0.000882 t:3.6s +tttg: c43/185 lr:0.000877 t:3.7s +tttg: c44/185 lr:0.000871 t:3.8s +tttg: c45/185 lr:0.000865 t:3.9s +tttg: c46/185 lr:0.000860 t:4.0s +tttg: c47/185 lr:0.000854 t:4.1s +tttg: c48/185 lr:0.000847 t:4.1s +tttg: c49/185 lr:0.000841 t:4.2s +tttg: c50/185 lr:0.000835 t:4.3s +tttg: c51/185 lr:0.000829 t:4.4s +tttg: c52/185 lr:0.000822 t:4.5s +tttg: c53/185 lr:0.000816 t:4.6s +tttg: c54/185 lr:0.000809 t:4.7s +tttg: c55/185 lr:0.000802 t:4.8s +tttg: c56/185 lr:0.000795 t:4.8s +tttg: c57/185 lr:0.000788 t:4.9s +tttg: c58/185 lr:0.000781 t:5.0s +tttg: c59/185 lr:0.000774 t:5.1s +tttg: c60/185 lr:0.000767 t:5.2s +tttg: c61/185 lr:0.000760 t:5.3s +tttg: c62/185 lr:0.000752 t:5.4s +tttg: c63/185 lr:0.000745 t:5.4s +tttg: c64/185 lr:0.000738 t:5.5s +tttg: c65/185 lr:0.000730 t:5.6s +tttg: c66/185 lr:0.000722 t:5.7s +tttg: c67/185 lr:0.000715 t:5.8s +tttg: c68/185 lr:0.000707 t:5.9s +tttg: c69/185 lr:0.000699 t:6.0s +tttg: c70/185 lr:0.000691 t:6.0s +tttg: c71/185 lr:0.000683 t:6.1s +tttg: c72/185 lr:0.000675 t:6.2s +tttg: c73/185 lr:0.000667 t:6.3s +tttg: c74/185 lr:0.000659 t:6.4s +tttg: c75/185 lr:0.000651 t:6.5s +tttg: c76/185 lr:0.000643 t:6.6s +tttg: c77/185 lr:0.000635 t:6.7s +tttg: c78/185 lr:0.000627 t:6.7s +tttg: c79/185 lr:0.000618 t:6.8s +tttg: c80/185 lr:0.000610 t:6.9s +tttg: c81/185 lr:0.000602 t:7.0s +tttg: c82/185 lr:0.000593 t:7.1s +tttg: c83/185 lr:0.000585 t:7.2s +tttg: c84/185 lr:0.000577 t:7.3s +tttg: c85/185 lr:0.000568 t:7.3s +tttg: c86/185 lr:0.000560 t:7.4s +tttg: c87/185 lr:0.000551 t:7.5s +tttg: c88/185 lr:0.000543 t:7.6s +tttg: c89/185 lr:0.000534 t:7.7s +tttg: c90/185 lr:0.000526 t:7.8s +tttg: c91/185 lr:0.000517 t:7.9s +tttg: c92/185 lr:0.000509 t:7.9s +tttg: c93/185 lr:0.000500 t:8.0s +tttg: c94/185 lr:0.000491 t:8.1s +tttg: c95/185 lr:0.000483 t:8.2s +tttg: c96/185 lr:0.000474 t:8.3s +tttg: c97/185 lr:0.000466 t:8.4s +tttg: c98/185 lr:0.000457 t:8.5s +tttg: c99/185 lr:0.000449 t:8.5s +tttg: c100/185 lr:0.000440 t:8.6s +tttg: c101/185 lr:0.000432 t:8.7s +tttg: c102/185 lr:0.000423 t:8.8s +tttg: c103/185 lr:0.000415 t:8.9s +tttg: c104/185 lr:0.000407 t:9.0s +tttg: c105/185 lr:0.000398 t:9.1s +tttg: c106/185 lr:0.000390 t:9.1s +tttg: c107/185 lr:0.000382 t:9.2s +tttg: c108/185 lr:0.000373 t:9.3s +tttg: c109/185 lr:0.000365 t:9.4s +tttg: c110/185 lr:0.000357 t:9.5s +tttg: c111/185 lr:0.000349 t:9.6s +tttg: c112/185 lr:0.000341 t:9.7s +tttg: c113/185 lr:0.000333 t:9.7s +tttg: c114/185 lr:0.000325 t:9.8s +tttg: c115/185 lr:0.000317 t:9.9s +tttg: c116/185 lr:0.000309 t:10.0s +tttg: c117/185 lr:0.000301 t:10.1s +tttg: c118/185 lr:0.000293 t:10.2s +tttg: c119/185 lr:0.000285 t:10.3s +tttg: c120/185 lr:0.000278 t:10.3s +tttg: c121/185 lr:0.000270 t:10.4s +tttg: c122/185 lr:0.000262 t:10.5s +tttg: c123/185 lr:0.000255 t:10.6s +tttg: c124/185 lr:0.000248 t:10.7s +tttg: c125/185 lr:0.000240 t:10.8s +tttg: c126/185 lr:0.000233 t:10.9s +tttg: c127/185 lr:0.000226 t:11.0s +tttg: c128/185 lr:0.000219 t:11.0s +tttg: c129/185 lr:0.000212 t:11.1s +tttg: c130/185 lr:0.000205 t:11.2s +tttg: c131/185 lr:0.000198 t:11.3s +tttg: c132/185 lr:0.000191 t:11.4s +tttg: c133/185 lr:0.000184 t:11.5s +tttg: c134/185 lr:0.000178 t:11.6s +tttg: c135/185 lr:0.000171 t:11.6s +tttg: c136/185 lr:0.000165 t:11.7s +tttg: c137/185 lr:0.000159 t:11.8s +tttg: c138/185 lr:0.000153 t:11.9s +tttg: c139/185 lr:0.000146 t:12.0s +tttg: c140/185 lr:0.000140 t:12.1s +tttg: c141/185 lr:0.000135 t:12.2s +tttg: c142/185 lr:0.000129 t:12.2s +tttg: c143/185 lr:0.000123 t:12.3s +tttg: c144/185 lr:0.000118 t:12.4s +tttg: c145/185 lr:0.000112 t:12.5s +tttg: c146/185 lr:0.000107 t:12.6s +tttg: c147/185 lr:0.000102 t:12.7s +tttg: c148/185 lr:0.000096 t:12.8s +tttg: c149/185 lr:0.000092 t:12.8s +tttg: c150/185 lr:0.000087 t:12.9s +tttg: c151/185 lr:0.000082 t:13.0s +tttg: c152/185 lr:0.000077 t:13.1s +tttg: c153/185 lr:0.000073 t:13.2s +tttg: c154/185 lr:0.000068 t:13.3s +tttg: c155/185 lr:0.000064 t:13.4s +tttg: c156/185 lr:0.000060 t:13.5s +tttg: c157/185 lr:0.000056 t:13.5s +tttg: c158/185 lr:0.000052 t:13.6s +tttg: c159/185 lr:0.000048 t:13.7s +tttg: c160/185 lr:0.000045 t:13.8s +tttg: c161/185 lr:0.000041 t:13.9s +tttg: c162/185 lr:0.000038 t:14.0s +tttg: c163/185 lr:0.000035 t:14.1s +tttg: c164/185 lr:0.000032 t:14.1s +tttg: c165/185 lr:0.000029 t:14.2s +tttg: c166/185 lr:0.000026 t:14.3s +tttg: c167/185 lr:0.000023 t:14.4s +tttg: c168/185 lr:0.000021 t:14.5s +tttg: c169/185 lr:0.000019 t:14.6s +tttg: c170/185 lr:0.000016 t:14.6s +tttg: c171/185 lr:0.000014 t:14.7s +tttg: c172/185 lr:0.000012 t:14.8s +tttg: c173/185 lr:0.000010 t:14.9s +tttg: c174/185 lr:0.000009 t:15.0s +tttg: c175/185 lr:0.000007 t:15.1s +tttg: c176/185 lr:0.000006 t:15.2s +tttg: c177/185 lr:0.000005 t:15.3s +tttg: c178/185 lr:0.000004 t:15.4s +tttg: c179/185 lr:0.000003 t:15.4s +tttg: c180/185 lr:0.000002 t:15.5s +tttg: c181/185 lr:0.000001 t:15.6s +tttg: c182/185 lr:0.000001 t:15.7s +tttg: c183/185 lr:0.000000 t:15.8s +tttg: c184/185 lr:0.000000 t:15.9s +ttpr: phase:2/3 t:351.6s +ttp: b753/782 bl:2.2249 bb:1.0044 rl:2.1888 rb:1.0542 dl:3284-3344 gd:0 +ttpp: phase:3/3 pd:2448 gd:2000 t:371.4s +tttg: c1/250 lr:0.001000 t:0.1s +tttg: c2/250 lr:0.001000 t:0.2s +tttg: c3/250 lr:0.001000 t:0.3s +tttg: c4/250 lr:0.001000 t:0.3s +tttg: c5/250 lr:0.000999 t:0.4s +tttg: c6/250 lr:0.000999 t:0.5s +tttg: c7/250 lr:0.000999 t:0.6s +tttg: c8/250 lr:0.000998 t:0.7s +tttg: c9/250 lr:0.000997 t:0.8s +tttg: c10/250 lr:0.000997 t:0.9s +tttg: c11/250 lr:0.000996 t:1.0s +tttg: c12/250 lr:0.000995 t:1.0s +tttg: c13/250 lr:0.000994 t:1.1s +tttg: c14/250 lr:0.000993 t:1.2s +tttg: c15/250 lr:0.000992 t:1.3s +tttg: c16/250 lr:0.000991 t:1.4s +tttg: c17/250 lr:0.000990 t:1.5s +tttg: c18/250 lr:0.000989 t:1.6s +tttg: c19/250 lr:0.000987 t:1.7s +tttg: c20/250 lr:0.000986 t:1.7s +tttg: c21/250 lr:0.000984 t:1.8s +tttg: c22/250 lr:0.000983 t:1.9s +tttg: c23/250 lr:0.000981 t:2.0s +tttg: c24/250 lr:0.000979 t:2.1s +tttg: c25/250 lr:0.000977 t:2.2s +tttg: c26/250 lr:0.000975 t:2.3s +tttg: c27/250 lr:0.000973 t:2.3s +tttg: c28/250 lr:0.000971 t:2.4s +tttg: c29/250 lr:0.000969 t:2.5s +tttg: c30/250 lr:0.000967 t:2.6s +tttg: c31/250 lr:0.000965 t:2.7s +tttg: c32/250 lr:0.000962 t:2.8s +tttg: c33/250 lr:0.000960 t:2.9s +tttg: c34/250 lr:0.000957 t:3.0s +tttg: c35/250 lr:0.000955 t:3.0s +tttg: c36/250 lr:0.000952 t:3.1s +tttg: c37/250 lr:0.000949 t:3.2s +tttg: c38/250 lr:0.000947 t:3.3s +tttg: c39/250 lr:0.000944 t:3.4s +tttg: c40/250 lr:0.000941 t:3.5s +tttg: c41/250 lr:0.000938 t:3.6s +tttg: c42/250 lr:0.000935 t:3.6s +tttg: c43/250 lr:0.000931 t:3.7s +tttg: c44/250 lr:0.000928 t:3.8s +tttg: c45/250 lr:0.000925 t:3.9s +tttg: c46/250 lr:0.000922 t:4.0s +tttg: c47/250 lr:0.000918 t:4.1s +tttg: c48/250 lr:0.000915 t:4.2s +tttg: c49/250 lr:0.000911 t:4.3s +tttg: c50/250 lr:0.000907 t:4.3s +tttg: c51/250 lr:0.000904 t:4.4s +tttg: c52/250 lr:0.000900 t:4.5s +tttg: c53/250 lr:0.000896 t:4.6s +tttg: c54/250 lr:0.000892 t:4.7s +tttg: c55/250 lr:0.000888 t:4.8s +tttg: c56/250 lr:0.000884 t:4.9s +tttg: c57/250 lr:0.000880 t:4.9s +tttg: c58/250 lr:0.000876 t:5.0s +tttg: c59/250 lr:0.000872 t:5.1s +tttg: c60/250 lr:0.000868 t:5.2s +tttg: c61/250 lr:0.000863 t:5.3s +tttg: c62/250 lr:0.000859 t:5.4s +tttg: c63/250 lr:0.000855 t:5.5s +tttg: c64/250 lr:0.000850 t:5.6s +tttg: c65/250 lr:0.000846 t:5.6s +tttg: c66/250 lr:0.000841 t:5.7s +tttg: c67/250 lr:0.000836 t:5.8s +tttg: c68/250 lr:0.000832 t:5.9s +tttg: c69/250 lr:0.000827 t:6.0s +tttg: c70/250 lr:0.000822 t:6.1s +tttg: c71/250 lr:0.000817 t:6.2s +tttg: c72/250 lr:0.000812 t:6.3s +tttg: c73/250 lr:0.000807 t:6.3s +tttg: c74/250 lr:0.000803 t:6.4s +tttg: c75/250 lr:0.000797 t:6.5s +tttg: c76/250 lr:0.000792 t:6.6s +tttg: c77/250 lr:0.000787 t:6.7s +tttg: c78/250 lr:0.000782 t:6.8s +tttg: c79/250 lr:0.000777 t:6.9s +tttg: c80/250 lr:0.000772 t:6.9s +tttg: c81/250 lr:0.000766 t:7.0s +tttg: c82/250 lr:0.000761 t:7.1s +tttg: c83/250 lr:0.000755 t:7.2s +tttg: c84/250 lr:0.000750 t:7.3s +tttg: c85/250 lr:0.000745 t:7.4s +tttg: c86/250 lr:0.000739 t:7.5s +tttg: c87/250 lr:0.000733 t:7.6s +tttg: c88/250 lr:0.000728 t:7.6s +tttg: c89/250 lr:0.000722 t:7.7s +tttg: c90/250 lr:0.000717 t:7.8s +tttg: c91/250 lr:0.000711 t:7.9s +tttg: c92/250 lr:0.000705 t:8.0s +tttg: c93/250 lr:0.000699 t:8.1s +tttg: c94/250 lr:0.000694 t:8.2s +tttg: c95/250 lr:0.000688 t:8.2s +tttg: c96/250 lr:0.000682 t:8.3s +tttg: c97/250 lr:0.000676 t:8.4s +tttg: c98/250 lr:0.000670 t:8.5s +tttg: c99/250 lr:0.000664 t:8.6s +tttg: c100/250 lr:0.000658 t:8.7s +tttg: c101/250 lr:0.000652 t:8.8s +tttg: c102/250 lr:0.000646 t:8.9s +tttg: c103/250 lr:0.000640 t:8.9s +tttg: c104/250 lr:0.000634 t:9.0s +tttg: c105/250 lr:0.000628 t:9.1s +tttg: c106/250 lr:0.000622 t:9.2s +tttg: c107/250 lr:0.000616 t:9.3s +tttg: c108/250 lr:0.000610 t:9.4s +tttg: c109/250 lr:0.000603 t:9.5s +tttg: c110/250 lr:0.000597 t:9.6s +tttg: c111/250 lr:0.000591 t:9.6s +tttg: c112/250 lr:0.000585 t:9.7s +tttg: c113/250 lr:0.000579 t:9.8s +tttg: c114/250 lr:0.000572 t:9.9s +tttg: c115/250 lr:0.000566 t:10.0s +tttg: c116/250 lr:0.000560 t:10.1s +tttg: c117/250 lr:0.000554 t:10.2s +tttg: c118/250 lr:0.000547 t:10.2s +tttg: c119/250 lr:0.000541 t:10.3s +tttg: c120/250 lr:0.000535 t:10.4s +tttg: c121/250 lr:0.000528 t:10.5s +tttg: c122/250 lr:0.000522 t:10.6s +tttg: c123/250 lr:0.000516 t:10.7s +tttg: c124/250 lr:0.000509 t:10.8s +tttg: c125/250 lr:0.000503 t:10.9s +tttg: c126/250 lr:0.000497 t:10.9s +tttg: c127/250 lr:0.000491 t:11.0s +tttg: c128/250 lr:0.000484 t:11.1s +tttg: c129/250 lr:0.000478 t:11.2s +tttg: c130/250 lr:0.000472 t:11.3s +tttg: c131/250 lr:0.000465 t:11.4s +tttg: c132/250 lr:0.000459 t:11.5s +tttg: c133/250 lr:0.000453 t:11.5s +tttg: c134/250 lr:0.000446 t:11.6s +tttg: c135/250 lr:0.000440 t:11.7s +tttg: c136/250 lr:0.000434 t:11.8s +tttg: c137/250 lr:0.000428 t:11.9s +tttg: c138/250 lr:0.000421 t:12.0s +tttg: c139/250 lr:0.000415 t:12.1s +tttg: c140/250 lr:0.000409 t:12.2s +tttg: c141/250 lr:0.000403 t:12.2s +tttg: c142/250 lr:0.000397 t:12.3s +tttg: c143/250 lr:0.000390 t:12.4s +tttg: c144/250 lr:0.000384 t:12.5s +tttg: c145/250 lr:0.000378 t:12.6s +tttg: c146/250 lr:0.000372 t:12.7s +tttg: c147/250 lr:0.000366 t:12.8s +tttg: c148/250 lr:0.000360 t:12.8s +tttg: c149/250 lr:0.000354 t:12.9s +tttg: c150/250 lr:0.000348 t:13.0s +tttg: c151/250 lr:0.000342 t:13.1s +tttg: c152/250 lr:0.000336 t:13.2s +tttg: c153/250 lr:0.000330 t:13.3s +tttg: c154/250 lr:0.000324 t:13.4s +tttg: c155/250 lr:0.000318 t:13.5s +tttg: c156/250 lr:0.000312 t:13.5s +tttg: c157/250 lr:0.000306 t:13.6s +tttg: c158/250 lr:0.000301 t:13.7s +tttg: c159/250 lr:0.000295 t:13.8s +tttg: c160/250 lr:0.000289 t:13.9s +tttg: c161/250 lr:0.000283 t:14.0s +tttg: c162/250 lr:0.000278 t:14.1s +tttg: c163/250 lr:0.000272 t:14.2s +tttg: c164/250 lr:0.000267 t:14.2s +tttg: c165/250 lr:0.000261 t:14.3s +tttg: c166/250 lr:0.000255 t:14.4s +tttg: c167/250 lr:0.000250 t:14.5s +tttg: c168/250 lr:0.000245 t:14.6s +tttg: c169/250 lr:0.000239 t:14.7s +tttg: c170/250 lr:0.000234 t:14.8s +tttg: c171/250 lr:0.000228 t:14.9s +tttg: c172/250 lr:0.000223 t:14.9s +tttg: c173/250 lr:0.000218 t:15.0s +tttg: c174/250 lr:0.000213 t:15.1s +tttg: c175/250 lr:0.000208 t:15.2s +tttg: c176/250 lr:0.000203 t:15.3s +tttg: c177/250 lr:0.000197 t:15.4s +tttg: c178/250 lr:0.000193 t:15.5s +tttg: c179/250 lr:0.000188 t:15.6s +tttg: c180/250 lr:0.000183 t:15.6s +tttg: c181/250 lr:0.000178 t:15.7s +tttg: c182/250 lr:0.000173 t:15.8s +tttg: c183/250 lr:0.000168 t:15.9s +tttg: c184/250 lr:0.000164 t:16.0s +tttg: c185/250 lr:0.000159 t:16.1s +tttg: c186/250 lr:0.000154 t:16.2s +tttg: c187/250 lr:0.000150 t:16.2s +tttg: c188/250 lr:0.000145 t:16.3s +tttg: c189/250 lr:0.000141 t:16.4s +tttg: c190/250 lr:0.000137 t:16.5s +tttg: c191/250 lr:0.000132 t:16.6s +tttg: c192/250 lr:0.000128 t:16.7s +tttg: c193/250 lr:0.000124 t:16.8s +tttg: c194/250 lr:0.000120 t:16.8s +tttg: c195/250 lr:0.000116 t:16.9s +tttg: c196/250 lr:0.000112 t:17.0s +tttg: c197/250 lr:0.000108 t:17.1s +tttg: c198/250 lr:0.000104 t:17.2s +tttg: c199/250 lr:0.000100 t:17.3s +tttg: c200/250 lr:0.000096 t:17.4s +tttg: c201/250 lr:0.000093 t:17.5s +tttg: c202/250 lr:0.000089 t:17.6s +tttg: c203/250 lr:0.000085 t:17.6s +tttg: c204/250 lr:0.000082 t:17.7s +tttg: c205/250 lr:0.000078 t:17.8s +tttg: c206/250 lr:0.000075 t:17.9s +tttg: c207/250 lr:0.000072 t:18.0s +tttg: c208/250 lr:0.000069 t:18.1s +tttg: c209/250 lr:0.000065 t:18.2s +tttg: c210/250 lr:0.000062 t:18.2s +tttg: c211/250 lr:0.000059 t:18.3s +tttg: c212/250 lr:0.000056 t:18.4s +tttg: c213/250 lr:0.000053 t:18.5s +tttg: c214/250 lr:0.000051 t:18.6s +tttg: c215/250 lr:0.000048 t:18.7s +tttg: c216/250 lr:0.000045 t:18.8s +tttg: c217/250 lr:0.000043 t:18.9s +tttg: c218/250 lr:0.000040 t:18.9s +tttg: c219/250 lr:0.000038 t:19.0s +tttg: c220/250 lr:0.000035 t:19.1s +tttg: c221/250 lr:0.000033 t:19.2s +tttg: c222/250 lr:0.000031 t:19.3s +tttg: c223/250 lr:0.000029 t:19.4s +tttg: c224/250 lr:0.000027 t:19.5s +tttg: c225/250 lr:0.000025 t:19.5s +tttg: c226/250 lr:0.000023 t:19.6s +tttg: c227/250 lr:0.000021 t:19.7s +tttg: c228/250 lr:0.000019 t:19.8s +tttg: c229/250 lr:0.000017 t:19.9s +tttg: c230/250 lr:0.000016 t:20.0s +tttg: c231/250 lr:0.000014 t:20.1s +tttg: c232/250 lr:0.000013 t:20.2s +tttg: c233/250 lr:0.000011 t:20.2s +tttg: c234/250 lr:0.000010 t:20.3s +tttg: c235/250 lr:0.000009 t:20.4s +tttg: c236/250 lr:0.000008 t:20.5s +tttg: c237/250 lr:0.000007 t:20.6s +tttg: c238/250 lr:0.000006 t:20.7s +tttg: c239/250 lr:0.000005 t:20.8s +tttg: c240/250 lr:0.000004 t:20.8s +tttg: c241/250 lr:0.000003 t:20.9s +tttg: c242/250 lr:0.000003 t:21.0s +tttg: c243/250 lr:0.000002 t:21.1s +tttg: c244/250 lr:0.000001 t:21.2s +tttg: c245/250 lr:0.000001 t:21.3s +tttg: c246/250 lr:0.000001 t:21.4s +tttg: c247/250 lr:0.000000 t:21.5s +tttg: c248/250 lr:0.000000 t:21.5s +tttg: c249/250 lr:0.000000 t:21.6s +ttpr: phase:3/3 t:395.5s +ttp: b743/782 bl:2.3391 bb:1.0657 rl:2.2017 rb:1.0553 dl:2762-2805 gd:1 +ttp: b728/782 bl:2.3668 bb:1.0835 rl:2.2127 rb:1.0572 dl:2306-2324 gd:1 +ttp: b720/782 bl:2.3660 bb:1.0701 rl:2.2215 rb:1.0580 dl:2125-2144 gd:1 +ttp: b719/782 bl:2.3302 bb:1.0493 rl:2.2274 rb:1.0575 dl:2106-2125 gd:1 +ttp: b704/782 bl:2.2934 bb:1.0420 rl:2.2305 rb:1.0568 dl:1872-1885 gd:1 +ttp: b696/782 bl:2.3161 bb:1.0548 rl:2.2340 rb:1.0567 dl:1779-1790 gd:1 +ttp: b692/782 bl:2.3019 bb:1.0333 rl:2.2367 rb:1.0557 dl:1737-1746 gd:1 +ttp: b682/782 bl:2.3550 bb:1.0627 rl:2.2409 rb:1.0560 dl:1638-1646 gd:1 +ttp: b672/782 bl:2.3378 bb:1.0520 rl:2.2441 rb:1.0558 dl:1553-1562 gd:1 +ttp: b664/782 bl:2.3496 bb:1.0311 rl:2.2473 rb:1.0550 dl:1493-1499 gd:1 +ttp: b656/782 bl:2.3361 bb:1.1145 rl:2.2498 rb:1.0567 dl:1439-1445 gd:1 +ttp: b648/782 bl:2.2954 bb:1.0129 rl:2.2510 rb:1.0555 dl:1387-1392 gd:1 +ttp: b640/782 bl:2.3172 bb:1.0556 rl:2.2527 rb:1.0555 dl:1337-1343 gd:1 +ttp: b632/782 bl:2.3597 bb:1.0382 rl:2.2552 rb:1.0550 dl:1290-1297 gd:1 +ttp: b624/782 bl:2.3653 bb:1.0707 rl:2.2577 rb:1.0554 dl:1249-1255 gd:1 +ttp: b617/782 bl:2.3207 bb:1.0255 rl:2.2590 rb:1.0547 dl:1211-1216 gd:1 +ttp: b609/782 bl:2.2866 bb:1.0244 rl:2.2596 rb:1.0541 dl:1172-1177 gd:1 +ttp: b601/782 bl:2.3433 bb:1.0259 rl:2.2612 rb:1.0535 dl:1137-1141 gd:1 +ttp: b593/782 bl:2.3041 bb:1.0171 rl:2.2620 rb:1.0528 dl:1103-1107 gd:1 +ttp: b585/782 bl:2.2982 bb:1.0424 rl:2.2626 rb:1.0526 dl:1069-1073 gd:1 +ttp: b577/782 bl:2.3006 bb:1.0355 rl:2.2632 rb:1.0523 dl:1037-1041 gd:1 +ttp: b569/782 bl:2.3163 bb:1.0473 rl:2.2641 rb:1.0522 dl:1007-1010 gd:1 +ttp: b561/782 bl:2.2579 bb:1.0185 rl:2.2640 rb:1.0517 dl:979-983 gd:1 +ttp: b554/782 bl:2.4391 bb:1.0980 rl:2.2665 rb:1.0524 dl:955-959 gd:1 +ttp: b547/782 bl:2.3430 bb:1.0531 rl:2.2676 rb:1.0524 dl:934-937 gd:1 +ttp: b540/782 bl:2.3602 bb:1.0781 rl:2.2689 rb:1.0528 dl:912-915 gd:1 +ttp: b533/782 bl:2.3815 bb:1.0713 rl:2.2703 rb:1.0530 dl:890-892 gd:1 +ttp: b525/782 bl:2.3613 bb:1.0234 rl:2.2715 rb:1.0526 dl:866-869 gd:1 +ttp: b517/782 bl:2.3628 bb:1.0310 rl:2.2726 rb:1.0524 dl:843-846 gd:1 +ttp: b510/782 bl:2.3919 bb:1.0776 rl:2.2740 rb:1.0527 dl:823-826 gd:1 +ttp: b502/782 bl:2.3305 bb:1.0327 rl:2.2746 rb:1.0524 dl:802-804 gd:1 +ttp: b494/782 bl:2.3345 bb:1.0640 rl:2.2752 rb:1.0526 dl:780-783 gd:1 +ttp: b486/782 bl:2.4182 bb:1.0865 rl:2.2767 rb:1.0529 dl:761-764 gd:1 +ttp: b478/782 bl:2.3480 bb:1.0812 rl:2.2774 rb:1.0532 dl:742-744 gd:1 +ttp: b470/782 bl:2.3636 bb:1.0637 rl:2.2783 rb:1.0533 dl:724-726 gd:1 +ttp: b462/782 bl:2.3446 bb:1.0406 rl:2.2789 rb:1.0532 dl:706-708 gd:1 +ttp: b454/782 bl:2.3955 bb:1.0880 rl:2.2800 rb:1.0535 dl:689-691 gd:1 +ttp: b446/782 bl:2.3094 bb:1.0855 rl:2.2802 rb:1.0538 dl:672-674 gd:1 +ttp: b437/782 bl:2.3070 bb:1.0615 rl:2.2804 rb:1.0538 dl:653-655 gd:1 +ttp: b429/782 bl:2.2545 bb:1.0282 rl:2.2802 rb:1.0536 dl:638-640 gd:1 +ttp: b421/782 bl:2.3026 bb:1.0081 rl:2.2804 rb:1.0533 dl:622-624 gd:1 +ttp: b414/782 bl:2.2192 bb:1.0161 rl:2.2799 rb:1.0530 dl:609-611 gd:1 +ttp: b406/782 bl:2.3236 bb:1.0700 rl:2.2803 rb:1.0531 dl:593-595 gd:1 +ttp: b398/782 bl:2.2556 bb:1.0073 rl:2.2801 rb:1.0528 dl:579-581 gd:1 +ttp: b390/782 bl:2.3560 bb:1.0615 rl:2.2806 rb:1.0528 dl:564-566 gd:1 +ttp: b382/782 bl:2.3040 bb:1.0886 rl:2.2808 rb:1.0530 dl:550-552 gd:1 +ttp: b374/782 bl:2.3133 bb:1.0428 rl:2.2810 rb:1.0530 dl:537-538 gd:1 +ttp: b366/782 bl:2.3468 bb:1.0751 rl:2.2814 rb:1.0531 dl:524-525 gd:1 +ttp: b358/782 bl:2.4140 bb:1.0834 rl:2.2822 rb:1.0533 dl:510-512 gd:1 +ttp: b350/782 bl:2.3377 bb:1.0624 rl:2.2825 rb:1.0534 dl:497-498 gd:1 +ttp: b342/782 bl:2.3878 bb:1.1296 rl:2.2831 rb:1.0538 dl:485-486 gd:1 +ttp: b334/782 bl:2.3934 bb:1.0758 rl:2.2838 rb:1.0539 dl:472-474 gd:1 +ttp: b326/782 bl:2.3330 bb:1.0684 rl:2.2840 rb:1.0540 dl:461-462 gd:1 +ttp: b319/782 bl:2.4110 bb:1.0872 rl:2.2847 rb:1.0542 dl:450-451 gd:1 +ttp: b312/782 bl:2.3217 bb:1.0575 rl:2.2849 rb:1.0542 dl:439-440 gd:1 +ttp: b304/782 bl:2.3537 bb:1.0796 rl:2.2852 rb:1.0543 dl:427-429 gd:1 +ttp: b296/782 bl:2.3986 bb:1.1043 rl:2.2858 rb:1.0546 dl:415-417 gd:1 +ttp: b286/782 bl:2.3870 bb:1.1134 rl:2.2862 rb:1.0548 dl:400-402 gd:1 +ttp: b278/782 bl:2.2783 bb:1.0672 rl:2.2862 rb:1.0549 dl:389-391 gd:1 +ttp: b270/782 bl:2.3285 bb:1.0654 rl:2.2864 rb:1.0549 dl:379-380 gd:1 +ttp: b263/782 bl:2.4014 bb:1.0863 rl:2.2869 rb:1.0551 dl:370-371 gd:1 +ttp: b255/782 bl:2.3744 bb:1.0950 rl:2.2872 rb:1.0552 dl:360-361 gd:1 +ttp: b248/782 bl:2.4694 bb:1.1918 rl:2.2879 rb:1.0557 dl:351-352 gd:1 +ttp: b241/782 bl:2.3536 bb:1.0939 rl:2.2882 rb:1.0559 dl:342-344 gd:1 +ttp: b233/782 bl:2.3682 bb:1.1315 rl:2.2885 rb:1.0562 dl:333-334 gd:1 +ttp: b225/782 bl:2.4454 bb:1.1197 rl:2.2891 rb:1.0564 dl:323-324 gd:1 +ttp: b217/782 bl:2.3897 bb:1.1409 rl:2.2894 rb:1.0567 dl:314-315 gd:1 +ttp: b209/782 bl:2.4212 bb:1.1325 rl:2.2898 rb:1.0569 dl:305-306 gd:1 +ttp: b201/782 bl:2.3041 bb:1.0991 rl:2.2899 rb:1.0571 dl:297-298 gd:1 +ttp: b193/782 bl:2.3712 bb:1.1371 rl:2.2901 rb:1.0573 dl:288-289 gd:1 +ttp: b187/782 bl:2.4657 bb:1.1394 rl:2.2907 rb:1.0576 dl:281-282 gd:1 +ttp: b179/782 bl:2.3763 bb:1.1328 rl:2.2909 rb:1.0578 dl:273-274 gd:1 +ttp: b171/782 bl:2.4823 bb:1.1447 rl:2.2915 rb:1.0580 dl:266-266 gd:1 +ttp: b163/782 bl:2.3928 bb:1.1273 rl:2.2918 rb:1.0582 dl:257-259 gd:1 +ttp: b155/782 bl:2.4126 bb:1.1155 rl:2.2921 rb:1.0584 dl:250-251 gd:1 +ttp: b147/782 bl:2.4826 bb:1.1291 rl:2.2926 rb:1.0585 dl:242-243 gd:1 +ttp: b138/782 bl:2.3929 bb:1.1133 rl:2.2928 rb:1.0587 dl:233-234 gd:1 +ttp: b129/782 bl:2.4011 bb:1.1503 rl:2.2931 rb:1.0589 dl:225-226 gd:1 +ttp: b122/782 bl:2.4273 bb:1.1492 rl:2.2934 rb:1.0591 dl:219-219 gd:1 +ttp: b114/782 bl:2.4861 bb:1.1529 rl:2.2939 rb:1.0593 dl:211-212 gd:1 +ttp: b106/782 bl:2.4572 bb:1.1828 rl:2.2942 rb:1.0596 dl:204-205 gd:1 +ttp: b98/782 bl:2.6044 bb:1.2221 rl:2.2949 rb:1.0599 dl:197-198 gd:1 +ttp: b89/782 bl:2.5011 bb:1.1558 rl:2.2953 rb:1.0601 dl:189-190 gd:1 +ttp: b81/782 bl:2.4903 bb:1.1302 rl:2.2956 rb:1.0602 dl:182-183 gd:1 +ttp: b74/782 bl:2.4908 bb:1.1559 rl:2.2960 rb:1.0604 dl:175-176 gd:1 +ttp: b65/782 bl:2.4670 bb:1.1701 rl:2.2963 rb:1.0606 dl:167-169 gd:1 +ttp: b58/782 bl:2.5305 bb:1.2281 rl:2.2967 rb:1.0609 dl:161-162 gd:1 +ttp: b50/782 bl:2.4112 bb:1.1685 rl:2.2969 rb:1.0610 dl:153-154 gd:1 +ttp: b42/782 bl:2.4895 bb:1.2122 rl:2.2972 rb:1.0612 dl:145-146 gd:1 +ttp: b30/782 bl:2.6049 bb:1.2701 rl:2.2976 rb:1.0615 dl:133-134 gd:1 +ttp: b23/782 bl:2.6148 bb:1.2281 rl:2.2980 rb:1.0617 dl:126-127 gd:1 +ttp: b15/782 bl:2.6528 bb:1.2320 rl:2.2984 rb:1.0619 dl:115-117 gd:1 +ttp: b7/782 bl:2.7632 bb:1.2436 rl:2.2989 rb:1.0621 dl:101-103 gd:1 +quantized_ttt_phased val_loss:2.33108005 val_bpb:1.06521262 eval_time:513753ms +total_eval_time:513.8s