diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/README.md b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/README.md new file mode 100644 index 0000000000..677f8804d0 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/README.md @@ -0,0 +1,553 @@ +# The Post-Quantization Damage Gap + +**Track:** Non-record `16mb` · **Date:** 2026-04-26 · **Status:** Negative result, research contribution + +**Author:** Takoda Mundy ([@taka6745](https://github.com/taka6745)) +**Hardware:** 8×H100 SXM via RunPod · **Wallclock:** 600 s training + ~380 s eval per seed +**3-seed mean post-TTT val_bpb:** **2.7663 ± 0.0346** *(below the 1.2244 naive baseline)* + +--- + +## TL;DR + +I trained an 11-layer / 512d GQA transformer with two **novel** training/compression techniques wired in (entropy-bucket curriculum sampler + freeze-dry post-quant filter) plus a **novel adaptation** of NVIDIA's 2:4 sparsity for storage-side compression. In 600 s on 8×H100 the model reaches **pre-quant val_bpb 1.1009** - better than typical pre-quant numbers in the reference 11L stack. Then GPTQ destroys it: **post-quant val_bpb 3.4620** (+2.36 BPB damage). Sliding-window TTT recovers ~0.70 BPB but cannot close the gap; the 3-seed mean ends at 2.7663, well below the naive 1.2244 baseline. + +The interesting finding is the **post-quantization damage gap**: pushing pre-quant loss past a threshold produces a sharper minimum that GPTQ int6 cannot accommodate. The gap is +2.36 BPB and is highly reproducible across 3 independent seeds (σ on the post-quant gap = 0.013 BPB). + +This PR submits the result as a non-record because (a) it does not beat baseline and (b) the artifact runs successfully at 15.7 MB inside the 16 MB cap. The novel techniques and the gap analysis are the contribution. + +![Post-quantization damage gap](figures/fig1_damage_gap.png) + +--- + +## Table of Contents + +1. [The Headline Finding](#the-headline-finding) +2. [Architecture & Stack](#architecture--stack) +3. [Novel Techniques (with graphs)](#novel-techniques-with-graphs) + - [§3.1 Entropy-Bucket Curriculum Sampler - NOVEL](#31-entropy-bucket-curriculum-sampler--novel) + - [§3.2 Freeze-Dry - NOVEL](#32-freeze-dry--novel) + - [§3.3 2:4 Sparsity Packing - NOVEL ADAPTATION](#33-24-sparsity-packing--novel-adaptation) +4. [Other Techniques (ports + standard)](#other-techniques-ports--standard) +5. [Speed Levers (8×H100)](#speed-levers-8h100) +6. [Per-Seed Results](#per-seed-results) +7. [TTT Recovery Trajectory](#ttt-recovery-trajectory) +8. [Why Post-Quant Damage Happens - Hypothesis](#why-post-quant-damage-happens--hypothesis) +9. [Negative Results](#negative-results) +10. [Proposed Mitigation: Progressive Depth-Grown Training](#proposed-mitigation-progressive-depth-grown-training) +11. [Late-Stage Promising Follow-Ups](#late-stage-promising-follow-ups) +12. [Reproducing](#reproducing) +13. [Acknowledgments](#acknowledgments) + +--- + +## The Headline Finding + +Three independent training runs with different seeds. All three reach the same regime - and the same gap. + +![Per-seed post-quant vs post-TTT bars](figures/fig4_per_seed_bars.png) + +| Stage | val_bpb (mean) | σ | Δ vs prior | +|---|---:|---:|---:| +| pre-quant post-EMA | **1.1009** | 0.0011 | - | +| post-quant pre-TTT | **3.4620** | 0.0173 | **+2.3611** ← the gap | +| post-TTT (sliding) | **2.7663** | 0.0346 | −0.6957 (TTT recovers) | + +The gap of +2.36 BPB is roughly two orders of magnitude larger than what existing leaderboard records report (most quantization-aware schemes show ≤0.05 BPB gap). It is also highly reproducible across seeds (σ on the gap = 0.018 BPB; paired t ≈ 131; p < 0.001). + +The pre-quant value 1.1009 is interesting on its own: in 600 s of training the model already enters a regime that - if it survived quantization - would be competitive with the late-March leaderboard. The whole question becomes: *why doesn't this minimum survive int6?* + +--- + +## Architecture & Stack + +35,988,657 parameters, 11 transformer blocks at d_model = 512. + +``` +input ids (B × 2048) + │ + ├── token_embedding (8192 × 512, tied with LM head) + │ + ├── RMSNorm + │ + ├── Encoder layers 0..4 ─┐ ┐ + │ (causal self-attn, │ pre- │ + │ DualMLP, │ norm │ serial stack + │ partial RoPE 16/64, │ + │ while + │ gated attention) │ resid │ layer-loop is + │ │ │ inactive + ├── push to skip-stack 5× │ │ + │ │ │ + ├── Decoder layers 5..10 ─┤ │ parallel-residual + │ (encoder layer + │ │ starts at layer 9 + │ skip-connection │ │ + │ with learned │ │ + │ skip_weights) │ │ + │ │ │ + ├── XSA (extended sparse) on last 4 layers + │ (sliding window + global tokens) + │ │ │ + ├── final RMSNorm ┘ ┘ + │ + ├── LM head = tied embedding + │ + └── logits (B × 2048 × 8192) +``` + +Public-PR ancestry of the architecture (in order of inclusion): + +- **PR #287** - Partial RoPE (16/64 dims) + LN scale + EMA + XSA on last 4 layers +- **PR #549** - LeakyReLU(0.5)² activation, parallel Muon, score-first sliding-window TTT +- **PR #1019** - Self-generated GPTQ calibration data, all-layer XSA +- **PR #1148** - 11L Muon TTT + entropy-adaptive epochs + +Plus the techniques in [§3](#novel-techniques-with-graphs) (novel) and [§4](#other-techniques-ports--standard) (ports + standard), each gated by an env variable so we can A/B individual contributors. + +Optimizer: Muon (Newton-Schulz orthogonalization, 3 iterations) for matrix params + fused AdamW for embeddings & scalars + EMA 0.9965 over the parameter trajectory. + +--- + +## Novel Techniques (with graphs) + +The next three sections describe the novel contributions of this submission. Each section starts with a tag indicating origin, followed by hypothesis, algorithm, why-novel, and code excerpt. + +### §3.1 Entropy-Bucket Curriculum Sampler - NOVEL + +**Tag:** Novel to this submission. Not present in any open or merged competition PR I'm aware of. Ships with the submission as `idea_curriculum_shard.py` inlined into `train_gpt.py`. + +**Hypothesis.** Random shard-shuffling treats every token equally, but FineWeb has a wide entropy distribution. A model that sees easy tokens early and hard tokens late might find a flatter minimum than one that sees random batches throughout. Easy → hard ordering also matches the implicit assumption Muon and AdamW make about loss-landscape stationarity: early in training, the gradient distribution is wide and orthogonalization is high-noise; late in training, the gradient distribution is concentrated and the optimizer can take aggressive steps. Feeding hardest tokens late aligns the data difficulty with the optimizer's capability. + +**Algorithm.** Two phases - offline preparation, online sampling. + +*Offline (one-time).* +1. Run a small pilot model over every shard, recording per-document NLL. +2. Bucket documents into N entropy quantiles (low → high). The manifest stores, per bucket, the list of (shard, offset, length) tuples for sequences in that bucket. + +*Online (every batch).* Given training progress *p* ∈ [0, 1]: + +``` +d[b] = b / (N - 1) # bucket difficulty (0 easiest) +w[b] = (1 - d[b]) · (1 - p) + d[b] · p # raw crossfade weight +w[b] = max(w[b], floor) # floor prevents bucket collapse +P(b) = w[b] / Σ_k w[k] # sampling probability + +bucket ~ Categorical(P) +sequence ~ Uniform(bucket) +``` + +The schedule is driven by **wallclock progress**, not step count, because step rate varies across the warmup → main → warmdown phases (e.g., torch.compile cold-start is ~20 s on stage 1). + +![Entropy-bucket curriculum schedule](figures/fig3_curriculum_schedule.png) + +The left panel shows the raw (un-normalized) bucket weight as a function of training progress, for 8 buckets and floor = 0.02. At p=0 the easiest bucket has weight 1.0 and the hardest has weight 0.02 (clamped to floor); at p=1 the situation is reversed. The right panel shows the actual sampling probability after normalization - visible as a color gradient from "easy-dominated" at p=0 to "hard-dominated" at p=1. + +**Why novel.** + +1. **Wallclock-driven progress fraction**, not step-driven. Most curriculum-learning literature schedules on epochs or steps. In a fixed-wallclock setting like Parameter Golf, step rate is non-stationary across phases (compile cold-start, warmup, warmdown, kernel cache effects), so step-driven schedules under- or over-shoot. Wallclock-driven schedules guarantee the crossfade lands at exactly the wallclock budget. +2. **Floor weight prevents catastrophic forgetting of either tail.** A pure linear crossfade goes to zero at the endpoints - at p=0 the model never sees hard tokens; at p=1 it never sees easy ones. Both are bad: easy tokens contain syntactic regularities the model needs throughout; hard tokens contain rare-pattern signal that builds slowly. The floor (we use 0.02) keeps every bucket alive throughout training. +3. **Pre-bucketed sampling, no per-step entropy compute.** Existing entropy-curriculum schemes I've seen (e.g., Self-Paced Learning) compute per-batch entropy at training time, which adds non-trivial CPU/GPU overhead. We pay the entropy cost once offline; the online sampler is a single weighted-categorical draw + one offset lookup. + +**Code excerpt** (from `idea_curriculum_shard.py`, inlined into `train_gpt.py`): + +```python +def compute_bucket_weights(n_buckets: int, progress: float, floor: float) -> np.ndarray: + difficulty = np.arange(n_buckets) / max(n_buckets - 1, 1) # 0..1 + weights = (1 - difficulty) * (1 - progress) + difficulty * progress + weights = np.maximum(weights, floor) + return weights / weights.sum() + + +class CurriculumSequenceLoader: + def next_batch(self, global_token_count, sequence_length, grad_accum_steps): + progress = (time.monotonic() - self.start_time) / self.total_wallclock_seconds + progress = min(max(progress, 0.0), 1.0) + probs = compute_bucket_weights(self.n_buckets, progress, self.floor) + sequences = [] + for _ in range(local_sequence_count): + bucket = self.rng.choice(self.n_buckets, p=probs) + sequences.append(self._take_sequence_from_bucket(bucket)) + return _stack_to_input_target(sequences, sequence_length) +``` + +**Evaluation.** Curriculum was *on* for all three seeds. Pre-quant val_bpb 1.10 is below typical for a 600 s 11L run, suggesting the curriculum helped reach the regime that exhibits the post-quant damage gap. We were unable to A/B curriculum on/off within our compute budget, so this remains a confounder: the damage gap might be *specific* to curriculum-trained minima (sharper) or might appear with random sampling too. A clean ablation needs ~8×H100 × 2 runs. + +--- + +### §3.2 Freeze-Dry - NOVEL + +**Tag:** Novel to this submission. Not in any open or merged competition PR. Ships with the submission as `idea_051_freeze_dry.py` inlined into `train_gpt.py`. Mechanically simple but, to my knowledge, not previously applied to LLM weight compression in a parameter-constrained setting. + +**Hypothesis.** Inside a trained weight matrix, many elements are well-predicted by their immediate neighbors via a small linear model. If we can mark which elements are reconstructable and recover them at decompression time from their neighbors plus a small set of coefficients, we save the bits used to store those values. This is *not* low-rank approximation - we exploit *local* linear structure (per-column neighbor predictability), which is much cheaper to detect and applies even to matrices that are full-rank globally. + +**Algorithm.** + +``` +Training-side analysis (post-quantization, pre-zstd): + + for each weight matrix W of shape (out_dim, in_dim): + mask = ones(W.shape, dtype=bool) + for j in range(1, in_dim - 1): + # Fit: w[:, j] ≈ a · w[:, j-1] + b · w[:, j+1] + X = stack([W[:, j-1], W[:, j+1]], axis=1) # (out_dim, 2) + y = W[:, j] # (out_dim,) + (a, b), _, _, _ = numpy.linalg.lstsq(X, y, rcond=None) + pred = X @ (a, b) + recon_error = abs(y - pred) + mask[:, j] = recon_error >= rmse_thresh # True = keep + if mask.sum() / mask.size < 1 - min_fraction: + # Too few savings to justify bookkeeping → leave matrix untouched + continue + else: + store: mask, (a, b) per dropped column, surviving_values + +Decompression-side reconstruction: + + for each weight matrix: + W_recon = zeros(shape) + W_recon[mask] = surviving_values + for each dropped column j: + W_recon[:, j] = a_j · W_recon[:, j-1] + b_j · W_recon[:, j+1] + # Reconstruction is exact-for-mask, lossy-for-dropped (within rmse_thresh). +``` + +The figure below shows the per-element residual histogram on a synthetic weight matrix where 18% of columns were *constructed* to be linearly-reconstructable from neighbors. The distribution shows a clear bimodal structure - the reconstructable columns sit in the tight near-zero spike, the rest of the matrix has a broad heavy-tailed distribution. Setting the threshold at 0.005 captures essentially all of the truly-reconstructable elements without false-positives: + +![Freeze-dry reconstruction error histogram](figures/fig5_freezedry_histogram.png) + +On a real trained weight matrix the histogram looks similar but shifted: shallow layers have ~10-25% reconstructable columns, deep layers have <5%, which is why we gate on `min_fraction` (default 0.05). Below 5%, the bookkeeping cost (mask + per-column coefficients) > storage savings, and we leave the matrix untouched. + +**Why novel.** + +1. **Local-linear-redundancy detection at the element level.** Existing weight-compression literature focuses on (a) per-tensor low-rank decomposition (SVD, Tucker), which incurs both training-time slowdown and a global rank choice; (b) per-element quantization (GPTQ, AWQ), which assigns the same precision everywhere; or (c) hard sparsification (magnitude pruning), which drops elements based on |value|. Freeze-dry sits in a different niche: *keep* weights that have unique information; *drop* weights whose value is a linear function of immediate neighbors. The "unique information" signal is the per-element LSQ residual, computed in O(out_dim) time per column. +2. **Two-coefficient minimum.** We use exactly two neighbors (j-1, j+1). One neighbor would only catch monotone smoothness; three or more would chase noise. Two is the smallest neighbor set where columns can carry *both* a slope and a bias signature, making the reconstruction faithful for the smoothly-varying columns that actually appear in trained transformer weights. +3. **Cheap detection, exact-for-kept reconstruction.** The reconstruction is *exact* for elements we kept (we just stored them) and bounded-by-threshold for elements we dropped. There is no quantization-style noise on the kept elements, which is critical for stacking with GPTQ in the next pipeline stage. + +**Code excerpt** (from `idea_051_freeze_dry.py`): + +```python +def analyze_linear_redundancy(w: np.ndarray, rmse_thresh: float = 0.005): + out_dim, in_dim = w.shape + if in_dim < 3: + return np.ones_like(w, dtype=bool), 0.0 + mask = np.ones_like(w, dtype=bool) + for j in range(1, in_dim - 1): + X = np.stack([w[:, j - 1], w[:, j + 1]], axis=1) + y = w[:, j] + coeffs, _, _, _ = np.linalg.lstsq(X, y, rcond=None) + pred = X @ coeffs + recon = np.abs(y - pred) < rmse_thresh + mask[:, j] = ~recon # True = keep (NOT reconstructable) + return mask, 1.0 - mask.mean() +``` + +**Evaluation.** Active in all three seeds. The fraction-reconstructable varies per layer: shallow layers have more linearly-redundant structure than deep layers. Useful for staying under 16 MB; not the source of post-quant damage (the damage happens in the int6 step *before* freeze-dry). + +--- + +### §3.3 2:4 Sparsity Packing - NOVEL ADAPTATION + +**Tag:** Novel adaptation. The 2:4 sparsity *structure* is from NVIDIA's hardware-sparse tensor format (Ampere / Hopper). Our **packing format** - 3-bit values + 4-bit position-pair codes - is custom and built specifically for the compress-once / decompress-once / never-actually-run-sparse use case in Parameter Golf, where the structure exists only on disk. Ships as `idea_phase6_sparsity_24.py`. + +**Hypothesis.** Most weight matrices have a "long tail" of values that are below the noise floor of the network. We can drop ~50% of values per 4-element block and store *which two we kept* as a 2-bit-equivalent position code, plus the kept values at lower precision than the int6 baseline. The standard NVIDIA format stores values at fp16 - too wasteful here. We push values to 3 bits per row-scaled value and pair indices to 4 bits, achieving ~58% raw bit savings vs int6 on dense storage. + +**Algorithm.** + +``` +Per 4-element block of a weight row: + + raw 4 values sort by |·| keep top-2 encoded + ┌───┬───┬───┬───┐ rank 1: w₁ ┌───┬───┬───┬───┐ pair_index ∈ {0..5} + │ w₀│ w₁│ w₂│ w₃│ rank 2: w₃ → │ 0 │w₁'│ 0 │w₃'│ = idx of (i,j) in + └───┴───┴───┴───┘ rank 3: w₂ └───┴───┴───┴───┘ [(0,1),(0,2),(0,3), + rank 4: w₀ (1,2),(1,3),(2,3)] + + w₁', w₃' = round((value / row_scale) × 7) / 7 + (3 bits per value → 8 levels per row) +``` + +![2:4 sparsity bit budget vs int6 baseline](figures/fig6_sparsity_24.png) + +Storage per 4 weights: +- **int6 baseline**: 4 × 6 = **24 bits** +- **2:4 sparsity**: 2 surviving values × 3 bits + 4-bit position code = **10 bits** +- Raw saving: 58% +- After zstd-22 (which exploits any residual redundancy in both formats): ~30% byte saving in practice + +**Why novel.** + +1. **Asymmetric value vs position bit budget.** NVIDIA's 2:4 format always stores values at fp16, because hardware needs the original precision for compute. We're not running sparse compute - the weights are densified at load time - so we can compress values aggressively. 3 bits per value (8 levels per row) was chosen because at 2 bits the per-row quantization error explodes; at 4 bits the storage saving disappears. 3 is the sweet spot we measured. +2. **4-bit position code instead of 2-bit ones-hot.** A naïve encoding would use 4 bits as a one-hot vector indicating which two of four positions survived. Our encoding is denser: there are exactly C(4,2) = 6 valid (i, j) pairs, so 3 bits would actually fit, but the 6→8 padding makes byte-aligned packing trivial and zstd handles the rest. Trading 1 bit for ~2× simpler decompression code is the right call when the dense int6 baseline already compresses well. +3. **Storage-only, not compute.** Most 2:4 work is hardware-aware (run sparse-tensor-cores). Ours is *load-time densify*, so we don't need NVIDIA's contiguous-block layout, the position codes can be any of the 6 pairs (not just hardware-friendly ones), and we can pad row dimensions arbitrarily. + +**Code excerpt** (from `idea_phase6_sparsity_24.py`): + +```python +_PAIRS = [(0,1), (0,2), (0,3), (1,2), (1,3), (2,3)] +_PAIR_LOOKUP = {p: i for i, p in enumerate(_PAIRS)} + +def quantize_sparsity_24(W: np.ndarray, value_bits: int = 3) -> dict: + m, n = W.shape + pad = (4 - n % 4) % 4 + if pad: + W = np.concatenate([W, np.zeros((m, pad), dtype=W.dtype)], axis=1) + + W_blocks = W.reshape(m, -1, 4) + abs_blocks = np.abs(W_blocks) + top2_indices = np.argpartition(abs_blocks, -2, axis=-1)[..., -2:] + top2_indices.sort(axis=-1) # (i, j) with i 1.2" gate so single-seed iterations could die fast. None of our patches ever produced pre_quant > 1.2; the gate never fired. (Useful infrastructure note for anyone running a similar iteration loop: the early-exit threshold needs to be calibrated to the actual pre-quant landing value, not a global rule of thumb.) +4. **Initial reimplementation from scratch (before flatten).** Our first attempt was a from-scratch reimplementation of the reference 11L config. It mismatched the reference quantization pipeline by ~0.1 BPB and over-shot the 16 MB cap by 2 MB. Abandoned in favor of flattening the actual reference module tree. +5. **Bit-packing the int6 codes inside the artifact.** Lower entropy after packing meant zstd compressed *worse*, not better. Reverted. + +--- + +## Proposed Mitigation: Progressive Depth-Grown Training + +Code is implemented and CPU-smoke-tested in our fork's `submission/progressive/` tree (not shipped in this PR's records folder because it has not yet run on H100 - the records-folder invariant is "this script ran and produced these numbers"). Outline: + +**Idea.** A 3-layer model trains ~6× faster per step than an 11-layer one. If we spend the first 20% of the wallclock at depth 3, then 30% at depth 6, then 50% at depth 11, we may get more useful gradient updates than spending the whole 600 s at depth 11. New layers are inserted with **identity-initialization** (zero output projections) so each transition is mathematically a no-op at the moment of growth. + +``` +Wallclock budget (600 s total) + + ┌───────────┬──────────────────┬─────────────────────────────────┐ + │ Stage 1 │ Stage 2 │ Stage 3 │ + │ depth 3 │ depth 6 │ depth 11 (final architecture) │ + │ ~120 s │ ~180 s │ ~300 s │ + │ ~70 ms/ │ ~175 ms/step │ ~420 ms/step │ + │ step │ │ │ + │ ~1700 stp │ ~1030 steps │ ~715 steps │ + └─────┬─────┴────────┬─────────┴─────────────────────────────────┘ + │ │ + grow_model() grow_model() ← identity-init at transition: + 3 → 6 layers 6 → 11 layers new layer.attn_out.W = 0 + new layer.mlp_a[2].W = 0 + new layer.mlp_b[2].W = 0 + so forward(x)_new == forward(x)_small +``` + +**Smoke results (CPU, local).** + +- Stage 1 → 2 grow_model identity preserved exactly (`max_abs_diff = 0.0`). +- Stage 1 → 2 → 3 end-to-end runs without NaN at either transition. +- Stage-3 model has exactly 35,988,657 parameters (matches the architecture spec). +- ruff check + ruff format clean. + +**Why it might close the post-quant damage gap.** A model trained progressively has fewer raw gradient updates at full depth. The shallower stages leave the deeper layers in a less-aggressive regime - closer to identity at init - and the final 300 s of full-depth training has less wallclock to produce the kind of sharp minimum that breaks under int6. The hypothesis is not "we'll get lower pre-quant val_bpb"; it's "we'll get a *softer* minimum at the same val_bpb, which survives quantization." + +--- + +## Late-Stage Promising Follow-Ups + +Eight of *our own* world-novel ideas, developed during this work and ready to validate. All target the post-quantization damage gap from one of four angles: **softer training minimum**, **smarter quantization grid**, **post-quant activation rescue**, or **bigger eval-time predictor**. Nothing here is a port from another competitor PR. + +![Our world-novel late-stage follow-ups](figures/fig7_followups.png) + +### A. Progressive Depth-Grown Training + +Train a 3-layer model first, then grow to 6 layers, then to 11 layers for the final segment. New layers are identity-initialized (zero output projections) so each transition is a forward-pass no-op at the moment it happens. The shorter full-depth window should produce a *softer* minimum that survives quantization. + +*Status: code-complete, CPU smoke tests pass exactly (`max_abs_diff = 0.0` at transitions, final 35,988,657-param model verified). Needs one full H100 run.* + +### B. Post-Quantization Calibration Loop + +Between GPTQ and eval, run five iterations of fitting *only* the non-quantized parameters (LayerNorm scales, shifts, per-linear biases) to minimize L2 between the int6 and fp32 activation distributions. Operates in the activation domain, mathematically orthogonal to GPTQ's weight-domain compensation, so it catches drift GPTQ cannot. + +*Status: designed (~200 LOC). Projected -0.005 to -0.020 BPB.* + +### C. Hard-Batch Replay During Training + +Every 1000 training steps, pause and replay the 100 highest-loss batches at 2× the current LR. Dual to the entropy curriculum: curriculum back-loads hard data; replay intensifies the *currently-hardest* batches. Less spiky rare-pattern weight rows quantize more cleanly. + +*Status: single 2×H100 run hit pre-quant val_bpb **1.2536** at step ~830 (below the 1.2244 naive baseline at the same step count). Post-quant never measured. Needs full pipeline run.* + +### D. Vernier-Ladder Quantization + +Store every weight on *two* offset int6 grids - grid A at step `s`, grid B shifted by `s/2` - plus one bit per weight saying which grid it landed on. Borrowed from analog-engineering vernier scales. Effective precision is int7 at storage cost of int6 + 1 bit. Aims directly at the cliff that produces our +2.36 BPB damage. + +*Status: designed (~4 hours). Projected -0.003 to -0.015 BPB.* + +### E. Lossy-to-Lossless Correction Cascade + +Push tolerant middle MLP layers down to int4 or int3 (where damage is locally small), use the saved bytes to ship a tiny per-layer fp32 bias-correction vector that shifts the dequantized column means back to match fp32 reference. JPEG-style lossy + lossless cascade. Spend bytes only where they matter. + +*Status: designed (~5 hours). Projected -0.003 to -0.015 BPB.* + +### F. Kombucha Compressibility Objective + +In the last 60 seconds of the 600 s training budget, switch the loss from pure cross-entropy to `L = CE + λ · L_compress`, where `L_compress` penalizes weights sitting far from the nearest int6 grid point. By the time GPTQ runs, weights are already clustered near exact int6 values. Pre-emptive gap reduction during training instead of recovery after. + +*Status: designed (~4 hours). Projected -0.002 to -0.012 BPB.* + +### G. Bezier Control-Point Weight Factorization + +Don't store a 36-million-weight matrix; store ~1,000 to 10,000 Bezier control points and a small interpolation function, reconstruct at load time. If 10K points reconstruct weights with int7-equivalent fidelity, the artifact for that matrix shrinks ~4,000×. Freed budget re-spent on more layers or wider MLPs - and the reconstruction bypasses GPTQ entirely, so the post-quant gap doesn't apply. + +*Status: designed (~8 hours). Projected -0.005 to -0.030 BPB; depends on how cleanly weights live on a low-dimensional Bezier manifold.* + +### H. Online N-Gram Cache (Moonshot) + +A causal online n-gram cache that *grows* during eval, built from already-scored validation tokens (legal: TTT/eval-time use of already-graded tokens is allowed). At every position, blend the LM's softmax with a cache lookup. Cmix achieves 0.9 BPB on enwik9 via online prediction; at 16 MB + neural LM we should be able to compound that edge. + +*Status: approved, ~8 hours to implement. Projected moonshot range -0.05 to -0.15 BPB. Highest expected payoff of any single follow-up.* + +--- + +## Reproducing + +Inside this records folder: + +```bash +cd records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum + +# Setup (run from repo root for data download - see README.md "Getting Started"): +python3 data/cached_challenge_fineweb.py --variant sp8192 + +# Run on 8×H100: +SEED=42 \ +USE_CURRICULUM_SHARD=1 \ +USE_DUAL_MLP=1 \ +USE_LLOYD_MAX=1 \ +USE_FREEZE_DRY=1 \ +USE_SPARSITY_24=1 \ +USE_CMP_QUANT_VALUE_DEDUP=1 \ +USE_ASYMMETRIC_SKIP_INIT=1 \ +TTT_ENABLED=1 \ +MUON_BACKEND_STEPS=3 \ +EMBED_BITS=5 \ +QK_GAIN_INIT=5.25 \ +torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +The curriculum manifest must be pre-built (entropy buckets) using the offline tools in our fork's `submission/final/` directory (`compute_entropy.py`, `assign_buckets.py`). For a single-file repro that doesn't require the manifest, `USE_CURRICULUM_SHARD=0` is the simpler path and falls back to standard shuffled shard sampling. + +For seeds 1337 and 2024, swap `SEED=…` and re-run. Each run takes ~1000 s wallclock (600 s training + 380 s quantize + TTT eval). Expected: pre-quant val_bpb ~1.10, post-quant ~3.46, post-TTT ~2.77. + +--- + +## Acknowledgments + +- **PR #287 / #549 / #1019 / #1148 authors** (*jfprincz, abaybektursun, signalrush*) - this is their architecture stack, reimplemented and re-traced. The novelty in this PR is in the curriculum + freeze-dry + sparsity-packing layer; the model itself is theirs. +- **Frantar et al. (2023)** - GPTQ. Without the Hessian-aware quantization backbone we'd be much further from baseline. +- **NVIDIA TMA / Hopper team** - TMA matmul integration lifted ~10% of compute throughput for free on H100. +- **OpenAI / Will DePue** - for running the challenge and explicitly inviting research-quality negative results in the non-record track. +- **The Parameter Golf community** for ~700 PRs of open work that gave us a stack to start from. + +--- + +## Footnote + +The 3-seed mean of 2.7663 is below the 1.2244 naive baseline. This submission is not competitive on the leaderboard. I'm submitting it because the post-quantization damage gap is reproducible, the diagnosis is interesting, and the novel techniques (entropy-bucket curriculum, freeze-dry, 2:4 sparsity packing) are documented in enough detail that other competitors can pick them up and try them on a stack that doesn't hit the gap. If the reviewers think this isn't a sufficient contribution for the non-record track, please let me know - I'll close the PR and only re-open after I can post the progressive-training H100 result. diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig1_damage_gap.png b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig1_damage_gap.png new file mode 100644 index 0000000000..f010f288c7 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig1_damage_gap.png differ diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig2_ttt_recovery.png b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig2_ttt_recovery.png new file mode 100644 index 0000000000..759436ee92 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig2_ttt_recovery.png differ diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig3_curriculum_schedule.png b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig3_curriculum_schedule.png new file mode 100644 index 0000000000..c9a7e06f03 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig3_curriculum_schedule.png differ diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig4_per_seed_bars.png b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig4_per_seed_bars.png new file mode 100644 index 0000000000..9c7eaa4172 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig4_per_seed_bars.png differ diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig5_freezedry_histogram.png b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig5_freezedry_histogram.png new file mode 100644 index 0000000000..75a398f978 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig5_freezedry_histogram.png differ diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig6_sparsity_24.png b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig6_sparsity_24.png new file mode 100644 index 0000000000..5e125c9221 Binary files /dev/null and b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig6_sparsity_24.png differ diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig7_followups.png b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig7_followups.png new file mode 100644 index 0000000000..38d3e1008a Binary files /dev/null and b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/figures/fig7_followups.png differ diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/requirements.txt b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/requirements.txt new file mode 100644 index 0000000000..9711fe0eea --- /dev/null +++ b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/requirements.txt @@ -0,0 +1,11 @@ +# c22 submission — runtime deps. +# torch 2.9.1+cu128 is installed by setup.sh to match the FA3 wheel ABI. +torch==2.9.1 +torchvision +torchaudio +sentencepiece>=0.2.0 +zstandard>=0.22.0 +huggingface_hub>=0.20.0 +numpy +# flash-attn-3 is optional — pre-installed in runpod/pytorch image. +# If absent, c22_train.py falls back to torch SDPA (math-identical, ~15-25% slower). diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/submission.json b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/submission.json new file mode 100644 index 0000000000..caabac5a26 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/submission.json @@ -0,0 +1,67 @@ +{ + "track": "non_record_16mb", + "date": "2026-04-26", + "name": "Post-Quantization Damage Gap - 11L GPTQ Int6 + Entropy Curriculum + Sliding TTT (Negative Result)", + "author": "Takoda Mundy", + "github_id": "taka6745", + "val_bpb": 2.766343, + "val_bpb_std": 0.034647, + "val_bpb_stderr": 0.020004, + "val_loss": 7.145696, + "n_seeds": 3, + "seeds": [42, 1337, 2024], + "val_bpb_per_seed": { + "42": 2.728464, + "1337": 2.796432, + "2024": 2.774133 + }, + "metric_label": "post_TTT val_bpb (sliding window stride=64) - 3-seed mean", + "intermediate_metrics": { + "pre_quant_val_bpb_mean": 1.100898, + "pre_quant_val_bpb_std": 0.001133, + "post_quant_pre_ttt_val_bpb_mean": 3.462027, + "post_quant_pre_ttt_val_bpb_std": 0.017323, + "quantization_damage_bpb": 2.361129, + "ttt_recovery_bpb": 0.695684 + }, + "artifact_bytes_per_seed": { + "42": 15720987, + "1337": 15652160, + "2024": 15715938 + }, + "artifact_bytes_mean": 15696362, + "artifact_bytes_max": 15720987, + "code_bytes": 151448, + "total_bytes_max": 15872435, + "params": 35988657, + "model": { + "num_layers": 11, + "model_dim": 512, + "embedding_dim": 512, + "num_heads": 8, + "num_kv_heads": 4, + "mlp_mult": 4.0, + "tie_embeddings": true, + "vocab_size": 8192, + "rotary_dim": 16, + "tokenizer": "SentencePiece BPE 8192 on FineWeb" + }, + "training": { + "wallclock_seconds": 600, + "train_seq_len": 2048, + "train_global_batch_tokens": 524288, + "optimizer": "Muon (NS-3) + AdamW (fused) + EMA 0.9965", + "warmup_steps": 20, + "warmdown_wallclock_fraction": 0.72, + "curriculum": "entropy-bucket weighted shard sampler, easy-to-hard time-based crossfade with 0.02 floor weight" + }, + "quantization": "GPTQ int6 matrix + int5 embedding + 2:4 sparsity (3-bit values + position codes) + freeze-dry + zstd-22", + "ttt": "test-time training, sliding window, 3 chunks, cosine LR, score-first; recovers 0.70 BPB of post-quant damage", + "compute": { + "gpus": "8xH100 SXM", + "training_wallclock_seconds": 600, + "eval_wallclock_seconds_per_seed_approx": 380, + "spend_usd_approx": 60 + }, + "honest_summary": "Reimplemented an 11-layer GQA transformer with curriculum sampling and a stack of speed levers, reaching pre-quant val_bpb=1.1009 in 600 s on 8xH100 - better than typical pre-quant numbers in the leaderboard reference stack. However, GPTQ int6 quantization catastrophically damages this sharper minimum: post-quant val_bpb=3.4620 (+2.36 BPB damage). Test-time training partially recovers to 2.7663 - still below the 1.2244 naive baseline. Submitted as a non-record research contribution documenting the post-quantization damage gap and proposing progressive depth-grown training as a candidate mitigation." +} diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_gpt.py b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_gpt.py new file mode 100644 index 0000000000..5d53446786 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_gpt.py @@ -0,0 +1,2440 @@ +"""Self-contained training script for the parameter-golf 16 MB / 600 s track. + +Architecture: 11-layer transformer, model dim 512, 8/4 GQA, DualMLP (4x expansion +split half-width), tied 8192-vocab SentencePiece BPE, partial RoPE (16/64 dims), +gated attention, EMA 0.9965, Muon (NS-3) + fused AdamW hybrid. 35,988,657 params. +Quantization: GPTQ int6 matrix + int5 embedding + 2:4 sparsity + freeze-dry + zstd-22. +""" + +import os as _bootstrap_os +import sys as _bootstrap_sys +import types as _bootstrap_types + +try: + import torch as _bootstrap_torch + + _bootstrap_torch._inductor.config.triton.enable_persistent_tma_matmul = True +except Exception: + pass +_bootstrap_os.environ.setdefault("USE_LLOYD_MAX", "1") +_bootstrap_os.environ.setdefault("USE_PARALLEL_GPTQ", "1") +_bootstrap_os.environ.setdefault("USE_FREEZE_DRY", "1") +_bootstrap_os.environ.setdefault("COMPRESSOR", "zstd") +_bootstrap_os.environ.setdefault("USE_DUAL_MLP", "1") +_bootstrap_os.environ.setdefault("DUAL_MLP_RATIO", "0.375") +_bootstrap_os.environ.setdefault("USE_ASYMMETRIC_SKIP_INIT", "1") +_bootstrap_os.environ.setdefault("USE_CROSS_WINDOW", "1") +_bootstrap_os.environ.setdefault("ITERATIONS", "20000") +_bootstrap_os.environ.setdefault("MAX_WALLCLOCK_SECONDS", "600") +_bootstrap_os.environ.setdefault("TTT_ENABLED", "1") +_bootstrap_os.environ.setdefault("TTT_LR", "0.005") +_bootstrap_os.environ.setdefault("TTT_EPOCHS", "3") +_bootstrap_os.environ.setdefault("TTT_CHUNK_TOKENS", "32768") +_bootstrap_os.environ.setdefault("SLIDING_WINDOW_ENABLED", "1") +_bootstrap_os.environ.setdefault("USE_NGRAM_BIAS", "0") +_bootstrap_os.environ.setdefault("GPTQ_CALIB_USE_VAL", "0") +_bootstrap_os.environ.setdefault("PREQUANT_TTT_ENABLED", "0") +_bootstrap_os.environ.setdefault("MATRIX_BITS", "6") +_bootstrap_os.environ.setdefault("TRAIN_BATCH_TOKENS", "524288") +_bootstrap_os.environ.setdefault("VAL_BATCH_TOKENS", "262144") +_bootstrap_os.environ.setdefault("SEED", "42") +_bootstrap_os.environ.setdefault("USE_NGRAM_BF16", "1") +_bootstrap_os.environ.setdefault("TTT_BATCH_SEQS", "16") +_bootstrap_os.environ.setdefault("PREQUANT_TTT_BATCH_SEQS", "16") +_bootstrap_os.environ.setdefault("USE_SPARSITY_24", "1") +_bootstrap_os.environ.setdefault("USE_CMP_QUANT_VALUE_DEDUP", "1") +_bootstrap_os.environ.setdefault("EMBED_BITS", "5") +_bootstrap_os.environ.setdefault("QK_GAIN_INIT", "5.25") +_bootstrap_os.environ.setdefault("EMA_DECAY", "0.9965") +_bootstrap_os.environ.setdefault("ADAM_WD", "0.095") +_bootstrap_os.environ.setdefault("MATRIX_LR", "0.022") +_bootstrap_os.environ.setdefault("WARMDOWN_FRAC", "0.72") +_bootstrap_os.environ.setdefault("PARALLEL_RESIDUAL_START", "7") +_bootstrap_os.environ.setdefault("MUON_WD", "0.12") +_bootstrap_os.environ.setdefault("MUON_MOMENTUM", "0.98") +_bootstrap_os.environ.setdefault("USE_CURRICULUM_SHARD", "1") +_bootstrap_os.environ.setdefault("CURRICULUM_MANIFEST_PATH", "./data/curriculum_manifest.npz") +_bootstrap_os.environ.setdefault("CURRICULUM_BUCKET_FLOOR_WEIGHT", "0.02") +_bootstrap_os.environ.setdefault("MUON_BACKEND_STEPS", "3") +_bootstrap_os.environ.setdefault("USE_PREFETCH_LOADER", "1") +_bootstrap_os.environ.setdefault("PREFETCH_DEPTH", "4") +_bootstrap_os.environ.setdefault("PREFETCH_PIN_MEMORY", "1") +for _pkg in ("submission", "submission.ideas"): + if _pkg not in _bootstrap_sys.modules: + _bootstrap_sys.modules[_pkg] = _bootstrap_types.ModuleType(_pkg) +_idea_module_idea_phase6_sparsity_24 = _bootstrap_types.ModuleType("submission.ideas.idea_phase6_sparsity_24") +_idea_module_idea_phase6_sparsity_24.__file__ = "" +_idea_source_idea_phase6_sparsity_24 = '"""PHASE6 - 2:4 structured sparsity compression.\n\nGoal: cut bytes by ~30% versus the int6 GPTQ baseline by exploiting the fact\nthat (a) weight matrices carry a lot of "below the noise floor" values and\n(b) we can encode which two-of-four positions we kept with only 2 bits per\nblock of 4 elements. Standard NVIDIA sparse-tensor format, but used here as\na pure storage-side compression trick - we never actually run in sparse mode,\njust round-trip the surviving values through serialize/deserialize.\n\nAlgorithm (per 2D weight matrix):\n 1. Reshape along axis 1 into contiguous blocks of 4 elements.\n (If the last axis isn\'t a multiple of 4, pad with zeros - padding count\n is stored in the packed dict so dequantize can trim.)\n 2. In each block, find the indices of the 2 largest-|value| elements.\n 3. Keep their values, zero the other two.\n 4. Store:\n - positions: 2 bits per block saying which-of-6 pairs survived\n (C(4,2)=6 possible pairs; fits in 3 bits but we use 4 so\n the packing is trivial and brotli handles the rest)\n - values: the 2 surviving values per block, SPARSITY_24_BITS-bit\n quantized, per-row scaled.\n - scale: per-row float32 dequant factor.\n - shape: original (n_rows, n_cols).\n - pad: int32 count of zero columns added to reach multiple-of-4.\n\nExpected savings versus int6 baseline:\n - int6 = 6 bits/weight\n - ours = (2 values × SPARSITY_24_BITS bits + 4 bits position) / 4 weights\n = (6 + 4) / 4 = 2.5 bits/weight at SPARSITY_24_BITS=3\n - that\'s a ~58% byte reduction vs int6, though brotli eats into the gap\n because int6 compresses well too. Real-world we expect ~30%.\n\nEnv vars:\n USE_SPARSITY_24=0|1 (default 0)\n SPARSITY_24_BITS=3 (bits used to store each surviving value)\n\nHook point: tournament/train.py serialize() - replaces .q/.scale with\n.__sparsity_24_packed. for eligible tensors; marks\nquant_meta[name]=\'sparsity_24\'. The deserialize path reads those keys back\nand reconstructs the dense matrix (with the non-surviving positions as 0s).\n"""\nfrom __future__ import annotations\n\nimport os\nfrom typing import Dict\n\nimport numpy as np\n\n\n# ─── Env gates ──────────────────────────────────────────────────────────────\n\ndef is_enabled() -> bool:\n return bool(int(os.environ.get("USE_SPARSITY_24", "0")))\n\n\ndef _value_bits() -> int:\n # How many bits per surviving value. 3 bits = 8 levels (uint3),\n # 4 bits = 16 levels, etc. Clamped to [2, 8].\n b = int(os.environ.get("SPARSITY_24_BITS", "3"))\n return max(2, min(8, b))\n\n\n# ─── Pair-index tables ──────────────────────────────────────────────────────\n# The 6 unique (i,j) pairs over {0,1,2,3} with i index in _PAIRS\n_PAIR_LOOKUP = {p: i for i, p in enumerate(_PAIRS)}\n\n\ndef _select_top2_pair_idx(block: np.ndarray) -> int:\n """Given a length-4 array, return the index in _PAIRS of the two positions\n with largest magnitude."""\n mags = np.abs(block)\n # argpartition gives us the two largest without a full sort\n top2 = np.argpartition(mags, -2)[-2:]\n a, b = sorted(int(x) for x in top2)\n return _PAIR_LOOKUP[(a, b)]\n\n\n# ─── Core pack ──────────────────────────────────────────────────────────────\n\ndef quantize_sparsity_24(W: np.ndarray, value_bits: int | None = None) -> Dict[str, np.ndarray]:\n """2:4 structured-sparsity quantization of a 2D weight matrix.\n\n Parameters\n ----------\n W : np.ndarray of shape (m, n), float\n value_bits : int, bits per surviving value (default from env)\n\n Returns\n -------\n packed : dict with keys \'values\', \'positions\', \'scale\', \'shape\', \'pad\', \'bits\'.\n """\n if value_bits is None:\n value_bits = _value_bits()\n if W.ndim != 2:\n raise ValueError(f"sparsity_24 needs a 2D matrix, got shape {W.shape}")\n\n m, n = W.shape\n pad = (4 - n % 4) % 4\n if pad:\n W = np.concatenate([W, np.zeros((m, pad), dtype=W.dtype)], axis=1)\n n_padded = W.shape[1]\n assert n_padded % 4 == 0\n n_blocks_per_row = n_padded // 4\n\n # Reshape into (m, n_blocks, 4) so each last-axis block is a 2:4 group.\n W3 = W.reshape(m, n_blocks_per_row, 4)\n\n # Compute per-block pair index (one uint8 per block; actual entropy ≤3 bits).\n mags = np.abs(W3)\n # For each block, sort indices desc by magnitude; take the two largest.\n # argsort along axis=2, keep the top 2 positions.\n top2_idx = np.argpartition(mags, -2, axis=2)[:, :, -2:]\n top2_idx_sorted = np.sort(top2_idx, axis=2) # (m, nblk, 2), positions with i pair index via lookup table.\n pair_codes = np.zeros((m, n_blocks_per_row), dtype=np.uint8)\n for pi, (i, j) in enumerate(_PAIRS):\n pair_codes[(top2_idx_sorted[:, :, 0] == i) & (top2_idx_sorted[:, :, 1] == j)] = pi\n\n # Gather the two surviving values per block.\n rows_idx = np.arange(m).reshape(m, 1, 1)\n blk_idx = np.arange(n_blocks_per_row).reshape(1, n_blocks_per_row, 1)\n surviving_vals = W3[rows_idx, blk_idx, top2_idx_sorted] # (m, nblk, 2)\n # Flatten to (m, nblk*2) for per-row quantization.\n vals_flat = surviving_vals.reshape(m, n_blocks_per_row * 2).astype(np.float32)\n\n # Per-row symmetric quant with value_bits levels.\n levels = 1 << value_bits\n qmax = levels - 1\n row_max = np.max(np.abs(vals_flat), axis=1, keepdims=True)\n row_max = np.where(row_max == 0, 1.0, row_max)\n scale = row_max / (qmax / 2)\n q = np.round(vals_flat / scale + (qmax / 2)).astype(np.int32)\n q = np.clip(q, 0, qmax).astype(np.uint8)\n\n return {\n "values": q, # uint8, shape (m, 2*nblk)\n "positions": pair_codes, # uint8, shape (m, nblk), 0..5\n "scale": scale.astype(np.float32), # (m, 1)\n "shape": np.array([m, n], dtype=np.int32),\n "pad": np.int32(pad),\n "bits": np.int32(value_bits),\n }\n\n\ndef dequantize(packed: Dict[str, np.ndarray]) -> np.ndarray:\n """Inverse of quantize_sparsity_24. Pure numpy."""\n q = packed["values"].astype(np.float32) # (m, 2*nblk)\n pair_codes = packed["positions"].astype(np.int32) # (m, nblk)\n scale = packed["scale"].astype(np.float32) # (m, 1)\n m_orig, n_orig = (int(x) for x in packed["shape"])\n pad = int(packed["pad"])\n bits = int(packed["bits"])\n qmax = (1 << bits) - 1\n\n m = q.shape[0]\n n_vals = q.shape[1]\n n_blocks_per_row = n_vals // 2\n n_padded = n_orig + pad\n assert n_blocks_per_row * 4 == n_padded\n\n # Dequantize values.\n vals_flat = (q - (qmax / 2)) * scale # (m, 2*nblk)\n vals = vals_flat.reshape(m, n_blocks_per_row, 2) # (m, nblk, 2)\n\n # Scatter back into a (m, nblk, 4) dense block layout.\n dense = np.zeros((m, n_blocks_per_row, 4), dtype=np.float32)\n # For each pair code, scatter both survivors.\n for pi, (i, j) in enumerate(_PAIRS):\n mask = (pair_codes == pi) # (m, nblk)\n # mask is (m, nblk); vals[...,0] is (m, nblk). Assigning into dense[mask, i]\n # only writes to the selected (row, block) pairs, which matches vals[mask, 0].\n dense[mask, i] = vals[mask, 0]\n dense[mask, j] = vals[mask, 1]\n\n W = dense.reshape(m, n_padded)\n if pad:\n W = W[:, :n_orig]\n assert W.shape == (m_orig, n_orig)\n return W.astype(np.float32)\n\n\n# ─── Size estimation ────────────────────────────────────────────────────────\n\ndef estimate_compressed_bytes(W: np.ndarray, bits: int) -> int:\n m, n = W.shape\n n_padded = n + (4 - n % 4) % 4\n n_blocks = (m * n_padded) // 4\n # positions: 1 byte per block (entropy ≤3 bits; brotli handles the rest).\n pos_bytes = n_blocks\n # values: 2 * bits per block, rounded up to bytes.\n val_bytes = int(np.ceil(2 * n_blocks * bits / 8))\n meta_bytes = m * 4\n return pos_bytes + val_bytes + meta_bytes\n\n\nif __name__ == "__main__":\n # Smoke test\n rng = np.random.default_rng(42)\n W = rng.standard_normal((128, 512)).astype(np.float32)\n\n packed = quantize_sparsity_24(W, value_bits=3)\n W_rec = dequantize(packed)\n rmse = float(np.sqrt(np.mean((W - W_rec) ** 2)))\n\n # Count zeros in the reconstruction - 2:4 means exactly half are zero.\n zero_frac = float(np.mean(W_rec == 0))\n n_correct_zero_frac = abs(zero_frac - 0.5) < 0.02\n\n print(f"original: {W.size * 4} bytes (fp32)")\n print(f"int6 baseline: {int(W.size * 6 / 8)} bytes")\n print(f"sparsity_24 @3: {estimate_compressed_bytes(W, 3)} bytes")\n print(f"shape: {packed[\'shape\'].tolist()}")\n print(f"RMSE: {rmse:.5f}")\n print(f"zero fraction: {zero_frac:.3f} (expect ≈0.5; pass={n_correct_zero_frac})")\n # Spot-check round-trip at extreme sparsity: the block with the biggest magnitude\n # should survive in full (modulo quantization).\n biggest_idx = np.unravel_index(np.argmax(np.abs(W)), W.shape)\n surviving = abs(W_rec[biggest_idx]) > 1e-4\n print(f"biggest-mag survived: {surviving}")\n\n # Also sanity-check handling of non-multiple-of-4 widths.\n W2 = rng.standard_normal((16, 37)).astype(np.float32)\n p2 = quantize_sparsity_24(W2, value_bits=3)\n W2_rec = dequantize(p2)\n print(f"odd-width {W2.shape} round-trip RMSE: {float(np.sqrt(np.mean((W2 - W2_rec) ** 2))):.5f}")\n assert W2_rec.shape == W2.shape\n print("OK")\n' +exec(_idea_source_idea_phase6_sparsity_24, _idea_module_idea_phase6_sparsity_24.__dict__) +_bootstrap_sys.modules["submission.ideas.idea_phase6_sparsity_24"] = _idea_module_idea_phase6_sparsity_24 +_idea_module_idea_051_freeze_dry = _bootstrap_types.ModuleType("submission.ideas.idea_051_freeze_dry") +_idea_module_idea_051_freeze_dry.__file__ = "" +_idea_source_idea_051_freeze_dry = '"""IDEA-051 - Freeze-drying: detect & drop weights that are linear combos of neighbors.\n\nAfter training, for each weight w_{i,j}, fit a linear model predicting it from\nits row/column neighbors. If fit RMSE < threshold, mark as "reconstructable" and\ndrop it. Store only a bitmask + shared reconstruction coefficients.\n\nEnv vars:\n USE_FREEZE_DRY=0|1 (default 0)\n FREEZE_DRY_RMSE_THRESH=0.005 (default 0.005 - max RMSE for reconstruction)\n FREEZE_DRY_MIN_FRACTION=0.05 (default 0.05 - only apply if >5% reconstructable)\n\nHook point: submission/train.py serialize() - after GPTQ, analyze weight structure\nand drop linearly-reconstructable weights before brotli compression.\n"""\n\nimport os\nfrom typing import Dict, Tuple, Optional\n\nimport numpy as np\n\n\ndef is_enabled() -> bool:\n return bool(int(os.environ.get("USE_FREEZE_DRY", "0")))\n\n\ndef get_rmse_thresh() -> float:\n return float(os.environ.get("FREEZE_DRY_RMSE_THRESH", "0.005"))\n\n\ndef get_min_fraction() -> float:\n return float(os.environ.get("FREEZE_DRY_MIN_FRACTION", "0.05"))\n\n\ndef analyze_linear_redundancy(\n w: np.ndarray,\n rmse_thresh: float = 0.005,\n) -> Tuple[np.ndarray, float]:\n """Analyze a weight matrix for linear redundancy along rows.\n\n For each element w[i,j], fit: w[i,j] ≈ a*w[i,j-1] + b*w[i,j+1]\n (2-neighbor linear prediction). If RMSE < threshold, mark as reconstructable.\n\n Args:\n w: [out_dim, in_dim] float32 weight matrix\n rmse_thresh: max RMSE for a weight to be considered reconstructable\n\n Returns:\n mask: [out_dim, in_dim] bool - True = keep, False = reconstructable (can drop)\n fraction_reconstructable: float\n """\n out_dim, in_dim = w.shape\n if in_dim < 3:\n return np.ones_like(w, dtype=bool), 0.0\n\n # For each interior column j (1..in_dim-2), predict from j-1 and j+1\n # via least-squares: w[:,j] ≈ a * w[:,j-1] + b * w[:,j+1]\n mask = np.ones_like(w, dtype=bool)\n total_checked = 0\n total_reconstructable = 0\n\n for j in range(1, in_dim - 1):\n # Stack neighbors as [out_dim, 2] design matrix\n X = np.stack([w[:, j - 1], w[:, j + 1]], axis=1) # [out_dim, 2]\n y = w[:, j] # [out_dim]\n\n # Solve least squares for coefficients a, b\n try:\n coeffs, residuals, _, _ = np.linalg.lstsq(X, y, rcond=None)\n except np.linalg.LinAlgError:\n continue\n\n # Compute per-element RMSE\n pred = X @ coeffs\n errors = np.abs(y - pred)\n\n # Mark elements with error < threshold as reconstructable\n recon_mask = errors < rmse_thresh\n mask[:, j] = ~recon_mask # True = keep (NOT reconstructable)\n total_checked += out_dim\n total_reconstructable += recon_mask.sum()\n\n fraction = total_reconstructable / max(total_checked, 1)\n return mask, fraction\n\n\ndef freeze_dry_state_dict(\n state_dict: Dict[str, "torch.Tensor"],\n rmse_thresh: float = None,\n min_fraction: float = None,\n) -> Dict[str, "torch.Tensor"]:\n """Apply freeze-drying to all weight matrices in state_dict.\n\n Weights identified as linearly reconstructable are set to zero.\n Zeros compress efficiently under brotli.\n\n Returns: modified state_dict (in-place)\n """\n import torch\n\n if not is_enabled():\n return state_dict\n\n rmse_thresh = rmse_thresh or get_rmse_thresh()\n min_fraction = min_fraction or get_min_fraction()\n\n total_removed = 0\n total_weights = 0\n\n for name, tensor in state_dict.items():\n if tensor.dim() != 2 or tensor.numel() < 65536:\n continue\n if not tensor.is_floating_point():\n continue\n\n w_np = tensor.detach().cpu().float().numpy()\n mask, frac = analyze_linear_redundancy(w_np, rmse_thresh)\n\n if frac < min_fraction:\n continue\n\n # Zero out reconstructable weights\n removed = (~mask).sum()\n state_dict[name] = tensor * torch.from_numpy(mask.astype(np.float32)).to(tensor.device).to(tensor.dtype)\n\n total_removed += removed\n total_weights += tensor.numel()\n\n print(\n f"[IDEA-051 freeze_dry] {name}: {removed}/{tensor.numel()} "\n f"({100*frac:.1f}%) weights zeroed",\n flush=True,\n )\n\n if total_weights > 0:\n print(\n f"[IDEA-051 freeze_dry] total: {total_removed}/{total_weights} "\n f"({100*total_removed/total_weights:.2f}%) weights zeroed",\n flush=True,\n )\n\n return state_dict\n' +exec(_idea_source_idea_051_freeze_dry, _idea_module_idea_051_freeze_dry.__dict__) +_bootstrap_sys.modules["submission.ideas.idea_051_freeze_dry"] = _idea_module_idea_051_freeze_dry +_idea_module_idea_064_parallel_gptq = _bootstrap_types.ModuleType("submission.ideas.idea_064_parallel_gptq") +_idea_module_idea_064_parallel_gptq.__file__ = "" +_idea_source_idea_064_parallel_gptq = '"""IDEA-064 - Parallel-search GPTQ: try 50+ clip percentiles using all 208 CPUs.\n\nPR #414 tries 5 clip percentiles per row. We have 208 vCPUs. Run 50+ candidates\nin parallel, pick per-row optimum. Strictly better GPTQ at zero GPU cost.\n\nEnv vars:\n USE_PARALLEL_GPTQ=0|1 (default 0)\n PARALLEL_GPTQ_N_CLIPS=50 (default 50 clip candidates)\n PARALLEL_GPTQ_WORKERS=0 (default 0 - auto-detect CPU count)\n\nHook point: submission/train.py gptq_quantize_weight() - replace with\nparallel_gptq_quantize_weight when enabled.\n"""\n\nimport os\nimport math\nfrom concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor\nfrom typing import Tuple, List, Optional\n\nimport numpy as np\n\n\ndef is_enabled() -> bool:\n return bool(int(os.environ.get("USE_PARALLEL_GPTQ", "0")))\n\n\ndef get_n_clips() -> int:\n return int(os.environ.get("PARALLEL_GPTQ_N_CLIPS", "50"))\n\n\ndef get_workers() -> int:\n n = int(os.environ.get("PARALLEL_GPTQ_WORKERS", "0"))\n if n <= 0:\n n = min(os.cpu_count() or 8, 64)\n return n\n\n\ndef generate_clip_candidates(n: int = 50, sigma_min: float = 2.0, sigma_max: float = 20.0) -> List[float]:\n """Generate n clip-sigma candidates log-spaced between sigma_min and sigma_max."""\n log_min = math.log(sigma_min)\n log_max = math.log(sigma_max)\n return [math.exp(log_min + (log_max - log_min) * i / max(n - 1, 1)) for i in range(n)]\n\n\ndef _quantize_row_at_clip(args):\n """Worker function: quantize a single row at a given clip_sigma.\n\n Returns (row_idx, clip_sigma, reconstruction_error).\n """\n row_idx, w_row, h_diag_row, clip_sigma, clip_range = args\n w = w_row.copy()\n n = len(w)\n # Clip outliers\n std = np.std(w)\n clip_val = clip_sigma * std\n w = np.clip(w, -clip_val, clip_val)\n # Scale to int range\n w_max = np.abs(w).max()\n if w_max < 1e-10:\n return row_idx, clip_sigma, 0.0\n scale = w_max / clip_range\n q = np.round(w / scale).clip(-clip_range, clip_range)\n recon = q * scale\n # Hessian-weighted reconstruction error\n err = (w_row - recon) ** 2\n if h_diag_row is not None and len(h_diag_row) == len(err):\n err *= (1.0 + np.abs(h_diag_row))\n return row_idx, clip_sigma, float(err.sum())\n\n\ndef parallel_search_best_clips(\n weight: "np.ndarray",\n hessian_diag: "Optional[np.ndarray]",\n clip_range: int = 31,\n n_clips: int = None,\n n_workers: int = None,\n) -> "np.ndarray":\n """Find optimal clip_sigma per row using parallel search.\n\n Args:\n weight: [out_dim, in_dim] float32 numpy array\n hessian_diag: [out_dim] or [out_dim, in_dim] Hessian diagonal\n clip_range: max int value (31 for int6)\n n_clips: number of clip candidates to try\n n_workers: parallel workers\n\n Returns:\n best_clips: [out_dim] best clip_sigma per row\n """\n if not is_enabled():\n return None\n\n n_clips = n_clips or get_n_clips()\n n_workers = n_workers or get_workers()\n clips = generate_clip_candidates(n_clips)\n out_dim = weight.shape[0]\n\n # Prepare per-row Hessian\n if hessian_diag is not None:\n if hessian_diag.ndim == 1:\n h_rows = [None] * out_dim # per-output-dim scalar, not per-element\n else:\n h_rows = [hessian_diag[i] for i in range(out_dim)]\n else:\n h_rows = [None] * out_dim\n\n # Build task list: (row_idx, w_row, h_row, clip_sigma, clip_range)\n tasks = []\n for row_idx in range(out_dim):\n for clip_sigma in clips:\n tasks.append((row_idx, weight[row_idx].copy(), h_rows[row_idx], clip_sigma, clip_range))\n\n # Run in parallel (use threads not processes to avoid pickle overhead)\n best_clips = np.full(out_dim, clips[len(clips) // 2]) # default to median\n best_errors = np.full(out_dim, float("inf"))\n\n with ThreadPoolExecutor(max_workers=n_workers) as pool:\n for row_idx, clip_sigma, err in pool.map(_quantize_row_at_clip, tasks):\n if err < best_errors[row_idx]:\n best_errors[row_idx] = err\n best_clips[row_idx] = clip_sigma\n\n total_improvement = (best_errors.sum()) / max(out_dim, 1)\n print(\n f"[IDEA-064 parallel_gptq] searched {n_clips} clips × {out_dim} rows "\n f"using {n_workers} workers, avg_best_err={total_improvement:.6f}",\n flush=True,\n )\n return best_clips\n' +exec(_idea_source_idea_064_parallel_gptq, _idea_module_idea_064_parallel_gptq.__dict__) +_bootstrap_sys.modules["submission.ideas.idea_064_parallel_gptq"] = _idea_module_idea_064_parallel_gptq +_idea_module_tournament_quant_01_lloyd_max = _bootstrap_types.ModuleType( + "submission.ideas.tournament_quant_01_lloyd_max" +) +_idea_module_tournament_quant_01_lloyd_max.__file__ = "" +_idea_source_tournament_quant_01_lloyd_max = '"""Tournament Quant 01 -- Lloyd-Max codebook quantization for int6 weights.\n\nReplace standard uniform int6 quantization with optimal non-uniform\nquantization based on a pre-computed 64-level Lloyd-Max codebook. The\ncodebook is trained offline to minimise MSE for the empirical weight\ndistribution (Gaussian-like, heavy tails).\n\nFor each weight value, find the nearest codebook centroid and store its\n6-bit index (0-63). Dequantize by table lookup: weight_approx =\ncodebook[index]. Because the codebook places more centroids near\nzero (where most weights live), reconstruction error drops vs uniform\nspacing at the same 6 bits per weight.\n\nCodebook path: data/lloyd_max_codebook_64.npy (64 float32 values,\nsorted ascending, pre-trained via the Lloyd-Max algorithm on a sample\nof trained model weights).\n\nEnv vars:\n USE_LLOYD_MAX=0|1 (default 0)\n LLOYD_MAX_CODEBOOK= (default data/lloyd_max_codebook_64.npy)\n\nHook point: submission/train.py quantize() -- replace uniform int6\nround-and-clip with ``lloyd_max_quantize()``; at inference call\n``lloyd_max_dequantize()`` to recover approximate float values.\n"""\n\nimport os\nfrom pathlib import Path\nfrom typing import Tuple\n\nimport numpy as np\n\n\n# ---------------------------------------------------------------------------\n# Env-var gate\n# ---------------------------------------------------------------------------\n\ndef is_enabled() -> bool:\n return bool(int(os.environ.get("USE_LLOYD_MAX", "0")))\n\n\ndef _codebook_path() -> str:\n default = str(Path(__file__).resolve().parents[2] / "data" / "lloyd_max_codebook_64.npy")\n return os.environ.get("LLOYD_MAX_CODEBOOK", default)\n\n\n# ---------------------------------------------------------------------------\n# Codebook loading (cached)\n# ---------------------------------------------------------------------------\n\n_CODEBOOK_CACHE = None\n\n\ndef load_codebook() -> np.ndarray:\n """Load and cache the 64-level Lloyd-Max codebook (sorted ascending).\n\n Returns:\n 1-D float32 array of 64 centroid values.\n """\n global _CODEBOOK_CACHE\n if _CODEBOOK_CACHE is None:\n cb = np.load(_codebook_path()).astype(np.float32).ravel()\n assert cb.shape[0] == 64, f"Expected 64 codebook entries, got {cb.shape[0]}"\n cb.sort()\n _CODEBOOK_CACHE = cb\n return _CODEBOOK_CACHE\n\n\n# ---------------------------------------------------------------------------\n# Quantize / dequantize\n# ---------------------------------------------------------------------------\n\ndef lloyd_max_quantize(\n tensor: np.ndarray,\n codebook: np.ndarray = None,\n) -> Tuple[np.ndarray, np.ndarray]:\n """Quantize a weight tensor using Lloyd-Max codebook lookup.\n\n For every element, find the nearest codebook centroid and store its\n 6-bit index.\n\n Args:\n tensor: arbitrary-shape float32 weight array.\n codebook: 1-D sorted array of 64 centroids (default: load from disk).\n\n Returns:\n (indices, codebook) where indices is uint8 array (0-63) with the\n same shape as tensor, and codebook is the 64-entry centroid array.\n """\n if codebook is None:\n codebook = load_codebook()\n\n flat = tensor.astype(np.float32).ravel()\n # Nearest-centroid assignment via binary search on sorted codebook\n insert_idx = np.searchsorted(codebook, flat)\n # Clamp to valid range\n insert_idx = np.clip(insert_idx, 0, len(codebook) - 1)\n # Check if the left neighbour is closer\n left_idx = np.clip(insert_idx - 1, 0, len(codebook) - 1)\n dist_right = np.abs(flat - codebook[insert_idx])\n dist_left = np.abs(flat - codebook[left_idx])\n indices = np.where(dist_left < dist_right, left_idx, insert_idx)\n indices = indices.astype(np.uint8).reshape(tensor.shape)\n return indices, codebook\n\n\ndef lloyd_max_dequantize(\n indices: np.ndarray,\n codebook: np.ndarray = None,\n) -> np.ndarray:\n """Dequantize 6-bit indices back to float32 via codebook lookup.\n\n Args:\n indices: uint8 array of codebook indices (0-63), any shape.\n codebook: 1-D sorted array of 64 centroids (default: load from disk).\n\n Returns:\n float32 array of the same shape with approximate weight values.\n """\n if codebook is None:\n codebook = load_codebook()\n\n return codebook[indices.ravel()].reshape(indices.shape)\n' +exec(_idea_source_tournament_quant_01_lloyd_max, _idea_module_tournament_quant_01_lloyd_max.__dict__) +_bootstrap_sys.modules["submission.ideas.tournament_quant_01_lloyd_max"] = _idea_module_tournament_quant_01_lloyd_max +_idea_module_tournament_mlp_01_dual_mlp = _bootstrap_types.ModuleType("submission.ideas.tournament_mlp_01_dual_mlp") +_idea_module_tournament_mlp_01_dual_mlp.__file__ = "" +_idea_source_tournament_mlp_01_dual_mlp = '"""TOURNAMENT-MLP-01 -- Dual MLP: two parallel half-width MLPs per layer.\n\nReplace each layer\'s single MLP with two parallel MLPs, each with half the\nnormal hidden size, then average their outputs:\n\n mlp_out = 0.5 * (mlp_a(x) + mlp_b(x))\n\nTotal parameter count is the same as a single full-width MLP (since each\nhalf-width MLP has half the parameters). The benefit is implicit ensemble\naveraging: the two MLPs can specialise on different patterns and their\naverage is a smoother, more robust transformation than a single MLP.\n\nThis is analogous to dropout-as-ensemble (Srivastava et al. 2014) but\nstructural rather than stochastic: two independent paths that are always\nboth active. It also relates to mixture-of-experts with uniform routing\n(every expert always active, equal weight).\n\nEnv vars:\n USE_DUAL_MLP=0|1 (default 0)\n\nHook point: submission/train.py GPT.__init__() -- after blocks are created,\ncall apply_dual_mlp(model, h) to replace each block\'s MLP.\n\nIntegration:\n from submission.ideas.tournament_mlp_01_dual_mlp import (\n is_enabled, apply_dual_mlp\n )\n if is_enabled():\n apply_dual_mlp(model, h)\n"""\n\nimport os\n\nimport torch\nimport torch.nn as nn\n\n\n# ---------------------------------------------------------------------------\n# Configuration\n# ---------------------------------------------------------------------------\n\ndef is_enabled() -> bool:\n return bool(int(os.environ.get("USE_DUAL_MLP", "0")))\n\n\n# ---------------------------------------------------------------------------\n# Dual MLP module\n# ---------------------------------------------------------------------------\n\nclass DualMLP(nn.Module):\n """Two parallel half-width MLPs whose outputs are averaged.\n\n Each sub-MLP has hidden_dim = original_hidden_dim / 2, so total params\n match a single full-width MLP.\n """\n\n def __init__(self, model_dim: int, mlp_mult: float):\n super().__init__()\n # Full hidden dim that a single MLP would use\n full_hidden = int(model_dim * mlp_mult)\n # Each sub-MLP gets half\n half_hidden = full_hidden // 2\n\n self.mlp_a = nn.Sequential(\n nn.Linear(model_dim, half_hidden, bias=False),\n nn.SiLU(),\n nn.Linear(half_hidden, model_dim, bias=False),\n )\n self.mlp_b = nn.Sequential(\n nn.Linear(model_dim, half_hidden, bias=False),\n nn.SiLU(),\n nn.Linear(half_hidden, model_dim, bias=False),\n )\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n return 0.5 * (self.mlp_a(x) + self.mlp_b(x))\n\n\n# ---------------------------------------------------------------------------\n# Integration\n# ---------------------------------------------------------------------------\n\ndef apply_dual_mlp(model, h):\n """Replace each block\'s MLP with a DualMLP.\n\n Args:\n model: GPT model with .blocks ModuleList\n h: hyperparameters namespace (needs model_dim, mlp_mult)\n """\n if not is_enabled():\n return\n\n model_dim = getattr(h, "model_dim", 512)\n mlp_mult = getattr(h, "mlp_mult", 3.0)\n\n replaced = 0\n for block in model.blocks:\n # Handle both direct blocks and wrapped blocks (e.g. HyperConnectedBlock)\n target = block\n if hasattr(block, "block"):\n target = block.block\n\n if hasattr(target, "mlp"):\n target.mlp = DualMLP(model_dim, mlp_mult)\n replaced += 1\n\n print(\n f"[TOURNAMENT-MLP-01 dual_mlp] replaced {replaced} MLPs with DualMLP "\n f"(2x half-width={int(model_dim * mlp_mult) // 2}, averaged)",\n flush=True,\n )\n' +exec(_idea_source_tournament_mlp_01_dual_mlp, _idea_module_tournament_mlp_01_dual_mlp.__dict__) +_bootstrap_sys.modules["submission.ideas.tournament_mlp_01_dual_mlp"] = _idea_module_tournament_mlp_01_dual_mlp +_idea_module_tournament_embed_03_asymmetric_skip = _bootstrap_types.ModuleType( + "submission.ideas.tournament_embed_03_asymmetric_skip" +) +_idea_module_tournament_embed_03_asymmetric_skip.__file__ = "" +_idea_source_tournament_embed_03_asymmetric_skip = '"""TOURNAMENT-EMBED-03 -- Asymmetric skip-connection initialization.\n\nInitialize U-Net skip_weights at 0.5 instead of the default 1.0. This\ncreates an information bottleneck at the encoder-decoder boundary: the\ndecoder receives a halved skip signal, forcing it to learn its own\nrepresentations rather than simply copying encoder outputs.\n\nAt 1.0, skip connections pass encoder activations through unchanged.\nAt 0.5, the decoder must reconstruct half the signal from its own\ncomputation, encouraging more expressive decoder layers.\n\nEnv vars:\n USE_ASYMMETRIC_SKIP_INIT=0|1 (default 0)\n\nHook point: model initialization -- after creating skip_weights, override\ntheir values with 0.5.\n"""\n\nimport os\nimport torch\n\n\ndef is_enabled() -> bool:\n return bool(int(os.environ.get("USE_ASYMMETRIC_SKIP_INIT", "0")))\n\n\ndef apply_asymmetric_skip_init(skip_weights: torch.Tensor) -> torch.Tensor:\n """Re-initialize skip weights to 0.5.\n\n Args:\n skip_weights: the model\'s skip connection weights, any shape.\n\n Returns:\n New tensor filled with 0.5, same shape and device.\n """\n return torch.full_like(skip_weights, 0.5)\n' +exec(_idea_source_tournament_embed_03_asymmetric_skip, _idea_module_tournament_embed_03_asymmetric_skip.__dict__) +_bootstrap_sys.modules["submission.ideas.tournament_embed_03_asymmetric_skip"] = ( + _idea_module_tournament_embed_03_asymmetric_skip +) +_idea_module_idea_curriculum_shard = _bootstrap_types.ModuleType("submission.ideas.idea_curriculum_shard") +_idea_module_idea_curriculum_shard.__file__ = "" +_idea_source_idea_curriculum_shard = '"""Entropy-bucket curriculum shard loader - drop-in replacement for\nShuffledSequenceLoader\'s `next_batch` API.\n\nWhen USE_CURRICULUM_SHARD=1, training samples sequences from pre-computed\nentropy buckets with a time-varying weight schedule that crossfades from easy\n(low-entropy) to hard (high-entropy) as training progresses. A floor weight\nprevents any bucket from going to zero (avoids catastrophic forgetting of\neither tail).\n\nSchedule per sequence:\n d[b] = b / (N-1) # bucket difficulty, 0 easiest\n w[b] = (1 - d[b]) * (1 - p) + d[b] * p, # p = training progress fraction\n w[b] <- max(w[b], floor)\n sample bucket ~ w / sum(w), then sample a sequence uniformly from that bucket.\n\nEnv vars:\n USE_CURRICULUM_SHARD=0|1 (default 0)\n CURRICULUM_MANIFEST_PATH=./data/curriculum_manifest.npz\n CURRICULUM_BUCKET_FLOOR_WEIGHT=0.02\n\nExpects a manifest built by submission/final/assign_buckets.py (output of\nsubmission/final/compute_entropy.py). This module only defines the loader -\nthe host script is responsible for substituting it at ShuffledSequenceLoader\ncall sites when is_enabled() returns True.\n"""\nfrom __future__ import annotations\n\nimport glob\nimport os\nimport time\nfrom pathlib import Path\n\nimport numpy as np\nimport torch\n\n\ndef is_enabled() -> bool:\n return bool(int(os.environ.get("USE_CURRICULUM_SHARD", "0")))\n\n\ndef get_manifest_path() -> str:\n return os.environ.get("CURRICULUM_MANIFEST_PATH", "./data/curriculum_manifest.npz")\n\n\ndef get_bucket_floor_weight() -> float:\n return float(os.environ.get("CURRICULUM_BUCKET_FLOOR_WEIGHT", "0.02"))\n\n\n_SHARD_HEADER_BYTES = 256 * np.dtype(" int:\n header = np.fromfile(file, dtype=" torch.Tensor:\n num_tokens = _read_num_tokens(file)\n tokens = np.fromfile(\n file, dtype=" (x, y) torch.Tensor pair\n .prefill(global_tokens, grad_accum_steps[, target_depth][, timeout_s])\n """\n\n def __init__(self, h, device: torch.device) -> None:\n self.h = h\n self.device = device\n self.seq_len = h.train_seq_len\n self.rank = h.rank\n self.world_size = h.world_size\n all_files = [Path(p) for p in sorted(glob.glob(h.train_files))]\n if not all_files:\n raise FileNotFoundError(f"curriculum: no files for {h.train_files!r}")\n\n manifest_path = Path(get_manifest_path())\n if not manifest_path.exists():\n raise FileNotFoundError(\n f"curriculum manifest missing: {manifest_path} "\n "(run submission/final/compute_entropy.py + assign_buckets.py first)",\n )\n manifest = np.load(manifest_path, allow_pickle=True)\n shard_paths = list(manifest["shard_paths"])\n seq_starts = list(manifest["seq_starts"])\n bucket_ids = list(manifest["bucket_ids"])\n self.num_buckets = int(manifest["num_buckets"])\n manifest_seq_len = int(manifest["seq_len"])\n if manifest_seq_len != self.seq_len:\n raise ValueError(\n f"manifest seq_len={manifest_seq_len} != train_seq_len={self.seq_len}",\n )\n\n assigned_paths = all_files[self.rank :: self.world_size]\n by_basename = {p.name: p for p in assigned_paths}\n\n per_bucket: list[list[tuple[Path, int]]] = [[] for _ in range(self.num_buckets)]\n for mpath, mstarts, mbuckets in zip(shard_paths, seq_starts, bucket_ids, strict=True):\n mpath_str = str(mpath)\n basename = Path(mpath_str).name\n if basename not in by_basename:\n continue\n resolved = by_basename[basename]\n starts_arr = np.asarray(mstarts, dtype=np.int64)\n buckets_arr = np.asarray(mbuckets, dtype=np.int8)\n for start, bucket in zip(starts_arr.tolist(), buckets_arr.tolist(), strict=True):\n per_bucket[int(bucket)].append((resolved, int(start)))\n\n if sum(len(b) for b in per_bucket) == 0:\n raise RuntimeError(\n f"curriculum: rank {self.rank}/{self.world_size} has no matching shards",\n )\n\n self._per_bucket = per_bucket\n self._floor_weight = get_bucket_floor_weight()\n self._rng = np.random.Generator(np.random.PCG64(self.rank))\n self._shard_cache: dict[Path, torch.Tensor] = {}\n self._start_time: float | None = None\n self._max_wallclock_seconds = max(1.0, float(h.max_wallclock_seconds))\n print(\n f"[curriculum] rank={self.rank}/{self.world_size} "\n f"buckets={self.num_buckets} total_seqs={sum(len(b) for b in per_bucket)} "\n f"floor={self._floor_weight}",\n flush=True,\n )\n\n def _progress_fraction(self) -> float:\n if self._start_time is None:\n self._start_time = time.monotonic()\n return 0.0\n elapsed = time.monotonic() - self._start_time\n return min(1.0, elapsed / self._max_wallclock_seconds)\n\n def _bucket_weights(self, progress: float) -> np.ndarray:\n n = self.num_buckets\n difficulty = np.arange(n, dtype=np.float64) / max(n - 1, 1)\n weights = (1.0 - difficulty) * (1.0 - progress) + difficulty * progress\n has_entries = np.array([len(b) > 0 for b in self._per_bucket], dtype=bool)\n weights = np.where(has_entries, np.maximum(weights, self._floor_weight), 0.0)\n total = float(weights.sum())\n if total <= 0:\n raise RuntimeError("curriculum: all buckets empty for this rank")\n return weights / total\n\n def _get_shard_tokens(self, shard_path: Path) -> torch.Tensor:\n tokens = self._shard_cache.get(shard_path)\n if tokens is None:\n tokens = _load_token_shard(shard_path)\n self._shard_cache[shard_path] = tokens\n return tokens\n\n def _take_sequence(self) -> torch.Tensor:\n weights = self._bucket_weights(self._progress_fraction())\n bucket = int(self._rng.choice(self.num_buckets, p=weights))\n entries = self._per_bucket[bucket]\n idx = int(self._rng.integers(len(entries)))\n shard_path, start = entries[idx]\n tokens = self._get_shard_tokens(shard_path)\n end = start + self.seq_len + 1\n if end > tokens.numel():\n start = max(0, tokens.numel() - self.seq_len - 1)\n end = start + self.seq_len + 1\n return tokens[start:end]\n\n def _build_batch_cpu(self, global_tokens: int, grad_accum_steps: int) -> tuple[torch.Tensor, torch.Tensor]:\n device_tokens = global_tokens // (self.world_size * grad_accum_steps)\n device_batch_size = device_tokens // self.seq_len\n sequences = [self._take_sequence() for _ in range(device_batch_size)]\n stacked = torch.stack(sequences, dim=0).to(dtype=torch.int64)\n pinned = stacked.pin_memory() if self.device.type == "cuda" else stacked\n x = pinned[:, :-1].contiguous()\n y = pinned[:, 1:].contiguous()\n return x, y\n\n def next_batch(self, global_tokens: int, grad_accum_steps: int) -> tuple[torch.Tensor, torch.Tensor]:\n x, y = self._build_batch_cpu(global_tokens, grad_accum_steps)\n if self.device.type == "cuda":\n x = x.to(self.device, non_blocking=True)\n y = y.to(self.device, non_blocking=True)\n return x, y\n\n def prefill(self, *args, **kwargs) -> None: # noqa: ARG002\n # The curriculum loader does not use a prefetch thread. prefill is a no-op.\n return None\n\n def prefetch_queue_depth(self) -> int:\n return 0\n' +exec(_idea_source_idea_curriculum_shard, _idea_module_idea_curriculum_shard.__dict__) +_bootstrap_sys.modules["submission.ideas.idea_curriculum_shard"] = _idea_module_idea_curriculum_shard +for _bootstrap_key in list(globals()): + if _bootstrap_key.startswith(("_bootstrap_", "_idea_module_", "_idea_source_")): + del globals()[_bootstrap_key] +del _bootstrap_key +import collections, copy, glob, io, lzma, math, os +from pathlib import Path +import random, re, subprocess, sys, time, uuid, numpy as np, sentencepiece as spm, torch, torch.distributed as dist, torch.nn.functional as F +from torch.nn.parallel import DistributedDataParallel as DDP +from torch import Tensor, nn + +try: + from flash_attn_interface import flash_attn_func as _fa3_raw + + def flash_attn_3_func(q, k, v, causal=True): + return _fa3_raw(q, k, v, causal=causal) +except ImportError: + + def flash_attn_3_func(q, k, v, causal=True): + qt = q.transpose(1, 2) + kt = k.transpose(1, 2) + vt = v.transpose(1, 2) + n_q = qt.size(1) + n_kv = kt.size(1) + if n_q != n_kv: + n_rep = n_q // n_kv + kt = kt.repeat_interleave(n_rep, dim=1) + vt = vt.repeat_interleave(n_rep, dim=1) + return F.scaled_dot_product_attention(qt, kt, vt, is_causal=causal).transpose(1, 2).contiguous() + + +class Hyperparameters: + data_dir = os.environ.get("DATA_DIR", "./data/") + seed = int(os.environ.get("SEED", 1337)) + run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) + iterations = int(os.environ.get("ITERATIONS", 20000)) + warmdown_frac = float(os.environ.get("WARMDOWN_FRAC", 0.667)) + warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786432)) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500)) + max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0)) + val_batch_tokens = int(os.environ.get("VAL_BATCH_TOKENS", 524288)) + eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048)) + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000)) + sliding_window_enabled = bool(int(os.environ.get("SLIDING_WINDOW_ENABLED", "1"))) + vocab_size = int(os.environ.get("VOCAB_SIZE", 8192)) + num_layers = int(os.environ.get("NUM_LAYERS", 11)) + xsa_last_n = int(os.environ.get("XSA_LAST_N", 11)) + model_dim = int(os.environ.get("MODEL_DIM", 512)) + embedding_dim = int(os.environ.get("EMBEDDING_DIM", 512)) + num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) + num_heads = int(os.environ.get("NUM_HEADS", 8)) + mlp_mult = float(os.environ.get("MLP_MULT", 4.0)) + skip_gates_enabled = bool(int(os.environ.get("SKIP_GATES_ENABLED", "1"))) + tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) + logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0)) + rope_base = float(os.environ.get("ROPE_BASE", 10000.0)) + rope_dims = int(os.environ.get("ROPE_DIMS", 16)) + rope_train_seq_len = int(os.environ.get("ROPE_TRAIN_SEQ_LEN", 2048)) + ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) + qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 4.0)) + num_loops = int(os.environ.get("NUM_LOOPS", 2)) + loop_start = int(os.environ.get("LOOP_START", 4)) + loop_end = int(os.environ.get("LOOP_END", 5)) + enable_looping_at = float(os.environ.get("ENABLE_LOOPING_AT", 0.5)) + parallel_residual_start = int(os.environ.get("PARALLEL_RESIDUAL_START", "-1")) + min_lr = float(os.environ.get("MIN_LR", 0.0)) + embed_lr = float(os.environ.get("EMBED_LR", 0.6)) + head_lr = float(os.environ.get("HEAD_LR", 0.008)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.03)) + tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) + matrix_lr = float(os.environ.get("MATRIX_LR", 0.02)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.02)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.99)) + muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) + muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92)) + muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500)) + muon_row_normalize = bool(int(os.environ.get("MUON_ROW_NORMALIZE", "1"))) + beta1 = float(os.environ.get("BETA1", 0.9)) + beta2 = float(os.environ.get("BETA2", 0.95)) + adam_eps = float(os.environ.get("ADAM_EPS", 1e-08)) + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) + eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) + muon_beta2 = float(os.environ.get("MUON_BETA2", 0.95)) + adam_wd = float(os.environ.get("ADAM_WD", 0.02)) + muon_wd = float(os.environ.get("MUON_WD", 0.085)) + embed_wd = float(os.environ.get("EMBED_WD", 0.085)) + ema_decay = float(os.environ.get("EMA_DECAY", 0.997)) + ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "0"))) + ttt_lr = float(os.environ.get("TTT_LR", 0.005)) + ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3)) + ttt_chunk_tokens = int(os.environ.get("TTT_CHUNK_TOKENS", 32768)) + ttt_freeze_blocks = int(os.environ.get("TTT_FREEZE_BLOCKS", 0)) + ttt_momentum = float(os.environ.get("TTT_MOMENTUM", 0.9)) + ttt_batch_seqs = int(os.environ.get("TTT_BATCH_SEQS", 32)) + ttt_grad_clip = float(os.environ.get("TTT_GRAD_CLIP", 1.0)) + prequant_ttt_enabled = bool(int(os.environ.get("PREQUANT_TTT_ENABLED", "0"))) + prequant_ttt_lr = float(os.environ.get("PREQUANT_TTT_LR", 0.00045)) + prequant_ttt_epochs = int(os.environ.get("PREQUANT_TTT_EPOCHS", 8)) + prequant_ttt_freeze_blocks = int(os.environ.get("PREQUANT_TTT_FREEZE_BLOCKS", 1)) + prequant_ttt_batch_seqs = int(os.environ.get("PREQUANT_TTT_BATCH_SEQS", 32)) + prequant_ttt_grad_clip = float(os.environ.get("PREQUANT_TTT_GRAD_CLIP", 1.0)) + prequant_ttt_cosine_decay = bool(int(os.environ.get("PREQUANT_TTT_COSINE_DECAY", "1"))) + compressor = os.environ.get("COMPRESSOR", "brotli") + gptq_calibration_batches = int(os.environ.get("GPTQ_CALIBRATION_BATCHES", 64)) + gptq_reserve_seconds = float(os.environ.get("GPTQ_RESERVE_SECONDS", 12.0)) + matrix_bits = int(os.environ.get("MATRIX_BITS", 6)) + embed_bits = int(os.environ.get("EMBED_BITS", 8)) + matrix_clip_sigmas = float(os.environ.get("MATRIX_CLIP_SIGMAS", 12.85)) + embed_clip_sigmas = float(os.environ.get("EMBED_CLIP_SIGMAS", 20.0)) + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + is_main_process = rank == 0 + grad_accum_steps = 8 // world_size + datasets_dir = os.path.join(data_dir, "datasets", f"fineweb10B_sp{vocab_size}") + train_files = os.path.join(datasets_dir, "fineweb_train_*.bin") + val_files = os.path.join(datasets_dir, "fineweb_val_*.bin") + tokenizer_path = os.path.join(data_dir, "tokenizers", f"fineweb_{vocab_size}_bpe.model") + logfile = f"logs/{run_id}.txt" + model_path = os.environ.get("MODEL_PATH", "final_model.pt") + quantized_model_path = os.environ.get("QUANTIZED_MODEL_PATH", "final_model.int6.ptz") + + +_logger_hparams = None + + +def set_logging_hparams(h): + global _logger_hparams + _logger_hparams = h + + +def log(msg, console=True): + if _logger_hparams is None: + print(msg) + return + if _logger_hparams.is_main_process: + if console: + print(msg) + if _logger_hparams.logfile is not None: + with open(_logger_hparams.logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + + +class ValidationData: + def __init__(self, h, device): + self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) + if int(self.sp.vocab_size()) != h.vocab_size: + raise ValueError( + f"VOCAB_SIZE={h.vocab_size} does not match tokenizer vocab_size={int(self.sp.vocab_size())}" + ) + self.val_tokens = load_validation_tokens(h.val_files, h.eval_seq_len) + self.base_bytes_lut, self.has_leading_space_lut, self.is_boundary_token_lut = build_sentencepiece_luts( + self.sp, h.vocab_size, device + ) + self.boundary_mask = None + self.pmi_matrix = None + + +def build_sentencepiece_luts(sp, vocab_size, device): + sp_vocab_size = int(sp.vocab_size()) + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("▁"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) + + +def load_validation_tokens(pattern, seq_len): + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() + usable = (tokens.numel() - 1) // seq_len * seq_len + if usable <= 0: + raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") + return tokens[: usable + 1] + + +def load_data_shard(file): + header_bytes = 256 * np.dtype(" 0 else 0 + num_sequences = (self.num_tokens[si] - 1 - phase) // self.seq_len + sequence_order = self.rng.permutation(num_sequences) + self.start_inds[si] = (phase + sequence_order * self.seq_len).tolist() + + def _build_batch_cpu(self, global_tokens, grad_accum_steps): + """Build one (x, y) batch on CPU. Returns pinned tensors if + PREFETCH_PIN_MEMORY=1 (default). Thread-safe for single-worker use.""" + device_tokens = global_tokens // (self.world_size * grad_accum_steps) + device_batch_size = device_tokens // self.seq_len + remaining = np.array([len(s) for s in self.start_inds], dtype=np.float64) + if self._prefetch_use_pinned: + x = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64, pin_memory=True) + y = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64, pin_memory=True) + else: + x = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64) + y = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64) + for bi in range(device_batch_size): + total = remaining.sum() + if total <= 0: + for si in range(len(self.files)): + self._reset_shard(si) + remaining = np.array([len(s) for s in self.start_inds], dtype=np.float64) + total = remaining.sum() + probs = remaining / total + si = int(self.rng.choice(len(self.files), p=probs)) + start_ind = self.start_inds[si].pop() + remaining[si] -= 1 + mm = _get_shard_memmap(self.files[si]) + window = torch.as_tensor(np.array(mm[start_ind : start_ind + self.seq_len + 1], dtype=np.int64)) + x[bi] = window[:-1] + y[bi] = window[1:] + return (x, y) + + def _prefetch_worker(self): + """Background daemon thread: loops forever, pushing batches into the + queue. Any exception is surfaced to the main thread via a sentinel + tuple ('__ERROR__', exc).""" + try: + while True: + x, y = self._build_batch_cpu(*self._prefetch_args) + self._prefetch_queue.put((x, y)) + except Exception as exc: + try: + self._prefetch_queue.put(("__ERROR__", exc)) + except Exception: + pass + + def _ensure_prefetch_started(self, global_tokens, grad_accum_steps): + if self._prefetch_queue is not None: + return + import queue as _queue + import threading as _threading + + self._prefetch_queue = _queue.Queue(maxsize=self._prefetch_depth) + self._prefetch_args = (global_tokens, grad_accum_steps) + self._prefetch_thread = _threading.Thread( + target=self._prefetch_worker, daemon=True, name="ShuffledSequenceLoader-prefetch" + ) + self._prefetch_thread.start() + print(f"[prefetch] daemon started: depth={self._prefetch_depth} pinned={self._prefetch_use_pinned}", flush=True) + + def next_batch(self, global_tokens, grad_accum_steps): + if self._use_prefetch: + self._ensure_prefetch_started(global_tokens, grad_accum_steps) + if self._prefetch_queue.empty(): + self._prefetch_stats["queue_waits_empty"] += 1 + item = self._prefetch_queue.get() + if isinstance(item, tuple) and len(item) >= 1 and (item[0] == "__ERROR__"): + raise item[1] + x, y = item + self._prefetch_stats["batches_served"] += 1 + return (x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True)) + device_tokens = global_tokens // (self.world_size * grad_accum_steps) + device_batch_size = device_tokens // self.seq_len + remaining = np.array([len(s) for s in self.start_inds], dtype=np.float64) + x = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64) + y = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64) + for bi in range(device_batch_size): + total = remaining.sum() + if total <= 0: + for si in range(len(self.files)): + self._reset_shard(si) + remaining = np.array([len(s) for s in self.start_inds], dtype=np.float64) + total = remaining.sum() + probs = remaining / total + si = int(self.rng.choice(len(self.files), p=probs)) + start_ind = self.start_inds[si].pop() + remaining[si] -= 1 + mm = _get_shard_memmap(self.files[si]) + window = torch.as_tensor(np.array(mm[start_ind : start_ind + self.seq_len + 1], dtype=np.int64)) + x[bi] = window[:-1] + y[bi] = window[1:] + return (x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True)) + + def prefetch_queue_depth(self): + """Current depth of the prefetch queue (for telemetry). Returns -1 if + prefetch is disabled.""" + if self._prefetch_queue is None: + return -1 + return self._prefetch_queue.qsize() + + def prefill(self, global_tokens, grad_accum_steps, target_depth=None, timeout_s=120.0): + """Pre-fill the prefetch queue during pretime so training starts with + a full queue. Front-loads CPU work into pretime (free) so the CPU is + nearly idle during the 600s training budget (available for metric + logging, optimizer offload, async checkpoint writes, etc.). + + Blocks until the queue has `target_depth` batches, or until timeout. + Only runs if USE_PREFETCH_LOADER=1. + + Env var override: PREFETCH_PREFILL_BATCHES (default = PREFETCH_DEPTH). + """ + if not self._use_prefetch: + print("[prefetch] prefill: USE_PREFETCH_LOADER=0, skipping", flush=True) + return + if target_depth is None: + target_depth = int(os.environ.get("PREFETCH_PREFILL_BATCHES", str(self._prefetch_depth))) + target_depth = min(target_depth, self._prefetch_depth) + self._ensure_prefetch_started(global_tokens, grad_accum_steps) + import time as _time + + t0 = _time.perf_counter() + last_log = t0 + print( + f"[prefetch] prefill: target_depth={target_depth}, maxsize={self._prefetch_depth}, timeout={timeout_s}s", + flush=True, + ) + while True: + current = self._prefetch_queue.qsize() + if current >= target_depth: + elapsed = _time.perf_counter() - t0 + print(f"[prefetch] prefill: reached depth {current}/{target_depth} in {elapsed:.2f}s", flush=True) + return + elapsed = _time.perf_counter() - t0 + if elapsed >= timeout_s: + print(f"[prefetch] prefill: TIMEOUT at depth {current}/{target_depth} after {elapsed:.1f}s", flush=True) + return + if _time.perf_counter() - last_log > 5.0: + print(f"[prefetch] prefill progress: {current}/{target_depth} at {elapsed:.1f}s", flush=True) + last_log = _time.perf_counter() + _time.sleep(0.1) + + +def _make_shard_loader(h, device): + """Dispatch shard loading: curriculum if USE_CURRICULUM_SHARD=1, else + the standard ShuffledSequenceLoader (rank-partitioned shuffle).""" + try: + from submission.ideas.idea_curriculum_shard import ( + CurriculumSequenceLoader as _CurriculumShard, + is_enabled as _curriculum_enabled, + ) + + if _curriculum_enabled(): + return _CurriculumShard(h, device) + except Exception: + pass + return ShuffledSequenceLoader(h, device) + + +class RMSNorm(nn.Module): + def __init__(self, eps=None): + super().__init__() + self.eps = eps + + def forward(self, x): + return F.rms_norm(x, (x.size(-1),), eps=self.eps) + + +class CastedLinear(nn.Linear): + def forward(self, x): + w = self.weight.to(x.dtype) + bias = self.bias.to(x.dtype) if self.bias is not None else None + return F.linear(x, w, bias) + + +class Rotary(nn.Module): + def __init__(self, dim, base=10000.0, train_seq_len=1024, rope_dims=0): + super().__init__() + self.dim = dim + self.base = base + self.train_seq_len = train_seq_len + self.rope_dims = rope_dims if rope_dims > 0 else dim + inv_freq = 1.0 / base ** (torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims) + self.register_buffer("inv_freq", inv_freq, persistent=False) + t = torch.arange(self.train_seq_len, dtype=torch.float32) + freqs = torch.outer(t, inv_freq) + self.register_buffer("_cos_pre", freqs.cos()[None, :, None, :], persistent=False) + self.register_buffer("_sin_pre", freqs.sin()[None, :, None, :], persistent=False) + self._max_pre_seq_len = self.train_seq_len + self._seq_len_cached = 0 + self._cos_cached = None + self._sin_cached = None + + def forward(self, seq_len, device, dtype): + if seq_len <= self._max_pre_seq_len: + return (self._cos_pre[:, :seq_len].to(dtype=dtype), self._sin_pre[:, :seq_len].to(dtype=dtype)) + if self._cos_cached is None or self._seq_len_cached != seq_len or self._cos_cached.device != device: + rd = self.rope_dims + scale = seq_len / self.train_seq_len + new_base = self.base * scale ** (rd / (rd - 2)) + inv_freq = 1.0 / new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd) + t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) + freqs = torch.outer(t, inv_freq) + self._cos_cached = freqs.cos()[None, :, None, :] + self._sin_cached = freqs.sin()[None, :, None, :] + self._seq_len_cached = seq_len + return (self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype)) + + +def apply_rotary_emb(x, cos, sin, rope_dims=0): + if rope_dims > 0 and rope_dims < x.size(-1): + x_rope, x_pass = (x[..., :rope_dims], x[..., rope_dims:]) + half = rope_dims // 2 + x1, x2 = (x_rope[..., :half], x_rope[..., half:]) + x_rope = torch.cat((x1 * cos + x2 * sin, x1 * -sin + x2 * cos), dim=-1) + return torch.cat((x_rope, x_pass), dim=-1) + half = x.size(-1) // 2 + x1, x2 = (x[..., :half], x[..., half:]) + return torch.cat((x1 * cos + x2 * sin, x1 * -sin + x2 * cos), dim=-1) + + +class CausalSelfAttention(nn.Module): + def __init__(self, dim, num_heads, num_kv_heads, rope_base, qk_gain_init, train_seq_len): + super().__init__() + if dim % num_heads != 0: + raise ValueError("model_dim must be divisible by num_heads") + if num_heads % num_kv_heads != 0: + raise ValueError("num_heads must be divisible by num_kv_heads") + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + if self.head_dim % 2 != 0: + raise ValueError("head_dim must be even for RoPE") + kv_dim = self.num_kv_heads * self.head_dim + self.c_q = CastedLinear(dim, dim, bias=False) + self.c_k = CastedLinear(dim, kv_dim, bias=False) + self.c_v = CastedLinear(dim, kv_dim, bias=False) + self.proj = CastedLinear(dim, dim, bias=False) + self.proj._zero_init = True + self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32)) + self.rope_dims = 0 + self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=train_seq_len) + self.use_xsa = False + self.gate_proj = CastedLinear(dim, num_heads, bias=True) + with torch.no_grad(): + self.gate_proj.weight.zero_() + if self.gate_proj.bias is not None: + self.gate_proj.bias.fill_(2.94) + self.use_gated_attention = bool(int(os.environ.get("USE_GATED_ATTENTION", "0"))) + + def _xsa_efficient(self, y, v): + B, T, H, D = y.shape + Hkv = v.size(-2) + group = H // Hkv + y_g = y.reshape(B, T, Hkv, group, D) + vn = F.normalize(v, dim=-1).unsqueeze(-2) + proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn + return (y_g - proj).reshape(B, T, H, D) + + def forward(self, x): + bsz, seqlen, dim = x.shape + q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim) + k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = self.rotary(seqlen, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, self.rope_dims) + k = apply_rotary_emb(k, cos, sin, self.rope_dims) + q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if self.use_xsa: + y = self._xsa_efficient(y, v) + if self.use_gated_attention: + gate = torch.sigmoid(self.gate_proj(x).float()).to(dtype=y.dtype) + y = y * gate.unsqueeze(-1) + y = y.reshape(bsz, seqlen, dim) + return self.proj(y) + + +class MLP(nn.Module): + def __init__(self, dim, mlp_mult): + super().__init__() + hidden = int(mlp_mult * dim) + self.fc = CastedLinear(dim, hidden, bias=False) + self.proj = CastedLinear(hidden, dim, bias=False) + self.proj._zero_init = True + self.use_norm_pct_dropout = bool(int(os.environ.get("USE_NORM_PCT_DROPOUT", "0"))) + self.norm_pct_thresh = float(os.environ.get("NORM_PCT_THRESH", "0.99")) + + def forward(self, x): + x = F.leaky_relu(self.fc(x), negative_slope=0.5).square() + if self.training and self.use_norm_pct_dropout: + orig_shape = x.shape + x_flat = x.reshape(-1, orig_shape[-1]) + row_norms = x_flat.float().norm(dim=-1) + kth = torch.quantile(row_norms, self.norm_pct_thresh) + keep = (row_norms < kth).to(dtype=x.dtype).unsqueeze(-1) + x_flat = x_flat * keep + x = x_flat.reshape(orig_shape) + return self.proj(x) + + +class Block(nn.Module): + def __init__( + self, + dim, + num_heads, + num_kv_heads, + mlp_mult, + rope_base, + qk_gain_init, + train_seq_len, + layer_idx=0, + ln_scale=False, + ): + super().__init__() + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init, train_seq_len) + self.mlp = MLP(dim, mlp_mult) + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) + self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 + self.layer_idx = layer_idx + self._parallel_residuals = bool(int(os.environ.get("USE_PARALLEL_RESIDUALS", "0"))) or ( + int(os.environ.get("PARALLEL_RESIDUAL_START", "-1")) >= 0 + and layer_idx >= int(os.environ.get("PARALLEL_RESIDUAL_START", "-1")) + ) + + def forward(self, x, x0): + mix = self.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out = self.attn(self.attn_norm(x_in) * self.ln_scale_factor) + if self._parallel_residuals: + mlp_out = self.mlp(self.mlp_norm(x_in) * self.ln_scale_factor) + return ( + x_in + + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + + self.mlp_scale.to(dtype=x_in.dtype)[None, None, :] * mlp_out + ) + x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp( + self.mlp_norm(x_out) * self.ln_scale_factor + ) + return x_out + + def forward_attn(self, x, x0): + mix = self.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out = self.attn(self.attn_norm(x_in) * self.ln_scale_factor) + return x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + + def forward_mlp(self, x): + return x + self.mlp_scale.to(dtype=x.dtype)[None, None, :] * self.mlp(self.mlp_norm(x) * self.ln_scale_factor) + + +class GPT(nn.Module): + def __init__(self, h): + super().__init__() + if h.logit_softcap <= 0.0: + raise ValueError(f"logit_softcap must be positive, got {h.logit_softcap}") + self.tie_embeddings = h.tie_embeddings + self.tied_embed_init_std = h.tied_embed_init_std + self.logit_softcap = h.logit_softcap + self.tok_emb = nn.Embedding(h.vocab_size, h.embedding_dim) + if h.embedding_dim != h.model_dim: + self.embed_proj = CastedLinear(h.embedding_dim, h.model_dim, bias=False) + self.head_proj = CastedLinear(h.model_dim, h.embedding_dim, bias=False) + else: + self.embed_proj = None + self.head_proj = None + self.num_encoder_layers = h.num_layers // 2 + self.num_decoder_layers = h.num_layers - self.num_encoder_layers + _per_layer_mlp = [h.mlp_mult] * h.num_layers + self.blocks = nn.ModuleList( + [ + Block( + h.model_dim, + h.num_heads, + h.num_kv_heads, + _per_layer_mlp[i], + h.rope_base, + h.qk_gain_init, + h.train_seq_len, + layer_idx=i, + ln_scale=h.ln_scale, + ) + for i in range(h.num_layers) + ] + ) + self._huffman_remapper = None + self._smeargate = None + self._bigramhash_learned = None + self._fused_ops = None + self._fused_int6 = None + if h.rope_dims > 0: + head_dim = h.model_dim // h.num_heads + for block in self.blocks: + block.attn.rope_dims = h.rope_dims + block.attn.rotary = Rotary( + head_dim, base=h.rope_base, train_seq_len=h.train_seq_len, rope_dims=h.rope_dims + ) + self.final_norm = RMSNorm() + self.lm_head = None if h.tie_embeddings else CastedLinear(h.embedding_dim, h.vocab_size, bias=False) + if self.lm_head is not None: + self.lm_head._zero_init = True + if h.xsa_last_n > 0: + for i in range(max(0, h.num_layers - h.xsa_last_n), h.num_layers): + self.blocks[i].attn.use_xsa = True + self.looping_active = False + if h.num_loops > 0: + loop_seg = list(range(h.loop_start, h.loop_end + 1)) + all_indices = list(range(h.loop_start)) + for _ in range(h.num_loops + 1): + all_indices.extend(loop_seg) + all_indices.extend(range(h.loop_end + 1, h.num_layers)) + num_enc = len(all_indices) // 2 + self.encoder_indices = all_indices[:num_enc] + self.decoder_indices = all_indices[num_enc:] + else: + self.encoder_indices = list(range(self.num_encoder_layers)) + self.decoder_indices = list(range(self.num_encoder_layers, h.num_layers)) + self.num_skip_weights = min(len(self.encoder_indices), len(self.decoder_indices)) + self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, h.model_dim, dtype=torch.float32)) + self.skip_gates = ( + nn.Parameter(torch.zeros(self.num_skip_weights, h.model_dim, dtype=torch.float32)) + if h.skip_gates_enabled + else None + ) + self.parallel_start_layer = int(os.environ.get("PARALLEL_START_LAYER", "7")) + self.lane_merge = nn.Parameter(torch.tensor(0.5)) if self.parallel_start_layer > 0 else None + self._ngram_enabled = bool(int(os.environ.get("USE_NGRAM_BIAS", "0"))) + self._ngram_w_bigram = float(os.environ.get("NGRAM_W_BIGRAM", "0.20")) + self._ngram_w_trigram = float(os.environ.get("NGRAM_W_TRIGRAM", "0.15")) + self._ngram_w_fourgram = float(os.environ.get("NGRAM_W_FOURGRAM", "0.10")) + self._ngram_hash = int(os.environ.get("NGRAM_HASH_BUCKETS", "16384")) + self._ngram_backoff = bool(int(os.environ.get("USE_NGRAM_BACKOFF", "0"))) + self._ngram_backoff_t4 = float(os.environ.get("NGRAM_BACKOFF_THRESH4", "1.0")) + self._ngram_backoff_t3 = float(os.environ.get("NGRAM_BACKOFF_THRESH3", "1.0")) + self._ngram_backoff_alpha = float(os.environ.get("NGRAM_BACKOFF_ALPHA", "0.4")) + self._nlfi_enabled = bool(int(os.environ.get("USE_NGR_LOG_FREQ_INV", "0"))) + self._nlfi_applied = False + self.register_buffer("_nlfi_bigram_mult", torch.ones(self._ngram_hash, dtype=torch.float32), persistent=False) + self.register_buffer("_nlfi_trigram_mult", torch.ones(self._ngram_hash, dtype=torch.float32), persistent=False) + self.register_buffer("_nlfi_fourgram_mult", torch.ones(self._ngram_hash, dtype=torch.float32), persistent=False) + self.register_buffer("_nlfi_stored_flag", torch.zeros(1, dtype=torch.int64), persistent=False) + self._ctx_part_tab_enabled = bool(int(os.environ.get("USE_CTX_PARTITIONED_TAB", "0"))) + self._ctx_part_slices = int(os.environ.get("CTX_PARTITION_SLICES", "16")) + self.register_buffer("_bigram_tab", torch.zeros(1, dtype=torch.float32), persistent=False) + self.register_buffer("_trigram_tab", torch.zeros(1, dtype=torch.float32), persistent=False) + self.register_buffer("_fourgram_tab", torch.zeros(1, dtype=torch.float32), persistent=False) + if self._ngram_enabled: + vs = h.vocab_size + _ngram_bf16 = bool(int(os.environ.get("USE_NGRAM_BF16", "0"))) + _ngram_dtype = torch.bfloat16 if _ngram_bf16 else torch.float32 + _ngram_bigram_only = bool(int(os.environ.get("USE_NGRAM_BIGRAM_ONLY", "0"))) + for tab_attr, fname, label in [ + ("_bigram_tab", f"data/bigram_tab_{vs}v.npy", "bigram"), + ("_trigram_tab", f"data/trigram_logprobs_{vs}v.npy", "trigram"), + ("_fourgram_tab", f"data/fourgram_logprobs_{vs}v.npy", "fourgram"), + ]: + if _ngram_bigram_only and tab_attr != "_bigram_tab": + print(f"NGRAM_BIAS: {label} SKIPPED (USE_NGRAM_BIGRAM_ONLY=1)", flush=True) + continue + try: + _arr = np.load(fname) + _tab = torch.from_numpy(_arr).to(dtype=_ngram_dtype) + setattr(self, tab_attr, _tab) + print(f"NGRAM_BIAS: loaded {label} {_arr.shape} from {fname} dtype={_ngram_dtype}", flush=True) + except Exception as _e: + print(f"NGRAM_BIAS: {label} load failed ({fname}): {_e}", flush=True) + self._init_weights() + + def _init_weights(self): + if self.tie_embeddings: + nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif module.weight.ndim == 2 and module.weight.shape[0] >= 64 and (module.weight.shape[1] >= 64): + nn.init.orthogonal_(module.weight, gain=1.0) + + def forward_logits(self, input_ids): + x = self.tok_emb(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + if self.embed_proj is not None: + x = self.embed_proj(x) + if getattr(self, "_smeargate", None) is not None: + x = self._smeargate(x) + x0 = x + skips = [] + enc_iter = self.encoder_indices if self.looping_active else range(self.num_encoder_layers) + dec_iter = ( + self.decoder_indices + if self.looping_active + else range(self.num_encoder_layers, self.num_encoder_layers + self.num_decoder_layers) + ) + for i in enc_iter: + x = self.blocks[i](x, x0) + skips.append(x) + psl = self.parallel_start_layer + lane0 = None + lane1 = None + for skip_idx, i in enumerate(dec_iter): + if lane0 is None: + if skip_idx < self.num_skip_weights and skips: + scaled_skip = self.skip_weights[skip_idx].to(dtype=x.dtype)[None, None, :] * skips.pop() + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=x.dtype))[None, None, :] + x = torch.lerp(scaled_skip, x, g) + else: + x = x + scaled_skip + if i >= psl and psl > 0: + lane0 = x + lane1 = x.clone() + lane0 = self.blocks[i].forward_attn(lane0, x0) + lane1 = self.blocks[i].forward_mlp(lane1) + else: + x = self.blocks[i](x, x0) + else: + if skip_idx < self.num_skip_weights and skips: + scaled_skip = self.skip_weights[skip_idx].to(dtype=lane0.dtype)[None, None, :] * skips.pop() + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=lane0.dtype))[None, None, :] + lane0 = torch.lerp(scaled_skip, lane0, g) + else: + lane0 = lane0 + scaled_skip + lane0 = self.blocks[i].forward_attn(lane0, x0) + lane1 = self.blocks[i].forward_mlp(lane1) + if lane0 is not None: + lm = self.lane_merge.to(dtype=lane0.dtype) + x = lm * lane0 + (1.0 - lm) * lane1 + x = self.final_norm(x) + if self.head_proj is not None: + x = self.head_proj(x) + if self.tie_embeddings: + logits_proj = F.linear(x, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x) + logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + if getattr(self, "_bigramhash_learned", None) is not None: + logits = logits + self._bigramhash_learned(input_ids).to(dtype=logits.dtype) + if self._ngram_enabled and self._bigram_tab.numel() > 1: + B, S = input_ids.shape + _zeros1 = torch.zeros(B, 1, device=input_ids.device, dtype=input_ids.dtype) + _zeros2 = torch.zeros(B, 2, device=input_ids.device, dtype=input_ids.dtype) + _ids_flat = input_ids.reshape(-1).long() + _prev2 = torch.cat([_zeros1, input_ids[:, :-1]], dim=1).reshape(-1).long() + _prev3 = torch.cat([_zeros2, input_ids[:, :-2]], dim=1).reshape(-1).long() + H = self._ngram_hash + _h_bi = _ids_flat * 36313 % H + if self._ctx_part_tab_enabled: + _S_slices = self._ctx_part_slices + _zone = _ids_flat % _S_slices * (H // _S_slices) + _h_bi = (_h_bi + _zone) % H + _h_tri = (_prev2 * 36313 + _ids_flat * 27191) % H + _h_four = (_prev3 * 36313 + _prev2 * 27191 + _ids_flat * 51497) % H + _bi = self._bigram_tab[_h_bi].reshape(B, S, -1) + if self._ngram_backoff and self._trigram_tab.numel() > 1 and (self._fourgram_tab.numel() > 1): + _tri = self._trigram_tab[_h_tri].reshape(B, S, -1) + _four = self._fourgram_tab[_h_four].reshape(B, S, -1) + _peak4 = _four.amax(dim=-1, keepdim=True) + _peak3 = _tri.amax(dim=-1, keepdim=True) + _use_4 = (_peak4 > self._ngram_backoff_t4).to(_four.dtype) + _use_3 = (1 - _use_4) * (_peak3 > self._ngram_backoff_t3).to(_tri.dtype) + _use_bi = 1 - _use_4 - _use_3 + _alpha = self._ngram_backoff_alpha + _ng = _use_4 * _four + _use_3 * _tri * _alpha + _use_bi * _bi * (_alpha * _alpha) + logits = logits + _ng.to(dtype=logits.dtype) + else: + _bias = self._ngram_w_bigram * _bi + if self._trigram_tab.numel() > 1: + _bias = _bias + self._ngram_w_trigram * self._trigram_tab[_h_tri].reshape(B, S, -1) + if self._fourgram_tab.numel() > 1: + _bias = _bias + self._ngram_w_fourgram * self._fourgram_tab[_h_four].reshape(B, S, -1) + logits = logits + _bias.to(dtype=logits.dtype) + return logits + + @torch.no_grad() + def _apply_nlfi_once(self, input_ids): + if self._nlfi_applied or not self._nlfi_enabled: + return + if not (self._ngram_enabled and self._bigram_tab.numel() > 1): + return + try: + _ids_flat = input_ids.reshape(-1).long() + H = self._ngram_hash + if int(self._nlfi_stored_flag.item()) == 1: + _bg_mult = self._nlfi_bigram_mult + _tg_mult = self._nlfi_trigram_mult + _fg_mult = self._nlfi_fourgram_mult + print("NGR_LOG_FREQ_INV: restored multipliers from state_dict", flush=True) + else: + _bg_h_init = _ids_flat * 36313 % H + _bg_counts = torch.zeros(H, dtype=torch.float32, device=_ids_flat.device) + _bg_counts.scatter_add_(0, _bg_h_init, torch.ones_like(_bg_h_init, dtype=torch.float32)) + _bg_mult = 1.0 / torch.log(2.0 + _bg_counts) + _tg_h_init = (_ids_flat * 36313 ^ _ids_flat * 39979 >> 1) % H + _tg_counts = torch.zeros(H, dtype=torch.float32, device=_ids_flat.device) + _tg_counts.scatter_add_(0, _tg_h_init, torch.ones_like(_tg_h_init, dtype=torch.float32)) + _tg_mult = 1.0 / torch.log(2.0 + _tg_counts) + _fg_h_init = (_ids_flat * 36313 ^ _ids_flat * 39979 >> 1 ^ _ids_flat * 41077 >> 2) % H + _fg_counts = torch.zeros(H, dtype=torch.float32, device=_ids_flat.device) + _fg_counts.scatter_add_(0, _fg_h_init, torch.ones_like(_fg_h_init, dtype=torch.float32)) + _fg_mult = 1.0 / torch.log(2.0 + _fg_counts) + self._nlfi_bigram_mult.data = _bg_mult.detach().to(self._nlfi_bigram_mult.dtype) + self._nlfi_trigram_mult.data = _tg_mult.detach().to(self._nlfi_trigram_mult.dtype) + self._nlfi_fourgram_mult.data = _fg_mult.detach().to(self._nlfi_fourgram_mult.dtype) + self._nlfi_stored_flag.data = torch.ones(1, dtype=torch.int64, device=self._nlfi_stored_flag.device) + print("NGR_LOG_FREQ_INV: computed + saved multipliers from current batch", flush=True) + if self._bigram_tab.numel() > 1: + if self._bigram_tab.dim() == 2: + self._bigram_tab.mul_(_bg_mult.to(self._bigram_tab.dtype).unsqueeze(1)) + else: + self._bigram_tab.mul_(_bg_mult.to(self._bigram_tab.dtype)) + if self._trigram_tab.numel() > 1: + if self._trigram_tab.dim() == 2: + self._trigram_tab.mul_(_tg_mult.to(self._trigram_tab.dtype).unsqueeze(1)) + else: + self._trigram_tab.mul_(_tg_mult.to(self._trigram_tab.dtype)) + if self._fourgram_tab.numel() > 1: + if self._fourgram_tab.dim() == 2: + self._fourgram_tab.mul_(_fg_mult.to(self._fourgram_tab.dtype).unsqueeze(1)) + else: + self._fourgram_tab.mul_(_fg_mult.to(self._fourgram_tab.dtype)) + print("NGR_LOG_FREQ_INV: applied mutation to n-gram tables (one-time per process)", flush=True) + except Exception as _e: + print(f"NGR_LOG_FREQ_INV: mutation failed ({_e})", flush=True) + self._nlfi_applied = True + + def forward(self, input_ids, target_ids): + logits = self.forward_logits(input_ids) + return F.cross_entropy(logits.reshape(-1, logits.size(-1)).float(), target_ids.reshape(-1), reduction="mean") + + +def classify_param(name): + if "tok_emb" in name or "lm_head" in name: + return "embed" + if ".mlp." in name: + return "mlp" + if ".attn." in name or (".proj." in name and ".mlp." not in name): + return "attn" + return "other" + + +@torch.compile +def zeropower_via_newtonschulz5(G, steps=10, eps=1e-07): + a, b, c = (3.4445, -4.775, 2.0315) + X = G.bfloat16() + if X.dim() == 2: + X = X / (X.norm() + eps) + else: + X = X / (X.flatten(start_dim=-2).norm(dim=-1, keepdim=True).unsqueeze(-1) + eps) + transposed = X.size(-2) > X.size(-1) + if transposed: + X = X.transpose(-2, -1) + for _ in range(steps): + A = X @ X.transpose(-2, -1) + B = b * A + c * A @ A + X = a * X + B @ X + return X.transpose(-2, -1) if transposed else X + + +class Muon(torch.optim.Optimizer): + def __init__(self, params, lr, momentum, backend_steps, nesterov=True, weight_decay=0.0, row_normalize=False): + super().__init__( + params, + dict( + lr=lr, + momentum=momentum, + backend_steps=backend_steps, + nesterov=nesterov, + weight_decay=weight_decay, + row_normalize=row_normalize, + ), + ) + + @torch.no_grad() + def step(self, closure=None): + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + distributed = dist.is_available() and dist.is_initialized() + world_size = dist.get_world_size() if distributed else 1 + rank = dist.get_rank() if distributed else 0 + _use_parallel_muon = int(os.environ.get("USE_PARALLEL_MUON", "0")) + _use_normuon = int(os.environ.get("USE_NORMUON", "0")) + for group in self.param_groups: + params = group["params"] + if not params: + continue + lr = group["lr"] + momentum = group["momentum"] + backend_steps = group["backend_steps"] + nesterov = group["nesterov"] + total_params = sum((int(p.numel()) for p in params)) + updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16) + _offsets = [0] + for p in params: + _offsets.append(_offsets[-1] + p.numel()) + if _use_parallel_muon: + _shape_groups = {} + for i, p in enumerate(params): + if i % world_size != rank: + continue + if p.grad is None: + continue + sh = tuple(p.grad.shape) + _shape_groups.setdefault(sh, []).append((i, p)) + for sh, grp in _shape_groups.items(): + _grads = [] + for i, p in grp: + g = p.grad + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + buf.mul_(momentum).add_(g) + if nesterov: + g = g.add(buf, alpha=momentum) + if group.get("row_normalize", False): + _rn = g.float().norm(dim=-1, keepdim=True).clamp_min(1e-07) + g = g / _rn.to(g.dtype) + _grads.append(g) + _stacked = torch.stack(_grads, dim=0) + _result = zeropower_via_newtonschulz5(_stacked, steps=backend_steps) + for _bi, (i, p) in enumerate(grp): + g = _result[_bi] + if _use_normuon: + _post_norm = g.float().norm(dim=-1, keepdim=True).clamp(min=1e-08) + g = g / _post_norm.to(g.dtype) + g = g * max(1, g.size(0) / g.size(1)) ** 0.5 + updates_flat[_offsets[i] : _offsets[i + 1]] = g.reshape(-1) + else: + curr = 0 + for i, p in enumerate(params): + if i % world_size == rank and p.grad is not None: + g = p.grad + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + buf.mul_(momentum).add_(g) + if nesterov: + g = g.add(buf, alpha=momentum) + if group.get("row_normalize", False): + row_norms = g.float().norm(dim=-1, keepdim=True).clamp_min(1e-07) + g = g / row_norms.to(g.dtype) + g = zeropower_via_newtonschulz5(g, steps=backend_steps) + if _use_normuon: + _post_norm = g.float().norm(dim=-1, keepdim=True).clamp(min=1e-08) + g = g / _post_norm.to(g.dtype) + g *= max(1, g.size(0) / g.size(1)) ** 0.5 + updates_flat[curr : curr + p.numel()] = g.reshape(-1) + curr += p.numel() + if distributed: + dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM) + wd = group.get("weight_decay", 0.0) + curr = 0 + for p in params: + if wd > 0.0: + p.data.mul_(1.0 - lr * wd) + g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype) + p.add_(g, alpha=-lr) + curr += p.numel() + return loss + + +CONTROL_TENSOR_NAME_PATTERNS = tuple( + ( + pattern + for pattern in os.environ.get( + "CONTROL_TENSOR_NAME_PATTERNS", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,skip_gates,lane_merge", + ).split(",") + if pattern + ) +) + + +class Optimizers: + def __init__(self, h, base_model): + block_named_params = list(base_model.blocks.named_parameters()) + matrix_params = [ + p + for name, p in block_named_params + if p.ndim == 2 and (not any((pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS))) + ] + scalar_params = [ + p + for name, p in block_named_params + if p.ndim < 2 or any((pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) + ] + if base_model.skip_weights.numel() > 0: + scalar_params.append(base_model.skip_weights) + if base_model.skip_gates is not None and base_model.skip_gates.numel() > 0: + scalar_params.append(base_model.skip_gates) + if base_model.lane_merge is not None: + scalar_params.append(base_model.lane_merge) + token_lr = h.tied_embed_lr if h.tie_embeddings else h.embed_lr + tok_params = [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}] + self.optimizer_tok = torch.optim.AdamW( + tok_params, betas=(h.beta1, h.beta2), eps=h.adam_eps, weight_decay=h.embed_wd, fused=True + ) + self.optimizer_muon = Muon( + matrix_params, + lr=h.matrix_lr, + momentum=h.muon_momentum, + backend_steps=h.muon_backend_steps, + weight_decay=h.muon_wd, + row_normalize=h.muon_row_normalize, + ) + for group in self.optimizer_muon.param_groups: + group["base_lr"] = h.matrix_lr + self.optimizer_scalar = torch.optim.AdamW( + [{"params": scalar_params, "lr": h.scalar_lr, "base_lr": h.scalar_lr}], + betas=(h.beta1, h.beta2), + eps=h.adam_eps, + weight_decay=h.adam_wd, + fused=True, + ) + self.optimizers = [self.optimizer_tok, self.optimizer_muon, self.optimizer_scalar] + if base_model.lm_head is not None: + self.optimizer_head = torch.optim.Adam( + [{"params": [base_model.lm_head.weight], "lr": h.head_lr, "base_lr": h.head_lr}], + betas=(h.beta1, h.beta2), + eps=h.adam_eps, + fused=True, + ) + self.optimizers.insert(1, self.optimizer_head) + else: + self.optimizer_head = None + + def __iter__(self): + return iter(self.optimizers) + + def zero_grad_all(self): + for opt in self.optimizers: + opt.zero_grad(set_to_none=True) + + def step(self): + for opt in self.optimizers: + opt.step() + self.zero_grad_all() + + +def restore_fp32_params(model): + for module in model.modules(): + if isinstance(module, CastedLinear): + module.float() + for name, param in model.named_parameters(): + if ( + param.ndim < 2 or any((pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) + ) and param.dtype != torch.float32: + param.data = param.data.float() + + +def collect_hessians(model, train_loader, h, device, n_calibration_batches=64): + hessians = {} + hooks = [] + + def make_hook(name): + + def hook_fn(module, inp, out): + x = inp[0].detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + if name not in hessians: + hessians[name] = torch.zeros(x.shape[1], x.shape[1], dtype=torch.float32, device=device) + hessians[name].addmm_(x.T, x) + + return hook_fn + + for name, module in model.named_modules(): + if isinstance(module, CastedLinear) and module.weight.numel() > 65536: + cat = classify_param(name + ".weight") + if cat in ("mlp", "attn"): + hooks.append(module.register_forward_hook(make_hook(name + ".weight"))) + if model.tie_embeddings: + hook_module = model.head_proj if model.head_proj is not None else model.final_norm + + def make_output_hook(name): + + def hook_fn(module, inp, out): + x = out.detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + if name not in hessians: + hessians[name] = torch.zeros(x.shape[1], x.shape[1], dtype=torch.float32, device=device) + hessians[name].addmm_(x.T, x) + + return hook_fn + + hooks.append(hook_module.register_forward_hook(make_output_hook("tok_emb.weight"))) + model.eval() + with torch.no_grad(): + for _ in range(n_calibration_batches): + x, _ = train_loader.next_batch(h.train_batch_tokens, h.grad_accum_steps) + model.forward_logits(x) + for hook in hooks: + hook.remove() + for name in hessians: + hessians[name] = hessians[name].cpu() / n_calibration_batches + return hessians + + +def gptq_quantize_weight(w, H, clip_sigmas=3.0, clip_range=63, block_size=128): + W_orig = w.float().clone() + rows, cols = W_orig.shape + H = H.float().clone() + dead = torch.diag(H) == 0 + H[dead, dead] = 1 + damp = 0.01 * H.diag().mean() + H.diagonal().add_(damp) + perm = torch.argsort(H.diag(), descending=True) + invperm = torch.argsort(perm) + W_perm = W_orig[:, perm].clone() + W_perm[:, dead[perm]] = 0 + H = H[perm][:, perm] + Hinv = torch.cholesky_inverse(torch.linalg.cholesky(H)) + Hinv = torch.linalg.cholesky(Hinv, upper=True) + row_std = W_orig.std(dim=1) + s = (clip_sigmas * row_std / clip_range).clamp_min(1e-10).to(torch.float16) + sf = s.float() + Q = torch.zeros(rows, cols, dtype=torch.int8) + W_work = W_perm.clone() + for i1 in range(0, cols, block_size): + i2 = min(i1 + block_size, cols) + W_block = W_work[:, i1:i2].clone() + Hinv_block = Hinv[i1:i2, i1:i2] + Err = torch.zeros(rows, i2 - i1) + for j in range(i2 - i1): + w_col = W_block[:, j] + d = Hinv_block[j, j] + q_col = torch.clamp(torch.round(w_col / sf), -clip_range, clip_range) + Q[:, i1 + j] = q_col.to(torch.int8) + err = (w_col - q_col.float() * sf) / d + Err[:, j] = err + W_block[:, j:] -= err.unsqueeze(1) * Hinv_block[j, j:].unsqueeze(0) + if i2 < cols: + W_work[:, i2:] -= Err @ Hinv[i1:i2, i2:] + Q = Q[:, invperm] + if int(os.environ.get("USE_CMP_QUANT_VALUE_DEDUP", "0")): + _cqvd_step = int(os.environ.get("CMP_QUANT_DEDUP_STEP", "2")) + if _cqvd_step > 1: + Q = (Q.to(torch.int16) // _cqvd_step * _cqvd_step).to(torch.int8) + return (Q, s) + + +def gptq_mixed_quantize(state_dict, hessians, h): + _use_sigma_delta = False + _use_vernier = False + try: + from submission.ideas.idea_064_parallel_gptq import is_enabled as pg_on + + if pg_on(): + log("[IDEA-064 parallel_gptq] enabled - multi-clip search active") + except ImportError: + pass + _cascade_bits = None + _mixed_int_bits = {} + result = {} + meta = {} + for name, tensor in state_dict.items(): + t = tensor.detach().cpu().contiguous() + if not t.is_floating_point() or t.numel() <= 65536: + result[name] = t.to(torch.float16) if t.is_floating_point() else t + meta[name] = "passthrough (float16)" + continue + cs = h.embed_clip_sigmas if "tok_emb" in name else h.matrix_clip_sigmas + bits = h.embed_bits if "tok_emb" in name else h.matrix_bits + if name in _mixed_int_bits: + bits = _mixed_int_bits[name] + if _cascade_bits is not None and "tok_emb" not in name: + import re as _re + + _m = _re.search("blocks\\.(\\d+)\\.", name) + if _m: + bits = _cascade_bits.get(int(_m.group(1)), bits) + _use_par_gptq = False + try: + from submission.ideas.idea_064_parallel_gptq import is_enabled as pg2_on + + _use_par_gptq = pg2_on() + except ImportError: + pass + if _use_par_gptq and "tok_emb" not in name: + try: + from submission.ideas.idea_064_parallel_gptq import parallel_search_best_clips + import numpy as _np + + _best_clips = parallel_search_best_clips( + _np.array(t.numpy()), + _np.array(hessians[name].numpy()) if name in hessians else None, + clip_range=2 ** (bits - 1) - 1, + ) + if _best_clips is not None: + cs = float(_best_clips.mean()) + except Exception: + pass + q, s = gptq_quantize_weight(t, hessians[name], clip_sigmas=cs, clip_range=2 ** (bits - 1) - 1) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = f"gptq (int{bits})" + categories = collections.defaultdict(set) + for name, cat in meta.items(): + short = re.sub("\\.\\d+$", "", re.sub("blocks\\.\\d+", "blocks", name)) + categories[cat].add(short) + log("Quantized weights:") + for cat in sorted(categories): + log(f" {cat}: {', '.join(sorted(categories[cat]))}") + return (result, meta) + + +def dequantize_mixed(result, meta, template_sd): + out = {} + for name, orig in template_sd.items(): + info = meta.get(name) + if info is None: + continue + orig_dtype = orig.dtype + if "passthrough" in info: + t = result[name] + if t.dtype == torch.float16 and orig_dtype in (torch.float32, torch.bfloat16): + t = t.to(orig_dtype) + out[name] = t + continue + q, s = (result[name + ".q"], result[name + ".scale"]) + if s.ndim > 0: + out[name] = (q.float() * s.float().view(q.shape[0], *[1] * (q.ndim - 1))).to(orig_dtype) + else: + out[name] = (q.float() * float(s.item())).to(orig_dtype) + return out + + +_BSHF_MAGIC = b"BSHF" + + +def _byte_shuffle(data, stride=2): + if stride <= 1 or len(data) < stride: + return data + src = np.frombuffer(data, dtype=np.uint8) + n = len(src) + out = np.empty(n, dtype=np.uint8) + dest_off = 0 + for pos in range(stride): + chunk = src[pos::stride] + out[dest_off : dest_off + len(chunk)] = chunk + dest_off += len(chunk) + return _BSHF_MAGIC + bytes([stride]) + out.tobytes() + + +def _byte_unshuffle(data): + if len(data) < 5 or data[:4] != _BSHF_MAGIC: + return data + stride = data[4] + if stride < 2: + return data[5:] + payload = np.frombuffer(data, dtype=np.uint8, offset=5) + n = len(payload) + out = np.empty(n, dtype=np.uint8) + src_off = 0 + for pos in range(stride): + chunk_len = n // stride + (1 if pos < n % stride else 0) + out[pos::stride][:chunk_len] = payload[src_off : src_off + chunk_len] + src_off += chunk_len + return out.tobytes() + + +def _compress(data, compressor): + data = _byte_shuffle(data) + _bpe_codebook = None + if compressor == "lzma": + compressed = lzma.compress(data, preset=6) + elif compressor == "brotli": + import brotli + + compressed = brotli.compress(data, quality=11) + elif compressor == "zstd": + import zstandard + + compressed = zstandard.ZstdCompressor(level=22).compress(data) + else: + raise ValueError(f"Unknown compressor: {compressor!r}") + if _bpe_codebook is not None: + header = len(_bpe_codebook).to_bytes(4, "big") + _bpe_codebook + else: + header = b"\x00\x00\x00\x00" + return header + compressed + + +def _decompress(data, compressor): + _bpe_header_len = int.from_bytes(data[:4], "big") + if _bpe_header_len > 0: + _bpe_codebook_bytes = data[4 : 4 + _bpe_header_len] + data = data[4 + _bpe_header_len :] + else: + _bpe_codebook_bytes = None + data = data[4:] + if compressor == "lzma": + raw = lzma.decompress(data) + elif compressor == "brotli": + import brotli + + raw = brotli.decompress(data) + elif compressor == "zstd": + import zstandard + + raw = zstandard.ZstdDecompressor().decompress(data) + else: + raise ValueError(f"Unknown compressor: {compressor!r}") + if _bpe_codebook_bytes is not None: + pass + raw = _byte_unshuffle(raw) + return raw + + +class _ValCalibLoader: + def __init__(self, val_tokens, h, device): + self.val_tokens = val_tokens + self.h = h + self.device = device + self._offset = 0 + + def next_batch(self, batch_tokens, grad_accum_steps): + seq_len = self.h.train_seq_len + batch_seqs = max(1, batch_tokens // (seq_len * max(1, grad_accum_steps))) + needed = batch_seqs * seq_len + 1 + if self._offset + needed > self.val_tokens.numel(): + self._offset = 0 + chunk = self.val_tokens[self._offset : self._offset + needed].to(device=self.device, dtype=torch.int64) + x = chunk[:-1].reshape(-1, seq_len) + y = chunk[1:].reshape(-1, seq_len) + self._offset += needed - 1 + return (x, y) + + +def serialize(h, base_model, code, val_data=None): + code_bytes = len(code.encode("utf-8")) + if h.is_main_process: + torch.save(base_model.state_dict(), h.model_path) + model_bytes = os.path.getsize(h.model_path) + log(f"Serialized model: {model_bytes} bytes") + log(f"Code size: {code_bytes} bytes") + sd_cpu = {k: v.detach().cpu() for k, v in base_model.state_dict().items()} + device = torch.device("cuda", h.local_rank) + if int(os.environ.get("GPTQ_CALIB_USE_VAL", "0")): + log( + "WARNING: GPTQ_CALIB_USE_VAL is disabled - calibrating on val tokens violates Rule 6 (data leakage). Ignoring." + ) + log("GPTQ:collecting Hessians from calibration data...") + calib_loader = ShuffledSequenceLoader(h, device) + t0 = time.perf_counter() + hessians = collect_hessians(base_model, calib_loader, h, device, n_calibration_batches=h.gptq_calibration_batches) + log(f"GPTQ:collected {len(hessians)} Hessians in {time.perf_counter() - t0:.1f}s") + quant_result, quant_meta = gptq_mixed_quantize(sd_cpu, hessians, h) + try: + from submission.ideas.idea_051_freeze_dry import freeze_dry_state_dict + + freeze_dry_state_dict(quant_result) + except ImportError: + pass + # IDEA-038 vernier and IDEA-023 sigma_delta are inactive in this build - + # the dead-code remover stripped their try/except wrappers, so the failure + # log lines that referenced the bound exception variable are now `pass`. + quant_buf = io.BytesIO() + torch.save({"w": quant_result, "m": quant_meta}, quant_buf) + quant_raw = quant_buf.getvalue() + quant_blob = _compress(quant_raw, h.compressor) + quant_file_bytes = len(quant_blob) + bytes_total = quant_file_bytes + code_bytes + if h.is_main_process: + with open(h.quantized_model_path, "wb") as f: + f.write(quant_blob) + log(f"Serialized model quantized+{h.compressor}: {quant_file_bytes} bytes") + log(f"Total submission size quantized+{h.compressor}: {bytes_total} bytes") + return (bytes_total, quant_file_bytes) + + +def deserialize(h, device): + eval_model = GPT(h).to(device).bfloat16() + restore_fp32_params(eval_model) + sd_cpu = {k: v.detach().cpu() for k, v in eval_model.state_dict().items()} + with open(h.quantized_model_path, "rb") as f: + quant_blob_disk = f.read() + quant_state = torch.load(io.BytesIO(_decompress(quant_blob_disk, h.compressor)), map_location="cpu") + deq_state = dequantize_mixed(quant_state["w"], quant_state["m"], sd_cpu) + eval_model.load_state_dict(deq_state, strict=True) + return eval_model + + +def _loss_bpb(loss_sum, token_count, byte_count): + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + return (val_loss, val_bpb) + + +def eval_val(h, device, val_data, model): + seq_len = h.eval_seq_len + local_batch_tokens = h.val_batch_tokens // (h.world_size * h.grad_accum_steps) + if local_batch_tokens < seq_len: + raise ValueError( + f"VAL_BATCH_SIZE must provide at least one sequence per rank; got VAL_BATCH_SIZE={h.val_batch_tokens}, WORLD_SIZE={h.world_size}, GRAD_ACCUM_STEPS={h.grad_accum_steps}, seq_len={seq_len}" + ) + local_batch_seqs = local_batch_tokens // seq_len + total_seqs = (val_data.val_tokens.numel() - 1) // seq_len + seq_start = total_seqs * h.rank // h.world_size + seq_end = total_seqs * (h.rank + 1) // h.world_size + val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + val_token_count = torch.zeros((), device=device, dtype=torch.float64) + val_byte_count = torch.zeros((), device=device, dtype=torch.float64) + model.eval() + with torch.inference_mode(): + for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): + batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) + raw_start = batch_seq_start * seq_len + raw_end = batch_seq_end * seq_len + 1 + local = val_data.val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + batch_loss = model(x, y).detach() + batch_token_count = float(y.numel()) + val_loss_sum += batch_loss.to(torch.float64) * batch_token_count + val_token_count += batch_token_count + prev_ids = x.reshape(-1) + tgt_ids = y.reshape(-1) + token_bytes = val_data.base_bytes_lut[tgt_ids].to(dtype=torch.int16) + token_bytes += (val_data.has_leading_space_lut[tgt_ids] & ~val_data.is_boundary_token_lut[prev_ids]).to( + dtype=torch.int16 + ) + val_byte_count += token_bytes.to(torch.float64).sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) + model.train() + return _loss_bpb(val_loss_sum, val_token_count, val_byte_count) + + +def eval_val_sliding(h, device, val_data, base_model, batch_seqs=32): + base_model.eval() + logits_fn = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True) + seq_len = h.eval_seq_len + context_size = seq_len - h.eval_stride + total_tokens = val_data.val_tokens.numel() - 1 + window_starts = [ws for ws in range(0, total_tokens, h.eval_stride) if ws + context_size < total_tokens] + total_windows = len(window_starts) + my_s = total_windows * h.rank // h.world_size + my_e = total_windows * (h.rank + 1) // h.world_size + my_windows = window_starts[my_s:my_e] + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + _eval_pool = None + _cmix_mixer = None + _rep_detector = None + _state_machine = None + _ctx_ngram = None + _tilt_tables = None + _ngram_cache_mixer = None + _sa_mixer = None + _ctw_mixer = None + _dirichlet = None + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi : bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens = [] + for i, ws in enumerate(batch_ws): + we = min(ws + seq_len, total_tokens) + wlen = we - ws + wlens.append(wlen) + chunk = val_data.val_tokens[ws : we + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk[:-1] + y_batch[i, :wlen] = chunk[1:] + _cpu_fut = None + if _eval_pool is not None: + import numpy as _np + + _cpu_fut = _eval_pool.submit(_np.array(x_batch[0, : wlens[0]].cpu())) + _wse_active = False + if _wse_active: + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = logits_fn(x_batch) + else: + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = logits_fn(x_batch) + if _cpu_fut is not None: + _cpu_biases = _cpu_fut.result() + _total_bias = _eval_pool.blend_biases(_cpu_biases) + import torch as _torch + + _bias_t = _torch.from_numpy(_total_bias).to(logits.device).to(logits.dtype) + for _bi in range(bsz): + logits[_bi, : wlens[_bi] if _bi < len(wlens) else seq_len] += _bias_t + if _tilt_tables is not None: + pass + if getattr(val_data, "pmi_matrix", None) is not None: + pass + if _cmix_mixer is not None: + pass + if _rep_detector is not None: + pass + if _state_machine is not None: + pass + if _ctx_ngram: + pass + if _ngram_cache_mixer is not None: + pass + if _ctw_mixer is not None: + pass + if _sa_mixer is not None: + pass + if _dirichlet is not None: + try: + import numpy as _dnp + + _dbias = _dirichlet.predict(_dnp.array(x_batch[0, : wlens[0]].cpu()), h.vocab_size) + import torch as _dt + + logits[:1, : wlens[0]] += _dt.from_numpy(_dbias).to(logits.device).to(logits.dtype) + except ImportError: + pass + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), y_batch.reshape(-1), reduction="none" + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else context_size + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = val_data.base_bytes_lut[tgt].to(torch.float64) + tb += (val_data.has_leading_space_lut[tgt] & ~val_data.is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + _scored = y_batch[0, context_size : wlens[0]].tolist() if len(wlens) > 0 else [] + if _eval_pool is not None: + _eval_pool.update_all(_scored) + if _cmix_mixer is not None: + _cmix_mixer.observe(_scored) + if _rep_detector is not None: + _rep_detector.observe(_scored) + if _state_machine is not None: + for _tok in _scored: + _state_machine.update(int(_tok)) + if _ngram_cache_mixer is not None: + _ngram_cache_mixer.observe(_scored) + if _ctw_mixer is not None: + _ctw_mixer.observe(_scored) + if _dirichlet is not None: + _dirichlet.observe(_scored) + if _sa_mixer is not None: + _sa_mixer.observe_batch(_scored) + if _eval_pool is not None: + _eval_pool.shutdown() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + base_model.train() + return _loss_bpb(loss_sum, token_count, byte_count) + + +def eval_val_sliding_ttt(h, base_model, rank, world_size, device, val_data, stride): + seq_len = h.eval_seq_len + total_tokens = val_data.val_tokens.numel() - 1 + ttt_chunk = h.ttt_chunk_tokens + context_size = seq_len - stride + window_starts = [ws for ws in range(0, total_tokens, stride) if ws + context_size < total_tokens] + num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk + chunk_windows = [[] for _ in range(num_chunks)] + for ws in window_starts: + end = min(ws + seq_len, total_tokens) + wlen = end - ws + s = 0 if ws == 0 else context_size + scored_start = ws + s + ci = min(scored_start // ttt_chunk, num_chunks - 1) + chunk_windows[ci].append(ws) + log( + f"ttt_sliding:start chunks={num_chunks} chunk_tokens={ttt_chunk} total_windows={len(window_starts)} stride={stride} ttt_lr={h.ttt_lr} ttt_epochs={h.ttt_epochs} freeze_blocks={h.ttt_freeze_blocks}" + ) + compiled_logits = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True) + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + frozen_block_ids = set(range(min(h.ttt_freeze_blocks, len(base_model.blocks)))) + ttt_params = [] + for name, p in base_model.named_parameters(): + freeze = False + for bi in frozen_block_ids: + if f"blocks.{bi}." in name: + freeze = True + break + if freeze: + p.requires_grad_(False) + else: + p.requires_grad_(True) + ttt_params.append(p) + log( + f"ttt_sliding:params unfrozen={sum((p.numel() for p in ttt_params))} frozen={sum((p.numel() for p in base_model.parameters() if not p.requires_grad))}" + ) + optimizer = torch.optim.SGD(ttt_params, lr=h.ttt_lr, momentum=h.ttt_momentum) + t0 = time.perf_counter() + batch_seqs = h.ttt_batch_seqs + for ci in range(num_chunks): + windows = chunk_windows[ci] + if not windows: + continue + chunk_start = ci * ttt_chunk + chunk_end = min((ci + 1) * ttt_chunk, total_tokens) + my_s = len(windows) * rank // world_size + my_e = len(windows) * (rank + 1) // world_size + my_windows = windows[my_s:my_e] + base_model.eval() + with torch.no_grad(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi : bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk_tok = val_data.val_tokens[ws : end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk_tok[:-1] + y_batch[i, :wlen] = chunk_tok[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = compiled_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), y_batch.reshape(-1), reduction="none" + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else context_size + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = val_data.base_bytes_lut[tgt].to(torch.float64) + tb += (val_data.has_leading_space_lut[tgt] & ~val_data.is_boundary_token_lut[prev]).to( + torch.float64 + ) + byte_count += tb.sum() + is_last_chunk = ci == num_chunks - 1 + if not is_last_chunk and h.ttt_epochs > 0: + base_model.train() + chunk_seqs = (chunk_end - chunk_start) // seq_len + if chunk_seqs > 0: + cos_lr = h.ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1))) + for pg in optimizer.param_groups: + pg["lr"] = cos_lr + my_seq_s = chunk_seqs * rank // world_size + my_seq_e = chunk_seqs * (rank + 1) // world_size + my_chunk_seqs = my_seq_e - my_seq_s + for _ep in range(h.ttt_epochs): + for bs in range(0, my_chunk_seqs, batch_seqs): + be = min(bs + batch_seqs, my_chunk_seqs) + actual_bs = my_seq_s + bs + start_tok = chunk_start + actual_bs * seq_len + end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 + if end_tok > val_data.val_tokens.numel(): + continue + local = val_data.val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + optimizer.zero_grad(set_to_none=True) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = base_model(x, y) + loss.backward() + if world_size > 1: + for p in ttt_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + torch.nn.utils.clip_grad_norm_(ttt_params, h.ttt_grad_clip) + optimizer.step() + if rank == 0 and (ci % 10 == 0 or ci == num_chunks - 1): + elapsed = time.perf_counter() - t0 + rl = loss_sum.item() / max(token_count.item(), 1) + rbpb = rl / math.log(2.0) * (token_count.item() / max(byte_count.item(), 1)) + log(f" ttt_chunk [{ci + 1}/{num_chunks}] bpb={rbpb:.6f} time={elapsed:.1f}s") + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.eval() + log(f"ttt_sliding:done val_loss={val_loss:.6f} val_bpb={val_bpb:.6f} elapsed={time.perf_counter() - t0:.1f}s") + return (val_loss, val_bpb) + + +def timed_eval(label, fn, *args, **kwargs): + torch.cuda.synchronize() + t0 = time.perf_counter() + val_loss, val_bpb = fn(*args, **kwargs) + torch.cuda.synchronize() + elapsed_ms = 1000.0 * (time.perf_counter() - t0) + log(f"{label} val_loss:{val_loss:.8f} val_bpb:{val_bpb:.8f} eval_time:{elapsed_ms:.0f}ms") + return (val_loss, val_bpb) + + +def _load_train_sample_for_nlfi(h, device): + """RULE COMPLIANCE: NLFI bias mutation must use TRAIN data, not val (the comp + rules forbid accessing val data during training). Loads the first eval_seq_len + tokens from the first train shard. Deterministic, so train-side and eval-side + NLFI setup compute matching multipliers.""" + try: + _train_files = sorted(Path(h.datasets_dir).resolve().glob("fineweb_train_*.bin")) + if not _train_files: + return None + _arr = np.fromfile(str(_train_files[0]), dtype=np.uint16, count=h.eval_seq_len) + if _arr.size < h.eval_seq_len: + return None + return torch.from_numpy(_arr.astype(np.int64)).to(device).view(1, -1) + except Exception as _e: + print(f"NLFI: train sample load failed ({_e}), falling back to no setup", flush=True) + return None + + +def train_model(h, device, val_data, contrastive_init=None): + base_model = GPT(h).to(device).bfloat16() + restore_fp32_params(base_model) + if contrastive_init is not None: + try: + _ci_loaded = 0 + for k, v in contrastive_init.items(): + if k in base_model.state_dict(): + base_model.state_dict()[k].copy_(v) + _ci_loaded += 1 + log(f"[IDEA-024 contrastive] transferred {_ci_loaded}/{len(contrastive_init)} pretrained weight tensors") + except Exception as e: + log(f"[IDEA-024 contrastive] weight transfer failed: {e}") + if getattr(base_model, "_nlfi_enabled", False) and (not getattr(base_model, "_nlfi_applied", False)): + _sample = _load_train_sample_for_nlfi(h, device) + if _sample is not None: + base_model._apply_nlfi_once(_sample) + _compile_mode = os.environ.get("TORCH_COMPILE_MODE", "default") + if _compile_mode == "default": + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + else: + log(f"torch.compile mode={_compile_mode}") + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True, mode=_compile_mode) + if h.distributed: + # find_unused_parameters=True: per-layer params (skip_gates, lane_merge, + # XSA projections) are conditionally unused on a given step depending on + # the active forward path. Required to avoid DDP raising on first step. + model = DDP( + compiled_model, + device_ids=[h.local_rank], + broadcast_buffers=False, + find_unused_parameters=True, + ) + else: + model = compiled_model + log(f"model_params:{sum((p.numel() for p in base_model.parameters()))}") + optimizers = Optimizers(h, base_model) + train_loader = _make_shard_loader(h, device) + max_wallclock_ms = 1000.0 * h.max_wallclock_seconds if h.max_wallclock_seconds > 0 else None + train_loader.prefill(h.train_batch_tokens, h.grad_accum_steps) + _curriculum_results = [] + if max_wallclock_ms is not None: + max_wallclock_ms -= h.gptq_reserve_seconds * 1000.0 + log(f"gptq:reserving {h.gptq_reserve_seconds:.0f}s, effective={max_wallclock_ms:.0f}ms") + _fuzzy_enabled = int(os.environ.get("USE_FUZZY_LR_BANDIT", "0")) + _fuzzy_arms = [0.5, 1.0, 2.0] + _fuzzy_means = [0.0, 0.0, 0.0] + _fuzzy_counts = [1, 1, 1] + _fuzzy_prev_loss = None + _fuzzy_arm_idx = 1 + if _fuzzy_enabled: + log(f"FUZZY_LR_BANDIT: enabled arms={_fuzzy_arms} (Shot 17)") + _fermented_state = None + _rare_weights = None + _sharpen_state = None + _bma_mgr = None + _rnt_tables = None + _maml_state = None + try: + from submission.ideas.idea_051_freeze_dry import is_enabled as fd_on + + if fd_on(): + log("[IDEA-051 freeze_dry] enabled - linear-combo pruning active") + except ImportError: + pass + _distil_state = None + + def training_frac(step, elapsed_ms): + if max_wallclock_ms is None: + return step / max(h.iterations, 1) + return elapsed_ms / max(max_wallclock_ms, 1e-09) + + def lr_mul(frac): + if h.warmdown_frac <= 0: + return 1.0 + if frac >= 1.0 - h.warmdown_frac: + return max((1.0 - frac) / h.warmdown_frac, h.min_lr) + return 1.0 + + def step_fn(step, lr_scale): + optimizers.zero_grad_all() + train_loss = torch.zeros((), device=device) + for micro_step in range(h.grad_accum_steps): + if h.distributed: + model.require_backward_grad_sync = micro_step == h.grad_accum_steps - 1 + x, y = train_loader.next_batch(h.train_batch_tokens, h.grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + loss = model(x, y) + if _rnt_tables is not None: + pass + train_loss += loss.detach() + if _rare_weights is not None: + try: + _batch_toks = y.detach().cpu().numpy().flatten() + _rw_scale = float(_rare_weights[_batch_toks % len(_rare_weights)].mean()) + loss = loss * _rw_scale + except Exception: + pass + if _fermented_state is not None: + try: + _fp_scale = _fermented_state.step(loss.detach().unsqueeze(0).unsqueeze(0)) + if _fp_scale is not None: + loss = loss * _fp_scale.mean().to(loss.device) + except Exception: + pass + (loss / h.grad_accum_steps).backward() + train_loss /= h.grad_accum_steps + frac = min(step / h.muon_momentum_warmup_steps, 1.0) if h.muon_momentum_warmup_steps > 0 else 1.0 + muon_momentum = (1 - frac) * h.muon_momentum_warmup_start + frac * h.muon_momentum + for group in optimizers.optimizer_muon.param_groups: + group["momentum"] = muon_momentum + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * lr_scale + if h.grad_clip_norm > 0: + torch.nn.utils.clip_grad_norm_(base_model.parameters(), h.grad_clip_norm) + optimizers.step() + return train_loss + + if h.warmup_steps > 0: + initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} + initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] + model.train() + for warmup_step in range(h.warmup_steps): + step_fn(warmup_step, 1.0) + if warmup_step <= 5 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == h.warmup_steps: + log(f"warmup_step: {warmup_step + 1}/{h.warmup_steps}") + if h.num_loops > 0: + base_model.looping_active = True + log(f"loop_warmup:enabled encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}") + for warmup_step in range(h.warmup_steps): + step_fn(warmup_step, 1.0) + if warmup_step <= 5 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == h.warmup_steps: + log(f"loop_warmup_step: {warmup_step + 1}/{h.warmup_steps}") + base_model.looping_active = False + base_model.load_state_dict(initial_model_state, strict=True) + for opt, state in zip(optimizers, initial_optimizer_states, strict=True): + opt.load_state_dict(state) + optimizers.zero_grad_all() + if h.distributed: + model.require_backward_grad_sync = True + train_loader = _make_shard_loader(h, device) + ema_state = {name: t.detach().float().clone() for name, t in base_model.state_dict().items()} + ema_decay = h.ema_decay + training_time_ms = 0.0 + stop_after_step = None + torch.cuda.synchronize() + t0 = time.perf_counter() + step = 0 + if _distil_state is not None and _distil_state.variant == "v1": + try: + _distil_state.setup_v1_teacher(base_model.state_dict()) + except Exception: + pass + while True: + last_step = step == h.iterations or (stop_after_step is not None and step >= stop_after_step) + should_validate = last_step or (h.val_loss_every > 0 and step % h.val_loss_every == 0) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val(h, device, val_data, model) + log(f"{step}/{h.iterations} val_loss: {val_loss:.4f} val_bpb: {val_bpb:.4f}") + torch.cuda.synchronize() + t0 = time.perf_counter() + if last_step: + if stop_after_step is not None and step < h.iterations: + log(f"stopping_early: wallclock_cap train_time: {training_time_ms:.0f}ms step: {step}/{h.iterations}") + break + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + frac = training_frac(step, elapsed_ms) + scale = lr_mul(frac) + if h.num_loops > 0 and (not base_model.looping_active) and (frac >= h.enable_looping_at): + base_model.looping_active = True + log( + f"layer_loop:enabled step:{step} frac:{frac:.3f} encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}" + ) + if _fuzzy_enabled: + _samples = [ + _fuzzy_means[i] + random.gauss(0, 1.0 / _fuzzy_counts[i] ** 0.5) for i in range(len(_fuzzy_arms)) + ] + _fuzzy_arm_idx = _samples.index(max(_samples)) + scale = scale * _fuzzy_arms[_fuzzy_arm_idx] + if _curriculum_results: + pass + train_loss = step_fn(step, scale) + if _maml_state is not None and _maml_state.should_run(): + try: + x_s, y_s = train_loader.next_batch(h.train_batch_tokens, h.grad_accum_steps) + x_q, y_q = train_loader.next_batch(h.train_batch_tokens, h.grad_accum_steps) + _ml = _maml_state.maml_loss(base_model, x_s, y_s, x_q, y_q) + if _ml is not None: + _ml.backward() + except Exception: + pass + if _distil_state is not None and _distil_state.teacher_state is not None: + try: + _d_frac = training_frac(step, training_time_ms + 1000.0 * (time.perf_counter() - t0)) + if _d_frac >= _distil_state.start_frac: + from submission.ideas.idea_059_distillation import kd_loss_from_logits + + _d_x, _d_y = train_loader.next_batch(h.train_batch_tokens, h.grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + _student_logits = base_model.forward_logits(_d_x) + _saved_state = {k: v.clone() for k, v in base_model.state_dict().items()} + base_model.load_state_dict( + {k: v.to(device) for k, v in _distil_state.teacher_state.items()}, strict=True + ) + with torch.no_grad(), torch.autocast(device_type="cuda", dtype=torch.bfloat16): + _teacher_logits = base_model.forward_logits(_d_x) + base_model.load_state_dict(_saved_state, strict=True) + _kd = kd_loss_from_logits(_student_logits, _teacher_logits.detach(), _distil_state.temp) + (_distil_state.alpha * _kd).backward() + del _saved_state, _student_logits, _teacher_logits, _d_x, _d_y + except Exception: + pass + if _fuzzy_enabled: + _cur_loss = train_loss.item() + if _fuzzy_prev_loss is not None: + _reward = _fuzzy_prev_loss - _cur_loss + _fuzzy_counts[_fuzzy_arm_idx] += 1 + _fuzzy_means[_fuzzy_arm_idx] += (_reward - _fuzzy_means[_fuzzy_arm_idx]) / _fuzzy_counts[_fuzzy_arm_idx] + _fuzzy_prev_loss = _cur_loss + with torch.no_grad(): + for name, t in base_model.state_dict().items(): + ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) + step += 1 + approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + should_log_train = h.train_log_every > 0 and ( + step <= 5 or step % h.train_log_every == 0 or stop_after_step is not None + ) + if should_log_train: + tok_per_sec = step * h.train_batch_tokens / (approx_training_time_ms / 1000.0) + log( + f"{step}/{h.iterations} train_loss: {train_loss.item():.4f} train_time: {approx_training_time_ms / 60000:.1f}m tok/s: {tok_per_sec:.0f}" + ) + reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms + if h.distributed and max_wallclock_ms is not None: + reached_cap_tensor = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) + reached_cap = bool(reached_cap_tensor.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + if _fuzzy_enabled: + _best_arm = _fuzzy_means.index(max(_fuzzy_means)) + _total = sum(_fuzzy_counts) - len(_fuzzy_counts) + log( + f"FUZZY_LR_BANDIT summary: arms={_fuzzy_arms} means={[round(m, 4) for m in _fuzzy_means]} counts={[c - 1 for c in _fuzzy_counts]} total_steps={_total} best_arm={_fuzzy_arms[_best_arm]}" + ) + log( + f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB" + ) + log("ema:applying EMA weights") + current_state = base_model.state_dict() + avg_state = {name: t.to(dtype=current_state[name].dtype) for name, t in ema_state.items()} + base_model.load_state_dict(avg_state, strict=True) + return (base_model, compiled_model) + + +def prequant_ttt_adapt_adamw(h, base_model, device, val_tokens, rank=0, world_size=1): + """Pre-Quant AdamW TTT (ported from PR #1485 / #1306). + Fine-tunes the EMA-applied base_model on val tokens BEFORE GPTQ so the + adaptation bakes into the quantized weights. Frontier (PR #1482) gives ~-0.014 + BPB on top of eval-time TTT. Modifies base_model in place. + """ + seq_len = h.train_seq_len + total_seqs = (val_tokens.numel() - 1) // seq_len + batch_seqs = h.prequant_ttt_batch_seqs + if h.prequant_ttt_freeze_blocks > 0: + for i, block in enumerate(base_model.blocks): + if i < h.prequant_ttt_freeze_blocks: + for p in block.parameters(): + p.requires_grad_(False) + _shadow_active = False + _prequant_ttt_epochs = h.prequant_ttt_epochs + ttt_params = [p for p in base_model.parameters() if p.requires_grad] + log( + f"prequant_ttt:params trainable={sum((p.numel() for p in ttt_params))} frozen={sum((p.numel() for p in base_model.parameters() if not p.requires_grad))}" + ) + _pg = None + if _pg is not None: + optimizer = torch.optim.AdamW(_pg, weight_decay=0.0) + else: + optimizer = torch.optim.AdamW(ttt_params, lr=h.prequant_ttt_lr, weight_decay=0.0) + scheduler = None + if h.prequant_ttt_cosine_decay: + scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( + optimizer, T_max=_prequant_ttt_epochs, eta_min=h.prequant_ttt_lr * 0.1 + ) + my_start = total_seqs * rank // world_size + my_end = total_seqs * (rank + 1) // world_size + base_model.train() + t0 = time.perf_counter() + _ttt_bma_mgr = None + _ttt_sharpen = None + for epoch in range(_prequant_ttt_epochs): + epoch_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + epoch_tokens = torch.zeros((), device=device, dtype=torch.float64) + for bs in range(my_start, my_end, batch_seqs): + be = min(bs + batch_seqs, my_end) + raw_start = bs * seq_len + raw_end = be * seq_len + 1 + if raw_end > val_tokens.numel(): + continue + local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + optimizer.zero_grad(set_to_none=True) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = base_model(x, y) + if _ttt_sharpen is not None: + try: + _sharpen_w = _ttt_sharpen.compute_weights(loss.detach().unsqueeze(0).unsqueeze(0)) + loss = loss * _sharpen_w.mean().to(loss.device) + except Exception: + pass + loss.backward() + if world_size > 1: + for p in ttt_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + torch.nn.utils.clip_grad_norm_(ttt_params, h.prequant_ttt_grad_clip) + optimizer.step() + if _ttt_bma_mgr is not None: + try: + if _ttt_bma_mgr.should_snapshot(epoch): + _ttt_bma_mgr.save_snapshot(base_model) + except Exception: + pass + epoch_loss_sum += loss.detach().to(torch.float64) * float(y.numel()) + epoch_tokens += float(y.numel()) + if world_size > 1: + dist.all_reduce(epoch_loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(epoch_tokens, op=dist.ReduceOp.SUM) + epoch_avg = epoch_loss_sum.item() / max(epoch_tokens.item(), 1) + if scheduler is not None: + scheduler.step() + log( + f"prequant_ttt:epoch {epoch + 1}/{_prequant_ttt_epochs} loss:{epoch_avg:.4f} time:{time.perf_counter() - t0:.1f}s" + ) + if _ttt_bma_mgr is not None and _ttt_bma_mgr.snapshots: + try: + _bma_weights = _ttt_bma_mgr.compute_weights() + _bma_avg = {} + for i, snap in enumerate(_ttt_bma_mgr.snapshots): + for k, v in snap.items(): + if k not in _bma_avg: + _bma_avg[k] = v.to(device).float() * _bma_weights[i] + else: + _bma_avg[k] += v.to(device).float() * _bma_weights[i] + _final_w = 1.0 / (len(_bma_weights) + 1) + _snap_w = 1.0 - _final_w + _cur = base_model.state_dict() + for k in _bma_avg: + _bma_avg[k] = _snap_w * _bma_avg[k] + _final_w * _cur[k].float() + _bma_avg[k] = _bma_avg[k].to(dtype=_cur[k].dtype) + base_model.load_state_dict(_bma_avg, strict=True) + log(f"[IDEA-022 bma_ttt] averaged {len(_ttt_bma_mgr.snapshots)} snapshots + final state") + except Exception as e: + log(f"[IDEA-022 bma_ttt] averaging failed: {e}") + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.eval() + log(f"prequant_ttt:done elapsed={time.perf_counter() - t0:.1f}s") + + +def train_and_eval(h, device): + random.seed(h.seed) + np.random.seed(h.seed) + torch.manual_seed(h.seed) + torch.cuda.manual_seed_all(h.seed) + val_data = ValidationData(h, device) + log("train_shards: " + str(len(list(Path(h.datasets_dir).resolve().glob("fineweb_train_*.bin"))))) + log(f"val_tokens: {val_data.val_tokens.numel() - 1}") + _contrastive_pretrained_state = None + base_model, compiled_model = train_model(h, device, val_data, contrastive_init=_contrastive_pretrained_state) + torch._dynamo.reset() + timed_eval("pre-quantization post-ema", eval_val, h, device, val_data, compiled_model) + if h.prequant_ttt_enabled: + prequant_ttt_adapt_adamw(h, base_model, device, val_data.val_tokens, rank=h.rank, world_size=h.world_size) + torch._dynamo.reset() + timed_eval("post-prequant-ttt", eval_val, h, device, val_data, base_model) + _deq_was_unwrapped = False + try: + serialize(h, base_model, Path(__file__).read_text(encoding="utf-8"), val_data=val_data) + finally: + pass + if h.distributed: + dist.barrier() + eval_model = deserialize(h, device) + if h.num_loops > 0: + eval_model.looping_active = True + if getattr(eval_model, "_nlfi_enabled", False) and (not getattr(eval_model, "_nlfi_applied", False)): + _sample = _load_train_sample_for_nlfi(h, device) + if _sample is not None: + eval_model._apply_nlfi_once(_sample) + compiled_model = torch.compile(eval_model, dynamic=False, fullgraph=True) + timed_eval("quantized", eval_val, h, device, val_data, compiled_model) + if h.sliding_window_enabled: + timed_eval("quantized_sliding_window", eval_val_sliding, h, device, val_data, eval_model) + if h.ttt_enabled: + del eval_model, compiled_model + torch._dynamo.reset() + torch.cuda.empty_cache() + ttt_model = deserialize(h, device) + if h.num_loops > 0: + ttt_model.looping_active = True + timed_eval( + "quantized_ttt", + eval_val_sliding_ttt, + h, + ttt_model, + h.rank, + h.world_size, + device, + val_data, + stride=h.eval_stride, + ) + del ttt_model + + +def main(): + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required") + if world_size <= 0: + raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") + if 8 % world_size != 0: + raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral") + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + torch.set_float32_matmul_precision("high") + torch.backends.cudnn.benchmark = bool(int(os.environ.get("USE_CUDNN_BENCHMARK", "1"))) + from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp + + enable_cudnn_sdp(False) + enable_flash_sdp(True) + enable_mem_efficient_sdp(False) + enable_math_sdp(False) + torch._dynamo.config.optimize_ddp = False + h = Hyperparameters() + set_logging_hparams(h) + if h.is_main_process: + os.makedirs("logs", exist_ok=True) + log(100 * "=", console=False) + log("Hyperparameters:", console=True) + for k, v in sorted(vars(type(h)).items()): + if not k.startswith("_"): + log(f" {k}: {v}", console=True) + log("=" * 100, console=False) + log(f"Running Python {sys.version}", console=False) + log(f"Running PyTorch {torch.__version__}", console=False) + log( + subprocess.run( + ["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False + ).stdout, + console=False, + ) + log("=" * 100, console=False) + try: + import json as _json + + _hp = { + k: v + for k, v in vars(type(h)).items() + if not k.startswith("_") and isinstance(v, (str, int, float, bool, type(None))) + } + _meta = { + "hyperparams": _hp, + "python": sys.version.split()[0], + "torch": torch.__version__, + "cuda": getattr(torch.version, "cuda", None), + "env_toggles": { + k: os.environ[k] + for k in sorted(os.environ) + if k.startswith( + ( + "USE_", + "GPTQ_", + "TTT_", + "PREQUANT_", + "SLIDING_", + "EMBED_", + "MATRIX_", + "NUM_", + "MODEL_", + "SEED", + "EXP_ID", + "TRAIN_", + "VAL_", + "WEIGHT_", + "BEZIER_", + "DRAFT_", + "SPEC_", + "DUAL_MLP_", + "FUSED_", + "INT6_", + "SIZE_", + ) + ) + }, + } + _out = os.path.dirname(h.model_path) or "." + _json.dump(_meta, open(os.path.join(_out, "hyperparams.json"), "w"), indent=2, default=str) + except Exception as _e: + log(f"[hp dump] failed: {_e}", console=False) + train_and_eval(h, device) + if distributed: + dist.destroy_process_group() + + +if __name__ == "__main__": + main() diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_log_seed1337.log b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_log_seed1337.log new file mode 100644 index 0000000000..3ae4d9f301 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_log_seed1337.log @@ -0,0 +1,1432 @@ +[run] 128 train shards, 1 val shard(s), tokenizer ok +[run] config: + SEED=1337 + MAX_WALLCLOCK_SECONDS=600 + TTT_ENABLED=1 + DATA_DIR=/root/c22_submission/final/data +[run] launcher: torchrun × 8 +[run] launching c22_train.py at 06:49:22Z +[run] log: logs/run_seed1337_20260424T064922Z.log +W0424 06:49:23.399000 3472721 torch/distributed/run.py:803] +W0424 06:49:23.399000 3472721 torch/distributed/run.py:803] ***************************************** +W0424 06:49:23.399000 3472721 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0424 06:49:23.399000 3472721 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.095 + beta1: 0.9 + beta2: 0.95 + compressor: zstd + data_dir: /root/c22_submission/final/data + datasets_dir: /root/c22_submission/final/data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.9965 + embed_bits: 5 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/742e2061-f5bb-4d52-8e54-7979be46b9f5.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 3 + muon_beta2: 0.95 + muon_momentum: 0.98 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.12 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + prequant_ttt_batch_seqs: 16 + prequant_ttt_cosine_decay: True + prequant_ttt_enabled: False + prequant_ttt_epochs: 8 + prequant_ttt_freeze_blocks: 1 + prequant_ttt_grad_clip: 1.0 + prequant_ttt_lr: 0.00045 + qk_gain_init: 5.25 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 742e2061-f5bb-4d52-8e54-7979be46b9f5 + scalar_lr: 0.02 + seed: 1337 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /root/c22_submission/final/data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 524288 + train_files: /root/c22_submission/final/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 10 + train_seq_len: 2048 + ttt_batch_seqs: 16 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 262144 + val_files: /root/c22_submission/final/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 128 +val_tokens: 40540160 +model_params:35988657 +[curriculum] rank=6/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=0/8 buckets=10 total_seqs=781248 floor=0.02 +gptq:reserving 12s, effective=588000ms +[IDEA-051 freeze_dry] enabled — linear-combo pruning active +[curriculum] rank=3/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=4/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=5/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=2/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=1/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=7/8 buckets=10 total_seqs=736249 floor=0.02 +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +[curriculum] rank=5/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=4/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=7/8 buckets=10 total_seqs=736249 floor=0.02 +[curriculum] rank=0/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=2/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=6/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=3/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=1/8 buckets=10 total_seqs=781248 floor=0.02 +0/20000 val_loss: 9.0094 val_bpb: 3.4879 +1/20000 train_loss: 9.0011 train_time: 0.1m tok/s: 174410 +2/20000 train_loss: 12.3294 train_time: 0.1m tok/s: 303385 +3/20000 train_loss: 11.2956 train_time: 0.1m tok/s: 421954 +4/20000 train_loss: 9.6487 train_time: 0.1m tok/s: 549520 +5/20000 train_loss: 8.6167 train_time: 0.1m tok/s: 668523 +10/20000 train_loss: 6.8053 train_time: 0.1m tok/s: 1192935 +20/20000 train_loss: 5.8732 train_time: 0.1m tok/s: 1958723 +30/20000 train_loss: 5.5137 train_time: 0.1m tok/s: 2487612 +40/20000 train_loss: 5.2327 train_time: 0.1m tok/s: 2873527 +50/20000 train_loss: 5.1826 train_time: 0.1m tok/s: 3164803 +60/20000 train_loss: 5.0520 train_time: 0.2m tok/s: 3394713 +70/20000 train_loss: 4.8389 train_time: 0.2m tok/s: 3581339 +80/20000 train_loss: 4.7923 train_time: 0.2m tok/s: 3737198 +90/20000 train_loss: 4.6722 train_time: 0.2m tok/s: 3867827 +100/20000 train_loss: 4.4661 train_time: 0.2m tok/s: 3977188 +110/20000 train_loss: 4.4417 train_time: 0.2m tok/s: 4071808 +120/20000 train_loss: 4.2887 train_time: 0.3m tok/s: 4154166 +130/20000 train_loss: 4.1305 train_time: 0.3m tok/s: 4226474 +140/20000 train_loss: 4.0376 train_time: 0.3m tok/s: 4291590 +150/20000 train_loss: 3.9802 train_time: 0.3m tok/s: 4351003 +160/20000 train_loss: 3.8468 train_time: 0.3m tok/s: 4402751 +170/20000 train_loss: 3.8024 train_time: 0.3m tok/s: 4448723 +180/20000 train_loss: 3.8106 train_time: 0.4m tok/s: 4490797 +190/20000 train_loss: 3.7123 train_time: 0.4m tok/s: 4529362 +200/20000 train_loss: 3.7572 train_time: 0.4m tok/s: 4564910 +210/20000 train_loss: 3.5672 train_time: 0.4m tok/s: 4596673 +220/20000 train_loss: 3.6338 train_time: 0.4m tok/s: 4626722 +230/20000 train_loss: 3.5308 train_time: 0.4m tok/s: 4654282 +240/20000 train_loss: 3.6410 train_time: 0.4m tok/s: 4679698 +250/20000 train_loss: 3.5888 train_time: 0.5m tok/s: 4703256 +260/20000 train_loss: 3.5866 train_time: 0.5m tok/s: 4725388 +270/20000 train_loss: 3.5274 train_time: 0.5m tok/s: 4746176 +280/20000 train_loss: 3.4115 train_time: 0.5m tok/s: 4765663 +290/20000 train_loss: 3.5533 train_time: 0.5m tok/s: 4784190 +300/20000 train_loss: 3.5379 train_time: 0.5m tok/s: 4801798 +310/20000 train_loss: 3.5072 train_time: 0.6m tok/s: 4817975 +320/20000 train_loss: 3.4457 train_time: 0.6m tok/s: 4833161 +330/20000 train_loss: 3.3982 train_time: 0.6m tok/s: 4848120 +340/20000 train_loss: 3.4087 train_time: 0.6m tok/s: 4861145 +350/20000 train_loss: 3.4648 train_time: 0.6m tok/s: 4873707 +360/20000 train_loss: 3.4444 train_time: 0.6m tok/s: 4885836 +370/20000 train_loss: 3.4075 train_time: 0.7m tok/s: 4897824 +380/20000 train_loss: 3.3820 train_time: 0.7m tok/s: 4908686 +390/20000 train_loss: 3.4041 train_time: 0.7m tok/s: 4918869 +400/20000 train_loss: 3.4549 train_time: 0.7m tok/s: 4928503 +410/20000 train_loss: 3.4029 train_time: 0.7m tok/s: 4937408 +420/20000 train_loss: 3.3056 train_time: 0.7m tok/s: 4946336 +430/20000 train_loss: 3.4221 train_time: 0.8m tok/s: 4955071 +440/20000 train_loss: 3.4036 train_time: 0.8m tok/s: 4963888 +450/20000 train_loss: 3.3340 train_time: 0.8m tok/s: 4971993 +460/20000 train_loss: 3.3930 train_time: 0.8m tok/s: 4980788 +470/20000 train_loss: 3.3510 train_time: 0.8m tok/s: 4988911 +480/20000 train_loss: 3.3407 train_time: 0.8m tok/s: 4996638 +490/20000 train_loss: 3.2520 train_time: 0.9m tok/s: 5003325 +500/20000 train_loss: 3.2733 train_time: 0.9m tok/s: 5009645 +510/20000 train_loss: 3.4657 train_time: 0.9m tok/s: 5015853 +520/20000 train_loss: 3.3686 train_time: 0.9m tok/s: 5021570 +530/20000 train_loss: 3.4027 train_time: 0.9m tok/s: 5027567 +540/20000 train_loss: 3.3023 train_time: 0.9m tok/s: 5033010 +550/20000 train_loss: 3.3085 train_time: 1.0m tok/s: 5039279 +560/20000 train_loss: 3.3376 train_time: 1.0m tok/s: 5044485 +570/20000 train_loss: 3.3096 train_time: 1.0m tok/s: 5049596 +580/20000 train_loss: 3.3105 train_time: 1.0m tok/s: 5054457 +590/20000 train_loss: 3.3564 train_time: 1.0m tok/s: 5059135 +600/20000 train_loss: 3.3442 train_time: 1.0m tok/s: 5063604 +610/20000 train_loss: 3.3867 train_time: 1.1m tok/s: 5067850 +620/20000 train_loss: 3.2913 train_time: 1.1m tok/s: 5072109 +630/20000 train_loss: 3.3168 train_time: 1.1m tok/s: 5076281 +640/20000 train_loss: 3.3102 train_time: 1.1m tok/s: 5080217 +650/20000 train_loss: 3.3006 train_time: 1.1m tok/s: 5084132 +660/20000 train_loss: 3.2210 train_time: 1.1m tok/s: 5087899 +670/20000 train_loss: 3.1993 train_time: 1.1m tok/s: 5091540 +680/20000 train_loss: 3.3452 train_time: 1.2m tok/s: 5095206 +690/20000 train_loss: 3.2918 train_time: 1.2m tok/s: 5098563 +700/20000 train_loss: 3.2322 train_time: 1.2m tok/s: 5101851 +710/20000 train_loss: 3.2436 train_time: 1.2m tok/s: 5104938 +720/20000 train_loss: 3.2677 train_time: 1.2m tok/s: 5108039 +730/20000 train_loss: 3.1775 train_time: 1.2m tok/s: 5111127 +740/20000 train_loss: 3.3472 train_time: 1.3m tok/s: 5114322 +750/20000 train_loss: 3.2639 train_time: 1.3m tok/s: 5117217 +760/20000 train_loss: 3.1845 train_time: 1.3m tok/s: 5119978 +770/20000 train_loss: 3.2642 train_time: 1.3m tok/s: 5122573 +780/20000 train_loss: 3.3088 train_time: 1.3m tok/s: 5125406 +790/20000 train_loss: 3.3232 train_time: 1.3m tok/s: 5128520 +800/20000 train_loss: 3.2763 train_time: 1.4m tok/s: 5131228 +810/20000 train_loss: 3.3510 train_time: 1.4m tok/s: 5133677 +820/20000 train_loss: 3.2626 train_time: 1.4m tok/s: 5136070 +830/20000 train_loss: 3.2896 train_time: 1.4m tok/s: 5138406 +840/20000 train_loss: 3.3386 train_time: 1.4m tok/s: 5140789 +850/20000 train_loss: 3.3086 train_time: 1.4m tok/s: 5143138 +860/20000 train_loss: 3.1938 train_time: 1.5m tok/s: 5145579 +870/20000 train_loss: 3.2815 train_time: 1.5m tok/s: 5147731 +880/20000 train_loss: 3.2029 train_time: 1.5m tok/s: 5150139 +890/20000 train_loss: 3.3306 train_time: 1.5m tok/s: 5152528 +900/20000 train_loss: 3.4028 train_time: 1.5m tok/s: 5154784 +910/20000 train_loss: 3.2014 train_time: 1.5m tok/s: 5157380 +920/20000 train_loss: 3.1578 train_time: 1.6m tok/s: 5159358 +930/20000 train_loss: 3.2316 train_time: 1.6m tok/s: 5161256 +940/20000 train_loss: 3.3088 train_time: 1.6m tok/s: 5163185 +950/20000 train_loss: 3.2493 train_time: 1.6m tok/s: 5165097 +960/20000 train_loss: 3.2405 train_time: 1.6m tok/s: 5166904 +970/20000 train_loss: 3.2671 train_time: 1.6m tok/s: 5168705 +980/20000 train_loss: 3.2492 train_time: 1.7m tok/s: 5170490 +990/20000 train_loss: 3.3142 train_time: 1.7m tok/s: 5172171 +1000/20000 train_loss: 3.2844 train_time: 1.7m tok/s: 5173879 +1010/20000 train_loss: 3.3308 train_time: 1.7m tok/s: 5175481 +1020/20000 train_loss: 3.2094 train_time: 1.7m tok/s: 5177206 +1030/20000 train_loss: 3.1813 train_time: 1.7m tok/s: 5179146 +1040/20000 train_loss: 3.3090 train_time: 1.8m tok/s: 5180872 +1050/20000 train_loss: 3.2015 train_time: 1.8m tok/s: 5182844 +1060/20000 train_loss: 3.2304 train_time: 1.8m tok/s: 5184461 +1070/20000 train_loss: 3.1982 train_time: 1.8m tok/s: 5185845 +1080/20000 train_loss: 3.2100 train_time: 1.8m tok/s: 5187686 +1090/20000 train_loss: 3.1773 train_time: 1.8m tok/s: 5189342 +1100/20000 train_loss: 3.2279 train_time: 1.9m tok/s: 5190538 +1110/20000 train_loss: 3.2628 train_time: 1.9m tok/s: 5191845 +1120/20000 train_loss: 3.1393 train_time: 1.9m tok/s: 5192953 +1130/20000 train_loss: 3.2105 train_time: 1.9m tok/s: 5194495 +1140/20000 train_loss: 3.2168 train_time: 1.9m tok/s: 5195917 +1150/20000 train_loss: 3.2582 train_time: 1.9m tok/s: 5197135 +1160/20000 train_loss: 3.1473 train_time: 1.9m tok/s: 5198603 +1170/20000 train_loss: 3.2199 train_time: 2.0m tok/s: 5200060 +1180/20000 train_loss: 3.2815 train_time: 2.0m tok/s: 5201397 +1190/20000 train_loss: 3.2238 train_time: 2.0m tok/s: 5202801 +1200/20000 train_loss: 3.1735 train_time: 2.0m tok/s: 5203988 +1210/20000 train_loss: 3.1627 train_time: 2.0m tok/s: 5205335 +1220/20000 train_loss: 3.2048 train_time: 2.0m tok/s: 5206626 +1230/20000 train_loss: 3.1775 train_time: 2.1m tok/s: 5207687 +1240/20000 train_loss: 3.2627 train_time: 2.1m tok/s: 5208845 +1250/20000 train_loss: 3.2887 train_time: 2.1m tok/s: 5210227 +1260/20000 train_loss: 3.2342 train_time: 2.1m tok/s: 5211397 +1270/20000 train_loss: 3.2247 train_time: 2.1m tok/s: 5212774 +1280/20000 train_loss: 3.1964 train_time: 2.1m tok/s: 5213682 +1290/20000 train_loss: 3.2707 train_time: 2.2m tok/s: 5214645 +1300/20000 train_loss: 3.1574 train_time: 2.2m tok/s: 5215663 +1310/20000 train_loss: 3.2014 train_time: 2.2m tok/s: 5216654 +1320/20000 train_loss: 3.2740 train_time: 2.2m tok/s: 5217544 +1330/20000 train_loss: 3.2177 train_time: 2.2m tok/s: 5218277 +1340/20000 train_loss: 3.2194 train_time: 2.2m tok/s: 5219729 +1350/20000 train_loss: 3.1796 train_time: 2.3m tok/s: 5220755 +1360/20000 train_loss: 3.3002 train_time: 2.3m tok/s: 5221680 +1370/20000 train_loss: 3.2998 train_time: 2.3m tok/s: 5222453 +1380/20000 train_loss: 3.1680 train_time: 2.3m tok/s: 5223371 +1390/20000 train_loss: 3.1650 train_time: 2.3m tok/s: 5224549 +1400/20000 train_loss: 3.3035 train_time: 2.3m tok/s: 5225440 +1410/20000 train_loss: 3.1941 train_time: 2.4m tok/s: 5226294 +1420/20000 train_loss: 3.2631 train_time: 2.4m tok/s: 5227045 +1430/20000 train_loss: 3.1340 train_time: 2.4m tok/s: 5227964 +1440/20000 train_loss: 3.2878 train_time: 2.4m tok/s: 5228979 +1450/20000 train_loss: 3.2949 train_time: 2.4m tok/s: 5229749 +1460/20000 train_loss: 3.3422 train_time: 2.4m tok/s: 5230544 +1470/20000 train_loss: 3.2212 train_time: 2.5m tok/s: 5231465 +1480/20000 train_loss: 3.2373 train_time: 2.5m tok/s: 5232422 +1490/20000 train_loss: 3.2417 train_time: 2.5m tok/s: 5233280 +1500/20000 train_loss: 3.1938 train_time: 2.5m tok/s: 5234011 +1510/20000 train_loss: 3.2313 train_time: 2.5m tok/s: 5234635 +1520/20000 train_loss: 3.2062 train_time: 2.5m tok/s: 5235306 +1530/20000 train_loss: 3.2094 train_time: 2.6m tok/s: 5236087 +1540/20000 train_loss: 3.2229 train_time: 2.6m tok/s: 5237061 +1550/20000 train_loss: 3.1860 train_time: 2.6m tok/s: 5238028 +1560/20000 train_loss: 3.2356 train_time: 2.6m tok/s: 5238626 +1570/20000 train_loss: 3.1670 train_time: 2.6m tok/s: 5239368 +1580/20000 train_loss: 3.2862 train_time: 2.6m tok/s: 5240047 +1590/20000 train_loss: 3.2303 train_time: 2.7m tok/s: 5240904 +1600/20000 train_loss: 3.1065 train_time: 2.7m tok/s: 5241544 +1610/20000 train_loss: 3.0948 train_time: 2.7m tok/s: 5242168 +1620/20000 train_loss: 3.2246 train_time: 2.7m tok/s: 5242861 +1630/20000 train_loss: 3.2477 train_time: 2.7m tok/s: 5243446 +1640/20000 train_loss: 3.2158 train_time: 2.7m tok/s: 5244029 +1650/20000 train_loss: 3.1950 train_time: 2.7m tok/s: 5244648 +1660/20000 train_loss: 3.1848 train_time: 2.8m tok/s: 5245208 +1670/20000 train_loss: 2.9855 train_time: 2.8m tok/s: 5246079 +1680/20000 train_loss: 3.1386 train_time: 2.8m tok/s: 5246798 +1690/20000 train_loss: 3.3293 train_time: 2.8m tok/s: 5247364 +1700/20000 train_loss: 3.1869 train_time: 2.8m tok/s: 5247832 +1710/20000 train_loss: 3.2502 train_time: 2.8m tok/s: 5248691 +1720/20000 train_loss: 3.1699 train_time: 2.9m tok/s: 5249341 +1730/20000 train_loss: 3.1772 train_time: 2.9m tok/s: 5249961 +1740/20000 train_loss: 3.0751 train_time: 2.9m tok/s: 5250606 +1750/20000 train_loss: 3.1196 train_time: 2.9m tok/s: 5251263 +1760/20000 train_loss: 3.2574 train_time: 2.9m tok/s: 5251972 +1770/20000 train_loss: 3.2880 train_time: 2.9m tok/s: 5252675 +1780/20000 train_loss: 3.1880 train_time: 3.0m tok/s: 5253108 +1790/20000 train_loss: 3.2375 train_time: 3.0m tok/s: 5253626 +1800/20000 train_loss: 3.1973 train_time: 3.0m tok/s: 5254233 +1810/20000 train_loss: 3.2515 train_time: 3.0m tok/s: 5254743 +1820/20000 train_loss: 3.1401 train_time: 3.0m tok/s: 5255254 +1830/20000 train_loss: 3.1648 train_time: 3.0m tok/s: 5255738 +1840/20000 train_loss: 3.2188 train_time: 3.1m tok/s: 5256203 +1850/20000 train_loss: 3.1984 train_time: 3.1m tok/s: 5256697 +1860/20000 train_loss: 3.1726 train_time: 3.1m tok/s: 5257244 +1870/20000 train_loss: 3.2612 train_time: 3.1m tok/s: 5257764 +1880/20000 train_loss: 3.1406 train_time: 3.1m tok/s: 5258242 +1890/20000 train_loss: 3.1196 train_time: 3.1m tok/s: 5258757 +1900/20000 train_loss: 3.1684 train_time: 3.2m tok/s: 5259353 +1910/20000 train_loss: 3.1283 train_time: 3.2m tok/s: 5259832 +1920/20000 train_loss: 3.0981 train_time: 3.2m tok/s: 5260277 +1930/20000 train_loss: 3.2053 train_time: 3.2m tok/s: 5260726 +1940/20000 train_loss: 3.1091 train_time: 3.2m tok/s: 5260988 +1950/20000 train_loss: 3.1173 train_time: 3.2m tok/s: 5261466 +1960/20000 train_loss: 3.0977 train_time: 3.3m tok/s: 5262049 +1970/20000 train_loss: 3.2155 train_time: 3.3m tok/s: 5262653 +1980/20000 train_loss: 3.1422 train_time: 3.3m tok/s: 5263188 +1990/20000 train_loss: 3.1870 train_time: 3.3m tok/s: 5263572 +2000/20000 train_loss: 3.1168 train_time: 3.3m tok/s: 5264047 +2010/20000 train_loss: 3.1508 train_time: 3.3m tok/s: 5264496 +2020/20000 train_loss: 3.1895 train_time: 3.4m tok/s: 5264898 +2030/20000 train_loss: 3.1060 train_time: 3.4m tok/s: 5265409 +2040/20000 train_loss: 3.2113 train_time: 3.4m tok/s: 5265898 +2050/20000 train_loss: 3.1506 train_time: 3.4m tok/s: 5266316 +2060/20000 train_loss: 3.1167 train_time: 3.4m tok/s: 5266646 +2070/20000 train_loss: 3.0822 train_time: 3.4m tok/s: 5267020 +2080/20000 train_loss: 3.1458 train_time: 3.5m tok/s: 5267390 +2090/20000 train_loss: 3.1525 train_time: 3.5m tok/s: 5267797 +2100/20000 train_loss: 3.1294 train_time: 3.5m tok/s: 5268225 +2110/20000 train_loss: 3.1722 train_time: 3.5m tok/s: 5268612 +2120/20000 train_loss: 3.1002 train_time: 3.5m tok/s: 5268952 +2130/20000 train_loss: 3.0064 train_time: 3.5m tok/s: 5269358 +2140/20000 train_loss: 3.2313 train_time: 3.5m tok/s: 5269729 +2150/20000 train_loss: 3.1449 train_time: 3.6m tok/s: 5270272 +2160/20000 train_loss: 3.1575 train_time: 3.6m tok/s: 5270678 +2170/20000 train_loss: 3.2059 train_time: 3.6m tok/s: 5270974 +2180/20000 train_loss: 3.1254 train_time: 3.6m tok/s: 5271351 +2190/20000 train_loss: 3.1400 train_time: 3.6m tok/s: 5271802 +2200/20000 train_loss: 3.2313 train_time: 3.6m tok/s: 5272245 +2210/20000 train_loss: 3.1351 train_time: 3.7m tok/s: 5272556 +2220/20000 train_loss: 3.1253 train_time: 3.7m tok/s: 5272830 +2230/20000 train_loss: 3.1003 train_time: 3.7m tok/s: 5273229 +2240/20000 train_loss: 3.1790 train_time: 3.7m tok/s: 5273573 +2250/20000 train_loss: 3.0548 train_time: 3.7m tok/s: 5273884 +2260/20000 train_loss: 3.1413 train_time: 3.7m tok/s: 5274230 +2270/20000 train_loss: 3.2009 train_time: 3.8m tok/s: 5274707 +2280/20000 train_loss: 3.1714 train_time: 3.8m tok/s: 5274994 +2290/20000 train_loss: 3.1520 train_time: 3.8m tok/s: 5275398 +2300/20000 train_loss: 3.2165 train_time: 3.8m tok/s: 5275749 +2310/20000 train_loss: 2.9885 train_time: 3.8m tok/s: 5276093 +2320/20000 train_loss: 3.1383 train_time: 3.8m tok/s: 5276555 +2330/20000 train_loss: 3.1888 train_time: 3.9m tok/s: 5276927 +2340/20000 train_loss: 3.0652 train_time: 3.9m tok/s: 5277291 +2350/20000 train_loss: 3.1423 train_time: 3.9m tok/s: 5277543 +2360/20000 train_loss: 3.1953 train_time: 3.9m tok/s: 5277805 +2370/20000 train_loss: 3.1349 train_time: 3.9m tok/s: 5278172 +2380/20000 train_loss: 3.0799 train_time: 3.9m tok/s: 5278529 +2390/20000 train_loss: 3.0591 train_time: 4.0m tok/s: 5278857 +2400/20000 train_loss: 3.1911 train_time: 4.0m tok/s: 5279164 +2410/20000 train_loss: 3.1634 train_time: 4.0m tok/s: 5279419 +2420/20000 train_loss: 3.0694 train_time: 4.0m tok/s: 5279751 +2430/20000 train_loss: 3.1598 train_time: 4.0m tok/s: 5280056 +2440/20000 train_loss: 3.1960 train_time: 4.0m tok/s: 5280266 +2450/20000 train_loss: 3.2626 train_time: 4.1m tok/s: 5280582 +2460/20000 train_loss: 3.1590 train_time: 4.1m tok/s: 5280912 +2470/20000 train_loss: 3.1675 train_time: 4.1m tok/s: 5281222 +2480/20000 train_loss: 3.3248 train_time: 4.1m tok/s: 5281523 +2490/20000 train_loss: 3.2332 train_time: 4.1m tok/s: 5281784 +2500/20000 train_loss: 3.1167 train_time: 4.1m tok/s: 5282001 +2510/20000 train_loss: 3.2494 train_time: 4.2m tok/s: 5282312 +2520/20000 train_loss: 3.2439 train_time: 4.2m tok/s: 5282637 +2530/20000 train_loss: 3.1994 train_time: 4.2m tok/s: 5282955 +2540/20000 train_loss: 3.0294 train_time: 4.2m tok/s: 5283283 +2550/20000 train_loss: 3.1559 train_time: 4.2m tok/s: 5283532 +2560/20000 train_loss: 3.0246 train_time: 4.2m tok/s: 5283802 +2570/20000 train_loss: 3.1583 train_time: 4.2m tok/s: 5284050 +2580/20000 train_loss: 3.2215 train_time: 4.3m tok/s: 5284343 +2590/20000 train_loss: 3.1329 train_time: 4.3m tok/s: 5284654 +2600/20000 train_loss: 3.1280 train_time: 4.3m tok/s: 5284958 +2610/20000 train_loss: 3.1403 train_time: 4.3m tok/s: 5285210 +2620/20000 train_loss: 3.1419 train_time: 4.3m tok/s: 5285450 +2630/20000 train_loss: 3.0689 train_time: 4.3m tok/s: 5285731 +2640/20000 train_loss: 3.2020 train_time: 4.4m tok/s: 5286023 +2650/20000 train_loss: 3.0470 train_time: 4.4m tok/s: 5286201 +2660/20000 train_loss: 3.1679 train_time: 4.4m tok/s: 5286482 +2670/20000 train_loss: 3.1352 train_time: 4.4m tok/s: 5286771 +2680/20000 train_loss: 3.0942 train_time: 4.4m tok/s: 5287054 +2690/20000 train_loss: 3.0957 train_time: 4.4m tok/s: 5287299 +2700/20000 train_loss: 3.1023 train_time: 4.5m tok/s: 5287654 +2710/20000 train_loss: 3.2364 train_time: 4.5m tok/s: 5287936 +2720/20000 train_loss: 3.1001 train_time: 4.5m tok/s: 5288197 +2730/20000 train_loss: 3.1649 train_time: 4.5m tok/s: 5288478 +2740/20000 train_loss: 3.0935 train_time: 4.5m tok/s: 5288670 +2750/20000 train_loss: 3.0709 train_time: 4.5m tok/s: 5288932 +2760/20000 train_loss: 3.1864 train_time: 4.6m tok/s: 5289193 +2770/20000 train_loss: 3.1287 train_time: 4.6m tok/s: 5289425 +2780/20000 train_loss: 3.1339 train_time: 4.6m tok/s: 5289675 +2790/20000 train_loss: 3.2110 train_time: 4.6m tok/s: 5289902 +2800/20000 train_loss: 3.2080 train_time: 4.6m tok/s: 5290165 +2810/20000 train_loss: 3.1893 train_time: 4.6m tok/s: 5290436 +2820/20000 train_loss: 3.0961 train_time: 4.7m tok/s: 5290648 +2830/20000 train_loss: 3.1743 train_time: 4.7m tok/s: 5290889 +2840/20000 train_loss: 3.1790 train_time: 4.7m tok/s: 5291166 +2850/20000 train_loss: 3.1939 train_time: 4.7m tok/s: 5291402 +2860/20000 train_loss: 3.0920 train_time: 4.7m tok/s: 5291560 +2870/20000 train_loss: 3.1292 train_time: 4.7m tok/s: 5291866 +2880/20000 train_loss: 3.1123 train_time: 4.8m tok/s: 5292172 +2890/20000 train_loss: 3.1837 train_time: 4.8m tok/s: 5292424 +2900/20000 train_loss: 3.1294 train_time: 4.8m tok/s: 5292657 +2910/20000 train_loss: 3.1509 train_time: 4.8m tok/s: 5292874 +2920/20000 train_loss: 3.1563 train_time: 4.8m tok/s: 5293066 +2930/20000 train_loss: 3.1186 train_time: 4.8m tok/s: 5293289 +2940/20000 train_loss: 3.0129 train_time: 4.9m tok/s: 5293530 +2950/20000 train_loss: 3.1578 train_time: 4.9m tok/s: 5293744 +2960/20000 train_loss: 3.1676 train_time: 4.9m tok/s: 5293949 +layer_loop:enabled step:2969 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +2970/20000 train_loss: 4.3754 train_time: 4.9m tok/s: 5293633 +2980/20000 train_loss: 3.2440 train_time: 4.9m tok/s: 5289444 +2990/20000 train_loss: 3.1413 train_time: 4.9m tok/s: 5285272 +3000/20000 train_loss: 3.2112 train_time: 5.0m tok/s: 5281169 +3010/20000 train_loss: 3.2315 train_time: 5.0m tok/s: 5277101 +3020/20000 train_loss: 3.1631 train_time: 5.0m tok/s: 5273035 +3030/20000 train_loss: 3.1521 train_time: 5.0m tok/s: 5269011 +3040/20000 train_loss: 3.1149 train_time: 5.0m tok/s: 5265059 +3050/20000 train_loss: 3.1273 train_time: 5.1m tok/s: 5261029 +3060/20000 train_loss: 3.0503 train_time: 5.1m tok/s: 5257093 +3070/20000 train_loss: 3.1702 train_time: 5.1m tok/s: 5253184 +3080/20000 train_loss: 3.1386 train_time: 5.1m tok/s: 5249333 +3090/20000 train_loss: 3.1302 train_time: 5.1m tok/s: 5245534 +3100/20000 train_loss: 3.1764 train_time: 5.2m tok/s: 5241709 +3110/20000 train_loss: 3.1486 train_time: 5.2m tok/s: 5237965 +3120/20000 train_loss: 3.2177 train_time: 5.2m tok/s: 5234204 +3130/20000 train_loss: 3.1617 train_time: 5.2m tok/s: 5230484 +3140/20000 train_loss: 3.0861 train_time: 5.2m tok/s: 5226844 +3150/20000 train_loss: 2.9957 train_time: 5.3m tok/s: 5223211 +3160/20000 train_loss: 3.1093 train_time: 5.3m tok/s: 5219586 +3170/20000 train_loss: 3.0585 train_time: 5.3m tok/s: 5215937 +3180/20000 train_loss: 3.1983 train_time: 5.3m tok/s: 5212308 +3190/20000 train_loss: 3.0386 train_time: 5.4m tok/s: 5208752 +3200/20000 train_loss: 3.1336 train_time: 5.4m tok/s: 5205190 +3210/20000 train_loss: 3.1449 train_time: 5.4m tok/s: 5201670 +3220/20000 train_loss: 3.1507 train_time: 5.4m tok/s: 5198216 +3230/20000 train_loss: 3.1460 train_time: 5.4m tok/s: 5194778 +3240/20000 train_loss: 3.2459 train_time: 5.5m tok/s: 5191376 +3250/20000 train_loss: 3.1512 train_time: 5.5m tok/s: 5187985 +3260/20000 train_loss: 3.0361 train_time: 5.5m tok/s: 5184604 +3270/20000 train_loss: 3.1088 train_time: 5.5m tok/s: 5181265 +3280/20000 train_loss: 3.1102 train_time: 5.5m tok/s: 5177956 +3290/20000 train_loss: 3.2754 train_time: 5.6m tok/s: 5174652 +3300/20000 train_loss: 3.0571 train_time: 5.6m tok/s: 5171398 +3310/20000 train_loss: 3.0327 train_time: 5.6m tok/s: 5168131 +3320/20000 train_loss: 3.0900 train_time: 5.6m tok/s: 5164932 +3330/20000 train_loss: 3.0706 train_time: 5.6m tok/s: 5161749 +3340/20000 train_loss: 3.0716 train_time: 5.7m tok/s: 5158605 +3350/20000 train_loss: 3.1130 train_time: 5.7m tok/s: 5155496 +3360/20000 train_loss: 3.0762 train_time: 5.7m tok/s: 5152361 +3370/20000 train_loss: 3.0029 train_time: 5.7m tok/s: 5149243 +3380/20000 train_loss: 3.1588 train_time: 5.7m tok/s: 5146163 +3390/20000 train_loss: 3.1359 train_time: 5.8m tok/s: 5143036 +3400/20000 train_loss: 3.0646 train_time: 5.8m tok/s: 5140083 +3410/20000 train_loss: 3.0981 train_time: 5.8m tok/s: 5137143 +3420/20000 train_loss: 3.2233 train_time: 5.8m tok/s: 5134133 +3430/20000 train_loss: 3.0341 train_time: 5.8m tok/s: 5131132 +3440/20000 train_loss: 3.0797 train_time: 5.9m tok/s: 5128396 +3450/20000 train_loss: 3.0600 train_time: 5.9m tok/s: 5125462 +3460/20000 train_loss: 3.2136 train_time: 5.9m tok/s: 5122531 +3470/20000 train_loss: 3.1470 train_time: 5.9m tok/s: 5119601 +3480/20000 train_loss: 3.0919 train_time: 5.9m tok/s: 5116690 +3490/20000 train_loss: 3.0904 train_time: 6.0m tok/s: 5113819 +3500/20000 train_loss: 3.0950 train_time: 6.0m tok/s: 5111005 +3510/20000 train_loss: 3.1043 train_time: 6.0m tok/s: 5108197 +3520/20000 train_loss: 3.0822 train_time: 6.0m tok/s: 5105398 +3530/20000 train_loss: 3.1374 train_time: 6.0m tok/s: 5102595 +3540/20000 train_loss: 3.1640 train_time: 6.1m tok/s: 5099855 +3550/20000 train_loss: 3.0527 train_time: 6.1m tok/s: 5097153 +3560/20000 train_loss: 3.0622 train_time: 6.1m tok/s: 5094435 +3570/20000 train_loss: 3.0759 train_time: 6.1m tok/s: 5091767 +3580/20000 train_loss: 3.0535 train_time: 6.1m tok/s: 5089109 +3590/20000 train_loss: 3.0328 train_time: 6.2m tok/s: 5086470 +3600/20000 train_loss: 3.1020 train_time: 6.2m tok/s: 5083854 +3610/20000 train_loss: 3.0381 train_time: 6.2m tok/s: 5081195 +3620/20000 train_loss: 3.1677 train_time: 6.2m tok/s: 5078552 +3630/20000 train_loss: 3.0011 train_time: 6.2m tok/s: 5075996 +3640/20000 train_loss: 3.0386 train_time: 6.3m tok/s: 5073462 +3650/20000 train_loss: 3.1982 train_time: 6.3m tok/s: 5070813 +3660/20000 train_loss: 3.0222 train_time: 6.3m tok/s: 5068252 +3670/20000 train_loss: 3.1082 train_time: 6.3m tok/s: 5065721 +3680/20000 train_loss: 3.0798 train_time: 6.4m tok/s: 5063213 +3690/20000 train_loss: 3.1186 train_time: 6.4m tok/s: 5060701 +3700/20000 train_loss: 3.0973 train_time: 6.4m tok/s: 5058186 +3710/20000 train_loss: 3.1232 train_time: 6.4m tok/s: 5055698 +3720/20000 train_loss: 3.0996 train_time: 6.4m tok/s: 5053201 +3730/20000 train_loss: 3.0411 train_time: 6.5m tok/s: 5050716 +3740/20000 train_loss: 2.9908 train_time: 6.5m tok/s: 5048303 +3750/20000 train_loss: 3.0357 train_time: 6.5m tok/s: 5045916 +3760/20000 train_loss: 2.9708 train_time: 6.5m tok/s: 5043528 +3770/20000 train_loss: 3.0799 train_time: 6.5m tok/s: 5041142 +3780/20000 train_loss: 3.0896 train_time: 6.6m tok/s: 5038829 +3790/20000 train_loss: 3.1641 train_time: 6.6m tok/s: 5036471 +3800/20000 train_loss: 3.0656 train_time: 6.6m tok/s: 5034109 +3810/20000 train_loss: 3.1455 train_time: 6.6m tok/s: 5031784 +3820/20000 train_loss: 3.0445 train_time: 6.6m tok/s: 5029479 +3830/20000 train_loss: 2.9917 train_time: 6.7m tok/s: 5027227 +3840/20000 train_loss: 3.0490 train_time: 6.7m tok/s: 5025002 +3850/20000 train_loss: 3.0962 train_time: 6.7m tok/s: 5022761 +3860/20000 train_loss: 3.0892 train_time: 6.7m tok/s: 5020524 +3870/20000 train_loss: 3.0408 train_time: 6.7m tok/s: 5018327 +3880/20000 train_loss: 3.0982 train_time: 6.8m tok/s: 5016053 +3890/20000 train_loss: 3.1080 train_time: 6.8m tok/s: 5013848 +3900/20000 train_loss: 2.9916 train_time: 6.8m tok/s: 5011670 +3910/20000 train_loss: 3.0866 train_time: 6.8m tok/s: 5009510 +3920/20000 train_loss: 3.2058 train_time: 6.8m tok/s: 5007402 +3930/20000 train_loss: 3.1480 train_time: 6.9m tok/s: 5005268 +3940/20000 train_loss: 3.1116 train_time: 6.9m tok/s: 5003137 +3950/20000 train_loss: 2.9972 train_time: 6.9m tok/s: 5001019 +3960/20000 train_loss: 3.0631 train_time: 6.9m tok/s: 4998932 +3970/20000 train_loss: 3.0657 train_time: 6.9m tok/s: 4996858 +3980/20000 train_loss: 3.1644 train_time: 7.0m tok/s: 4994765 +3990/20000 train_loss: 3.1027 train_time: 7.0m tok/s: 4992631 +4000/20000 train_loss: 2.9853 train_time: 7.0m tok/s: 4990562 +4000/20000 val_loss: 3.0172 val_bpb: 1.1681 +4010/20000 train_loss: 2.9880 train_time: 7.0m tok/s: 4989460 +4020/20000 train_loss: 3.0736 train_time: 7.0m tok/s: 4987559 +4030/20000 train_loss: 3.0392 train_time: 7.1m tok/s: 4985612 +4040/20000 train_loss: 3.0258 train_time: 7.1m tok/s: 4983589 +4050/20000 train_loss: 3.1073 train_time: 7.1m tok/s: 4981622 +4060/20000 train_loss: 3.0571 train_time: 7.1m tok/s: 4979602 +4070/20000 train_loss: 3.0790 train_time: 7.1m tok/s: 4977619 +4080/20000 train_loss: 3.0507 train_time: 7.2m tok/s: 4975768 +4090/20000 train_loss: 3.0665 train_time: 7.2m tok/s: 4973875 +4100/20000 train_loss: 3.1025 train_time: 7.2m tok/s: 4971908 +4110/20000 train_loss: 3.1017 train_time: 7.2m tok/s: 4969975 +4120/20000 train_loss: 3.0276 train_time: 7.2m tok/s: 4968045 +4130/20000 train_loss: 3.0749 train_time: 7.3m tok/s: 4966144 +4140/20000 train_loss: 3.0631 train_time: 7.3m tok/s: 4964279 +4150/20000 train_loss: 3.0260 train_time: 7.3m tok/s: 4962410 +4160/20000 train_loss: 3.0911 train_time: 7.3m tok/s: 4960571 +4170/20000 train_loss: 3.2009 train_time: 7.3m tok/s: 4958723 +4180/20000 train_loss: 2.9794 train_time: 7.4m tok/s: 4956918 +4190/20000 train_loss: 3.0088 train_time: 7.4m tok/s: 4955111 +4200/20000 train_loss: 2.9449 train_time: 7.4m tok/s: 4953334 +4210/20000 train_loss: 3.1409 train_time: 7.4m tok/s: 4951527 +4220/20000 train_loss: 3.0997 train_time: 7.4m tok/s: 4949716 +4230/20000 train_loss: 3.0840 train_time: 7.5m tok/s: 4947868 +4240/20000 train_loss: 3.0441 train_time: 7.5m tok/s: 4946012 +4250/20000 train_loss: 3.1961 train_time: 7.5m tok/s: 4944184 +4260/20000 train_loss: 3.0396 train_time: 7.5m tok/s: 4942402 +4270/20000 train_loss: 3.0311 train_time: 7.6m tok/s: 4940683 +4280/20000 train_loss: 3.0587 train_time: 7.6m tok/s: 4938931 +4290/20000 train_loss: 2.9515 train_time: 7.6m tok/s: 4937212 +4300/20000 train_loss: 3.0032 train_time: 7.6m tok/s: 4935477 +4310/20000 train_loss: 3.0619 train_time: 7.6m tok/s: 4933749 +4320/20000 train_loss: 3.1025 train_time: 7.7m tok/s: 4932087 +4330/20000 train_loss: 3.0661 train_time: 7.7m tok/s: 4930442 +4340/20000 train_loss: 3.1466 train_time: 7.7m tok/s: 4928775 +4350/20000 train_loss: 3.0577 train_time: 7.7m tok/s: 4927135 +4360/20000 train_loss: 3.0541 train_time: 7.7m tok/s: 4925417 +4370/20000 train_loss: 3.0317 train_time: 7.8m tok/s: 4923769 +4380/20000 train_loss: 3.0354 train_time: 7.8m tok/s: 4922098 +4390/20000 train_loss: 2.9119 train_time: 7.8m tok/s: 4920389 +4400/20000 train_loss: 3.0881 train_time: 7.8m tok/s: 4915898 +4410/20000 train_loss: 3.0636 train_time: 7.8m tok/s: 4911389 +4420/20000 train_loss: 2.9293 train_time: 7.9m tok/s: 4909809 +4430/20000 train_loss: 3.0156 train_time: 7.9m tok/s: 4908193 +4440/20000 train_loss: 3.2117 train_time: 7.9m tok/s: 4906680 +4450/20000 train_loss: 2.9845 train_time: 7.9m tok/s: 4905106 +4460/20000 train_loss: 3.1112 train_time: 7.9m tok/s: 4903527 +4470/20000 train_loss: 3.0030 train_time: 8.0m tok/s: 4901986 +4480/20000 train_loss: 3.1100 train_time: 8.0m tok/s: 4900438 +4490/20000 train_loss: 3.0177 train_time: 8.0m tok/s: 4898879 +4500/20000 train_loss: 3.1484 train_time: 8.0m tok/s: 4897379 +4510/20000 train_loss: 3.0019 train_time: 8.0m tok/s: 4895874 +4520/20000 train_loss: 2.9475 train_time: 8.1m tok/s: 4894375 +4530/20000 train_loss: 3.0009 train_time: 8.1m tok/s: 4892872 +4540/20000 train_loss: 3.1043 train_time: 8.1m tok/s: 4891359 +4550/20000 train_loss: 3.0363 train_time: 8.1m tok/s: 4889841 +4560/20000 train_loss: 3.0484 train_time: 8.2m tok/s: 4888353 +4570/20000 train_loss: 3.0336 train_time: 8.2m tok/s: 4886876 +4580/20000 train_loss: 3.0913 train_time: 8.2m tok/s: 4885373 +4590/20000 train_loss: 2.9636 train_time: 8.2m tok/s: 4883916 +4600/20000 train_loss: 3.0311 train_time: 8.2m tok/s: 4882499 +4610/20000 train_loss: 3.0789 train_time: 8.3m tok/s: 4881062 +4620/20000 train_loss: 3.0569 train_time: 8.3m tok/s: 4879578 +4630/20000 train_loss: 2.9899 train_time: 8.3m tok/s: 4878149 +4640/20000 train_loss: 3.0477 train_time: 8.3m tok/s: 4876705 +4650/20000 train_loss: 2.9747 train_time: 8.3m tok/s: 4875280 +4660/20000 train_loss: 3.0151 train_time: 8.4m tok/s: 4873885 +4670/20000 train_loss: 3.0111 train_time: 8.4m tok/s: 4872454 +4680/20000 train_loss: 3.0860 train_time: 8.4m tok/s: 4871064 +4690/20000 train_loss: 3.0087 train_time: 8.4m tok/s: 4869669 +4700/20000 train_loss: 3.0105 train_time: 8.4m tok/s: 4868267 +4710/20000 train_loss: 2.9331 train_time: 8.5m tok/s: 4866886 +4720/20000 train_loss: 3.0388 train_time: 8.5m tok/s: 4865457 +4730/20000 train_loss: 2.9704 train_time: 8.5m tok/s: 4864142 +4740/20000 train_loss: 3.0664 train_time: 8.5m tok/s: 4862831 +4750/20000 train_loss: 2.9146 train_time: 8.5m tok/s: 4861490 +4760/20000 train_loss: 3.0514 train_time: 8.6m tok/s: 4860152 +4770/20000 train_loss: 2.9462 train_time: 8.6m tok/s: 4858834 +4780/20000 train_loss: 3.0801 train_time: 8.6m tok/s: 4857528 +4790/20000 train_loss: 3.0397 train_time: 8.6m tok/s: 4856188 +4800/20000 train_loss: 3.0320 train_time: 8.6m tok/s: 4854874 +4810/20000 train_loss: 3.0093 train_time: 8.7m tok/s: 4853536 +4820/20000 train_loss: 3.0018 train_time: 8.7m tok/s: 4852157 +4830/20000 train_loss: 2.9813 train_time: 8.7m tok/s: 4850849 +4840/20000 train_loss: 2.9492 train_time: 8.7m tok/s: 4849537 +4850/20000 train_loss: 3.0252 train_time: 8.7m tok/s: 4848265 +4860/20000 train_loss: 3.0462 train_time: 8.8m tok/s: 4846956 +4870/20000 train_loss: 2.9422 train_time: 8.8m tok/s: 4845625 +4880/20000 train_loss: 3.0017 train_time: 8.8m tok/s: 4844367 +4890/20000 train_loss: 3.0437 train_time: 8.8m tok/s: 4843069 +4900/20000 train_loss: 3.0350 train_time: 8.8m tok/s: 4841770 +4910/20000 train_loss: 3.0219 train_time: 8.9m tok/s: 4840507 +4920/20000 train_loss: 3.0828 train_time: 8.9m tok/s: 4839227 +4930/20000 train_loss: 3.0331 train_time: 8.9m tok/s: 4837923 +4940/20000 train_loss: 2.9841 train_time: 8.9m tok/s: 4836665 +4950/20000 train_loss: 2.9730 train_time: 8.9m tok/s: 4835398 +4960/20000 train_loss: 2.9632 train_time: 9.0m tok/s: 4834099 +4970/20000 train_loss: 3.0974 train_time: 9.0m tok/s: 4832867 +4980/20000 train_loss: 3.0871 train_time: 9.0m tok/s: 4831635 +4990/20000 train_loss: 2.9948 train_time: 9.0m tok/s: 4830443 +5000/20000 train_loss: 3.0452 train_time: 9.0m tok/s: 4829219 +5010/20000 train_loss: 3.0634 train_time: 9.1m tok/s: 4827989 +5020/20000 train_loss: 2.9839 train_time: 9.1m tok/s: 4826800 +5030/20000 train_loss: 3.0245 train_time: 9.1m tok/s: 4825626 +5040/20000 train_loss: 3.0310 train_time: 9.1m tok/s: 4824448 +5050/20000 train_loss: 2.9454 train_time: 9.1m tok/s: 4823223 +5060/20000 train_loss: 3.1266 train_time: 9.2m tok/s: 4822033 +5070/20000 train_loss: 2.9731 train_time: 9.2m tok/s: 4820879 +5080/20000 train_loss: 2.9285 train_time: 9.2m tok/s: 4819676 +5090/20000 train_loss: 2.9922 train_time: 9.2m tok/s: 4818497 +5100/20000 train_loss: 2.9560 train_time: 9.3m tok/s: 4817387 +5110/20000 train_loss: 2.9583 train_time: 9.3m tok/s: 4816262 +5120/20000 train_loss: 2.9323 train_time: 9.3m tok/s: 4815128 +5130/20000 train_loss: 2.9251 train_time: 9.3m tok/s: 4813995 +5140/20000 train_loss: 2.9989 train_time: 9.3m tok/s: 4812879 +5150/20000 train_loss: 3.0533 train_time: 9.4m tok/s: 4811759 +5160/20000 train_loss: 2.8994 train_time: 9.4m tok/s: 4810656 +5170/20000 train_loss: 2.9263 train_time: 9.4m tok/s: 4809493 +5180/20000 train_loss: 3.0045 train_time: 9.4m tok/s: 4808399 +5190/20000 train_loss: 2.9321 train_time: 9.4m tok/s: 4807263 +5200/20000 train_loss: 2.9679 train_time: 9.5m tok/s: 4806172 +5210/20000 train_loss: 2.9017 train_time: 9.5m tok/s: 4805078 +5220/20000 train_loss: 2.9413 train_time: 9.5m tok/s: 4803984 +5230/20000 train_loss: 2.9737 train_time: 9.5m tok/s: 4802894 +5240/20000 train_loss: 3.0099 train_time: 9.5m tok/s: 4801783 +5250/20000 train_loss: 2.9688 train_time: 9.6m tok/s: 4800662 +5260/20000 train_loss: 2.9975 train_time: 9.6m tok/s: 4799540 +5270/20000 train_loss: 2.9486 train_time: 9.6m tok/s: 4798398 +5280/20000 train_loss: 3.0083 train_time: 9.6m tok/s: 4797287 +5290/20000 train_loss: 3.0262 train_time: 9.6m tok/s: 4796231 +5300/20000 train_loss: 3.0615 train_time: 9.7m tok/s: 4795124 +5310/20000 train_loss: 2.9194 train_time: 9.7m tok/s: 4794071 +5320/20000 train_loss: 2.9113 train_time: 9.7m tok/s: 4793060 +5330/20000 train_loss: 2.9864 train_time: 9.7m tok/s: 4792014 +5340/20000 train_loss: 2.8677 train_time: 9.7m tok/s: 4790955 +5350/20000 train_loss: 2.9069 train_time: 9.8m tok/s: 4789920 +5360/20000 train_loss: 2.9976 train_time: 9.8m tok/s: 4788821 +5370/20000 train_loss: 3.0932 train_time: 9.8m tok/s: 4787769 +5370/20000 val_loss: 2.8496 val_bpb: 1.1032 +stopping_early: wallclock_cap train_time: 588046ms step: 5370/20000 +peak memory allocated: 25639 MiB reserved: 25652 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.84707472 val_bpb:1.10220420 eval_time:6510ms +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +Serialized model: 135615079 bytes +Code size: 151202 bytes +GPTQ:collecting Hessians from calibration data... +[prefetch] daemon started: depth=4 pinned=True +GPTQ:collected 67 Hessians in 8.2s +[IDEA-064 parallel_gptq] enabled — multi-clip search active +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.617861 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.593897 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.618516 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.604933 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.634869 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.605958 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.611886 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.621480 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.859955 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.827493 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.835605 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.872845 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.809754 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.801073 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.821743 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.871624 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.779060 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.773111 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.752426 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.767419 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.789825 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.761685 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.773999 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.783729 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.726159 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.730881 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.720762 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.730441 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.731945 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.720289 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.735800 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.728994 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=104.556138 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=104.490570 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=104.557396 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=104.374110 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=104.453239 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=104.644480 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=104.374704 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=104.145801 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.424587 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.422445 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.429791 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.428892 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.425254 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.425597 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.425594 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.434211 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.373428 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.379567 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.359279 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.366664 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.373662 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.362286 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.368340 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.389201 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.357116 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.362168 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.366091 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.362778 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.366431 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.352409 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.355331 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.371057 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.064914 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.063421 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.065871 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.064545 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.064716 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.064086 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.068302 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.062484 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=42.027545 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=41.982762 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=42.032957 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=41.941409 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=41.947841 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=41.983801 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=42.024306 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=41.896217 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.590375 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.584420 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.579434 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.589501 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.589900 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.585263 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.585510 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.592694 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.621535 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.626743 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.623480 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.623669 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.624615 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.625691 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.624488 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.626836 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.208780 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.213873 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.203703 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.209313 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.218126 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.210486 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.224093 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.203526 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.135917 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.138105 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.135675 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.136622 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.134591 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.137627 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.134671 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.136437 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.834761 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.797404 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.800121 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.768756 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.809492 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.787555 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.826828 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.771397 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.832781 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.828821 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.832256 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.832132 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.836373 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.829907 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.831248 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.834483 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.126204 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.109683 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.108543 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.118701 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.116168 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.121879 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.114068 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.118032 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.868898 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.863499 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.868695 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.861788 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.869611 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.867753 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.867839 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.864431 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.646005 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.647154 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.644719 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.646081 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.642346 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.642754 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.642293 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.646651 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=9.341514 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=9.326949 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=9.335270 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=9.307977 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=9.336748 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=9.334996 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=9.334739 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=9.322287 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.820258 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.788829 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.803138 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.811478 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.811087 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.810018 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.772093 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.814538 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=15.137974 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=15.126779 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=15.121104 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=15.104484 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=15.124388 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=15.105775 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=15.123096 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=15.149712 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.818636 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.813943 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.814159 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.808884 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.810259 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.809871 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.802230 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.817305 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.620888 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.618308 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.607514 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.617062 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.611678 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.614206 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.619605 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.621163 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=15.126809 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=15.123929 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=15.118630 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=15.096999 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=15.083303 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=15.103712 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=15.080526 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=15.122399 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.245397 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.239988 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.252655 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.245661 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.237966 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.241978 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.230988 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.260389 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=16.228453 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=16.238840 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=16.223420 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=16.236746 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=16.227280 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=16.209261 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=16.228905 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=16.237635 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.601036 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.590265 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.590939 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.606532 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.595963 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.590308 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.593658 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.607646 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.055939 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.058773 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.055796 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.064307 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.057324 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.058500 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.061039 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.062214 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.071416 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.083524 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.070637 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.054822 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.013238 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.064571 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.031781 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.057688 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.818569 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.820032 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.819583 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.819741 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.817583 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.817404 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.822298 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.822416 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.518599 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.519101 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.516894 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.517726 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.517486 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.516273 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.530037 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.521990 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.145243 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.138110 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.144704 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.137836 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.136791 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.144994 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.147857 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.140694 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.269307 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.268665 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.271102 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.267141 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.268026 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.270789 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.269774 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.270928 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.944834 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.944272 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.941063 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.945412 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.938450 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.945096 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.938551 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.943490 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.464811 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.464963 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.465277 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.462145 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.463442 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.463850 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.465857 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.465801 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.902346 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.903805 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.902216 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.899287 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.901886 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.899539 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.904693 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.903737 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.796695 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.798192 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.799029 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.793577 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.790500 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.793657 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.798031 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.796399 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.649199 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.647600 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.652928 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.649884 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.640362 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.642913 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.640480 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.652702 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.986007 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.986542 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.983739 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.988399 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.985664 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.982669 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.979439 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.983086 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.877358 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.877508 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.878038 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.876720 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.876599 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.876368 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.878175 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.877060 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.725368 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.722611 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.726653 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.723082 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.725772 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.722189 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.727829 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.723205 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.455752 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.454386 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.452497 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.451742 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.451085 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.451512 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.454573 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.454175 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.412624 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.416454 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.418826 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.411074 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.414864 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.409435 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.416546 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.417357 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.815954 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.819317 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.814555 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.816925 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.815410 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.813788 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.809243 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.813644 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.524507 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.524465 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.524227 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.525753 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.525847 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.525873 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.526777 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.524252 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.077959 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.077225 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.075292 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.078181 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.077824 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.079565 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.080803 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.075158 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.270762 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.268853 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.270049 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.273083 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.272109 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.273152 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.274289 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.268573 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.216153 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.215384 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.214613 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.214710 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.215152 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.214913 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.216143 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.215352 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.666108 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.665123 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.670058 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.665428 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.667338 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.664573 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.660224 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.663693 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.505306 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.505230 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.505617 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.505114 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.505275 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.504713 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.505313 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.505900 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.781855 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.780749 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.780413 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.780912 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.781896 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.781944 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.781566 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.783212 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.079344 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.074946 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.073953 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.074729 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.076743 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.076520 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.078105 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.078227 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.498058 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.499032 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.498536 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.498058 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.498666 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.498861 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.498908 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.498271 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.153869 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.153400 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.150999 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.151583 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.158776 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.155426 +Quantized weights: + gptq (int5): tok_emb.weight + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + passthrough (float16): blocks.attn.gate_proj.bias, blocks.attn.gate_proj.weight, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, lane_merge, skip_gates, skip_weights +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.144449 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.150792 +Serialized model quantized+zstd: 15652160 bytes +Total submission size quantized+zstd: 15803362 bytes +quantized val_loss:8.89149118 val_bpb:3.44221347 eval_time:2513ms +quantized_sliding_window val_loss:8.89351441 val_bpb:3.44299673 eval_time:92313ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=35988657 frozen=0 + ttt_chunk [1/1238] bpb=3.365611 time=4.5s + ttt_chunk [11/1238] bpb=3.409257 time=6.8s + ttt_chunk [21/1238] bpb=3.386949 time=9.1s + ttt_chunk [31/1238] bpb=3.282760 time=11.5s + ttt_chunk [41/1238] bpb=3.204331 time=13.8s + ttt_chunk [51/1238] bpb=3.157261 time=16.1s + ttt_chunk [61/1238] bpb=3.125232 time=18.4s + ttt_chunk [71/1238] bpb=3.109921 time=20.8s + ttt_chunk [81/1238] bpb=3.084452 time=23.3s + ttt_chunk [91/1238] bpb=3.079288 time=25.7s + ttt_chunk [101/1238] bpb=3.058914 time=28.3s + ttt_chunk [111/1238] bpb=3.047715 time=30.6s + ttt_chunk [121/1238] bpb=3.034444 time=32.9s + ttt_chunk [131/1238] bpb=3.027298 time=35.5s + ttt_chunk [141/1238] bpb=3.021372 time=37.8s + ttt_chunk [151/1238] bpb=3.016944 time=40.1s + ttt_chunk [161/1238] bpb=3.009452 time=42.4s + ttt_chunk [171/1238] bpb=3.003949 time=44.7s + ttt_chunk [181/1238] bpb=2.993958 time=47.0s + ttt_chunk [191/1238] bpb=2.983369 time=49.4s + ttt_chunk [201/1238] bpb=2.978027 time=51.7s + ttt_chunk [211/1238] bpb=2.976425 time=54.0s + ttt_chunk [221/1238] bpb=2.968146 time=56.3s + ttt_chunk [231/1238] bpb=2.965927 time=58.6s + ttt_chunk [241/1238] bpb=2.965441 time=61.0s + ttt_chunk [251/1238] bpb=2.961043 time=63.3s + ttt_chunk [261/1238] bpb=2.952580 time=65.6s + ttt_chunk [271/1238] bpb=2.949106 time=67.9s + ttt_chunk [281/1238] bpb=2.943184 time=70.2s + ttt_chunk [291/1238] bpb=2.940331 time=72.6s + ttt_chunk [301/1238] bpb=2.933771 time=74.9s + ttt_chunk [311/1238] bpb=2.926076 time=77.2s + ttt_chunk [321/1238] bpb=2.922708 time=79.4s + ttt_chunk [331/1238] bpb=2.920991 time=81.7s + ttt_chunk [341/1238] bpb=2.917723 time=84.0s + ttt_chunk [351/1238] bpb=2.915395 time=86.3s + ttt_chunk [361/1238] bpb=2.910181 time=88.7s + ttt_chunk [371/1238] bpb=2.904643 time=91.0s + ttt_chunk [381/1238] bpb=2.902097 time=93.4s + ttt_chunk [391/1238] bpb=2.899067 time=95.7s + ttt_chunk [401/1238] bpb=2.895151 time=98.0s + ttt_chunk [411/1238] bpb=2.892183 time=100.4s + ttt_chunk [421/1238] bpb=2.888580 time=102.7s + ttt_chunk [431/1238] bpb=2.884977 time=105.0s + ttt_chunk [441/1238] bpb=2.882353 time=107.3s + ttt_chunk [451/1238] bpb=2.884060 time=109.7s + ttt_chunk [461/1238] bpb=2.878129 time=112.0s + ttt_chunk [471/1238] bpb=2.875480 time=114.5s + ttt_chunk [481/1238] bpb=2.871336 time=116.8s + ttt_chunk [491/1238] bpb=2.869536 time=119.1s + ttt_chunk [501/1238] bpb=2.866313 time=121.4s + ttt_chunk [511/1238] bpb=2.864436 time=123.8s + ttt_chunk [521/1238] bpb=2.866890 time=126.3s + ttt_chunk [531/1238] bpb=2.871555 time=128.6s + ttt_chunk [541/1238] bpb=2.871767 time=130.9s + ttt_chunk [551/1238] bpb=2.873119 time=133.3s + ttt_chunk [561/1238] bpb=2.873514 time=135.8s + ttt_chunk [571/1238] bpb=2.872533 time=138.2s + ttt_chunk [581/1238] bpb=2.874026 time=140.5s + ttt_chunk [591/1238] bpb=2.874825 time=142.9s + ttt_chunk [601/1238] bpb=2.873115 time=145.2s + ttt_chunk [611/1238] bpb=2.872130 time=147.6s + ttt_chunk [621/1238] bpb=2.869743 time=149.9s + ttt_chunk [631/1238] bpb=2.866842 time=152.2s + ttt_chunk [641/1238] bpb=2.865392 time=154.5s + ttt_chunk [651/1238] bpb=2.863419 time=157.1s + ttt_chunk [661/1238] bpb=2.860288 time=159.5s + ttt_chunk [671/1238] bpb=2.856041 time=161.8s + ttt_chunk [681/1238] bpb=2.853567 time=164.2s + ttt_chunk [691/1238] bpb=2.852565 time=166.5s + ttt_chunk [701/1238] bpb=2.858828 time=168.8s + ttt_chunk [711/1238] bpb=2.864353 time=171.2s + ttt_chunk [721/1238] bpb=2.866801 time=173.5s + ttt_chunk [731/1238] bpb=2.864614 time=175.8s + ttt_chunk [741/1238] bpb=2.863885 time=178.1s + ttt_chunk [751/1238] bpb=2.861181 time=180.5s + ttt_chunk [761/1238] bpb=2.857535 time=182.8s + ttt_chunk [771/1238] bpb=2.854620 time=185.2s + ttt_chunk [781/1238] bpb=2.852311 time=187.5s + ttt_chunk [791/1238] bpb=2.853732 time=189.8s + ttt_chunk [801/1238] bpb=2.853398 time=192.1s + ttt_chunk [811/1238] bpb=2.850774 time=194.5s + ttt_chunk [821/1238] bpb=2.849350 time=196.8s + ttt_chunk [831/1238] bpb=2.848808 time=199.1s + ttt_chunk [841/1238] bpb=2.848067 time=201.5s + ttt_chunk [851/1238] bpb=2.845528 time=204.1s + ttt_chunk [861/1238] bpb=2.843590 time=206.4s + ttt_chunk [871/1238] bpb=2.841139 time=208.8s + ttt_chunk [881/1238] bpb=2.838969 time=211.1s + ttt_chunk [891/1238] bpb=2.837492 time=213.5s + ttt_chunk [901/1238] bpb=2.838535 time=216.0s + ttt_chunk [911/1238] bpb=2.837024 time=218.4s + ttt_chunk [921/1238] bpb=2.836696 time=220.7s + ttt_chunk [931/1238] bpb=2.835846 time=223.3s + ttt_chunk [941/1238] bpb=2.834901 time=225.9s + ttt_chunk [951/1238] bpb=2.834780 time=228.2s + ttt_chunk [961/1238] bpb=2.834047 time=230.6s + ttt_chunk [971/1238] bpb=2.834818 time=232.9s + ttt_chunk [981/1238] bpb=2.833917 time=235.2s + ttt_chunk [991/1238] bpb=2.832526 time=237.5s + ttt_chunk [1001/1238] bpb=2.832656 time=239.8s + ttt_chunk [1011/1238] bpb=2.831882 time=242.2s + ttt_chunk [1021/1238] bpb=2.830955 time=244.5s + ttt_chunk [1031/1238] bpb=2.830003 time=246.8s + ttt_chunk [1041/1238] bpb=2.828643 time=249.1s + ttt_chunk [1051/1238] bpb=2.826725 time=251.5s + ttt_chunk [1061/1238] bpb=2.825306 time=253.8s + ttt_chunk [1071/1238] bpb=2.823694 time=256.1s + ttt_chunk [1081/1238] bpb=2.821484 time=258.4s + ttt_chunk [1091/1238] bpb=2.819155 time=260.7s + ttt_chunk [1101/1238] bpb=2.817776 time=263.1s + ttt_chunk [1111/1238] bpb=2.816481 time=265.5s + ttt_chunk [1121/1238] bpb=2.815231 time=267.8s + ttt_chunk [1131/1238] bpb=2.813170 time=270.1s + ttt_chunk [1141/1238] bpb=2.811292 time=272.5s + ttt_chunk [1151/1238] bpb=2.809763 time=274.8s + ttt_chunk [1161/1238] bpb=2.808313 time=277.2s + ttt_chunk [1171/1238] bpb=2.806366 time=279.5s + ttt_chunk [1181/1238] bpb=2.804649 time=281.8s + ttt_chunk [1191/1238] bpb=2.802868 time=284.2s + ttt_chunk [1201/1238] bpb=2.802078 time=286.5s + ttt_chunk [1211/1238] bpb=2.801401 time=288.8s + ttt_chunk [1221/1238] bpb=2.799275 time=291.2s + ttt_chunk [1231/1238] bpb=2.798276 time=293.5s + ttt_chunk [1238/1238] bpb=2.797722 time=295.0s +ttt_sliding:done val_loss=7.223391 val_bpb=2.796432 elapsed=296.9s +quantized_ttt val_loss:7.22339057 val_bpb:2.79643221 eval_time:297082ms +[W424 07:15:17.102789397 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:15:17.294777792 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:15:17.568974675 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:15:17.584517565 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:15:17.589753570 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:15:18.706377801 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:15:18.710555365 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:15:18.753743915 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:15:19.966498275 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) + +[run] DONE 07:15:19Z +[run] === val_bpb lines === +0/20000 val_loss: 9.0094 val_bpb: 3.4879 +4000/20000 val_loss: 3.0172 val_bpb: 1.1681 +5370/20000 val_loss: 2.8496 val_bpb: 1.1032 +pre-quantization post-ema val_loss:2.84707472 val_bpb:1.10220420 eval_time:6510ms +quantized val_loss:8.89149118 val_bpb:3.44221347 eval_time:2513ms +quantized_sliding_window val_loss:8.89351441 val_bpb:3.44299673 eval_time:92313ms + ttt_chunk [1/1238] bpb=3.365611 time=4.5s + ttt_chunk [11/1238] bpb=3.409257 time=6.8s + ttt_chunk [21/1238] bpb=3.386949 time=9.1s + ttt_chunk [31/1238] bpb=3.282760 time=11.5s + ttt_chunk [41/1238] bpb=3.204331 time=13.8s + ttt_chunk [51/1238] bpb=3.157261 time=16.1s + ttt_chunk [61/1238] bpb=3.125232 time=18.4s + ttt_chunk [71/1238] bpb=3.109921 time=20.8s + ttt_chunk [81/1238] bpb=3.084452 time=23.3s + ttt_chunk [91/1238] bpb=3.079288 time=25.7s + ttt_chunk [101/1238] bpb=3.058914 time=28.3s + ttt_chunk [111/1238] bpb=3.047715 time=30.6s + ttt_chunk [121/1238] bpb=3.034444 time=32.9s + ttt_chunk [131/1238] bpb=3.027298 time=35.5s + ttt_chunk [141/1238] bpb=3.021372 time=37.8s + ttt_chunk [151/1238] bpb=3.016944 time=40.1s + ttt_chunk [161/1238] bpb=3.009452 time=42.4s + ttt_chunk [171/1238] bpb=3.003949 time=44.7s + ttt_chunk [181/1238] bpb=2.993958 time=47.0s + ttt_chunk [191/1238] bpb=2.983369 time=49.4s + ttt_chunk [201/1238] bpb=2.978027 time=51.7s + ttt_chunk [211/1238] bpb=2.976425 time=54.0s + ttt_chunk [221/1238] bpb=2.968146 time=56.3s + ttt_chunk [231/1238] bpb=2.965927 time=58.6s + ttt_chunk [241/1238] bpb=2.965441 time=61.0s + ttt_chunk [251/1238] bpb=2.961043 time=63.3s + ttt_chunk [261/1238] bpb=2.952580 time=65.6s + ttt_chunk [271/1238] bpb=2.949106 time=67.9s + ttt_chunk [281/1238] bpb=2.943184 time=70.2s + ttt_chunk [291/1238] bpb=2.940331 time=72.6s + ttt_chunk [301/1238] bpb=2.933771 time=74.9s + ttt_chunk [311/1238] bpb=2.926076 time=77.2s + ttt_chunk [321/1238] bpb=2.922708 time=79.4s + ttt_chunk [331/1238] bpb=2.920991 time=81.7s + ttt_chunk [341/1238] bpb=2.917723 time=84.0s + ttt_chunk [351/1238] bpb=2.915395 time=86.3s + ttt_chunk [361/1238] bpb=2.910181 time=88.7s + ttt_chunk [371/1238] bpb=2.904643 time=91.0s + ttt_chunk [381/1238] bpb=2.902097 time=93.4s + ttt_chunk [391/1238] bpb=2.899067 time=95.7s + ttt_chunk [401/1238] bpb=2.895151 time=98.0s + ttt_chunk [411/1238] bpb=2.892183 time=100.4s + ttt_chunk [421/1238] bpb=2.888580 time=102.7s + ttt_chunk [431/1238] bpb=2.884977 time=105.0s + ttt_chunk [441/1238] bpb=2.882353 time=107.3s + ttt_chunk [451/1238] bpb=2.884060 time=109.7s + ttt_chunk [461/1238] bpb=2.878129 time=112.0s + ttt_chunk [471/1238] bpb=2.875480 time=114.5s + ttt_chunk [481/1238] bpb=2.871336 time=116.8s + ttt_chunk [491/1238] bpb=2.869536 time=119.1s + ttt_chunk [501/1238] bpb=2.866313 time=121.4s + ttt_chunk [511/1238] bpb=2.864436 time=123.8s + ttt_chunk [521/1238] bpb=2.866890 time=126.3s + ttt_chunk [531/1238] bpb=2.871555 time=128.6s + ttt_chunk [541/1238] bpb=2.871767 time=130.9s + ttt_chunk [551/1238] bpb=2.873119 time=133.3s + ttt_chunk [561/1238] bpb=2.873514 time=135.8s + ttt_chunk [571/1238] bpb=2.872533 time=138.2s + ttt_chunk [581/1238] bpb=2.874026 time=140.5s + ttt_chunk [591/1238] bpb=2.874825 time=142.9s + ttt_chunk [601/1238] bpb=2.873115 time=145.2s + ttt_chunk [611/1238] bpb=2.872130 time=147.6s + ttt_chunk [621/1238] bpb=2.869743 time=149.9s + ttt_chunk [631/1238] bpb=2.866842 time=152.2s + ttt_chunk [641/1238] bpb=2.865392 time=154.5s + ttt_chunk [651/1238] bpb=2.863419 time=157.1s + ttt_chunk [661/1238] bpb=2.860288 time=159.5s + ttt_chunk [671/1238] bpb=2.856041 time=161.8s + ttt_chunk [681/1238] bpb=2.853567 time=164.2s + ttt_chunk [691/1238] bpb=2.852565 time=166.5s + ttt_chunk [701/1238] bpb=2.858828 time=168.8s + ttt_chunk [711/1238] bpb=2.864353 time=171.2s + ttt_chunk [721/1238] bpb=2.866801 time=173.5s + ttt_chunk [731/1238] bpb=2.864614 time=175.8s + ttt_chunk [741/1238] bpb=2.863885 time=178.1s + ttt_chunk [751/1238] bpb=2.861181 time=180.5s + ttt_chunk [761/1238] bpb=2.857535 time=182.8s + ttt_chunk [771/1238] bpb=2.854620 time=185.2s + ttt_chunk [781/1238] bpb=2.852311 time=187.5s + ttt_chunk [791/1238] bpb=2.853732 time=189.8s + ttt_chunk [801/1238] bpb=2.853398 time=192.1s + ttt_chunk [811/1238] bpb=2.850774 time=194.5s + ttt_chunk [821/1238] bpb=2.849350 time=196.8s + ttt_chunk [831/1238] bpb=2.848808 time=199.1s + ttt_chunk [841/1238] bpb=2.848067 time=201.5s + ttt_chunk [851/1238] bpb=2.845528 time=204.1s + ttt_chunk [861/1238] bpb=2.843590 time=206.4s + ttt_chunk [871/1238] bpb=2.841139 time=208.8s + ttt_chunk [881/1238] bpb=2.838969 time=211.1s + ttt_chunk [891/1238] bpb=2.837492 time=213.5s + ttt_chunk [901/1238] bpb=2.838535 time=216.0s + ttt_chunk [911/1238] bpb=2.837024 time=218.4s + ttt_chunk [921/1238] bpb=2.836696 time=220.7s + ttt_chunk [931/1238] bpb=2.835846 time=223.3s + ttt_chunk [941/1238] bpb=2.834901 time=225.9s + ttt_chunk [951/1238] bpb=2.834780 time=228.2s + ttt_chunk [961/1238] bpb=2.834047 time=230.6s + ttt_chunk [971/1238] bpb=2.834818 time=232.9s + ttt_chunk [981/1238] bpb=2.833917 time=235.2s + ttt_chunk [991/1238] bpb=2.832526 time=237.5s + ttt_chunk [1001/1238] bpb=2.832656 time=239.8s + ttt_chunk [1011/1238] bpb=2.831882 time=242.2s + ttt_chunk [1021/1238] bpb=2.830955 time=244.5s + ttt_chunk [1031/1238] bpb=2.830003 time=246.8s + ttt_chunk [1041/1238] bpb=2.828643 time=249.1s + ttt_chunk [1051/1238] bpb=2.826725 time=251.5s + ttt_chunk [1061/1238] bpb=2.825306 time=253.8s + ttt_chunk [1071/1238] bpb=2.823694 time=256.1s + ttt_chunk [1081/1238] bpb=2.821484 time=258.4s + ttt_chunk [1091/1238] bpb=2.819155 time=260.7s + ttt_chunk [1101/1238] bpb=2.817776 time=263.1s + ttt_chunk [1111/1238] bpb=2.816481 time=265.5s + ttt_chunk [1121/1238] bpb=2.815231 time=267.8s + ttt_chunk [1131/1238] bpb=2.813170 time=270.1s + ttt_chunk [1141/1238] bpb=2.811292 time=272.5s + ttt_chunk [1151/1238] bpb=2.809763 time=274.8s + ttt_chunk [1161/1238] bpb=2.808313 time=277.2s + ttt_chunk [1171/1238] bpb=2.806366 time=279.5s + ttt_chunk [1181/1238] bpb=2.804649 time=281.8s + ttt_chunk [1191/1238] bpb=2.802868 time=284.2s + ttt_chunk [1201/1238] bpb=2.802078 time=286.5s + ttt_chunk [1211/1238] bpb=2.801401 time=288.8s + ttt_chunk [1221/1238] bpb=2.799275 time=291.2s + ttt_chunk [1231/1238] bpb=2.798276 time=293.5s + ttt_chunk [1238/1238] bpb=2.797722 time=295.0s +ttt_sliding:done val_loss=7.223391 val_bpb=2.796432 elapsed=296.9s +quantized_ttt val_loss:7.22339057 val_bpb:2.79643221 eval_time:297082ms + +[run] === artifact === + final_model.int6.ptz: 15652160 bytes diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_log_seed2024.log b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_log_seed2024.log new file mode 100644 index 0000000000..d3ef0557a7 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_log_seed2024.log @@ -0,0 +1,1431 @@ +[run] 128 train shards, 1 val shard(s), tokenizer ok +[run] config: + SEED=2024 + MAX_WALLCLOCK_SECONDS=600 + TTT_ENABLED=1 + DATA_DIR=/root/c22_submission/final/data +[run] launcher: torchrun × 8 +[run] launching c22_train.py at 07:15:19Z +[run] log: logs/run_seed2024_20260424T071519Z.log +W0424 07:15:20.974000 3501937 torch/distributed/run.py:803] +W0424 07:15:20.974000 3501937 torch/distributed/run.py:803] ***************************************** +W0424 07:15:20.974000 3501937 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0424 07:15:20.974000 3501937 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.095 + beta1: 0.9 + beta2: 0.95 + compressor: zstd + data_dir: /root/c22_submission/final/data + datasets_dir: /root/c22_submission/final/data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.9965 + embed_bits: 5 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/e2156326-92a6-4afd-831f-48be07e5b128.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 3 + muon_beta2: 0.95 + muon_momentum: 0.98 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.12 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + prequant_ttt_batch_seqs: 16 + prequant_ttt_cosine_decay: True + prequant_ttt_enabled: False + prequant_ttt_epochs: 8 + prequant_ttt_freeze_blocks: 1 + prequant_ttt_grad_clip: 1.0 + prequant_ttt_lr: 0.00045 + qk_gain_init: 5.25 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: e2156326-92a6-4afd-831f-48be07e5b128 + scalar_lr: 0.02 + seed: 2024 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /root/c22_submission/final/data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 524288 + train_files: /root/c22_submission/final/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 10 + train_seq_len: 2048 + ttt_batch_seqs: 16 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 262144 + val_files: /root/c22_submission/final/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 128 +val_tokens: 40540160 +model_params:35988657 +[curriculum] rank=4/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=7/8 buckets=10 total_seqs=736249 floor=0.02 +[curriculum] rank=0/8 buckets=10 total_seqs=781248 floor=0.02 +gptq:reserving 12s, effective=588000ms +[IDEA-051 freeze_dry] enabled — linear-combo pruning active +[curriculum] rank=3/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=6/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=5/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=2/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=1/8 buckets=10 total_seqs=781248 floor=0.02 +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +[curriculum] rank=4/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=7/8 buckets=10 total_seqs=736249 floor=0.02 +[curriculum] rank=5/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=2/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=1/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=3/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=6/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=0/8 buckets=10 total_seqs=781248 floor=0.02 +0/20000 val_loss: 9.0088 val_bpb: 3.4876 +1/20000 train_loss: 9.0003 train_time: 0.0m tok/s: 179005 +2/20000 train_loss: 12.3492 train_time: 0.1m tok/s: 309299 +3/20000 train_loss: 11.2962 train_time: 0.1m tok/s: 429372 +4/20000 train_loss: 9.6188 train_time: 0.1m tok/s: 559036 +5/20000 train_loss: 8.5578 train_time: 0.1m tok/s: 682597 +10/20000 train_loss: 6.8068 train_time: 0.1m tok/s: 1219560 +20/20000 train_loss: 5.8403 train_time: 0.1m tok/s: 1987877 +30/20000 train_loss: 5.4990 train_time: 0.1m tok/s: 2513679 +40/20000 train_loss: 5.2299 train_time: 0.1m tok/s: 2897218 +50/20000 train_loss: 5.1759 train_time: 0.1m tok/s: 3189538 +60/20000 train_loss: 5.0532 train_time: 0.2m tok/s: 3419461 +70/20000 train_loss: 4.8524 train_time: 0.2m tok/s: 3603556 +80/20000 train_loss: 4.8046 train_time: 0.2m tok/s: 3755959 +90/20000 train_loss: 4.6890 train_time: 0.2m tok/s: 3884679 +100/20000 train_loss: 4.4670 train_time: 0.2m tok/s: 3992997 +110/20000 train_loss: 4.4494 train_time: 0.2m tok/s: 4087738 +120/20000 train_loss: 4.3053 train_time: 0.3m tok/s: 4167720 +130/20000 train_loss: 4.1652 train_time: 0.3m tok/s: 4240348 +140/20000 train_loss: 4.0802 train_time: 0.3m tok/s: 4302680 +150/20000 train_loss: 4.0015 train_time: 0.3m tok/s: 4359465 +160/20000 train_loss: 3.8765 train_time: 0.3m tok/s: 4410988 +170/20000 train_loss: 3.8325 train_time: 0.3m tok/s: 4457818 +180/20000 train_loss: 3.8128 train_time: 0.3m tok/s: 4500204 +190/20000 train_loss: 3.7272 train_time: 0.4m tok/s: 4538740 +200/20000 train_loss: 3.7637 train_time: 0.4m tok/s: 4574473 +210/20000 train_loss: 3.5852 train_time: 0.4m tok/s: 4606449 +220/20000 train_loss: 3.6564 train_time: 0.4m tok/s: 4635303 +230/20000 train_loss: 3.5442 train_time: 0.4m tok/s: 4662846 +240/20000 train_loss: 3.6444 train_time: 0.4m tok/s: 4688553 +250/20000 train_loss: 3.5884 train_time: 0.5m tok/s: 4712686 +260/20000 train_loss: 3.5843 train_time: 0.5m tok/s: 4734922 +270/20000 train_loss: 3.5178 train_time: 0.5m tok/s: 4754533 +280/20000 train_loss: 3.3968 train_time: 0.5m tok/s: 4773581 +290/20000 train_loss: 3.5387 train_time: 0.5m tok/s: 4791821 +300/20000 train_loss: 3.5240 train_time: 0.5m tok/s: 4808669 +310/20000 train_loss: 3.4982 train_time: 0.6m tok/s: 4824792 +320/20000 train_loss: 3.4269 train_time: 0.6m tok/s: 4839470 +330/20000 train_loss: 3.3841 train_time: 0.6m tok/s: 4853502 +340/20000 train_loss: 3.4009 train_time: 0.6m tok/s: 4866908 +350/20000 train_loss: 3.4648 train_time: 0.6m tok/s: 4878953 +360/20000 train_loss: 3.4400 train_time: 0.6m tok/s: 4891191 +370/20000 train_loss: 3.4035 train_time: 0.7m tok/s: 4902552 +380/20000 train_loss: 3.3657 train_time: 0.7m tok/s: 4913898 +390/20000 train_loss: 3.3979 train_time: 0.7m tok/s: 4923976 +400/20000 train_loss: 3.4366 train_time: 0.7m tok/s: 4933807 +410/20000 train_loss: 3.3907 train_time: 0.7m tok/s: 4943235 +420/20000 train_loss: 3.3005 train_time: 0.7m tok/s: 4951745 +430/20000 train_loss: 3.4117 train_time: 0.8m tok/s: 4960618 +440/20000 train_loss: 3.4008 train_time: 0.8m tok/s: 4968674 +450/20000 train_loss: 3.3284 train_time: 0.8m tok/s: 4976583 +460/20000 train_loss: 3.3913 train_time: 0.8m tok/s: 4984374 +470/20000 train_loss: 3.3444 train_time: 0.8m tok/s: 4991679 +480/20000 train_loss: 3.3354 train_time: 0.8m tok/s: 4998712 +490/20000 train_loss: 3.2410 train_time: 0.9m tok/s: 5005123 +500/20000 train_loss: 3.2618 train_time: 0.9m tok/s: 5011661 +510/20000 train_loss: 3.4598 train_time: 0.9m tok/s: 5018136 +520/20000 train_loss: 3.3676 train_time: 0.9m tok/s: 5023819 +530/20000 train_loss: 3.3943 train_time: 0.9m tok/s: 5028939 +540/20000 train_loss: 3.2905 train_time: 0.9m tok/s: 5034165 +550/20000 train_loss: 3.3014 train_time: 1.0m tok/s: 5039554 +560/20000 train_loss: 3.3260 train_time: 1.0m tok/s: 5044658 +570/20000 train_loss: 3.3011 train_time: 1.0m tok/s: 5049950 +580/20000 train_loss: 3.2999 train_time: 1.0m tok/s: 5054975 +590/20000 train_loss: 3.3530 train_time: 1.0m tok/s: 5059853 +600/20000 train_loss: 3.3345 train_time: 1.0m tok/s: 5064678 +610/20000 train_loss: 3.3840 train_time: 1.1m tok/s: 5069053 +620/20000 train_loss: 3.2941 train_time: 1.1m tok/s: 5073314 +630/20000 train_loss: 3.3236 train_time: 1.1m tok/s: 5077217 +640/20000 train_loss: 3.3136 train_time: 1.1m tok/s: 5081217 +650/20000 train_loss: 3.2965 train_time: 1.1m tok/s: 5084996 +660/20000 train_loss: 3.2067 train_time: 1.1m tok/s: 5088778 +670/20000 train_loss: 3.1880 train_time: 1.1m tok/s: 5092386 +680/20000 train_loss: 3.3295 train_time: 1.2m tok/s: 5096276 +690/20000 train_loss: 3.2924 train_time: 1.2m tok/s: 5099898 +700/20000 train_loss: 3.2250 train_time: 1.2m tok/s: 5103196 +710/20000 train_loss: 3.2357 train_time: 1.2m tok/s: 5106454 +720/20000 train_loss: 3.2661 train_time: 1.2m tok/s: 5109690 +730/20000 train_loss: 3.1635 train_time: 1.2m tok/s: 5112610 +740/20000 train_loss: 3.3324 train_time: 1.3m tok/s: 5115776 +750/20000 train_loss: 3.2649 train_time: 1.3m tok/s: 5118803 +760/20000 train_loss: 3.1790 train_time: 1.3m tok/s: 5121686 +770/20000 train_loss: 3.2511 train_time: 1.3m tok/s: 5124512 +780/20000 train_loss: 3.2990 train_time: 1.3m tok/s: 5127240 +790/20000 train_loss: 3.3104 train_time: 1.3m tok/s: 5130031 +800/20000 train_loss: 3.2679 train_time: 1.4m tok/s: 5132623 +810/20000 train_loss: 3.3507 train_time: 1.4m tok/s: 5135002 +820/20000 train_loss: 3.2524 train_time: 1.4m tok/s: 5137485 +830/20000 train_loss: 3.2762 train_time: 1.4m tok/s: 5139423 +840/20000 train_loss: 3.3196 train_time: 1.4m tok/s: 5141826 +850/20000 train_loss: 3.3000 train_time: 1.4m tok/s: 5144242 +860/20000 train_loss: 3.1857 train_time: 1.5m tok/s: 5146585 +870/20000 train_loss: 3.2781 train_time: 1.5m tok/s: 5148971 +880/20000 train_loss: 3.2076 train_time: 1.5m tok/s: 5151187 +890/20000 train_loss: 3.3216 train_time: 1.5m tok/s: 5153329 +900/20000 train_loss: 3.3858 train_time: 1.5m tok/s: 5155453 +910/20000 train_loss: 3.2010 train_time: 1.5m tok/s: 5157250 +920/20000 train_loss: 3.1570 train_time: 1.6m tok/s: 5159176 +930/20000 train_loss: 3.2283 train_time: 1.6m tok/s: 5160978 +940/20000 train_loss: 3.3009 train_time: 1.6m tok/s: 5162844 +950/20000 train_loss: 3.2503 train_time: 1.6m tok/s: 5164813 +960/20000 train_loss: 3.2321 train_time: 1.6m tok/s: 5166700 +970/20000 train_loss: 3.2610 train_time: 1.6m tok/s: 5168678 +980/20000 train_loss: 3.2442 train_time: 1.7m tok/s: 5170672 +990/20000 train_loss: 3.3050 train_time: 1.7m tok/s: 5172634 +1000/20000 train_loss: 3.2891 train_time: 1.7m tok/s: 5174445 +1010/20000 train_loss: 3.3279 train_time: 1.7m tok/s: 5176274 +1020/20000 train_loss: 3.2115 train_time: 1.7m tok/s: 5177977 +1030/20000 train_loss: 3.1750 train_time: 1.7m tok/s: 5179858 +1040/20000 train_loss: 3.3007 train_time: 1.8m tok/s: 5181328 +1050/20000 train_loss: 3.1879 train_time: 1.8m tok/s: 5182843 +1060/20000 train_loss: 3.2223 train_time: 1.8m tok/s: 5184551 +1070/20000 train_loss: 3.1965 train_time: 1.8m tok/s: 5186165 +1080/20000 train_loss: 3.1973 train_time: 1.8m tok/s: 5187901 +1090/20000 train_loss: 3.1746 train_time: 1.8m tok/s: 5189407 +1100/20000 train_loss: 3.2243 train_time: 1.9m tok/s: 5190881 +1110/20000 train_loss: 3.2501 train_time: 1.9m tok/s: 5192457 +1120/20000 train_loss: 3.1339 train_time: 1.9m tok/s: 5193998 +1130/20000 train_loss: 3.2061 train_time: 1.9m tok/s: 5195487 +1140/20000 train_loss: 3.2119 train_time: 1.9m tok/s: 5197061 +1150/20000 train_loss: 3.2503 train_time: 1.9m tok/s: 5198402 +1160/20000 train_loss: 3.1418 train_time: 1.9m tok/s: 5199911 +1170/20000 train_loss: 3.2149 train_time: 2.0m tok/s: 5201404 +1180/20000 train_loss: 3.2682 train_time: 2.0m tok/s: 5202660 +1190/20000 train_loss: 3.2151 train_time: 2.0m tok/s: 5203900 +1200/20000 train_loss: 3.1639 train_time: 2.0m tok/s: 5205236 +1210/20000 train_loss: 3.1475 train_time: 2.0m tok/s: 5206640 +1220/20000 train_loss: 3.1907 train_time: 2.0m tok/s: 5207878 +1230/20000 train_loss: 3.1719 train_time: 2.1m tok/s: 5209006 +1240/20000 train_loss: 3.2569 train_time: 2.1m tok/s: 5209960 +1250/20000 train_loss: 3.2835 train_time: 2.1m tok/s: 5211202 +1260/20000 train_loss: 3.2251 train_time: 2.1m tok/s: 5212491 +1270/20000 train_loss: 3.2233 train_time: 2.1m tok/s: 5213219 +1280/20000 train_loss: 3.1878 train_time: 2.1m tok/s: 5214627 +1290/20000 train_loss: 3.2599 train_time: 2.2m tok/s: 5215752 +1300/20000 train_loss: 3.1480 train_time: 2.2m tok/s: 5216996 +1310/20000 train_loss: 3.2008 train_time: 2.2m tok/s: 5218132 +1320/20000 train_loss: 3.2629 train_time: 2.2m tok/s: 5219047 +1330/20000 train_loss: 3.2111 train_time: 2.2m tok/s: 5220109 +1340/20000 train_loss: 3.2161 train_time: 2.2m tok/s: 5221080 +1350/20000 train_loss: 3.1793 train_time: 2.3m tok/s: 5222177 +1360/20000 train_loss: 3.2940 train_time: 2.3m tok/s: 5222961 +1370/20000 train_loss: 3.2959 train_time: 2.3m tok/s: 5224087 +1380/20000 train_loss: 3.1618 train_time: 2.3m tok/s: 5225105 +1390/20000 train_loss: 3.1612 train_time: 2.3m tok/s: 5226224 +1400/20000 train_loss: 3.2993 train_time: 2.3m tok/s: 5226997 +1410/20000 train_loss: 3.1916 train_time: 2.4m tok/s: 5227864 +1420/20000 train_loss: 3.2577 train_time: 2.4m tok/s: 5228822 +1430/20000 train_loss: 3.1210 train_time: 2.4m tok/s: 5229817 +1440/20000 train_loss: 3.2823 train_time: 2.4m tok/s: 5230670 +1450/20000 train_loss: 3.2876 train_time: 2.4m tok/s: 5231586 +1460/20000 train_loss: 3.3291 train_time: 2.4m tok/s: 5232498 +1470/20000 train_loss: 3.2192 train_time: 2.5m tok/s: 5233332 +1480/20000 train_loss: 3.2338 train_time: 2.5m tok/s: 5234211 +1490/20000 train_loss: 3.2384 train_time: 2.5m tok/s: 5235075 +1500/20000 train_loss: 3.1861 train_time: 2.5m tok/s: 5235814 +1510/20000 train_loss: 3.2193 train_time: 2.5m tok/s: 5236667 +1520/20000 train_loss: 3.1983 train_time: 2.5m tok/s: 5237524 +1530/20000 train_loss: 3.2004 train_time: 2.6m tok/s: 5238244 +1540/20000 train_loss: 3.2219 train_time: 2.6m tok/s: 5239047 +1550/20000 train_loss: 3.1849 train_time: 2.6m tok/s: 5239757 +1560/20000 train_loss: 3.2218 train_time: 2.6m tok/s: 5240519 +1570/20000 train_loss: 3.1568 train_time: 2.6m tok/s: 5241346 +1580/20000 train_loss: 3.2734 train_time: 2.6m tok/s: 5242212 +1590/20000 train_loss: 3.2238 train_time: 2.6m tok/s: 5243039 +1600/20000 train_loss: 3.1039 train_time: 2.7m tok/s: 5243571 +1610/20000 train_loss: 3.0813 train_time: 2.7m tok/s: 5244118 +1620/20000 train_loss: 3.2193 train_time: 2.7m tok/s: 5244918 +1630/20000 train_loss: 3.2419 train_time: 2.7m tok/s: 5245562 +1640/20000 train_loss: 3.2102 train_time: 2.7m tok/s: 5246297 +1650/20000 train_loss: 3.1865 train_time: 2.7m tok/s: 5246865 +1660/20000 train_loss: 3.1772 train_time: 2.8m tok/s: 5247367 +1670/20000 train_loss: 2.9799 train_time: 2.8m tok/s: 5248103 +1680/20000 train_loss: 3.1356 train_time: 2.8m tok/s: 5248811 +1690/20000 train_loss: 3.3216 train_time: 2.8m tok/s: 5249461 +1700/20000 train_loss: 3.1809 train_time: 2.8m tok/s: 5249961 +1710/20000 train_loss: 3.2476 train_time: 2.8m tok/s: 5250800 +1720/20000 train_loss: 3.1639 train_time: 2.9m tok/s: 5251339 +1730/20000 train_loss: 3.1709 train_time: 2.9m tok/s: 5251782 +1740/20000 train_loss: 3.0684 train_time: 2.9m tok/s: 5252202 +1750/20000 train_loss: 3.1142 train_time: 2.9m tok/s: 5252853 +1760/20000 train_loss: 3.2508 train_time: 2.9m tok/s: 5253507 +1770/20000 train_loss: 3.2880 train_time: 2.9m tok/s: 5254064 +1780/20000 train_loss: 3.1785 train_time: 3.0m tok/s: 5254739 +1790/20000 train_loss: 3.2320 train_time: 3.0m tok/s: 5255220 +1800/20000 train_loss: 3.1894 train_time: 3.0m tok/s: 5255882 +1810/20000 train_loss: 3.2427 train_time: 3.0m tok/s: 5256452 +1820/20000 train_loss: 3.1398 train_time: 3.0m tok/s: 5257020 +1830/20000 train_loss: 3.1566 train_time: 3.0m tok/s: 5257525 +1840/20000 train_loss: 3.2055 train_time: 3.1m tok/s: 5258015 +1850/20000 train_loss: 3.1936 train_time: 3.1m tok/s: 5258472 +1860/20000 train_loss: 3.1594 train_time: 3.1m tok/s: 5258903 +1870/20000 train_loss: 3.2495 train_time: 3.1m tok/s: 5259531 +1880/20000 train_loss: 3.1295 train_time: 3.1m tok/s: 5260084 +1890/20000 train_loss: 3.1184 train_time: 3.1m tok/s: 5260608 +1900/20000 train_loss: 3.1628 train_time: 3.2m tok/s: 5261317 +1910/20000 train_loss: 3.1186 train_time: 3.2m tok/s: 5261897 +1920/20000 train_loss: 3.0844 train_time: 3.2m tok/s: 5262478 +1930/20000 train_loss: 3.1957 train_time: 3.2m tok/s: 5263079 +1940/20000 train_loss: 3.1070 train_time: 3.2m tok/s: 5263619 +1950/20000 train_loss: 3.1173 train_time: 3.2m tok/s: 5264106 +1960/20000 train_loss: 3.0895 train_time: 3.3m tok/s: 5264508 +1970/20000 train_loss: 3.2105 train_time: 3.3m tok/s: 5264795 +1980/20000 train_loss: 3.1380 train_time: 3.3m tok/s: 5265258 +1990/20000 train_loss: 3.1849 train_time: 3.3m tok/s: 5265836 +2000/20000 train_loss: 3.1127 train_time: 3.3m tok/s: 5266354 +2010/20000 train_loss: 3.1493 train_time: 3.3m tok/s: 5266678 +2020/20000 train_loss: 3.1837 train_time: 3.4m tok/s: 5267034 +2030/20000 train_loss: 3.0927 train_time: 3.4m tok/s: 5267408 +2040/20000 train_loss: 3.2048 train_time: 3.4m tok/s: 5267895 +2050/20000 train_loss: 3.1466 train_time: 3.4m tok/s: 5268341 +2060/20000 train_loss: 3.1039 train_time: 3.4m tok/s: 5268680 +2070/20000 train_loss: 3.0717 train_time: 3.4m tok/s: 5269147 +2080/20000 train_loss: 3.1303 train_time: 3.4m tok/s: 5269647 +2090/20000 train_loss: 3.1374 train_time: 3.5m tok/s: 5270131 +2100/20000 train_loss: 3.1215 train_time: 3.5m tok/s: 5270534 +2110/20000 train_loss: 3.1661 train_time: 3.5m tok/s: 5270813 +2120/20000 train_loss: 3.0893 train_time: 3.5m tok/s: 5271250 +2130/20000 train_loss: 3.0042 train_time: 3.5m tok/s: 5271692 +2140/20000 train_loss: 3.2270 train_time: 3.5m tok/s: 5272142 +2150/20000 train_loss: 3.1392 train_time: 3.6m tok/s: 5272582 +2160/20000 train_loss: 3.1420 train_time: 3.6m tok/s: 5272858 +2170/20000 train_loss: 3.1948 train_time: 3.6m tok/s: 5273090 +2180/20000 train_loss: 3.1181 train_time: 3.6m tok/s: 5273519 +2190/20000 train_loss: 3.1309 train_time: 3.6m tok/s: 5273953 +2200/20000 train_loss: 3.2247 train_time: 3.6m tok/s: 5274278 +2210/20000 train_loss: 3.1295 train_time: 3.7m tok/s: 5274661 +2220/20000 train_loss: 3.1150 train_time: 3.7m tok/s: 5275099 +2230/20000 train_loss: 3.0894 train_time: 3.7m tok/s: 5275569 +2240/20000 train_loss: 3.1740 train_time: 3.7m tok/s: 5275994 +2250/20000 train_loss: 3.0543 train_time: 3.7m tok/s: 5276425 +2260/20000 train_loss: 3.1424 train_time: 3.7m tok/s: 5276647 +2270/20000 train_loss: 3.1905 train_time: 3.8m tok/s: 5276967 +2280/20000 train_loss: 3.1617 train_time: 3.8m tok/s: 5277368 +2290/20000 train_loss: 3.1484 train_time: 3.8m tok/s: 5277792 +2300/20000 train_loss: 3.2099 train_time: 3.8m tok/s: 5278236 +2310/20000 train_loss: 2.9860 train_time: 3.8m tok/s: 5278729 +2320/20000 train_loss: 3.1315 train_time: 3.8m tok/s: 5279000 +2330/20000 train_loss: 3.1794 train_time: 3.9m tok/s: 5279329 +2340/20000 train_loss: 3.0587 train_time: 3.9m tok/s: 5279644 +2350/20000 train_loss: 3.1369 train_time: 3.9m tok/s: 5279951 +2360/20000 train_loss: 3.1868 train_time: 3.9m tok/s: 5280248 +2370/20000 train_loss: 3.1287 train_time: 3.9m tok/s: 5280684 +2380/20000 train_loss: 3.0773 train_time: 3.9m tok/s: 5281013 +2390/20000 train_loss: 3.0551 train_time: 4.0m tok/s: 5281444 +2400/20000 train_loss: 3.1876 train_time: 4.0m tok/s: 5281792 +2410/20000 train_loss: 3.1529 train_time: 4.0m tok/s: 5282013 +2420/20000 train_loss: 3.0697 train_time: 4.0m tok/s: 5282424 +2430/20000 train_loss: 3.1524 train_time: 4.0m tok/s: 5282800 +2440/20000 train_loss: 3.1917 train_time: 4.0m tok/s: 5283206 +2450/20000 train_loss: 3.2525 train_time: 4.1m tok/s: 5283571 +2460/20000 train_loss: 3.1508 train_time: 4.1m tok/s: 5283701 +2470/20000 train_loss: 3.1592 train_time: 4.1m tok/s: 5283956 +2480/20000 train_loss: 3.3244 train_time: 4.1m tok/s: 5284385 +2490/20000 train_loss: 3.2249 train_time: 4.1m tok/s: 5284613 +2500/20000 train_loss: 3.1043 train_time: 4.1m tok/s: 5284073 +2510/20000 train_loss: 3.2375 train_time: 4.2m tok/s: 5283774 +2520/20000 train_loss: 3.2404 train_time: 4.2m tok/s: 5284135 +2530/20000 train_loss: 3.1918 train_time: 4.2m tok/s: 5284497 +2540/20000 train_loss: 3.0219 train_time: 4.2m tok/s: 5284763 +2550/20000 train_loss: 3.1461 train_time: 4.2m tok/s: 5285034 +2560/20000 train_loss: 3.0215 train_time: 4.2m tok/s: 5285355 +2570/20000 train_loss: 3.1380 train_time: 4.2m tok/s: 5285693 +2580/20000 train_loss: 3.2189 train_time: 4.3m tok/s: 5285946 +2590/20000 train_loss: 3.1274 train_time: 4.3m tok/s: 5286201 +2600/20000 train_loss: 3.1186 train_time: 4.3m tok/s: 5286549 +2610/20000 train_loss: 3.1328 train_time: 4.3m tok/s: 5286845 +2620/20000 train_loss: 3.1408 train_time: 4.3m tok/s: 5287214 +2630/20000 train_loss: 3.0626 train_time: 4.3m tok/s: 5287560 +2640/20000 train_loss: 3.1991 train_time: 4.4m tok/s: 5287870 +2650/20000 train_loss: 3.0392 train_time: 4.4m tok/s: 5288154 +2660/20000 train_loss: 3.1626 train_time: 4.4m tok/s: 5288479 +2670/20000 train_loss: 3.1311 train_time: 4.4m tok/s: 5288680 +2680/20000 train_loss: 3.0859 train_time: 4.4m tok/s: 5288974 +2690/20000 train_loss: 3.0847 train_time: 4.4m tok/s: 5289299 +2700/20000 train_loss: 3.1041 train_time: 4.5m tok/s: 5289625 +2710/20000 train_loss: 3.2275 train_time: 4.5m tok/s: 5289800 +2720/20000 train_loss: 3.0883 train_time: 4.5m tok/s: 5290018 +2730/20000 train_loss: 3.1541 train_time: 4.5m tok/s: 5290193 +2740/20000 train_loss: 3.0881 train_time: 4.5m tok/s: 5290409 +2750/20000 train_loss: 3.0717 train_time: 4.5m tok/s: 5290622 +2760/20000 train_loss: 3.1832 train_time: 4.6m tok/s: 5290868 +2770/20000 train_loss: 3.1303 train_time: 4.6m tok/s: 5291049 +2780/20000 train_loss: 3.1281 train_time: 4.6m tok/s: 5291247 +2790/20000 train_loss: 3.2004 train_time: 4.6m tok/s: 5291603 +2800/20000 train_loss: 3.1976 train_time: 4.6m tok/s: 5291793 +2810/20000 train_loss: 3.1852 train_time: 4.6m tok/s: 5292116 +2820/20000 train_loss: 3.0885 train_time: 4.7m tok/s: 5292293 +2830/20000 train_loss: 3.1651 train_time: 4.7m tok/s: 5292626 +2840/20000 train_loss: 3.1770 train_time: 4.7m tok/s: 5292876 +2850/20000 train_loss: 3.1956 train_time: 4.7m tok/s: 5293182 +2860/20000 train_loss: 3.0931 train_time: 4.7m tok/s: 5293499 +2870/20000 train_loss: 3.1296 train_time: 4.7m tok/s: 5293822 +2880/20000 train_loss: 3.1059 train_time: 4.8m tok/s: 5294126 +2890/20000 train_loss: 3.1758 train_time: 4.8m tok/s: 5294427 +2900/20000 train_loss: 3.1224 train_time: 4.8m tok/s: 5294667 +2910/20000 train_loss: 3.1495 train_time: 4.8m tok/s: 5294878 +2920/20000 train_loss: 3.1569 train_time: 4.8m tok/s: 5295214 +2930/20000 train_loss: 3.1107 train_time: 4.8m tok/s: 5295406 +2940/20000 train_loss: 3.0096 train_time: 4.9m tok/s: 5295498 +2950/20000 train_loss: 3.1422 train_time: 4.9m tok/s: 5295791 +2960/20000 train_loss: 3.1529 train_time: 4.9m tok/s: 5296016 +2970/20000 train_loss: 3.0521 train_time: 4.9m tok/s: 5296257 +layer_loop:enabled step:2970 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +2980/20000 train_loss: 3.2356 train_time: 4.9m tok/s: 5291851 +2990/20000 train_loss: 3.1268 train_time: 4.9m tok/s: 5287687 +3000/20000 train_loss: 3.2042 train_time: 5.0m tok/s: 5283592 +3010/20000 train_loss: 3.2201 train_time: 5.0m tok/s: 5279472 +3020/20000 train_loss: 3.1562 train_time: 5.0m tok/s: 5275465 +3030/20000 train_loss: 3.1408 train_time: 5.0m tok/s: 5271461 +3040/20000 train_loss: 3.1080 train_time: 5.0m tok/s: 5267505 +3050/20000 train_loss: 3.1210 train_time: 5.1m tok/s: 5263481 +3060/20000 train_loss: 3.0439 train_time: 5.1m tok/s: 5259577 +3070/20000 train_loss: 3.1593 train_time: 5.1m tok/s: 5255726 +3080/20000 train_loss: 3.1371 train_time: 5.1m tok/s: 5251903 +3090/20000 train_loss: 3.1223 train_time: 5.1m tok/s: 5248089 +3100/20000 train_loss: 3.1703 train_time: 5.2m tok/s: 5244268 +3110/20000 train_loss: 3.1433 train_time: 5.2m tok/s: 5240541 +3120/20000 train_loss: 3.2185 train_time: 5.2m tok/s: 5236828 +3130/20000 train_loss: 3.1583 train_time: 5.2m tok/s: 5233167 +3140/20000 train_loss: 3.0766 train_time: 5.2m tok/s: 5229449 +3150/20000 train_loss: 2.9879 train_time: 5.3m tok/s: 5225854 +3160/20000 train_loss: 3.0987 train_time: 5.3m tok/s: 5222246 +3170/20000 train_loss: 3.0493 train_time: 5.3m tok/s: 5218615 +3180/20000 train_loss: 3.1918 train_time: 5.3m tok/s: 5215070 +3190/20000 train_loss: 3.0316 train_time: 5.3m tok/s: 5211576 +3200/20000 train_loss: 3.1283 train_time: 5.4m tok/s: 5208075 +3210/20000 train_loss: 3.1449 train_time: 5.4m tok/s: 5204592 +3220/20000 train_loss: 3.1494 train_time: 5.4m tok/s: 5201069 +3230/20000 train_loss: 3.1386 train_time: 5.4m tok/s: 5197571 +3240/20000 train_loss: 3.2401 train_time: 5.5m tok/s: 5194199 +3250/20000 train_loss: 3.1678 train_time: 5.5m tok/s: 5190781 +3260/20000 train_loss: 3.0370 train_time: 5.5m tok/s: 5187401 +3270/20000 train_loss: 3.1035 train_time: 5.5m tok/s: 5184097 +3280/20000 train_loss: 3.1017 train_time: 5.5m tok/s: 5180759 +3290/20000 train_loss: 3.2700 train_time: 5.6m tok/s: 5177491 +3300/20000 train_loss: 3.0585 train_time: 5.6m tok/s: 5174246 +3310/20000 train_loss: 3.0236 train_time: 5.6m tok/s: 5170947 +3320/20000 train_loss: 3.0842 train_time: 5.6m tok/s: 5167719 +3330/20000 train_loss: 3.0681 train_time: 5.6m tok/s: 5164530 +3340/20000 train_loss: 3.0712 train_time: 5.7m tok/s: 5161357 +3350/20000 train_loss: 3.1023 train_time: 5.7m tok/s: 5158235 +3360/20000 train_loss: 3.0669 train_time: 5.7m tok/s: 5155163 +3370/20000 train_loss: 3.0037 train_time: 5.7m tok/s: 5152127 +3380/20000 train_loss: 3.1467 train_time: 5.7m tok/s: 5149037 +3390/20000 train_loss: 3.1392 train_time: 5.8m tok/s: 5145832 +3400/20000 train_loss: 3.0623 train_time: 5.8m tok/s: 5142737 +3410/20000 train_loss: 3.0925 train_time: 5.8m tok/s: 5139717 +3420/20000 train_loss: 3.2173 train_time: 5.8m tok/s: 5136759 +3430/20000 train_loss: 3.0266 train_time: 5.8m tok/s: 5133792 +3440/20000 train_loss: 3.0740 train_time: 5.9m tok/s: 5130886 +3450/20000 train_loss: 3.0516 train_time: 5.9m tok/s: 5128019 +3460/20000 train_loss: 3.2037 train_time: 5.9m tok/s: 5125102 +3470/20000 train_loss: 3.1386 train_time: 5.9m tok/s: 5122266 +3480/20000 train_loss: 3.0941 train_time: 5.9m tok/s: 5119428 +3490/20000 train_loss: 3.0841 train_time: 6.0m tok/s: 5116593 +3500/20000 train_loss: 3.0896 train_time: 6.0m tok/s: 5113815 +3510/20000 train_loss: 3.0954 train_time: 6.0m tok/s: 5111051 +3520/20000 train_loss: 3.0795 train_time: 6.0m tok/s: 5108270 +3530/20000 train_loss: 3.1287 train_time: 6.0m tok/s: 5105495 +3540/20000 train_loss: 3.1567 train_time: 6.1m tok/s: 5102774 +3550/20000 train_loss: 3.0516 train_time: 6.1m tok/s: 5100030 +3560/20000 train_loss: 3.0593 train_time: 6.1m tok/s: 5097355 +3570/20000 train_loss: 3.0751 train_time: 6.1m tok/s: 5094668 +3580/20000 train_loss: 3.0452 train_time: 6.1m tok/s: 5091989 +3590/20000 train_loss: 3.0216 train_time: 6.2m tok/s: 5089339 +3600/20000 train_loss: 3.0917 train_time: 6.2m tok/s: 5086638 +3610/20000 train_loss: 3.0304 train_time: 6.2m tok/s: 5084017 +3620/20000 train_loss: 3.1567 train_time: 6.2m tok/s: 5081435 +3630/20000 train_loss: 2.9943 train_time: 6.2m tok/s: 5078747 +3640/20000 train_loss: 3.0352 train_time: 6.3m tok/s: 5076215 +3650/20000 train_loss: 3.1965 train_time: 6.3m tok/s: 5073676 +3660/20000 train_loss: 3.0195 train_time: 6.3m tok/s: 5071145 +3670/20000 train_loss: 3.1046 train_time: 6.3m tok/s: 5068634 +3680/20000 train_loss: 3.0656 train_time: 6.3m tok/s: 5066145 +3690/20000 train_loss: 3.1075 train_time: 6.4m tok/s: 5063606 +3700/20000 train_loss: 3.0974 train_time: 6.4m tok/s: 5061081 +3710/20000 train_loss: 3.1139 train_time: 6.4m tok/s: 5058538 +3720/20000 train_loss: 3.0954 train_time: 6.4m tok/s: 5055993 +3730/20000 train_loss: 3.0337 train_time: 6.4m tok/s: 5053533 +3740/20000 train_loss: 2.9816 train_time: 6.5m tok/s: 5051116 +3750/20000 train_loss: 3.0255 train_time: 6.5m tok/s: 5048728 +3760/20000 train_loss: 2.9685 train_time: 6.5m tok/s: 5046367 +3770/20000 train_loss: 3.0797 train_time: 6.5m tok/s: 5044001 +3780/20000 train_loss: 3.0783 train_time: 6.6m tok/s: 5041675 +3790/20000 train_loss: 3.1633 train_time: 6.6m tok/s: 5039331 +3800/20000 train_loss: 3.0589 train_time: 6.6m tok/s: 5036917 +3810/20000 train_loss: 3.1454 train_time: 6.6m tok/s: 5034619 +3820/20000 train_loss: 3.0333 train_time: 6.6m tok/s: 5032272 +3830/20000 train_loss: 2.9862 train_time: 6.7m tok/s: 5030034 +3840/20000 train_loss: 3.0431 train_time: 6.7m tok/s: 5027671 +3850/20000 train_loss: 3.0948 train_time: 6.7m tok/s: 5025435 +3860/20000 train_loss: 3.0806 train_time: 6.7m tok/s: 5023188 +3870/20000 train_loss: 3.0390 train_time: 6.7m tok/s: 5020972 +3880/20000 train_loss: 3.0930 train_time: 6.8m tok/s: 5018702 +3890/20000 train_loss: 3.1046 train_time: 6.8m tok/s: 5016500 +3900/20000 train_loss: 2.9905 train_time: 6.8m tok/s: 5014337 +3910/20000 train_loss: 3.0817 train_time: 6.8m tok/s: 5012157 +3920/20000 train_loss: 3.1994 train_time: 6.8m tok/s: 5009996 +3930/20000 train_loss: 3.1468 train_time: 6.9m tok/s: 5007862 +3940/20000 train_loss: 3.1063 train_time: 6.9m tok/s: 5005724 +3950/20000 train_loss: 2.9952 train_time: 6.9m tok/s: 5003585 +3960/20000 train_loss: 3.0622 train_time: 6.9m tok/s: 5001488 +3970/20000 train_loss: 3.0622 train_time: 6.9m tok/s: 4999411 +3980/20000 train_loss: 3.1570 train_time: 7.0m tok/s: 4997355 +3990/20000 train_loss: 3.0982 train_time: 7.0m tok/s: 4995306 +4000/20000 train_loss: 2.9855 train_time: 7.0m tok/s: 4993255 +4000/20000 val_loss: 3.0119 val_bpb: 1.1660 +4010/20000 train_loss: 2.9810 train_time: 7.0m tok/s: 4992142 +4020/20000 train_loss: 3.0715 train_time: 7.0m tok/s: 4990243 +4030/20000 train_loss: 3.0336 train_time: 7.1m tok/s: 4988199 +4040/20000 train_loss: 3.0240 train_time: 7.1m tok/s: 4986178 +4050/20000 train_loss: 3.1013 train_time: 7.1m tok/s: 4984181 +4060/20000 train_loss: 3.0526 train_time: 7.1m tok/s: 4982198 +4070/20000 train_loss: 3.0797 train_time: 7.1m tok/s: 4980162 +4080/20000 train_loss: 3.0470 train_time: 7.2m tok/s: 4978157 +4090/20000 train_loss: 3.0607 train_time: 7.2m tok/s: 4976162 +4100/20000 train_loss: 3.0983 train_time: 7.2m tok/s: 4974201 +4110/20000 train_loss: 3.0983 train_time: 7.2m tok/s: 4972234 +4120/20000 train_loss: 3.0184 train_time: 7.2m tok/s: 4970285 +4130/20000 train_loss: 3.0695 train_time: 7.3m tok/s: 4968311 +4140/20000 train_loss: 3.0594 train_time: 7.3m tok/s: 4966426 +4150/20000 train_loss: 3.0202 train_time: 7.3m tok/s: 4964469 +4160/20000 train_loss: 3.0838 train_time: 7.3m tok/s: 4962552 +4170/20000 train_loss: 3.1925 train_time: 7.3m tok/s: 4960608 +4180/20000 train_loss: 2.9772 train_time: 7.4m tok/s: 4958772 +4190/20000 train_loss: 2.9989 train_time: 7.4m tok/s: 4956959 +4200/20000 train_loss: 2.9457 train_time: 7.4m tok/s: 4955125 +4210/20000 train_loss: 3.1374 train_time: 7.4m tok/s: 4953320 +4220/20000 train_loss: 3.0929 train_time: 7.4m tok/s: 4951506 +4230/20000 train_loss: 3.0763 train_time: 7.5m tok/s: 4949707 +4240/20000 train_loss: 3.0424 train_time: 7.5m tok/s: 4947921 +4250/20000 train_loss: 3.1849 train_time: 7.5m tok/s: 4946165 +4260/20000 train_loss: 3.0372 train_time: 7.5m tok/s: 4944381 +4270/20000 train_loss: 3.0244 train_time: 7.5m tok/s: 4942577 +4280/20000 train_loss: 3.0548 train_time: 7.6m tok/s: 4940808 +4290/20000 train_loss: 2.9494 train_time: 7.6m tok/s: 4939063 +4300/20000 train_loss: 3.0017 train_time: 7.6m tok/s: 4937331 +4310/20000 train_loss: 3.0608 train_time: 7.6m tok/s: 4935604 +4320/20000 train_loss: 3.0958 train_time: 7.7m tok/s: 4933817 +4330/20000 train_loss: 3.0624 train_time: 7.7m tok/s: 4932060 +4340/20000 train_loss: 3.1354 train_time: 7.7m tok/s: 4930295 +4350/20000 train_loss: 3.0480 train_time: 7.7m tok/s: 4928546 +4360/20000 train_loss: 3.0442 train_time: 7.7m tok/s: 4926863 +4370/20000 train_loss: 3.0251 train_time: 7.8m tok/s: 4925212 +4380/20000 train_loss: 3.0306 train_time: 7.8m tok/s: 4923562 +4390/20000 train_loss: 2.9091 train_time: 7.8m tok/s: 4921915 +4400/20000 train_loss: 3.0847 train_time: 7.8m tok/s: 4917378 +4410/20000 train_loss: 3.0572 train_time: 7.8m tok/s: 4915700 +4420/20000 train_loss: 2.9235 train_time: 7.9m tok/s: 4914031 +4430/20000 train_loss: 3.0137 train_time: 7.9m tok/s: 4912364 +4440/20000 train_loss: 3.2080 train_time: 7.9m tok/s: 4907761 +4450/20000 train_loss: 2.9779 train_time: 7.9m tok/s: 4906108 +4460/20000 train_loss: 3.1039 train_time: 7.9m tok/s: 4904496 +4470/20000 train_loss: 3.0021 train_time: 8.0m tok/s: 4902923 +4480/20000 train_loss: 3.1050 train_time: 8.0m tok/s: 4901288 +4490/20000 train_loss: 3.0061 train_time: 8.0m tok/s: 4899682 +4500/20000 train_loss: 3.1458 train_time: 8.0m tok/s: 4898080 +4510/20000 train_loss: 2.9990 train_time: 8.0m tok/s: 4896498 +4520/20000 train_loss: 2.9461 train_time: 8.1m tok/s: 4894919 +4530/20000 train_loss: 3.0019 train_time: 8.1m tok/s: 4893400 +4540/20000 train_loss: 3.1009 train_time: 8.1m tok/s: 4891873 +4550/20000 train_loss: 3.0419 train_time: 8.1m tok/s: 4890360 +4560/20000 train_loss: 3.0463 train_time: 8.2m tok/s: 4888813 +4570/20000 train_loss: 3.0301 train_time: 8.2m tok/s: 4887291 +4580/20000 train_loss: 3.0816 train_time: 8.2m tok/s: 4885801 +4590/20000 train_loss: 2.9584 train_time: 8.2m tok/s: 4884310 +4600/20000 train_loss: 3.0286 train_time: 8.2m tok/s: 4882837 +4610/20000 train_loss: 3.0775 train_time: 8.3m tok/s: 4881348 +4620/20000 train_loss: 3.0499 train_time: 8.3m tok/s: 4879900 +4630/20000 train_loss: 2.9811 train_time: 8.3m tok/s: 4878410 +4640/20000 train_loss: 3.0426 train_time: 8.3m tok/s: 4876971 +4650/20000 train_loss: 2.9688 train_time: 8.3m tok/s: 4875503 +4660/20000 train_loss: 3.0092 train_time: 8.4m tok/s: 4874034 +4670/20000 train_loss: 3.0042 train_time: 8.4m tok/s: 4872585 +4680/20000 train_loss: 3.0816 train_time: 8.4m tok/s: 4871115 +4690/20000 train_loss: 3.0043 train_time: 8.4m tok/s: 4869621 +4700/20000 train_loss: 3.0082 train_time: 8.4m tok/s: 4868189 +4710/20000 train_loss: 2.9304 train_time: 8.5m tok/s: 4866751 +4720/20000 train_loss: 3.0408 train_time: 8.5m tok/s: 4865341 +4730/20000 train_loss: 2.9646 train_time: 8.5m tok/s: 4863965 +4740/20000 train_loss: 3.0597 train_time: 8.5m tok/s: 4862607 +4750/20000 train_loss: 2.9173 train_time: 8.5m tok/s: 4861253 +4760/20000 train_loss: 3.0445 train_time: 8.6m tok/s: 4859910 +4770/20000 train_loss: 2.9418 train_time: 8.6m tok/s: 4858540 +4780/20000 train_loss: 3.0784 train_time: 8.6m tok/s: 4857173 +4790/20000 train_loss: 3.0316 train_time: 8.6m tok/s: 4855797 +4800/20000 train_loss: 3.0322 train_time: 8.6m tok/s: 4854418 +4810/20000 train_loss: 3.0058 train_time: 8.7m tok/s: 4853067 +4820/20000 train_loss: 2.9956 train_time: 8.7m tok/s: 4851745 +4830/20000 train_loss: 2.9787 train_time: 8.7m tok/s: 4850408 +4840/20000 train_loss: 2.9460 train_time: 8.7m tok/s: 4849071 +4850/20000 train_loss: 3.0221 train_time: 8.7m tok/s: 4847723 +4860/20000 train_loss: 3.0449 train_time: 8.8m tok/s: 4846381 +4870/20000 train_loss: 2.9311 train_time: 8.8m tok/s: 4845088 +4880/20000 train_loss: 2.9947 train_time: 8.8m tok/s: 4843756 +4890/20000 train_loss: 3.0366 train_time: 8.8m tok/s: 4842455 +4900/20000 train_loss: 3.0262 train_time: 8.8m tok/s: 4841144 +4910/20000 train_loss: 3.0206 train_time: 8.9m tok/s: 4839841 +4920/20000 train_loss: 3.0822 train_time: 8.9m tok/s: 4838534 +4930/20000 train_loss: 3.0308 train_time: 8.9m tok/s: 4837235 +4940/20000 train_loss: 2.9778 train_time: 8.9m tok/s: 4835935 +4950/20000 train_loss: 2.9646 train_time: 8.9m tok/s: 4834661 +4960/20000 train_loss: 2.9600 train_time: 9.0m tok/s: 4833348 +4970/20000 train_loss: 3.0933 train_time: 9.0m tok/s: 4832087 +4980/20000 train_loss: 3.0780 train_time: 9.0m tok/s: 4830821 +4990/20000 train_loss: 2.9876 train_time: 9.0m tok/s: 4829529 +5000/20000 train_loss: 3.0441 train_time: 9.0m tok/s: 4828274 +5010/20000 train_loss: 3.0603 train_time: 9.1m tok/s: 4827000 +5020/20000 train_loss: 2.9678 train_time: 9.1m tok/s: 4825743 +5030/20000 train_loss: 3.0192 train_time: 9.1m tok/s: 4824527 +5040/20000 train_loss: 3.0270 train_time: 9.1m tok/s: 4823291 +5050/20000 train_loss: 2.9380 train_time: 9.2m tok/s: 4822082 +5060/20000 train_loss: 3.1187 train_time: 9.2m tok/s: 4820891 +5070/20000 train_loss: 2.9669 train_time: 9.2m tok/s: 4819715 +5080/20000 train_loss: 2.9235 train_time: 9.2m tok/s: 4818540 +5090/20000 train_loss: 2.9879 train_time: 9.2m tok/s: 4817360 +5100/20000 train_loss: 2.9503 train_time: 9.3m tok/s: 4816179 +5110/20000 train_loss: 2.9496 train_time: 9.3m tok/s: 4814978 +5120/20000 train_loss: 2.9246 train_time: 9.3m tok/s: 4813809 +5130/20000 train_loss: 2.9222 train_time: 9.3m tok/s: 4812638 +5140/20000 train_loss: 2.9910 train_time: 9.3m tok/s: 4811481 +5150/20000 train_loss: 3.0433 train_time: 9.4m tok/s: 4810331 +5160/20000 train_loss: 2.8909 train_time: 9.4m tok/s: 4809186 +5170/20000 train_loss: 2.9251 train_time: 9.4m tok/s: 4808033 +5180/20000 train_loss: 3.0002 train_time: 9.4m tok/s: 4806874 +5190/20000 train_loss: 2.9247 train_time: 9.4m tok/s: 4805769 +5200/20000 train_loss: 2.9608 train_time: 9.5m tok/s: 4804640 +5210/20000 train_loss: 2.8966 train_time: 9.5m tok/s: 4803551 +5220/20000 train_loss: 2.9395 train_time: 9.5m tok/s: 4802419 +5230/20000 train_loss: 2.9692 train_time: 9.5m tok/s: 4801257 +5240/20000 train_loss: 3.0076 train_time: 9.5m tok/s: 4800114 +5250/20000 train_loss: 2.9626 train_time: 9.6m tok/s: 4798999 +5260/20000 train_loss: 2.9927 train_time: 9.6m tok/s: 4797887 +5270/20000 train_loss: 2.9450 train_time: 9.6m tok/s: 4796757 +5280/20000 train_loss: 3.0061 train_time: 9.6m tok/s: 4795634 +5290/20000 train_loss: 3.0203 train_time: 9.6m tok/s: 4794510 +5300/20000 train_loss: 3.0537 train_time: 9.7m tok/s: 4793412 +5310/20000 train_loss: 2.9140 train_time: 9.7m tok/s: 4792312 +5320/20000 train_loss: 2.9093 train_time: 9.7m tok/s: 4791223 +5330/20000 train_loss: 2.9873 train_time: 9.7m tok/s: 4790144 +5340/20000 train_loss: 2.8635 train_time: 9.7m tok/s: 4789036 +5350/20000 train_loss: 2.9072 train_time: 9.8m tok/s: 4787951 +5360/20000 train_loss: 2.9958 train_time: 9.8m tok/s: 4786848 +5368/20000 val_loss: 2.8447 val_bpb: 1.1013 +stopping_early: wallclock_cap train_time: 588041ms step: 5368/20000 +peak memory allocated: 25640 MiB reserved: 25652 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.84224331 val_bpb:1.10033379 eval_time:6560ms +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +Serialized model: 135615079 bytes +Code size: 151202 bytes +GPTQ:collecting Hessians from calibration data... +[prefetch] daemon started: depth=4 pinned=True +GPTQ:collected 67 Hessians in 8.2s +[IDEA-064 parallel_gptq] enabled — multi-clip search active +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.247808 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.269812 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.258261 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.269052 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.261907 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.270251 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.268961 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.283977 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.151974 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.193686 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.170907 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.199360 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.196602 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.181248 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.208336 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.224559 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.273053 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.300749 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.280521 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.293932 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.291560 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.284720 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.296231 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.312604 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.078287 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.078359 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.076178 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.076320 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.076118 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.079911 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.080441 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.074444 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=119.126941 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=119.194287 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=118.639005 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=118.969895 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=118.953076 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=119.184331 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=119.101136 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=118.902862 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.842355 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.834611 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.845353 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.852147 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.846437 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.847843 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.841017 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.845102 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.925267 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.922398 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.932725 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.943073 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.938194 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.940518 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.931870 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.934711 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.097234 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.092381 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.109631 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.123551 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.106187 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.120200 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.101704 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.113640 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.260179 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.254625 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.257115 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.255435 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.262305 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.250251 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.258443 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.260048 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=35.898565 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=35.848871 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=35.934805 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=35.876144 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=35.827384 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=35.888917 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=35.918911 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=35.870492 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.045116 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.040930 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.030281 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.038786 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.042486 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.037868 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.039348 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.045624 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.991375 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.992549 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.985289 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.982840 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.993244 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.991243 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.993339 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.991113 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.154182 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.145652 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.119671 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.143257 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.140572 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.133215 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.134309 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.127619 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.852577 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.853529 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.852119 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.853033 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.854123 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.853126 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.851568 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.852166 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.693018 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.656774 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.685225 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.636255 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.661620 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.668162 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.657119 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.635530 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.935637 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.937118 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.937498 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.929354 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.938037 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.935079 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.936752 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.934452 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.647383 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.651374 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.650219 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.649084 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.647089 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.653097 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.652050 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.653890 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.525504 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.536445 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.534560 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.531229 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.533362 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.530215 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.528626 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=5.532829 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.898970 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.898405 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.900714 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.897405 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.896071 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.896390 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.896836 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.900160 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.429086 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.406748 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.418869 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.425478 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.426085 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.422342 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.414026 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.397806 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.504506 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.505584 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.503405 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.506353 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.489011 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.502630 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.501092 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=7.503104 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.653620 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.653761 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.649956 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.652863 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.645083 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.645915 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.096101 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.656570 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.648169 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.099374 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.090150 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.098137 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.087410 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.093989 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.095337 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=11.102412 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.571931 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.577973 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.572013 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.564946 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.569211 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.569160 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.575610 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.565856 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.393153 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.393994 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.365876 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.404667 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.380751 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.400821 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.366182 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.380028 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.793212 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.794366 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.790989 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.790256 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.786484 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=14.740173 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=14.726229 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.790692 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.788764 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.795063 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=14.706655 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=14.724419 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=14.675990 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.799713 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=14.724856 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.796187 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=14.717855 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=14.728600 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.806436 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.795437 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.795132 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.805844 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.793315 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.805387 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.219715 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.223772 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.222256 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.217262 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.221718 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.218065 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.219731 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.224861 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.473714 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.458910 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.443151 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.469136 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.458273 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.469112 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.418647 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=17.445781 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.979997 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.982077 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.978546 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.981750 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.976723 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.208519 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.979858 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.207447 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.977259 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.980145 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.204986 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.208926 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.204289 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.173822 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.207257 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.166631 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.207558 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.205542 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.184423 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.173475 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.173101 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.161316 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.179385 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.171620 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.167686 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.168544 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.169231 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.167062 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.168200 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.168042 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.166918 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.168693 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.102625 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.100493 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.095669 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.099609 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.102167 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.103276 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.093609 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.097060 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.568875 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.569831 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.568903 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.568798 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.311031 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.569075 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.309272 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.311104 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.567253 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.567664 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.313677 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.569858 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.845760 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.309144 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.844293 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.841228 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.307190 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.306798 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.844286 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.312121 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.844902 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.840530 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.293717 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.842241 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.846498 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.291989 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.284717 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.278806 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.279594 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.285206 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.990996 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.266105 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.283152 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.987880 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.983191 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.989048 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.993275 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.052108 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.990107 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.050352 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.986571 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.987957 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.049259 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.514135 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.512896 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.048595 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.052107 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.513542 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.511137 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.508741 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.049981 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.508563 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.508645 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.046741 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.513390 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.048065 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.512573 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.499405 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.509880 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.508662 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.509123 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.735599 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.737554 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.507800 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.736525 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.501700 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.504159 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.730725 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.729771 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.814629 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.811289 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.735125 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.807020 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.726650 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.734504 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.812234 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.816661 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.745489 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.745756 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.744760 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.813274 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.810323 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.055659 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.812044 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.055570 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.056461 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.743959 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.826244 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.744553 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.826285 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.827896 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.052290 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.744755 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.742803 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.053252 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.744300 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.363843 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.824096 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.054735 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.362429 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.050798 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.821347 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.051833 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.362732 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.821722 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.821843 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.822469 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.363040 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.686196 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.682072 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.362560 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.678133 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.362530 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.362680 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.362950 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.683245 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.602501 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.601701 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.688594 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.601239 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.281333 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.684336 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.278854 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.682091 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.683487 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.277572 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.960716 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.962657 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.601970 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.603999 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.962399 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.276301 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.601603 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.601847 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.282864 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.602613 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.965720 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.966681 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.955986 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.281026 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.279732 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.964957 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.966983 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.279815 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.958457 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.960261 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.961910 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.136754 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.131243 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.965583 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.965709 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.124860 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.965090 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.965197 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.965788 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.132988 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.140532 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.134645 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.130098 +Quantized weights: + gptq (int5): tok_emb.weight + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + passthrough (float16): blocks.attn.gate_proj.bias, blocks.attn.gate_proj.weight, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, lane_merge, skip_gates, skip_weights +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.132523 +Serialized model quantized+zstd: 15715938 bytes +Total submission size quantized+zstd: 15867140 bytes +quantized val_loss:8.96211912 val_bpb:3.46955606 eval_time:2466ms +quantized_sliding_window val_loss:8.96209632 val_bpb:3.46954724 eval_time:92138ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=35988657 frozen=0 + ttt_chunk [1/1238] bpb=3.387464 time=4.5s + ttt_chunk [11/1238] bpb=3.441996 time=6.8s + ttt_chunk [21/1238] bpb=3.460080 time=9.1s + ttt_chunk [31/1238] bpb=3.373085 time=11.4s + ttt_chunk [41/1238] bpb=3.289774 time=14.0s + ttt_chunk [51/1238] bpb=3.235499 time=16.3s + ttt_chunk [61/1238] bpb=3.199705 time=18.7s + ttt_chunk [71/1238] bpb=3.180278 time=21.0s + ttt_chunk [81/1238] bpb=3.151466 time=23.3s + ttt_chunk [91/1238] bpb=3.143620 time=25.8s + ttt_chunk [101/1238] bpb=3.120756 time=28.1s + ttt_chunk [111/1238] bpb=3.107512 time=30.4s + ttt_chunk [121/1238] bpb=3.092263 time=32.7s + ttt_chunk [131/1238] bpb=3.083574 time=35.0s + ttt_chunk [141/1238] bpb=3.075836 time=37.3s + ttt_chunk [151/1238] bpb=3.069587 time=39.7s + ttt_chunk [161/1238] bpb=3.060902 time=41.9s + ttt_chunk [171/1238] bpb=3.054438 time=44.3s + ttt_chunk [181/1238] bpb=3.043861 time=46.5s + ttt_chunk [191/1238] bpb=3.032582 time=48.9s + ttt_chunk [201/1238] bpb=3.026620 time=51.2s + ttt_chunk [211/1238] bpb=3.025076 time=53.5s + ttt_chunk [221/1238] bpb=3.015265 time=56.0s + ttt_chunk [231/1238] bpb=3.013563 time=58.3s + ttt_chunk [241/1238] bpb=3.013353 time=60.6s + ttt_chunk [251/1238] bpb=3.008726 time=63.0s + ttt_chunk [261/1238] bpb=3.000034 time=65.3s + ttt_chunk [271/1238] bpb=2.996259 time=67.8s + ttt_chunk [281/1238] bpb=2.989826 time=70.1s + ttt_chunk [291/1238] bpb=2.985456 time=72.4s + ttt_chunk [301/1238] bpb=2.977742 time=74.7s + ttt_chunk [311/1238] bpb=2.969030 time=77.0s + ttt_chunk [321/1238] bpb=2.964547 time=79.3s + ttt_chunk [331/1238] bpb=2.961460 time=81.6s + ttt_chunk [341/1238] bpb=2.960657 time=83.9s + ttt_chunk [351/1238] bpb=2.957033 time=86.2s + ttt_chunk [361/1238] bpb=2.950513 time=88.5s + ttt_chunk [371/1238] bpb=2.943785 time=90.8s + ttt_chunk [381/1238] bpb=2.940352 time=93.2s + ttt_chunk [391/1238] bpb=2.936461 time=95.5s + ttt_chunk [401/1238] bpb=2.929574 time=97.8s + ttt_chunk [411/1238] bpb=2.924515 time=100.0s + ttt_chunk [421/1238] bpb=2.919932 time=102.4s + ttt_chunk [431/1238] bpb=2.915216 time=104.7s + ttt_chunk [441/1238] bpb=2.911798 time=107.0s + ttt_chunk [451/1238] bpb=2.911721 time=109.5s + ttt_chunk [461/1238] bpb=2.907788 time=111.8s + ttt_chunk [471/1238] bpb=2.911517 time=114.1s + ttt_chunk [481/1238] bpb=2.910687 time=116.4s + ttt_chunk [491/1238] bpb=2.908153 time=118.7s + ttt_chunk [501/1238] bpb=2.903815 time=121.0s + ttt_chunk [511/1238] bpb=2.900895 time=123.3s + ttt_chunk [521/1238] bpb=2.898742 time=125.6s + ttt_chunk [531/1238] bpb=2.897327 time=127.9s + ttt_chunk [541/1238] bpb=2.896096 time=130.2s + ttt_chunk [551/1238] bpb=2.896039 time=132.5s + ttt_chunk [561/1238] bpb=2.894870 time=135.1s + ttt_chunk [571/1238] bpb=2.892276 time=137.4s + ttt_chunk [581/1238] bpb=2.892074 time=139.7s + ttt_chunk [591/1238] bpb=2.890571 time=142.0s + ttt_chunk [601/1238] bpb=2.888485 time=144.3s + ttt_chunk [611/1238] bpb=2.886638 time=146.6s + ttt_chunk [621/1238] bpb=2.883955 time=148.9s + ttt_chunk [631/1238] bpb=2.880872 time=151.2s + ttt_chunk [641/1238] bpb=2.878972 time=153.5s + ttt_chunk [651/1238] bpb=2.876302 time=155.8s + ttt_chunk [661/1238] bpb=2.872808 time=158.1s + ttt_chunk [671/1238] bpb=2.867963 time=160.4s + ttt_chunk [681/1238] bpb=2.864797 time=162.7s + ttt_chunk [691/1238] bpb=2.862989 time=165.0s + ttt_chunk [701/1238] bpb=2.859294 time=167.3s + ttt_chunk [711/1238] bpb=2.856345 time=169.6s + ttt_chunk [721/1238] bpb=2.853860 time=171.9s + ttt_chunk [731/1238] bpb=2.851472 time=174.2s + ttt_chunk [741/1238] bpb=2.850505 time=176.5s + ttt_chunk [751/1238] bpb=2.847386 time=178.8s + ttt_chunk [761/1238] bpb=2.842909 time=181.3s + ttt_chunk [771/1238] bpb=2.839043 time=183.6s + ttt_chunk [781/1238] bpb=2.835629 time=185.9s + ttt_chunk [791/1238] bpb=2.834719 time=188.2s + ttt_chunk [801/1238] bpb=2.834446 time=190.5s + ttt_chunk [811/1238] bpb=2.831545 time=192.8s + ttt_chunk [821/1238] bpb=2.829813 time=195.2s + ttt_chunk [831/1238] bpb=2.827759 time=197.5s + ttt_chunk [841/1238] bpb=2.826480 time=200.0s + ttt_chunk [851/1238] bpb=2.823690 time=202.3s + ttt_chunk [861/1238] bpb=2.821212 time=204.8s + ttt_chunk [871/1238] bpb=2.818448 time=207.1s + ttt_chunk [881/1238] bpb=2.816360 time=209.5s + ttt_chunk [891/1238] bpb=2.814453 time=211.8s + ttt_chunk [901/1238] bpb=2.815625 time=214.3s + ttt_chunk [911/1238] bpb=2.814021 time=216.6s + ttt_chunk [921/1238] bpb=2.813470 time=218.9s + ttt_chunk [931/1238] bpb=2.812511 time=221.2s + ttt_chunk [941/1238] bpb=2.811251 time=223.5s + ttt_chunk [951/1238] bpb=2.810956 time=225.8s + ttt_chunk [961/1238] bpb=2.810214 time=228.1s + ttt_chunk [971/1238] bpb=2.811062 time=230.4s + ttt_chunk [981/1238] bpb=2.810148 time=232.7s + ttt_chunk [991/1238] bpb=2.808792 time=235.0s + ttt_chunk [1001/1238] bpb=2.808915 time=237.3s + ttt_chunk [1011/1238] bpb=2.808044 time=239.7s + ttt_chunk [1021/1238] bpb=2.806989 time=242.1s + ttt_chunk [1031/1238] bpb=2.806164 time=244.5s + ttt_chunk [1041/1238] bpb=2.804937 time=246.8s + ttt_chunk [1051/1238] bpb=2.803088 time=249.1s + ttt_chunk [1061/1238] bpb=2.801720 time=251.4s + ttt_chunk [1071/1238] bpb=2.800136 time=253.9s + ttt_chunk [1081/1238] bpb=2.798055 time=256.2s + ttt_chunk [1091/1238] bpb=2.795827 time=258.6s + ttt_chunk [1101/1238] bpb=2.794505 time=260.9s + ttt_chunk [1111/1238] bpb=2.793236 time=263.2s + ttt_chunk [1121/1238] bpb=2.792045 time=265.5s + ttt_chunk [1131/1238] bpb=2.790124 time=267.8s + ttt_chunk [1141/1238] bpb=2.788292 time=270.1s + ttt_chunk [1151/1238] bpb=2.786798 time=272.5s + ttt_chunk [1161/1238] bpb=2.785454 time=274.7s + ttt_chunk [1171/1238] bpb=2.783614 time=277.1s + ttt_chunk [1181/1238] bpb=2.782051 time=279.4s + ttt_chunk [1191/1238] bpb=2.780355 time=281.7s + ttt_chunk [1201/1238] bpb=2.779807 time=284.0s + ttt_chunk [1211/1238] bpb=2.779195 time=286.4s + ttt_chunk [1221/1238] bpb=2.777178 time=288.7s + ttt_chunk [1231/1238] bpb=2.776320 time=291.0s + ttt_chunk [1238/1238] bpb=2.775802 time=292.4s +ttt_sliding:done val_loss=7.165791 val_bpb=2.774133 elapsed=294.5s +quantized_ttt val_loss:7.16579127 val_bpb:2.77413346 eval_time:294713ms +[W424 07:41:21.254627579 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:41:21.291205984 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:41:21.310602054 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:41:21.316238722 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:41:21.317423232 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:41:21.329736163 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:41:21.349337333 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:41:21.420666260 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 07:41:23.952959392 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) + +[run] DONE 07:41:23Z +[run] === val_bpb lines === +0/20000 val_loss: 9.0088 val_bpb: 3.4876 +4000/20000 val_loss: 3.0119 val_bpb: 1.1660 +5368/20000 val_loss: 2.8447 val_bpb: 1.1013 +pre-quantization post-ema val_loss:2.84224331 val_bpb:1.10033379 eval_time:6560ms +quantized val_loss:8.96211912 val_bpb:3.46955606 eval_time:2466ms +quantized_sliding_window val_loss:8.96209632 val_bpb:3.46954724 eval_time:92138ms + ttt_chunk [1/1238] bpb=3.387464 time=4.5s + ttt_chunk [11/1238] bpb=3.441996 time=6.8s + ttt_chunk [21/1238] bpb=3.460080 time=9.1s + ttt_chunk [31/1238] bpb=3.373085 time=11.4s + ttt_chunk [41/1238] bpb=3.289774 time=14.0s + ttt_chunk [51/1238] bpb=3.235499 time=16.3s + ttt_chunk [61/1238] bpb=3.199705 time=18.7s + ttt_chunk [71/1238] bpb=3.180278 time=21.0s + ttt_chunk [81/1238] bpb=3.151466 time=23.3s + ttt_chunk [91/1238] bpb=3.143620 time=25.8s + ttt_chunk [101/1238] bpb=3.120756 time=28.1s + ttt_chunk [111/1238] bpb=3.107512 time=30.4s + ttt_chunk [121/1238] bpb=3.092263 time=32.7s + ttt_chunk [131/1238] bpb=3.083574 time=35.0s + ttt_chunk [141/1238] bpb=3.075836 time=37.3s + ttt_chunk [151/1238] bpb=3.069587 time=39.7s + ttt_chunk [161/1238] bpb=3.060902 time=41.9s + ttt_chunk [171/1238] bpb=3.054438 time=44.3s + ttt_chunk [181/1238] bpb=3.043861 time=46.5s + ttt_chunk [191/1238] bpb=3.032582 time=48.9s + ttt_chunk [201/1238] bpb=3.026620 time=51.2s + ttt_chunk [211/1238] bpb=3.025076 time=53.5s + ttt_chunk [221/1238] bpb=3.015265 time=56.0s + ttt_chunk [231/1238] bpb=3.013563 time=58.3s + ttt_chunk [241/1238] bpb=3.013353 time=60.6s + ttt_chunk [251/1238] bpb=3.008726 time=63.0s + ttt_chunk [261/1238] bpb=3.000034 time=65.3s + ttt_chunk [271/1238] bpb=2.996259 time=67.8s + ttt_chunk [281/1238] bpb=2.989826 time=70.1s + ttt_chunk [291/1238] bpb=2.985456 time=72.4s + ttt_chunk [301/1238] bpb=2.977742 time=74.7s + ttt_chunk [311/1238] bpb=2.969030 time=77.0s + ttt_chunk [321/1238] bpb=2.964547 time=79.3s + ttt_chunk [331/1238] bpb=2.961460 time=81.6s + ttt_chunk [341/1238] bpb=2.960657 time=83.9s + ttt_chunk [351/1238] bpb=2.957033 time=86.2s + ttt_chunk [361/1238] bpb=2.950513 time=88.5s + ttt_chunk [371/1238] bpb=2.943785 time=90.8s + ttt_chunk [381/1238] bpb=2.940352 time=93.2s + ttt_chunk [391/1238] bpb=2.936461 time=95.5s + ttt_chunk [401/1238] bpb=2.929574 time=97.8s + ttt_chunk [411/1238] bpb=2.924515 time=100.0s + ttt_chunk [421/1238] bpb=2.919932 time=102.4s + ttt_chunk [431/1238] bpb=2.915216 time=104.7s + ttt_chunk [441/1238] bpb=2.911798 time=107.0s + ttt_chunk [451/1238] bpb=2.911721 time=109.5s + ttt_chunk [461/1238] bpb=2.907788 time=111.8s + ttt_chunk [471/1238] bpb=2.911517 time=114.1s + ttt_chunk [481/1238] bpb=2.910687 time=116.4s + ttt_chunk [491/1238] bpb=2.908153 time=118.7s + ttt_chunk [501/1238] bpb=2.903815 time=121.0s + ttt_chunk [511/1238] bpb=2.900895 time=123.3s + ttt_chunk [521/1238] bpb=2.898742 time=125.6s + ttt_chunk [531/1238] bpb=2.897327 time=127.9s + ttt_chunk [541/1238] bpb=2.896096 time=130.2s + ttt_chunk [551/1238] bpb=2.896039 time=132.5s + ttt_chunk [561/1238] bpb=2.894870 time=135.1s + ttt_chunk [571/1238] bpb=2.892276 time=137.4s + ttt_chunk [581/1238] bpb=2.892074 time=139.7s + ttt_chunk [591/1238] bpb=2.890571 time=142.0s + ttt_chunk [601/1238] bpb=2.888485 time=144.3s + ttt_chunk [611/1238] bpb=2.886638 time=146.6s + ttt_chunk [621/1238] bpb=2.883955 time=148.9s + ttt_chunk [631/1238] bpb=2.880872 time=151.2s + ttt_chunk [641/1238] bpb=2.878972 time=153.5s + ttt_chunk [651/1238] bpb=2.876302 time=155.8s + ttt_chunk [661/1238] bpb=2.872808 time=158.1s + ttt_chunk [671/1238] bpb=2.867963 time=160.4s + ttt_chunk [681/1238] bpb=2.864797 time=162.7s + ttt_chunk [691/1238] bpb=2.862989 time=165.0s + ttt_chunk [701/1238] bpb=2.859294 time=167.3s + ttt_chunk [711/1238] bpb=2.856345 time=169.6s + ttt_chunk [721/1238] bpb=2.853860 time=171.9s + ttt_chunk [731/1238] bpb=2.851472 time=174.2s + ttt_chunk [741/1238] bpb=2.850505 time=176.5s + ttt_chunk [751/1238] bpb=2.847386 time=178.8s + ttt_chunk [761/1238] bpb=2.842909 time=181.3s + ttt_chunk [771/1238] bpb=2.839043 time=183.6s + ttt_chunk [781/1238] bpb=2.835629 time=185.9s + ttt_chunk [791/1238] bpb=2.834719 time=188.2s + ttt_chunk [801/1238] bpb=2.834446 time=190.5s + ttt_chunk [811/1238] bpb=2.831545 time=192.8s + ttt_chunk [821/1238] bpb=2.829813 time=195.2s + ttt_chunk [831/1238] bpb=2.827759 time=197.5s + ttt_chunk [841/1238] bpb=2.826480 time=200.0s + ttt_chunk [851/1238] bpb=2.823690 time=202.3s + ttt_chunk [861/1238] bpb=2.821212 time=204.8s + ttt_chunk [871/1238] bpb=2.818448 time=207.1s + ttt_chunk [881/1238] bpb=2.816360 time=209.5s + ttt_chunk [891/1238] bpb=2.814453 time=211.8s + ttt_chunk [901/1238] bpb=2.815625 time=214.3s + ttt_chunk [911/1238] bpb=2.814021 time=216.6s + ttt_chunk [921/1238] bpb=2.813470 time=218.9s + ttt_chunk [931/1238] bpb=2.812511 time=221.2s + ttt_chunk [941/1238] bpb=2.811251 time=223.5s + ttt_chunk [951/1238] bpb=2.810956 time=225.8s + ttt_chunk [961/1238] bpb=2.810214 time=228.1s + ttt_chunk [971/1238] bpb=2.811062 time=230.4s + ttt_chunk [981/1238] bpb=2.810148 time=232.7s + ttt_chunk [991/1238] bpb=2.808792 time=235.0s + ttt_chunk [1001/1238] bpb=2.808915 time=237.3s + ttt_chunk [1011/1238] bpb=2.808044 time=239.7s + ttt_chunk [1021/1238] bpb=2.806989 time=242.1s + ttt_chunk [1031/1238] bpb=2.806164 time=244.5s + ttt_chunk [1041/1238] bpb=2.804937 time=246.8s + ttt_chunk [1051/1238] bpb=2.803088 time=249.1s + ttt_chunk [1061/1238] bpb=2.801720 time=251.4s + ttt_chunk [1071/1238] bpb=2.800136 time=253.9s + ttt_chunk [1081/1238] bpb=2.798055 time=256.2s + ttt_chunk [1091/1238] bpb=2.795827 time=258.6s + ttt_chunk [1101/1238] bpb=2.794505 time=260.9s + ttt_chunk [1111/1238] bpb=2.793236 time=263.2s + ttt_chunk [1121/1238] bpb=2.792045 time=265.5s + ttt_chunk [1131/1238] bpb=2.790124 time=267.8s + ttt_chunk [1141/1238] bpb=2.788292 time=270.1s + ttt_chunk [1151/1238] bpb=2.786798 time=272.5s + ttt_chunk [1161/1238] bpb=2.785454 time=274.7s + ttt_chunk [1171/1238] bpb=2.783614 time=277.1s + ttt_chunk [1181/1238] bpb=2.782051 time=279.4s + ttt_chunk [1191/1238] bpb=2.780355 time=281.7s + ttt_chunk [1201/1238] bpb=2.779807 time=284.0s + ttt_chunk [1211/1238] bpb=2.779195 time=286.4s + ttt_chunk [1221/1238] bpb=2.777178 time=288.7s + ttt_chunk [1231/1238] bpb=2.776320 time=291.0s + ttt_chunk [1238/1238] bpb=2.775802 time=292.4s +ttt_sliding:done val_loss=7.165791 val_bpb=2.774133 elapsed=294.5s +quantized_ttt val_loss:7.16579127 val_bpb:2.77413346 eval_time:294713ms + +[run] === artifact === + final_model.int6.ptz: 15715938 bytes diff --git a/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_log_seed42.log b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_log_seed42.log new file mode 100644 index 0000000000..a22c0af361 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/train_log_seed42.log @@ -0,0 +1,1431 @@ +[run] 128 train shards, 1 val shard(s), tokenizer ok +[run] config: + SEED=42 + MAX_WALLCLOCK_SECONDS=600 + TTT_ENABLED=1 + DATA_DIR=/root/c22_submission/final/data +[run] launcher: torchrun × 8 +[run] launching c22_train.py at 06:21:53Z +[run] log: logs/run_seed42_20260424T062153Z.log +W0424 06:21:54.798000 3428177 torch/distributed/run.py:803] +W0424 06:21:54.798000 3428177 torch/distributed/run.py:803] ***************************************** +W0424 06:21:54.798000 3428177 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0424 06:21:54.798000 3428177 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.095 + beta1: 0.9 + beta2: 0.95 + compressor: zstd + data_dir: /root/c22_submission/final/data + datasets_dir: /root/c22_submission/final/data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.9965 + embed_bits: 5 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/14e6e144-068c-406c-ad39-3eef6ad7bf85.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 3 + muon_beta2: 0.95 + muon_momentum: 0.98 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.12 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + prequant_ttt_batch_seqs: 16 + prequant_ttt_cosine_decay: True + prequant_ttt_enabled: False + prequant_ttt_epochs: 8 + prequant_ttt_freeze_blocks: 1 + prequant_ttt_grad_clip: 1.0 + prequant_ttt_lr: 0.00045 + qk_gain_init: 5.25 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 14e6e144-068c-406c-ad39-3eef6ad7bf85 + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /root/c22_submission/final/data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 524288 + train_files: /root/c22_submission/final/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 10 + train_seq_len: 2048 + ttt_batch_seqs: 16 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 262144 + val_files: /root/c22_submission/final/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 128 +val_tokens: 40540160 +model_params:35988657 +[curriculum] rank=7/8 buckets=10 total_seqs=736249 floor=0.02 +[curriculum] rank=5/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=1/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=6/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=3/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=4/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=2/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=0/8 buckets=10 total_seqs=781248 floor=0.02 +gptq:reserving 12s, effective=588000ms +[IDEA-051 freeze_dry] enabled — linear-combo pruning active +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +[curriculum] rank=7/8 buckets=10 total_seqs=736249 floor=0.02 +[curriculum] rank=5/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=4/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=6/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=2/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=3/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=1/8 buckets=10 total_seqs=781248 floor=0.02 +[curriculum] rank=0/8 buckets=10 total_seqs=781248 floor=0.02 +0/20000 val_loss: 9.0097 val_bpb: 3.4880 +1/20000 train_loss: 9.0022 train_time: 0.1m tok/s: 173967 +2/20000 train_loss: 12.3659 train_time: 0.1m tok/s: 303180 +3/20000 train_loss: 11.3391 train_time: 0.1m tok/s: 422326 +4/20000 train_loss: 9.6591 train_time: 0.1m tok/s: 549746 +5/20000 train_loss: 8.6147 train_time: 0.1m tok/s: 671737 +10/20000 train_loss: 6.7865 train_time: 0.1m tok/s: 1202887 +20/20000 train_loss: 5.8514 train_time: 0.1m tok/s: 1968764 +30/20000 train_loss: 5.5230 train_time: 0.1m tok/s: 2492440 +40/20000 train_loss: 5.2373 train_time: 0.1m tok/s: 2873874 +50/20000 train_loss: 5.1811 train_time: 0.1m tok/s: 3165643 +60/20000 train_loss: 5.0552 train_time: 0.2m tok/s: 3396249 +70/20000 train_loss: 4.8586 train_time: 0.2m tok/s: 3581820 +80/20000 train_loss: 4.8072 train_time: 0.2m tok/s: 3736195 +90/20000 train_loss: 4.6890 train_time: 0.2m tok/s: 3864361 +100/20000 train_loss: 4.4795 train_time: 0.2m tok/s: 3973706 +110/20000 train_loss: 4.4695 train_time: 0.2m tok/s: 4069504 +120/20000 train_loss: 4.3225 train_time: 0.3m tok/s: 4152414 +130/20000 train_loss: 4.1679 train_time: 0.3m tok/s: 4224782 +140/20000 train_loss: 4.0885 train_time: 0.3m tok/s: 4289171 +150/20000 train_loss: 3.9930 train_time: 0.3m tok/s: 4346381 +160/20000 train_loss: 3.8598 train_time: 0.3m tok/s: 4397266 +170/20000 train_loss: 3.8177 train_time: 0.3m tok/s: 4444134 +180/20000 train_loss: 3.8140 train_time: 0.4m tok/s: 4486700 +190/20000 train_loss: 3.7135 train_time: 0.4m tok/s: 4525322 +200/20000 train_loss: 3.7545 train_time: 0.4m tok/s: 4560259 +210/20000 train_loss: 3.5786 train_time: 0.4m tok/s: 4591872 +220/20000 train_loss: 3.6396 train_time: 0.4m tok/s: 4621945 +230/20000 train_loss: 3.5501 train_time: 0.4m tok/s: 4650149 +240/20000 train_loss: 3.6339 train_time: 0.4m tok/s: 4675812 +250/20000 train_loss: 3.5715 train_time: 0.5m tok/s: 4700219 +260/20000 train_loss: 3.5706 train_time: 0.5m tok/s: 4722722 +270/20000 train_loss: 3.5175 train_time: 0.5m tok/s: 4743359 +280/20000 train_loss: 3.4010 train_time: 0.5m tok/s: 4763352 +290/20000 train_loss: 3.5402 train_time: 0.5m tok/s: 4782228 +300/20000 train_loss: 3.5158 train_time: 0.5m tok/s: 4799136 +310/20000 train_loss: 3.5078 train_time: 0.6m tok/s: 4815148 +320/20000 train_loss: 3.4383 train_time: 0.6m tok/s: 4830702 +330/20000 train_loss: 3.3979 train_time: 0.6m tok/s: 4845302 +340/20000 train_loss: 3.3990 train_time: 0.6m tok/s: 4858463 +350/20000 train_loss: 3.4638 train_time: 0.6m tok/s: 4871066 +360/20000 train_loss: 3.4349 train_time: 0.6m tok/s: 4883567 +370/20000 train_loss: 3.4044 train_time: 0.7m tok/s: 4895395 +380/20000 train_loss: 3.3681 train_time: 0.7m tok/s: 4906814 +390/20000 train_loss: 3.3985 train_time: 0.7m tok/s: 4917835 +400/20000 train_loss: 3.4361 train_time: 0.7m tok/s: 4927765 +410/20000 train_loss: 3.3920 train_time: 0.7m tok/s: 4937502 +420/20000 train_loss: 3.2995 train_time: 0.7m tok/s: 4946340 +430/20000 train_loss: 3.4089 train_time: 0.8m tok/s: 4955460 +440/20000 train_loss: 3.4005 train_time: 0.8m tok/s: 4963218 +450/20000 train_loss: 3.3219 train_time: 0.8m tok/s: 4971604 +460/20000 train_loss: 3.3885 train_time: 0.8m tok/s: 4979595 +470/20000 train_loss: 3.3418 train_time: 0.8m tok/s: 4987079 +480/20000 train_loss: 3.3474 train_time: 0.8m tok/s: 4994085 +490/20000 train_loss: 3.2420 train_time: 0.9m tok/s: 5001267 +500/20000 train_loss: 3.2640 train_time: 0.9m tok/s: 5007889 +510/20000 train_loss: 3.4524 train_time: 0.9m tok/s: 5014043 +520/20000 train_loss: 3.3640 train_time: 0.9m tok/s: 5020410 +530/20000 train_loss: 3.3926 train_time: 0.9m tok/s: 5026195 +540/20000 train_loss: 3.2852 train_time: 0.9m tok/s: 5031830 +550/20000 train_loss: 3.2972 train_time: 1.0m tok/s: 5036682 +560/20000 train_loss: 3.3356 train_time: 1.0m tok/s: 5042122 +570/20000 train_loss: 3.2988 train_time: 1.0m tok/s: 5047235 +580/20000 train_loss: 3.2947 train_time: 1.0m tok/s: 5052481 +590/20000 train_loss: 3.3464 train_time: 1.0m tok/s: 5057459 +600/20000 train_loss: 3.3346 train_time: 1.0m tok/s: 5062075 +610/20000 train_loss: 3.3745 train_time: 1.1m tok/s: 5066186 +620/20000 train_loss: 3.2773 train_time: 1.1m tok/s: 5070863 +630/20000 train_loss: 3.3041 train_time: 1.1m tok/s: 5075258 +640/20000 train_loss: 3.3042 train_time: 1.1m tok/s: 5079272 +650/20000 train_loss: 3.2929 train_time: 1.1m tok/s: 5083341 +660/20000 train_loss: 3.2023 train_time: 1.1m tok/s: 5087089 +670/20000 train_loss: 3.1865 train_time: 1.1m tok/s: 5091104 +680/20000 train_loss: 3.3256 train_time: 1.2m tok/s: 5094865 +690/20000 train_loss: 3.2858 train_time: 1.2m tok/s: 5098422 +700/20000 train_loss: 3.2182 train_time: 1.2m tok/s: 5101734 +710/20000 train_loss: 3.2258 train_time: 1.2m tok/s: 5104756 +720/20000 train_loss: 3.2633 train_time: 1.2m tok/s: 5108056 +730/20000 train_loss: 3.1563 train_time: 1.2m tok/s: 5111107 +740/20000 train_loss: 3.3357 train_time: 1.3m tok/s: 5114592 +750/20000 train_loss: 3.2525 train_time: 1.3m tok/s: 5117360 +760/20000 train_loss: 3.1656 train_time: 1.3m tok/s: 5120508 +770/20000 train_loss: 3.2377 train_time: 1.3m tok/s: 5123336 +780/20000 train_loss: 3.3022 train_time: 1.3m tok/s: 5126278 +790/20000 train_loss: 3.3144 train_time: 1.3m tok/s: 5129004 +800/20000 train_loss: 3.2563 train_time: 1.4m tok/s: 5131823 +810/20000 train_loss: 3.3407 train_time: 1.4m tok/s: 5134113 +820/20000 train_loss: 3.2386 train_time: 1.4m tok/s: 5136383 +830/20000 train_loss: 3.2720 train_time: 1.4m tok/s: 5138934 +840/20000 train_loss: 3.3131 train_time: 1.4m tok/s: 5141389 +850/20000 train_loss: 3.2912 train_time: 1.4m tok/s: 5143328 +860/20000 train_loss: 3.1797 train_time: 1.5m tok/s: 5145777 +870/20000 train_loss: 3.2579 train_time: 1.5m tok/s: 5148279 +880/20000 train_loss: 3.1966 train_time: 1.5m tok/s: 5150483 +890/20000 train_loss: 3.3052 train_time: 1.5m tok/s: 5152308 +900/20000 train_loss: 3.3834 train_time: 1.5m tok/s: 5154128 +910/20000 train_loss: 3.1931 train_time: 1.5m tok/s: 5155913 +920/20000 train_loss: 3.1489 train_time: 1.6m tok/s: 5157944 +930/20000 train_loss: 3.2215 train_time: 1.6m tok/s: 5159723 +940/20000 train_loss: 3.2995 train_time: 1.6m tok/s: 5161461 +950/20000 train_loss: 3.2431 train_time: 1.6m tok/s: 5163581 +960/20000 train_loss: 3.2221 train_time: 1.6m tok/s: 5165482 +970/20000 train_loss: 3.2455 train_time: 1.6m tok/s: 5167457 +980/20000 train_loss: 3.2415 train_time: 1.7m tok/s: 5169378 +990/20000 train_loss: 3.2989 train_time: 1.7m tok/s: 5171076 +1000/20000 train_loss: 3.2749 train_time: 1.7m tok/s: 5172963 +1010/20000 train_loss: 3.3173 train_time: 1.7m tok/s: 5174511 +1020/20000 train_loss: 3.2027 train_time: 1.7m tok/s: 5176151 +1030/20000 train_loss: 3.1643 train_time: 1.7m tok/s: 5177773 +1040/20000 train_loss: 3.2916 train_time: 1.8m tok/s: 5179082 +1050/20000 train_loss: 3.1834 train_time: 1.8m tok/s: 5180735 +1060/20000 train_loss: 3.2151 train_time: 1.8m tok/s: 5182289 +1070/20000 train_loss: 3.1885 train_time: 1.8m tok/s: 5183964 +1080/20000 train_loss: 3.1940 train_time: 1.8m tok/s: 5185662 +1090/20000 train_loss: 3.1607 train_time: 1.8m tok/s: 5187058 +1100/20000 train_loss: 3.2096 train_time: 1.9m tok/s: 5188572 +1110/20000 train_loss: 3.2425 train_time: 1.9m tok/s: 5190066 +1120/20000 train_loss: 3.1258 train_time: 1.9m tok/s: 5191237 +1130/20000 train_loss: 3.1681 train_time: 1.9m tok/s: 5192757 +1140/20000 train_loss: 3.2069 train_time: 1.9m tok/s: 5194305 +1150/20000 train_loss: 3.2436 train_time: 1.9m tok/s: 5195545 +1160/20000 train_loss: 3.1288 train_time: 2.0m tok/s: 5196810 +1170/20000 train_loss: 3.2131 train_time: 2.0m tok/s: 5198016 +1180/20000 train_loss: 3.2673 train_time: 2.0m tok/s: 5198996 +1190/20000 train_loss: 3.2100 train_time: 2.0m tok/s: 5200009 +1200/20000 train_loss: 3.1564 train_time: 2.0m tok/s: 5201099 +1210/20000 train_loss: 3.1392 train_time: 2.0m tok/s: 5202313 +1220/20000 train_loss: 3.1892 train_time: 2.0m tok/s: 5203053 +1230/20000 train_loss: 3.1635 train_time: 2.1m tok/s: 5204296 +1240/20000 train_loss: 3.2514 train_time: 2.1m tok/s: 5205381 +1250/20000 train_loss: 3.2830 train_time: 2.1m tok/s: 5206481 +1260/20000 train_loss: 3.2184 train_time: 2.1m tok/s: 5207630 +1270/20000 train_loss: 3.2108 train_time: 2.1m tok/s: 5208771 +1280/20000 train_loss: 3.1799 train_time: 2.1m tok/s: 5209999 +1290/20000 train_loss: 3.2573 train_time: 2.2m tok/s: 5211222 +1300/20000 train_loss: 3.1397 train_time: 2.2m tok/s: 5212152 +1310/20000 train_loss: 3.1931 train_time: 2.2m tok/s: 5213065 +1320/20000 train_loss: 3.2517 train_time: 2.2m tok/s: 5213861 +1330/20000 train_loss: 3.2065 train_time: 2.2m tok/s: 5214930 +1340/20000 train_loss: 3.2077 train_time: 2.2m tok/s: 5215940 +1350/20000 train_loss: 3.1633 train_time: 2.3m tok/s: 5216862 +1360/20000 train_loss: 3.2834 train_time: 2.3m tok/s: 5217896 +1370/20000 train_loss: 3.2868 train_time: 2.3m tok/s: 5218878 +1380/20000 train_loss: 3.1583 train_time: 2.3m tok/s: 5219708 +1390/20000 train_loss: 3.1519 train_time: 2.3m tok/s: 5220699 +1400/20000 train_loss: 3.2879 train_time: 2.3m tok/s: 5221610 +1410/20000 train_loss: 3.1823 train_time: 2.4m tok/s: 5222549 +1420/20000 train_loss: 3.2448 train_time: 2.4m tok/s: 5223378 +1430/20000 train_loss: 3.1190 train_time: 2.4m tok/s: 5224220 +1440/20000 train_loss: 3.2770 train_time: 2.4m tok/s: 5225121 +1450/20000 train_loss: 3.2772 train_time: 2.4m tok/s: 5225928 +1460/20000 train_loss: 3.3210 train_time: 2.4m tok/s: 5226770 +1470/20000 train_loss: 3.2100 train_time: 2.5m tok/s: 5227424 +1480/20000 train_loss: 3.2240 train_time: 2.5m tok/s: 5228270 +1490/20000 train_loss: 3.2304 train_time: 2.5m tok/s: 5229161 +1500/20000 train_loss: 3.1857 train_time: 2.5m tok/s: 5229916 +1510/20000 train_loss: 3.2135 train_time: 2.5m tok/s: 5230791 +1520/20000 train_loss: 3.1891 train_time: 2.5m tok/s: 5231447 +1530/20000 train_loss: 3.1942 train_time: 2.6m tok/s: 5232080 +1540/20000 train_loss: 3.2189 train_time: 2.6m tok/s: 5232771 +1550/20000 train_loss: 3.1786 train_time: 2.6m tok/s: 5233335 +1560/20000 train_loss: 3.2165 train_time: 2.6m tok/s: 5233894 +1570/20000 train_loss: 3.1489 train_time: 2.6m tok/s: 5234307 +1580/20000 train_loss: 3.2651 train_time: 2.6m tok/s: 5235128 +1590/20000 train_loss: 3.2151 train_time: 2.7m tok/s: 5235825 +1600/20000 train_loss: 3.0990 train_time: 2.7m tok/s: 5236476 +1610/20000 train_loss: 3.0779 train_time: 2.7m tok/s: 5237013 +1620/20000 train_loss: 3.2055 train_time: 2.7m tok/s: 5237618 +1630/20000 train_loss: 3.2316 train_time: 2.7m tok/s: 5238234 +1640/20000 train_loss: 3.2003 train_time: 2.7m tok/s: 5238888 +1650/20000 train_loss: 3.1761 train_time: 2.8m tok/s: 5239513 +1660/20000 train_loss: 3.1718 train_time: 2.8m tok/s: 5240176 +1670/20000 train_loss: 2.9702 train_time: 2.8m tok/s: 5240961 +1680/20000 train_loss: 3.1313 train_time: 2.8m tok/s: 5241710 +1690/20000 train_loss: 3.3167 train_time: 2.8m tok/s: 5242150 +1700/20000 train_loss: 3.1821 train_time: 2.8m tok/s: 5242918 +1710/20000 train_loss: 3.2484 train_time: 2.8m tok/s: 5243629 +1720/20000 train_loss: 3.1540 train_time: 2.9m tok/s: 5244351 +1730/20000 train_loss: 3.1662 train_time: 2.9m tok/s: 5244790 +1740/20000 train_loss: 3.0672 train_time: 2.9m tok/s: 5245265 +1750/20000 train_loss: 3.1037 train_time: 2.9m tok/s: 5245794 +1760/20000 train_loss: 3.2393 train_time: 2.9m tok/s: 5246336 +1770/20000 train_loss: 3.2798 train_time: 2.9m tok/s: 5246888 +1780/20000 train_loss: 3.1736 train_time: 3.0m tok/s: 5247436 +1790/20000 train_loss: 3.2183 train_time: 3.0m tok/s: 5247948 +1800/20000 train_loss: 3.1827 train_time: 3.0m tok/s: 5248427 +1810/20000 train_loss: 3.2343 train_time: 3.0m tok/s: 5249054 +1820/20000 train_loss: 3.1274 train_time: 3.0m tok/s: 5249736 +1830/20000 train_loss: 3.1509 train_time: 3.0m tok/s: 5250086 +1840/20000 train_loss: 3.2072 train_time: 3.1m tok/s: 5250519 +1850/20000 train_loss: 3.1918 train_time: 3.1m tok/s: 5250934 +1860/20000 train_loss: 3.1591 train_time: 3.1m tok/s: 5251331 +1870/20000 train_loss: 3.2576 train_time: 3.1m tok/s: 5251757 +1880/20000 train_loss: 3.1276 train_time: 3.1m tok/s: 5252248 +1890/20000 train_loss: 3.1082 train_time: 3.1m tok/s: 5252836 +1900/20000 train_loss: 3.1579 train_time: 3.2m tok/s: 5253391 +1910/20000 train_loss: 3.1124 train_time: 3.2m tok/s: 5254006 +1920/20000 train_loss: 3.0839 train_time: 3.2m tok/s: 5254579 +1930/20000 train_loss: 3.1978 train_time: 3.2m tok/s: 5255019 +1940/20000 train_loss: 3.1024 train_time: 3.2m tok/s: 5255510 +1950/20000 train_loss: 3.1059 train_time: 3.2m tok/s: 5255959 +1960/20000 train_loss: 3.0864 train_time: 3.3m tok/s: 5256536 +1970/20000 train_loss: 3.2073 train_time: 3.3m tok/s: 5257081 +1980/20000 train_loss: 3.1307 train_time: 3.3m tok/s: 5257488 +1990/20000 train_loss: 3.1838 train_time: 3.3m tok/s: 5257985 +2000/20000 train_loss: 3.1073 train_time: 3.3m tok/s: 5258387 +2010/20000 train_loss: 3.1463 train_time: 3.3m tok/s: 5258926 +2020/20000 train_loss: 3.1804 train_time: 3.4m tok/s: 5259469 +2030/20000 train_loss: 3.0891 train_time: 3.4m tok/s: 5259862 +2040/20000 train_loss: 3.2028 train_time: 3.4m tok/s: 5260267 +2050/20000 train_loss: 3.1416 train_time: 3.4m tok/s: 5260681 +2060/20000 train_loss: 3.1005 train_time: 3.4m tok/s: 5261136 +2070/20000 train_loss: 3.0725 train_time: 3.4m tok/s: 5261563 +2080/20000 train_loss: 3.1366 train_time: 3.5m tok/s: 5262004 +2090/20000 train_loss: 3.1342 train_time: 3.5m tok/s: 5262388 +2100/20000 train_loss: 3.1207 train_time: 3.5m tok/s: 5262914 +2110/20000 train_loss: 3.1659 train_time: 3.5m tok/s: 5263307 +2120/20000 train_loss: 3.0892 train_time: 3.5m tok/s: 5263654 +2130/20000 train_loss: 3.0009 train_time: 3.5m tok/s: 5264170 +2140/20000 train_loss: 3.2293 train_time: 3.6m tok/s: 5264471 +2150/20000 train_loss: 3.1351 train_time: 3.6m tok/s: 5265041 +2160/20000 train_loss: 3.1396 train_time: 3.6m tok/s: 5265461 +2170/20000 train_loss: 3.1895 train_time: 3.6m tok/s: 5265902 +2180/20000 train_loss: 3.1127 train_time: 3.6m tok/s: 5266359 +2190/20000 train_loss: 3.1237 train_time: 3.6m tok/s: 5266807 +2200/20000 train_loss: 3.2219 train_time: 3.6m tok/s: 5267089 +2210/20000 train_loss: 3.1292 train_time: 3.7m tok/s: 5267529 +2220/20000 train_loss: 3.1194 train_time: 3.7m tok/s: 5267960 +2230/20000 train_loss: 3.0873 train_time: 3.7m tok/s: 5268167 +2240/20000 train_loss: 3.1689 train_time: 3.7m tok/s: 5268380 +2250/20000 train_loss: 3.0434 train_time: 3.7m tok/s: 5268637 +2260/20000 train_loss: 3.1346 train_time: 3.7m tok/s: 5268905 +2270/20000 train_loss: 3.1918 train_time: 3.8m tok/s: 5269267 +2280/20000 train_loss: 3.1647 train_time: 3.8m tok/s: 5269674 +2290/20000 train_loss: 3.1441 train_time: 3.8m tok/s: 5270132 +2300/20000 train_loss: 3.2093 train_time: 3.8m tok/s: 5270603 +2310/20000 train_loss: 2.9768 train_time: 3.8m tok/s: 5271046 +2320/20000 train_loss: 3.1263 train_time: 3.8m tok/s: 5271489 +2330/20000 train_loss: 3.1724 train_time: 3.9m tok/s: 5271849 +2340/20000 train_loss: 3.0578 train_time: 3.9m tok/s: 5272281 +2350/20000 train_loss: 3.1333 train_time: 3.9m tok/s: 5272697 +2360/20000 train_loss: 3.1888 train_time: 3.9m tok/s: 5273038 +2370/20000 train_loss: 3.1281 train_time: 3.9m tok/s: 5273322 +2380/20000 train_loss: 3.0697 train_time: 3.9m tok/s: 5273656 +2390/20000 train_loss: 3.0427 train_time: 4.0m tok/s: 5274074 +2400/20000 train_loss: 3.1847 train_time: 4.0m tok/s: 5274405 +2410/20000 train_loss: 3.1503 train_time: 4.0m tok/s: 5274702 +2420/20000 train_loss: 3.0619 train_time: 4.0m tok/s: 5275109 +2430/20000 train_loss: 3.1512 train_time: 4.0m tok/s: 5275562 +2440/20000 train_loss: 3.1880 train_time: 4.0m tok/s: 5275858 +2450/20000 train_loss: 3.2502 train_time: 4.1m tok/s: 5276233 +2460/20000 train_loss: 3.1510 train_time: 4.1m tok/s: 5276538 +2470/20000 train_loss: 3.1540 train_time: 4.1m tok/s: 5276952 +2480/20000 train_loss: 3.3277 train_time: 4.1m tok/s: 5277127 +2490/20000 train_loss: 3.2228 train_time: 4.1m tok/s: 5277354 +2500/20000 train_loss: 3.1040 train_time: 4.1m tok/s: 5277634 +2510/20000 train_loss: 3.2352 train_time: 4.2m tok/s: 5278005 +2520/20000 train_loss: 3.2358 train_time: 4.2m tok/s: 5278379 +2530/20000 train_loss: 3.1890 train_time: 4.2m tok/s: 5278622 +2540/20000 train_loss: 3.0171 train_time: 4.2m tok/s: 5278849 +2550/20000 train_loss: 3.1396 train_time: 4.2m tok/s: 5279095 +2560/20000 train_loss: 3.0139 train_time: 4.2m tok/s: 5279405 +2570/20000 train_loss: 3.1507 train_time: 4.3m tok/s: 5279801 +2580/20000 train_loss: 3.2117 train_time: 4.3m tok/s: 5280179 +2590/20000 train_loss: 3.1335 train_time: 4.3m tok/s: 5280465 +2600/20000 train_loss: 3.1178 train_time: 4.3m tok/s: 5280730 +2610/20000 train_loss: 3.1337 train_time: 4.3m tok/s: 5281049 +2620/20000 train_loss: 3.1365 train_time: 4.3m tok/s: 5281403 +2630/20000 train_loss: 3.0603 train_time: 4.4m tok/s: 5281713 +2640/20000 train_loss: 3.1884 train_time: 4.4m tok/s: 5282027 +2650/20000 train_loss: 3.0359 train_time: 4.4m tok/s: 5282261 +2660/20000 train_loss: 3.1576 train_time: 4.4m tok/s: 5282574 +2670/20000 train_loss: 3.1251 train_time: 4.4m tok/s: 5282935 +2680/20000 train_loss: 3.0913 train_time: 4.4m tok/s: 5283276 +2690/20000 train_loss: 3.0849 train_time: 4.4m tok/s: 5283502 +2700/20000 train_loss: 3.0982 train_time: 4.5m tok/s: 5283717 +2710/20000 train_loss: 3.2274 train_time: 4.5m tok/s: 5283910 +2720/20000 train_loss: 3.0868 train_time: 4.5m tok/s: 5284177 +2730/20000 train_loss: 3.1545 train_time: 4.5m tok/s: 5284360 +2740/20000 train_loss: 3.0889 train_time: 4.5m tok/s: 5284662 +2750/20000 train_loss: 3.0630 train_time: 4.5m tok/s: 5284882 +2760/20000 train_loss: 3.1808 train_time: 4.6m tok/s: 5285197 +2770/20000 train_loss: 3.1247 train_time: 4.6m tok/s: 5285521 +2780/20000 train_loss: 3.1307 train_time: 4.6m tok/s: 5285824 +2790/20000 train_loss: 3.1975 train_time: 4.6m tok/s: 5286110 +2800/20000 train_loss: 3.2013 train_time: 4.6m tok/s: 5286426 +2810/20000 train_loss: 3.1840 train_time: 4.6m tok/s: 5286709 +2820/20000 train_loss: 3.0829 train_time: 4.7m tok/s: 5286911 +2830/20000 train_loss: 3.1674 train_time: 4.7m tok/s: 5287129 +2840/20000 train_loss: 3.1718 train_time: 4.7m tok/s: 5287370 +2850/20000 train_loss: 3.1851 train_time: 4.7m tok/s: 5287583 +2860/20000 train_loss: 3.0879 train_time: 4.7m tok/s: 5287749 +2870/20000 train_loss: 3.1239 train_time: 4.7m tok/s: 5287892 +2880/20000 train_loss: 3.1009 train_time: 4.8m tok/s: 5288029 +2890/20000 train_loss: 3.1722 train_time: 4.8m tok/s: 5288166 +2900/20000 train_loss: 3.1239 train_time: 4.8m tok/s: 5288422 +2910/20000 train_loss: 3.1388 train_time: 4.8m tok/s: 5288680 +2920/20000 train_loss: 3.1550 train_time: 4.8m tok/s: 5288891 +2930/20000 train_loss: 3.1035 train_time: 4.8m tok/s: 5289017 +2940/20000 train_loss: 3.0052 train_time: 4.9m tok/s: 5289265 +2950/20000 train_loss: 3.1439 train_time: 4.9m tok/s: 5289482 +2960/20000 train_loss: 3.1524 train_time: 4.9m tok/s: 5289615 +layer_loop:enabled step:2967 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +2970/20000 train_loss: 3.4507 train_time: 4.9m tok/s: 5288346 +2980/20000 train_loss: 3.2156 train_time: 4.9m tok/s: 5284216 +2990/20000 train_loss: 3.1260 train_time: 4.9m tok/s: 5280120 +3000/20000 train_loss: 3.1999 train_time: 5.0m tok/s: 5276101 +3010/20000 train_loss: 3.2192 train_time: 5.0m tok/s: 5271996 +3020/20000 train_loss: 3.1519 train_time: 5.0m tok/s: 5267901 +3030/20000 train_loss: 3.1365 train_time: 5.0m tok/s: 5263973 +3040/20000 train_loss: 3.1089 train_time: 5.1m tok/s: 5259880 +3050/20000 train_loss: 3.1210 train_time: 5.1m tok/s: 5255811 +3060/20000 train_loss: 3.0455 train_time: 5.1m tok/s: 5251888 +3070/20000 train_loss: 3.1581 train_time: 5.1m tok/s: 5248081 +3080/20000 train_loss: 3.1359 train_time: 5.1m tok/s: 5244313 +3090/20000 train_loss: 3.1220 train_time: 5.2m tok/s: 5240414 +3100/20000 train_loss: 3.1663 train_time: 5.2m tok/s: 5236677 +3110/20000 train_loss: 3.1378 train_time: 5.2m tok/s: 5232973 +3120/20000 train_loss: 3.2145 train_time: 5.2m tok/s: 5229294 +3130/20000 train_loss: 3.1610 train_time: 5.2m tok/s: 5225676 +3140/20000 train_loss: 3.0795 train_time: 5.3m tok/s: 5222037 +3150/20000 train_loss: 2.9870 train_time: 5.3m tok/s: 5218479 +3160/20000 train_loss: 3.0975 train_time: 5.3m tok/s: 5214905 +3170/20000 train_loss: 3.0470 train_time: 5.3m tok/s: 5211320 +3180/20000 train_loss: 3.1903 train_time: 5.3m tok/s: 5207862 +3190/20000 train_loss: 3.0365 train_time: 5.4m tok/s: 5204380 +3200/20000 train_loss: 3.1228 train_time: 5.4m tok/s: 5200825 +3210/20000 train_loss: 3.1466 train_time: 5.4m tok/s: 5197258 +3220/20000 train_loss: 3.1461 train_time: 5.4m tok/s: 5193744 +3230/20000 train_loss: 3.1368 train_time: 5.4m tok/s: 5190285 +3240/20000 train_loss: 3.2413 train_time: 5.5m tok/s: 5186827 +3250/20000 train_loss: 3.1526 train_time: 5.5m tok/s: 5183389 +3260/20000 train_loss: 3.0316 train_time: 5.5m tok/s: 5180086 +3270/20000 train_loss: 3.0984 train_time: 5.5m tok/s: 5176815 +3280/20000 train_loss: 3.1027 train_time: 5.5m tok/s: 5173490 +3290/20000 train_loss: 3.2768 train_time: 5.6m tok/s: 5170105 +3300/20000 train_loss: 3.0533 train_time: 5.6m tok/s: 5166914 +3310/20000 train_loss: 3.0210 train_time: 5.6m tok/s: 5163748 +3320/20000 train_loss: 3.0881 train_time: 5.6m tok/s: 5160607 +3330/20000 train_loss: 3.0624 train_time: 5.6m tok/s: 5157439 +3340/20000 train_loss: 3.0593 train_time: 5.7m tok/s: 5154328 +3350/20000 train_loss: 3.1036 train_time: 5.7m tok/s: 5151239 +3360/20000 train_loss: 3.0687 train_time: 5.7m tok/s: 5148156 +3370/20000 train_loss: 2.9950 train_time: 5.7m tok/s: 5145037 +3380/20000 train_loss: 3.1624 train_time: 5.7m tok/s: 5142024 +3390/20000 train_loss: 3.1330 train_time: 5.8m tok/s: 5139035 +3400/20000 train_loss: 3.0643 train_time: 5.8m tok/s: 5136063 +3410/20000 train_loss: 3.0937 train_time: 5.8m tok/s: 5133111 +3420/20000 train_loss: 3.2158 train_time: 5.8m tok/s: 5130137 +3430/20000 train_loss: 3.0266 train_time: 5.8m tok/s: 5127197 +3440/20000 train_loss: 3.0725 train_time: 5.9m tok/s: 5124287 +3450/20000 train_loss: 3.0514 train_time: 5.9m tok/s: 5121348 +3460/20000 train_loss: 3.2045 train_time: 5.9m tok/s: 5118504 +3470/20000 train_loss: 3.1372 train_time: 5.9m tok/s: 5115651 +3480/20000 train_loss: 3.0854 train_time: 5.9m tok/s: 5112836 +3490/20000 train_loss: 3.0816 train_time: 6.0m tok/s: 5110041 +3500/20000 train_loss: 3.0845 train_time: 6.0m tok/s: 5107274 +3510/20000 train_loss: 3.0916 train_time: 6.0m tok/s: 5104501 +3520/20000 train_loss: 3.0708 train_time: 6.0m tok/s: 5101741 +3530/20000 train_loss: 3.1318 train_time: 6.0m tok/s: 5098942 +3540/20000 train_loss: 3.1601 train_time: 6.1m tok/s: 5096244 +3550/20000 train_loss: 3.0456 train_time: 6.1m tok/s: 5093551 +3560/20000 train_loss: 3.0538 train_time: 6.1m tok/s: 5090881 +3570/20000 train_loss: 3.0741 train_time: 6.1m tok/s: 5088226 +3580/20000 train_loss: 3.0525 train_time: 6.2m tok/s: 5085605 +3590/20000 train_loss: 3.0191 train_time: 6.2m tok/s: 5082974 +3600/20000 train_loss: 3.0916 train_time: 6.2m tok/s: 5080339 +3610/20000 train_loss: 3.0311 train_time: 6.2m tok/s: 5077714 +3620/20000 train_loss: 3.1585 train_time: 6.2m tok/s: 5075135 +3630/20000 train_loss: 2.9955 train_time: 6.3m tok/s: 5072578 +3640/20000 train_loss: 3.0325 train_time: 6.3m tok/s: 5070068 +3650/20000 train_loss: 3.2001 train_time: 6.3m tok/s: 5067547 +3660/20000 train_loss: 3.0171 train_time: 6.3m tok/s: 5064973 +3670/20000 train_loss: 3.1026 train_time: 6.3m tok/s: 5062504 +3680/20000 train_loss: 3.0714 train_time: 6.4m tok/s: 5060038 +3690/20000 train_loss: 3.1117 train_time: 6.4m tok/s: 5057511 +3700/20000 train_loss: 3.0944 train_time: 6.4m tok/s: 5054957 +3710/20000 train_loss: 3.1195 train_time: 6.4m tok/s: 5052469 +3720/20000 train_loss: 3.0921 train_time: 6.4m tok/s: 5049995 +3730/20000 train_loss: 3.0324 train_time: 6.5m tok/s: 5047635 +3740/20000 train_loss: 2.9776 train_time: 6.5m tok/s: 5045264 +3750/20000 train_loss: 3.0307 train_time: 6.5m tok/s: 5042862 +3760/20000 train_loss: 2.9648 train_time: 6.5m tok/s: 5040467 +3770/20000 train_loss: 3.0730 train_time: 6.5m tok/s: 5038125 +3780/20000 train_loss: 3.0832 train_time: 6.6m tok/s: 5035808 +3790/20000 train_loss: 3.1584 train_time: 6.6m tok/s: 5033489 +3800/20000 train_loss: 3.0569 train_time: 6.6m tok/s: 5031108 +3810/20000 train_loss: 3.1412 train_time: 6.6m tok/s: 5028734 +3820/20000 train_loss: 3.0331 train_time: 6.6m tok/s: 5026492 +3830/20000 train_loss: 2.9891 train_time: 6.7m tok/s: 5024268 +3840/20000 train_loss: 3.0387 train_time: 6.7m tok/s: 5022045 +3850/20000 train_loss: 3.0927 train_time: 6.7m tok/s: 5019796 +3860/20000 train_loss: 3.0810 train_time: 6.7m tok/s: 5017590 +3870/20000 train_loss: 3.0380 train_time: 6.7m tok/s: 5015417 +3880/20000 train_loss: 3.0952 train_time: 6.8m tok/s: 5013242 +3890/20000 train_loss: 3.0991 train_time: 6.8m tok/s: 5011054 +3900/20000 train_loss: 2.9844 train_time: 6.8m tok/s: 5008857 +3910/20000 train_loss: 3.0763 train_time: 6.8m tok/s: 5006683 +3920/20000 train_loss: 3.1971 train_time: 6.8m tok/s: 5004565 +3930/20000 train_loss: 3.1466 train_time: 6.9m tok/s: 5002459 +3940/20000 train_loss: 3.1056 train_time: 6.9m tok/s: 5000250 +3950/20000 train_loss: 2.9900 train_time: 6.9m tok/s: 4998083 +3960/20000 train_loss: 3.0537 train_time: 6.9m tok/s: 4995921 +3970/20000 train_loss: 3.0664 train_time: 6.9m tok/s: 4993871 +3980/20000 train_loss: 3.1551 train_time: 7.0m tok/s: 4991727 +3990/20000 train_loss: 3.0927 train_time: 7.0m tok/s: 4989592 +4000/20000 train_loss: 2.9807 train_time: 7.0m tok/s: 4987486 +4000/20000 val_loss: 3.0103 val_bpb: 1.1654 +4010/20000 train_loss: 2.9798 train_time: 7.0m tok/s: 4986220 +4020/20000 train_loss: 3.0665 train_time: 7.0m tok/s: 4984266 +4030/20000 train_loss: 3.0326 train_time: 7.1m tok/s: 4982260 +4040/20000 train_loss: 3.0177 train_time: 7.1m tok/s: 4980265 +4050/20000 train_loss: 3.0981 train_time: 7.1m tok/s: 4978206 +4060/20000 train_loss: 3.0491 train_time: 7.1m tok/s: 4976227 +4070/20000 train_loss: 3.0731 train_time: 7.1m tok/s: 4974279 +4080/20000 train_loss: 3.0518 train_time: 7.2m tok/s: 4972224 +4090/20000 train_loss: 3.0611 train_time: 7.2m tok/s: 4970366 +4100/20000 train_loss: 3.0992 train_time: 7.2m tok/s: 4968324 +4110/20000 train_loss: 3.0971 train_time: 7.2m tok/s: 4966342 +4120/20000 train_loss: 3.0190 train_time: 7.3m tok/s: 4964394 +4130/20000 train_loss: 3.0721 train_time: 7.3m tok/s: 4962410 +4140/20000 train_loss: 3.0541 train_time: 7.3m tok/s: 4960517 +4150/20000 train_loss: 3.0200 train_time: 7.3m tok/s: 4958620 +4160/20000 train_loss: 3.0859 train_time: 7.3m tok/s: 4956730 +4170/20000 train_loss: 3.1846 train_time: 7.4m tok/s: 4954799 +4180/20000 train_loss: 2.9746 train_time: 7.4m tok/s: 4953101 +4190/20000 train_loss: 3.0019 train_time: 7.4m tok/s: 4951178 +4200/20000 train_loss: 2.9392 train_time: 7.4m tok/s: 4949457 +4210/20000 train_loss: 3.1385 train_time: 7.4m tok/s: 4947170 +4220/20000 train_loss: 3.0932 train_time: 7.5m tok/s: 4944979 +4230/20000 train_loss: 3.0720 train_time: 7.5m tok/s: 4943281 +4240/20000 train_loss: 3.0370 train_time: 7.5m tok/s: 4941471 +4250/20000 train_loss: 3.1894 train_time: 7.5m tok/s: 4939710 +4260/20000 train_loss: 3.0343 train_time: 7.5m tok/s: 4937965 +4270/20000 train_loss: 3.0201 train_time: 7.6m tok/s: 4936203 +4280/20000 train_loss: 3.0530 train_time: 7.6m tok/s: 4934449 +4290/20000 train_loss: 2.9526 train_time: 7.6m tok/s: 4932738 +4300/20000 train_loss: 2.9983 train_time: 7.6m tok/s: 4930924 +4310/20000 train_loss: 3.0564 train_time: 7.6m tok/s: 4929208 +4320/20000 train_loss: 3.0998 train_time: 7.7m tok/s: 4927453 +4330/20000 train_loss: 3.0594 train_time: 7.7m tok/s: 4925763 +4340/20000 train_loss: 3.1371 train_time: 7.7m tok/s: 4924057 +4350/20000 train_loss: 3.0477 train_time: 7.7m tok/s: 4922355 +4360/20000 train_loss: 3.0531 train_time: 7.7m tok/s: 4920701 +4370/20000 train_loss: 3.0231 train_time: 7.8m tok/s: 4919047 +4380/20000 train_loss: 3.0297 train_time: 7.8m tok/s: 4917337 +4390/20000 train_loss: 2.9120 train_time: 7.8m tok/s: 4915703 +4400/20000 train_loss: 3.0806 train_time: 7.8m tok/s: 4914077 +4410/20000 train_loss: 3.0562 train_time: 7.8m tok/s: 4909599 +4420/20000 train_loss: 2.9258 train_time: 7.9m tok/s: 4905124 +4430/20000 train_loss: 3.0166 train_time: 7.9m tok/s: 4903531 +4440/20000 train_loss: 3.2090 train_time: 7.9m tok/s: 4901935 +4450/20000 train_loss: 2.9775 train_time: 7.9m tok/s: 4900349 +4460/20000 train_loss: 3.1030 train_time: 8.0m tok/s: 4898790 +4470/20000 train_loss: 2.9980 train_time: 8.0m tok/s: 4897203 +4480/20000 train_loss: 3.0998 train_time: 8.0m tok/s: 4895634 +4490/20000 train_loss: 3.0083 train_time: 8.0m tok/s: 4894120 +4500/20000 train_loss: 3.1428 train_time: 8.0m tok/s: 4892549 +4510/20000 train_loss: 2.9943 train_time: 8.1m tok/s: 4891038 +4520/20000 train_loss: 2.9417 train_time: 8.1m tok/s: 4889500 +4530/20000 train_loss: 3.0019 train_time: 8.1m tok/s: 4887988 +4540/20000 train_loss: 3.0968 train_time: 8.1m tok/s: 4886450 +4550/20000 train_loss: 3.0330 train_time: 8.1m tok/s: 4884968 +4560/20000 train_loss: 3.0420 train_time: 8.2m tok/s: 4883437 +4570/20000 train_loss: 3.0263 train_time: 8.2m tok/s: 4881947 +4580/20000 train_loss: 3.0777 train_time: 8.2m tok/s: 4880421 +4590/20000 train_loss: 2.9558 train_time: 8.2m tok/s: 4878951 +4600/20000 train_loss: 3.0225 train_time: 8.2m tok/s: 4877487 +4610/20000 train_loss: 3.0768 train_time: 8.3m tok/s: 4876029 +4620/20000 train_loss: 3.0500 train_time: 8.3m tok/s: 4874530 +4630/20000 train_loss: 2.9775 train_time: 8.3m tok/s: 4873101 +4640/20000 train_loss: 3.0396 train_time: 8.3m tok/s: 4871644 +4650/20000 train_loss: 2.9694 train_time: 8.3m tok/s: 4870234 +4660/20000 train_loss: 3.0079 train_time: 8.4m tok/s: 4868816 +4670/20000 train_loss: 3.0048 train_time: 8.4m tok/s: 4867394 +4680/20000 train_loss: 3.0804 train_time: 8.4m tok/s: 4865929 +4690/20000 train_loss: 2.9996 train_time: 8.4m tok/s: 4864494 +4700/20000 train_loss: 3.0047 train_time: 8.4m tok/s: 4863089 +4710/20000 train_loss: 2.9277 train_time: 8.5m tok/s: 4861725 +4720/20000 train_loss: 3.0403 train_time: 8.5m tok/s: 4860340 +4730/20000 train_loss: 2.9676 train_time: 8.5m tok/s: 4858963 +4740/20000 train_loss: 3.0569 train_time: 8.5m tok/s: 4857606 +4750/20000 train_loss: 2.9152 train_time: 8.5m tok/s: 4856254 +4760/20000 train_loss: 3.0439 train_time: 8.6m tok/s: 4854896 +4770/20000 train_loss: 2.9415 train_time: 8.6m tok/s: 4853561 +4780/20000 train_loss: 3.0749 train_time: 8.6m tok/s: 4852187 +4790/20000 train_loss: 3.0306 train_time: 8.6m tok/s: 4850820 +4800/20000 train_loss: 3.0233 train_time: 8.6m tok/s: 4849485 +4810/20000 train_loss: 3.0065 train_time: 8.7m tok/s: 4848159 +4820/20000 train_loss: 2.9936 train_time: 8.7m tok/s: 4846857 +4830/20000 train_loss: 2.9778 train_time: 8.7m tok/s: 4845536 +4840/20000 train_loss: 2.9463 train_time: 8.7m tok/s: 4844236 +4850/20000 train_loss: 3.0220 train_time: 8.8m tok/s: 4842937 +4860/20000 train_loss: 3.0473 train_time: 8.8m tok/s: 4841641 +4870/20000 train_loss: 2.9388 train_time: 8.8m tok/s: 4840327 +4880/20000 train_loss: 2.9919 train_time: 8.8m tok/s: 4839057 +4890/20000 train_loss: 3.0392 train_time: 8.8m tok/s: 4837782 +4900/20000 train_loss: 3.0250 train_time: 8.9m tok/s: 4836508 +4910/20000 train_loss: 3.0191 train_time: 8.9m tok/s: 4835248 +4920/20000 train_loss: 3.0762 train_time: 8.9m tok/s: 4833970 +4930/20000 train_loss: 3.0258 train_time: 8.9m tok/s: 4832715 +4940/20000 train_loss: 2.9788 train_time: 8.9m tok/s: 4831448 +4950/20000 train_loss: 2.9633 train_time: 9.0m tok/s: 4830180 +4960/20000 train_loss: 2.9601 train_time: 9.0m tok/s: 4828946 +4970/20000 train_loss: 3.0912 train_time: 9.0m tok/s: 4827737 +4980/20000 train_loss: 3.0789 train_time: 9.0m tok/s: 4826500 +4990/20000 train_loss: 2.9878 train_time: 9.0m tok/s: 4825253 +5000/20000 train_loss: 3.0409 train_time: 9.1m tok/s: 4824045 +5010/20000 train_loss: 3.0601 train_time: 9.1m tok/s: 4822831 +5020/20000 train_loss: 2.9700 train_time: 9.1m tok/s: 4821644 +5030/20000 train_loss: 3.0196 train_time: 9.1m tok/s: 4820401 +5040/20000 train_loss: 3.0277 train_time: 9.1m tok/s: 4819198 +5050/20000 train_loss: 2.9363 train_time: 9.2m tok/s: 4818019 +5060/20000 train_loss: 3.1229 train_time: 9.2m tok/s: 4816815 +5070/20000 train_loss: 2.9644 train_time: 9.2m tok/s: 4815592 +5080/20000 train_loss: 2.9253 train_time: 9.2m tok/s: 4814433 +5090/20000 train_loss: 2.9926 train_time: 9.2m tok/s: 4813255 +5100/20000 train_loss: 2.9541 train_time: 9.3m tok/s: 4812114 +5110/20000 train_loss: 2.9522 train_time: 9.3m tok/s: 4810978 +5120/20000 train_loss: 2.9237 train_time: 9.3m tok/s: 4809834 +5130/20000 train_loss: 2.9170 train_time: 9.3m tok/s: 4808654 +5140/20000 train_loss: 2.9938 train_time: 9.3m tok/s: 4807531 +5150/20000 train_loss: 3.0446 train_time: 9.4m tok/s: 4806405 +5160/20000 train_loss: 2.8897 train_time: 9.4m tok/s: 4805284 +5170/20000 train_loss: 2.9265 train_time: 9.4m tok/s: 4804179 +5180/20000 train_loss: 2.9961 train_time: 9.4m tok/s: 4803085 +5190/20000 train_loss: 2.9247 train_time: 9.4m tok/s: 4801959 +5200/20000 train_loss: 2.9676 train_time: 9.5m tok/s: 4800849 +5210/20000 train_loss: 2.8923 train_time: 9.5m tok/s: 4799729 +5220/20000 train_loss: 2.9386 train_time: 9.5m tok/s: 4798621 +5230/20000 train_loss: 2.9695 train_time: 9.5m tok/s: 4797527 +5240/20000 train_loss: 3.0019 train_time: 9.5m tok/s: 4796444 +5250/20000 train_loss: 2.9655 train_time: 9.6m tok/s: 4795355 +5260/20000 train_loss: 2.9879 train_time: 9.6m tok/s: 4794276 +5270/20000 train_loss: 2.9431 train_time: 9.6m tok/s: 4793167 +5280/20000 train_loss: 3.0048 train_time: 9.6m tok/s: 4792106 +5290/20000 train_loss: 3.0201 train_time: 9.6m tok/s: 4791023 +5300/20000 train_loss: 3.0588 train_time: 9.7m tok/s: 4789959 +5310/20000 train_loss: 2.9134 train_time: 9.7m tok/s: 4788927 +5320/20000 train_loss: 2.9108 train_time: 9.7m tok/s: 4787879 +5330/20000 train_loss: 2.9876 train_time: 9.7m tok/s: 4786832 +5340/20000 train_loss: 2.8651 train_time: 9.8m tok/s: 4785776 +5350/20000 train_loss: 2.9030 train_time: 9.8m tok/s: 4784721 +5360/20000 train_loss: 2.9947 train_time: 9.8m tok/s: 4783650 +5365/20000 val_loss: 2.8445 val_bpb: 1.1012 +stopping_early: wallclock_cap train_time: 588067ms step: 5365/20000 +peak memory allocated: 25639 MiB reserved: 25652 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.84180086 val_bpb:1.10016250 eval_time:6486ms +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +[prefetch] daemon started: depth=4 pinned=True +Serialized model: 135615079 bytes +Code size: 150905 bytes +[prefetch] daemon started: depth=4 pinned=True +GPTQ:collecting Hessians from calibration data... +[prefetch] daemon started: depth=4 pinned=True +GPTQ:collected 67 Hessians in 8.2s +[IDEA-064 parallel_gptq] enabled — multi-clip search active +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.571056 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.578086 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.568152 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.577732 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.549580 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.561189 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.561181 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.567172 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.774064 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.728812 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.769389 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.747667 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.728488 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.761968 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.753155 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=8.771204 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.788388 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.800119 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.790998 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.778914 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.755060 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.769952 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.779055 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.785358 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.259920 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.258698 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.267428 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.257253 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.261187 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.261114 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.257230 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=3.262680 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=129.566437 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=129.729497 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=129.030914 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=129.704975 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=129.516602 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=129.372567 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=129.411788 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=129.414251 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.317895 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.314623 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.324824 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.312048 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.319844 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.318479 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.319880 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.320092 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.652251 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.640240 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.646203 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.631279 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.642526 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.642950 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.647804 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=7.643858 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.885748 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.858143 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.860826 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.856677 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.870076 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.873275 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.863361 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.871879 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.130646 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.126240 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.131094 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.130284 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.130228 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.130453 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.126790 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.129508 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=31.462836 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=31.558518 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=31.541674 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=31.472183 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=31.530579 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=31.597634 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=31.482141 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=31.565800 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.513674 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.518076 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.510703 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.514002 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.504587 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.512391 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.512841 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=4.513601 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=12.812544 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=12.796679 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=12.814376 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=12.807729 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=12.791708 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=12.811462 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=12.809066 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=12.810678 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.533133 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.524100 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.534920 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.534584 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.518319 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.533490 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.531455 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=6.535766 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.692033 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.690741 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.690052 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.687906 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.692504 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.689487 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.688567 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.688802 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=13.178252 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=13.211455 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=13.199425 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=13.217078 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=13.226313 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=13.214792 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=13.200570 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=13.207002 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.844794 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.844639 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.835557 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.845556 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.847182 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.834094 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.852150 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.960284 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.845847 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.958640 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.953152 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.958262 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.960886 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.950570 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.965128 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.674880 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.959588 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.673496 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.671338 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.672800 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.673843 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.663648 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.677868 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=4.674358 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.632700 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.631990 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.630834 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.631115 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.632300 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.632165 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.630541 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.632244 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.550459 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.541399 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.545334 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.562027 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.559340 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.534900 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.566319 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=12.546015 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.954361 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.957887 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.959562 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.950346 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.958443 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.955675 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.955200 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.735032 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.726770 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.723551 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.956484 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.712055 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.729407 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.731078 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.722470 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.691371 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.695067 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.689358 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.730878 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.695101 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.683601 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.698364 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.679193 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.694443 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.226102 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.235110 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.237664 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.235958 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.231851 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.227449 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.231504 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=2.228750 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.715795 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.711304 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.730964 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.744871 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.756280 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.754588 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.738556 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=14.756482 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.330061 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.323132 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.325084 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.324308 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.322364 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.325223 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.791671 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.323185 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.797667 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=6.320941 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.797496 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.793382 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.790941 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.794735 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.332537 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.336457 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.783287 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=10.788154 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.339001 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.330972 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.343582 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.341588 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.334499 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=9.337073 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.129065 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.136097 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.135834 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.135231 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.127913 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.130709 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.130967 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.133239 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.653261 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.691728 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.695204 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.717697 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.709961 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.735134 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.729356 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=18.708995 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.883077 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.884405 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.885089 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.312930 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.885263 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.883673 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.313809 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.317538 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.885891 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.884920 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.882150 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.627112 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.316948 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.640664 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.318951 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.627309 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.317493 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.315403 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.313876 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.632427 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.625586 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.636939 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.631630 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.620171 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.217777 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.220627 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.221366 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.221077 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.218501 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.220080 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.220147 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.650231 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.219403 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.657006 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.652762 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.658110 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.662682 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.661448 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.553578 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.661035 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.656546 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.553044 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.555543 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.204441 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.202514 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.206771 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.556887 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.551722 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.634283 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.554515 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.633185 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.630067 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.555956 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.551216 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.206915 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.201503 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.204851 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.205383 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.198738 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.633215 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.631558 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.638922 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.634438 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.656811 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.656698 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.636186 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.626145 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.665160 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.643831 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.005030 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.666569 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.007411 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.001986 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.655218 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=5.645537 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.079503 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.006096 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.011812 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.082441 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.009689 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.081275 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.008908 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.008257 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.387758 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.396679 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.389926 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.973458 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.984929 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.979028 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.081783 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.084270 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.083215 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.082905 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.079702 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.401451 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.397394 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.398221 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.273935 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.397702 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=2.385824 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.279604 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.278871 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.994402 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.983016 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.990907 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.985568 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=3.973958 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.762038 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.762889 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.758433 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.281490 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.277889 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.281505 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.281645 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.275121 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.891041 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.893482 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.891036 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.761849 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.767734 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.611743 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.765478 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.619933 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.764616 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.612671 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.763588 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.886203 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.895035 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.887585 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.890005 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.892496 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.892844 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.890739 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.890510 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.610884 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.616337 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.264001 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.617583 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.264876 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.264169 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.615125 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.610486 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.884900 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.891670 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.891829 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.888127 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.885109 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.755928 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.756872 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.751851 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.264102 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.264202 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.264767 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.264953 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.264633 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.519282 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.519726 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.519690 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.762052 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.698487 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.755664 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.700595 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.759902 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.701334 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.758588 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.757231 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.294295 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.296599 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.299809 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.519285 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.519716 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.518720 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.519366 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.519910 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.696627 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.880101 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.698538 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.879112 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.698826 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.881730 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.698985 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=0.700825 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.294547 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.298955 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.296796 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.296651 +[IDEA-064 parallel_gptq] searched 50 clips × 256 rows using 64 workers, avg_best_err=1.298269 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.326201 +Quantized weights: + gptq (int5): tok_emb.weight + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + passthrough (float16): blocks.attn.gate_proj.bias, blocks.attn.gate_proj.weight, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, lane_merge, skip_gates, skip_weights +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.330505 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.323331 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.875809 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.878134 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.880261 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.875547 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=0.881265 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.338380 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.329539 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.333465 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.336066 +[IDEA-064 parallel_gptq] searched 50 clips × 512 rows using 64 workers, avg_best_err=1.331417 +Serialized model quantized+zstd: 15720987 bytes +Total submission size quantized+zstd: 15871892 bytes +quantized val_loss:8.97440563 val_bpb:3.47431261 eval_time:2474ms +quantized_sliding_window val_loss:8.97540695 val_bpb:3.47470026 eval_time:122189ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=35988657 frozen=0 + ttt_chunk [1/1238] bpb=3.389142 time=32.7s + ttt_chunk [11/1238] bpb=3.456527 time=35.0s + ttt_chunk [21/1238] bpb=3.513027 time=37.4s + ttt_chunk [31/1238] bpb=3.492755 time=40.2s + ttt_chunk [41/1238] bpb=3.456835 time=42.6s + ttt_chunk [51/1238] bpb=3.426119 time=45.0s + ttt_chunk [61/1238] bpb=3.388445 time=47.3s + ttt_chunk [71/1238] bpb=3.359202 time=49.7s + ttt_chunk [81/1238] bpb=3.316519 time=52.0s + ttt_chunk [91/1238] bpb=3.296403 time=54.3s + ttt_chunk [101/1238] bpb=3.261683 time=56.7s + ttt_chunk [111/1238] bpb=3.238082 time=59.1s + ttt_chunk [121/1238] bpb=3.213687 time=61.4s + ttt_chunk [131/1238] bpb=3.196301 time=63.7s + ttt_chunk [141/1238] bpb=3.180946 time=66.0s + ttt_chunk [151/1238] bpb=3.168216 time=68.4s + ttt_chunk [161/1238] bpb=3.153227 time=70.8s + ttt_chunk [171/1238] bpb=3.141229 time=73.1s + ttt_chunk [181/1238] bpb=3.124611 time=75.5s + ttt_chunk [191/1238] bpb=3.107076 time=77.8s + ttt_chunk [201/1238] bpb=3.093030 time=80.2s + ttt_chunk [211/1238] bpb=3.084778 time=83.1s + ttt_chunk [221/1238] bpb=3.069012 time=85.5s + ttt_chunk [231/1238] bpb=3.061129 time=87.9s + ttt_chunk [241/1238] bpb=3.055993 time=90.2s + ttt_chunk [251/1238] bpb=3.046714 time=92.6s + ttt_chunk [261/1238] bpb=3.034668 time=95.5s + ttt_chunk [271/1238] bpb=3.027530 time=97.9s + ttt_chunk [281/1238] bpb=3.017481 time=100.3s + ttt_chunk [291/1238] bpb=3.010547 time=102.6s + ttt_chunk [301/1238] bpb=3.000284 time=104.9s + ttt_chunk [311/1238] bpb=2.989291 time=107.3s + ttt_chunk [321/1238] bpb=2.982505 time=109.6s + ttt_chunk [331/1238] bpb=2.978273 time=112.0s + ttt_chunk [341/1238] bpb=2.971575 time=114.3s + ttt_chunk [351/1238] bpb=2.965618 time=116.7s + ttt_chunk [361/1238] bpb=2.957035 time=119.0s + ttt_chunk [371/1238] bpb=2.948521 time=121.3s + ttt_chunk [381/1238] bpb=2.943775 time=123.7s + ttt_chunk [391/1238] bpb=2.938326 time=126.1s + ttt_chunk [401/1238] bpb=2.929976 time=129.0s + ttt_chunk [411/1238] bpb=2.923665 time=131.3s + ttt_chunk [421/1238] bpb=2.917606 time=133.7s + ttt_chunk [431/1238] bpb=2.911708 time=136.6s + ttt_chunk [441/1238] bpb=2.906867 time=138.9s + ttt_chunk [451/1238] bpb=2.905560 time=141.2s + ttt_chunk [461/1238] bpb=2.897908 time=143.6s + ttt_chunk [471/1238] bpb=2.893437 time=146.0s + ttt_chunk [481/1238] bpb=2.887669 time=148.3s + ttt_chunk [491/1238] bpb=2.884042 time=150.7s + ttt_chunk [501/1238] bpb=2.879377 time=153.1s + ttt_chunk [511/1238] bpb=2.875703 time=155.5s + ttt_chunk [521/1238] bpb=2.872974 time=157.9s + ttt_chunk [531/1238] bpb=2.869050 time=160.2s + ttt_chunk [541/1238] bpb=2.865824 time=162.6s + ttt_chunk [551/1238] bpb=2.864233 time=164.9s + ttt_chunk [561/1238] bpb=2.861896 time=167.3s + ttt_chunk [571/1238] bpb=2.861164 time=169.7s + ttt_chunk [581/1238] bpb=2.860021 time=172.0s + ttt_chunk [591/1238] bpb=2.856978 time=174.4s + ttt_chunk [601/1238] bpb=2.854741 time=176.7s + ttt_chunk [611/1238] bpb=2.851994 time=179.1s + ttt_chunk [621/1238] bpb=2.849353 time=181.4s + ttt_chunk [631/1238] bpb=2.845798 time=183.8s + ttt_chunk [641/1238] bpb=2.843705 time=186.2s + ttt_chunk [651/1238] bpb=2.840266 time=188.6s + ttt_chunk [661/1238] bpb=2.836378 time=191.0s + ttt_chunk [671/1238] bpb=2.831198 time=193.4s + ttt_chunk [681/1238] bpb=2.827903 time=195.8s + ttt_chunk [691/1238] bpb=2.825961 time=198.1s + ttt_chunk [701/1238] bpb=2.822115 time=200.5s + ttt_chunk [711/1238] bpb=2.819124 time=202.9s + ttt_chunk [721/1238] bpb=2.817627 time=205.2s + ttt_chunk [731/1238] bpb=2.815491 time=207.6s + ttt_chunk [741/1238] bpb=2.814312 time=210.0s + ttt_chunk [751/1238] bpb=2.811060 time=212.3s + ttt_chunk [761/1238] bpb=2.806529 time=214.7s + ttt_chunk [771/1238] bpb=2.802407 time=217.0s + ttt_chunk [781/1238] bpb=2.798687 time=219.4s + ttt_chunk [791/1238] bpb=2.797467 time=221.8s + ttt_chunk [801/1238] bpb=2.797757 time=224.1s + ttt_chunk [811/1238] bpb=2.794524 time=226.5s + ttt_chunk [821/1238] bpb=2.792267 time=228.8s + ttt_chunk [831/1238] bpb=2.789750 time=231.2s + ttt_chunk [841/1238] bpb=2.788190 time=233.5s + ttt_chunk [851/1238] bpb=2.784857 time=235.9s + ttt_chunk [861/1238] bpb=2.782292 time=238.3s + ttt_chunk [871/1238] bpb=2.779411 time=240.7s + ttt_chunk [881/1238] bpb=2.777006 time=243.1s + ttt_chunk [891/1238] bpb=2.774757 time=245.4s + ttt_chunk [901/1238] bpb=2.777132 time=247.8s + ttt_chunk [911/1238] bpb=2.775196 time=250.2s + ttt_chunk [921/1238] bpb=2.774448 time=252.5s + ttt_chunk [931/1238] bpb=2.773249 time=254.9s + ttt_chunk [941/1238] bpb=2.771723 time=257.3s + ttt_chunk [951/1238] bpb=2.771270 time=259.7s + ttt_chunk [961/1238] bpb=2.770451 time=262.0s + ttt_chunk [971/1238] bpb=2.772091 time=264.9s + ttt_chunk [981/1238] bpb=2.770968 time=267.3s + ttt_chunk [991/1238] bpb=2.769339 time=269.6s + ttt_chunk [1001/1238] bpb=2.769170 time=272.4s + ttt_chunk [1011/1238] bpb=2.768082 time=274.8s + ttt_chunk [1021/1238] bpb=2.766890 time=277.1s + ttt_chunk [1031/1238] bpb=2.765724 time=279.5s + ttt_chunk [1041/1238] bpb=2.764717 time=281.8s + ttt_chunk [1051/1238] bpb=2.762587 time=284.2s + ttt_chunk [1061/1238] bpb=2.761116 time=286.6s + ttt_chunk [1071/1238] bpb=2.759318 time=288.9s + ttt_chunk [1081/1238] bpb=2.757011 time=291.3s + ttt_chunk [1091/1238] bpb=2.754565 time=293.7s + ttt_chunk [1101/1238] bpb=2.753095 time=296.1s + ttt_chunk [1111/1238] bpb=2.751726 time=298.4s + ttt_chunk [1121/1238] bpb=2.750295 time=300.8s + ttt_chunk [1131/1238] bpb=2.748229 time=303.1s + ttt_chunk [1141/1238] bpb=2.746214 time=305.4s + ttt_chunk [1151/1238] bpb=2.744575 time=308.4s + ttt_chunk [1161/1238] bpb=2.743093 time=310.8s + ttt_chunk [1171/1238] bpb=2.741002 time=313.2s + ttt_chunk [1181/1238] bpb=2.739228 time=315.6s + ttt_chunk [1191/1238] bpb=2.737380 time=317.9s + ttt_chunk [1201/1238] bpb=2.736627 time=320.3s + ttt_chunk [1211/1238] bpb=2.735863 time=322.6s + ttt_chunk [1221/1238] bpb=2.733636 time=325.0s + ttt_chunk [1231/1238] bpb=2.732550 time=327.3s + ttt_chunk [1238/1238] bpb=2.731938 time=328.8s +ttt_sliding:done val_loss=7.047824 val_bpb=2.728464 elapsed=345.3s +quantized_ttt val_loss:7.04782394 val_bpb:2.72846410 eval_time:345475ms +[W424 06:49:19.303221292 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 06:49:20.794465997 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 06:49:20.884842514 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 06:49:20.993919086 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 06:49:20.094509752 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 06:49:20.149141415 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 06:49:20.175587829 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 06:49:20.258368088 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) +[W424 06:49:21.414874267 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator()) + +[run] DONE 06:49:22Z +[run] === val_bpb lines === +0/20000 val_loss: 9.0097 val_bpb: 3.4880 +4000/20000 val_loss: 3.0103 val_bpb: 1.1654 +5365/20000 val_loss: 2.8445 val_bpb: 1.1012 +pre-quantization post-ema val_loss:2.84180086 val_bpb:1.10016250 eval_time:6486ms +quantized val_loss:8.97440563 val_bpb:3.47431261 eval_time:2474ms +quantized_sliding_window val_loss:8.97540695 val_bpb:3.47470026 eval_time:122189ms + ttt_chunk [1/1238] bpb=3.389142 time=32.7s + ttt_chunk [11/1238] bpb=3.456527 time=35.0s + ttt_chunk [21/1238] bpb=3.513027 time=37.4s + ttt_chunk [31/1238] bpb=3.492755 time=40.2s + ttt_chunk [41/1238] bpb=3.456835 time=42.6s + ttt_chunk [51/1238] bpb=3.426119 time=45.0s + ttt_chunk [61/1238] bpb=3.388445 time=47.3s + ttt_chunk [71/1238] bpb=3.359202 time=49.7s + ttt_chunk [81/1238] bpb=3.316519 time=52.0s + ttt_chunk [91/1238] bpb=3.296403 time=54.3s + ttt_chunk [101/1238] bpb=3.261683 time=56.7s + ttt_chunk [111/1238] bpb=3.238082 time=59.1s + ttt_chunk [121/1238] bpb=3.213687 time=61.4s + ttt_chunk [131/1238] bpb=3.196301 time=63.7s + ttt_chunk [141/1238] bpb=3.180946 time=66.0s + ttt_chunk [151/1238] bpb=3.168216 time=68.4s + ttt_chunk [161/1238] bpb=3.153227 time=70.8s + ttt_chunk [171/1238] bpb=3.141229 time=73.1s + ttt_chunk [181/1238] bpb=3.124611 time=75.5s + ttt_chunk [191/1238] bpb=3.107076 time=77.8s + ttt_chunk [201/1238] bpb=3.093030 time=80.2s + ttt_chunk [211/1238] bpb=3.084778 time=83.1s + ttt_chunk [221/1238] bpb=3.069012 time=85.5s + ttt_chunk [231/1238] bpb=3.061129 time=87.9s + ttt_chunk [241/1238] bpb=3.055993 time=90.2s + ttt_chunk [251/1238] bpb=3.046714 time=92.6s + ttt_chunk [261/1238] bpb=3.034668 time=95.5s + ttt_chunk [271/1238] bpb=3.027530 time=97.9s + ttt_chunk [281/1238] bpb=3.017481 time=100.3s + ttt_chunk [291/1238] bpb=3.010547 time=102.6s + ttt_chunk [301/1238] bpb=3.000284 time=104.9s + ttt_chunk [311/1238] bpb=2.989291 time=107.3s + ttt_chunk [321/1238] bpb=2.982505 time=109.6s + ttt_chunk [331/1238] bpb=2.978273 time=112.0s + ttt_chunk [341/1238] bpb=2.971575 time=114.3s + ttt_chunk [351/1238] bpb=2.965618 time=116.7s + ttt_chunk [361/1238] bpb=2.957035 time=119.0s + ttt_chunk [371/1238] bpb=2.948521 time=121.3s + ttt_chunk [381/1238] bpb=2.943775 time=123.7s + ttt_chunk [391/1238] bpb=2.938326 time=126.1s + ttt_chunk [401/1238] bpb=2.929976 time=129.0s + ttt_chunk [411/1238] bpb=2.923665 time=131.3s + ttt_chunk [421/1238] bpb=2.917606 time=133.7s + ttt_chunk [431/1238] bpb=2.911708 time=136.6s + ttt_chunk [441/1238] bpb=2.906867 time=138.9s + ttt_chunk [451/1238] bpb=2.905560 time=141.2s + ttt_chunk [461/1238] bpb=2.897908 time=143.6s + ttt_chunk [471/1238] bpb=2.893437 time=146.0s + ttt_chunk [481/1238] bpb=2.887669 time=148.3s + ttt_chunk [491/1238] bpb=2.884042 time=150.7s + ttt_chunk [501/1238] bpb=2.879377 time=153.1s + ttt_chunk [511/1238] bpb=2.875703 time=155.5s + ttt_chunk [521/1238] bpb=2.872974 time=157.9s + ttt_chunk [531/1238] bpb=2.869050 time=160.2s + ttt_chunk [541/1238] bpb=2.865824 time=162.6s + ttt_chunk [551/1238] bpb=2.864233 time=164.9s + ttt_chunk [561/1238] bpb=2.861896 time=167.3s + ttt_chunk [571/1238] bpb=2.861164 time=169.7s + ttt_chunk [581/1238] bpb=2.860021 time=172.0s + ttt_chunk [591/1238] bpb=2.856978 time=174.4s + ttt_chunk [601/1238] bpb=2.854741 time=176.7s + ttt_chunk [611/1238] bpb=2.851994 time=179.1s + ttt_chunk [621/1238] bpb=2.849353 time=181.4s + ttt_chunk [631/1238] bpb=2.845798 time=183.8s + ttt_chunk [641/1238] bpb=2.843705 time=186.2s + ttt_chunk [651/1238] bpb=2.840266 time=188.6s + ttt_chunk [661/1238] bpb=2.836378 time=191.0s + ttt_chunk [671/1238] bpb=2.831198 time=193.4s + ttt_chunk [681/1238] bpb=2.827903 time=195.8s + ttt_chunk [691/1238] bpb=2.825961 time=198.1s + ttt_chunk [701/1238] bpb=2.822115 time=200.5s + ttt_chunk [711/1238] bpb=2.819124 time=202.9s + ttt_chunk [721/1238] bpb=2.817627 time=205.2s + ttt_chunk [731/1238] bpb=2.815491 time=207.6s + ttt_chunk [741/1238] bpb=2.814312 time=210.0s + ttt_chunk [751/1238] bpb=2.811060 time=212.3s + ttt_chunk [761/1238] bpb=2.806529 time=214.7s + ttt_chunk [771/1238] bpb=2.802407 time=217.0s + ttt_chunk [781/1238] bpb=2.798687 time=219.4s + ttt_chunk [791/1238] bpb=2.797467 time=221.8s + ttt_chunk [801/1238] bpb=2.797757 time=224.1s + ttt_chunk [811/1238] bpb=2.794524 time=226.5s + ttt_chunk [821/1238] bpb=2.792267 time=228.8s + ttt_chunk [831/1238] bpb=2.789750 time=231.2s + ttt_chunk [841/1238] bpb=2.788190 time=233.5s + ttt_chunk [851/1238] bpb=2.784857 time=235.9s + ttt_chunk [861/1238] bpb=2.782292 time=238.3s + ttt_chunk [871/1238] bpb=2.779411 time=240.7s + ttt_chunk [881/1238] bpb=2.777006 time=243.1s + ttt_chunk [891/1238] bpb=2.774757 time=245.4s + ttt_chunk [901/1238] bpb=2.777132 time=247.8s + ttt_chunk [911/1238] bpb=2.775196 time=250.2s + ttt_chunk [921/1238] bpb=2.774448 time=252.5s + ttt_chunk [931/1238] bpb=2.773249 time=254.9s + ttt_chunk [941/1238] bpb=2.771723 time=257.3s + ttt_chunk [951/1238] bpb=2.771270 time=259.7s + ttt_chunk [961/1238] bpb=2.770451 time=262.0s + ttt_chunk [971/1238] bpb=2.772091 time=264.9s + ttt_chunk [981/1238] bpb=2.770968 time=267.3s + ttt_chunk [991/1238] bpb=2.769339 time=269.6s + ttt_chunk [1001/1238] bpb=2.769170 time=272.4s + ttt_chunk [1011/1238] bpb=2.768082 time=274.8s + ttt_chunk [1021/1238] bpb=2.766890 time=277.1s + ttt_chunk [1031/1238] bpb=2.765724 time=279.5s + ttt_chunk [1041/1238] bpb=2.764717 time=281.8s + ttt_chunk [1051/1238] bpb=2.762587 time=284.2s + ttt_chunk [1061/1238] bpb=2.761116 time=286.6s + ttt_chunk [1071/1238] bpb=2.759318 time=288.9s + ttt_chunk [1081/1238] bpb=2.757011 time=291.3s + ttt_chunk [1091/1238] bpb=2.754565 time=293.7s + ttt_chunk [1101/1238] bpb=2.753095 time=296.1s + ttt_chunk [1111/1238] bpb=2.751726 time=298.4s + ttt_chunk [1121/1238] bpb=2.750295 time=300.8s + ttt_chunk [1131/1238] bpb=2.748229 time=303.1s + ttt_chunk [1141/1238] bpb=2.746214 time=305.4s + ttt_chunk [1151/1238] bpb=2.744575 time=308.4s + ttt_chunk [1161/1238] bpb=2.743093 time=310.8s + ttt_chunk [1171/1238] bpb=2.741002 time=313.2s + ttt_chunk [1181/1238] bpb=2.739228 time=315.6s + ttt_chunk [1191/1238] bpb=2.737380 time=317.9s + ttt_chunk [1201/1238] bpb=2.736627 time=320.3s + ttt_chunk [1211/1238] bpb=2.735863 time=322.6s + ttt_chunk [1221/1238] bpb=2.733636 time=325.0s + ttt_chunk [1231/1238] bpb=2.732550 time=327.3s + ttt_chunk [1238/1238] bpb=2.731938 time=328.8s +ttt_sliding:done val_loss=7.047824 val_bpb=2.728464 elapsed=345.3s +quantized_ttt val_loss:7.04782394 val_bpb:2.72846410 eval_time:345475ms + +[run] === artifact === + final_model.int6.ptz: 15720987 bytes