diff --git a/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/README.md b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/README.md new file mode 100644 index 0000000000..51851c31b3 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/README.md @@ -0,0 +1,71 @@ +# SP10240 + SimCTG + QAHSP + post-quant TTT (Submission A v2) + +**val_bpb = 1.07197** (3-seed mean post-quant TTT sliding-window, std 0.00023) | artifact 15.96 MB | 8×H100 SXM | brotli-quantized model + lzma-compressed self-extracting code + +## 3-seed results + +| Seed | post-EMA | quantized | sliding-window | **TTT sliding-window** | +|------|----------|-----------|----------------|----------------------:| +| 42 | 1.07522 | 1.08978 | 1.07386 | **1.07218** | +| 1337 | 1.07522 | 1.08978 | 1.07386 | **1.07200** | +| 2025 | 1.07491 | 1.08939 | 1.07350 | **1.07173** | +| **mean** | **1.07512** | **1.08965** | **1.07374** | **1.07197** | +| std | 0.00018 | 0.00022 | 0.00021 | 0.00023 | + +The shipped `final_model.int6.ptz` is from seed 2025 (lowest val_bpb of the 3). + +Δ vs prior leaderboard sliding-window SOTA (1.0827, 2026-04-09 SP8192 3-Layer Recurrence): **−0.01073 BPB / 10.7 mBPB better**, well above 3-seed σ (0.23 mBPB). + +Δ vs our prior Sub A (1.07502, sliding-window 3-seed): **−0.00305 BPB / 3.05 mBPB better** at the post-quant TTT level. + +## Architecture + +11L × 512d × 8H / 4KV with: 3-Layer Recurrence (loops 3-5), Parallel Residuals (from layer 7), LeakyReLU(0.5)² SwiGLU, Partial RoPE (16/64), XSA on all 11 layers, tied embeddings, SP10240 tokenizer, Polar Express NS Muon, GPTQ int6 (matrices) + int7 (token embeddings) + brotli compression. + +**Training**: 4530-4537 steps in ~588s under `MAX_WALLCLOCK_SECONDS=600` on 8×H100, single seed per run. +**Quantization**: Mixed GPTQ int6/int7 + brotli. +**Eval**: pre-quant post-EMA grade pass → quantized → sliding-window stride 64 → post-quant TTT (1 epoch, LR 5e-3) over remaining eval tokens. + +## Our novel additions on top of the PR #1855 lineage + +1. **SimCTG contrastive regularizer** (λ=0.3, margin=0.4) — angular spread on token-level hidden states, no inference cost. Carried over from prior Sub A. +2. **QAHSP quant-aware activation regularizer** (λ=0.3) — STE penalty `MSE(h, STE-quantize(h, int6))` pushing hidden states toward an int6 grid during training. **Novel to this submission.** See companion Sub C (PR #2011) for the cross-base ablation characterizing where QAHSP helps and where it hurts. +3. **Post-quant test-time training** (`TTT_ENABLED=1`, default 3 epochs LR 5e-3 reduced to 1 epoch in this run for budget) on already-graded eval tokens, after the legal pre-quant grading pass. Same ttt-after-score line as PR #1413. +4. **Bug fix to `eval_val_ttt`**: original code referenced `compiled_forward` (defined only in the pre-quant TTT path); replaced with eager `base_model(x, y)` call. This is what unblocked TTT from completing — without the fix, the post-quant TTT loop crashed silently on the first chunk. + +## Compliance + +- Trains in <600s on 8×H100 (`MAX_WALLCLOCK_SECONDS=600`). +- Post-quant TTT runs after the legal pre-quantization post-EMA grading pass per Issue #1017 / README evaluation rules. Same compliance argument as PR #1413 (score-first TTT). +- Eval ops total ~700-720s (sliding-window 115s + TTT 260-290s plus pre-/quantized eval ~30s). Slightly over the 600s soft rule discussed in PR #1958 — flagged for organizer review. +- Artifact 15,958,541 bytes ≤ 16,000,000 (margin 41,459 bytes). + +## Files + +- `final_model.int6.ptz` — brotli-compressed quantized model (seed 2025, 15,932,327 bytes) +- `train_gpt.py` — self-extracting (lzma+base85+exec, SOTA-standard format, 22,215 bytes) +- `submission.json` — leaderboard metadata +- `train_seed{42,1337,2025}.log` — 3-seed daemon training logs (stripped to relevant lines) +- `README.md` — this file + +## Reproduction + +```bash +SEED=2025 SP_VOCAB_SIZE=10240 VOCAB_SIZE=10240 MAX_WALLCLOCK_SECONDS=600 \ + COMPRESSOR=brotli \ + N9_SIMCTG_LAMBDA=0.3 N9_SIMCTG_MARGIN=0.4 \ + REG_QAHSP_LAMBDA=0.3 \ + TTT_ENABLED=1 TTT_EPOCHS=1 \ + torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +To decode the self-extracting wrapper: +```bash +python3 -c "import lzma,base64,re;exec(lzma.decompress(base64.b85decode(re.search(r'b85decode\(\"([^\"]+)\"\)', open('train_gpt.py').read()).group(1))).decode())" +``` + +## Credits + +PR #1855 SOTA stack (Kevin Clark et al.), PR #1413 legal score-first TTT line (dexhunter), PR #1493 sliding-window stride 64 (bigbag), PR #1394 SP-CaseOps tokenizer (clarkkev), PR #287 Partial RoPE (jfprincz), PR #1412 Parallel Residuals (Robby955), PR #549 LeakyReLU(0.5)² (abaybektursun). + +QAHSP, the post-quant TTT pipeline integration, and the `eval_val_ttt` bug fix are novel to this submission. diff --git a/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/final_model.int6.ptz b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/final_model.int6.ptz new file mode 100644 index 0000000000..dfd871f29d Binary files /dev/null and b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/final_model.int6.ptz differ diff --git a/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/submission.json b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/submission.json new file mode 100644 index 0000000000..938ac153a1 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/submission.json @@ -0,0 +1,42 @@ +{ + "name": "SP10240 + SimCTG + QAHSP + post-quant TTT", + "blurb": "PR #1855 lineage SOTA stack (11L x 512d x 8H, 3-Layer Recurrence loops 3-5, Parallel Residuals, LeakyReLU(0.5)^2, Partial RoPE 16/64, XSA all-layers, SP10240) + SimCTG (lambda=0.3, margin=0.4) + QAHSP quant-aware activation regularizer (lambda=0.3) + post-quant TTT (TTT_ENABLED=1, TTT_EPOCHS=1, LR 5e-3) + Polar Express NS Muon + GPTQ int6/int7 + brotli + lzma-compressed self-extracting code. 3-seed mean 1.07197 BPB post-quant TTT sliding-window stride 64. Beats prior leaderboard sliding-window SOTA 1.0827 by 10.7 mBPB and our prior Sub A (1.07502) by 3.05 mBPB.", + "date": "2026-04-30", + "val_bpb": 1.07197, + "val_bpb_std": 0.00023, + "val_bpb_metric": "quantized_ttt_sliding_window", + "shipped_seed": 2025, + "seeds": { + "42": { + "post_ema_bpb": 1.07522, + "quantized_bpb": 1.08978, + "sliding_window_bpb": 1.07386, + "ttt_sliding_window_bpb": 1.07218411 + }, + "1337": { + "post_ema_bpb": 1.07522, + "quantized_bpb": 1.08978, + "sliding_window_bpb": 1.07386, + "ttt_sliding_window_bpb": 1.07200099 + }, + "2025": { + "post_ema_bpb": 1.07491, + "quantized_bpb": 1.08939, + "sliding_window_bpb": 1.07350, + "ttt_sliding_window_bpb": 1.07172856 + } + }, + "novel_contributions": { + "qahsp": "Quant-Aware Hidden STE Penalty regularizer at lambda=0.3 (MSE between hidden states and STE-quantized-to-int6 versions). Novel to this submission. See companion Sub C (PR #2011) for cross-base ablation.", + "post_quant_ttt_integration": "Same legal score-first line as PR #1413; TTT_EPOCHS=1 to fit in the eval budget after sliding-window eval.", + "eval_val_ttt_bug_fix": "Original code referenced compiled_forward (defined only in the pre-quant TTT path); patched to use eager base_model(x, y) call." + }, + "compliance_notes": "Post-quant TTT runs after the legal pre-quantization post-EMA grade pass. Eval ops total ~700-720s including TTT (sliding-window 115s + TTT 260-290s + pre/quantized eval ~30s); slightly over the 600s soft rule discussed in PR #1958 -- flagged for organizer review.", + "credits": "PR #1855 (Kevin Clark et al.) - architecture; PR #1413 (dexhunter) - legal score-first TTT line; PR #1493 (bigbag) - stride-64 sliding eval; PR #1394 (clarkkev), PR #287 (jfprincz), PR #1412 (Robby955), PR #549 (abaybektursun)", + "bytes_total": 15958541, + "bytes_artifact": 15932327, + "bytes_train_gpt_self_extracting": 22215, + "bytes_readme": 2618, + "bytes_submission_json_self": null, + "cap_margin_bytes": 41459 +} diff --git a/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_gpt.py b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_gpt.py new file mode 100644 index 0000000000..7b5d7c5ed3 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_gpt.py @@ -0,0 +1,2 @@ +import lzma as L,base64 as B +exec(L.decompress(B.b85decode("{Wp48S^xk9=GL@E0stWa8~^|S5YJf5;V<<>_+0=rn@VT6Qap3bt~@<3h>ok~)Km_aAcM1$ZA=RNsrI&uUw)pb_nMj0LFYCMl-ULtvz!0lTlkwZNfQb9u;zP;lKC6%NM(=|8~7kIg$g6~+qlxGpltspif;>W9Ih1cC**hctLEJ6B&YTfwKJ0OWvU@z19^O+nlrHB%v1xZ$=#-ixbtl-nGL3Bf>&GfOs&4;pcOyIFP?wG$UVPX6|As{dP^I!I=aV9kZdH0NY)ou6M?t6-PDNK*k?8oK<6m=V<4_&Or>o~6kbl`xdLvx9KUQ5H*w_8HN5yc8m%t_~XLd2%e=#MX5z-CT8OAh!gr?aM#7(FUyfiuLW5@#Sg9IWzv*yWK~L08v?6r|CfY02^NdkE6!3C0itrE3D12xQIJD9l$C*!aU>HV2yyX?AF3b^Z5S*_?ZT5wXdOW40+Cv>njtnEaZ3D&`WV&6GCNcsI}3*IQ8^#!=|$QibiVa1~%(G#(Ww;^n#>gZ^3Ih77u?#y-oVV-m*xW#UQ{~O5jslOysHyqQNk@HqWjp=H5qkm|%*E0nu=NknUV{SfCUvimCpwjXG=@j%B7O1i{oH@X_~!n$V|=bSzlP4fa2AwK1G$aiW(gimwWeNy-X8Ibx{Cfo+G3UmMcOIjTMaqKHYHhA!1IfgmTZ&L{f;?R{{~RO%Gf?=FYW154;Irj}?Il6|-|TZSiDnt=_rSKp+-r$j<=xHw^Y9U>s*PBQQWZ;4;8tpOlXbSB)OW)$WDG5)@Bc~l3;aLq@GFC3bP1UW#DDC##g+#C_`Uj+QteTxFgFwg!g^bGx@;qQW^vGApwVo{*CW{&+$^8@bO##$72D{jjozVspa=PWDVY6kh-0W0Bm=HZ?&_yP5>+77<5>mikhs`~T=TLz4p6~7N4POA+gmrhK0qIGGj7esu>s`DJr)tmsI9V&^`LO37e~xvNIZJ$uBHC-$L|MeFW?=(#EC%Ci9tWtV}f`g3vmgTFPQK=EG&V9Veef@=07YY>M1=+N%WIX;y9JrR^%C2Jg_y5TH?}YY(=aSMk%SZo-(|XIL9UPp{59m%hL8i?fh>X0QqUCFmVeoFpCYCgPjjEKB|F}=-+r(2Sko>J&|8B(0m`nRmW+|}nG6!Qc@9|reR`NDciN-k3x5ti&EK_b43Jo!!N=hBKBmU(v-V>Xd3Y=%JM;rmmy|h+iPK0MV$OiXxSur)JVFR@nk~Rei+3$f!!xFtDs#fR-D99D41fbuoQJ2IJb@eB=_da!D>z&4epDp|Rs@z}hZ3eKysY-|_6Tkq5v9f{;VUVye1y>W)i~Io5qEc821xt*U|03j^*baO^>rfjN1g&v<@_QKT9V+yp)00Z{bEHkEPsKrd?$e!-MrbLp!mN7W6LHLA_@=st^`E@N{u0qYV7S1YdnDY!mWDfrT(L`Jk2x3G7s9MGZJjM}*rfsRz>2mJaCEtTyzxd^AHg9!42kcCpQ-!|af#S1XYZYhy*=WSP&UU>(i2iubXILS&gWCYQ)}5CR8wIF8@Vx74%2~%MAq-+uH5|B#mzX7SI<3P?nJ2lQt59brXC@h%DvW*ED>yE%|Qd4znFMyF*o&iB?y)#larFN9Q-Ip!znTnj5CKpnD@NtMxpIvJcMJ0Uo43T4S6a}E$m)So?}^wXv1hF8rM>IGY0EWKHrxj;){j7p{(>yy+SCoUHwe{rIewJsC(hG$1VJnl;$9|`USJG7r~<_(7}VcwQ*1HGu3^1XAXXZr*9H-NseX?kEMGr0_+aC{Lc!$fnjC1d>nEoh^%jlC3Wq=m_jH;*|LMYyKhpqH&7V1!kvRnZkqgKG<+2jjwDB?XKu)}QtCEr08{(U=FKsfS-hw_2LH$~WoK|vj3N>@3fFKoNTz+J;;1L4@dJeDZKb8CJ5bZd^|e^kNv;zRrr2bqgKP}~4r~;H_4?NV4^4iOe>Kk7XW+>%5aeIDHS=w~3=@}OwKzle=Bm44HsvSZy&ZY_gVO903|7{*qCTAL>UjMZCe0A~myU5Cp4AgY^yXa=LtHsOmd?qE?tU){;+3ouQ@_X?*~~;JLK6?79{12bB3z{9q1nkIo(FdlAwIA$mbgbPYHcQQT|}k8TzZ79Dj>>NTk_F(OKE2i2s1ByxYO>!ktz#q$hBbI_ZTZmR8@{64ks_Th1p3f*=3zSCm|s7Ejn?8h8#H3mdiB@gA|r`S$qaaos$1?+qA*+a&x9BI3EYP1+@k%eAoz4j~P5r@6YNOyp`cCDT#?Y=`DU_nQW@H=J5X~HtZKPf`Ph^bgp5TvC`fGjPO{X&#>XYH8CCA6>}x{yZ#q3PiL)Y4HWv)QQQMT#R%OtJeI{W`INp^c2va(#fS17XXw`-X+nMJl9q-6XIHV4Pj{&i{?8&L19!IBro1D-2r?lu%>hl|-*L4G8N+mHx@@P2Wa3Q3U7!Ubb0=PfC#>LqqCEo|P1q72+gQo2tJ6H)e${fsoP!S>KrUA@MzBV=2?liMa$UawIwK$`S{{SZ)`y21XW%ezkd+`csbzO?;3Zfaf58SNoH_C+ZKI`_K4e7MFaY6B3!~f=i-QN8J%ZW?rA>YO3v8OGw|z5{jg2I9Y0lnB*kvCeah;Kkbo%a$ohU*(+p{XY_=)OEC#25|mUcQdlY$wG+OxV5jhZ7Kt3B8t?chtfSq0lNF8(B^3^q=DH;{`Frtp)!mm48j+XW~khOC#W1vQh-z&M<1uLM%2^3Y5XBu&5O>Wr%4zkxqW=Y;~ZlK{wxIQ`&msz>?VaasaO|Y;~QSu{bJ+tQ(X;w^e8<&6X$7EU<)6&#z=}mCR7Lq-c&QT;v9Merratz6ibc>)gko3rM9Agw3a6U?}22TsIP||VW5?-b^FVuL)roprg1?0b_v8ii>)wYt)MFQAR7vByBQDZQ&aZpx)Xh9xGQc7c?MHWa~C@lKR^}YFvX8rR{Eb#Cx-b+M>|CiWczr)drlDGE|;bZ^G6zo-ts_vIk4Q7x&=SXA^j%nj~|E>BlNH&Yr&S%FqI7edkp`z_N&>ludnwknL|qkLYtYwV&IrN`PAPh6%pTM1X7tb47UeUX9fNhV+eX$(N1B+oV13Js10_VESg+n2`?^`j&BY5ah~D`KGWCz`VQeN+AVd!b535TYe#v`bVP~3L{-C(M$^ta1QMURPB!F5{PeTbvk(X`4$ws+{ucQq!22ZC1s!FxP`7(yHRWGp^q3Q|Q{&W*capHN;&=V0Kbk9hk!oA>x~wp}~(P~uqPmdNWBxsb8AZG)uLAPB;%4h>dpwrjcVto;DwG=C8=VvM+sSa$w73(`Xma+OG$nbW{TYBQ>E%%wa;vB)STui3k>;t{!g4SaW5AIcv2uVJgX1230uMazxYtxgpMR2S+$Y&29COvDK%mSHJ9hk>fi-n;(y!ihXa{gqA}sjDQ}p>;se=(7g6c2dQoIlrS!7ML1hIWg5?AG$_x$;8POW#>Db+BP(&OhHS>2@;!&)_ZHc9Ck!ME^D8Wh4v3CtYo1rmCr>Hm?&asXlwjZ1#hXcsFa@W)<1cbY_@NBM`MbrZ0KETbhen_YmB{?)#eom+!Z}^t9`Zb-XW+x;sFFy}J1G6ynge2a_7{Jy>JO5eytdU%Va?0k;~7#wL|D$(W|P!6F}4>gP&Jgp=cl7d)}^Bc0>N<$q;npC`Bm_iT&RT@`>Ka*Es-$?*H7ZEi^!S#Cq-PLb5aAPUV|8~9J<8~_O%-nyfb>eQs!DY8afK6&!wRrIJ~W#IxU=oUkg!!&+v5S1vM>tln`_saN1Ok6foNZ_ExKy9}|`9IWJ&FTh(oxO}izw9TV!oDAz^ng*Hz+X!7q{D7d41)kP35d)v{}Tl>yE+W-tFuqC&B8sL6#u{Am+e9yZjOh&IWq5rOaPO7@xqkdF`v#Muw2#c5n0r^r0J$h@Kw2XDZ|@xE1xW1L>`=mihQQ#?x=TDMQ4dEviLsB;&_~vIrY`#K@4V6k7e}b{cqv_RFS+I1*Tt6G7E%X+<*u#TGZ+LB!3@YH{?jz?YVI|zdBPOX*vo#3WrXEa}1^6s+NC2is^`Es(U_2Ny+q5lT{;Q#>3T#gL9KYE#2odIE%@&Xg04}{uZ9qT?TOhEy+j?|HxN61sf5B5tM{Dm>u_veu34XO9W?jgV{xhS|oRF-=~^p*buxa&58nAtKz_;`#kH|e4K|uf(0&zKKZ;bq(l=&gS76-F6TChcPaHCbYWTmdqmD=z1N+aa#TiOXDrR_pc9I7~Y}oLf=pjwRSqzy`-7p4h?y1sF%+!E0OuVpiII3{U`AU)AtPX@=2;Hu}Vjg#Xw9ZG!YNce&`4<<&9*zBEFpAQ@uzGNih8s>>N=W=-ns(1)7bCtmv+8V$PnlEROw_mB-H}Byo~PX1AV&_BibvSvMq3+cD7Q?|1*{^KTMF-EF(9p1?@?^)5i9Y)cjsYeX6L3Ai&%%4|&W(!LtMM50{MqN75MG(k!56~_QzLp;9o{oD##Wzr_?7rpmKk1h{#8Nr}EFS`CfJe!SZmXHhsmt}_f=uh!tLoouxRUvBDKald^p5A6XafNM;^&~FeA`ldVdk{BOAU)Dv3657D;lKSJ>q33P@foghiNOPcln4GkyFN&ALB7V#_y5hv^fBQj4fHwTH)Gy-(cpWgefj_B%B4jRA}aE3E$E81X+1T8Z4ayTd#ICPH&{q!;FeuhfH|Io_Pyw>f_Nc`KKjC#rdSLb|w7lNZWEkyeP6CrlQ}w>M&*VeS)F5aeWkqL8V|7#s0b=oMsE@H;=OWRsXX9f%~|*JAeEz2TRzl$hSWAATSlMXftiGvAI=TFDf@LWe$RJTD&ULosi_dW*cj`tggHP-VkBxYTY=L3ZP4c`X4@jr*gjG+Rw=Am5-i8w73rMY$F{7D<9K?LoS7=u5jzHnyO8|BBf@%jw(UW+Y1Lc>PPGmE5)ANufNg##k~ICrk{M5!IALyWA?p~^Pu(L#N!Fj#XGv-Mv?M9*QD#sCFm?{YS7ysKGqo-pGX7pmywuXThLOXC0g7tG#K{-aVImh2BGKokejUeb#;b{G`l*slOGMg`HTfiZlVh=wxuOSiea?zNcMMF6?@>Nxf!Nn>m+H`>~xgKK*`T=*G7{F=cc$@d3>#krf{T7zhG{_R}*wmdrjL0M{Bp*z^Eim388Pwju7f%uoo`z0-U=Km=4cL+)vYwsm-(r&CIIyH;Msl!PK>JYm_2Nl;a0rT;6We;zDq&s36jqYMo_12t#U=MJ6^jX(lBa0u~+wRh$lNVm-F?k)$B#KeTqRbtc#I0K)+*P+!hpC`M-#ZKuibqe^1)Ll@Fh>Ik;Jrw7n~@N(3+2|rCXB}VDWP+tAp(G^Lrcv2{i;?~Y*9a%GBnlLZ-&IG${U`lE0`?*BRrl}KwTHt@ODiVTWQxls9Ay}^u<*`7~LO+>AcO5@1v_Zw5hpgZTXE=c_(PSRoJH7vq~E3N9AZ95dX1%SFHWI5VWy?V78r>1PQ*!)6OZsk+b-O(=5o-XB6c-*rp1p0sjI1jn)Q5V78RKf<)K0=JL|lE=<~K)D=~y|NhfXA_s#|`}Xew(%x1w&)-&U_Nl5^e0YDHCcXn4xxtf_=%HQK2wM<<1NJIsG+cjoU%~Yj?WmfFW6>iyhqN9I0RtDRl(3>wo_vb_DrJXb$OA!re}~=g!#~C`Ho_#Qe$^QeVtF!fon^kuIr4By#Wcm(X=yuj!}T-N`WfQ?vquC$B8+TV3To9U@Vj=K`1p~Ui-9&ai4q*`$mn8!!dUOZ{77yE+1^qT8=Z(-vJTPhkr#+mqcy5Yq(@nK6s`Ol`ojinYqsI5R!sUF)d1n|0VnOumzpL{<+H|ep)HN&XDo4-(d%A&{EKMz?aiz7sF34gMu#<6ZbFacrT;yPsPQlR{-}A`!DlC?L&&(4c#HK`J9)Mr#RKv+tXaeUpPNV+LSvuzX-n5R0Ne$wtW1C$(p}StmNtPMl6i=mf#pZuA(h@R%qlK4M%vl^L^MWZ92bGQX&6O+de_ijfeVBdCRTU*wf7}zsfeR@;clt@0x$++A;Effb9fcB)|O-JuXj&qL&d?X+yzuR!R{(%MBkhfnu3nYQ^(&Dt^tteNjx8Hb~Qdn0pJ+3pEviN4qzYR+dB|*oM^*o2w8oGs4K1R`~;MNv}7ksDvk^FJMR_64$@jwN^EivMAt>_5fbEfb*y>uo)ubs_*V>5}W~U){lbkExIgY?)WHpP0{*j4)l$@FQ;WjR{SUs@v<8VO7}jNy33p!yv3?>yMk+p?jofTyU~*-GeO;*QC_fXWk;18m+9xfR_B&D8UHKG^GLys25o5#~8cGt<__JK=90SGi278h8q+&3eM8VzTAswCIehZhOT=)K>zi3%cPF5bT9dZ;O&a>H~i+;d#EWl47v`a*BvCcHwbl}IB0Ijbo_HGeo(tOcF$XkrWR)cZ&Q4bC%JWe16K|3fdzP6MP;8Ata71jxR6WFuGifwLCyhN&D=eB{6i%Ky9Vj}pHv!C~)bMLL`-k$#A6}aR-S)$0ryJ53E2Nh>CyM&yow7Ym&h1r0;v;)5oFHkeF0elVExUlH&6#SnV6{f97NKElkI#7-u^&n2kc0Wxs3rku_o7ZLrtI6y@qZy#dS=$!?+B-^b?8G6J4~)P4Q-vz+Y?ku8yttkexgsp)>-7oA6UJ8rOQ}CD9v)FoJhUr*D@f^elVz4#KuaC|EJPIdyG`2&8&Z~!9ORyY;u`%gj{lIk$0+;o5)7GM!d)1dghRZ~*wVbkkEE8`r!V*{LTg`QM=qQ5U4HicDXZIONOkEW0%}XuwJXtJtC9awNkgRgaDPul=&ocXnu37*iaCuf7D4NIXp!*@brWM!q=#o^2AkDT^OPvTR(5hXvrE580a2Fy5*GyoMy#*CDwi|SJjbI(>Re_$ZA4iL|Rvh2ev;AOtaVef{ZX-j}>LHERGHV2Bp$}68(`VT@BVWy!?EN4T}xb3IZ%e;pz@beVJ8ktG^5vgs6na%kf3qFY89cxG&z0_>ivIVb=9kRdd7$Wy*yOT6QlQ_Hu|JVSi!6Dc;J-45u`}0EDa?TJ^~N(3kYn-{|-Ak49*L^)oJ@*?0!v2Mxmgmub|DnX+fd1Bvkk2{R|xjxX@X>hPRF2J;AYL>OTuhu?A}PdYDU|G8K&Pczq~k-G;V-6bK2)BVBR265r--P$}pU{Bx1(q22axWy=~jtQL@VZsaGru35vE4yLch`#xGd689#J-b6LyFcxPV)BO&403fdQ+^;>VLqaQktQ6-Y1b>lwBT3}73fpWO4DnEvGoEgk-(k?OOdy}*zAs1YFQmgD1B&u{6+6Z+aC#I8E8NYhT3k~O4qWXxS@vc`$@D(msxPGdsUD0t1RI0yiv4%_To^_1^vC-mo+I~##w_2}GbBPTdmF}GM^L>=d1l|xvn@MdGAg%o#pxmvM>6Z-GbcIB5PEB9mTcPlL;^TSEdR(9Jn5+Ky&SFMSl_@=PmYs0cXbRL`^&GGpau{byJ9ymye?Tt@PM5?%dF?)HzF@)8I$TQ-AvtcR?j&gdAPg}>=WKEArN_1Yh(QSBko1qjYJr9IB{JzTih$cH8l4<{AV$|C*oSsue?~Tdu5)Xwx8NH>l*rHt1U=K2MqtMz^`?9p3C!}bzE03RN^4~LWHh+~^|M$Tfw7jegPdyQI_5q<()MX_UsyP|q4IHtob!Ck!Ju6)^qQ7NPiR5$$(8Q*GmAziEN<6N8&>sLIZ7Fe2noBlj5fYCy7^GAAmOrI~TV{1k#s3%z2;572$In-;uqUrWK7~iY{q)@Qh=U0U(c8;>?0K|yxz6_t8->+;jw`?Xu#jYx_zMQo^Pk0ufWij|e>W$-H*$d`g*|Rz2K}?*9&H6oQu%(h)R1-Le^7k+tC^Mb{sx)mKTfBJad`rWnsy1a*ad7$-p%Z-6_o<^w;lMRCoIArGYsbb8o>hl?QqQI*ww!vF}R%U?fPM=7wQttxtwK{EJNOM5m^(Ptwf&V2#%XTRl`gGEWQg-NKJtM~K#@2&rLwkH}@x9|3DuNK&>;2C+VWSLRAe@JaAlnmQV#bGnXAMdeWNfgJq;rzBpJS+z*QZ+53k;?R;EyF4*r|k=o!mq^arv#ydu2NO5a8Q{vpEg57)b)r;-Z+2HzUFSsn%ja&KxVIdc#Ds;e<&0-!GJC1mnB&;ZCHF&OA$oz$@F$gV_RWR}8(G_$5nxW*A;)KnRSFc+HKCJN*4-%JHI!K{4Z!bYTT+#rLg`@Y(sz4A%BBK>&0^KKSQ9l}8k@5S-&U0iVr^ccOoz-_59aRI-f>2^3nzrbbExKvI^6(8;QPVdD*@K;;CdxGTAtD>sgD+RLP7ZSipkGXNu~|ow85^Mi7VS%|r3!+Yh2S^OLjkyQk-{IxWGj8i>UGq=!)5coy6)^^H&7;8sXXCV%qRQ4>0(gxT$%YJ~(@f5JAiUMTGM0f~ghDQZ$;G3udFptfUPjW*bRYAt6n~=<2x_ua-+z;4Y=N?L%_+u)R|A9Lr3nAyA^V?VlZ@8K6t5E}M1zoC9mk-je|SkL%BV10ZXn)21#qa=|72L*t{D`WI^cgVAY9-rCC<^K^YqsMfj-)LG>C36gS|{~f}7TzZ>fdQ7K6X>Qk!Q>ar~j?y?jJ4wl?}sfF#k)Y9M42W=RK>&_p=mna+Wi4(&sEG%P-5)plx-zh`WeNhmn0fo~-=ePaJ3N&TOK*i*?5XuEB&3ytp&>1)L5It>wZm0%YXD#HGr#p-)9(b$B_T9KxtkDU8gNb#?sHliO4YG(uhrVv#HmXB46lM@x|t-q41JIsM`LD9w^9iDD8t&A=qF8Pazw_4hvuPDO6)l{E6Ie!PFiCumTLoTm%%oMjfDjC=HLw3bFh1g7AeUG8)|rl7?V~e3+cs-!ekt{Vvh~DX@q4|FiS877d-S1O#Z@m_vxqbx4CgmWVc=Y7Xix6?Ddi6fCuVf}(>##J;jV3aoe@LNgk96(?kw)fiKW&367@y^RGL3Sio*v-0Ch#*kbm7c|UnAHK6-ovf=+cw*8h(PM#ks2m9I8izXz$q^@JRzcU$=8*V=y8!05aqqxN3?XktQX}TTWHJ2eO_u!*Sw&I^=}FW`H;bTVxoJY|puP>C%Hzz@^+9OD>IU&!%Jn%tp#I_4-^GSiRc^Y9C}rupI3UM|-+VC!rt0Z6SxUK{@HiS?c6Ds=pU%JptY-B?)_$m=fz|&pd7NC3LgMQ=0a@$kIgIA|8lGPzc&&mAORqRJXBzhVT$!n9fE}*|3RtS(JSp(n)J9nyEP2D{)2tfZ(lZx0)`RiVtRtu4*m4fQsNU)pT2@_{C0v=WFV`}Q()a>oMkZ}A}2f=jKAq3Owm}VWqY(dYe!Ya>79Gh&7zt@H)&2;oNav>Y5}MX`Phs_GjVEWb`pgW>XCH(@QvC&PX1@_w7i&9m&eg*y^W3mH>p`FBNiloMI&P~Mdh@^>nESuJzr0L+Ep*gKeKlTeQGuT!35h|5(tXk>zn&rL|ilSpIapc!NA&u1Q?Wp=d->3tOcQ6c_WRbub_1_9N3s3#=twY&O0El#_KVAfVcEzv#0oq=FErP5@`^Ggg3wZ9ZngY5vUFrmPlwCA%M2|K@TJt5e=JXg~zJ`@FOt4sV31_W+H{!!|QYXj$>}v-U%)^NlFIPeh?yePlLq9eCu5l+xj%1Z;wvI+`#2H$DB&LZ6RAxaXm$fWCiNkK+n({4A5<9W`XX*7P3|gwD)+Q1gw}bP!PadS&ZnPw!Q9CH|QjZ}R8LVZ(gkh?Jm1Y;hrO<8XxeZaQy_zN@G^JW%lsiwu>z4glU-G##)N6o@2~)-3{-k(aDsIVOe>FD;?iNChm_P&d&L&3Yfm^x@ps+E7?$$b%;ZGncIQm8rd|?nh3U`3RLKJ;a@PFZRMmz?$iqs&(E5#upSj~GE^UAMK~ADTjBx(55aW6iH!)>465INmM-ixHXr5jV7`MVzddIkNvW06Pgkr%-bqAQs4<8PRtfdTGI4XXA`?vXoV=+FQArvZDQ`Zs^Ql*R0fc$9thk>bLDO;40uzuFRRC94`-%cY3E8lK3`n01}~$$Mq~$VTKl>B>(xDh?SPW?Ql8v*h*S(R~XE-$=k}riR^W@W?cwB>)W7_3<^yUE|Vg)vh!cnUv2By(bd@KRB3WdJf}#vQF@j{rprJo(y4lQM38w|l5hRXZxO3M1>W?nYbicaisVo~@Jk;hO-q)kAn!tacVc!ewm|XIdT~a)Gx)3)hRs^ZA&`k%0o?fIZIOp<`9A>g+&VL}n9s=}*gYL)!X<&wuL{}L;f2;=;?X-DbFJfcQHF<${3v@EFLaXNw@OjQiJf0TxTCp>Wo*ofGE;H3-eRs~l^vy~>Hjj%|NtgGO8oH@7gA0+aElGjujO|P?xwA0?^v-Ua!Av2xb`K$^K&#R)U&T2=xqc3~0c+Dahay>`U^a@PZp2|ADIq;ZAqbc-hp{o|B;RUR5^U&m#=>csUY7v(3u+DJyo%}LOAV3LAVW}~XH7=2+-X&30h|~}NxgxygHdv)yI|XnF+ieJg4ev3yhs80!iWi`WyT*|Pn*VWe+!{`h7q2YUeLUdc=#ot_8H@(E&e(S*aL&|`5ZG(aW{Ryo2Y{;E6lv5ou!b9N6B;1Td=)MJSaY+)I|wMqiZl|pv(X!-_^GQEyku+|aMidM_9+hyzD=>|_KGRhLGR^{Dr;IX6VIo==*QvX~^WxsaIhQwow2ZpqKc$M*)A)P=Kzz$#i7y^-@Md&Vo3pA0|1v4Ai*NLUN$MPC5{rPR{c7dmc!5lhZo1*P^j9kPpL{rFA{kE{oxy06u;g^xt!aBXDv!I_mau9?T5vut=G8`ik6j^xA#24K~iHE>+&i_|YY}y;OQB*?s6G~g)cKk`{6;x#6qx0m)o=GZ<0TTk*Dk2meFU%@qD+8|7!Pwb3wl(TN6dnuq)$yW9m+h$dDLg~CoM`n;G4ABX!Am3wtkj`7HTHP%dn%x7>fBg6UwpD)QM915l!W(clPd_Bqm&3=m4-^KC>HavFt2s?t+px{a3J`8$u~F!R|4LQ;;jo0!fjj2hjWT^2NIxmX;6Z;m2f)w-w*68sLC|Xs%0nJ$U1(xAC%a_>0JYLFB)!c~k%>|Df?m>xZ=GT(DTgA4()O;`2WetH9jjVp)%53$!ZY1jY?j^QGF^jO!{IJ-Ab1;MAtZ=f2X`TB`G#`OdUg*^B95u3Z*Ex?7|kJGfyMl$Nzwj=w}MgmP*x5)2P+#pS~xcFr4#<l}9$=dRsb@{-Cy2`JL6mJ~9iqrEu~Slg?+(v_$R`Ul&PZGq28*_})QTi24Bk(5LPsiUO;^TT^nX?!K~6{9pdr@1yPu>ief7JhbssaRGNz`B;IQ-LXrBm#r+`}M_b>`%zVkcDMm$pE?;##KCG(g3ugXuH#UTZ~cJlE?H3ay9W_Erda_LMfWcLYC)Smr=W`CL(E5PI5a3QZ|uBMT5rx5B~Cp=!?jaFQ1<4sfZt1wVpaO<%ob@61*331u$4HkgKZ4U}g3PB@z_N24Net^tiJ=padF5*9@ZBTkeP*z=>yWay}!DR_T&eD;8H!|=YS$Iw^5kRh{7nV=<$WwG{%l)Im7W_4$Ue#3bV$PS6Z7wrojMMv*f*vLIDZf!V+<(12xBDomcH_K&VsfI;sFs0QFA)bp6VxKl8Rw{1y5Cz&i6kAOqJUGLQc@eDdpPgD2T6pu+8W>JB#G1ZM(xbH|PUs0;uDMpv|GjKdLE5z5fTJ1bubjezVw`iHcX3A0P%p>eSQ9CBoXsxMc`a$VdECDZK)pqgy%^Vh}8~{S2F)+)N8Z&80pJJkBQ6I2EJ@paIqubetIe7@uRNyNzRLbrTz{7_Fn3ZqAfWWb&j2ipJO3kQa^QP@qS`iu<&{TC`srM0LJszTowlpg=0;tAp;m{ascA#Tdu_h395v&Pik}5my3KSq0DZQkAvzZ`k}k}n%T;q*q_}&*OXUDG0+jf7iR9wKt$u_S1eY_zq7Fh9V3sOSL+@ajbpeB#0y;{D06c)eu4{BziQD-4R_hLZt#I&bQwsKp^(dDUBc_2w7xS+=F@wHsrgn*H|LH1%ov^0NN&thF0~5~7??OY(01NKs=nVZ2E;w7`!I0CRzzHb@>A$!N$8?F7C|&xsEA1WvH!ZMlO+QWx5zr}PUYPl?Y;D@x-Wgh7spfA-iRiYULwK-72efajUoAdI?y85{$i``r{C%|ygyOOg-xGRRvH$QVp2ig4Mk)9N;F2Pa?KUtA;vs`MF0H{@?FDoDVq@c_%051l9sMr=%yY_}&<51wm%mFAQ3OGCnon=yKFU&2R^;mt{0{!XG(jKcakZJ)r`iPh*oSRsmNn)w`O6t}qIfzGp72dD*EImg0#zt7W`{To%R}iV*F>T1H3?pFd7-=2PXexDOoU4DGO^pV+Y~^(N`WyN6sknu(%EauZC7T^gW8RE4sh;E%{k0d8F*kGym9G6ADo93;B}*$Av^+u(~0ah{v&Iz4%b2$SgF$QK@KWA*Zp?j7$SN{*VL22G!>i%5T)dIO8}z?q2N8?b^pz3CevpZB?V#_Ftx!Tl|((NuyLWvkEa!sAvc<y~~z4W7(o9f1?Kx3;0KI(!YJ>+lU3?e2W>A-$SWXP_iYXEwgCVL4@F4*5Vb?ZY3vF?T|P*&sp9sQpf-=>h}Ne9z`DcdY9*j?EVA*-1l3%!_4^;00ESW0ruYnbzRK1vBYQl0ssI200dcD"))) diff --git a/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed1337.log b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed1337.log new file mode 100644 index 0000000000..8bb16b364d --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed1337.log @@ -0,0 +1,162 @@ +W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] +W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] ***************************************** +W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp10240_casefold + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.35 + etlb_clip: 3.0 + etlb_enabled: False + etlb_lr: 0.05 + etlb_steps: 5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/092aaf1c-e18e-4528-bb81-32dd0766130f.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + ppm_conf: 0.9 + ppm_enabled: False + ppm_lhi: 0.9 + ppm_llo: 0.05 + ppm_order: 5 + prequant_ttt_batch_seqs: 32 + prequant_ttt_chunk_tokens: 32768 + prequant_ttt_enabled: False + prequant_ttt_epochs: 21 + prequant_ttt_freeze_blocks: 2 + prequant_ttt_grad_clip: 1.0 + prequant_ttt_lr: 0.0005 + prequant_ttt_wd: 0.0 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 092aaf1c-e18e-4528-bb81-32dd0766130f + scalar_lr: 0.02 + seed: 1337 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_10240_casefold_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp10240_casefold/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 1 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp10240_casefold/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 10240 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 101 +val_tokens: 49999872 +model_params:36993112 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.2310 val_bpb: 3.3640 +1/20000 train_loss: 9.2315 train_time: 0.0m tok/s: 7930234 +2/20000 train_loss: 12.2249 train_time: 0.0m tok/s: 7851607 +3/20000 train_loss: 10.7457 train_time: 0.0m tok/s: 7698980 +4/20000 train_loss: 9.2979 train_time: 0.0m tok/s: 7571286 +5/20000 train_loss: 8.6158 train_time: 0.0m tok/s: 7539839 +500/20000 train_loss: 3.4683 train_time: 0.9m tok/s: 7664692 +1000/20000 train_loss: 3.3543 train_time: 1.7m tok/s: 7675617 +1500/20000 train_loss: 3.3481 train_time: 2.6m tok/s: 7676392 +2000/20000 train_loss: 3.2917 train_time: 3.4m tok/s: 7677555 +layer_loop:enabled step:2010 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 3.0796 train_time: 4.7m tok/s: 7033965 +3000/20000 train_loss: 3.1011 train_time: 5.9m tok/s: 6648897 +3500/20000 train_loss: 3.0149 train_time: 7.2m tok/s: 6398929 +4000/20000 train_loss: 2.9019 train_time: 8.5m tok/s: 6201193 +4000/20000 val_loss: 3.0139 val_bpb: 1.0983 +4500/20000 train_loss: 2.9929 train_time: 9.7m tok/s: 6074841 +4537/20000 val_loss: 2.9536 val_bpb: 1.0764 +stopping_early: wallclock_cap train_time: 588142ms step: 4537/20000 +peak memory allocated: 39441 MiB reserved: 39550 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.95050684 val_bpb:1.07521787 eval_time:8884ms +Serialized model: 137528185 bytes +Code size: 17708 bytes (lzma compressed; raw 77814 bytes) +Saved compressed code: train_gpt.py.lzma +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int7): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15931088 bytes +Total submission size quantized+brotli: 15948796 bytes +quantized val_loss:2.99046535 val_bpb:1.08977947 eval_time:11026ms +quantized_sliding_window val_loss:2.94677885 val_bpb:1.07385932 eval_time:115225ms +ttt:start chunks=1526 ttt_lr=0.005 ttt_epochs=1 +quantized_ttt_sliding_window val_loss:2.94167940 val_bpb:1.07200099 eval_time:260518ms diff --git a/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed2025.log b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed2025.log new file mode 100644 index 0000000000..5ed5b04a9d --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed2025.log @@ -0,0 +1,162 @@ +W0430 20:29:35.906000 3798271 torch/distributed/run.py:803] +W0430 20:29:35.906000 3798271 torch/distributed/run.py:803] ***************************************** +W0430 20:29:35.906000 3798271 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0430 20:29:35.906000 3798271 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp10240_casefold + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.35 + etlb_clip: 3.0 + etlb_enabled: False + etlb_lr: 0.05 + etlb_steps: 5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/06536ee5-34fc-4c26-9f29-7b0833d65593.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + ppm_conf: 0.9 + ppm_enabled: False + ppm_lhi: 0.9 + ppm_llo: 0.05 + ppm_order: 5 + prequant_ttt_batch_seqs: 32 + prequant_ttt_chunk_tokens: 32768 + prequant_ttt_enabled: False + prequant_ttt_epochs: 21 + prequant_ttt_freeze_blocks: 2 + prequant_ttt_grad_clip: 1.0 + prequant_ttt_lr: 0.0005 + prequant_ttt_wd: 0.0 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 06536ee5-34fc-4c26-9f29-7b0833d65593 + scalar_lr: 0.02 + seed: 2025 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_10240_casefold_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp10240_casefold/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 1 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp10240_casefold/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 10240 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 101 +val_tokens: 49999872 +model_params:36993112 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.2318 val_bpb: 3.3642 +1/20000 train_loss: 9.2314 train_time: 0.0m tok/s: 8214428 +2/20000 train_loss: 12.3115 train_time: 0.0m tok/s: 7865234 +3/20000 train_loss: 10.8396 train_time: 0.0m tok/s: 7718388 +4/20000 train_loss: 9.3424 train_time: 0.0m tok/s: 7615084 +5/20000 train_loss: 8.6487 train_time: 0.0m tok/s: 7560628 +500/20000 train_loss: 3.4719 train_time: 0.9m tok/s: 7654176 +1000/20000 train_loss: 3.3554 train_time: 1.7m tok/s: 7660054 +1500/20000 train_loss: 3.3465 train_time: 2.6m tok/s: 7660580 +2000/20000 train_loss: 3.2929 train_time: 3.4m tok/s: 7659357 +layer_loop:enabled step:2005 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 3.0786 train_time: 4.7m tok/s: 7006496 +3000/20000 train_loss: 3.0997 train_time: 5.9m tok/s: 6624910 +3500/20000 train_loss: 3.0112 train_time: 7.2m tok/s: 6378943 +4000/20000 train_loss: 2.8981 train_time: 8.5m tok/s: 6191042 +4000/20000 val_loss: 3.0125 val_bpb: 1.0978 +4500/20000 train_loss: 2.9916 train_time: 9.7m tok/s: 6069169 +4533/20000 val_loss: 2.9528 val_bpb: 1.0760 +stopping_early: wallclock_cap train_time: 588066ms step: 4533/20000 +peak memory allocated: 39441 MiB reserved: 39550 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.94964995 val_bpb:1.07490560 eval_time:9450ms +Serialized model: 137528185 bytes +Code size: 17708 bytes (lzma compressed; raw 77814 bytes) +Saved compressed code: train_gpt.py.lzma +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int7): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15932327 bytes +Total submission size quantized+brotli: 15950035 bytes +quantized val_loss:2.98939161 val_bpb:1.08938818 eval_time:11347ms +quantized_sliding_window val_loss:2.94578426 val_bpb:1.07349687 eval_time:115428ms +ttt:start chunks=1526 ttt_lr=0.005 ttt_epochs=1 +quantized_ttt_sliding_window val_loss:2.94093183 val_bpb:1.07172856 eval_time:291566ms diff --git a/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed42.log b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed42.log new file mode 100644 index 0000000000..c4c884ce61 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed42.log @@ -0,0 +1,162 @@ +W0430 20:49:27.147000 3805057 torch/distributed/run.py:803] +W0430 20:49:27.147000 3805057 torch/distributed/run.py:803] ***************************************** +W0430 20:49:27.147000 3805057 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0430 20:49:27.147000 3805057 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp10240_casefold + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.35 + etlb_clip: 3.0 + etlb_enabled: False + etlb_lr: 0.05 + etlb_steps: 5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/e66800bc-bd49-4ccb-9fd8-386effb83eec.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + ppm_conf: 0.9 + ppm_enabled: False + ppm_lhi: 0.9 + ppm_llo: 0.05 + ppm_order: 5 + prequant_ttt_batch_seqs: 32 + prequant_ttt_chunk_tokens: 32768 + prequant_ttt_enabled: False + prequant_ttt_epochs: 21 + prequant_ttt_freeze_blocks: 2 + prequant_ttt_grad_clip: 1.0 + prequant_ttt_lr: 0.0005 + prequant_ttt_wd: 0.0 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: e66800bc-bd49-4ccb-9fd8-386effb83eec + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_10240_casefold_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp10240_casefold/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 1 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp10240_casefold/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 10240 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 101 +val_tokens: 49999872 +model_params:36993112 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.2331 val_bpb: 3.3647 +1/20000 train_loss: 9.2325 train_time: 0.0m tok/s: 8214511 +2/20000 train_loss: 12.2056 train_time: 0.0m tok/s: 7823390 +3/20000 train_loss: 10.7709 train_time: 0.0m tok/s: 7668010 +4/20000 train_loss: 9.3267 train_time: 0.0m tok/s: 7600516 +5/20000 train_loss: 8.6376 train_time: 0.0m tok/s: 7531012 +500/20000 train_loss: 3.4646 train_time: 0.9m tok/s: 7616911 +1000/20000 train_loss: 3.3617 train_time: 1.7m tok/s: 7621589 +1500/20000 train_loss: 3.3438 train_time: 2.6m tok/s: 7620445 +layer_loop:enabled step:1994 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2000/20000 train_loss: 3.4547 train_time: 3.4m tok/s: 7609207 +2500/20000 train_loss: 3.0803 train_time: 4.7m tok/s: 6968701 +3000/20000 train_loss: 3.1053 train_time: 6.0m tok/s: 6600569 +3500/20000 train_loss: 3.0121 train_time: 7.2m tok/s: 6360055 +4000/20000 train_loss: 2.8988 train_time: 8.5m tok/s: 6173132 +4000/20000 val_loss: 3.0128 val_bpb: 1.0979 +4500/20000 train_loss: 2.9907 train_time: 9.7m tok/s: 6052803 +4523/20000 val_loss: 2.9542 val_bpb: 1.0766 +stopping_early: wallclock_cap train_time: 588139ms step: 4523/20000 +peak memory allocated: 39441 MiB reserved: 39550 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.95121054 val_bpb:1.07547431 eval_time:8591ms +Serialized model: 137528185 bytes +Code size: 17708 bytes (lzma compressed; raw 77814 bytes) +Saved compressed code: train_gpt.py.lzma +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int7): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15935128 bytes +Total submission size quantized+brotli: 15952836 bytes +quantized val_loss:2.99148305 val_bpb:1.09015033 eval_time:10644ms +quantized_sliding_window val_loss:2.94777979 val_bpb:1.07422408 eval_time:115121ms +ttt:start chunks=1526 ttt_lr=0.005 ttt_epochs=1 +quantized_ttt_sliding_window val_loss:2.94218191 val_bpb:1.07218411 eval_time:264298ms