Skip to content

Add SP8192 + ParResid + DR + LoRA TTT + Mixed int4/int6/int8 + AWQ su…#1919

Open
dev-pratap-singh wants to merge 1 commit intoopenai:mainfrom
dev-pratap-singh:submission/2026-04-29-sp8192-parresid-dr-loratt-mixedquant-awq
Open

Add SP8192 + ParResid + DR + LoRA TTT + Mixed int4/int6/int8 + AWQ su…#1919
dev-pratap-singh wants to merge 1 commit intoopenai:mainfrom
dev-pratap-singh:submission/2026-04-29-sp8192-parresid-dr-loratt-mixedquant-awq

Conversation

@dev-pratap-singh
Copy link
Copy Markdown

…bmission

Adds an unverified submission targeting val_bpb=1.0587 (3-seed mean) on the 10-min 8xH100 track, mirrored into the non-record track. Both folders contain the README, submission.json, and single-file train_gpt.py entry point.

Status: NOT YET VERIFIED ON H100 — per-seed train logs and runtime compliance flags are pending the 8xH100 reproduction run.

…bmission

Adds an unverified submission targeting val_bpb=1.0587 (3-seed mean) on the
10-min 8xH100 track, mirrored into the non-record track. Both folders contain
the README, submission.json, and single-file train_gpt.py entry point.

Status: NOT YET VERIFIED ON H100 — per-seed train logs and runtime compliance
flags are pending the 8xH100 reproduction run.
@dexhunter
Copy link
Copy Markdown
Contributor

Hi @dev-pratap-singh — thanks for sharing this submission.

Quick technical note for the community thread, since I noticed there are no other comments yet:

The AWQ activation-aware scale calibration in train_gpt.py (around lines 1796-1819, where AWQ activations are collected from the first 4×2048 val tokens before int4 quantization, per the README's explicit description) appears to violate Issue #1017 Condition 3 (score-before-update):

"fix that position's score contribution before any x_t-dependent update or accounting rule"

The AWQ rescaling factors s_in are computed from activations on the first 8192 val tokens, then folded into the int4 weights. The artifact's scoring weights for those same val tokens therefore depend on the val tokens' own activations — i.e., position t's score uses a transform learned from x_t's activation.

This is the same class as PR #1350 / PR #1351 pre-quant calibration on val data (flagged in earlier reviews), which Issue #677 (illegal-submissions megathread) covers.

A clean fix: calibrate AWQ scales on a held-out slice of train shards (e.g., the last N tokens of train_files[0]) and freeze before eval. This preserves the AWQ benefit while keeping Condition 3 satisfied.

The rest of the mechanic looks quite clean — parallel residuals, depth recurrence, and LoRA score-first TTT all appear well-formed. Happy to discuss if I've misread the calibration path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants