Skip to content

Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767#2128

Open
okezue wants to merge 1 commit intoopenai:mainfrom
okezue:lcq-nonrecord
Open

Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767#2128
okezue wants to merge 1 commit intoopenai:mainfrom
okezue:lcq-nonrecord

Conversation

@okezue
Copy link
Copy Markdown

@okezue okezue commented May 1, 2026

Non-record submission

val_bpb = 1.06767 (seed 42, single-seed) | artifact 15,912,974 bytes | 8xH100 SXM | strict 600s train + eval

This is a non-record submission per README §"Non-record Submissions". It does not beat current SOTA, and is offered as documentation of a novel technique with detailed analysis of a negative result.

What is novel

Post-Quantization LoRA Distillation (LCQ): after GPTQ produces quantized weights, a small rank=4 LoRA is trained on the post-GPTQ dequantized model via KL divergence against the pre-quantization BF16 teacher logits, on TRAIN data only (FineWeb_train shards), entirely within the 10-minute training cap (GPTQ_RESERVE_SECONDS=80 cuts main training to 520s, leaving 80s for GPTQ + LCQ). The trained LoRA is held in memory across the train-to-eval boundary in the same Python process and applied at eval via the model's existing forward_ttt path with a new cu_seqlens-aware variable-length attention dispatch (the same BOS-aware masking the legal sliding window uses).

Code-level: BatchedLinearLoRA.forward extended to broadcast a bsz=1 LoRA against multi-batch eval inputs; forward_ttt, _block_with_lora, _parallel_block_with_lora extended to accept and propagate cu_seqlens, max_seqlen and dispatch to flash_attn_varlen_func accordingly; forward_ttt(..., return_logits=True) returns logits for distillation; new postquant_lora_distill(...) runs the 60s KL distillation training loop; serialize calls it after GPTQ; eval_val_sliding(..., lora=lora) accepts the trained LoRA and uses forward_ttt instead of forward_logits when set.

Why it's a negative result

stage val_bpb
post-EMA BF16 (cut 520s training) 1.06870
quantized 1.07702
quantized + sliding window + LCQ LoRA 1.06767
(reference) sliding-only at full 600s training 1.06286

LCQ recovers about -0.0094 from quantized, which roughly matches plain sliding window (-0.0099). The actual LoRA contribution on top of plain sliding window is only about -0.0003 BPB. The 80 seconds LCQ steals from main training costs about +0.005 BPB on the BF16 model, more than LoRA recovers. Net negative versus plain sliding window at full training.

KL distillation training loss converged very low (~0.02) but the BPB delta is small, suggesting the residual quant error lives in the long tail of the next-token distribution where rank=4 LoRA has too little capacity to fix.

Possible follow-ups (not attempted here): higher-rank LoRA (16-32) with careful artifact-size accounting, temperature-scaled distillation to up-weight tail tokens, or shipping LoRA in the artifact and running LCQ inside the eval budget on val tokens after they are graded (legal score-first on val).

Compliance

  • Train budget: main + LCQ within the 10-minute cap (520s + ~75s).
  • Eval budget: post-quant + sliding window with LoRA, under 10 minutes.
  • Artifact: 15,912,974 bytes < 16,000,000 byte cap.
  • C3 score-first on val: LCQ trains exclusively on TRAIN data shards. The reported quantized_sliding_window val_bpb is single-pass causal scoring with the trained LoRA already loaded; no parameter updates driven by val tokens before scoring them.
  • C1 causality: cu_seqlens-aware variable-length attention masks doc boundaries during sliding window eval (same as the legal eval_val pattern).
  • No SLOT, no n-gram cache, no logit bias, no ETLB.
  • 8xH100 80GB SXM.

Files

  • README.md (full technique writeup)
  • submission.json (structured metadata)
  • train_gpt.py (LCQ implementation)
  • lossless_caps.py (CaseOps tokenizer utility)
  • train_seed42.log (full run log including LCQ training trace)

Credits

cc @cocohearts @valerio-oai for visibility (non-record review).

… stack, val_bpb=1.06767

Single-seed non-record submission documenting a novel post-quantization LoRA distillation technique. After GPTQ produces quantized weights, a rank=4 LoRA is trained at train-time on TRAIN data only (no val) via KL divergence against the pre-quantization BF16 teacher logits, then held in memory and applied at eval through forward_ttt with cu_seqlens-aware variable-length attention during sliding-window scoring.

Result: val_bpb 1.06767, artifact 15,912,974 bytes, train 520s, eval under 10 min cap. Beats post-GPTQ baseline (1.07702) by 0.00935 BPB but does not beat plain sliding window on the same stack at full 600s training (1.06286). The 80s of training time LCQ steals from main training costs about 0.005 BPB on the BF16 model, while the LoRA only recovers about 0.0003 BPB. Negative result documented with diagnosis and follow-up suggestions.

Compliance: train + eval each under 600s, artifact under 16,000,000 bytes, score-first on val (LCQ trains on TRAIN data only, no val tokens are used for parameter updates before being scored), C1 causal sliding window with BOS-aware cu_seqlens.
@Christopher-Lee-McClendon
Copy link
Copy Markdown
Contributor

Nice idea! Regarding "The trained LoRA is held in memory across the train-to-eval boundary", is there room for this LoRA in your artifact size? The model/weights learned from the training set need to fit within 16,000,000 bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants