Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767#2128
Open
okezue wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767#2128okezue wants to merge 1 commit intoopenai:mainfrom
okezue wants to merge 1 commit intoopenai:mainfrom
Conversation
… stack, val_bpb=1.06767 Single-seed non-record submission documenting a novel post-quantization LoRA distillation technique. After GPTQ produces quantized weights, a rank=4 LoRA is trained at train-time on TRAIN data only (no val) via KL divergence against the pre-quantization BF16 teacher logits, then held in memory and applied at eval through forward_ttt with cu_seqlens-aware variable-length attention during sliding-window scoring. Result: val_bpb 1.06767, artifact 15,912,974 bytes, train 520s, eval under 10 min cap. Beats post-GPTQ baseline (1.07702) by 0.00935 BPB but does not beat plain sliding window on the same stack at full 600s training (1.06286). The 80s of training time LCQ steals from main training costs about 0.005 BPB on the BF16 model, while the LoRA only recovers about 0.0003 BPB. Negative result documented with diagnosis and follow-up suggestions. Compliance: train + eval each under 600s, artifact under 16,000,000 bytes, score-first on val (LCQ trains on TRAIN data only, no val tokens are used for parameter updates before being scored), C1 causal sliding window with BOS-aware cu_seqlens.
Contributor
|
Nice idea! Regarding "The trained LoRA is held in memory across the train-to-eval boundary", is there room for this LoRA in your artifact size? The model/weights learned from the training set need to fit within 16,000,000 bytes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record submission
val_bpb = 1.06767 (seed 42, single-seed) | artifact 15,912,974 bytes | 8xH100 SXM | strict 600s train + eval
This is a non-record submission per README §"Non-record Submissions". It does not beat current SOTA, and is offered as documentation of a novel technique with detailed analysis of a negative result.
What is novel
Post-Quantization LoRA Distillation (LCQ): after GPTQ produces quantized weights, a small rank=4 LoRA is trained on the post-GPTQ dequantized model via KL divergence against the pre-quantization BF16 teacher logits, on TRAIN data only (FineWeb_train shards), entirely within the 10-minute training cap (
GPTQ_RESERVE_SECONDS=80cuts main training to 520s, leaving 80s for GPTQ + LCQ). The trained LoRA is held in memory across the train-to-eval boundary in the same Python process and applied at eval via the model's existingforward_tttpath with a newcu_seqlens-aware variable-length attention dispatch (the same BOS-aware masking the legal sliding window uses).Code-level:
BatchedLinearLoRA.forwardextended to broadcast a bsz=1 LoRA against multi-batch eval inputs;forward_ttt,_block_with_lora,_parallel_block_with_loraextended to accept and propagatecu_seqlens, max_seqlenand dispatch toflash_attn_varlen_funcaccordingly;forward_ttt(..., return_logits=True)returns logits for distillation; newpostquant_lora_distill(...)runs the 60s KL distillation training loop;serializecalls it after GPTQ;eval_val_sliding(..., lora=lora)accepts the trained LoRA and usesforward_tttinstead offorward_logitswhen set.Why it's a negative result
LCQ recovers about -0.0094 from quantized, which roughly matches plain sliding window (-0.0099). The actual LoRA contribution on top of plain sliding window is only about -0.0003 BPB. The 80 seconds LCQ steals from main training costs about +0.005 BPB on the BF16 model, more than LoRA recovers. Net negative versus plain sliding window at full training.
KL distillation training loss converged very low (~0.02) but the BPB delta is small, suggesting the residual quant error lives in the long tail of the next-token distribution where rank=4 LoRA has too little capacity to fix.
Possible follow-ups (not attempted here): higher-rank LoRA (16-32) with careful artifact-size accounting, temperature-scaled distillation to up-weight tail tokens, or shipping LoRA in the artifact and running LCQ inside the eval budget on val tokens after they are graded (legal score-first on val).
Compliance
quantized_sliding_window val_bpbis single-pass causal scoring with the trained LoRA already loaded; no parameter updates driven by val tokens before scoring them.eval_valpattern).Files
README.md(full technique writeup)submission.json(structured metadata)train_gpt.py(LCQ implementation)lossless_caps.py(CaseOps tokenizer utility)train_seed42.log(full run log including LCQ training trace)Credits
cc @cocohearts @valerio-oai for visibility (non-record review).