Skip to content

[Submission] Random LinearMaps + LoRA Adapters#1295

Open
austinluk wants to merge 3 commits intoopenai:mainfrom
austinluk:submission/random-linear-maps-lora
Open

[Submission] Random LinearMaps + LoRA Adapters#1295
austinluk wants to merge 3 commits intoopenai:mainfrom
austinluk:submission/random-linear-maps-lora

Conversation

@austinluk
Copy link
Copy Markdown

@austinluk austinluk commented Apr 3, 2026

Submission

@himanshudongre
Copy link
Copy Markdown

Great to see someone else exploring this direction! I've been working on the same wishlist item and just submitted my findings in PR #1301.

TL;DR: Your "Potential Improvements" section nails it — selective freezing is the key.

I tested both full freeze + adapters (your approach) and selective freeze (freeze only MLP gate+up, learn attention fully) on FineWeb data. The results are dramatic:

Approach Frozen% Best CE (FineWeb) vs Baseline
Full freeze + VeRA rank=8 94% 2.3388 +80% gap
Full freeze + VeRA rank=16 94% 2.3288 +79% gap
Full freeze + VeRA rank=32 94% 2.3221 +79% gap
Selective freeze (gate+up only) 37% 1.2792 -1.5% BETTER than baseline

Increasing adapter rank from 8→32 barely helps — the bottleneck is frozen attention weights that can't learn relational patterns, not adapter capacity.

The fix: freeze only the MLP gate and up projections (feature expansion — where Johnson-Lindenstrauss applies naturally), learn everything else. This preserves the model's ability to learn attention patterns while getting artifact savings from frozen random projections.

On the artifact-normalized comparison (the real competition question), a larger frozen model beats a smaller fully-trained model at the same artifact budget:

Config CE (FineWeb) Artifact
6L 192d fully-trained + dropout 3.2531 2.4MB
12L 384d selective freeze + dropout 2.8803 7.3MB

The frozen model has 4× more effective params at 3× the artifact cost — and it wins by 11.5%.

Full details + code in PR #1301. Would be interesting to see if your 12L 768d backbone with selective freeze (learn attention, freeze only MLP gate+up) closes the gap further.

@himanshudongre
Copy link
Copy Markdown

Related work: I've been running extensive experiments on selective freeze (freezing gate+up projections only, 37% frozen) as an alternative to your full freeze + LoRA approach.

Key finding: selective freeze (37% frozen) dramatically outperforms full freeze + LoRA (94% frozen) — the LoRA approach has an ~80% quality gap while selective freeze shows -2.1% improvement over baseline on H100.

I also developed "progressive freeze" — train all weights fully for N steps, then freeze mid-training. This outperforms random-init freeze by 1.3 percentage points on FineWeb sp4096.

Full results with 7 architecture variants across H100 and A40: PR #1301.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — [Submission] Random LinearMaps + LoRA Adapters

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary PR #1295 ("Random Linear Maps + LoRA Adapters") submits a pure neural approach with no illegal enhancements. Head SHA: 77cec21. ## Checks N-gram family bug (ILLEGAL): No n-gram, bigram, BigramHash, or XOR hash logic anywhere in the file. Clean. Pre-Quant TTT — multi-epoch on val_tokens (ILLEGAL): val_tokens is loaded once at line 711 and used only for standard inference-mode evaluation in eval_val() (lines 207-246). eval_val is called twice: once per validation interval during training (line 848) and once post-quantization for a roundtrip check (line 944). Neither call trains on val_tokens; both run under torch.inference_mode() with model.eval(). No gradient updates touch val_tokens at any point. Clean. Score-first TTT (LEGAL): Not present. No TTT pattern at all. Scored-region SLOT: Not present. The architecture is a standard transformer with frozen random base weights and trainable LoRA adapters. Architecture description: 12-layer transformer (768 dim, 12 heads, 4 KV heads) where all linear layers use frozen deterministically-seeded random weights plus trainable LoRA rank-16 adapters. Only adapters, embeddings, norms, and scalars are trained and serialized. Frozen backbone is regenerated from seed at load time (0 bytes in artifact). Optimizer: Muon for LoRA matrices, Adam for embeddings and scalars. Quantization: INT8 per-row quantization of trainable-only state dict, then zlib compression. Roundtrip validation confirms quantized model produces same BPB before final log. Conclusion: This is a clean pure-neural submission. No illegal val_tokens training, no n-gram cheating, no scored-region slot holding. The "random backbone regenerated from seed" trick is legitimate architectural cleverness (seed stored in code, not...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

@austinluk austinluk closed this Apr 19, 2026
@austinluk austinluk reopened this Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants