[Notable Non-Record Submission] Everything Everywhere All in One Bit: XNOR-mally I'd use floats - 118M XNOR-Net - 1.539 BPB - 10-Min and Unconstrained Runs#1388
Conversation
|
Comes without saying, but this is NOT a production level architecture, there is a limit on how much it can learn, and the 100k steps are not even really needed as 20-30k saturates the network. Some further improvements to the LR schedule system (ReduceLRonPlateau but on a per-step basis with some Cosine Annealing with a longer tail mix could help reduce the loss slightly). Nonetheless, it was a lot of fun even with all I spent on the RTX 5090 and 8xH100s. |
|
Thorough work — 100+ runs across 6 phases with 0.500 BPB cumulative improvement (R1 2.074 → R40 1.574), and every negative result documented as rigorously as the wins. Standout contributions:
Noted: Phase 1 development was done on a single RTX 5090 (Vast.ai) before scaling to 8×H100 — the architecture, Triton kernel debugging, and basic training dynamics were all established on consumer hardware first. One piece of related work worth citing: BiT (arxiv:2205.13016) did full weight+activation binarization on BERT in 2022, though that's encoder-only classification rather than autoregressive LM. Your work appears to be the first to push full XNOR into the autoregressive decoder regime. |
|
cool beans |
valerio-oai
left a comment
There was a problem hiding this comment.
Selected for the notable non-record submissions section.
Everything Everywhere All in One Bit: XNOR-mally I'd Use Floats
118M XNOR-Net - 10Mins and Unconstrained Runs - Over 100 Experiments
Author: Ciprian-Florin Ifrim - April 2026
A full XNOR-Net language model that binarizes both weights and activations. This work extends the Binary-Weight-Network (BWN) and ternary bitnet submissions (previous PRs #640, #641, #920, #923) with a true XNOR-Net implementation, which to my knowledge is the first known application of full activation binarization to transformer language models at this scale with full explanations available in the body of this PR or the README of the submitted files.
Why would anyone do this?
XNOR-Nets replace float matrix multiplications with bitwise operations. In a standard neural network, computing a dot product of two 256-element vectors requires 256 multiply-adds on expensive FP units. In an XNOR-Net, both weights and activations are reduced to single bits (+1/-1, stored as 1/0), so the same dot product becomes: pack the 256 bits into 8 int32 words, XOR the weight and activation words (which computes element-wise XNOR in a single cycle for 32 elements), then popcount the result (counting set bits gives the number of agreements). The dot product equals group_size - 2 * popcount(XOR). This is 32 elements per clock cycle per XOR+popcount pair, versus 1 element per multiply-add on float hardware. Memory drops 32x (1 bit per weight instead of 32 bits for float32), and energy consumption drops dramatically since integer bitwise operations use circa 10-100x less energy than floating point multiplies. On modern GPUs with native popcount instructions, this theoretically enables 58x compute speedup, and even on CPUs this becomes trivial with custom kernels that use XNOR+Popcount. The catch is quality as sign() destroys magnitude information, so the challenge is training a model that maintains quality despite this extreme quantization, which is exactly what this project explores.
If you like anything from fitness-based finetuning with EGGROLL without a gradient and optimizer, in-between blocks residuals vs per-pass residuals, the new Gram NeoMuon release, custom Triton Kernels, and ultimately, having as many possible parameters in the 16MB compressed size available, give this a read.
Best results:
Table of Contents
Architecture Overview
The model is a U-Net transformer with skip connections between encoder and decoder halves. All large weight matrices (QKV projections, attention output, MLP up/down) are binarized using the XNOR-Net approach from Rastegari et al. (2016). Small parameters (RMSNorm scales, skip weights, residual mixing, QK gains) remain in full precision.
Model Configuration (Best: R40 / N2)
U-Net Structure
The transformer is split into encoder (first N/2 layers) and decoder (remaining layers). Skip connections with learnable weights connect corresponding encoder-decoder pairs, initialized to ones. This provides error correction for the information loss inherent in binary quantization -- early features bypass the deepest (most lossy) layers.
Weight Binarization (STE)
Each weight matrix W is binarized per group:
During training, the Straight-Through Estimator (STE) passes gradients through sign() as if it were the identity function. The real-valued weights are maintained in float32 and updated normally; only the forward pass uses binary weights.
Key Technical Contributions
1. Activation Binarization Mode 2
Full XNOR (binarizing all activations) plateaus at ~2.0 bpb regardless of training duration. The root cause: the MLP down projection receives all-positive inputs from activation functions, so sign() always returns +1 -- carrying zero information. Mode 2 skips activation binarization on the MLP down projection only, breaking through to 1.575 bpb while keeping all other projections binary.
2. signsq Activation Function
signsq(x) = x * |x|replaces relu^2 for Mode 2. Unlike relu^2 (which outputs only positive values), signsq produces negative outputs, so subsequent sign() operations in the attention path carry real information. This is critical for quality when activations are binarized.3. Scale QAT (Quantization-Aware Training)
Binary weight group scales (alpha = mean(|W|) per group) are stored in FP8 at save time. Without scale QAT, the model trains with float32 scales but encounters FP8 quantization error at roundtrip, causing catastrophic degradation at long training runs (0.87 bpb gap at 200k steps). Scale QAT simulates FP8 quantization via STE during training, so the model learns to compensate for precision loss. Result: gap drops from 0.87 to 0.006 bpb.
4. Triton XNOR+POPCOUNT Kernel
A custom Triton kernel performs true 1-bit matrix multiplication using XOR and population count instructions. The kernel operates on packed int32 words (32 binary weights per word) with per-group scaling factors.
5. Cosine LR Schedule
Binary STE training with a flat learning rate followed by warmdown wastes ~70% of training in a divergent plateau. Cosine decay from the start keeps every step productive and enables 4x higher peak LR (0.008 vs 0.002).
6. Sequence Length Scheduling
Training starts at seq_len=128 and ramps through 256->512->1024 over four equal time phases. Short sequences give 8x more gradient updates per second during the early phase where the model just needs to learn token frequencies. All torch.compile graphs are cached during warmup by running one forward pass at each sequence length.
7. Low Momentum for Binary STE
Standard momentum (0.95) amplifies noisy STE gradients, causing destructive sign oscillations. Reducing momentum to 0.80 dampens this noise, giving a 0.027 bpb improvement.
Development Timeline
Phase 1: RTX 5090 Development (T-series runs)
Initial development on a single RTX 5090 32GB (Blackwell SM120) on Vast.ai. Established the architecture, debugged FlashAttention3 compatibility, developed and debugged the Triton XNOR kernel, and tested basic training dynamics. Key discovery: per-group alpha scaling is essential (per-row loses 0.62 bpb).
Phase 2: 8xH100 Scaling (S-series runs)
Moved to 8xH100 SMX 80GB in a Docker container (driver 565.57, CUDA 12.7). Discovered that batch size dramatically affects binary training -- 65536 tokens outperforms 524288 by 0.07 bpb because binary networks benefit from frequent, small updates rather than rare, large ones.
Phase 3: Record Attempts (R-series runs)
Systematic hyperparameter optimization covering 42 runs. Discovered cosine LR schedule (R25-R29), momentum reduction (R31-R35), gradient clipping (R30), sequence length scheduling (R33), and scale storage optimization (R37-R40). Cumulative improvement from R1 to R40: 2.074 -> 1.574 bpb (0.500 bpb gain).
Phase 4: Notable Track (N-series runs)
Extended training at 100k-200k steps. N1 revealed the roundtrip gap problem from FP8 scale accumulation over long training. Scale QAT (N2) fixed this, achieving 1.575 roundtrip bpb.
Phase 5: EGGROLL Exploration (E-series runs)
Attempted gradient-free evolution strategies using the EGGROLL algorithm (Sarkar et al. 2026). Tested full perturbation, layer-limited perturbation, and LoRA-based perturbation across 11 runs. Found that STE+Muon finds a basin too precise for zeroth-order methods to improve upon at 115M parameters.
Phase 6: Attention Residuals (R41-R42, N3)
Implemented Attention Residuals from the Kimi Team (2026) paper as an alternative to U-Net skip connections. Each layer attends over all prior outputs via learned depth-wise attention. The 33% overhead from the depth-wise softmax reduced the number of training steps achievable in 10 minutes, resulting in worse final quality than the simpler U-Net skips.
Activation Binarization Modes
BINARIZE_ACTIVATIONScontrols which layers have their input activations binarized:*BWN result from separate Binary BitNet submission, not this XNOR codebase.
Why Full XNOR Plateaus at 2.0 bpb
With relu^2 or signsq activation, MLP hidden states passed through the activation are either all-positive (relu^2) or mixed-sign (signsq). When the down projection's input goes through sign(), the quality depends entirely on whether these signs carry information:
With relu^2: every hidden element is positive, so sign() returns all +1. The binary dot product
sign(x) * sign(w)degenerates to justsum(sign(w))-- the activation signs carry no information. This bottleneck limits quality to ~2.0 bpb regardless of model size or training duration.With signsq in Mode 2: the down projection receives un-binarized signsq outputs (mixed signs with magnitude information), bypassing the bottleneck. All other projections (QKV, attention out, MLP up) still use full XNOR with binarized activations.
Activation Function Analysis
relu^2 Compression Phenomenon
relu^2 makes all MLP hidden activations positive. The down projection's weight signs evolve to be highly structured (correlated within groups) because the gradient signal only comes through the positive activation channel. LZMA compresses these structured signs extremely well -- a 196M param model (16L) compresses to 15.5MB.
However, this compression comes at the cost of quality. The model trades information capacity for compressibility. With signsq, the signs are high-entropy (incompressible) but carry genuine information, yielding much better bpb.
Triton XNOR Kernel
Architecture
The kernel performs binary matrix multiplication using XOR + population count:
Each group of 256 weights is packed into 8 int32 words. The kernel accumulates per-group dot products, scales by per-group alpha, and sums across groups.
Per-Group Alpha Scaling
The kernel supports per-group weight scaling factors (alpha = mean(|w|) per group), matching the STE reference path exactly. Initial versions used per-row alpha which lost 0.62 bpb of quality.
Bug Fix: int64 Promotion
Triton promotes int32 to int64 during 2D broadcast operations. When
xv[:, None] ^ wv[None, :]creates a [BLOCK_M, BLOCK_N] tensor, the result is int64.popc()then dispatches to__nv_popcll(64-bit popcount) instead of__nv_popc(32-bit), counting 32 extra zero-bits for every positive int32 and 32 extra one-bits for every negative int32.Fix: cast the XOR result back to int32 before calling popc:
bfloat16 Accumulation
The kernel accumulates in bfloat16 (not float32) to match the precision of the STE reference path and the roundtrip reconstruction. This reduced the quantization gap from 0.008 to 0.003 bpb.
Performance
At the current model size (1024d, 65536 batch tokens, 8 GPUs), the Triton kernel shows no speed improvement over the BF16 STE path (~38ms/step for both). The matrices are too small for the kernel launch overhead to be amortized. The kernel's value is correctness verification and future larger models.
Compression Pipeline
Storage Formats
Compression Comparison
Brotli consistently wins by ~50KB. zstd is worst for binary data -- it's optimized for structured text, not near-random bit patterns. The save process tries all three and picks the smallest, with a 1-byte header indicating the method for the decompressor.
FP8 Scale Storage vs BF16
FP8 scale storage saves ~0.45MB but introduces quantization error on per-group scales. For 10-minute runs (15k steps), the gap is tolerable. For 100k+ steps, the error compounds and BF16 scales are essential (or scale QAT is needed).
Sign-Sort Permutation
Post-training, MLP hidden dimensions are permuted so same-sign weight columns are adjacent. The corresponding rows of the paired projection are permuted identically, preserving model output. Intended to create long runs of identical bits for LZMA compression. Result: did not help for signsq (signs are high-entropy), only useful for relu^2 which has structured signs.
Compression Regularizer
A differentiable penalty using
tanh(10*w)that pushes weight signs within each group toward uniformity. Controlled bySIGN_COMPRESS_REG. Result: hurt quality, not worth it. The regularizer fights against the STE gradient signal, reducing model capacity without sufficient compression gain.Optimizer Exploration
Muon (Momentum + NS Orthogonalization)
The primary optimizer for binary weight matrices. Uses Newton-Schulz orthogonalization on the gradient before applying the update. Muon was chosen because it produces well-conditioned updates that help binary STE training converge faster than Adam.
Original 3-step NS wins. Binary STE gradients are inherently noisy because sign() is a discontinuous function. More precise orthogonalization (Gram NS with 5 steps) doesn't help because the gradient itself is approximate. The library's float16 precision actively hurts because bfloat16's larger dynamic range matters more than mantissa precision for binary training.
NS Step Count Ablation
3 steps is the sweet spot. 2 steps under-orthogonalizes, producing updates that are poorly conditioned and create a huge roundtrip gap (0.049). 5 steps over-orthogonalizes noisy STE gradients, wasting compute on precision that doesn't exist in the signal.
Momentum
Momentum controls how much of the previous gradient update is carried forward. In standard float training, high momentum (0.95) smooths out mini-batch noise. But for binary STE training, each gradient is fundamentally approximate because sign() is not differentiable. High momentum amplifies these approximation errors, causing weights to oscillate across zero (flipping their sign back and forth unproductively).
At 0.80, the noise from STE gradient errors is dampened enough that the model trains stably, but there is still enough momentum to escape shallow local optima. At 0.75, gradients become too noisy (not enough smoothing), and the roundtrip gap doubles -- weights jitter more and quantize poorly.
EMA (Exponential Moving Average)
EMA averages weights over recent training history. For float models this smooths out noise, but for binary models it's catastrophic. The averaged weights have less decisive signs -- they sit closer to zero where sign() is maximally sensitive to perturbation. During roundtrip (load from compressed artifact), these near-zero weights flip unpredictably, destroying quality. EMA is harmful for binary models.
Learning Rate and Schedule Analysis
LR Schedule: Linear Warmdown vs Cosine
The training loss curve for binary STE networks shows a distinctive "wandering" pattern. After an initial drop (steps 0-3000), loss increases and oscillates for thousands of steps (3000-9000) before dropping again during warmdown. This happens because the LR is too high for stable binary training -- each step flips thousands of weight signs, some productive, some destructive. The productive and destructive flips roughly cancel out, so the model wanders sideways.
With cosine decay, the LR starts decreasing immediately after warmup. There is no sustained high-LR plateau, so the wandering phase is compressed. More importantly, cosine enables a much higher peak LR (0.008 vs 0.002) because the rapid decay prevents the accumulated noise from causing divergence.
Cosine LR Sweep
The peak is at 0.008. Below that, the model learns too slowly in the available training time. Above that, excessive sign flips early in training prevent the model from finding a good basin.
Gradient Clipping
Small improvement. Gradient clipping prevents any single batch from causing a catastrophic cascade of sign flips. In binary networks, a large gradient can flip the sign of many weights simultaneously, and the resulting binary network can be dramatically different from what the optimizer expected.
Sequence Length Scheduling
Training starts at seq_len=128 and doubles at equal time intervals: 128->256->512->1024.
The reasoning: early in training, the model needs to learn basic token frequencies and simple bigram patterns. These require only short context. Processing short sequences is 8x faster than full 1024 (attention is quadratic), giving 8x more gradient updates per second. Once the model has learned local patterns, longer sequences allow it to learn long-range dependencies.
Implementation details: the schedule is based on either wall-clock time (if MAX_WALLCLOCK_SECONDS > 0) or step count (if using iterations). Each torch.compile graph is cached during warmup by running one forward-backward pass at each sequence length.
The 0.019 bpb gain from scheduling is entirely free -- the same total tokens are processed, just in a more efficient order. The model gets ~4x more gradient updates during the first quarter of training.
Batch Size Sweep
Binary STE training strongly prefers smaller batches with more frequent updates. Each sign() decision is discrete -- once a weight's sign flips, the effect on the network is immediate and discontinuous. More frequent updates mean the model can react to the consequences of each sign flip sooner, correcting mistakes before they propagate.
65536 is the sweet spot. Below that, per-batch gradient noise increases without speed benefit (DDP communication overhead dominates at small batch sizes). Above that, the model makes fewer sign-flip decisions per second, losing the benefit of frequent updates.
With 8 GPUs at 65536 total batch: 8192 tokens per GPU, well within VRAM. Step time is dominated by DDP synchronization, not compute.
Architecture Ablations
Group Size
The group size controls how many weights share a single scaling factor (alpha). Smaller groups give finer-grained scaling but noisier per-group statistics (fewer elements to average over). Larger groups give stable statistics but coarser approximation.
256 is optimal. At 128, per-group mean(|w|) over 128 elements is noisy. At 512+, a single alpha must represent weights with different magnitudes, losing precision.
Layers vs MLP Width
Wider MLP is strictly better than deeper for binary networks. Each layer applies sign() to its output, which is a lossy operation that compounds across depth. More layers = more compounding information loss. A wider MLP gives more capacity per layer without the compounding. The 10L 4x config fits the 16MB budget optimally.
Wider Model (768d) vs Standard (1024d)
Even with 80% more layers, the narrower model is worse. Binary networks lose information per layer, so depth hurts more than width helps.
BPE Vocabulary
Smaller vocabulary saves 1.85MB of embedding FP params, allowing more binary parameters within the 16MB budget. The larger vocabulary doesn't compensate for the lost binary capacity.
Embedding Dimension
384 embed_dim with BF16 scales is the sweet spot -- richer embedding space within budget, and the BF16 scales avoid roundtrip degradation. 512 embed with FP8 scales destroys roundtrip at long training due to accumulated scale quantization error.
Logit Softcap
10 is better. Lower softcap constrains logits more, regularizing the model. Uses polynomial approximation (
x * (1 - x^2/3 + x^4/15)) instead of tanh because tanh doesn't fuse with torch.compile.Smear Module
Smear didn't help with sequence length scheduling enabled. The scheduling already provides the "easy then hard" curriculum that smear approximates.
Size Check Runs (T28-T34)
Architecture variants tested for 16MB budget fit:
Scale QAT and Roundtrip Gap
The Problem
During training, per-group weight scales (alpha = mean(|w|)) are computed in float32. At save time, these scales are quantized to FP8 for storage. Each step introduces a tiny error that the model never sees during training. Over 200k steps, the model becomes precisely tuned to float32 scale values that FP8 cannot represent, causing catastrophic roundtrip degradation.
N1's 100k checkpoint had roundtrip 1.986, 30k checkpoint had 2.121 -- the error compounds monotonically with training steps.
The Fix
Scale QAT simulates FP8 quantization on scales during the forward pass via STE:
The model sees the quantized scale values during training and learns to compensate. Combined with BF16 scale storage (which has negligible quantization error), the roundtrip gap stays below 0.006 bpb even at 100k steps.
Attention Residuals
Background
Attention Residuals (Kimi Team, 2026) replace standard residual connections with learned depth-wise attention. Instead of
h_l = h_{l-1} + f(h_{l-1}), each layer attends over ALL prior outputs:h_l = softmax_weighted_sum(all previous outputs). This allows later layers to selectively retrieve information from any earlier layer, bypassing lossy intermediate sign() operations.Implementation
Two modes were implemented:
Queries are zero-initialized so the model starts with uniform weights (equivalent to standard residual). Keys are RMSNorm'd stored outputs. No projection matrices needed.
Results
Analysis
AttnRes adds 33% overhead (51.7ms vs 38.6ms) from the depth-wise softmax computation over stored tensors. In the 10-minute track, this overhead means ~4000 fewer training steps, which more than negates any architectural benefit. Even at 100k steps (N3 vs N2), U-Net wins by 0.021 bpb in roundtrip.
The overhead comes from: storing 10+ tensors, computing 10 einsum operations for logits, softmax, and weighted sum each forward pass. Torch.compile partially fuses these but the softmax reduction dimension is too small (10 elements) for efficient GPU execution.
Mode 1 (sub-layer) crashed with an Inductor OOM error -- the backward graph with 20 stored tensors exceeds Triton's register file limits for the fused RMSNorm backward kernel.
Conclusion: for binary networks, simple weighted skip connections (U-Net) provide sufficient error correction at much lower overhead than learned depth-wise attention.
Complete Run Log
T-series: RTX 5090 Testing
S-series: 8xH100 Scaling
R-series: Record Attempts
P-series: Push/Submit (10-min track, 3 seeds)
N-series: Notable Track
EGGROLL Exploration
Background
EGGROLL (Sarkar et al. 2026) uses rank-r low-rank perturbations for efficient evolution strategies. Instead of sampling full-rank noise matrices, it samples A in R^(m x r) and B in R^(n x r) and forms E = (1/sqrt(r)) * AB^T. This enables gradient-free optimization that bypasses the STE entirely, directly optimizing the loss function over the binary weight space.
The motivation for trying EGGROLL on our binary network: the STE is a fundamentally approximate gradient. EGGROLL evaluates the true loss function (with actual sign() and quantization), so it could potentially find better solutions than STE-based gradient descent.
Implementation
Three approaches were implemented:
Full perturbation: Perturb all 115M binary weight parameters directly. Each perturbation adds sigma * (1/sqrt(r)) * AB^T to the float weights before binarization.
Layer-limited perturbation: Perturb only the last N layers (controlled by EGGROLL_LAYERS). Reduces dimensionality from 115M to 11.5M-34M.
LoRA perturbation: Create LoRA adapter pairs (A, B) for each binary weight matrix. Perturb only the LoRA parameters (~614K params at rank 4). Before each forward pass, merge LoRA into base weights, evaluate, then unmerge. Final model merges LoRA permanently.
Results: Full Perturbation
From scratch (E1-E2), ES cannot navigate the 115M-dimensional landscape at any sigma or population size. Even 4096 population provides zero useful gradient signal.
From pretrained weights (E3-E7), the perturbation scale is critical. Too large (E3, sigma=0.01): every perturbation destroys the trained model, so the "best" direction is just "least bad." Too small (E4, sigma=0.0001): fitness differences become noise-dominated. The sweet spot (E5, sigma=0.0001, pop=4096) is stable but shows zero improvement -- every perturbation direction is uphill from the STE-found basin.
Results: Layer-Limited Perturbation
Fewer parameters to perturb means less damage per step, but still no improvement. The model is at a local optimum in every direction, even when only searching a 11.5M-dimensional subspace.
Results: LoRA Perturbation
LoRA brings the perturbation dimensionality down to 614K -- manageable for ES. E10 with pop=4096 was nearly stable (only +0.004 degradation vs +0.040 for direct perturbation of the same parameters). But still no improvement, and E11 at pop=16384 was too slow at 229s/step to be practical.
Why EGGROLL Cannot Improve on STE+Muon
The fundamental issue is signal-to-noise ratio. With rank-1 perturbations in d-dimensional space, the cosine similarity between any random perturbation and the true gradient is approximately 1/sqrt(d). For d=115M, this gives ~0.00009. Population size N improves this by sqrt(N), so pop=4096 gives ~0.006 -- still 99.4% noise.
The EGGROLL paper's successful pretraining used a 256-dim model with up to 1M population. For 115M params, the required population would be orders of magnitude larger than is practical.
LoRA reduces d to 614K, giving ~0.04 per perturbation and ~2.5 with pop=4096. Better, but the LoRA subspace may not contain the improvement direction. The STE+Muon optimizer has access to 115M-dimensional gradient information per step, which is fundamentally more informative than 4096 scalar fitness samples.
Multi-Seed Variance
Three seeds (42, 7, 1337) were run with the P1 config to estimate variance:
Standard deviation of ~0.012 bpb across seeds. This is typical for binary networks where early sign choices cascade -- a different random initialization puts the model into a different basin, and small differences compound through the sign() operations.
Final Configuration
P1: 10-Minute Track Submission (R40 config)
Best single run: R40 -- 1.574 val, 1.578 roundtrip, 15.96MB
Three-seed mean: 1.602 +/- 0.012 roundtrip, 1.567 +/- 0.012 sliding
N2: Notable Track (100k steps)
Same as P1 but:
Result: 1.569 val, 1.575 roundtrip, 1.539 sliding, 15.91MB
Reproduction
Requirements
Setup
Training
Data
FineWeb 10B dataset with 1024 BPE tokenizer. 80 training shards, 1 validation shard (~40.5M tokens).
Key Insights
Binary networks need frequent, small updates. Batch size 65536 >> 524288 for quality. Each sign() is a discrete decision -- more decisions per second means faster convergence.
Full XNOR activation binarization has a quality ceiling around 2.0 bpb due to the MLP information bottleneck. Mode 2 (skipping MLP down proj) breaks through to 1.575.
Momentum should be lower than standard (0.80 vs 0.95) because STE gradient noise is amplified by momentum, causing destructive sign oscillations.
Cosine LR schedule is essential for binary STE training. Flat LR with warmdown wastes 70% of training time in a divergent plateau.
Sequence length scheduling provides free improvement -- short sequences at the start give 8x more gradient updates during the phase where the model needs to learn token frequencies.
Wider is better than deeper for binary networks. Each sign() compounds information loss across layers, but wider MLP gives more capacity per layer.
EMA is harmful for binary models -- the averaged weights have less decisive signs that don't survive quantization.
Scale QAT is essential for long training runs. Without it, FP8 scale quantization error accumulates over steps and causes catastrophic roundtrip degradation (0.87 bpb gap at 200k steps).
Attention Residuals add overhead without benefit for binary networks. The 33% slower steps reduce training progress more than depth-wise attention helps. Simple U-Net skips are sufficient.
EGGROLL cannot improve on STE+Muon at 115M parameters. The signal-to-noise ratio of zeroth-order methods is too low for practical population sizes. Even LoRA-based EGGROLL (614K params, pop=4096) shows no improvement from the STE-found basin.
References
License
This project is submitted for the OpenAI Parameter Golf Challenge, all work and experiments credited to Ciprian-Florin Ifrim with aforementioned references. Document formatted by Claude.