Replies: 1 comment
-
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Quadtrix.cpp: a local LLM stack you can actually read
Quadtrix.cpp is a complete local language model system built in C++ and CUDA, with a PyTorch reference path alongside it. The goal is simple: keep the entire stack visible. No black boxes. No hidden autograd. No framework magic you have to trust blindly.
The repository ships two complementary paths:
main.cutrain_quadtrix.pyBoth implement the same model. You can train in PyTorch, export the checkpoint, and run inference natively. Or do everything natively. The two paths stay in sync by design.
The model
The architecture is a standard decoder-only transformer.
Each transformer block follows pre-LN ordering:
Pre-LN is a practical choice. It keeps the residual stream alive throughout depth and makes training more stable as you add layers.
Training runs
We have three runs across different hardware and configurations. Here is what they look like.
CUDA / bf16 · 10.84M params
This is the fastest and best-performing run.
The loss curve drops sharply in the first 500 steps from ~11 down to ~3, then continues improving steadily to a final train loss of 2.2825 and best validation loss of 2.3918. The smooth curve tracks the raw loss cleanly with no major instability events.
The throughput plot tells the hardware story. After a short warmup at 0.3k tok/s, the CUDA path ramps to a peak of 19.6k tok/s and holds it flat for the rest of the run. Eighty minutes of training at that rate.
The gradient norm peaks at 2.25 at step 1200, then settles into a stable band around 1.75–2.0. That is a healthy pattern. A norm that keeps climbing is a sign of instability. This one stabilizes.
The loss Δ per eval interval chart shows the model improving at every checkpoint in the first quarter of training, with most of the cumulative loss drop coming in the first 2,000 steps. After that the per-step improvements are small but consistent — no eval intervals where validation got worse.
The grad norm vs loss scatter (coloured by step) shows a clean linear trend. Early steps (dark blue) have high loss and low norm. Late steps (yellow) cluster in the bottom right. The model is learning exactly as expected.
Key numbers:
CPU · 6.68M params
Same wall time as the CUDA run but on CPU with a smaller model and half the training tokens. The best validation loss is 2.9971 at step 6,800, which is reasonable given the constraints.
The loss curves show train and val tracking each other closely for the full run — the two lines are nearly on top of each other from step 2,000 onward. That is a good sign. It means the model is not overfitting to the training set and the validation split is representative.
The generalisation gap chart (val − train) is the most interesting part of this run. The gap oscillates around zero for most of training, peaking at +0.15 around step 3,900. The final gap settles at +0.0567, meaning validation loss is only marginally above train loss at the end. For a 6.68M parameter model on 7.1M tokens this is a tight result.
The checkpoint dots show the model saved frequently in the first 2,000 steps as it improved quickly, then less often as improvements became incremental.
Key numbers:
PyTorch CPU · 6.68M params
This is the PyTorch reference path. Same model size as Run 2, same device, but trained with the Python stack and a different dataset split. Best val is 4.1319 — higher than the native path, which makes sense given the smaller number of tokens processed and the character-level nature of this run's data.
The loss curves show a clean descent. Train loss reaches ~3.0 by the end while val stays around 4.1, giving a widening generalisation gap. The gap chart shows this clearly: after an initial dip it rises steadily, peaking at 8.965 and settling at 0.048 at step 5,500. The model is learning the training distribution faster than it generalises. More data or stronger regularisation would help here.
The gradient norm peaks at 2.2433 at step 3,395 — almost identical to Run 1's peak at step 1,200. The norm stabilises around a mean of 1.337 for the rest of training. That consistency between the CUDA native run and the PyTorch run is a good signal that the two implementations are aligned.
Throughput on CPU is obviously much lower: warmup at 885 tok/s, settling to a mean of 791 tok/s. Step time averages 656.7 ms with 113 spikes above 937ms — likely GC pauses or OS scheduling. The CUDA run at 19.6k tok/s is about 25× faster.
The val loss Δ per eval chart shows almost all improvement happening in the first 500 steps. After that the green bars (improvement) are small and red bars (worsening) are nearly absent — the model has mostly converged.
Key numbers:
Run comparison
The CUDA run wins on val loss by a clear margin with a larger model and more tokens, at roughly the same wall time. The native CPU run beats the PyTorch CPU run on val loss despite being the same model size, suggesting the native data pipeline and training loop are more sample-efficient.
The gradient norm across all three runs peaks in the 2.0–2.25 range and stabilises — that consistency across hardware and frameworks is a useful sanity check that the implementations are equivalent.
The tensor runtime
The implementation is built on a minimal custom tensor layer. No PyTorch. No Eigen. No external dependencies.
From that, the repository builds everything the model needs: elementwise ops, softmax, layer norm, matrix multiply (tiled), batched matmul, transposition, and concatenation — with OpenMP parallelization and AVX/SSE acceleration on CPU paths.
Every operation is a concrete numeric function. If something is wrong, you can find it.
Backpropagation
The C++ version includes a full analytical backward pass. No autograd.
Every gradient is derived and implemented for linear layers, layer norm, ReLU, dropout, softmax, batched matmul, attention, feed-forward blocks, embeddings, and cross-entropy loss.
The attention backward pass is the hardest part. It reconstructs the full gradient chain from output projection back through softmax, through the causal mask, back to Q/K/V projections. The forward pass saves every activation the backward needs. Nothing is recomputed. Nothing is inferred from a graph.
The optimizer
AdamW, written from scratch.
Weight decay is applied directly to the weights, not through the gradient. That is the actual AdamW formulation — different from L2 regularisation which interacts with the adaptive scaling. The difference matters for generalisation.
Generation and chat
After training:
Generation is autoregressive — one token at a time, context window capped at
block_size, oldest tokens dropped when full.What is next
The goal of Quadtrix.cpp is a complete, readable LLM training stack where nothing is hidden. If something is unclear in the source, that is a bug in the documentation.
Beta Was this translation helpful? Give feedback.
All reactions