Quadtrix.cpp: a local LLM stack you can actually read #67

Eamon2009 · 2026-06-01T18:45:03Z

Eamon2009
Jun 1, 2026
Maintainer

Quadtrix.cpp: a local LLM stack you can actually read

Quadtrix.cpp is a complete local language model system built in C++ and CUDA, with a PyTorch reference path alongside it. The goal is simple: keep the entire stack visible. No black boxes. No hidden autograd. No framework magic you have to trust blindly.

The repository ships two complementary paths:

a native C++ / CUDA training and inference engine in main.cu
a PyTorch reference script in train_quadtrix.py
Both implement the same model. You can train in PyTorch, export the checkpoint, and run inference natively. Or do everything natively. The two paths stay in sync by design.

The model

The architecture is a standard decoder-only transformer.

tokens + positions
        ↓
     embeddings
        ↓
  N × transformer block
        ↓
    final LayerNorm
        ↓
      lm head
        ↓
      logits

Each transformer block follows pre-LN ordering:

x = x + self_attention(layer_norm(x));
x = x + feed_forward(layer_norm(x));

Pre-LN is a practical choice. It keeps the residual stream alive throughout depth and makes training more stable as you add layers.

Training runs

We have three runs across different hardware and configurations. Here is what they look like.

CUDA / bf16 · 10.84M params

run_20260508_110726 · CUDA · bf16 · 10.84M params · 14.1M train tokens / 1.6M val · 8,000 steps · 82m 42s · best val 2.3918

This is the fastest and best-performing run.

The loss curve drops sharply in the first 500 steps from ~11 down to ~3, then continues improving steadily to a final train loss of 2.2825 and best validation loss of 2.3918. The smooth curve tracks the raw loss cleanly with no major instability events.

The throughput plot tells the hardware story. After a short warmup at 0.3k tok/s, the CUDA path ramps to a peak of 19.6k tok/s and holds it flat for the rest of the run. Eighty minutes of training at that rate.

The gradient norm peaks at 2.25 at step 1200, then settles into a stable band around 1.75–2.0. That is a healthy pattern. A norm that keeps climbing is a sign of instability. This one stabilizes.

The loss Δ per eval interval chart shows the model improving at every checkpoint in the first quarter of training, with most of the cumulative loss drop coming in the first 2,000 steps. After that the per-step improvements are small but consistent — no eval intervals where validation got worse.

The grad norm vs loss scatter (coloured by step) shows a clean linear trend. Early steps (dark blue) have high loss and low norm. Late steps (yellow) cluster in the bottom right. The model is learning exactly as expected.

Key numbers:


Params	10.84M
Device	CUDA / bf16
Train tokens	14.1M
Steps	8,000
Wall time	82m 42s
Peak throughput	19.6k tok/s
Best val loss	2.3918
Final train loss	2.2825

CPU · 6.68M params

run_20260430_192930 · CPU · 6.68M params · 7.1M train tokens · 7,000 steps · 86m 26s · best val 2.9971

Same wall time as the CUDA run but on CPU with a smaller model and half the training tokens. The best validation loss is 2.9971 at step 6,800, which is reasonable given the constraints.

The loss curves show train and val tracking each other closely for the full run — the two lines are nearly on top of each other from step 2,000 onward. That is a good sign. It means the model is not overfitting to the training set and the validation split is representative.

The generalisation gap chart (val − train) is the most interesting part of this run. The gap oscillates around zero for most of training, peaking at +0.15 around step 3,900. The final gap settles at +0.0567, meaning validation loss is only marginally above train loss at the end. For a 6.68M parameter model on 7.1M tokens this is a tight result.

The checkpoint dots show the model saved frequently in the first 2,000 steps as it improved quickly, then less often as improvements became incremental.

Key numbers:


Params	6.68M
Device	CPU
Train tokens	7.1M
Steps	7,000
Wall time	86m 26s
Best val loss	2.9971
Final gen. gap	+0.0567

PyTorch CPU · 6.68M params

run_20260530_165216 · PyTorch · CPU · 6.68M params · batch=16 · block=32 · lr=1e-3 · 6,000 steps · 77m 16s · best val 4.1319

This is the PyTorch reference path. Same model size as Run 2, same device, but trained with the Python stack and a different dataset split. Best val is 4.1319 — higher than the native path, which makes sense given the smaller number of tokens processed and the character-level nature of this run's data.

The loss curves show a clean descent. Train loss reaches ~3.0 by the end while val stays around 4.1, giving a widening generalisation gap. The gap chart shows this clearly: after an initial dip it rises steadily, peaking at 8.965 and settling at 0.048 at step 5,500. The model is learning the training distribution faster than it generalises. More data or stronger regularisation would help here.

The gradient norm peaks at 2.2433 at step 3,395 — almost identical to Run 1's peak at step 1,200. The norm stabilises around a mean of 1.337 for the rest of training. That consistency between the CUDA native run and the PyTorch run is a good signal that the two implementations are aligned.

Throughput on CPU is obviously much lower: warmup at 885 tok/s, settling to a mean of 791 tok/s. Step time averages 656.7 ms with 113 spikes above 937ms — likely GC pauses or OS scheduling. The CUDA run at 19.6k tok/s is about 25× faster.

The val loss Δ per eval chart shows almost all improvement happening in the first 500 steps. After that the green bars (improvement) are small and red bars (worsening) are nearly absent — the model has mostly converged.

Key numbers:


Params	6.68M
Device	PyTorch CPU
Batch / block	16 / 32
LR	1e-3
Steps	6,000
Wall time	77m 16s
Mean throughput	791 tok/s
Mean step time	656.7 ms
Best val loss	4.1319
Peak grad norm	2.2433

Run comparison

Run	Params	Device	Best val	Wall time	Tok/s
CUDA / bf16	10.84M	CUDA	2.3918	82m 42s	19,600
Native CPU	6.68M	CPU	2.9971	86m 26s	—
PyTorch CPU	6.68M	PyTorch CPU	4.1319	77m 16s	791

The CUDA run wins on val loss by a clear margin with a larger model and more tokens, at roughly the same wall time. The native CPU run beats the PyTorch CPU run on val loss despite being the same model size, suggesting the native data pipeline and training loop are more sample-efficient.

The gradient norm across all three runs peaks in the 2.0–2.25 range and stabilises — that consistency across hardware and frameworks is a useful sanity check that the implementations are equivalent.

The tensor runtime

The implementation is built on a minimal custom tensor layer. No PyTorch. No Eigen. No external dependencies.

struct Tensor {
    std::vector<int> shape;
    std::vector<float> data;
};

From that, the repository builds everything the model needs: elementwise ops, softmax, layer norm, matrix multiply (tiled), batched matmul, transposition, and concatenation — with OpenMP parallelization and AVX/SSE acceleration on CPU paths.

Every operation is a concrete numeric function. If something is wrong, you can find it.

Backpropagation

The C++ version includes a full analytical backward pass. No autograd.

Every gradient is derived and implemented for linear layers, layer norm, ReLU, dropout, softmax, batched matmul, attention, feed-forward blocks, embeddings, and cross-entropy loss.

The attention backward pass is the hardest part. It reconstructs the full gradient chain from output projection back through softmax, through the causal mask, back to Q/K/V projections. The forward pass saves every activation the backward needs. Nothing is recomputed. Nothing is inferred from a graph.

The optimizer

AdamW, written from scratch.

m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad²
m_hat = m / (1 - beta1^t)
v_hat = v / (1 - beta2^t)
p = p - lr * m_hat / (sqrt(v_hat) + eps)
p = p * (1 - lr * weight_decay)          # decoupled weight decay

Weight decay is applied directly to the weights, not through the gradient. That is the actual AdamW formulation — different from L2 regularisation which interacts with the adaptive scaling. The difference matters for generalisation.

Generation and chat

After training:

# generate tokens indefinitely (Ctrl-C to stop)
./quadtrix data/input.txt --generate
 
# interactive chat
./quadtrix data/input.txt --chat
 
# control response length
./quadtrix data/input.txt --chat --chat-tokens 300

Generation is autoregressive — one token at a time, context window capped at block_size, oldest tokens dropped when full.

What is next

Flash attention — attention is the bottleneck at longer sequence lengths
KV cache — currently the full attention matrix is recomputed every generation step
Larger default configs — a 124M parameter reference config would be a useful next target
Multi-GPU — ZeRO stage 1 sharding is partially implemented, not fully wired
External evals — validation loss is a useful signal but a fixed benchmark would make cross-run comparison cleaner

The goal of Quadtrix.cpp is a complete, readable LLM training stack where nothing is hidden. If something is unclear in the source, that is a bug in the documentation.

codeaddict-119 · 2026-06-01T19:54:20Z

codeaddict-119
Jun 1, 2026
Collaborator

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quadtrix.cpp: a local LLM stack you can actually read #67

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Quadtrix.cpp: a local LLM stack you can actually read #67

Uh oh!

Uh oh!

Eamon2009 Jun 1, 2026 Maintainer

Quadtrix.cpp: a local LLM stack you can actually read

The model

Training runs

CUDA / bf16 · 10.84M params

CPU · 6.68M params

PyTorch CPU · 6.68M params

Run comparison

The tensor runtime

Backpropagation

The optimizer

Generation and chat

What is next

Replies: 1 comment

Uh oh!

codeaddict-119 Jun 1, 2026 Collaborator

Eamon2009
Jun 1, 2026
Maintainer

codeaddict-119
Jun 1, 2026
Collaborator