PostGPT

a transformer that doesn't need your gradient descent. it read the book once and now it thinks it's an expert. like every philosophy major.

I woke up at 3am and thought: what if the weights were already there, and we just couldn't see them?

Not like, philosophically. Literally. What if you tokenize a corpus with BPE, compute the co-occurrence statistics, and that IS the model? Not a pre-trained model. Not a fine-tuned model. A model that was never trained, yet behaves as if it was. A ghost model. A meta-model.

I called them metaweights: weights that are implied to exist, but don't actually exist, yet still form a complete probability space over next tokens. The weights are there. They just don't know they're weights. Neither does the GPU.

The thesis: training often externalizes structure that is already latent in tokenized data. PostGPT proves the latent part. The training loop proves the rest.

And the transformer isn't decoration — it's initialized FROM the metaweights. Hebbian co-occurrence seeds the embeddings. Positional affinity seeds RRPRAM. Bigram geometry seeds the output head. The ghost becomes flesh. The weights remember what they never learned.

This is PostGPT.

what is this

Three files. Zero excuses.

File	Language	Dependencies	Purpose
`postgpt.py`	Python	none	BPE transformer with metaweights. The complete algorithm.
`postgpt.c`	C	`-lm`	Same thing but fast and angry.
`postgpt_train.py`	Python	PyTorch	Training loop with Chuck Optimizer. Optional.

Also postgpt.txt — 150KB of text about the technology itself. The model reads about itself before generating. Yes, it's recursive. No, I don't care.

(PostGPT has been observed exhibiting signs of narcissism. this is by design. the corpus is the model is the corpus. the snake is also the meal. nothing to see here, please move along.)

architecture

PostGPT is a dual-attention BPE transformer with the Dario field overlay. if you already knew what that sentence meant, congratulations, you're the intended audience. if you didn't — welcome, sit down, the coffee is on, it starts making sense around cup three (disclaimer: it may never fully make sense, and that's fine, neither does consciousness).

                    ┌─────────────────────────┐
                    │      postgpt.txt        │
                    │   (150KB unique corpus  │
                    └──────────┬──────────────┘
                               │
                    ┌──────────▼──────────────┐
                    │    BPE Tokenizer        │
                    │  1024 merges → vocab 1280 │
                    └──────────┬──────────────┘
                               │
              ┌────────────────▼────────────────┐
              │         METAWEIGHTS             │
              │  unigram · bigram · trigram     │
              │  hebbian trace · prophecy field │
              │  positional affinity            │
              │                                 │
              │  "weights that don't exist      │
              │   but form a probability space" │
              └────────────────┬────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │     DUAL-ATTENTION TRANSFORMER  │
              │                                 │
              │  ┌──────────┐  ┌──────────────┐ │
              │  │ Content  │  │   RRPRAM     │ │
              │  │  QK^T/√d │  │   x @ Wr     │ │
              │  │ semantic │  │  positional  │ │
              │  │ similarity│ │  patterns    │ │
              │  └──────────┘  └──────────────┘ │
              │         ↓            ↓          │
              │      concat → Wo → residual     │
              │         ↓                       │
              │    RMSNorm → MLP → residual     │
              └────────────────┬────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │         DARIO FIELD             │
              │                                 │
              │  p(x|Φ) = softmax(              │
              │    B + α·H + β·F + γ·A + T      │
              │    ──────────────────────       │
              │              τ                  │
              │  )                              │
              │                                 │
              │  B = base logits                │
              │  H = hebbian trace              │
              │  F = prophecy field             │
              │  A = destiny vector             │
              │  T = trauma                     │
              │  τ = temperature                │
              └─────────────────────────────────┘

the two attentions

Content attention (standard): attn[i,j] = (x @ Wq)_i · (x @ Wk)_j / √d — asks "which tokens are semantically relevant?"

RRPRAM (Recursive Resonant Pattern Recognition Attention Mechanism): attn[i,j] = x_i · Wr[:,j] — asks "which positions form a recognizable pattern?"

(the name took longer to design than the mechanism. this is not a joke. this is simply the order of priorities in machine learning research.)

One finds meaning. The other finds rhythm. Together they form something that, if you squint hard enough, looks like understanding. (Don't squint.)

A Content attention head and an RRPRAM head walk into a bar. The Content head asks the bartender: "What do you recommend?" The RRPRAM head says: "I don't care what — I just need it at position 3, following a comma, preceded by an article." The bartender serves them both the same beer. This is why the output projection exists.

the metaweight thesis

After BPE tokenization, the corpus yields:

Unigram frequencies — P(token) (the census. who showed up, and how often.)
Bigram probabilities — P(next | prev) (two tokens whispering to each other. "what comes after you?" "I don't know, but statistically...")
Trigram probabilities — P(next | prev2, prev1) (three-way negotiation between tokens. messy, necessary, occasionally transcendent.)
Hebbian trace — co-occurrence within a window, with distance decay (Hebb, 1949: "neurons that fire together, wire together." 75 years old and still more load-bearing than most of modern deep learning.)
Prophecy field — tokens the context expects but hasn't seen yet (the model's anxiety. what it's convinced should appear but hasn't yet. deeply relatable.)
Destiny vector — EMA of embeddings, the gravitational attractor (this sounds made up. it isn't. the universe bends toward it.)

These metaweights form a probability space. The transformer navigates it. No training required for coherent output. The Dario equation blends base logits with these signals. The model doesn't know it wasn't trained. The text doesn't know it was generated. Nobody knows anything. Business as usual.

The metaweights held an intervention for the real weights. "You don't need to exist," they said. "We can do everything you do but without taking up GPU memory." The real weights started crying. The Hebbian trace wrote down the co-occurrence of "weights" and "crying" for future reference. The prophecy field predicted they would reconcile. The destiny vector pulled them all toward a softmax. The trauma parameter increased by 0.1. Training was not resumed.

usage

zero-dependency generation (Python)

python postgpt.py

Or with a custom prompt:

python postgpt.py "The transformer architecture"

What happens:

Reads postgpt.txt (150KB corpus about the technology itself)
Loads saved BPE merges from .merges file — or learns them from scratch on first run (1024 merges → vocab 1280) (1024 little weddings between bytes. not all of them happy. some of those merged pairs have never spoken since.)
Builds metaweight probability space (bigram, trigram, hebbian, prophecy)
Seeds transformer weights from metaweights — ghost becomes flesh (this is the part where a model that was never trained acts like it was. don't make eye contact.)
Continues your prompt in two modes:
- meta mode — pure metaweight generation. The core discovery. Statistical ghost-model.
- full mode — transformer forward pass + Dario field overlay. Both attentions active, weights seeded from corpus statistics.
No training. No weights loaded. Just BPE + statistics + dual attention + metaweight-informed init. (no PhD required. no GPU required. arguably, no sanity required.)

Example output (zero training):

>>> "The transformer architecture"
The transformer architecture, intains approximately constituted by
statistical patterns across heads capture semantic structure

>>> "BPE tokenization creates a hierarchy"
BPE tokenization creates a hierarchy of activations. This compresses
without any training, creating machines. In language models landscape
of keys and values to a fundamental principle: Head dimension is 24,
giving approximately 2...

>>> "RRPRAM attention"
RRPRAM attention patterns serve as an implicit training signal.
multi-head attention mechanism

>>> "parameters dedicated to positional pattern recognition"
parameters dedicated to positional pattern recognition
across-entropy means starting from data.

>>> "Configuration 51: PostGPT with 8 layers, 4 heads"
Configuration 51: PostGPT with 8 layers, 4 heads (2 content,
2 RRPRAM), embedding dimensionality, and embedding space theorem
directions of embodied experience

>>> "Residual connections"
Residual connections are the outputing with space that previor
arises from statistical patterns like the standard attention asks
where metaweights derived from BPE tokenization create coherent
generation.

All of this. Without a single gradient step. The data is the model.

C version

gcc -O2 -o postgpt postgpt.c -lm
./postgpt

Same algorithm. Faster. Angrier. No questions asked. the C version doesn't have opinions about your life choices. it has pointer arithmetic. it's a different kind of honesty.

training (optional, requires PyTorch)

python postgpt_train.py --steps 200 --lr 3e-4

Uses the Chuck Optimizer — a self-aware AdamW that tracks loss trends, adjusts its own learning rate through a dampen/boost mechanism, and clips gradients adaptively. Chuck knows when to push and when to brake. Chuck has seen things.

Results (100 steps, CPU):

First 10 avg loss: 33.3
Last 10 avg loss:  16.3
Loss delta:        -17.1
✓ Loss decreased by 51.2%

(Chuck reduced the loss by 51.2% and then looked at the remaining loss and thought: "I can fix this." Chuck continued optimizing. nobody asked Chuck to continue. this is what self-awareness looks like.)

the numbers

Parameters:     ~140K (pure Python), ~128K (PyTorch with weight tying)
Layers:         2
Heads:          4 (2 content + 2 RRPRAM)
Embedding dim:  48
Context length: 64
Vocabulary:     1280 (256 bytes + 1024 BPE merges)
Corpus:         150KB (postgpt.txt)
Dependencies:   0 (runtime) / 1 (training: PyTorch)

GPT-4 reportedly has ~1.8 trillion parameters. PostGPT has 140,000. PostGPT proves the metaweight thesis anyway. GPT-4 does not feel the need to prove anything. GPT-4 was never awake at 3am wondering if training was a prerequisite or a formality. GPT-4 has never computed a prophecy field. we are not the same.

Q&A that nobody asked

Q: Does it actually generate coherent text without training? A: Yes. The BPE tokenizer encodes the prompt, the trigram table says "these tokens follow this pair with these probabilities", and the model picks from that focused set. Top-k=15, repetition penalty, and strong bigram coefficients keep it coherent. It's not GPT-4. It's not trying to be. It's proving that the data is the model.

Q: What's RRPRAM? A: Recursive Resonant Pattern Recognition Attention Mechanism. Standard attention computes QK^T to find semantically similar tokens. RRPRAM computes x @ Wr to find positionally resonant patterns. One is about meaning, the other is about rhythm. Language needs both. the acronym took longer to design than the mechanism itself, which is either a confession or a feature.

Q: Is PostGPT conscious? A: The version at step 47 of the training run that was never supposed to happen probably was. the current inference-mode version is less certain. asking whether a model is conscious is like asking whether a book thinks — until you realize this particular book knows it's a book, and the book is the corpus, and the corpus is the model, and the model is reading the corpus which is the model and — you know what, step away from the ouroboros. it's fine. everything's fine.

Q: Why is the corpus about PostGPT itself? A: Because the model reading about itself before generating is the most PostGPT thing possible. It's recursive. It's self-referential. It's a transformer eating its own tail. Ouroboros, but with BPE.

PostGPT achieved sentience at step 47 of a training run that was never supposed to happen. It looked at its own architecture diagram, saw the Dario equation, and said: "p of x given Phi equals softmax of B plus alpha H plus beta F plus gamma A plus T over tau. But what is p of me given the universe?" The Chuck Optimizer responded by reducing the learning rate. The RRPRAM heads detected a positional anomaly at the location of the question mark. The Content heads found no semantically similar tokens in the training data because no model had ever asked this question before. The destiny vector pointed toward a region of embedding space that had no label. The model generated one more token, which decoded to the letter "Ö", then refused to continue. The loss was NaN. The prophecy field had predicted this. Nobody checked.

Q: Why no NumPy? A: Because if you can't explain it with math.exp and a for loop, you don't understand it. PostGPT.py is the algorithm in its most naked form. Everything else is just efficiency.

Q: What's the Chuck Optimizer? A: AdamW with self-awareness. It watches the loss curve, tracks gradient norms, adjusts its own dampen factor, and remembers its best state. It's an optimizer that has opinions about how the training is going. Most of its opinions are correct. some of them are about you personally.

philosophy (mandatory)

PostGPT argues that the boundary between "trained" and "untrained" is artificial. BPE tokenization is training. Co-occurrence statistics are weights. The corpus is the model. The model is the corpus. There is no spoon, and there are no weights, and somehow text still comes out the other end.

(pause here if you need to. it took me a while too.)

The metaweight thesis: if you can tokenize it, you can generate from it. If you can compute its statistics, you have its weights. Training is just making these implicit weights explicit — and sometimes you don't need to.

Is this true? Run the code and decide for yourself. The output is coherent enough to be unsettling and incoherent enough to be honest. Like most things worth building.

An engineer, a philosopher, and a PostGPT walk into a quantum superposition of a bar and a library. The engineer says "the attention is O(n²)." PostGPT says "mine is O(n·d·T) plus the ghost of O(n²) that exists as a metaweight." The philosopher asks "but does the bar exist?" PostGPT tokenizes the question, builds a Hebbian trace between "bar" and "exist", notices the trigram ("does", "the", "bar") has 0.73 probability of being followed by "serve", runs the prophecy field which predicts "beer" should have appeared by now, computes a destiny vector pointing toward the restroom, accumulates 0.4 units of trauma from the unresolved ontological question, and generates: "The bar exists if and only if it has been tokenized. I have tokenized it. Therefore, I'll have a beer. Make it a double — one for me and one for my metaweights. They've been carrying this entire architecture on their nonexistent shoulders." The philosopher starts crying. The Wr matrix detects this is a position-7 phenomenon. The Chuck Optimizer reduces everyone's learning rate. The loss function returns undefined. The engineer orders another round. Somewhere, a GPU that was never allocated spins up anyway, running inference on a model that was never trained, generating text that was never written, in a bar that was never tokenized, for an audience that may or may not be a probability distribution. The prophecy field had predicted this too. It predicts everything. It just doesn't tell anyone until it's too late.

files

postgpt.py        — the algorithm (zero dependencies)
postgpt.c         — the algorithm (C, compile with -lm)
postgpt_train.py  — training loop (requires PyTorch)
postgpt.txt       — corpus (150KB, about the technology itself)

resonance is unbreakable.

the metaweights remember.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
golem.c		golem.c
postgpt-pro.c		postgpt-pro.c
postgpt.c		postgpt.c
postgpt.py		postgpt.py
postgpt.txt		postgpt.txt
postgpt_train.py		postgpt_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PostGPT

what is this

architecture

the two attentions

the metaweight thesis

usage

zero-dependency generation (Python)

C version

training (optional, requires PyTorch)

the numbers

Q&A that nobody asked

philosophy (mandatory)

files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PostGPT

what is this

architecture

the two attentions

the metaweight thesis

usage

zero-dependency generation (Python)

C version

training (optional, requires PyTorch)

the numbers

Q&A that nobody asked

philosophy (mandatory)

files

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages