Releases: Eamon2009/symctive
Experimental Pre-Release -v1.0-alpha
GPT Language Model — Release Notes
https://colab.research.google.com/drive/1TO6_1WrPL2Oyq0fOjpddXZ6r0HXmexu7#scrollTo=di-1lWDw5fOC
v1.0-alpha · Pre-Release · Experimental
⚠️ Experimental Release
This is a research pre-release (v1.0-alpha) published for documentation and reproducibility purposes only. The model exhibits significant overfitting and is not recommended for production use. Results reflect a controlled scaling experiment, not a deployment-ready system.
Hallucinations in ai output
struct bsd_acct_sctruct *acct_ppin_pu() {
/* Con't the name sof bafeace;
/* If forced to ail gevence foreacctompy ach */
return 0;
}
size_sbuffenc = sech_addr(*info->iPTAIZE);
free__modinfo_exit(&info->index.vers_ea);
mfree_ree(EPT_FLOABTF_MR(T));
pr_err("Invalid state turn: %u\n", p->name, "Dimpoten);
pr_errr("%s::: v"Aailid meagic: version sywnig tayue !=%h%w morks mis validate with itch us cimprecpution ptive-lid inatches keymbol)
return ;
}
/*
* That se con gowate asore syncach_bodes() free sto the sto stree pace asdcacche_memory aches ing
% U Shror in loady. Nunt
*/
static = mod->symb[#e].sections;
size mod->bsections(info->sechdrs[i].syms);
ar = syms->sp[0].sym->size += sech_size -> sizeof(mod->me, mod->ki);
return signed ->mod->racce_ove_blacklist());
size_t mod_symbol_no(syms); /*
* A Me is ae somples from MAMessize, so as beforeer. */
if (!S— mod->mem[type].bofset) EXEC;
struct module *mod }
!symbol_debugfs_clock(&module_ &module_get, list, find_count, ifo->copy);
/*
* This a-ddata. ssymsecs importimiley is chan only eadd proby seeters eefctions, symbol version.
*/
if (!ef_validate(info, info, "__Ipprintk")))
return -ENEXEC;
}
Model Configuration
| Parameter | Value |
|---|---|
| Architecture | Decoder-only Transformer (GPT-style) |
| Total Parameters | 10,818,151 (~10.82M) |
| Layers × Heads × Embedding dim | 6 × 6 × 384 |
| Context length | 256 tokens |
| Batch size | 64 |
| Training iterations | 5,000 |
| Dataset | data2.txt — 348,893 chars, vocab size 103 |
| Train / Val split | 314,003 / 34,890 tokens |
| Device | CUDA (GPU) |
Training Log
Training was halted at iter 2250 for analysis. Validation loss had been monotonically increasing since iter 1000, confirming no further generalisation benefit from continued training.
Summary Metrics
| Metric | Value |
|---|---|
| Best validation loss | 2.9267 (at iter 750) |
| Final train loss | 0.2182 (at iter 2250) |
| Generalisation gap (Δ) | +3.7665 (val − train) |
| Val/Train ratio | ~18.3× |
| Train loss reduction | ×21.5 (iter 0 → 2250) |
Scaling Law Analysis
This experiment provides a clear empirical demonstration of the dynamics predicted by neural scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022).
Chinchilla Optimal Compute Ratio
N_opt = 10.82M parameters
D_opt ≈ 20 × N = 20 × 10.82M ≈ 216M tokens (Hoffmann et al., 2022)
D_actual = 314,003 tokens
Coverage = D_actual / D_opt ≈ 0.15%
The model was trained on approximately 0.15% of the compute-optimal token budget. Under the Chinchilla scaling framework, this constitutes a severely undertrained regime — the model had far more capacity than the data could fully utilise, making overfitting near-inevitable by design.
Overfitting Signal — Empirical Loss Ratio
L_val / L_train (iter 2250) = 3.985 / 0.218 ≈ 18.3×
Generalisation gap Δ = L_val − L_train = +3.767
A well-trained model exhibits a ratio approaching 1.0. A ratio of ~18× is a definitive indicator of severe memorisation. The model has learned to reproduce the training corpus rather than acquiring generalisable language representations.
Power-Law Fit — Train Loss Descent
L(C) ∝ C^(−α), α ≈ 0.35 (empirically estimated)
L₀ = 4.6914 (iter 0)
L₂₂₅₀ = 0.2182 (iter 2250)
Reduction factor = ×21.5
Train loss follows a near-power-law decay with compute, consistent with scaling law predictions. Validation loss stabilised and began diverging post-iter 750, confirming the model crossed into the memorisation regime at that point.
Key Findings
01 — Optimal checkpoint is at iteration 750.
Validation loss reached its minimum of 2.9267 at iter 750. All training beyond this point decreased train loss while monotonically worsening generalisation — a textbook overfitting trajectory. The iter-750 checkpoint is the recommended model weight from this run.
02 — Model is significantly over-parameterised for this dataset.
At 10.82M parameters trained on ~314K tokens, the parameter-to-token ratio is approximately 1:29. Chinchilla scaling recommends a minimum of 1:20 tokens per parameter; ratios of 1:100–1:200 are preferred for robust generalisation.
03 — Train loss follows expected power-law decay.
The steep descent from 4.69 → 0.22 over 2,250 iterations is consistent with L ∝ C^(−α) dynamics. This confirms the model architecture and training loop are functioning correctly — the infrastructure is sound.
04 — The experiment validates the training pipeline.
Despite the overfitting outcome, the data loader, tokeniser, model forward pass, gradient updates, and evaluation loop all behave as expected. This is the primary deliverable of a v1.0-alpha experimental release.
For v1.0-beta AIM
- Scale the data. Increase dataset size to a minimum of ~200M tokens to approach compute-optimal training for this architecture.
- Or scale down the model. Reduce capacity to ~1–2M parameters to match the current dataset size and achieve a healthy parameter-to-token ratio.
- Add regularisation. Introduce dropout (
p = 0.1–0.2), weight decay (λ = 0.01–0.1), and early stopping triggered by validation loss plateau. - Evaluate a smaller architecture. Test 4 layers × 4 heads × 256 embd (~3M params) against this same dataset to confirm the generalisation gap narrows predictably.
- Use the iter-750 checkpoint. Save and tag it as the canonical v1.0-alpha model weight — it represents the point of best generalisation in this run.
References
- Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361
- Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556
Release Info
| Version | v1.0-alpha |
| Release type | Pre-release / Experimental |
| Status | Research documentation only |
| Date | 2026-04-03 |
This release is published for research documentation and reproducibility purposes only. Not intended for deployment. Use at your own discretion.
First GPU Release: 10M Parameter Model on TinyStories — v1.0
v1.0 — First GPU Release: 10M Parameter Model on TinyStories
Overview
This is the first official release of the Transformer Language Model project.
It introduces a fully GPU-optimised training pipeline, a 10M parameter model architecture, and a Google Colab notebook — making it possible to train a GPT-style transformer from scratch in under an hour on a free T4 GPU, with no local setup required.
Highlights
- 10M parameter model trained from scratch — no pre-trained weights, no fine-tuning
- Google Colab notebook — full end-to-end pipeline, runs free on T4 GPU
- TinyStories dataset — 100,000 stories (~50M characters) loaded via HuggingFace
- GPU-optimised hyperparameters — larger batch, longer context, deeper model
- Estimated training time — ~45–60 minutes on a free T4 GPU
- CPU fallback —
transformer.pystill works on CPU for low-resource machines
Model Architecture
| Parameter | Value |
|---|
Parameters | ~10M
n_embd | 384
n_head | 6
n_layer | 6
batch_size | 64
block_size | 256
max_iters | 5,000
dropout | 0.2
learning_rate | 3e-4
Dataset | ~50M chars (TinyStories)
Device | CUDA / CPU
Dataset
Uses the TinyStories dataset by Ronen Eldan and Yuanzhi Li (Microsoft Research), loaded directly from HuggingFace:
roneneldan/TinyStories
100,000 stories are used by default. Adjustable in the notebook via NUM_STORIES.
How to Run
Open the Colab notebook and run cells in order:
- Install dependencies
- Download TinyStories from HuggingFace
- Set up project structure and config
- Configure hyperparameters
- Train the model (~45–60 min on T4 GPU)
- Download
best_model.pt - Generate sample output
No local GPU required.
Notes
best_model.ptis not included in the repository due to file size — train via the Colab notebook to generate it- Vocabulary size adapts dynamically to the dataset at runtime
- The CPU training path (
transformer.py) remains functional for low-resource machines - This release is marked as latest but not stable — the project is under active development
Known Limitations
- No pip package yet — must run via notebook or script directly
- Inference script requires manual path configuration (see README)
- Output is story-shaped but not coherent — character-level model limitation
Acknowledgements
- Dataset: TinyStories — Eldan & Li, Microsoft Research
Overview
This is the first official release of the Transformer Language Model project.
It introduces a fully GPU-optimised training pipeline, a 10M parameter model architecture, and a Google Colab notebook — making it possible to train a GPT-style transformer from scratch in under an hour on a free T4 GPU, with no local setup required.
Highlights
- 10M parameter model trained from scratch — no pre-trained weights, no fine-tuning
- Google Colab notebook — full end-to-end pipeline, runs free on T4 GPU
- TinyStories dataset — 100,000 stories (~50M characters) loaded via HuggingFace
- GPU-optimised hyperparameters — larger batch, longer context, deeper model
- Estimated training time — ~45–60 minutes on a free T4 GPU
- CPU fallback —
transformer.pystill works on CPU for low-resource machines
Model Architecture
| Parameter | Value |
|---|---|
| Parameters | ~10M |
n_embd |
384 |
n_head |
6 |
n_layer |
6 |
batch_size |
64 |
block_size |
256 |
max_iters |
5,000 |
dropout |
0.2 |
learning_rate |
3e-4 |
| Dataset | ~50M chars (TinyStories) |
| Device | CUDA / CPU |
Dataset
Uses the TinyStories dataset by Ronen Eldan and Yuanzhi Li (Microsoft Research), loaded directly from HuggingFace:
roneneldan/TinyStories
100,000 stories are used by default. Adjustable in the notebook via NUM_STORIES.
How to Run
Open the Colab notebook and run cells in order:
- Install dependencies
- Download TinyStories from HuggingFace
- Set up project structure and config
- Configure hyperparameters
- Train the model (~45–60 min on T4 GPU)
- Download
best_model.pt - Generate sample output
No local GPU required.
Notes
best_model.ptis not included in the repository due to file size — train via the Colab notebook to generate it- Vocabulary size adapts dynamically to the dataset at runtime
- The CPU training path (
transformer.py) remains functional for low-resource machines - This release is marked as latest but not stable — the project is under active development
Known Limitations
- No pip package yet — must run via notebook or script directly
- Inference script requires manual path configuration (see README)
- Output is story-shaped but not coherent — character-level model limitation
Acknowledgements
- Dataset: [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) — Eldan & Li, Microsoft Research