Skip to content

Releases: Eamon2009/symctive

Experimental Pre-Release -v1.0-alpha

03 Apr 06:46

Choose a tag to compare

Pre-release

GPT Language Model — Release Notes

https://colab.research.google.com/drive/1TO6_1WrPL2Oyq0fOjpddXZ6r0HXmexu7#scrollTo=di-1lWDw5fOC

v1.0-alpha · Pre-Release · Experimental

⚠️ Experimental Release
This is a research pre-release (v1.0-alpha) published for documentation and reproducibility purposes only. The model exhibits significant overfitting and is not recommended for production use. Results reflect a controlled scaling experiment, not a deployment-ready system.


Hallucinations in ai output

	struct bsd_acct_sctruct *acct_ppin_pu() {
				/* Con't the name sof bafeace;
		/* If forced to ail gevence foreacctompy ach */
		return 0;
}

	size_sbuffenc = sech_addr(*info->iPTAIZE);
	free__modinfo_exit(&info->index.vers_ea);
	mfree_ree(EPT_FLOABTF_MR(T));
	pr_err("Invalid state turn: %u\n", p->name, "Dimpoten);
		pr_errr("%s::: v"Aailid meagic: version sywnig tayue !=%h%w morks mis validate with itch us cimprecpution ptive-lid inatches keymbol)
			return ;
	}


	/*
 * That se con gowate asore syncach_bodes() free sto the sto stree pace asdcacche_memory aches ing
% U Shror in loady. Nunt
	 */
	static = mod->symb[#e].sections;

	size mod->bsections(info->sechdrs[i].syms);
	ar = syms->sp[0].sym->size += sech_size -> sizeof(mod->me, mod->ki);
		return signed ->mod->racce_ove_blacklist());

	size_t mod_symbol_no(syms);	/*
	 * A Me is ae somples from MAMessize, so as beforeer. */
	if (!S— mod->mem[type].bofset)		EXEC;
		struct module *mod		}
			!symbol_debugfs_clock(&module_ &module_get, list, find_count, ifo->copy);
	/*
	 * This a-ddata. ssymsecs importimiley is chan only eadd proby seeters eefctions, symbol version.
		 */
	if (!ef_validate(info, info, "__Ipprintk")))
		return -ENEXEC;
	}

Model Configuration

Parameter Value
Architecture Decoder-only Transformer (GPT-style)
Total Parameters 10,818,151 (~10.82M)
Layers × Heads × Embedding dim 6 × 6 × 384
Context length 256 tokens
Batch size 64
Training iterations 5,000
Dataset data2.txt — 348,893 chars, vocab size 103
Train / Val split 314,003 / 34,890 tokens
Device CUDA (GPU)

Training Log

image

Training was halted at iter 2250 for analysis. Validation loss had been monotonically increasing since iter 1000, confirming no further generalisation benefit from continued training.


Summary Metrics

Metric Value
Best validation loss 2.9267 (at iter 750)
Final train loss 0.2182 (at iter 2250)
Generalisation gap (Δ) +3.7665 (val − train)
Val/Train ratio ~18.3×
Train loss reduction ×21.5 (iter 0 → 2250)

Scaling Law Analysis

This experiment provides a clear empirical demonstration of the dynamics predicted by neural scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022).

Chinchilla Optimal Compute Ratio

N_opt  = 10.82M parameters
D_opt  ≈ 20 × N = 20 × 10.82M ≈ 216M tokens   (Hoffmann et al., 2022)
D_actual = 314,003 tokens
 
Coverage = D_actual / D_opt ≈ 0.15%

The model was trained on approximately 0.15% of the compute-optimal token budget. Under the Chinchilla scaling framework, this constitutes a severely undertrained regime — the model had far more capacity than the data could fully utilise, making overfitting near-inevitable by design.


Overfitting Signal — Empirical Loss Ratio

L_val / L_train  (iter 2250)  =  3.985 / 0.218  ≈  18.3×
Generalisation gap  Δ         =  L_val − L_train  =  +3.767

A well-trained model exhibits a ratio approaching 1.0. A ratio of ~18× is a definitive indicator of severe memorisation. The model has learned to reproduce the training corpus rather than acquiring generalisable language representations.


Power-Law Fit — Train Loss Descent

L(C) ∝ C^(−α),   α ≈ 0.35   (empirically estimated)
 
L₀     = 4.6914   (iter 0)
L₂₂₅₀  = 0.2182   (iter 2250)
Reduction factor = ×21.5

Train loss follows a near-power-law decay with compute, consistent with scaling law predictions. Validation loss stabilised and began diverging post-iter 750, confirming the model crossed into the memorisation regime at that point.


Key Findings

01 — Optimal checkpoint is at iteration 750.
Validation loss reached its minimum of 2.9267 at iter 750. All training beyond this point decreased train loss while monotonically worsening generalisation — a textbook overfitting trajectory. The iter-750 checkpoint is the recommended model weight from this run.

02 — Model is significantly over-parameterised for this dataset.
At 10.82M parameters trained on ~314K tokens, the parameter-to-token ratio is approximately 1:29. Chinchilla scaling recommends a minimum of 1:20 tokens per parameter; ratios of 1:100–1:200 are preferred for robust generalisation.

03 — Train loss follows expected power-law decay.
The steep descent from 4.69 → 0.22 over 2,250 iterations is consistent with L ∝ C^(−α) dynamics. This confirms the model architecture and training loop are functioning correctly — the infrastructure is sound.

04 — The experiment validates the training pipeline.
Despite the overfitting outcome, the data loader, tokeniser, model forward pass, gradient updates, and evaluation loop all behave as expected. This is the primary deliverable of a v1.0-alpha experimental release.


For v1.0-beta AIM

  • Scale the data. Increase dataset size to a minimum of ~200M tokens to approach compute-optimal training for this architecture.
  • Or scale down the model. Reduce capacity to ~1–2M parameters to match the current dataset size and achieve a healthy parameter-to-token ratio.
  • Add regularisation. Introduce dropout (p = 0.1–0.2), weight decay (λ = 0.01–0.1), and early stopping triggered by validation loss plateau.
  • Evaluate a smaller architecture. Test 4 layers × 4 heads × 256 embd (~3M params) against this same dataset to confirm the generalisation gap narrows predictably.
  • Use the iter-750 checkpoint. Save and tag it as the canonical v1.0-alpha model weight — it represents the point of best generalisation in this run.

References

  • Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361
  • Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556

Release Info

Version v1.0-alpha
Release type Pre-release / Experimental
Status Research documentation only
Date 2026-04-03

This release is published for research documentation and reproducibility purposes only. Not intended for deployment. Use at your own discretion.

First GPU Release: 10M Parameter Model on TinyStories — v1.0

25 Mar 15:18

Choose a tag to compare

v1.0 — First GPU Release: 10M Parameter Model on TinyStories

Overview

This is the first official release of the Transformer Language Model project.

It introduces a fully GPU-optimised training pipeline, a 10M parameter model architecture, and a Google Colab notebook — making it possible to train a GPT-style transformer from scratch in under an hour on a free T4 GPU, with no local setup required.


Highlights

  • 10M parameter model trained from scratch — no pre-trained weights, no fine-tuning
  • Google Colab notebook — full end-to-end pipeline, runs free on T4 GPU
  • TinyStories dataset — 100,000 stories (~50M characters) loaded via HuggingFace
  • GPU-optimised hyperparameters — larger batch, longer context, deeper model
  • Estimated training time — ~45–60 minutes on a free T4 GPU
  • CPU fallbacktransformer.py still works on CPU for low-resource machines

Model Architecture

Parameter Value
Screenshot 2026-03-22 145832

Parameters | ~10M
n_embd | 384
n_head | 6
n_layer | 6
batch_size | 64
block_size | 256
max_iters | 5,000
dropout | 0.2
learning_rate | 3e-4
Dataset | ~50M chars (TinyStories)
Device | CUDA / CPU


Dataset

Uses the TinyStories dataset by Ronen Eldan and Yuanzhi Li (Microsoft Research), loaded directly from HuggingFace:

roneneldan/TinyStories

100,000 stories are used by default. Adjustable in the notebook via NUM_STORIES.


How to Run

Open the Colab notebook and run cells in order:

  1. Install dependencies
  2. Download TinyStories from HuggingFace
  3. Set up project structure and config
  4. Configure hyperparameters
  5. Train the model (~45–60 min on T4 GPU)
  6. Download best_model.pt
  7. Generate sample output

No local GPU required.


Notes

  • best_model.pt is not included in the repository due to file size — train via the Colab notebook to generate it
  • Vocabulary size adapts dynamically to the dataset at runtime
  • The CPU training path (transformer.py) remains functional for low-resource machines
  • This release is marked as latest but not stable — the project is under active development

Known Limitations

  • No pip package yet — must run via notebook or script directly
  • Inference script requires manual path configuration (see README)
  • Output is story-shaped but not coherent — character-level model limitation

Acknowledgements

  • Dataset: TinyStories — Eldan & Li, Microsoft Research
# v1.0 — First GPU Release: 10M Parameter Model on TinyStories

Overview

This is the first official release of the Transformer Language Model project.

It introduces a fully GPU-optimised training pipeline, a 10M parameter model architecture, and a Google Colab notebook — making it possible to train a GPT-style transformer from scratch in under an hour on a free T4 GPU, with no local setup required.


Highlights

  • 10M parameter model trained from scratch — no pre-trained weights, no fine-tuning
  • Google Colab notebook — full end-to-end pipeline, runs free on T4 GPU
  • TinyStories dataset — 100,000 stories (~50M characters) loaded via HuggingFace
  • GPU-optimised hyperparameters — larger batch, longer context, deeper model
  • Estimated training time — ~45–60 minutes on a free T4 GPU
  • CPU fallbacktransformer.py still works on CPU for low-resource machines

Model Architecture

Parameter Value
Parameters ~10M
n_embd 384
n_head 6
n_layer 6
batch_size 64
block_size 256
max_iters 5,000
dropout 0.2
learning_rate 3e-4
Dataset ~50M chars (TinyStories)
Device CUDA / CPU

Dataset

Uses the TinyStories dataset by Ronen Eldan and Yuanzhi Li (Microsoft Research), loaded directly from HuggingFace:

roneneldan/TinyStories

100,000 stories are used by default. Adjustable in the notebook via NUM_STORIES.


How to Run

Open the Colab notebook and run cells in order:

  1. Install dependencies
  2. Download TinyStories from HuggingFace
  3. Set up project structure and config
  4. Configure hyperparameters
  5. Train the model (~45–60 min on T4 GPU)
  6. Download best_model.pt
  7. Generate sample output

No local GPU required.


Notes

  • best_model.pt is not included in the repository due to file size — train via the Colab notebook to generate it
  • Vocabulary size adapts dynamically to the dataset at runtime
  • The CPU training path (transformer.py) remains functional for low-resource machines
  • This release is marked as latest but not stable — the project is under active development

Known Limitations

  • No pip package yet — must run via notebook or script directly
  • Inference script requires manual path configuration (see README)
  • Output is story-shaped but not coherent — character-level model limitation

Acknowledgements