03 Apr 06:46

Eamon2009

3903e16

Experimental Pre-Release -v1.0-alpha Pre-release

Pre-release

GPT Language Model — Release Notes

https://colab.research.google.com/drive/1TO6_1WrPL2Oyq0fOjpddXZ6r0HXmexu7#scrollTo=di-1lWDw5fOC

v1.0-alpha · Pre-Release · Experimental

⚠️ Experimental Release
This is a research pre-release (v1.0-alpha) published for documentation and reproducibility purposes only. The model exhibits significant overfitting and is not recommended for production use. Results reflect a controlled scaling experiment, not a deployment-ready system.

Hallucinations in ai output

	struct bsd_acct_sctruct *acct_ppin_pu() {
				/* Con't the name sof bafeace;
		/* If forced to ail gevence foreacctompy ach */
		return 0;
}

	size_sbuffenc = sech_addr(*info->iPTAIZE);
	free__modinfo_exit(&info->index.vers_ea);
	mfree_ree(EPT_FLOABTF_MR(T));
	pr_err("Invalid state turn: %u\n", p->name, "Dimpoten);
		pr_errr("%s::: v"Aailid meagic: version sywnig tayue !=%h%w morks mis validate with itch us cimprecpution ptive-lid inatches keymbol)
			return ;
	}


	/*
 * That se con gowate asore syncach_bodes() free sto the sto stree pace asdcacche_memory aches ing
% U Shror in loady. Nunt
	 */
	static = mod->symb[#e].sections;

	size mod->bsections(info->sechdrs[i].syms);
	ar = syms->sp[0].sym->size += sech_size -> sizeof(mod->me, mod->ki);
		return signed ->mod->racce_ove_blacklist());

	size_t mod_symbol_no(syms);	/*
	 * A Me is ae somples from MAMessize, so as beforeer. */
	if (!S— mod->mem[type].bofset)		EXEC;
		struct module *mod		}
			!symbol_debugfs_clock(&module_ &module_get, list, find_count, ifo->copy);
	/*
	 * This a-ddata. ssymsecs importimiley is chan only eadd proby seeters eefctions, symbol version.
		 */
	if (!ef_validate(info, info, "__Ipprintk")))
		return -ENEXEC;
	}

Model Configuration

Parameter	Value
Architecture	Decoder-only Transformer (GPT-style)
Total Parameters	10,818,151 (~10.82M)
Layers × Heads × Embedding dim	6 × 6 × 384
Context length	256 tokens
Batch size	64
Training iterations	5,000
Dataset	`data2.txt` — 348,893 chars, vocab size 103
Train / Val split	314,003 / 34,890 tokens
Device	CUDA (GPU)

Training Log

Training was halted at iter 2250 for analysis. Validation loss had been monotonically increasing since iter 1000, confirming no further generalisation benefit from continued training.

Summary Metrics

Metric	Value
Best validation loss	2.9267 (at iter 750)
Final train loss	0.2182 (at iter 2250)
Generalisation gap (Δ)	+3.7665 (val − train)
Val/Train ratio	~18.3×
Train loss reduction	×21.5 (iter 0 → 2250)

Scaling Law Analysis

This experiment provides a clear empirical demonstration of the dynamics predicted by neural scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022).

Chinchilla Optimal Compute Ratio

N_opt  = 10.82M parameters
D_opt  ≈ 20 × N = 20 × 10.82M ≈ 216M tokens   (Hoffmann et al., 2022)
D_actual = 314,003 tokens
 
Coverage = D_actual / D_opt ≈ 0.15%

The model was trained on approximately 0.15% of the compute-optimal token budget. Under the Chinchilla scaling framework, this constitutes a severely undertrained regime — the model had far more capacity than the data could fully utilise, making overfitting near-inevitable by design.

Overfitting Signal — Empirical Loss Ratio

L_val / L_train  (iter 2250)  =  3.985 / 0.218  ≈  18.3×
Generalisation gap  Δ         =  L_val − L_train  =  +3.767

A well-trained model exhibits a ratio approaching 1.0. A ratio of ~18× is a definitive indicator of severe memorisation. The model has learned to reproduce the training corpus rather than acquiring generalisable language representations.

Power-Law Fit — Train Loss Descent

L(C) ∝ C^(−α),   α ≈ 0.35   (empirically estimated)
 
L₀     = 4.6914   (iter 0)
L₂₂₅₀  = 0.2182   (iter 2250)
Reduction factor = ×21.5

Train loss follows a near-power-law decay with compute, consistent with scaling law predictions. Validation loss stabilised and began diverging post-iter 750, confirming the model crossed into the memorisation regime at that point.

Key Findings

01 — Optimal checkpoint is at iteration 750.
Validation loss reached its minimum of 2.9267 at iter 750. All training beyond this point decreased train loss while monotonically worsening generalisation — a textbook overfitting trajectory. The iter-750 checkpoint is the recommended model weight from this run.

02 — Model is significantly over-parameterised for this dataset.
At 10.82M parameters trained on ~314K tokens, the parameter-to-token ratio is approximately 1:29. Chinchilla scaling recommends a minimum of 1:20 tokens per parameter; ratios of 1:100–1:200 are preferred for robust generalisation.

03 — Train loss follows expected power-law decay.
The steep descent from 4.69 → 0.22 over 2,250 iterations is consistent with L ∝ C^(−α) dynamics. This confirms the model architecture and training loop are functioning correctly — the infrastructure is sound.

04 — The experiment validates the training pipeline.
Despite the overfitting outcome, the data loader, tokeniser, model forward pass, gradient updates, and evaluation loop all behave as expected. This is the primary deliverable of a v1.0-alpha experimental release.

For v1.0-beta AIM

Scale the data. Increase dataset size to a minimum of ~200M tokens to approach compute-optimal training for this architecture.
Or scale down the model. Reduce capacity to ~1–2M parameters to match the current dataset size and achieve a healthy parameter-to-token ratio.
Add regularisation. Introduce dropout (p = 0.1–0.2), weight decay (λ = 0.01–0.1), and early stopping triggered by validation loss plateau.
Evaluate a smaller architecture. Test 4 layers × 4 heads × 256 embd (~3M params) against this same dataset to confirm the generalisation gap narrows predictably.
Use the iter-750 checkpoint. Save and tag it as the canonical v1.0-alpha model weight — it represents the point of best generalisation in this run.

References

Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361
Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556

Release Info


Version	v1.0-alpha
Release type	Pre-release / Experimental
Status	Research documentation only
Date	2026-04-03

This release is published for research documentation and reproducibility purposes only. Not intended for deployment. Use at your own discretion.

Assets 3

25 Mar 15:18

Eamon2009

v1.0

988b43c

First GPU Release: 10M Parameter Model on TinyStories — v1.0 Latest

Latest

v1.0 — First GPU Release: 10M Parameter Model on TinyStories

Overview

This is the first official release of the Transformer Language Model project.

It introduces a fully GPU-optimised training pipeline, a 10M parameter model architecture, and a Google Colab notebook — making it possible to train a GPT-style transformer from scratch in under an hour on a free T4 GPU, with no local setup required.

Highlights

10M parameter model trained from scratch — no pre-trained weights, no fine-tuning
Google Colab notebook — full end-to-end pipeline, runs free on T4 GPU
TinyStories dataset — 100,000 stories (~50M characters) loaded via HuggingFace
GPU-optimised hyperparameters — larger batch, longer context, deeper model
Estimated training time — ~45–60 minutes on a free T4 GPU
CPU fallback — transformer.py still works on CPU for low-resource machines

Model Architecture

Parameter	Value

Dataset

Uses the TinyStories dataset by Ronen Eldan and Yuanzhi Li (Microsoft Research), loaded directly from HuggingFace:

roneneldan/TinyStories

100,000 stories are used by default. Adjustable in the notebook via NUM_STORIES.

How to Run

Open the Colab notebook and run cells in order:

Install dependencies
Download TinyStories from HuggingFace
Set up project structure and config
Configure hyperparameters
Train the model (~45–60 min on T4 GPU)
Download best_model.pt
Generate sample output

No local GPU required.

Notes

best_model.pt is not included in the repository due to file size — train via the Colab notebook to generate it
Vocabulary size adapts dynamically to the dataset at runtime
The CPU training path (transformer.py) remains functional for low-resource machines
This release is marked as latest but not stable — the project is under active development

Known Limitations

No pip package yet — must run via notebook or script directly
Inference script requires manual path configuration (see README)
Output is story-shaped but not coherent — character-level model limitation

Acknowledgements

Dataset: TinyStories — Eldan & Li, Microsoft Research

# v1.0 — First GPU Release: 10M Parameter Model on TinyStories

Overview

This is the first official release of the Transformer Language Model project.

Highlights

10M parameter model trained from scratch — no pre-trained weights, no fine-tuning
Google Colab notebook — full end-to-end pipeline, runs free on T4 GPU
TinyStories dataset — 100,000 stories (~50M characters) loaded via HuggingFace
GPU-optimised hyperparameters — larger batch, longer context, deeper model
Estimated training time — ~45–60 minutes on a free T4 GPU
CPU fallback — transformer.py still works on CPU for low-resource machines

Model Architecture

Parameter	Value
Parameters	~10M
`n_embd`	384
`n_head`	6
`n_layer`	6
`batch_size`	64
`block_size`	256
`max_iters`	5,000
`dropout`	0.2
`learning_rate`	3e-4
Dataset	~50M chars (TinyStories)
Device	CUDA / CPU

Dataset

Uses the TinyStories dataset by Ronen Eldan and Yuanzhi Li (Microsoft Research), loaded directly from HuggingFace:

roneneldan/TinyStories

100,000 stories are used by default. Adjustable in the notebook via NUM_STORIES.

How to Run

Open the Colab notebook and run cells in order:

Install dependencies
Download TinyStories from HuggingFace
Set up project structure and config
Configure hyperparameters
Train the model (~45–60 min on T4 GPU)
Download best_model.pt
Generate sample output

No local GPU required.

Notes

best_model.pt is not included in the repository due to file size — train via the Colab notebook to generate it
Vocabulary size adapts dynamically to the dataset at runtime
The CPU training path (transformer.py) remains functional for low-resource machines
This release is marked as latest but not stable — the project is under active development

Known Limitations

No pip package yet — must run via notebook or script directly
Inference script requires manual path configuration (see README)
Output is story-shaped but not coherent — character-level model limitation

Acknowledgements

Dataset: [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) — Eldan & Li, Microsoft Research

Assets 3

Releases: Eamon2009/symctive

Experimental Pre-Release -v1.0-alpha

GPT Language Model — Release Notes

v1.0-alpha · Pre-Release · Experimental

Hallucinations in ai output

Model Configuration

Training Log

Summary Metrics

Scaling Law Analysis

Chinchilla Optimal Compute Ratio

Overfitting Signal — Empirical Loss Ratio

Power-Law Fit — Train Loss Descent

Key Findings

For v1.0-beta AIM

References

Release Info

Uh oh!

First GPU Release: 10M Parameter Model on TinyStories — v1.0

v1.0 — First GPU Release: 10M Parameter Model on TinyStories

Overview

Highlights

Model Architecture

Dataset

How to Run

Notes

Known Limitations

Acknowledgements

Overview

Highlights

Model Architecture

Dataset

How to Run

Notes

Known Limitations

Acknowledgements

Uh oh!