GitHub - klusai/tinyfabulist-tf3: TinyFabulist Paper Series: TF3 — Fine-tuning GPT-2 on the klusai/ds-tf1-en-3m dataset. Includes evaluation results and practical rules for optimizing small language models for task efficiency.

Romanian (<50M) From‑Scratch Language Model

This repository trains a decoder‑only language model from scratch for Romanian, targeting a compact corpus (~1B tokens).

Why from scratch?

Language focus: Romanian morphology benefits from a tokenizer trained on Romanian text.
Smaller corpus: Enables quick iteration on architecture/regularization without heavy compute.
Control: Full control over special tokens, normalization, and vocabulary.

Data

Source: klusai/ds-tf2-en-ro-3m (Hugging Face Datasets)
Column used: translated_fable (Romanian side only)
Scale: After filtering and tokenization, the total is intended to be under ~1B tokens.

You can swap the dataset with your own as long as you expose a single text column and update preprocess.py accordingly.

Tokenizer

Type: SentencePiece Unigram (also builds an optional BPE variant for comparison)
Vocab size: 32,000
Special tokens: <pad>, <unk>, <bos>, <eos>
Files produced: timestamped JSONs under artifacts/tokenizers_<timestamp>/.

Train the tokenizers:

python tokenizer/train_tokenizer.py

Outputs (example):

artifacts/tokenizers_YYYY_MM_DD_HH_MM_SS/unigram_tokenizer.json
artifacts/tokenizers_YYYY_MM_DD_HH_MM_SS/bpe_tokenizer.json

Notes:

Unigram is often preferable for morphologically rich languages like Romanian; it typically yields better subword splits than BPE at the same vocab size.

Preprocessing

Creates contiguous 2048‑token chunks for causal LM training.

Loads klusai/ds-tf2-en-ro-3m and keeps only translated_fable.
Uses the local tokenizer JSON (no Hub downloads required).
Saves an Arrow dataset ready for training.

Update TOKENIZER_PATH in preprocess.py to the Unigram JSON you just trained, then run:

python preprocess.py

Outputs:

artifacts/ds-tf2-en-ro-3m-tokenized (DatasetDict on disk)

Acknowledgements

Hugging Face Datasets/Transformers/Tokenizers
Google SentencePiece
Community datasets for Romanian text

Todo

mamba vs transformers benchmarks
quantization benchmarks
ablation studies
finetuning
generate 3M fables

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
public_artifacts		public_artifacts
tf3		tf3
.gitignore		.gitignore
README.md		README.md
debian_setup.sh		debian_setup.sh
metrics_plot.png		metrics_plot.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Romanian (<50M) From‑Scratch Language Model

Why from scratch?

Data

Tokenizer

Preprocessing

Acknowledgements

Todo

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

klusai/tinyfabulist-tf3

Folders and files

Latest commit

History

Repository files navigation

Romanian (<50M) From‑Scratch Language Model

Why from scratch?

Data

Tokenizer

Preprocessing

Acknowledgements

Todo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages