Author: Adam Silva GitHub: https://github.com/adaumsilva/
Specialization: AI Engineering, RAG Architectures, and MLOps.
LuminaLLM is a research-grade repository dedicated to the adaptation of open-source Large Language Models (LLMs) to specialized domains. By utilizing Parameter-Efficient Fine-Tuning (PEFT) and 4-bit quantization (QLoRA), this project demonstrates how to achieve state-of-the-art performance on niche tasks while minimizing computational overhead.
- Efficient Fine-Tuning: Implementation of QLoRA to reduce VRAM requirements, allowing 7B+ parameter models to be tuned on accessible hardware.
- Modular Ingestion: Custom data pipelines for transforming unstructured text into instruction-following or chat-completion formats.
- Experiment Tracking: Integrated support for Weights & Biases (W&B) to monitor gradient norms, loss curves, and GPU utilization.
- Quantization & Merging: Scripts for loading models in 4/8-bit and merging LoRA weights back into the base model for production deployment.
- Performance Benchmarking: Comparative evaluation tools to measure the delta between base models and fine-tuned versions.
- Python 3.10+
- CUDA-capable GPU (≥ 16 GB VRAM recommended for Mistral-7B; 24 GB for larger models)
- CUDA 11.8 or 12.1
# 1. Clone and enter the repo
git clone https://github.com/your-org/LuminaLLM.git
cd LuminaLLM
# 2. Create a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 3. Install PyTorch (CUDA 12.1 wheel)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 4. Install project dependencies
pip install -r requirements.txt
# 5. (Recommended) Install Flash Attention 2 for faster training
pip install flash-attn --no-build-isolation
# 6. Configure secrets
cp .env.example .env
# Edit .env and fill in HF_TOKEN, WANDB_API_KEY, etc.# Convert Alpaca-format JSON to JSONL with a val split
python scripts/prepare_data.py \
--input data/raw/alpaca_data.json \
--output data/train.jsonl \
--format alpaca \
--val-output data/val.jsonl \
--val-split 0.05
# Convert ShareGPT JSONL
python scripts/prepare_data.py \
--input data/raw/sharegpt.jsonl \
--output data/train.jsonl \
--format sharegpt
# Convert OpenAI chat-format JSONL
python scripts/prepare_data.py \
--input data/raw/chat_data.jsonl \
--output data/train.jsonl \
--format openaiExpected output schema per line:
{"instruction": "Explain quantum entanglement", "input": "", "output": "Quantum entanglement is..."}jupyter notebook notebooks/01_eda.ipynbThe notebook produces:
- Token-length histograms and percentile tables for
max_seq_lengthselection - Field coverage report
- Duplicate detection
- GPU VRAM estimate per batch size
Edit configs/finetune.yaml to set your base model, LoRA rank, learning rate, etc., then run:
python scripts/train.py --config configs/finetune.yamlOverride any YAML field inline:
python scripts/train.py --config configs/finetune.yaml \
model.base_model_id=meta-llama/Meta-Llama-3-8B-Instruct \
lora.r=32 \
training.num_train_epochs=1 \
training.report_to=tensorboardKey config knobs (configs/finetune.yaml):
| Section | Key | Default | Notes |
|---|---|---|---|
model |
base_model_id |
mistralai/Mistral-7B-Instruct-v0.2 |
Any HF Hub causal LM |
model |
attn_implementation |
flash_attention_2 |
Set eager if Flash Attention not installed |
quantization |
bnb_4bit_quant_type |
nf4 |
nf4 outperforms fp4 empirically |
lora |
r |
64 |
Higher = more capacity, more VRAM |
lora |
lora_alpha |
128 |
Effective scale = alpha / r |
lora |
use_rslora |
true |
Rank-stabilised LoRA (recommended) |
training |
optim |
paged_adamw_8bit |
Saves ~2 GB vs standard AdamW |
data |
packing |
true |
ConstantLengthDataset; maximises GPU utilisation |
# Perplexity on the validation split
python scripts/evaluate.py \
--adapter outputs/mistral-7b-qlora/final_adapter \
--config configs/finetune.yaml \
--mode perplexity
# ROUGE-1/2/L on a held-out test file
python scripts/evaluate.py \
--adapter outputs/mistral-7b-qlora/final_adapter \
--config configs/finetune.yaml \
--mode rouge \
--test-file data/test.jsonl \
--output outputs/rouge_results.json
# LLM-as-a-judge (requires OPENAI_API_KEY)
python scripts/evaluate.py \
--adapter outputs/mistral-7b-qlora/final_adapter \
--config configs/finetune.yaml \
--mode judge \
--test-file data/test.jsonl \
--judge-model gpt-4o-mini \
--output outputs/judge_results.json# Interactive REPL with the merged model
python scripts/inference.py \
--model outputs/mistral-7b-merged \
--mode interactive
# Interactive REPL with base model + adapter (no merge required)
python scripts/inference.py \
--base-model mistralai/Mistral-7B-Instruct-v0.2 \
--adapter outputs/mistral-7b-qlora/final_adapter \
--mode interactive \
--template mistral
# Batch inference
python scripts/inference.py \
--model outputs/mistral-7b-merged \
--mode batch \
--input-file data/test.jsonl \
--output-file outputs/predictions.jsonl \
--greedyLuminaLLM/
├── configs/
│ └── finetune.yaml # All hyperparameters (single source of truth)
├── scripts/
│ ├── train.py # QLoRA SFT training entry point
│ ├── evaluate.py # Perplexity / ROUGE / LLM-as-a-judge
│ ├── inference.py # Interactive and batch inference
│ └── prepare_data.py # Raw → JSONL conversion
├── src/
│ ├── data/
│ │ ├── __init__.py
│ │ └── dataset.py # DatasetPipeline, prompt formatters
│ └── model/
│ ├── __init__.py
│ ├── builder.py # build_bnb_config, build_model_and_tokenizer, build_peft_model
│ └── utils.py # Parameter counting, GPU telemetry, merge_and_save
├── notebooks/
│ └── 01_eda.ipynb # Exploratory data analysis
├── outputs/ # Checkpoints, adapters, merged models (gitignored)
├── data/ # Local datasets (gitignored)
├── .env.example
├── .gitignore
└── requirements.txt
Set training.report_to: wandb in the config and provide WANDB_API_KEY in .env.
Logged metrics:
train/loss,train/grad_norm,train/learning_rateeval/loss,eval/perplexity(computed post-eval)- GPU VRAM usage via custom callback
For TensorBoard:
training.report_to: tensorboard
tensorboard --logdir outputs/| Technique | VRAM saving | Config key |
|---|---|---|
| 4-bit NF4 quantization | ~75 % of model weights | quantization.load_in_4bit |
| Double quantization | ~0.4 GB extra | quantization.bnb_4bit_use_double_quant |
| Gradient checkpointing | ~30–40 % activations | training.gradient_checkpointing |
| Paged AdamW 8-bit | ~2 GB optimizer states | training.optim |
| Flash Attention 2 | ~20 % activations | model.attn_implementation |
| Sequence packing | Maximises batch utilisation | data.packing |
@misc{luminallm2024,
title = {LuminaLLM: QLoRA Fine-Tuning Pipeline},
year = {2024},
url = {https://github.com/your-org/LuminaLLM}
}