Goal: Benchmark, fine-tune and quantize Gemma-3-12B for EN→RO translation for fables within a $350 budget.
- End-to-end pipeline: data prep → translation → scoring → reports
- Cost modeling (
lib/estimate) with budget guards - Human-centric metrics (Accuracy, Fluency, Coherence, Style, Cultural/Pragmatic)
- Model zoo ready: proprietary (GPT, Gemini, DeepL) & open-source (Gemma, Llama, etc.)
- Fine-tuning + Quantization: LoRA recipes, GGUF exports, and W8A8 (LLM Compressor)
- Cost Analysis 📊 — Done
- Dataset Creation 🏗️ — Done
- Benchmarking 🔍 — Done
- Fine-tuning 🎯 — Done (baseline)
- Evaluation & Reports 🚀 — Done
| Model | Accuracy | Fluency | Coherence | Style | Cultural/Pragmatic | Average Score | Count | Avg Input Tokens | Avg Output Tokens | Avg Inference Time (s) |
|---|---|---|---|---|---|---|---|---|---|---|
| o3-2025-04-16 | 4.86 | 4.92 | 4.89 | 4.96 | 4.97 | 4.92 | 100 | 181.3 | 342.7 | 20.37 |
| gpt-4.1-2025-04-14 | 4.86 | 4.89 | 4.85 | 4.92 | 4.94 | 4.89 | 100 | 181.3 | 342.7 | 20.37 |
| gemini-2.5-flash-preview-05-20 | 4.75 | 4.86 | 4.82 | 4.87 | 4.89 | 4.84 | 100 | 181.3 | 342.7 | 20.37 |
| tf2-12b | 4.72 | 4.88 | 4.84 | 4.87 | 4.85 | 4.83 | 100 | 0.0 | 0.0 | 0.00 |
| o3-mini-2025-01-31 | 4.71 | 4.78 | 4.87 | 4.85 | 4.92 | 4.83 | 100 | 181.3 | 342.7 | 20.37 |
| gemini-2.0-flash-001 | 4.66 | 4.82 | 4.78 | 4.89 | 4.93 | 4.82 | 100 | 181.3 | 342.7 | 20.37 |
| deepseek-r1 | 4.72 | 4.76 | 4.87 | 4.85 | 4.89 | 4.82 | 98 | 183.2 | 346.0 | 20.59 |
| tf2-12b-w8a8 | 4.70 | 4.86 | 4.85 | 4.86 | 4.83 | 4.82 | 100 | 0.0 | 0.0 | 0.00 |
| grok-3-mini-beta | 4.73 | 4.74 | 4.77 | 4.82 | 4.88 | 4.79 | 100 | 181.3 | 342.7 | 20.37 |
| gpt-4.1-mini-2025-04-14 | 4.54 | 4.71 | 4.72 | 4.84 | 4.83 | 4.73 | 98 | 181.3 | 342.2 | 20.35 |
| deepl | 4.42 | 4.73 | 4.38 | 4.69 | 4.74 | 4.59 | 100 | 181.3 | 342.7 | 20.37 |
| gemini-flash-1.5-8b | 4.14 | 4.45 | 4.67 | 4.52 | 4.46 | 4.45 | 99 | 181.3 | 342.6 | 20.40 |
| gemma-3-12b-it | 3.98 | 4.56 | 4.65 | 4.52 | 4.43 | 4.43 | 100 | 0.0 | 0.0 | 0.00 |
| EuroLLM-9B-Instruct | 3.84 | 4.27 | 4.36 | 4.27 | 4.22 | 4.19 | 98 | 0.0 | 0.0 | 0.00 |
| qwen3-14b | 2.63 | 3.13 | 3.40 | 3.02 | 2.84 | 3.00 | 99 | 183.1 | 346.2 | 20.58 |
- Source: 3M English fables from
https://huggingface.co/datasets/klusai/ds-tf1-en-3m - Ground truth: GPT-o3 EN→RO
- Budget target: $300
- Estimator:
lib/estimatewith per-model pricing + token usage priors - Example infra note: sfcompute, 8× H100, ~$1.35 per GPU·h (August 2025) → effective cluster ≈ $10.8/h
-
Models
tf2-12b-w8a8(W8A8 via LLM Compressor): huggingface.co/klusai/tf2-12b-w8a8tf2-12b-gguf(llama.cpp GGUF build): huggingface.co/klusai/tf2-12b-gguf
-
Datasets
ds-tf2-en-ro-3m(3M EN↔RO fable pairs): huggingface.co/datasets/klusai/ds-tf2-en-ro-3mds-tf2-en-ro-15k(15k EN→RO ground truth for LoRA): huggingface.co/datasets/klusai/ds-tf2-en-ro-15k
Deploy options: llama.cpp (GGUF), vLLM (W8A8), or standard 🤗 Transformers.
- LLM Evaluator Metrics within 98% of O3 baseline on test set
- Total cost under $300
- Inference cost < $0.001 / translation