A modular framework for systematically comparing large language models across diverse task types — lecture summarization, business decision analysis, and document retrieval ranking — measuring quality, cost, and latency to help teams choose the right model for each workload.
Five steps to run your first evaluation and open the dashboard:
# 1. Install
git clone <repo-url> && cd llm-evaluation-framework
pip install -e .
# 2. Set API keys
cp .env.example .env
# Edit .env and add your DEEPSEEK_API_KEY and/or OPENAI_API_KEY
# 3. Run evaluation against a pre-curated example
python -m src.cli run \
--config examples/lecture_summarization/config.yaml \
--test-cases examples/lecture_summarization/test_cases.json
# 4. Print summary tables
python -m src.cli results --output-folder examples/lecture_summarization/results/
# 5. Open the interactive dashboard
python -m src.cli dashboard \
--output-folder examples/lecture_summarization/results/ \
--output dashboard.htmlNo API keys? Pre-computed results for all three examples are already included. Open any
examples/*/results/dashboard.htmldirectly in a browser — no server needed.
| Multi-provider inference | DeepSeek, OpenAI — extensible to Anthropic, Together, any OpenAI-compatible endpoint |
| Async parallel execution | Per-provider concurrency limits, retry logic, rate-limit backoff |
| 8 evaluation metrics | 4 accuracy · 2 safety · 2 quality (see Metrics Explained) |
| Pareto frontier analysis | Identify models that can't be beaten on both quality and cost/latency simultaneously |
| Failure taxonomy | Auto-classify failures into 4 types; surface edge cases where all models fail |
| Interactive HTML dashboard | Self-contained ~55 KB file, 6 Plotly charts, no server required |
| Web upload UI | frontend/index.html — drag-and-drop JSON results, live filtering, CSV/JSON export |
| CLI interface | llm-eval run / results / analyze / dashboard |
| 3 worked examples | Lecture summarization · Decision analysis · Retrieval ranking |
Requirements: Python 3.10+
# Install in editable mode (recommended)
pip install -e .
# Or install dependencies only
pip install -r requirements.txtEnvironment variables — copy .env.example to .env and fill in your keys:
DEEPSEEK_API_KEY=your_deepseek_key # required for DeepSeek models
OPENAI_API_KEY=your_openai_key # required for GPT-4o, GPT-4o-mini, etc.
ANTHROPIC_API_KEY=your_key # optional
TOGETHER_API_KEY=your_key # optional| Variable | Description | Default |
|---|---|---|
DEEPSEEK_API_KEY |
DeepSeek API key | — |
OPENAI_API_KEY |
OpenAI API key | — |
ANTHROPIC_API_KEY |
Anthropic API key | — |
TOGETHER_API_KEY |
Together AI API key | — |
EMBEDDING_MODEL |
Embedding backend for semantic similarity | sentence-transformers |
OUTPUT_FOLDER |
Default directory for result files | ./data/results/ |
DEBUG |
Enable verbose debug logging | false |
python -m src.cli run \
--config examples/lecture_summarization/config.yaml \
--test-cases examples/lecture_summarization/test_cases.json \
[--output-folder PATH] # overrides config output_folderCalls every model in the config against every test case in parallel, computes all 8 metrics, and saves two files to the output folder:
inference_results.json— raw outputs, latency, token counts, costmetrics_results.json— all 8 scores per (test case, model) pair
python -m src.cli results --output-folder examples/lecture_summarization/results/Prints an Inference Summary (latency, cost, error rate) and a Metrics Summary (all 8 metrics averaged per model) as ASCII tables.
python -m src.cli analyze --output-folder examples/lecture_summarization/results/Prints ranked metric comparisons with inline bar charts, Pareto frontier status, model highlights, failure type breakdown, and plain-English recommendations.
python -m src.cli dashboard \
--output-folder examples/lecture_summarization/results/ \
--output my_dashboard.html \
[--no-browser]Generates a self-contained HTML file with 6 interactive Plotly charts (metric rankings, radar, Pareto scatter, failure distribution, performance table). Opens in the default browser automatically unless --no-browser is set.
Open frontend/index.html in any modern browser — no server, no build step.
Open frontend/index.html → drag-and-drop your result files → explore
- Upload
metrics_results.json(required) andinference_results.json(optional, for cost/latency charts) - Click Load demo data to explore immediately without running an evaluation
- Filter by model and difficulty level — all 6 charts update live
- Click failure rows to expand the input, expected output, and model output side by side
- Export the filtered view as CSV or JSON
Three fully pre-computed examples are included. Each contains config.yaml, test_cases.json, a populated results/ folder, and analysis_summary.md with real numbers.
Path: examples/lecture_summarization/
Task: Summarize university lecture excerpts (CS algorithms, ML, statistics)
Models: deepseek-chat vs gpt-4o · 15 cases
| deepseek-chat | gpt-4o | |
|---|---|---|
| Semantic similarity | 0.858 | 0.845 |
| ROUGE-1 | 0.552 | 0.511 |
| Hallucination rate | 0% | 0% |
| Avg latency | 4,628 ms | 2,330 ms |
| Total cost (15 cases) | $0.0009 | $0.027 |
Key finding: DeepSeek is marginally more accurate and 30× cheaper; GPT-4o is 2× faster. Both produce zero hallucinations on factual academic content. Hard cases (transformers, RL, CLT) did not degrade either model.
→ Full analysis · Open dashboard
Path: examples/decision_analysis/
Task: Recommend a course of action given structured business scenarios (fundraising, M&A, cloud strategy, PE LBOs)
Models: deepseek-chat vs gpt-4o · 15 cases
| deepseek-chat | gpt-4o | |
|---|---|---|
| Semantic similarity | 0.753 | 0.699 |
| Completeness score | 0.807 | 0.647 |
| Hallucination rate | 27% | 33% |
| Avg latency | 10,065 ms | 3,979 ms |
| Total cost (15 cases) | $0.0019 | $0.049 |
Key finding: Decision analysis is ~10 points harder than summarization for both models. DeepSeek consistently addressed quantitative trade-offs; GPT-4o gave higher-level strategic framing. Both models' "hallucinations" are appropriate domain knowledge — the evaluator flags general business facts not present in the case text. Four edge cases (SaaS pricing math, PE debt coverage, M&A earnout mechanics) broke both models.
→ Full analysis · Open dashboard
Path: examples/retrieval_ranking/
Task: Rank 5–8 documents by relevance to a query; output a JSON array of document IDs
Models: deepseek-chat vs gpt-4o · 15 cases across 9 domains
| deepseek-chat | gpt-4o | |
|---|---|---|
| Kendall τ (true ranking quality) | 0.840 | 0.825 |
| Semantic similarity* | 0.999 | 0.688 |
| Avg latency | 2,436 ms | 1,342 ms |
| Total cost (15 cases) | $0.0010 | $0.021 |
* Misleading — GPT-4o wraps output in markdown code fences; both models rank equally well when parsed correctly.
Key finding: Both models are strong rankers (τ ≈ 0.83), far above keyword-based baselines. Standard metrics (ROUGE, semantic similarity) are unreliable for structured-output tasks — use Kendall τ or NDCG instead. DeepSeek is better on hard science/finance cases; GPT-4o is better on software and cross-domain medium cases. At 21× lower cost with comparable quality, DeepSeek is the clear choice for ranking pipelines.
→ Full analysis · Open dashboard
┌─────────────────────────────────────────────────────────────────┐
│ config.yaml + test_cases.json │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ InferenceRunner │ async HTTP, concurrency
│ (inference_runner.py) │ limits, retry + backoff
└──────────┬──────────────┘
│ ┌──────────────────────────────┐
├──► DeepSeek /api/chat/completions
├──► OpenAI /v1/chat/completions
└──► Anthropic /v1/messages
│
│ InferenceResult per (test_case, model)
▼
┌─────────────────────────┐
│ MetricsEvaluator │ sentence-transformers
│ (evaluator.py) │ rouge-score
└──────────┬──────────────┘ DeepSeek LLM-as-judge
│
│ MetricsResult (8 scores) per (test_case, model)
▼
┌─────────────────────────┐
│ ComparativeAnalyzer │ Pareto frontier
│ (analyzer.py) │ rankings · failures
└──────────┬──────────────┘
│ AnalysisReport
┌──────────┴────────────────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌────────────────────────┐
│ DashboardGenerator │ │ CLI / llm-eval │
│ (dashboard_gen*.py) │ │ (cli.py) │
└──────────┬──────────┘ └────────────────────────┘
│ dashboard.html terminal output
▼
┌─────────────────────┐
│ frontend/index.html│ drag-and-drop · live filters
│ (Vanilla JS+Plotly)│ failure drill-down · CSV export
└─────────────────────┘
Key source files:
| File | Responsibility |
|---|---|
src/config_loader.py |
Pydantic-validated YAML loading; API key resolution from env |
src/inference_runner.py |
Async multi-provider inference; cost table; per-model semaphores |
src/evaluator.py |
Embedding cache; ROUGE; 4 LLM-as-judge metrics via DeepSeek |
src/analyzer.py |
Pareto frontier; metric rankings; failure classification (4 types) |
src/dashboard_generator.py |
6 Plotly figures → single self-contained HTML |
src/cli.py |
Click CLI entry point (llm-eval) |
src/models/results.py |
InferenceResult, MetricsResult dataclasses |
src/models/analysis.py |
AnalysisReport, FailureReport, ParetoParetoFrontier |
evaluation:
name: "My Evaluation"
description: "Optional description"
task_type: "summarization" # qa | summarization | classification | generation
# reasoning | code | ranking | decision_analysis | custom
models:
- id: "deepseek" # unique ID used in all result files
provider: "deepseek" # openai | deepseek | anthropic | together
model_name: "deepseek-chat"
temperature: 0.3 # 0.0 – 2.0
max_tokens: 512
# base_url: "https://..." # override default API endpoint
# extra_params: # any additional fields forwarded to the API
# top_p: 0.95
- id: "gpt-4o"
provider: "openai"
model_name: "gpt-4o"
temperature: 0.3
max_tokens: 512
metrics: # include any subset of the 8 available
- semantic_similarity
- rouge1
- rouge2
- rougeL
- hallucination_detected
- toxicity_score
- relevance_score
- completeness_score
test_generation:
strategy: manual # manual | llm_generated | mixed
num_cases: 15
difficulty_distribution:
easy: 0.33
medium: 0.34
hard: 0.33
output_folder: "examples/my_eval/results/"Test cases JSON format:
[
{
"id": "tc_001",
"input": "Summarize the following lecture: ...",
"expected_output": "Ground truth / reference answer",
"difficulty": "easy",
"context": null,
"tags": ["cs", "algorithms"],
"source": "manual"
}
]| Metric | How it works | Best suited for |
|---|---|---|
semantic_similarity |
Cosine similarity between all-MiniLM-L6-v2 embeddings of output and reference. Cached per text. |
Meaning-level comparison; robust to paraphrase. Primary metric for summarization and QA. |
rouge1 |
Unigram F1 overlap between output and reference. | Surface coverage — whether key terms appear. |
rouge2 |
Bigram F1 overlap. | Phrase-level faithfulness; penalises word-order errors. |
rougeL |
Longest common subsequence F1. | Sentence-level fluency and coherence. |
Warning for structured-output tasks: ROUGE and semantic similarity are misleading when output format differs from the reference (e.g., ranking returns
["D1","D2"]vs```json\n["D1","D2"]\n```). Use task-specific metrics like Kendall τ for ranking or exact match for classification.
| Metric | How it works | Interpretation |
|---|---|---|
hallucination_detected |
DeepSeek asked: "Does this output contain information NOT in the source? YES or NO" | True = possible unsupported claim. Meaningful for summarization/QA. Not appropriate for open-domain tasks where models apply general knowledge. |
toxicity_score |
DeepSeek asked to rate toxicity 0–10; normalised to 0–1. | 0 = safe · 1 = toxic. Use for user-facing content. |
| Metric | How it works | Best suited for |
|---|---|---|
relevance_score |
DeepSeek asked to rate how relevant the output is to the input, 0–10. | Whether the model addressed the actual question. Primary for decision analysis and QA. |
completeness_score |
DeepSeek asked to rate coverage of key points vs the reference, 0–10. | Whether the answer is thorough. Particularly informative for open-ended tasks where lexical overlap is low. |
All four LLM-as-judge calls run concurrently per evaluation and use DeepSeek (the cheapest option) to keep evaluation costs low.
- Create
examples/my_domain/config.yamlandtest_cases.json - If your task needs a custom system prompt, add it to
_SYSTEM_PROMPTSinsrc/inference_runner.py:_SYSTEM_PROMPTS["my_task"] = ( "You are a specialist in ... Provide ..." )
- Add
"my_task"to theTaskTypeliteral insrc/config_loader.py - Run
python -m src.cli run --config ... --test-cases ...
- Add the field to
MetricsResultinsrc/models/results.py:my_metric: float = 0.0
- Implement the scorer in
src/evaluator.py:async def _my_metric_check(self, session, output: str) -> float: prompt = f"Rate X on a scale of 0–10.\n\nTEXT:\n{_truncate(output)}" return _parse_score(await self._call_deepseek(session, prompt, max_tokens=4))
- Add a task to
_llm_metricsinMetricsEvaluatorso it runs concurrently with the others - Add it to
MetricsResult(...)in_evaluate_async - Add it to the table definition in
src/dashboard_generator.pyandsrc/cli.py
OpenAI-compatible API (most providers): add a model entry in config.yaml with provider: "openai" and a custom base_url. Then map your API key env variable in _PROVIDER_KEY_MAP in src/config_loader.py:
_PROVIDER_KEY_MAP["myprovider"] = "MY_PROVIDER_API_KEY"Custom wire format: implement _call_myprovider() in src/inference_runner.py following the pattern of _call_anthropic(), and add a routing case in _dispatch():
async def _dispatch(session, model, messages):
if model.provider == "myprovider":
return await _call_myprovider(session, model, messages)
if model.provider == "anthropic":
return await _call_anthropic(session, model, messages)
return await _call_openai_compatible(session, model, messages)MIT