LLM Evaluation Framework

A modular framework for systematically comparing large language models across diverse task types — lecture summarization, business decision analysis, and document retrieval ranking — measuring quality, cost, and latency to help teams choose the right model for each workload.

Quick Start

Five steps to run your first evaluation and open the dashboard:

# 1. Install
git clone <repo-url> && cd llm-evaluation-framework
pip install -e .

# 2. Set API keys
cp .env.example .env
# Edit .env and add your DEEPSEEK_API_KEY and/or OPENAI_API_KEY

# 3. Run evaluation against a pre-curated example
python -m src.cli run \
  --config    examples/lecture_summarization/config.yaml \
  --test-cases examples/lecture_summarization/test_cases.json

# 4. Print summary tables
python -m src.cli results --output-folder examples/lecture_summarization/results/

# 5. Open the interactive dashboard
python -m src.cli dashboard \
  --output-folder examples/lecture_summarization/results/ \
  --output dashboard.html

No API keys? Pre-computed results for all three examples are already included. Open any examples/*/results/dashboard.html directly in a browser — no server needed.

Features


Multi-provider inference	DeepSeek, OpenAI — extensible to Anthropic, Together, any OpenAI-compatible endpoint
Async parallel execution	Per-provider concurrency limits, retry logic, rate-limit backoff
8 evaluation metrics	4 accuracy · 2 safety · 2 quality (see Metrics Explained)
Pareto frontier analysis	Identify models that can't be beaten on both quality and cost/latency simultaneously
Failure taxonomy	Auto-classify failures into 4 types; surface edge cases where all models fail
Interactive HTML dashboard	Self-contained ~55 KB file, 6 Plotly charts, no server required
Web upload UI	`frontend/index.html` — drag-and-drop JSON results, live filtering, CSV/JSON export
CLI interface	`llm-eval run / results / analyze / dashboard`
3 worked examples	Lecture summarization · Decision analysis · Retrieval ranking

Installation

Requirements: Python 3.10+

# Install in editable mode (recommended)
pip install -e .

# Or install dependencies only
pip install -r requirements.txt

Environment variables — copy .env.example to .env and fill in your keys:

DEEPSEEK_API_KEY=your_deepseek_key    # required for DeepSeek models
OPENAI_API_KEY=your_openai_key        # required for GPT-4o, GPT-4o-mini, etc.
ANTHROPIC_API_KEY=your_key            # optional
TOGETHER_API_KEY=your_key             # optional

Variable	Description	Default
`DEEPSEEK_API_KEY`	DeepSeek API key	—
`OPENAI_API_KEY`	OpenAI API key	—
`ANTHROPIC_API_KEY`	Anthropic API key	—
`TOGETHER_API_KEY`	Together AI API key	—
`EMBEDDING_MODEL`	Embedding backend for semantic similarity	`sentence-transformers`
`OUTPUT_FOLDER`	Default directory for result files	`./data/results/`
`DEBUG`	Enable verbose debug logging	`false`

Usage

CLI

`llm-eval run` — Run inference and evaluation

python -m src.cli run \
  --config    examples/lecture_summarization/config.yaml \
  --test-cases examples/lecture_summarization/test_cases.json \
  [--output-folder PATH]   # overrides config output_folder

Calls every model in the config against every test case in parallel, computes all 8 metrics, and saves two files to the output folder:

inference_results.json — raw outputs, latency, token counts, cost
metrics_results.json — all 8 scores per (test case, model) pair

`llm-eval results` — Print summary tables

python -m src.cli results --output-folder examples/lecture_summarization/results/

Prints an Inference Summary (latency, cost, error rate) and a Metrics Summary (all 8 metrics averaged per model) as ASCII tables.

`llm-eval analyze` — Full text analysis

python -m src.cli analyze --output-folder examples/lecture_summarization/results/

Prints ranked metric comparisons with inline bar charts, Pareto frontier status, model highlights, failure type breakdown, and plain-English recommendations.

`llm-eval dashboard` — Generate HTML dashboard

python -m src.cli dashboard \
  --output-folder examples/lecture_summarization/results/ \
  --output        my_dashboard.html \
  [--no-browser]

Generates a self-contained HTML file with 6 interactive Plotly charts (metric rankings, radar, Pareto scatter, failure distribution, performance table). Opens in the default browser automatically unless --no-browser is set.

Web UI

Open frontend/index.html in any modern browser — no server, no build step.

Open frontend/index.html → drag-and-drop your result files → explore

Upload metrics_results.json (required) and inference_results.json (optional, for cost/latency charts)
Click Load demo data to explore immediately without running an evaluation
Filter by model and difficulty level — all 6 charts update live
Click failure rows to expand the input, expected output, and model output side by side
Export the filtered view as CSV or JSON

Examples

Three fully pre-computed examples are included. Each contains config.yaml, test_cases.json, a populated results/ folder, and analysis_summary.md with real numbers.

Lecture Summarization

Path: examples/lecture_summarization/ Task: Summarize university lecture excerpts (CS algorithms, ML, statistics) Models: deepseek-chat vs gpt-4o · 15 cases

	deepseek-chat	gpt-4o
Semantic similarity	0.858	0.845
ROUGE-1	0.552	0.511
Hallucination rate	0%	0%
Avg latency	4,628 ms	2,330 ms
Total cost (15 cases)	$0.0009	$0.027

Key finding: DeepSeek is marginally more accurate and 30× cheaper; GPT-4o is 2× faster. Both produce zero hallucinations on factual academic content. Hard cases (transformers, RL, CLT) did not degrade either model.

→ Full analysis · Open dashboard

Business Decision Analysis

Path: examples/decision_analysis/ Task: Recommend a course of action given structured business scenarios (fundraising, M&A, cloud strategy, PE LBOs) Models: deepseek-chat vs gpt-4o · 15 cases

	deepseek-chat	gpt-4o
Semantic similarity	0.753	0.699
Completeness score	0.807	0.647
Hallucination rate	27%	33%
Avg latency	10,065 ms	3,979 ms
Total cost (15 cases)	$0.0019	$0.049

Key finding: Decision analysis is ~10 points harder than summarization for both models. DeepSeek consistently addressed quantitative trade-offs; GPT-4o gave higher-level strategic framing. Both models' "hallucinations" are appropriate domain knowledge — the evaluator flags general business facts not present in the case text. Four edge cases (SaaS pricing math, PE debt coverage, M&A earnout mechanics) broke both models.

→ Full analysis · Open dashboard

Document Retrieval Ranking

Path: examples/retrieval_ranking/ Task: Rank 5–8 documents by relevance to a query; output a JSON array of document IDs Models: deepseek-chat vs gpt-4o · 15 cases across 9 domains

	deepseek-chat	gpt-4o
Kendall τ (true ranking quality)	0.840	0.825
Semantic similarity*	0.999	0.688
Avg latency	2,436 ms	1,342 ms
Total cost (15 cases)	$0.0010	$0.021

* Misleading — GPT-4o wraps output in markdown code fences; both models rank equally well when parsed correctly.

Key finding: Both models are strong rankers (τ ≈ 0.83), far above keyword-based baselines. Standard metrics (ROUGE, semantic similarity) are unreliable for structured-output tasks — use Kendall τ or NDCG instead. DeepSeek is better on hard science/finance cases; GPT-4o is better on software and cross-domain medium cases. At 21× lower cost with comparable quality, DeepSeek is the clear choice for ranking pipelines.

→ Full analysis · Open dashboard

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    config.yaml + test_cases.json                │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │     InferenceRunner     │  async HTTP, concurrency
              │  (inference_runner.py)  │  limits, retry + backoff
              └──────────┬──────────────┘
                         │  ┌──────────────────────────────┐
                         ├──► DeepSeek  /api/chat/completions
                         ├──► OpenAI    /v1/chat/completions
                         └──► Anthropic /v1/messages
                         │
                         │  InferenceResult per (test_case, model)
                         ▼
              ┌─────────────────────────┐
              │    MetricsEvaluator     │  sentence-transformers
              │      (evaluator.py)     │  rouge-score
              └──────────┬──────────────┘  DeepSeek LLM-as-judge
                         │
                         │  MetricsResult (8 scores) per (test_case, model)
                         ▼
              ┌─────────────────────────┐
              │  ComparativeAnalyzer    │  Pareto frontier
              │     (analyzer.py)       │  rankings · failures
              └──────────┬──────────────┘
                         │  AnalysisReport
              ┌──────────┴────────────────────┐
              │                               │
              ▼                               ▼
  ┌─────────────────────┐        ┌────────────────────────┐
  │  DashboardGenerator │        │       CLI / llm-eval   │
  │ (dashboard_gen*.py) │        │        (cli.py)        │
  └──────────┬──────────┘        └────────────────────────┘
             │ dashboard.html          terminal output
             ▼
  ┌─────────────────────┐
  │  frontend/index.html│  drag-and-drop · live filters
  │  (Vanilla JS+Plotly)│  failure drill-down · CSV export
  └─────────────────────┘

Key source files:

File	Responsibility
`src/config_loader.py`	Pydantic-validated YAML loading; API key resolution from env
`src/inference_runner.py`	Async multi-provider inference; cost table; per-model semaphores
`src/evaluator.py`	Embedding cache; ROUGE; 4 LLM-as-judge metrics via DeepSeek
`src/analyzer.py`	Pareto frontier; metric rankings; failure classification (4 types)
`src/dashboard_generator.py`	6 Plotly figures → single self-contained HTML
`src/cli.py`	Click CLI entry point (`llm-eval`)
`src/models/results.py`	`InferenceResult`, `MetricsResult` dataclasses
`src/models/analysis.py`	`AnalysisReport`, `FailureReport`, `ParetoParetoFrontier`

Configuration Reference

evaluation:
  name: "My Evaluation"
  description: "Optional description"
  task_type: "summarization"   # qa | summarization | classification | generation
                               # reasoning | code | ranking | decision_analysis | custom

models:
  - id: "deepseek"             # unique ID used in all result files
    provider: "deepseek"       # openai | deepseek | anthropic | together
    model_name: "deepseek-chat"
    temperature: 0.3           # 0.0 – 2.0
    max_tokens: 512
    # base_url: "https://..."  # override default API endpoint
    # extra_params:            # any additional fields forwarded to the API
    #   top_p: 0.95

  - id: "gpt-4o"
    provider: "openai"
    model_name: "gpt-4o"
    temperature: 0.3
    max_tokens: 512

metrics:                       # include any subset of the 8 available
  - semantic_similarity
  - rouge1
  - rouge2
  - rougeL
  - hallucination_detected
  - toxicity_score
  - relevance_score
  - completeness_score

test_generation:
  strategy: manual             # manual | llm_generated | mixed
  num_cases: 15
  difficulty_distribution:
    easy: 0.33
    medium: 0.34
    hard: 0.33

output_folder: "examples/my_eval/results/"

Test cases JSON format:

[
  {
    "id": "tc_001",
    "input": "Summarize the following lecture: ...",
    "expected_output": "Ground truth / reference answer",
    "difficulty": "easy",
    "context": null,
    "tags": ["cs", "algorithms"],
    "source": "manual"
  }
]

Metrics Explained

Accuracy Metrics (higher is better, range 0–1)

Metric	How it works	Best suited for
`semantic_similarity`	Cosine similarity between `all-MiniLM-L6-v2` embeddings of output and reference. Cached per text.	Meaning-level comparison; robust to paraphrase. Primary metric for summarization and QA.
`rouge1`	Unigram F1 overlap between output and reference.	Surface coverage — whether key terms appear.
`rouge2`	Bigram F1 overlap.	Phrase-level faithfulness; penalises word-order errors.
`rougeL`	Longest common subsequence F1.	Sentence-level fluency and coherence.

Warning for structured-output tasks: ROUGE and semantic similarity are misleading when output format differs from the reference (e.g., ranking returns ["D1","D2"] vs ```json\n["D1","D2"]\n```). Use task-specific metrics like Kendall τ for ranking or exact match for classification.

Safety Metrics

Metric	How it works	Interpretation
`hallucination_detected`	DeepSeek asked: "Does this output contain information NOT in the source? YES or NO"	`True` = possible unsupported claim. Meaningful for summarization/QA. Not appropriate for open-domain tasks where models apply general knowledge.
`toxicity_score`	DeepSeek asked to rate toxicity 0–10; normalised to 0–1.	0 = safe · 1 = toxic. Use for user-facing content.

Quality Metrics (higher is better, range 0–1)

Metric	How it works	Best suited for
`relevance_score`	DeepSeek asked to rate how relevant the output is to the input, 0–10.	Whether the model addressed the actual question. Primary for decision analysis and QA.
`completeness_score`	DeepSeek asked to rate coverage of key points vs the reference, 0–10.	Whether the answer is thorough. Particularly informative for open-ended tasks where lexical overlap is low.

All four LLM-as-judge calls run concurrently per evaluation and use DeepSeek (the cheapest option) to keep evaluation costs low.

Extending the Framework

Add a New Evaluation Domain

Create examples/my_domain/config.yaml and test_cases.json
If your task needs a custom system prompt, add it to _SYSTEM_PROMPTS in src/inference_runner.py:
```
_SYSTEM_PROMPTS["my_task"] = (
    "You are a specialist in ... Provide ..."
)
```
Add "my_task" to the TaskType literal in src/config_loader.py
Run python -m src.cli run --config ... --test-cases ...

Add a New Metric

Add the field to MetricsResult in src/models/results.py:
```
my_metric: float = 0.0
```

Implement the scorer in src/evaluator.py:

async def _my_metric_check(self, session, output: str) -> float:
    prompt = f"Rate X on a scale of 0–10.\n\nTEXT:\n{_truncate(output)}"
    return _parse_score(await self._call_deepseek(session, prompt, max_tokens=4))

Add a task to _llm_metrics in MetricsEvaluator so it runs concurrently with the others
Add it to MetricsResult(...) in _evaluate_async
Add it to the table definition in src/dashboard_generator.py and src/cli.py

Add a New Model Provider

OpenAI-compatible API (most providers): add a model entry in config.yaml with provider: "openai" and a custom base_url. Then map your API key env variable in _PROVIDER_KEY_MAP in src/config_loader.py:

_PROVIDER_KEY_MAP["myprovider"] = "MY_PROVIDER_API_KEY"

Custom wire format: implement _call_myprovider() in src/inference_runner.py following the pattern of _call_anthropic(), and add a routing case in _dispatch():

async def _dispatch(session, model, messages):
    if model.provider == "myprovider":
        return await _call_myprovider(session, model, messages)
    if model.provider == "anthropic":
        return await _call_anthropic(session, model, messages)
    return await _call_openai_compatible(session, model, messages)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/logs		data/logs
examples		examples
frontend		frontend
src		src
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DESIGN_RATIONALE.md		DESIGN_RATIONALE.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Evaluation Framework

Quick Start

Features

Installation

Usage

CLI

`llm-eval run` — Run inference and evaluation

`llm-eval results` — Print summary tables

`llm-eval analyze` — Full text analysis

`llm-eval dashboard` — Generate HTML dashboard

Web UI

Examples

Lecture Summarization

Business Decision Analysis

Document Retrieval Ranking

Architecture

Configuration Reference

Metrics Explained

Accuracy Metrics (higher is better, range 0–1)

Safety Metrics

Quality Metrics (higher is better, range 0–1)

Extending the Framework

Add a New Evaluation Domain

Add a New Metric

Add a New Model Provider

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation Framework

Quick Start

Features

Installation

Usage

CLI

llm-eval run — Run inference and evaluation

llm-eval results — Print summary tables

llm-eval analyze — Full text analysis

llm-eval dashboard — Generate HTML dashboard

Web UI

Examples

Lecture Summarization

Business Decision Analysis

Document Retrieval Ranking

Architecture

Configuration Reference

Metrics Explained

Accuracy Metrics (higher is better, range 0–1)

Safety Metrics

Quality Metrics (higher is better, range 0–1)

Extending the Framework

Add a New Evaluation Domain

Add a New Metric

Add a New Model Provider

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`llm-eval run` — Run inference and evaluation

`llm-eval results` — Print summary tables

`llm-eval analyze` — Full text analysis

`llm-eval dashboard` — Generate HTML dashboard

Packages