Skip to content

arshinsikka/llm-evaluation-framework

Repository files navigation

LLM Evaluation Framework

A modular framework for systematically comparing large language models across diverse task types — lecture summarization, business decision analysis, and document retrieval ranking — measuring quality, cost, and latency to help teams choose the right model for each workload.


Quick Start

Five steps to run your first evaluation and open the dashboard:

# 1. Install
git clone <repo-url> && cd llm-evaluation-framework
pip install -e .

# 2. Set API keys
cp .env.example .env
# Edit .env and add your DEEPSEEK_API_KEY and/or OPENAI_API_KEY

# 3. Run evaluation against a pre-curated example
python -m src.cli run \
  --config    examples/lecture_summarization/config.yaml \
  --test-cases examples/lecture_summarization/test_cases.json

# 4. Print summary tables
python -m src.cli results --output-folder examples/lecture_summarization/results/

# 5. Open the interactive dashboard
python -m src.cli dashboard \
  --output-folder examples/lecture_summarization/results/ \
  --output dashboard.html

No API keys? Pre-computed results for all three examples are already included. Open any examples/*/results/dashboard.html directly in a browser — no server needed.


Features

Multi-provider inference DeepSeek, OpenAI — extensible to Anthropic, Together, any OpenAI-compatible endpoint
Async parallel execution Per-provider concurrency limits, retry logic, rate-limit backoff
8 evaluation metrics 4 accuracy · 2 safety · 2 quality (see Metrics Explained)
Pareto frontier analysis Identify models that can't be beaten on both quality and cost/latency simultaneously
Failure taxonomy Auto-classify failures into 4 types; surface edge cases where all models fail
Interactive HTML dashboard Self-contained ~55 KB file, 6 Plotly charts, no server required
Web upload UI frontend/index.html — drag-and-drop JSON results, live filtering, CSV/JSON export
CLI interface llm-eval run / results / analyze / dashboard
3 worked examples Lecture summarization · Decision analysis · Retrieval ranking

Installation

Requirements: Python 3.10+

# Install in editable mode (recommended)
pip install -e .

# Or install dependencies only
pip install -r requirements.txt

Environment variables — copy .env.example to .env and fill in your keys:

DEEPSEEK_API_KEY=your_deepseek_key    # required for DeepSeek models
OPENAI_API_KEY=your_openai_key        # required for GPT-4o, GPT-4o-mini, etc.
ANTHROPIC_API_KEY=your_key            # optional
TOGETHER_API_KEY=your_key             # optional
Variable Description Default
DEEPSEEK_API_KEY DeepSeek API key
OPENAI_API_KEY OpenAI API key
ANTHROPIC_API_KEY Anthropic API key
TOGETHER_API_KEY Together AI API key
EMBEDDING_MODEL Embedding backend for semantic similarity sentence-transformers
OUTPUT_FOLDER Default directory for result files ./data/results/
DEBUG Enable verbose debug logging false

Usage

CLI

llm-eval run — Run inference and evaluation

python -m src.cli run \
  --config    examples/lecture_summarization/config.yaml \
  --test-cases examples/lecture_summarization/test_cases.json \
  [--output-folder PATH]   # overrides config output_folder

Calls every model in the config against every test case in parallel, computes all 8 metrics, and saves two files to the output folder:

  • inference_results.json — raw outputs, latency, token counts, cost
  • metrics_results.json — all 8 scores per (test case, model) pair

llm-eval results — Print summary tables

python -m src.cli results --output-folder examples/lecture_summarization/results/

Prints an Inference Summary (latency, cost, error rate) and a Metrics Summary (all 8 metrics averaged per model) as ASCII tables.

llm-eval analyze — Full text analysis

python -m src.cli analyze --output-folder examples/lecture_summarization/results/

Prints ranked metric comparisons with inline bar charts, Pareto frontier status, model highlights, failure type breakdown, and plain-English recommendations.

llm-eval dashboard — Generate HTML dashboard

python -m src.cli dashboard \
  --output-folder examples/lecture_summarization/results/ \
  --output        my_dashboard.html \
  [--no-browser]

Generates a self-contained HTML file with 6 interactive Plotly charts (metric rankings, radar, Pareto scatter, failure distribution, performance table). Opens in the default browser automatically unless --no-browser is set.


Web UI

Open frontend/index.html in any modern browser — no server, no build step.

Open frontend/index.html → drag-and-drop your result files → explore
  • Upload metrics_results.json (required) and inference_results.json (optional, for cost/latency charts)
  • Click Load demo data to explore immediately without running an evaluation
  • Filter by model and difficulty level — all 6 charts update live
  • Click failure rows to expand the input, expected output, and model output side by side
  • Export the filtered view as CSV or JSON

Examples

Three fully pre-computed examples are included. Each contains config.yaml, test_cases.json, a populated results/ folder, and analysis_summary.md with real numbers.

Lecture Summarization

Path: examples/lecture_summarization/ Task: Summarize university lecture excerpts (CS algorithms, ML, statistics) Models: deepseek-chat vs gpt-4o · 15 cases

deepseek-chat gpt-4o
Semantic similarity 0.858 0.845
ROUGE-1 0.552 0.511
Hallucination rate 0% 0%
Avg latency 4,628 ms 2,330 ms
Total cost (15 cases) $0.0009 $0.027

Key finding: DeepSeek is marginally more accurate and 30× cheaper; GPT-4o is 2× faster. Both produce zero hallucinations on factual academic content. Hard cases (transformers, RL, CLT) did not degrade either model.

Full analysis · Open dashboard


Business Decision Analysis

Path: examples/decision_analysis/ Task: Recommend a course of action given structured business scenarios (fundraising, M&A, cloud strategy, PE LBOs) Models: deepseek-chat vs gpt-4o · 15 cases

deepseek-chat gpt-4o
Semantic similarity 0.753 0.699
Completeness score 0.807 0.647
Hallucination rate 27% 33%
Avg latency 10,065 ms 3,979 ms
Total cost (15 cases) $0.0019 $0.049

Key finding: Decision analysis is ~10 points harder than summarization for both models. DeepSeek consistently addressed quantitative trade-offs; GPT-4o gave higher-level strategic framing. Both models' "hallucinations" are appropriate domain knowledge — the evaluator flags general business facts not present in the case text. Four edge cases (SaaS pricing math, PE debt coverage, M&A earnout mechanics) broke both models.

Full analysis · Open dashboard


Document Retrieval Ranking

Path: examples/retrieval_ranking/ Task: Rank 5–8 documents by relevance to a query; output a JSON array of document IDs Models: deepseek-chat vs gpt-4o · 15 cases across 9 domains

deepseek-chat gpt-4o
Kendall τ (true ranking quality) 0.840 0.825
Semantic similarity* 0.999 0.688
Avg latency 2,436 ms 1,342 ms
Total cost (15 cases) $0.0010 $0.021

* Misleading — GPT-4o wraps output in markdown code fences; both models rank equally well when parsed correctly.

Key finding: Both models are strong rankers (τ ≈ 0.83), far above keyword-based baselines. Standard metrics (ROUGE, semantic similarity) are unreliable for structured-output tasks — use Kendall τ or NDCG instead. DeepSeek is better on hard science/finance cases; GPT-4o is better on software and cross-domain medium cases. At 21× lower cost with comparable quality, DeepSeek is the clear choice for ranking pipelines.

Full analysis · Open dashboard


Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    config.yaml + test_cases.json                │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │     InferenceRunner     │  async HTTP, concurrency
              │  (inference_runner.py)  │  limits, retry + backoff
              └──────────┬──────────────┘
                         │  ┌──────────────────────────────┐
                         ├──► DeepSeek  /api/chat/completions
                         ├──► OpenAI    /v1/chat/completions
                         └──► Anthropic /v1/messages
                         │
                         │  InferenceResult per (test_case, model)
                         ▼
              ┌─────────────────────────┐
              │    MetricsEvaluator     │  sentence-transformers
              │      (evaluator.py)     │  rouge-score
              └──────────┬──────────────┘  DeepSeek LLM-as-judge
                         │
                         │  MetricsResult (8 scores) per (test_case, model)
                         ▼
              ┌─────────────────────────┐
              │  ComparativeAnalyzer    │  Pareto frontier
              │     (analyzer.py)       │  rankings · failures
              └──────────┬──────────────┘
                         │  AnalysisReport
              ┌──────────┴────────────────────┐
              │                               │
              ▼                               ▼
  ┌─────────────────────┐        ┌────────────────────────┐
  │  DashboardGenerator │        │       CLI / llm-eval   │
  │ (dashboard_gen*.py) │        │        (cli.py)        │
  └──────────┬──────────┘        └────────────────────────┘
             │ dashboard.html          terminal output
             ▼
  ┌─────────────────────┐
  │  frontend/index.html│  drag-and-drop · live filters
  │  (Vanilla JS+Plotly)│  failure drill-down · CSV export
  └─────────────────────┘

Key source files:

File Responsibility
src/config_loader.py Pydantic-validated YAML loading; API key resolution from env
src/inference_runner.py Async multi-provider inference; cost table; per-model semaphores
src/evaluator.py Embedding cache; ROUGE; 4 LLM-as-judge metrics via DeepSeek
src/analyzer.py Pareto frontier; metric rankings; failure classification (4 types)
src/dashboard_generator.py 6 Plotly figures → single self-contained HTML
src/cli.py Click CLI entry point (llm-eval)
src/models/results.py InferenceResult, MetricsResult dataclasses
src/models/analysis.py AnalysisReport, FailureReport, ParetoParetoFrontier

Configuration Reference

evaluation:
  name: "My Evaluation"
  description: "Optional description"
  task_type: "summarization"   # qa | summarization | classification | generation
                               # reasoning | code | ranking | decision_analysis | custom

models:
  - id: "deepseek"             # unique ID used in all result files
    provider: "deepseek"       # openai | deepseek | anthropic | together
    model_name: "deepseek-chat"
    temperature: 0.3           # 0.0 – 2.0
    max_tokens: 512
    # base_url: "https://..."  # override default API endpoint
    # extra_params:            # any additional fields forwarded to the API
    #   top_p: 0.95

  - id: "gpt-4o"
    provider: "openai"
    model_name: "gpt-4o"
    temperature: 0.3
    max_tokens: 512

metrics:                       # include any subset of the 8 available
  - semantic_similarity
  - rouge1
  - rouge2
  - rougeL
  - hallucination_detected
  - toxicity_score
  - relevance_score
  - completeness_score

test_generation:
  strategy: manual             # manual | llm_generated | mixed
  num_cases: 15
  difficulty_distribution:
    easy: 0.33
    medium: 0.34
    hard: 0.33

output_folder: "examples/my_eval/results/"

Test cases JSON format:

[
  {
    "id": "tc_001",
    "input": "Summarize the following lecture: ...",
    "expected_output": "Ground truth / reference answer",
    "difficulty": "easy",
    "context": null,
    "tags": ["cs", "algorithms"],
    "source": "manual"
  }
]

Metrics Explained

Accuracy Metrics (higher is better, range 0–1)

Metric How it works Best suited for
semantic_similarity Cosine similarity between all-MiniLM-L6-v2 embeddings of output and reference. Cached per text. Meaning-level comparison; robust to paraphrase. Primary metric for summarization and QA.
rouge1 Unigram F1 overlap between output and reference. Surface coverage — whether key terms appear.
rouge2 Bigram F1 overlap. Phrase-level faithfulness; penalises word-order errors.
rougeL Longest common subsequence F1. Sentence-level fluency and coherence.

Warning for structured-output tasks: ROUGE and semantic similarity are misleading when output format differs from the reference (e.g., ranking returns ["D1","D2"] vs ```json\n["D1","D2"]\n```). Use task-specific metrics like Kendall τ for ranking or exact match for classification.

Safety Metrics

Metric How it works Interpretation
hallucination_detected DeepSeek asked: "Does this output contain information NOT in the source? YES or NO" True = possible unsupported claim. Meaningful for summarization/QA. Not appropriate for open-domain tasks where models apply general knowledge.
toxicity_score DeepSeek asked to rate toxicity 0–10; normalised to 0–1. 0 = safe · 1 = toxic. Use for user-facing content.

Quality Metrics (higher is better, range 0–1)

Metric How it works Best suited for
relevance_score DeepSeek asked to rate how relevant the output is to the input, 0–10. Whether the model addressed the actual question. Primary for decision analysis and QA.
completeness_score DeepSeek asked to rate coverage of key points vs the reference, 0–10. Whether the answer is thorough. Particularly informative for open-ended tasks where lexical overlap is low.

All four LLM-as-judge calls run concurrently per evaluation and use DeepSeek (the cheapest option) to keep evaluation costs low.


Extending the Framework

Add a New Evaluation Domain

  1. Create examples/my_domain/config.yaml and test_cases.json
  2. If your task needs a custom system prompt, add it to _SYSTEM_PROMPTS in src/inference_runner.py:
    _SYSTEM_PROMPTS["my_task"] = (
        "You are a specialist in ... Provide ..."
    )
  3. Add "my_task" to the TaskType literal in src/config_loader.py
  4. Run python -m src.cli run --config ... --test-cases ...

Add a New Metric

  1. Add the field to MetricsResult in src/models/results.py:
    my_metric: float = 0.0
  2. Implement the scorer in src/evaluator.py:
    async def _my_metric_check(self, session, output: str) -> float:
        prompt = f"Rate X on a scale of 0–10.\n\nTEXT:\n{_truncate(output)}"
        return _parse_score(await self._call_deepseek(session, prompt, max_tokens=4))
  3. Add a task to _llm_metrics in MetricsEvaluator so it runs concurrently with the others
  4. Add it to MetricsResult(...) in _evaluate_async
  5. Add it to the table definition in src/dashboard_generator.py and src/cli.py

Add a New Model Provider

OpenAI-compatible API (most providers): add a model entry in config.yaml with provider: "openai" and a custom base_url. Then map your API key env variable in _PROVIDER_KEY_MAP in src/config_loader.py:

_PROVIDER_KEY_MAP["myprovider"] = "MY_PROVIDER_API_KEY"

Custom wire format: implement _call_myprovider() in src/inference_runner.py following the pattern of _call_anthropic(), and add a routing case in _dispatch():

async def _dispatch(session, model, messages):
    if model.provider == "myprovider":
        return await _call_myprovider(session, model, messages)
    if model.provider == "anthropic":
        return await _call_anthropic(session, model, messages)
    return await _call_openai_compatible(session, model, messages)

License

MIT

About

A modular LLM evaluation framework for comparing models across diverse tasks (lecture summarization, decision analysis, retrieval ranking) with rigorous quality/cost/latency trade-off analysis and interactive dashboards.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors