Skip to content

[HIGH] Benchmarking Suite for Accuracy Tracking (Spider, WikiSQL) #44

@Sakeeb91

Description

@Sakeeb91

🔴 Priority: HIGH | Type: Enhancement

1. SUMMARY

  • No automated benchmarking suite exists to measure and track SQL generation accuracy against standard datasets (Spider, WikiSQL, BIRD).
  • Impact: Cannot quantify accuracy claims (README states "90% on complex queries"), detect accuracy regressions, or compare model versions objectively.

2. SYSTEM CONTEXT

benchmarks/                    # NEW - proposed structure
├── datasets/
│   ├── spider/               # Spider dataset (cross-domain)
│   ├── wikisql/              # WikiSQL dataset (single-table)
│   └── custom/               # Custom domain-specific tests
├── runners/
│   ├── accuracy_runner.py    # Run benchmarks, collect metrics
│   └── report_generator.py   # Generate reports/dashboards
├── results/
│   └── {timestamp}/          # Historical results
└── configs/
    └── benchmark_config.yaml # Dataset configs, thresholds

Current state:
- README claims "90% accuracy on complex queries" with no reproducible proof
- No baseline metrics for comparison
- No regression detection in CI

3. CURRENT STATE (with code)

📄 File: README.md:244-250

## Benchmarks

| Metric | Traditional Pipeline | Agent Approach |
|--------|---------------------|----------------|
| Simple Queries | 85% accuracy | 92% accuracy |
| Complex Queries | 60% accuracy | 90% accuracy |
| Silent Failures | 15% | <1% |

Claims are not backed by reproducible benchmarks.

📄 File: docs/IMPLEMENTATION_PLAN.md (excerpt)

- Accuracy improvements via self-correction
- No formal benchmark suite defined

Implementation plan mentions accuracy but no benchmarking infrastructure.

4. PROPOSED SOLUTION

Create a comprehensive benchmarking suite that:

  1. Supports standard Text2SQL datasets (Spider, WikiSQL, BIRD)
  2. Measures execution accuracy (exact match, execution match)
  3. Tracks metrics over time and across model versions
  4. Integrates with CI for regression detection

📄 File: benchmarks/runners/accuracy_runner.py (NEW)

from dataclasses import dataclass
from pathlib import Path
import json

@dataclass
class BenchmarkResult:
    dataset: str
    total_queries: int
    exact_match: int
    execution_match: int
    exact_match_pct: float
    execution_match_pct: float
    avg_confidence: float
    avg_latency_ms: float
    errors: list[dict]

class AccuracyBenchmark:
    def __init__(self, agent_engine, dataset_path: Path):
        self.agent = agent_engine
        self.dataset = self._load_dataset(dataset_path)

    async def run(self) -> BenchmarkResult:
        results = []
        for example in self.dataset:
            result = await self._evaluate_single(example)
            results.append(result)
        return self._aggregate_results(results)

    async def _evaluate_single(self, example: dict) -> dict:
        generated = await self.agent.generate_sql(
            natural_query=example["question"],
            database_id=example["db_id"],
            execute=False,
        )
        return {
            "exact_match": self._normalize_sql(generated.sql) == self._normalize_sql(example["query"]),
            "execution_match": await self._check_execution_match(generated.sql, example["query"], example["db_id"]),
            "confidence": generated.confidence,
            "latency_ms": generated.execution_time_ms,
        }

📄 File: benchmarks/configs/benchmark_config.yaml (NEW)

datasets:
  spider:
    path: benchmarks/datasets/spider
    split: dev
    metrics: [exact_match, execution_match]
    thresholds:
      exact_match: 0.70
      execution_match: 0.80

  wikisql:
    path: benchmarks/datasets/wikisql
    split: test
    metrics: [exact_match, execution_match]
    thresholds:
      exact_match: 0.80
      execution_match: 0.85

reporting:
  output_dir: benchmarks/results
  format: [json, markdown, html]
  compare_with_baseline: true

5. IMPLEMENTATION CHECKLIST

Phase 1: Infrastructure

  • Create benchmarks/ directory structure
  • Add dataset download/setup scripts (Spider, WikiSQL)
  • Implement SQL normalization for comparison
  • Create database setup for benchmark schemas

Phase 2: Core Benchmark Runner

  • Implement AccuracyBenchmark class
  • Add exact match scoring
  • Add execution match scoring (run both SQLs, compare results)
  • Implement confidence and latency tracking
  • Add error categorization (syntax, semantic, timeout)

Phase 3: Reporting & CI

  • Generate JSON/Markdown/HTML reports
  • Add historical comparison (vs baseline, vs previous run)
  • Create GitHub Actions workflow for scheduled benchmarks
  • Add regression detection with configurable thresholds
  • Create badge/dashboard for README

Phase 4: Advanced Metrics

  • Component-level accuracy (SELECT, WHERE, JOIN, GROUP BY)
  • Complexity-stratified results (simple, medium, complex)
  • Per-domain accuracy (Spider domains)
  • Model comparison (different model versions)

6. FILES TO MODIFY TABLE

File Lines Action Description
benchmarks/__init__.py NEW Create Benchmark package
benchmarks/runners/accuracy_runner.py NEW Create Core benchmark runner
benchmarks/runners/report_generator.py NEW Create Report generation
benchmarks/utils/sql_normalize.py NEW Create SQL normalization for comparison
benchmarks/utils/dataset_loader.py NEW Create Dataset loading utilities
benchmarks/configs/benchmark_config.yaml NEW Create Benchmark configuration
scripts/download_datasets.sh NEW Create Dataset download script
scripts/run_benchmarks.py NEW Create CLI for running benchmarks
.github/workflows/benchmark.yml NEW Create Scheduled benchmark workflow
README.md TBD Modify Add benchmark badge and results

7. RISK ASSESSMENT

Risk Impact Mitigation
Dataset licensing issues 🟡 Use permissively licensed datasets; document licenses
Benchmarks are slow (full dataset) 🟡 Support subset mode; run full benchmarks on schedule
Execution match requires real DBs 🟡 Set up benchmark databases; use Docker containers
Results vary with model updates 🟢 Track historical results; require approval for regressions

8. RELATED CONTEXT

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions