[HIGH] Benchmarking Suite for Accuracy Tracking (Spider, WikiSQL)

## 🔴 Priority: HIGH | Type: Enhancement

### 1. SUMMARY

- No automated benchmarking suite exists to measure and track SQL generation accuracy against standard datasets (Spider, WikiSQL, BIRD).
- Impact: Cannot quantify accuracy claims (README states "90% on complex queries"), detect accuracy regressions, or compare model versions objectively.

### 2. SYSTEM CONTEXT

```
benchmarks/                    # NEW - proposed structure
├── datasets/
│   ├── spider/               # Spider dataset (cross-domain)
│   ├── wikisql/              # WikiSQL dataset (single-table)
│   └── custom/               # Custom domain-specific tests
├── runners/
│   ├── accuracy_runner.py    # Run benchmarks, collect metrics
│   └── report_generator.py   # Generate reports/dashboards
├── results/
│   └── {timestamp}/          # Historical results
└── configs/
    └── benchmark_config.yaml # Dataset configs, thresholds

Current state:
- README claims "90% accuracy on complex queries" with no reproducible proof
- No baseline metrics for comparison
- No regression detection in CI
```

### 3. CURRENT STATE (with code)

📄 **File:** `README.md:244-250`
```markdown
## Benchmarks

| Metric | Traditional Pipeline | Agent Approach |
|--------|---------------------|----------------|
| Simple Queries | 85% accuracy | 92% accuracy |
| Complex Queries | 60% accuracy | 90% accuracy |
| Silent Failures | 15% | <1% |
```
*Claims are not backed by reproducible benchmarks.*

📄 **File:** `docs/IMPLEMENTATION_PLAN.md` (excerpt)
```markdown
- Accuracy improvements via self-correction
- No formal benchmark suite defined
```
*Implementation plan mentions accuracy but no benchmarking infrastructure.*

### 4. PROPOSED SOLUTION

Create a comprehensive benchmarking suite that:
1. Supports standard Text2SQL datasets (Spider, WikiSQL, BIRD)
2. Measures execution accuracy (exact match, execution match)
3. Tracks metrics over time and across model versions
4. Integrates with CI for regression detection

📄 **File:** `benchmarks/runners/accuracy_runner.py` (NEW)
```python
from dataclasses import dataclass
from pathlib import Path
import json

@dataclass
class BenchmarkResult:
    dataset: str
    total_queries: int
    exact_match: int
    execution_match: int
    exact_match_pct: float
    execution_match_pct: float
    avg_confidence: float
    avg_latency_ms: float
    errors: list[dict]

class AccuracyBenchmark:
    def __init__(self, agent_engine, dataset_path: Path):
        self.agent = agent_engine
        self.dataset = self._load_dataset(dataset_path)

    async def run(self) -> BenchmarkResult:
        results = []
        for example in self.dataset:
            result = await self._evaluate_single(example)
            results.append(result)
        return self._aggregate_results(results)

    async def _evaluate_single(self, example: dict) -> dict:
        generated = await self.agent.generate_sql(
            natural_query=example["question"],
            database_id=example["db_id"],
            execute=False,
        )
        return {
            "exact_match": self._normalize_sql(generated.sql) == self._normalize_sql(example["query"]),
            "execution_match": await self._check_execution_match(generated.sql, example["query"], example["db_id"]),
            "confidence": generated.confidence,
            "latency_ms": generated.execution_time_ms,
        }
```

📄 **File:** `benchmarks/configs/benchmark_config.yaml` (NEW)
```yaml
datasets:
  spider:
    path: benchmarks/datasets/spider
    split: dev
    metrics: [exact_match, execution_match]
    thresholds:
      exact_match: 0.70
      execution_match: 0.80

  wikisql:
    path: benchmarks/datasets/wikisql
    split: test
    metrics: [exact_match, execution_match]
    thresholds:
      exact_match: 0.80
      execution_match: 0.85

reporting:
  output_dir: benchmarks/results
  format: [json, markdown, html]
  compare_with_baseline: true
```

### 5. IMPLEMENTATION CHECKLIST

**Phase 1: Infrastructure**
- [ ] Create `benchmarks/` directory structure
- [ ] Add dataset download/setup scripts (Spider, WikiSQL)
- [ ] Implement SQL normalization for comparison
- [ ] Create database setup for benchmark schemas

**Phase 2: Core Benchmark Runner**
- [ ] Implement `AccuracyBenchmark` class
- [ ] Add exact match scoring
- [ ] Add execution match scoring (run both SQLs, compare results)
- [ ] Implement confidence and latency tracking
- [ ] Add error categorization (syntax, semantic, timeout)

**Phase 3: Reporting & CI**
- [ ] Generate JSON/Markdown/HTML reports
- [ ] Add historical comparison (vs baseline, vs previous run)
- [ ] Create GitHub Actions workflow for scheduled benchmarks
- [ ] Add regression detection with configurable thresholds
- [ ] Create badge/dashboard for README

**Phase 4: Advanced Metrics**
- [ ] Component-level accuracy (SELECT, WHERE, JOIN, GROUP BY)
- [ ] Complexity-stratified results (simple, medium, complex)
- [ ] Per-domain accuracy (Spider domains)
- [ ] Model comparison (different model versions)

### 6. FILES TO MODIFY TABLE

| File | Lines | Action | Description |
|------|-------|--------|-------------|
| `benchmarks/__init__.py` | NEW | Create | Benchmark package |
| `benchmarks/runners/accuracy_runner.py` | NEW | Create | Core benchmark runner |
| `benchmarks/runners/report_generator.py` | NEW | Create | Report generation |
| `benchmarks/utils/sql_normalize.py` | NEW | Create | SQL normalization for comparison |
| `benchmarks/utils/dataset_loader.py` | NEW | Create | Dataset loading utilities |
| `benchmarks/configs/benchmark_config.yaml` | NEW | Create | Benchmark configuration |
| `scripts/download_datasets.sh` | NEW | Create | Dataset download script |
| `scripts/run_benchmarks.py` | NEW | Create | CLI for running benchmarks |
| `.github/workflows/benchmark.yml` | NEW | Create | Scheduled benchmark workflow |
| `README.md` | TBD | Modify | Add benchmark badge and results |

### 7. RISK ASSESSMENT

| Risk | Impact | Mitigation |
|------|--------|------------|
| Dataset licensing issues | 🟡 | Use permissively licensed datasets; document licenses |
| Benchmarks are slow (full dataset) | 🟡 | Support subset mode; run full benchmarks on schedule |
| Execution match requires real DBs | 🟡 | Set up benchmark databases; use Docker containers |
| Results vary with model updates | 🟢 | Track historical results; require approval for regressions |

### 8. RELATED CONTEXT

- Spider dataset: https://yale-lily.github.io/spider
- WikiSQL dataset: https://github.com/salesforce/WikiSQL
- BIRD dataset: https://bird-bench.github.io/
- Current accuracy claims: `README.md:244-250`
- Related issue: #43 (E2E Integration Tests)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HIGH] Benchmarking Suite for Accuracy Tracking (Spider, WikiSQL) #44

🔴 Priority: HIGH | Type: Enhancement

1. SUMMARY

2. SYSTEM CONTEXT

3. CURRENT STATE (with code)

4. PROPOSED SOLUTION

5. IMPLEMENTATION CHECKLIST

6. FILES TO MODIFY TABLE

7. RISK ASSESSMENT

8. RELATED CONTEXT

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Lines	Action	Description
`benchmarks/__init__.py`	NEW	Create	Benchmark package
`benchmarks/runners/accuracy_runner.py`	NEW	Create	Core benchmark runner
`benchmarks/runners/report_generator.py`	NEW	Create	Report generation
`benchmarks/utils/sql_normalize.py`	NEW	Create	SQL normalization for comparison
`benchmarks/utils/dataset_loader.py`	NEW	Create	Dataset loading utilities
`benchmarks/configs/benchmark_config.yaml`	NEW	Create	Benchmark configuration
`scripts/download_datasets.sh`	NEW	Create	Dataset download script
`scripts/run_benchmarks.py`	NEW	Create	CLI for running benchmarks
`.github/workflows/benchmark.yml`	NEW	Create	Scheduled benchmark workflow
`README.md`	TBD	Modify	Add benchmark badge and results

Risk	Impact	Mitigation
Dataset licensing issues	🟡	Use permissively licensed datasets; document licenses
Benchmarks are slow (full dataset)	🟡	Support subset mode; run full benchmarks on schedule
Execution match requires real DBs	🟡	Set up benchmark databases; use Docker containers
Results vary with model updates	🟢	Track historical results; require approval for regressions

[HIGH] Benchmarking Suite for Accuracy Tracking (Spider, WikiSQL) #44

Description

🔴 Priority: HIGH | Type: Enhancement

1. SUMMARY

2. SYSTEM CONTEXT

3. CURRENT STATE (with code)

4. PROPOSED SOLUTION

5. IMPLEMENTATION CHECKLIST

6. FILES TO MODIFY TABLE

7. RISK ASSESSMENT

8. RELATED CONTEXT

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions