-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
🔴 Priority: HIGH | Type: Enhancement
1. SUMMARY
- No automated benchmarking suite exists to measure and track SQL generation accuracy against standard datasets (Spider, WikiSQL, BIRD).
- Impact: Cannot quantify accuracy claims (README states "90% on complex queries"), detect accuracy regressions, or compare model versions objectively.
2. SYSTEM CONTEXT
benchmarks/ # NEW - proposed structure
├── datasets/
│ ├── spider/ # Spider dataset (cross-domain)
│ ├── wikisql/ # WikiSQL dataset (single-table)
│ └── custom/ # Custom domain-specific tests
├── runners/
│ ├── accuracy_runner.py # Run benchmarks, collect metrics
│ └── report_generator.py # Generate reports/dashboards
├── results/
│ └── {timestamp}/ # Historical results
└── configs/
└── benchmark_config.yaml # Dataset configs, thresholds
Current state:
- README claims "90% accuracy on complex queries" with no reproducible proof
- No baseline metrics for comparison
- No regression detection in CI
3. CURRENT STATE (with code)
📄 File: README.md:244-250
## Benchmarks
| Metric | Traditional Pipeline | Agent Approach |
|--------|---------------------|----------------|
| Simple Queries | 85% accuracy | 92% accuracy |
| Complex Queries | 60% accuracy | 90% accuracy |
| Silent Failures | 15% | <1% |Claims are not backed by reproducible benchmarks.
📄 File: docs/IMPLEMENTATION_PLAN.md (excerpt)
- Accuracy improvements via self-correction
- No formal benchmark suite definedImplementation plan mentions accuracy but no benchmarking infrastructure.
4. PROPOSED SOLUTION
Create a comprehensive benchmarking suite that:
- Supports standard Text2SQL datasets (Spider, WikiSQL, BIRD)
- Measures execution accuracy (exact match, execution match)
- Tracks metrics over time and across model versions
- Integrates with CI for regression detection
📄 File: benchmarks/runners/accuracy_runner.py (NEW)
from dataclasses import dataclass
from pathlib import Path
import json
@dataclass
class BenchmarkResult:
dataset: str
total_queries: int
exact_match: int
execution_match: int
exact_match_pct: float
execution_match_pct: float
avg_confidence: float
avg_latency_ms: float
errors: list[dict]
class AccuracyBenchmark:
def __init__(self, agent_engine, dataset_path: Path):
self.agent = agent_engine
self.dataset = self._load_dataset(dataset_path)
async def run(self) -> BenchmarkResult:
results = []
for example in self.dataset:
result = await self._evaluate_single(example)
results.append(result)
return self._aggregate_results(results)
async def _evaluate_single(self, example: dict) -> dict:
generated = await self.agent.generate_sql(
natural_query=example["question"],
database_id=example["db_id"],
execute=False,
)
return {
"exact_match": self._normalize_sql(generated.sql) == self._normalize_sql(example["query"]),
"execution_match": await self._check_execution_match(generated.sql, example["query"], example["db_id"]),
"confidence": generated.confidence,
"latency_ms": generated.execution_time_ms,
}📄 File: benchmarks/configs/benchmark_config.yaml (NEW)
datasets:
spider:
path: benchmarks/datasets/spider
split: dev
metrics: [exact_match, execution_match]
thresholds:
exact_match: 0.70
execution_match: 0.80
wikisql:
path: benchmarks/datasets/wikisql
split: test
metrics: [exact_match, execution_match]
thresholds:
exact_match: 0.80
execution_match: 0.85
reporting:
output_dir: benchmarks/results
format: [json, markdown, html]
compare_with_baseline: true5. IMPLEMENTATION CHECKLIST
Phase 1: Infrastructure
- Create
benchmarks/directory structure - Add dataset download/setup scripts (Spider, WikiSQL)
- Implement SQL normalization for comparison
- Create database setup for benchmark schemas
Phase 2: Core Benchmark Runner
- Implement
AccuracyBenchmarkclass - Add exact match scoring
- Add execution match scoring (run both SQLs, compare results)
- Implement confidence and latency tracking
- Add error categorization (syntax, semantic, timeout)
Phase 3: Reporting & CI
- Generate JSON/Markdown/HTML reports
- Add historical comparison (vs baseline, vs previous run)
- Create GitHub Actions workflow for scheduled benchmarks
- Add regression detection with configurable thresholds
- Create badge/dashboard for README
Phase 4: Advanced Metrics
- Component-level accuracy (SELECT, WHERE, JOIN, GROUP BY)
- Complexity-stratified results (simple, medium, complex)
- Per-domain accuracy (Spider domains)
- Model comparison (different model versions)
6. FILES TO MODIFY TABLE
| File | Lines | Action | Description |
|---|---|---|---|
benchmarks/__init__.py |
NEW | Create | Benchmark package |
benchmarks/runners/accuracy_runner.py |
NEW | Create | Core benchmark runner |
benchmarks/runners/report_generator.py |
NEW | Create | Report generation |
benchmarks/utils/sql_normalize.py |
NEW | Create | SQL normalization for comparison |
benchmarks/utils/dataset_loader.py |
NEW | Create | Dataset loading utilities |
benchmarks/configs/benchmark_config.yaml |
NEW | Create | Benchmark configuration |
scripts/download_datasets.sh |
NEW | Create | Dataset download script |
scripts/run_benchmarks.py |
NEW | Create | CLI for running benchmarks |
.github/workflows/benchmark.yml |
NEW | Create | Scheduled benchmark workflow |
README.md |
TBD | Modify | Add benchmark badge and results |
7. RISK ASSESSMENT
| Risk | Impact | Mitigation |
|---|---|---|
| Dataset licensing issues | 🟡 | Use permissively licensed datasets; document licenses |
| Benchmarks are slow (full dataset) | 🟡 | Support subset mode; run full benchmarks on schedule |
| Execution match requires real DBs | 🟡 | Set up benchmark databases; use Docker containers |
| Results vary with model updates | 🟢 | Track historical results; require approval for regressions |
8. RELATED CONTEXT
- Spider dataset: https://yale-lily.github.io/spider
- WikiSQL dataset: https://github.com/salesforce/WikiSQL
- BIRD dataset: https://bird-bench.github.io/
- Current accuracy claims:
README.md:244-250 - Related issue: [HIGH] End-to-End Integration Tests with Real Model and Database #43 (E2E Integration Tests)
Metadata
Metadata
Assignees
Labels
No labels