Detailed Model Analysis and Testing Results
Complete Model Testing Results
Full Model Comparison Table
Model
Type
Params
Disk Size
RAM Usage
Exec Accuracy
Avg Latency
Notes
qwen2.5-coder:7b
Domain-specific (code)
7B
4.7 GB
~6 GB
86.00%
5.41s
✅ Best overall
gpt-oss:20b
General purpose
20B
13 GB
~16 GB
90.00%
20.83s
⚠️ 4x slower, JSON errors
qwen2.5:14b
General purpose
14B
9.0 GB
~12 GB
82.00%
10.02s
❌ Worse accuracy, 2x slower
sqlcoder:7b
SQL-specific
7B
4.1 GB
~5 GB
2.00%
2.92s
❌ Failed - JSON/prompt issues
sqlcoder:15b
SQL-specific
15B
9.0 GB
~11 GB
6.00%
0.01s*
❌ Failed - not compatible
deepseek-coder-v2:16b
Domain-specific (code)
16B
8.9 GB
~11 GB
68.00%
4.04s
✅ Good accuracy, slower than 7B
codestral:22b-v0.1-q4_0
Domain-specific (code)
22B
12 GB
~15 GB
82.00%
30.59s
⚠️ Slow, decent accuracy
mistral:7b-instruct
General purpose
7B
4.1 GB
~5 GB
Failed
31-39s*
❌ JSON errors, extremely slow
Why Models Failed: Detailed Analysis
Why gpt-oss:20b Failed Despite Being Larger
Training Data Mismatch : Trained on conversational data, not code/SQL
JSON Generation : Unable to reliably generate structured JSON responses required by the agent
Inference Overhead : Larger model = slower inference without proportional accuracy gains
Context Understanding : Struggles with database schema context compared to code-specific models
Why SQL-Specific Models (sqlcoder) Failed
Prompt Format Incompatibility : sqlcoder models expect different prompt formats than our agent system
JSON Generation Issues : Unable to generate structured JSON required by IntelligentSQLAgent
Outdated Training : Older models may lack modern instruction-following capabilities
Agent Integration : These models were designed for direct SQL generation, not agent-based systems
Why Mistral Models Underperformed
mistral:7b-instruct : JSON generation failures, extremely slow (31-39s/query)
codestral:22b : Despite being code-optimized, 6x slower than qwen2.5-coder with no accuracy gain
Multi-Attempt Strategy: Detailed Results
Spider dev (first 100 samples) - Host Environment
Max Attempts
Exec Accuracy
Exact Match
Avg Latency
Avg Attempts
1
84%
3%
2.43s
1.00
5
85%
4%
3.97s
1.66
7
85%
4%
4.79s
1.94
Spider dev (first 100 samples) - Docker Environment
Max Attempts
Exec Accuracy
Exact Match
Avg Latency
Avg Attempts
1
84%
3%
2.56s
1.00
5
84%
2%
4.22s
1.66
7
84%
3%
4.77s
1.96
Phase-by-Phase Improvements
Version
Exec Accuracy
Exact Match
Avg Latency
Avg Attempts
Improvements
Original
82.00%
0.00%
9.60s
2.74
Baseline
Phase 2
86.00%
14.00%
5.41s
2.50
✅ Error Learning
Phase 3
86.00%
14.00%
5.37s
2.52
+ Semantic Analysis
Error Types Handled by IntelligentSQLAgent
Column ambiguity errors
Missing GROUP BY clauses
Invalid aggregate function usage
Table join errors
Schema mismatch errors
Type conversion errors
Syntax errors
Hardware: MacBook Pro with M-series chip
RAM: 48GB total, ~18GB available
OS: macOS Darwin 24.4.0
Python: 3.8+
Ollama: Latest version
Spider dev: First 50-100 samples
Temperature: 0.0 (deterministic) for single attempt, 0.2 for multi-attempt
Stop-on-success: Enabled
Max attempts: 1, 5, 7
# Test with IntelligentSQLAgent
python benchmarks/sql_benchmark.py \
--benchmark spider \
--use-agent \
--model-name qwen2.5-coder:7b \
--limit 50 \
--max-attempts 5 \
--temperature 0 \
--output results/spider_qwen_coder_50.json
# Quick model test
python test_ollama_models.py