Detailed Model Analysis and Testing Results

Complete Model Testing Results

Full Model Comparison Table

Model	Type	Params	Disk Size	RAM Usage	Exec Accuracy	Avg Latency	Notes
qwen2.5-coder:7b	Domain-specific (code)	7B	4.7 GB	~6 GB	86.00%	5.41s	✅ Best overall
gpt-oss:20b	General purpose	20B	13 GB	~16 GB	90.00%	20.83s	⚠️ 4x slower, JSON errors
qwen2.5:14b	General purpose	14B	9.0 GB	~12 GB	82.00%	10.02s	❌ Worse accuracy, 2x slower
sqlcoder:7b	SQL-specific	7B	4.1 GB	~5 GB	2.00%	2.92s	❌ Failed - JSON/prompt issues
sqlcoder:15b	SQL-specific	15B	9.0 GB	~11 GB	6.00%	0.01s*	❌ Failed - not compatible
deepseek-coder-v2:16b	Domain-specific (code)	16B	8.9 GB	~11 GB	68.00%	4.04s	✅ Good accuracy, slower than 7B
codestral:22b-v0.1-q4_0	Domain-specific (code)	22B	12 GB	~15 GB	82.00%	30.59s	⚠️ Slow, decent accuracy
mistral:7b-instruct	General purpose	7B	4.1 GB	~5 GB	Failed	31-39s*	❌ JSON errors, extremely slow

Why Models Failed: Detailed Analysis

Why gpt-oss:20b Failed Despite Being Larger

Training Data Mismatch: Trained on conversational data, not code/SQL
JSON Generation: Unable to reliably generate structured JSON responses required by the agent
Inference Overhead: Larger model = slower inference without proportional accuracy gains
Context Understanding: Struggles with database schema context compared to code-specific models

Why SQL-Specific Models (sqlcoder) Failed

Prompt Format Incompatibility: sqlcoder models expect different prompt formats than our agent system
JSON Generation Issues: Unable to generate structured JSON required by IntelligentSQLAgent
Outdated Training: Older models may lack modern instruction-following capabilities
Agent Integration: These models were designed for direct SQL generation, not agent-based systems

Why Mistral Models Underperformed

mistral:7b-instruct: JSON generation failures, extremely slow (31-39s/query)
codestral:22b: Despite being code-optimized, 6x slower than qwen2.5-coder with no accuracy gain

Multi-Attempt Strategy: Detailed Results

Spider dev (first 100 samples) - Host Environment

Max Attempts	Exec Accuracy	Exact Match	Avg Latency	Avg Attempts
1	84%	3%	2.43s	1.00
5	85%	4%	3.97s	1.66
7	85%	4%	4.79s	1.94

Spider dev (first 100 samples) - Docker Environment

Max Attempts	Exec Accuracy	Exact Match	Avg Latency	Avg Attempts
1	84%	3%	2.56s	1.00
5	84%	2%	4.22s	1.66
7	84%	3%	4.77s	1.96

Agent Evolution Details

Phase-by-Phase Improvements

Version	Exec Accuracy	Exact Match	Avg Latency	Avg Attempts	Improvements
Original	82.00%	0.00%	9.60s	2.74	Baseline
Phase 2	86.00%	14.00%	5.41s	2.50	✅ Error Learning
Phase 3	86.00%	14.00%	5.37s	2.52	+ Semantic Analysis

Error Types Handled by IntelligentSQLAgent

Column ambiguity errors
Missing GROUP BY clauses
Invalid aggregate function usage
Table join errors
Schema mismatch errors
Type conversion errors
Syntax errors

Testing Methodology

Test Environment

Hardware: MacBook Pro with M-series chip
RAM: 48GB total, ~18GB available
OS: macOS Darwin 24.4.0
Python: 3.8+
Ollama: Latest version

Benchmark Datasets

Spider dev: First 50-100 samples
Temperature: 0.0 (deterministic) for single attempt, 0.2 for multi-attempt
Stop-on-success: Enabled
Max attempts: 1, 5, 7

Reproduction Commands

# Test with IntelligentSQLAgent
python benchmarks/sql_benchmark.py \
    --benchmark spider \
    --use-agent \
    --model-name qwen2.5-coder:7b \
    --limit 50 \
    --max-attempts 5 \
    --temperature 0 \
    --output results/spider_qwen_coder_50.json

# Quick model test
python test_ollama_models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detailed Model Analysis and Testing Results

Complete Model Testing Results

Full Model Comparison Table

Why Models Failed: Detailed Analysis

Why gpt-oss:20b Failed Despite Being Larger

Why SQL-Specific Models (sqlcoder) Failed

Why Mistral Models Underperformed

Multi-Attempt Strategy: Detailed Results

Spider dev (first 100 samples) - Host Environment

Spider dev (first 100 samples) - Docker Environment

Agent Evolution Details

Phase-by-Phase Improvements

Error Types Handled by IntelligentSQLAgent

Testing Methodology

Test Environment

Benchmark Datasets

Reproduction Commands

FilesExpand file tree

detailed_model_analysis.md

Latest commit

History

detailed_model_analysis.md

File metadata and controls

Detailed Model Analysis and Testing Results

Complete Model Testing Results

Full Model Comparison Table

Why Models Failed: Detailed Analysis

Why gpt-oss:20b Failed Despite Being Larger

Why SQL-Specific Models (sqlcoder) Failed

Why Mistral Models Underperformed

Multi-Attempt Strategy: Detailed Results

Spider dev (first 100 samples) - Host Environment

Spider dev (first 100 samples) - Docker Environment

Agent Evolution Details

Phase-by-Phase Improvements

Error Types Handled by IntelligentSQLAgent

Testing Methodology

Test Environment

Benchmark Datasets

Reproduction Commands