Skip to content

Latest commit

 

History

History
98 lines (81 loc) · 4.27 KB

File metadata and controls

98 lines (81 loc) · 4.27 KB

Detailed Model Analysis and Testing Results

Complete Model Testing Results

Full Model Comparison Table

Model Type Params Disk Size RAM Usage Exec Accuracy Avg Latency Notes
qwen2.5-coder:7b Domain-specific (code) 7B 4.7 GB ~6 GB 86.00% 5.41s ✅ Best overall
gpt-oss:20b General purpose 20B 13 GB ~16 GB 90.00% 20.83s ⚠️ 4x slower, JSON errors
qwen2.5:14b General purpose 14B 9.0 GB ~12 GB 82.00% 10.02s ❌ Worse accuracy, 2x slower
sqlcoder:7b SQL-specific 7B 4.1 GB ~5 GB 2.00% 2.92s ❌ Failed - JSON/prompt issues
sqlcoder:15b SQL-specific 15B 9.0 GB ~11 GB 6.00% 0.01s* ❌ Failed - not compatible
deepseek-coder-v2:16b Domain-specific (code) 16B 8.9 GB ~11 GB 68.00% 4.04s ✅ Good accuracy, slower than 7B
codestral:22b-v0.1-q4_0 Domain-specific (code) 22B 12 GB ~15 GB 82.00% 30.59s ⚠️ Slow, decent accuracy
mistral:7b-instruct General purpose 7B 4.1 GB ~5 GB Failed 31-39s* ❌ JSON errors, extremely slow

Why Models Failed: Detailed Analysis

Why gpt-oss:20b Failed Despite Being Larger

  1. Training Data Mismatch: Trained on conversational data, not code/SQL
  2. JSON Generation: Unable to reliably generate structured JSON responses required by the agent
  3. Inference Overhead: Larger model = slower inference without proportional accuracy gains
  4. Context Understanding: Struggles with database schema context compared to code-specific models

Why SQL-Specific Models (sqlcoder) Failed

  1. Prompt Format Incompatibility: sqlcoder models expect different prompt formats than our agent system
  2. JSON Generation Issues: Unable to generate structured JSON required by IntelligentSQLAgent
  3. Outdated Training: Older models may lack modern instruction-following capabilities
  4. Agent Integration: These models were designed for direct SQL generation, not agent-based systems

Why Mistral Models Underperformed

  • mistral:7b-instruct: JSON generation failures, extremely slow (31-39s/query)
  • codestral:22b: Despite being code-optimized, 6x slower than qwen2.5-coder with no accuracy gain

Multi-Attempt Strategy: Detailed Results

Spider dev (first 100 samples) - Host Environment

Max Attempts Exec Accuracy Exact Match Avg Latency Avg Attempts
1 84% 3% 2.43s 1.00
5 85% 4% 3.97s 1.66
7 85% 4% 4.79s 1.94

Spider dev (first 100 samples) - Docker Environment

Max Attempts Exec Accuracy Exact Match Avg Latency Avg Attempts
1 84% 3% 2.56s 1.00
5 84% 2% 4.22s 1.66
7 84% 3% 4.77s 1.96

Agent Evolution Details

Phase-by-Phase Improvements

Version Exec Accuracy Exact Match Avg Latency Avg Attempts Improvements
Original 82.00% 0.00% 9.60s 2.74 Baseline
Phase 2 86.00% 14.00% 5.41s 2.50 ✅ Error Learning
Phase 3 86.00% 14.00% 5.37s 2.52 + Semantic Analysis

Error Types Handled by IntelligentSQLAgent

  1. Column ambiguity errors
  2. Missing GROUP BY clauses
  3. Invalid aggregate function usage
  4. Table join errors
  5. Schema mismatch errors
  6. Type conversion errors
  7. Syntax errors

Testing Methodology

Test Environment

  • Hardware: MacBook Pro with M-series chip
  • RAM: 48GB total, ~18GB available
  • OS: macOS Darwin 24.4.0
  • Python: 3.8+
  • Ollama: Latest version

Benchmark Datasets

  • Spider dev: First 50-100 samples
  • Temperature: 0.0 (deterministic) for single attempt, 0.2 for multi-attempt
  • Stop-on-success: Enabled
  • Max attempts: 1, 5, 7

Reproduction Commands

# Test with IntelligentSQLAgent
python benchmarks/sql_benchmark.py \
    --benchmark spider \
    --use-agent \
    --model-name qwen2.5-coder:7b \
    --limit 50 \
    --max-attempts 5 \
    --temperature 0 \
    --output results/spider_qwen_coder_50.json

# Quick model test
python test_ollama_models.py