-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
π΄ Priority: HIGH | Type: Testing
1. SUMMARY
- Current integration tests use mocked models and databases, missing true end-to-end validation of the full pipeline (real model inference β real database execution β result verification).
- Impact: Cannot verify actual accuracy, performance regressions, or model behavior changes in CI. Production bugs may slip through despite 695 unit tests passing.
2. SYSTEM CONTEXT
tests/
βββ integration/
β βββ test_api.py # API tests with mocked inference
β βββ test_database.py # DB tests with SQLite
βββ unit/ # 40+ unit test files (all mocked)
βββ conftest.py # Fixtures with mocked models
app/
βββ agent/engine.py # Agent orchestration (untested E2E)
βββ text2sql_engine.py # SQL generation (untested E2E)
βββ streaming.py # SSE streaming (untested E2E)
models/
βββ loader.py # Model loading (tested in isolation)
βββ inference.py # Inference (tested with mocks)
Current state: Unit tests mock InferenceEngine, ModelLoader, and database connections. No tests run actual model inference against real databases.
3. CURRENT STATE (with code)
π File: tests/conftest.py:45-60 (approximate)
@pytest.fixture
def mock_inference_engine():
"""Mock inference engine for testing."""
engine = MagicMock(spec=InferenceEngine)
engine.generate.return_value = InferenceResult(
generated_text="SELECT * FROM users",
sql="SELECT * FROM users",
confidence=0.95,
...
)
return engineAll tests use mocked inference - no real model calls.
π File: tests/integration/test_api.py:49-80
class TestQueryEndpoint:
"""Tests for query generation endpoint."""
# Uses test_client fixture which mocks the model
# Never runs actual inferenceIntegration tests don't test the actual inference pipeline.
4. PROPOSED SOLUTION
Create a dedicated E2E test suite that:
- Loads the actual model (or a smaller test model)
- Connects to a real test database with known schema
- Runs natural language queries through the full pipeline
- Verifies SQL correctness and result accuracy
π File: tests/e2e/test_full_pipeline.py (NEW)
import pytest
from app.agent import get_agent_engine
from db.connection import get_database
@pytest.fixture(scope="module")
async def real_agent():
"""Load real agent with actual model."""
engine = await get_agent_engine()
yield engine
@pytest.fixture(scope="module")
async def test_db():
"""Real test database with known data."""
db = await get_database()
# Seed with known test data
await seed_test_data(db)
yield db
await cleanup_test_data(db)
class TestFullPipeline:
@pytest.mark.e2e
@pytest.mark.slow
async def test_simple_select(self, real_agent, test_db):
result = await real_agent.generate_sql(
natural_query="Show all users",
database_id="test",
execute=True,
)
assert result.sql is not None
assert "SELECT" in result.sql.upper()
assert "users" in result.sql.lower()
assert len(result.results) == EXPECTED_USER_COUNT
@pytest.mark.e2e
async def test_aggregation(self, real_agent, test_db):
result = await real_agent.generate_sql(
natural_query="Count orders by status",
database_id="test",
execute=True,
)
assert "GROUP BY" in result.sql.upper()
assert result.confidence >= 0.75. IMPLEMENTATION CHECKLIST
Phase 1: Infrastructure
- Create
tests/e2e/directory structure - Add E2E pytest markers and configuration
- Create test database seeding scripts with known data
- Add CI job for E2E tests (separate from unit tests, runs on GPU runner or with small model)
Phase 2: Core E2E Tests
- Simple SELECT queries (single table)
- JOIN queries (multi-table)
- Aggregation queries (GROUP BY, SUM, COUNT)
- Filtering queries (WHERE clauses)
- Ordering and pagination (ORDER BY, LIMIT)
Phase 3: Advanced E2E Tests
- Multi-database routing tests
- Streaming endpoint tests
- Self-correction behavior tests
- Error handling and recovery tests
Phase 4: Performance E2E Tests
- Latency benchmarks (p50, p95, p99)
- Throughput under load
- Memory usage during inference
6. FILES TO MODIFY TABLE
| File | Lines | Action | Description |
|---|---|---|---|
tests/e2e/__init__.py |
NEW | Create | E2E test package |
tests/e2e/conftest.py |
NEW | Create | Real fixtures (model, database) |
tests/e2e/test_full_pipeline.py |
NEW | Create | Core pipeline tests |
tests/e2e/test_streaming_e2e.py |
NEW | Create | Streaming endpoint tests |
tests/e2e/test_multidb_e2e.py |
NEW | Create | Multi-database tests |
tests/e2e/seed_data.py |
NEW | Create | Test data seeding utilities |
pyproject.toml |
TBD | Modify | Add E2E markers and config |
.github/workflows/ci.yml |
TBD | Modify | Add E2E test job |
docker-compose.test.yml |
NEW | Create | Test environment with real DB |
7. RISK ASSESSMENT
| Risk | Impact | Mitigation |
|---|---|---|
| E2E tests are slow (model loading) | π‘ | Run separately from unit tests; use module-scoped fixtures |
| GPU required for real inference | π‘ | Support CPU-only mode with smaller model; GitHub GPU runners |
| Flaky tests due to model non-determinism | π‘ | Set random seeds; use temperature=0; allow confidence ranges |
| Test data pollution | π’ | Use isolated test database; cleanup fixtures |
8. RELATED CONTEXT
- Current test suite: 695 unit tests, all mocked
- CI pipeline:
.github/workflows/ci.yml - Model loading:
models/loader.py - Agent engine:
app/agent/engine.py - Benchmark datasets (Spider, WikiSQL) could be used for accuracy validation in Phase 4
Metadata
Metadata
Assignees
Labels
No labels