Skip to content

[HIGH] End-to-End Integration Tests with Real Model and DatabaseΒ #43

@Sakeeb91

Description

@Sakeeb91

πŸ”΄ Priority: HIGH | Type: Testing

1. SUMMARY

  • Current integration tests use mocked models and databases, missing true end-to-end validation of the full pipeline (real model inference β†’ real database execution β†’ result verification).
  • Impact: Cannot verify actual accuracy, performance regressions, or model behavior changes in CI. Production bugs may slip through despite 695 unit tests passing.

2. SYSTEM CONTEXT

tests/
β”œβ”€β”€ integration/
β”‚   β”œβ”€β”€ test_api.py           # API tests with mocked inference
β”‚   └── test_database.py      # DB tests with SQLite
β”œβ”€β”€ unit/                     # 40+ unit test files (all mocked)
└── conftest.py               # Fixtures with mocked models

app/
β”œβ”€β”€ agent/engine.py           # Agent orchestration (untested E2E)
β”œβ”€β”€ text2sql_engine.py        # SQL generation (untested E2E)
└── streaming.py              # SSE streaming (untested E2E)

models/
β”œβ”€β”€ loader.py                 # Model loading (tested in isolation)
└── inference.py              # Inference (tested with mocks)

Current state: Unit tests mock InferenceEngine, ModelLoader, and database connections. No tests run actual model inference against real databases.

3. CURRENT STATE (with code)

πŸ“„ File: tests/conftest.py:45-60 (approximate)

@pytest.fixture
def mock_inference_engine():
    """Mock inference engine for testing."""
    engine = MagicMock(spec=InferenceEngine)
    engine.generate.return_value = InferenceResult(
        generated_text="SELECT * FROM users",
        sql="SELECT * FROM users",
        confidence=0.95,
        ...
    )
    return engine

All tests use mocked inference - no real model calls.

πŸ“„ File: tests/integration/test_api.py:49-80

class TestQueryEndpoint:
    """Tests for query generation endpoint."""
    # Uses test_client fixture which mocks the model
    # Never runs actual inference

Integration tests don't test the actual inference pipeline.

4. PROPOSED SOLUTION

Create a dedicated E2E test suite that:

  1. Loads the actual model (or a smaller test model)
  2. Connects to a real test database with known schema
  3. Runs natural language queries through the full pipeline
  4. Verifies SQL correctness and result accuracy

πŸ“„ File: tests/e2e/test_full_pipeline.py (NEW)

import pytest
from app.agent import get_agent_engine
from db.connection import get_database

@pytest.fixture(scope="module")
async def real_agent():
    """Load real agent with actual model."""
    engine = await get_agent_engine()
    yield engine

@pytest.fixture(scope="module")
async def test_db():
    """Real test database with known data."""
    db = await get_database()
    # Seed with known test data
    await seed_test_data(db)
    yield db
    await cleanup_test_data(db)

class TestFullPipeline:
    @pytest.mark.e2e
    @pytest.mark.slow
    async def test_simple_select(self, real_agent, test_db):
        result = await real_agent.generate_sql(
            natural_query="Show all users",
            database_id="test",
            execute=True,
        )
        assert result.sql is not None
        assert "SELECT" in result.sql.upper()
        assert "users" in result.sql.lower()
        assert len(result.results) == EXPECTED_USER_COUNT

    @pytest.mark.e2e
    async def test_aggregation(self, real_agent, test_db):
        result = await real_agent.generate_sql(
            natural_query="Count orders by status",
            database_id="test",
            execute=True,
        )
        assert "GROUP BY" in result.sql.upper()
        assert result.confidence >= 0.7

5. IMPLEMENTATION CHECKLIST

Phase 1: Infrastructure

  • Create tests/e2e/ directory structure
  • Add E2E pytest markers and configuration
  • Create test database seeding scripts with known data
  • Add CI job for E2E tests (separate from unit tests, runs on GPU runner or with small model)

Phase 2: Core E2E Tests

  • Simple SELECT queries (single table)
  • JOIN queries (multi-table)
  • Aggregation queries (GROUP BY, SUM, COUNT)
  • Filtering queries (WHERE clauses)
  • Ordering and pagination (ORDER BY, LIMIT)

Phase 3: Advanced E2E Tests

  • Multi-database routing tests
  • Streaming endpoint tests
  • Self-correction behavior tests
  • Error handling and recovery tests

Phase 4: Performance E2E Tests

  • Latency benchmarks (p50, p95, p99)
  • Throughput under load
  • Memory usage during inference

6. FILES TO MODIFY TABLE

File Lines Action Description
tests/e2e/__init__.py NEW Create E2E test package
tests/e2e/conftest.py NEW Create Real fixtures (model, database)
tests/e2e/test_full_pipeline.py NEW Create Core pipeline tests
tests/e2e/test_streaming_e2e.py NEW Create Streaming endpoint tests
tests/e2e/test_multidb_e2e.py NEW Create Multi-database tests
tests/e2e/seed_data.py NEW Create Test data seeding utilities
pyproject.toml TBD Modify Add E2E markers and config
.github/workflows/ci.yml TBD Modify Add E2E test job
docker-compose.test.yml NEW Create Test environment with real DB

7. RISK ASSESSMENT

Risk Impact Mitigation
E2E tests are slow (model loading) 🟑 Run separately from unit tests; use module-scoped fixtures
GPU required for real inference 🟑 Support CPU-only mode with smaller model; GitHub GPU runners
Flaky tests due to model non-determinism 🟑 Set random seeds; use temperature=0; allow confidence ranges
Test data pollution 🟒 Use isolated test database; cleanup fixtures

8. RELATED CONTEXT

  • Current test suite: 695 unit tests, all mocked
  • CI pipeline: .github/workflows/ci.yml
  • Model loading: models/loader.py
  • Agent engine: app/agent/engine.py
  • Benchmark datasets (Spider, WikiSQL) could be used for accuracy validation in Phase 4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions