Skip to content

Insufficient test coverage and CI coverage for production validator#81

Open
machine-god-deus wants to merge 1 commit into
mainfrom
agent/issue-79-insufficient-test-coverage-and-ci-coverage-for-production-validator
Open

Insufficient test coverage and CI coverage for production validator#81
machine-god-deus wants to merge 1 commit into
mainfrom
agent/issue-79-insufficient-test-coverage-and-ci-coverage-for-production-validator

Conversation

@machine-god-deus

@machine-god-deus machine-god-deus commented Apr 4, 2026

Copy link
Copy Markdown

Summary

  • Expanded /health endpoint in src/crusades/api/server.py to report DB connectivity, evaluations in the last hour, queue depth, and degraded/unhealthy status instead of a static {"status": "healthy"}
  • Added scripts/health_monitor.py: a standalone loop that probes the validator /health endpoint on a configurable interval and posts Discord webhook alerts on failure, with cooldown and recovery notifications
  • Added health-monitor service to docker/compose.yml under a monitoring profile, wired to the validator container
  • Created CI workflow .github/workflows/test.yml running lint (ruff), pytest with coverage (--cov), and coverage reporting on PRs (threshold 60%)
  • Added test suite across 7 files (tests/test_api.py, test_health_monitor.py, test_config.py, test_database.py, test_exceptions.py, test_models.py, conftest.py) — 147 tests passing covering API endpoints, health monitor logic, config loading, database operations, exception handling, and models

Validation

  • pytest tests/ -v — 147 passed, 0 failed
  • Health monitor tests cover evaluate_health() across 4 status paths, post_discord() with and without webhook, and probe_health() for success/URLError/generic exception scenarios
  • API health check tests verify full response shape, degraded status on zero evaluations, and unhealthy status on exception

Risks

  • The health-monitor Docker service depends on scripts/health_monitor.py being present at the repo root; renaming or moving the script without updating the compose volume mount will silently break monitoring
  • Coverage threshold in CI is set to 60%, below the 80% target mentioned in the issue — may need to be raised as more modules gain coverage

Closes #79

Summary by CodeRabbit

Release Notes

  • New Features

    • Added health monitoring service that continuously checks validator status and sends Discord alerts on failures and recovery.
    • Enhanced health endpoint with live database status checks, queue depth, and evaluation metrics.
  • Chores

    • Added automated testing workflow with linting and code coverage reporting.
    • Expanded test suite with comprehensive coverage of core functionality.

@coderabbitai

coderabbitai Bot commented Apr 4, 2026

Copy link
Copy Markdown

Walkthrough

A comprehensive testing and monitoring update adds GitHub Actions CI/CD with automated testing, a health monitoring script with Discord alerting, enhanced API health endpoint with database status, extensive test suite covering configuration/database/API/models/exceptions, and pytest fixtures for test infrastructure.

Changes

Cohort / File(s) Summary
CI/CD Infrastructure
.github/workflows/test.yml
Added GitHub Actions workflow running Ruff linting and pytest with coverage reporting (threshold: 60%) on pull requests and main branch pushes.
Health Monitoring
scripts/health_monitor.py, docker/compose.yml, src/crusades/api/server.py
Added health monitor script that probes validator /health endpoint, evaluates status, manages alert state with cooldown, and posts Discord notifications on health transitions. Updated /health endpoint to return live database checks, evaluations count, queue depth, and connection status. Added Docker health-monitor service under monitoring profile with configurable thresholds.
Test Infrastructure
tests/conftest.py
Added pytest fixtures: db (in-memory SQLite database), seeded_db (pre-populated with submissions/evaluations), and mock_db_client (MockClient for tests).
Test Modules
tests/test_api.py, tests/test_config.py, tests/test_database.py, tests/test_exceptions.py, tests/test_health_monitor.py, tests/test_models.py
Added comprehensive test suites covering API endpoints (health, stats, leaderboard, submissions), configuration loading/validation, database CRUD/leaderboard/validator-state operations, exception classification, health monitor probe/evaluate/notify logic, and SQLAlchemy model defaults/table registration.
Dependencies
pyproject.toml
Added pytest-mock>=3.12.0 to dev optional dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant HealthMonitor as Health Monitor<br/>(Script)
    participant ValidatorAPI as Validator API<br/>(/health)
    participant Database as Database
    participant Discord as Discord<br/>Webhook

    loop Every CHECK_INTERVAL_SECONDS
        HealthMonitor->>ValidatorAPI: GET /health (timeout: 10s)
        ValidatorAPI->>Database: get_validator_status()
        Database-->>ValidatorAPI: status, evaluations_1h, queue_depth
        ValidatorAPI-->>HealthMonitor: {status, db_connected, ...}
        
        alt Response OK
            HealthMonitor->>HealthMonitor: evaluate_health(response)
            alt Healthy
                alt Was Alerting
                    HealthMonitor->>Discord: POST recovery (green)
                    HealthMonitor->>HealthMonitor: is_alerting = false
                else Normal
                    HealthMonitor->>HealthMonitor: log healthy status
                end
            else Unhealthy
                alt Cooldown Expired
                    HealthMonitor->>Discord: POST alert (red)
                    HealthMonitor->>HealthMonitor: is_alerting = true
                else In Cooldown
                    HealthMonitor->>HealthMonitor: log failure (no notify)
                end
            end
        else Network/Error
            HealthMonitor->>HealthMonitor: evaluate_health({error: "Unreachable"})
            alt Cooldown Expired
                HealthMonitor->>Discord: POST alert (red)
            end
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Fix Truncate Sequence Length #17: Modified EvaluationErrorCode enum members and is_fatal classification logic; this PR adds comprehensive tests asserting on the same enum's classification behavior, creating a direct dependency on the enum's semantics.

Suggested reviewers

  • joellidin

Poem

🐰 whispers with glee
New tests bloom like clover in the spring,
Coverage grows, and health checks take wing—
Discord alerts when the validator stumbles,
A watchful bunny ensures nothing crumbles! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR addresses multiple core objectives from issue #79: health checks with Discord alerting (scripts/health_monitor.py), expanded /health endpoint, comprehensive test suite (147 tests), and CI pipeline with coverage reporting, though coverage threshold (60%) is below the stated >80% target. Increase CI coverage threshold from 60% to at least 80% to meet the explicit requirement stated in issue #79 objectives.
Docstring Coverage ⚠️ Warning Docstring coverage is 32.97% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the main objective: expanding test coverage and CI infrastructure for the validator, which is the primary focus of this comprehensive PR.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #79 requirements: health monitoring, test coverage, and CI pipeline setup. No unrelated refactoring or feature additions are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch agent/issue-79-insufficient-test-coverage-and-ci-coverage-for-production-validator

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
tests/test_config.py (1)

15-19: netuid == 3 is brittle for a defaults-loading test.

Asserting an exact repo config value at Line 18 can create noisy failures when defaults are intentionally updated. Consider asserting validity constraints instead.

Example adjustment
     def test_load_from_project_hparams(self):
         """Loads real hparams.json from repository."""
         hparams = HParams.load()
-        assert hparams.netuid == 3
+        assert isinstance(hparams.netuid, int)
+        assert hparams.netuid >= 0
         assert hparams.evaluation_runs >= 1
         assert hparams.eval_timeout > 0
         assert hparams.benchmark_model_name != ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_config.py` around lines 15 - 19, The test
test_load_from_project_hparams currently asserts a brittle exact value
(hparams.netuid == 3); instead update the assertions to check validity
constraints: use HParams.load() to get hparams and replace the exact equality
with checks like isinstance(hparams.netuid, int) and hparams.netuid >= 0 (or
membership in an allowed set) while keeping the existing assertion for
hparams.evaluation_runs >= 1; locate this in the test_load_from_project_hparams
function and modify the hparams.netuid assertion accordingly.
docker/compose.yml (1)

123-125: Avoid startup race for monitor alerts.

Using condition: service_started can start the monitor before the validator endpoint is actually ready, which may cause false alerts right after boot.

Suggested compose tweak
   depends_on:
     validator:
-      condition: service_started
+      condition: service_healthy
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose.yml` around lines 123 - 125, The monitor currently depends on
the validator using depends_on: validator with condition: service_started which
can cause a race; add a healthcheck to the validator service (e.g., a curl/HTTP
check against its readiness endpoint) and change the monitor's depends_on to
wait for condition: service_healthy for the validator so the monitor only starts
after the validator is actually ready; update the validator service block (add a
healthcheck section) and update the monitor depends_on entry accordingly.
scripts/health_monitor.py (2)

154-159: Consider handling SIGTERM for graceful Docker shutdown.

Docker sends SIGTERM before SIGKILL when stopping containers. Currently only KeyboardInterrupt (SIGINT) is handled. The container will still stop, but adding SIGTERM handling would allow for cleaner log messages.

🔧 Suggested SIGTERM handling
+import signal
+
+def handle_signal(signum, frame):
+    logger.info(f"Received signal {signum}, shutting down")
+    sys.exit(0)
+
 if __name__ == "__main__":
+    signal.signal(signal.SIGTERM, handle_signal)
     try:
         run()
     except KeyboardInterrupt:
         logger.info("Monitor stopped")
         sys.exit(0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/health_monitor.py` around lines 154 - 159, Add SIGTERM handling in
the module's __main__ shutdown path so Docker stops are logged and cleaned up
like KeyboardInterrupt; import the signal module, define a handler (e.g.,
shutdown_handler) that logs "Monitor stopped" via logger and calls sys.exit(0),
register it with signal.signal(signal.SIGTERM, shutdown_handler), and keep the
existing try/except around run() for KeyboardInterrupt so both run() and logger
are used for graceful shutdown.

87-106: Consider handling unknown status explicitly.

An "unknown" status (the fallback on line 96) currently returns (True, "healthy"), which may mask issues if the endpoint returns an unexpected status. Consider treating unknown statuses as unhealthy or at least logging a warning.

🔧 Alternative: treat unknown as unhealthy
     if status == "degraded":
         return False, (
             f"Validator is degraded: 0 evaluations in last hour, "
             f"queue_depth={data.get('queue_depth', '?')}"
         )

+    if status != "healthy":
+        return False, f"Unexpected status: {status}"
+
     return True, "healthy"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/health_monitor.py` around lines 87 - 106, The evaluate_health
function currently treats the fallback status "unknown" as healthy; update
evaluate_health(data: dict) to explicitly handle unexpected statuses by treating
any status not in {"healthy","degraded","unhealthy"} as unhealthy (or at least
return False) and return a descriptive reason including the unexpected status
(use the local variable status and the input data for context); optionally emit
or call a logger/warning when status is unknown to surface the anomaly instead
of returning True, "healthy".
tests/test_health_monitor.py (3)

49-53: Consider explicitly asserting no network calls in noop test.

The test verifies no error is raised, but doesn't explicitly confirm urlopen wasn't called. This would make the test more robust.

🔧 Suggested improvement
     def test_noop_when_no_webhook_url(self):
         """post_discord is a no-op when DISCORD_WEBHOOK_URL is empty."""
         with patch.object(health_monitor, "DISCORD_WEBHOOK_URL", ""):
-            # Should return without error and without making any network calls
-            health_monitor.post_discord("test alert")
+            with patch.object(health_monitor, "urlopen") as mock_urlopen:
+                health_monitor.post_discord("test alert")
+                mock_urlopen.assert_not_called()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_health_monitor.py` around lines 49 - 53, Update the
test_noop_when_no_webhook_url to not only patch
health_monitor.DISCORD_WEBHOOK_URL to "" but also patch the network call (e.g.,
urllib.request.urlopen) and assert it was not called; specifically, wrap
health_monitor.post_discord("test alert") with a patch on urllib.request.urlopen
(or the exact urlopen symbol used by health_monitor) and add an assertion like
urlopen.assert_not_called() after calling post_discord to ensure no network
activity occurred.

90-100: Move URLError import to module level.

The URLError import inside the test method works but is non-idiomatic. Moving it to the module-level imports improves consistency.

🔧 Suggested change

Add to imports at top of file:

from urllib.error import URLError

Then remove line 92.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_health_monitor.py` around lines 90 - 100, Move the URLError import
from inside the test method to the module-level imports: add "from urllib.error
import URLError" to the top of the test file and remove the inline import in
test_returns_error_on_url_error; this keeps the test_returns_error_on_url_error
function unchanged (it still patches health_monitor.urlopen and asserts
probe_health returns an error) while making URLError a top-level import for
consistency.

10-12: Consider using a pytest plugin or conftest.py for path setup.

The sys.path.insert approach works but is fragile and duplicates path setup if other test files need the same import. Consider:

  1. Adding a conftest.py in the tests directory that does this once
  2. Using pytest-pythonpath or similar plugin
  3. Making scripts/ a proper package with __init__.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_health_monitor.py` around lines 10 - 12, The test currently
mutates sys.path inside tests/test_health_monitor.py via sys.path.insert to
import health_monitor; remove that direct insertion and centralize path setup by
either (A) adding a tests/conftest.py that performs the same sys.path.insert so
all tests can import health_monitor without duplication, or (B) make scripts a
proper package by adding an __init__.py in scripts and adjusting imports to use
the package, or (C) configure pytest to use a pythonpath plugin; implement one
of these options and delete the sys.path.insert line from
tests/test_health_monitor.py so imports rely on the centralized setup instead of
per-test manipulation of sys.path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test.yml:
- Around line 30-36: The CI coverage gate currently sets thresholdAll: 60 which
doesn't meet the linked objective of >80%; update the workflow action block
using orgoro/coverage@v3.2 (the Upload coverage step that references
coverageFile: coverage.xml) to set thresholdAll to a value greater than 80
(e.g., 81) so the coverage gate enforces the >80% requirement.

In `@scripts/health_monitor.py`:
- Around line 37-38: Remove the unused environment variable binding
STALE_THRESHOLD (from STALE_THRESHOLD_SECONDS) and any references in the module
docstring: delete the line defining STALE_THRESHOLD =
int(os.getenv("STALE_THRESHOLD_SECONDS", "3600")) and remove the
STALE_THRESHOLD_SECONDS mention in the file-level docstring so only
actually-used config (e.g., ALERT_COOLDOWN) remains; confirm no other code
references STALE_THRESHOLD or STALE_THRESHOLD_SECONDS before removing.

In `@src/crusades/api/server.py`:
- Around line 132-145: The health endpoint currently sets "db_connected": True
even when get_db_client() returns a MockClient; update the logic in the health
handler (where get_db_client(), client.get_validator_status(),
client.get_queue_stats() and evaluations_1h are used) to detect mocked clients
and report db_connected=False for them—e.g., check isinstance(client,
MockClient) or a boolean flag like client.is_mock (prefer the existing property
if present), and set db_connected = False in the returned dict when the client
is a MockClient; otherwise keep db_connected = True and preserve the existing
status/evaluations_1h logic.

In `@tests/test_api.py`:
- Around line 12-32: The fixtures client and authed_client mutate the
module-level singleton server_module.app.state.api_key which can leak between
tests; change each fixture to create or clone a fresh FastAPI app instance (or
restore the original state in a finalizer) instead of directly mutating
server_module.app.state, e.g., obtain a new app copy for TestClient or save the
original value of server_module.app.state.api_key and restore it after the yield
(using a try/finally or fixture teardown) while continuing to patch
server_module._db_client with MockClient so tests remain isolated.

---

Nitpick comments:
In `@docker/compose.yml`:
- Around line 123-125: The monitor currently depends on the validator using
depends_on: validator with condition: service_started which can cause a race;
add a healthcheck to the validator service (e.g., a curl/HTTP check against its
readiness endpoint) and change the monitor's depends_on to wait for condition:
service_healthy for the validator so the monitor only starts after the validator
is actually ready; update the validator service block (add a healthcheck
section) and update the monitor depends_on entry accordingly.

In `@scripts/health_monitor.py`:
- Around line 154-159: Add SIGTERM handling in the module's __main__ shutdown
path so Docker stops are logged and cleaned up like KeyboardInterrupt; import
the signal module, define a handler (e.g., shutdown_handler) that logs "Monitor
stopped" via logger and calls sys.exit(0), register it with
signal.signal(signal.SIGTERM, shutdown_handler), and keep the existing
try/except around run() for KeyboardInterrupt so both run() and logger are used
for graceful shutdown.
- Around line 87-106: The evaluate_health function currently treats the fallback
status "unknown" as healthy; update evaluate_health(data: dict) to explicitly
handle unexpected statuses by treating any status not in
{"healthy","degraded","unhealthy"} as unhealthy (or at least return False) and
return a descriptive reason including the unexpected status (use the local
variable status and the input data for context); optionally emit or call a
logger/warning when status is unknown to surface the anomaly instead of
returning True, "healthy".

In `@tests/test_config.py`:
- Around line 15-19: The test test_load_from_project_hparams currently asserts a
brittle exact value (hparams.netuid == 3); instead update the assertions to
check validity constraints: use HParams.load() to get hparams and replace the
exact equality with checks like isinstance(hparams.netuid, int) and
hparams.netuid >= 0 (or membership in an allowed set) while keeping the existing
assertion for hparams.evaluation_runs >= 1; locate this in the
test_load_from_project_hparams function and modify the hparams.netuid assertion
accordingly.

In `@tests/test_health_monitor.py`:
- Around line 49-53: Update the test_noop_when_no_webhook_url to not only patch
health_monitor.DISCORD_WEBHOOK_URL to "" but also patch the network call (e.g.,
urllib.request.urlopen) and assert it was not called; specifically, wrap
health_monitor.post_discord("test alert") with a patch on urllib.request.urlopen
(or the exact urlopen symbol used by health_monitor) and add an assertion like
urlopen.assert_not_called() after calling post_discord to ensure no network
activity occurred.
- Around line 90-100: Move the URLError import from inside the test method to
the module-level imports: add "from urllib.error import URLError" to the top of
the test file and remove the inline import in test_returns_error_on_url_error;
this keeps the test_returns_error_on_url_error function unchanged (it still
patches health_monitor.urlopen and asserts probe_health returns an error) while
making URLError a top-level import for consistency.
- Around line 10-12: The test currently mutates sys.path inside
tests/test_health_monitor.py via sys.path.insert to import health_monitor;
remove that direct insertion and centralize path setup by either (A) adding a
tests/conftest.py that performs the same sys.path.insert so all tests can import
health_monitor without duplication, or (B) make scripts a proper package by
adding an __init__.py in scripts and adjusting imports to use the package, or
(C) configure pytest to use a pythonpath plugin; implement one of these options
and delete the sys.path.insert line from tests/test_health_monitor.py so imports
rely on the centralized setup instead of per-test manipulation of sys.path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ad5763a3-dad0-40bd-9751-754375e16108

📥 Commits

Reviewing files that changed from the base of the PR and between 650d166 and 94fd426.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (13)
  • .github/workflows/test.yml
  • docker/compose.yml
  • pyproject.toml
  • scripts/health_monitor.py
  • src/crusades/api/server.py
  • tests/__init__.py
  • tests/conftest.py
  • tests/test_api.py
  • tests/test_config.py
  • tests/test_database.py
  • tests/test_exceptions.py
  • tests/test_health_monitor.py
  • tests/test_models.py

Comment on lines +30 to +36
- name: Upload coverage
if: github.event_name == 'pull_request'
uses: orgoro/coverage@v3.2
with:
coverageFile: coverage.xml
token: ${{ secrets.GITHUB_TOKEN }}
thresholdAll: 60

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Coverage gate does not meet the linked objective.

Line 36 sets thresholdAll: 60, but the linked issue objective specifies coverage target >80%. This leaves the requirement only partially implemented.

Minimal fix
       - name: Upload coverage
         if: github.event_name == 'pull_request'
         uses: orgoro/coverage@v3.2
         with:
           coverageFile: coverage.xml
           token: ${{ secrets.GITHUB_TOKEN }}
-          thresholdAll: 60
+          thresholdAll: 80
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Upload coverage
if: github.event_name == 'pull_request'
uses: orgoro/coverage@v3.2
with:
coverageFile: coverage.xml
token: ${{ secrets.GITHUB_TOKEN }}
thresholdAll: 60
- name: Upload coverage
if: github.event_name == 'pull_request'
uses: orgoro/coverage@v3.2
with:
coverageFile: coverage.xml
token: ${{ secrets.GITHUB_TOKEN }}
thresholdAll: 80
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yml around lines 30 - 36, The CI coverage gate
currently sets thresholdAll: 60 which doesn't meet the linked objective of >80%;
update the workflow action block using orgoro/coverage@v3.2 (the Upload coverage
step that references coverageFile: coverage.xml) to set thresholdAll to a value
greater than 80 (e.g., 81) so the coverage gate enforces the >80% requirement.

Comment thread scripts/health_monitor.py
Comment on lines +37 to +38
STALE_THRESHOLD = int(os.getenv("STALE_THRESHOLD_SECONDS", "3600"))
ALERT_COOLDOWN = int(os.getenv("ALERT_COOLDOWN_SECONDS", "300"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

STALE_THRESHOLD_SECONDS is defined but unused.

The environment variable STALE_THRESHOLD_SECONDS is documented and loaded but never referenced in the code. The staleness detection appears to be handled by the /health endpoint itself (which returns status: "degraded" when evaluations_1h == 0), making this variable dead code.

🗑️ Suggested fix: Remove unused variable
 DISCORD_WEBHOOK_URL = os.getenv("DISCORD_WEBHOOK_URL", "")
 CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL_SECONDS", "60"))
-STALE_THRESHOLD = int(os.getenv("STALE_THRESHOLD_SECONDS", "3600"))
 ALERT_COOLDOWN = int(os.getenv("ALERT_COOLDOWN_SECONDS", "300"))

Also remove from the docstring (line 15).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
STALE_THRESHOLD = int(os.getenv("STALE_THRESHOLD_SECONDS", "3600"))
ALERT_COOLDOWN = int(os.getenv("ALERT_COOLDOWN_SECONDS", "300"))
DISCORD_WEBHOOK_URL = os.getenv("DISCORD_WEBHOOK_URL", "")
CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL_SECONDS", "60"))
ALERT_COOLDOWN = int(os.getenv("ALERT_COOLDOWN_SECONDS", "300"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/health_monitor.py` around lines 37 - 38, Remove the unused
environment variable binding STALE_THRESHOLD (from STALE_THRESHOLD_SECONDS) and
any references in the module docstring: delete the line defining STALE_THRESHOLD
= int(os.getenv("STALE_THRESHOLD_SECONDS", "3600")) and remove the
STALE_THRESHOLD_SECONDS mention in the file-level docstring so only
actually-used config (e.g., ALERT_COOLDOWN) remains; confirm no other code
references STALE_THRESHOLD or STALE_THRESHOLD_SECONDS before removing.

Comment on lines +132 to +145
client = get_db_client()
if client is None:
return {"status": "healthy", "db_connected": False}

validator_status = client.get_validator_status()
queue = client.get_queue_stats()

evaluations_1h = validator_status.get("evaluations_completed_1h", 0)
status = "healthy" if evaluations_1h > 0 else "degraded"

return {
"status": status,
"db_connected": True,
"evaluations_1h": evaluations_1h,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

db_connected can be incorrectly true when running on MockClient.

At Line 132, get_db_client() may return MockClient (DB missing path), but Line 144 still reports "db_connected": True. That can mask real DB outages in /health.

Suggested fix
     try:
         client = get_db_client()
         if client is None:
             return {"status": "healthy", "db_connected": False}
+        db_connected = not isinstance(client, MockClient)

         validator_status = client.get_validator_status()
         queue = client.get_queue_stats()

         evaluations_1h = validator_status.get("evaluations_completed_1h", 0)
         status = "healthy" if evaluations_1h > 0 else "degraded"

         return {
             "status": status,
-            "db_connected": True,
+            "db_connected": db_connected,
             "evaluations_1h": evaluations_1h,
             "queue_depth": queue.get("queued_count", 0),
             "current_evaluation": validator_status.get("current_evaluation"),
             "validator_status": validator_status.get("status", "unknown"),
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
client = get_db_client()
if client is None:
return {"status": "healthy", "db_connected": False}
validator_status = client.get_validator_status()
queue = client.get_queue_stats()
evaluations_1h = validator_status.get("evaluations_completed_1h", 0)
status = "healthy" if evaluations_1h > 0 else "degraded"
return {
"status": status,
"db_connected": True,
"evaluations_1h": evaluations_1h,
client = get_db_client()
if client is None:
return {"status": "healthy", "db_connected": False}
db_connected = not isinstance(client, MockClient)
validator_status = client.get_validator_status()
queue = client.get_queue_stats()
evaluations_1h = validator_status.get("evaluations_completed_1h", 0)
status = "healthy" if evaluations_1h > 0 else "degraded"
return {
"status": status,
"db_connected": db_connected,
"evaluations_1h": evaluations_1h,
"queue_depth": queue.get("queued_count", 0),
"current_evaluation": validator_status.get("current_evaluation"),
"validator_status": validator_status.get("status", "unknown"),
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/crusades/api/server.py` around lines 132 - 145, The health endpoint
currently sets "db_connected": True even when get_db_client() returns a
MockClient; update the logic in the health handler (where get_db_client(),
client.get_validator_status(), client.get_queue_stats() and evaluations_1h are
used) to detect mocked clients and report db_connected=False for them—e.g.,
check isinstance(client, MockClient) or a boolean flag like client.is_mock
(prefer the existing property if present), and set db_connected = False in the
returned dict when the client is a MockClient; otherwise keep db_connected =
True and preserve the existing status/evaluations_1h logic.

Comment thread tests/test_api.py
Comment on lines +12 to +32
@pytest.fixture
def client():
"""TestClient using the module-level app with MockClient injected."""
mock = MockClient()
app = server_module.app
# Ensure no API key is required
app.state.api_key = None
with patch.object(server_module, "_db_client", mock):
with TestClient(app) as c:
yield c


@pytest.fixture
def authed_client():
"""TestClient with API key required."""
mock = MockClient()
app = server_module.app
app.state.api_key = "test-secret-key"
with patch.object(server_module, "_db_client", mock):
with TestClient(app) as c:
yield c

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential test isolation issue with shared app state.

Both fixtures directly modify server_module.app.state.api_key. Since app is a module-level singleton, this mutation persists across tests. If tests run in parallel or in unexpected order, they may interfere with each other.

Consider using a fresh app instance per test or resetting the state in a fixture finalizer.

🔧 Suggested improvement using fixture cleanup
 `@pytest.fixture`
 def client():
     """TestClient using the module-level app with MockClient injected."""
     mock = MockClient()
     app = server_module.app
+    original_api_key = app.state.api_key
     # Ensure no API key is required
     app.state.api_key = None
     with patch.object(server_module, "_db_client", mock):
         with TestClient(app) as c:
             yield c
+    app.state.api_key = original_api_key
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pytest.fixture
def client():
"""TestClient using the module-level app with MockClient injected."""
mock = MockClient()
app = server_module.app
# Ensure no API key is required
app.state.api_key = None
with patch.object(server_module, "_db_client", mock):
with TestClient(app) as c:
yield c
@pytest.fixture
def authed_client():
"""TestClient with API key required."""
mock = MockClient()
app = server_module.app
app.state.api_key = "test-secret-key"
with patch.object(server_module, "_db_client", mock):
with TestClient(app) as c:
yield c
`@pytest.fixture`
def client():
"""TestClient using the module-level app with MockClient injected."""
mock = MockClient()
app = server_module.app
original_api_key = app.state.api_key
# Ensure no API key is required
app.state.api_key = None
with patch.object(server_module, "_db_client", mock):
with TestClient(app) as c:
yield c
app.state.api_key = original_api_key
`@pytest.fixture`
def authed_client():
"""TestClient with API key required."""
mock = MockClient()
app = server_module.app
app.state.api_key = "test-secret-key"
with patch.object(server_module, "_db_client", mock):
with TestClient(app) as c:
yield c
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_api.py` around lines 12 - 32, The fixtures client and
authed_client mutate the module-level singleton server_module.app.state.api_key
which can leak between tests; change each fixture to create or clone a fresh
FastAPI app instance (or restore the original state in a finalizer) instead of
directly mutating server_module.app.state, e.g., obtain a new app copy for
TestClient or save the original value of server_module.app.state.api_key and
restore it after the yield (using a try/finally or fixture teardown) while
continuing to patch server_module._db_client with MockClient so tests remain
isolated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Insufficient test coverage and CI coverage for production validator

1 participant