Skip to content

Add resilient database initialization with retries and improved startup#22

Merged
vincentmakes merged 4 commits intomainfrom
claude/fix-postgres-timeout-beI4R
Feb 26, 2026
Merged

Add resilient database initialization with retries and improved startup#22
vincentmakes merged 4 commits intomainfrom
claude/fix-postgres-timeout-beI4R

Conversation

@vincentmakes
Copy link
Owner

Summary

This PR improves the reliability of database initialization by adding exponential backoff retry logic and better separation of critical vs. optional migrations. It also enhances the Docker entrypoint to support multi-database scenarios and configurable wait timeouts.

Key Changes

Master DB Initialization (app/services/master_db.py)

  • Added asyncio import for async sleep functionality
  • Implemented exponential backoff retry logic (up to 5 attempts with 2^n second delays) for initial database connectivity verification
  • Improved logging to track connection attempts and failures
  • Refactored _apply_pending_migrations() to distinguish between:
    • Critical tables (admin_users, admin_sessions): Must succeed or app fails to start
    • Optional migrations (organizations, SSO, tenant columns): Failures are logged but don't prevent startup
  • Updated docstrings to document retry behavior and migration strategy

Docker Entrypoint (docker-entrypoint.sh)

  • Enhanced wait_for_db() function with:
    • Configurable label parameter for clearer logging (e.g., "tenant DB" vs "master DB")
    • Configurable max attempts via DB_WAIT_ATTEMPTS environment variable (defaults to 30)
  • Added support for multi-tenant scenarios:
    • Waits for tenant database first
    • Conditionally waits for master database if MULTI_TENANT=true and MASTER_DB_HOST differs from tenant DB host
  • Improved messaging to indicate which database is being waited for and how many attempts remain

Implementation Details

  • The retry logic uses exponential backoff (2, 4, 8, 16, 32 seconds) to gracefully handle PostgreSQL containers that are still initializing
  • Critical migrations run outside the try-catch block to ensure fatal errors propagate
  • Optional migrations are wrapped in try-catch to allow partial schema initialization
  • The Docker entrypoint remains non-fatal for database readiness (app starts regardless), but the Python code now has its own retry mechanism for robustness
  • Environment variable DB_WAIT_ATTEMPTS allows operators to tune startup behavior for their infrastructure

https://claude.ai/code/session_01KHhXcB7ZB2eojjBd3tYxtB

Two issues caused admin login to fail with "relation admin_users does not exist":

1. docker-entrypoint.sh only waited for DB_HOST (tenant DB), not MASTER_DB_HOST.
   In multi-tenant mode with separate hosts, the master DB was never waited for.
   Also increased default wait attempts from 15 to 30 (configurable via
   DB_WAIT_ATTEMPTS env var).

2. master_db.py _apply_pending_migrations() wrapped ALL table creation in a
   single try/except that silently caught errors. If PostgreSQL wasn't ready
   or the migration failed for any reason, the critical admin_users table
   was never created but the app started anyway. Now core tables (admin_users,
   admin_sessions) are created outside the try/except so failures are fatal,
   while optional migrations (organizations, SSO) remain non-fatal.

3. Added retry logic with exponential backoff (5 attempts) to init_db() so
   the app can recover if PostgreSQL becomes ready slightly after the
   entrypoint wait completes.

https://claude.ai/code/session_01KHhXcB7ZB2eojjBd3tYxtB
Instead of suppressing all errors with 2>/dev/null, capture and display
the actual socket error (e.g., "Name or service not known", "Connection
refused", "timed out") so it's clear WHY the connection is failing.

https://claude.ai/code/session_01KHhXcB7ZB2eojjBd3tYxtB
Admin routes (/api/admin/auth/login, etc.) are mounted unconditionally
and depend on get_master_db, which needs the admin_users table. But
init_db() for the master DB was only called when MULTI_TENANT=true.
In single-tenant mode, the admin_users table was never created, causing
"relation admin_users does not exist" on every admin login attempt.

https://claude.ai/code/session_01KHhXcB7ZB2eojjBd3tYxtB
@vincentmakes vincentmakes merged commit 7f1fe8e into main Feb 26, 2026
10 checks passed
@vincentmakes vincentmakes deleted the claude/fix-postgres-timeout-beI4R branch February 26, 2026 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants