Add resilient database initialization with retries and improved startup#22
Merged
vincentmakes merged 4 commits intomainfrom Feb 26, 2026
Merged
Add resilient database initialization with retries and improved startup#22vincentmakes merged 4 commits intomainfrom
vincentmakes merged 4 commits intomainfrom
Conversation
Two issues caused admin login to fail with "relation admin_users does not exist": 1. docker-entrypoint.sh only waited for DB_HOST (tenant DB), not MASTER_DB_HOST. In multi-tenant mode with separate hosts, the master DB was never waited for. Also increased default wait attempts from 15 to 30 (configurable via DB_WAIT_ATTEMPTS env var). 2. master_db.py _apply_pending_migrations() wrapped ALL table creation in a single try/except that silently caught errors. If PostgreSQL wasn't ready or the migration failed for any reason, the critical admin_users table was never created but the app started anyway. Now core tables (admin_users, admin_sessions) are created outside the try/except so failures are fatal, while optional migrations (organizations, SSO) remain non-fatal. 3. Added retry logic with exponential backoff (5 attempts) to init_db() so the app can recover if PostgreSQL becomes ready slightly after the entrypoint wait completes. https://claude.ai/code/session_01KHhXcB7ZB2eojjBd3tYxtB
Instead of suppressing all errors with 2>/dev/null, capture and display the actual socket error (e.g., "Name or service not known", "Connection refused", "timed out") so it's clear WHY the connection is failing. https://claude.ai/code/session_01KHhXcB7ZB2eojjBd3tYxtB
Admin routes (/api/admin/auth/login, etc.) are mounted unconditionally and depend on get_master_db, which needs the admin_users table. But init_db() for the master DB was only called when MULTI_TENANT=true. In single-tenant mode, the admin_users table was never created, causing "relation admin_users does not exist" on every admin login attempt. https://claude.ai/code/session_01KHhXcB7ZB2eojjBd3tYxtB
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves the reliability of database initialization by adding exponential backoff retry logic and better separation of critical vs. optional migrations. It also enhances the Docker entrypoint to support multi-database scenarios and configurable wait timeouts.
Key Changes
Master DB Initialization (
app/services/master_db.py)asyncioimport for async sleep functionality_apply_pending_migrations()to distinguish between:Docker Entrypoint (
docker-entrypoint.sh)wait_for_db()function with:DB_WAIT_ATTEMPTSenvironment variable (defaults to 30)MULTI_TENANT=trueandMASTER_DB_HOSTdiffers from tenant DB hostImplementation Details
DB_WAIT_ATTEMPTSallows operators to tune startup behavior for their infrastructurehttps://claude.ai/code/session_01KHhXcB7ZB2eojjBd3tYxtB