Skip to content

Rpc health failover system#207

Open
1evi7eo wants to merge 2 commits intobnb-chain:mainfrom
1evi7eo:rpc-health-failover-system
Open

Rpc health failover system#207
1evi7eo wants to merge 2 commits intobnb-chain:mainfrom
1evi7eo:rpc-health-failover-system

Conversation

@1evi7eo
Copy link

@1evi7eo 1evi7eo commented Jan 24, 2026

Description

This PR introduces a new RPC Health Failover System tool for BNB Smart Chain (BSC), a BNBChain Cookbook demo that monitors multiple RPC endpoints, checks their health in real-time, and automatically fails over to the best available endpoint when one becomes slow or unavailable. This system ensures high availability and reliability for blockchain applications by maintaining redundancy across multiple RPC providers.

rpc-health-failover-system

Key Features:

  • Multi-Endpoint Monitoring: Simultaneously monitors multiple BSC RPC endpoints configured via comma-separated BSC_RPC_URLS environment variable
  • Periodic Health Checks: Automatically tests each endpoint every 5 seconds (configurable via HEALTH_CHECK_INTERVAL_MS) by calling eth_blockNumber RPC method
  • Status Classification: Categorizes endpoints into three status levels:
    • Healthy: Latency < 1000ms, responding correctly
    • Degraded: Latency 1000–3000ms, still functional but slow
    • Unhealthy: Latency > 3000ms or errors, not recommended
  • Automatic Failover: Intelligently selects the best available endpoint based on status priority (healthy > degraded > unhealthy) and latency, automatically switching when the current endpoint fails or degrades
  • Real-time Failover: If the active endpoint fails during an RPC call, the system immediately tries alternative endpoints in priority order
  • Manual Override: Allows manual selection of any endpoint via /api/set-active API endpoint
  • Status Tracking: Monitors latency, errors, consecutive failures, last check timestamp, and last block number for each endpoint
  • RESTful API: Express.js backend with endpoints for health status, manual endpoint selection, and test RPC calls
  • Real-time Dashboard: Modern dark-mode UI showing status of all endpoints with color-coded health indicators
  • Configurable Timeouts: RPC call timeout configurable via RPC_TIMEOUT_MS environment variable (default: 3000ms)

How Failover Works:

  1. Health checks run periodically (default: every 5 seconds) on all configured endpoints
  2. Each endpoint is tested by calling eth_blockNumber and measuring response latency
  3. Endpoints are ranked by status priority (healthy > degraded > unhealthy) and then by latency
  4. The best endpoint is automatically selected as active
  5. If the active endpoint fails during an RPC call, the system immediately fails over to the next best endpoint
  6. Failed endpoints are marked unhealthy and tracked with consecutive failure counts

Use Cases:

  • High-availability dApps that require reliable RPC access
  • Production applications needing automatic failover capabilities
  • Monitoring RPC provider performance and reliability
  • Educational tool for understanding RPC redundancy patterns
  • Building resilient blockchain infrastructure

Tech Stack:

  • TypeScript for type safety and maintainability
  • Express.js for HTTP server and RESTful API endpoints
  • Direct JSON-RPC calls using native fetch API with AbortController for timeouts
  • Plain HTML/CSS/JS for frontend with modern dark theme UI
  • Vitest for comprehensive unit testing

This implementation provides a complete, production-ready failover system that ensures continuous availability even when individual RPC endpoints experience issues, making it essential for mission-critical blockchain applications.

Fixes # (issue)

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce.

  • Ran unit tests with npm test to verify all health check and failover functions
  • Tested checkEndpointHealth function with various RPC endpoints and response times
  • Verified RpcFailoverManager class correctly initializes and manages multiple endpoints
  • Tested startHealthChecks and stopHealthChecks methods for periodic monitoring
  • Validated runHealthChecks function updates endpoint status, latency, and error tracking
  • Tested selectBestEndpoint function correctly prioritizes healthy endpoints by status and latency
  • Verified setActiveEndpoint function allows manual endpoint selection
  • Tested getHealthStatus function returns correct counts and endpoint information
  • Validated call method with automatic failover when active endpoint fails
  • Tested failover logic tries alternative endpoints in priority order
  • Verified consecutive failure tracking and error message storage
  • Tested Express API endpoints (/api/health, /api/set-active, /api/test-call) return correct data
  • Validated error handling for invalid endpoints, RPC failures, and timeout scenarios
  • Tested with multiple real BSC RPC endpoints to verify health monitoring and failover
  • Verified frontend UI correctly displays endpoint status, latency, and health indicators
  • Tested automatic failover when simulating endpoint failures
  • Built and tested production build with npm run build && npm start
  • Tested on BSC mainnet with real RPC endpoints to verify production readiness

Reproduction Steps:

  1. Clone the repository and run ./clone-and-run.sh (or manually: npm install, cp .env.example .env, npm run build, npm test, npm start)
  2. Open http://localhost:3000 in a browser
  3. View the dashboard showing health status of all configured RPC endpoints
  4. Observe automatic health checks running every 5 seconds
  5. Click "Test Active Endpoint" to verify the current active endpoint is working
  6. Manually select a different endpoint using the dropdown and verify it becomes active
  7. Simulate an endpoint failure by stopping one RPC service and observe automatic failover
  8. Verify the system automatically selects the best available endpoint based on health and latency

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@vivixu-cmd
Copy link

Congratulations! You have received a Cookbook reward. Please reply with your BSC wallet address.Thanks

@1evi7eo
Copy link
Author

1evi7eo commented Jan 27, 2026

Congratulations! You have received a Cookbook reward. Please reply with your BSC wallet address.Thanks

Thank you for the opportunity to contribute!
0x23b23556c3CAA3C582EeE23Fc0D972352FB2a62c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants