Skip to content

fix: bound rpc genvm calls and health scans#1648

Merged
MuncleUscles merged 1 commit into
mainfrom
codex/rpc-backpressure-health
May 29, 2026
Merged

fix: bound rpc genvm calls and health scans#1648
MuncleUscles merged 1 commit into
mainfrom
codex/rpc-backpressure-health

Conversation

@MuncleUscles
Copy link
Copy Markdown
Member

@MuncleUscles MuncleUscles commented May 29, 2026

Summary

  • Bound gen_call, sim_call, and GenVM-backed eth_call behind shared RPC admission control so busy pods reject excess work before opening validator snapshots or contract snapshots.
  • Removed the stray eth_call snapshot debug print.
  • Added a cooldown after the optional no-progress consensus-history scan times out, so health checks do not keep re-running the same expensive JSON scan every tick under DB pressure.
  • Added regression tests for RPC admission and the health scan cooldown.

Incident Fit

Prod showed JSON-RPC pool exhaustion and repeated no-progress scan timeouts while GenVM remained locally healthy. This keeps GenVM-backed reads from queueing unlimited DB/GenVM work in the API process and prevents a timed-out health scan from compounding the pressure.

Tests

  • /Users/edgars/Dev/genlayer-studio/.venv/bin/python -m py_compile backend/protocol_rpc/endpoints.py backend/protocol_rpc/health.py tests/unit/test_rpc_genvm_admission.py tests/db-sqlalchemy/test_health_orphan_detection.py
  • /Users/edgars/Dev/genlayer-studio/.venv/bin/python -m pytest tests/unit/test_rpc_genvm_admission.py tests/unit/test_contract_not_found_handling.py tests/unit/test_call_interceptor.py
  • /Users/edgars/Dev/genlayer-studio/.venv/bin/python -m pytest tests/db-sqlalchemy/test_health_orphan_detection.py -k 'no_progress_scan_error or no_progress_scan_timeout' blocked locally because POSTGRES_URL is not set.

Summary by CodeRabbit

  • Bug Fixes & Performance
    • Improved error handling and messaging during high-load scenarios
    • Optimized health check monitoring with cooldown mechanisms to prevent resource exhaustion
    • Enhanced request admission control for RPC endpoints under load

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR adds two control mechanisms: a reusable admission-control context manager that gates GenVM concurrency across JSON-RPC endpoints with -32006 rejection when overloaded, and a cooldown-based suppression mechanism for expensive consensus "no-progress" scans when they time out.

Changes

GenVM Admission Control

Layer / File(s) Summary
Admission context manager and infrastructure
backend/protocol_rpc/endpoints.py
Imports asynccontextmanager, defines _genvm_admission_semaphore for admission control separate from execution concurrency, and implements _admit_genvm_call() to reject overloaded requests with JSON-RPC error code -32006 and retry_after_seconds metadata.
Admission wrapping for gen_call, sim_call, and eth_call
backend/protocol_rpc/endpoints.py
Wraps execution of gen_call, sim_call, and eth_call inside async with _admit_genvm_call() to enforce admission gating at endpoint entry, including removal of debug print and preservation of error handling for contract-not-found cases.
GenVM admission control unit tests
tests/unit/test_rpc_genvm_admission.py
Adds _AsyncSnapshot helper and six async tests verifying admission rejection with error code -32006 and retry metadata, slot release on inner errors, early rejection before snapshot construction, and successful execution with proper cleanup.

Health Check Cooldown Suppression

Layer / File(s) Summary
Suppression state and cooldown configuration
backend/protocol_rpc/health.py
Introduces module-level _no_progress_scan_suppressed_until timestamp and get_no_progress_scan_error_cooldown_seconds() env-backed helper to manage cooldown window duration for progress-scan suppression.
Health check logic with cooldown enforcement
backend/protocol_rpc/health.py
Integrates cooldown duration into consensus health setup, declares suppression state as global within consensus query function, initializes no_progress_scan_suppressed flag, implements conditional scan skipping during cooldown (marked as error), and updates error handling to set/reset suppression timestamp.
Health output propagation and test isolation
backend/protocol_rpc/health.py, tests/db-sqlalchemy/test_health_orphan_detection.py
Adds no_progress_scan_suppressed to cached consensus service payload and health response; updates test fixture to snapshot/restore suppression state for isolation; introduces regression test verifying cooldown suppression after timeout.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • genlayerlabs/genlayer-studio#1638: Modifies the "no-progress" consensus health-scan logic and extends regression tests for related timeout and suppression behavior.
  • genlayerlabs/genlayer-studio#1639: Modifies backend/protocol_rpc/health.py and test suite to add no_progress_check_error and cooldown/suppression logic for expensive progress queries.
  • genlayerlabs/genlayer-studio#1443: Modifies backend/protocol_rpc/endpoints.py to reject GenVM-backed calls with error code -32006 when capacity is exhausted using similar admission-rejection patterns.

Suggested labels

run-tests

Suggested reviewers

  • cristiam86

Poem

🐰 A rabbit built two gates today,
One bounds the GenVM's way—when full, it says "retry!"
The other calms the scanner's fray,
A cooldown pause when queries cry.
Both changes keep the burrow spry! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: bound rpc genvm calls and health scans' accurately summarizes the main changes: adding admission control for GenVM RPC calls and implementing cooldown for health scans.
Description check ✅ Passed The PR description is well-structured with summary, incident fit, and tests sections, covering the main changes and their rationale, though some template sections like 'Decisions made' and 'Reviewing tips' are not filled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/rpc-backpressure-health

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MuncleUscles MuncleUscles force-pushed the codex/rpc-backpressure-health branch from 9c2bfdc to 592278c Compare May 29, 2026 13:15
@sonarqubecloud
Copy link
Copy Markdown

@MuncleUscles MuncleUscles merged commit 76a1497 into main May 29, 2026
11 of 12 checks passed
@MuncleUscles MuncleUscles deleted the codex/rpc-backpressure-health branch May 29, 2026 13:21
@github-actions
Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 0.120.18 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant