Skip to content

Fix router HTTP timeout connection cleanup#1115

Open
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/router-http-timeout-connection-cleanup
Open

Fix router HTTP timeout connection cleanup#1115
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/router-http-timeout-connection-cleanup

Conversation

@fallintoplace

@fallintoplace fallintoplace commented Jun 20, 2026

Copy link
Copy Markdown

Summary

Fix a router webserver HTTP timeout path that left pending `PORTFORWARD-*` entries in the in-memory connection registry when the backend never connected.

Details

`webserver_http_request()` allocates a router connection before asking the control websocket to open the backend port-forward. If the backend connection times out, the request returns `504` but previously left the pending entry behind. Repeated backend timeouts could accumulate stale connection records.

This change removes the pending connection on timeout and on the defensive early failure path after allocation. The backend websocket cleanup now uses an idempotent `pop()` as well, so a cleanup race does not turn into a `KeyError`.

Validation

  • `/Users/hoangvu/go/bin/bazelisk test //src/service/router:test_router`
  • `/Users/hoangvu/go/bin/bazelisk test //src/service/router:router_lib-pylint //src/service/router:test_router-pylint`

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of backend connection timeouts with enhanced cleanup of temporary connections to prevent resource leaks.
  • Tests

    • Added comprehensive unit tests for backend timeout scenarios and connection management behavior.

@fallintoplace fallintoplace requested a review from a team as a code owner June 20, 2026 12:14
@github-actions github-actions Bot added the external The author is not in @NVIDIA/osmo-dev label Jun 20, 2026
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a _remove_router_connection(key) helper in router.py that centralizes cleanup of the connections map and signals wait_close. The helper is invoked in the backend websocket timeout handler and in both the 504 and 404 exit paths of the HTTP forwarding function. A new test_router.py unit test and matching Bazel target verify the 504 cleanup behavior under a forced asyncio.TimeoutError.

Changes

Router timeout cleanup and tests

Layer / File(s) Summary
Connection cleanup helper and call sites
src/service/router/router.py
Adds _remove_router_connection(key) that pops from connections and signals wait_close; replaces unsafe del connections[key] in the WS timeout handler with connections.pop(key, None); calls _remove_router_connection before the 504 return on HTTP timeout and before the 404 return on missing websocket/close handle.
Unit test and Bazel wiring
src/service/router/test_router.py, src/service/router/BUILD
Adds WebserverHttpRequestTestCase with fake request/websocket helpers and a test that patches asyncio.wait_for to force TimeoutError, then asserts the 504 response, connection key removal from router.connections, and the exact JSON payload sent to the control websocket. Registers the test via a new osmo_py_test target in BUILD.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 A key left dangling caused such a fright,
So I popped it away with pop(key, None) right.
The wait_close now signals, the map is all clear,
A 504 tested — no ghost connections here.
Hop hop, the router is tidy tonight! 🌙

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix router HTTP timeout connection cleanup' directly and accurately summarizes the main change: addressing connection cleanup issues in router HTTP timeout handling.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/service/router/router.py (1)

102-107: ⚡ Quick win

Add return type annotation to the helper function.

The _remove_router_connection function is missing a return type annotation. As per coding guidelines, type annotations should be added where they improve clarity.

📝 Proposed fix
-def _remove_router_connection(key: str):
+def _remove_router_connection(key: str) -> None:
     connection = connections.pop(key, None)
     if connection and connection.wait_close:
         connection.wait_close.set()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/service/router/router.py` around lines 102 - 107, The
_remove_router_connection function is missing a return type annotation. Add the
return type annotation -> None to the function signature since the function does
not return any value, only performs operations on the connection object
retrieved from the connections dictionary.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/service/router/test_router.py`:
- Around line 34-37: The cookies dictionary is currently defined as a class
attribute in the _FakeRequest class, which causes all instances to share the
same mutable dictionary. Move the cookies dictionary definition from the class
level into an __init__ method of the _FakeRequest class to make it an instance
attribute instead, ensuring each instance has its own independent copy of the
cookies dictionary.

---

Nitpick comments:
In `@src/service/router/router.py`:
- Around line 102-107: The _remove_router_connection function is missing a
return type annotation. Add the return type annotation -> None to the function
signature since the function does not return any value, only performs operations
on the connection object retrieved from the connections dictionary.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 256030e3-d13e-450f-aacb-e83df1dc01dd

📥 Commits

Reviewing files that changed from the base of the PR and between cc7e0bf and a0d0ccb.

📒 Files selected for processing (3)
  • src/service/router/BUILD
  • src/service/router/router.py
  • src/service/router/test_router.py

Comment on lines +34 to +37
cookies = {
'_osmo_router_affinity': 'sticky-session',
'ignored': 'cookie',
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use instance attribute for mutable cookies dictionary.

The cookies dictionary is defined as a class attribute, which means all instances of _FakeRequest share the same dict. If multiple instances were created and one modified cookies, all would see the change. While this test only creates one instance, it's better practice to make mutable data instance attributes.

🔧 Proposed fix
 class _FakeRequest:
-    cookies = {
-        '_osmo_router_affinity': 'sticky-session',
-        'ignored': 'cookie',
-    }
+    def __init__(self):
+        self.cookies = {
+            '_osmo_router_affinity': 'sticky-session',
+            'ignored': 'cookie',
+        }
 
     async def body(self):
         return b''
🧰 Tools
🪛 Ruff (0.15.17)

[warning] 34-37: Mutable default value for class attribute

(RUF012)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/service/router/test_router.py` around lines 34 - 37, The cookies
dictionary is currently defined as a class attribute in the _FakeRequest class,
which causes all instances to share the same mutable dictionary. Move the
cookies dictionary definition from the class level into an __init__ method of
the _FakeRequest class to make it an instance attribute instead, ensuring each
instance has its own independent copy of the cookies dictionary.

Source: Linters/SAST tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external The author is not in @NVIDIA/osmo-dev

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant