Skip to content

fix(e2e): add gateway connectivity retry before recovery approve#5375

Closed
hunglp6d wants to merge 1 commit into
NVIDIA:mainfrom
hunglp6d:fix/nightly-e2e-gateway-approve-retry-eb1f00d
Closed

fix(e2e): add gateway connectivity retry before recovery approve#5375
hunglp6d wants to merge 1 commit into
NVIDIA:mainfrom
hunglp6d:fix/nightly-e2e-gateway-approve-retry-eb1f00d

Conversation

@hunglp6d

@hunglp6d hunglp6d commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

✨ [AI-generated PR]

The issue-4462-gateway-pinned-approval-characterization-e2e nightly job fails intermittently because the gateway WebSocket is transiently unreachable after the legacy approve characterization step deliberately provokes a failure. The recovery openclaw devices approve fires immediately without waiting, hitting gateway connect failed.

This PR adds a short polling loop (5 attempts x 2 s) that waits for device_state_json to succeed before issuing the recovery approve, giving the gateway time to stabilise.

Root Cause

Between the June 12 and June 13 nightlies, commits b747bfa and 09a5c69 hardened the recovery proxy-env sourcing path. The legacy approve's failed WebSocket request leaves the gateway briefly unstable; the test's recovery approve fires instantly without any backoff. This is a test-side timing issue, not a product bug.

Changes

  • test/e2e/test-issue-4462-scope-upgrade-approval.sh: Insert a gateway-readiness polling loop in legacy_gateway_pinned_approval_characterization() before the recovery approve_request call.

Nightly Run

  • Failing run: https://github.com/NVIDIA/NemoClaw/actions/runs/27450816965
  • Failing job: issue-4462-gateway-pinned-approval-characterization-e2e / run
  • Error: FAIL: recovery after legacy characterization: openclaw devices approve failed... gateway connect failed: G
  • Classification: infra_flake — transient gateway unreachability after deliberate failure

Test Plan

  • Re-run issue-4462-gateway-pinned-approval-characterization-e2e E2E job with this fix
  • Verify the retry loop logs gateway not yet reachable on transient failures
  • Confirm the recovery approve succeeds after the retry

Signed-off-by: Hung Le hple@nvidia.com

Fixes #5377

After the legacy gateway-pinned approve deliberately fails in the
issue-4462 characterization test, the gateway WebSocket can be
transiently unreachable. The immediate recovery approve then fails
with "gateway connect failed".

Add a short polling loop (5 × 2 s) that waits for device_state_json
to succeed before issuing the recovery approve, giving the gateway
time to stabilise after the failed request.

Signed-off-by: Hung Le <hple@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8ad876d5-50c3-4b17-ac9a-a968cc5679b7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@hunglp6d

Copy link
Copy Markdown
Contributor Author

This is AI-generated PR.
Closed due to a fix was merge #5412

@hunglp6d hunglp6d closed this Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nightly-e2e: issue-4462 gateway-pinned approval E2E flakes on transient gateway unreachability

1 participant