Skip to content

Conversation

Copy link

Copilot AI commented Dec 22, 2025

During vSphere VM deletion, if vm.ShutdownGuest() fails on first attempt (VMware Tools not ready, transient API failure), the entire deletion fails. After 3 retries, Rancher abandons the deletion leaving orphaned VMs.

Changes

Unified retry loop with precise timing:

  • Single loop using time.Now() and deadline-based timing for accuracy
  • Continuously retries d.Stop() until GracefulShutdownTimeout is reached
  • Checks VM state on every iteration, allowing early exit even if shutdown initiation fails
  • Log each failed attempt at Debug level with error details

Timeout tracking:

  • Uses time.Since() to track elapsed time from start
  • Ensures total time never exceeds GracefulShutdownTimeout
  • Smart sleep logic that prevents exceeding the deadline

Fallback behavior:

  • Proceeds to d.Kill() (hard power off) if graceful shutdown can't be initiated within timeout
  • Also falls back if graceful shutdown doesn't complete within timeout

Before:

if err := d.Stop(); err != nil {
    return err  // Fails immediately
}
// wait for machine to stop...

After:

start := time.Now()
deadline := start.Add(time.Duration(d.GracefulShutdownTimeout) * time.Second)

for time.Now().Before(deadline) {
    if !gracefulShutdownInitiated {
        if err := d.Stop(); err != nil {
            // Log and retry
        } else {
            gracefulShutdownInitiated = true
        }
    }
    
    // Always check state for early exit
    if machineState == state.Stopped {
        break
    }
    
    // Sleep without exceeding deadline
    time.Sleep(...)
}
// If not stopped, proceed to d.Kill()

[!WARNING]

Original prompt

There is an issue with vSphere nodes deletion if GracefulShutdownTimeout is set but rancher/machine is failing to actually kick off graceful shutdown and then machine deletion fails. After 3 retries Rancher fails to delete machine and users

We want to update the logic to where we continue attempting to kick off graceful shutdown until 'GracefulShutdownTimeout' is met (and if we succeed to kick off graceful shutdown before 'GracefulShutdownTimeout' is met we wait for graceful shutdown to complete until total wait time between waiting to kick off and waiting for actual shutdown exceeds GracefulShutdownTimeout'). Worth nothing that if we fail to kick off graceful shutdown within GracefulShutdownTimeout we should move on to hard restart (d.Kill`).

Note graceful shutdown was implemented as part of #222.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 5 commits December 22, 2025 18:53
- Retry graceful shutdown initiation until GracefulShutdownTimeout is met
- Track total elapsed time for both retry attempts and waiting for shutdown completion
- If graceful shutdown can't be initiated within timeout, proceed to hard power off
- Improved logging to distinguish between failed initiation attempts and successful shutdown

Co-authored-by: snasovich <[email protected]>
- Test validates that timeout accepts 0 and positive values
- Test validates that negative timeout values are rejected
- Ensures existing validation logic is maintained

Co-authored-by: snasovich <[email protected]>
- Sleep before incrementing elapsed time for accurate timing
- Ensures elapsed time accurately reflects actual waiting time
- Second loop properly uses remaining timeout from first loop

Co-authored-by: snasovich <[email protected]>
All changes complete and validated:
- Retry logic properly implemented with accurate timing
- Tests pass successfully
- Code review completed and issues addressed
- Security scan passed with no alerts

Co-authored-by: snasovich <[email protected]>
Copilot AI changed the title [WIP] Update logic for graceful shutdown of vSphere nodes Retry graceful shutdown on vSphere VM deletion until timeout Dec 22, 2025
Copilot AI requested a review from snasovich December 22, 2025 19:00
@jferrazbr
Copy link

@copilot please update the description based on the latest changes pushed to the branch

Copy link
Author

Copilot AI commented Jan 13, 2026

@copilot please update the description based on the latest changes pushed to the branch

Updated the PR description to reflect the refactored implementation with improved timing using time.Now() and a single unified loop (commit 680b751).

@jferrazbr jferrazbr force-pushed the copilot/update-graceful-shutdown-logic branch from 680b751 to d1517d4 Compare January 14, 2026 14:10
@jferrazbr jferrazbr marked this pull request as ready for review January 14, 2026 16:03
@jferrazbr jferrazbr requested review from a team and jiaqiluo and removed request for jferrazbr and snasovich January 14, 2026 16:03
Copy link
Member

@jiaqiluo jiaqiluo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiaqiluo jiaqiluo requested a review from a team January 14, 2026 17:56
@jferrazbr jferrazbr merged commit 8425450 into master Jan 14, 2026
1 check passed
@jferrazbr jferrazbr deleted the copilot/update-graceful-shutdown-logic branch January 14, 2026 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants