Skip to content

fix: dpf: Move machine into correct state on failure and allow restart of reprovisioning when dpf fails.#2978

Merged
abvarshney-nv merged 1 commit into
NVIDIA:mainfrom
abvarshney-nv:reprov_dpf_restart
Jun 29, 2026
Merged

fix: dpf: Move machine into correct state on failure and allow restart of reprovisioning when dpf fails.#2978
abvarshney-nv merged 1 commit into
NVIDIA:mainfrom
abvarshney-nv:reprov_dpf_restart

Conversation

@abvarshney-nv

Copy link
Copy Markdown
Contributor

When a DPF provisioning failure occurred while a host was in ManagedHostState::Assigned, the state machine was unconditionally transitioning to top-level ManagedHostState::Failed. This lost the Assigned context.

Additionally, start_dpu_reprovision had no path to restart reprovisioning from a DpfProvisioning failure.

Related issues

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@abvarshney-nv abvarshney-nv requested a review from a team as a code owner June 29, 2026 16:54
@abvarshney-nv abvarshney-nv requested a review from wminckler June 29, 2026 16:54
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9179a63-1925-4d17-93d9-192ca6ea1eca

📥 Commits

Reviewing files that changed from the base of the PR and between bafdc2d and 0918037.

📒 Files selected for processing (2)
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/handler/dpf.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/machine-controller/src/handler/dpf.rs
  • crates/machine-controller/src/handler.rs

Summary by CodeRabbit

  • Bug Fixes
    • Improved DPU reprovision restart behavior by expanding the set of failure scenarios that are eligible for automatic recovery.
    • Adjusted restart criteria for BIOS setup and DPF provisioning failures to restart under the intended conditions.
    • Refined DPU failure-state transitions to preserve the existing assignment context when appropriate, improving recovery continuity.
  • Tests
    • Added/updated unit tests to validate restartable failure detection and confirm the expected resulting failure-state transitions.

Walkthrough

The PR updates DPF failure handling to preserve assigned-state failure shape during reprovision paths and expands the restartable failure predicate used when restarting DPU reprovision after failure.

Changes

DPU Reprovision Failure Routing

Layer / File(s) Summary
Failure state helper and caller updates
crates/machine-controller/src/handler/dpf.rs
Adds make_failure_state to choose between top-level ManagedHostState::Failed and Assigned-embedded InstanceState::Failed, and routes DPF provisioning and reprovisioning failure paths through it.
Restartable failure predicate and tests
crates/machine-controller/src/handler.rs
Replaces the reprovision restartability check with is_reprovision_restartable_failure, broadens restartable DpfProvisioning handling, wires it into start_dpu_reprovision, and adds unit coverage for the restartable cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main fix: correcting DPF failure state handling and enabling reprovision restart.
Description check ✅ Passed The description matches the code changes and the stated bug fix for DPF failure handling and reprovision restart.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crates/machine-controller/src/handler/dpf.rs (1)

230-268: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoff

The failure-shape selection here duplicates the inline logic in handler.rs.

make_failure_state reimplements the exact Assigned-vs-Failed selection already present in attempt_state_transition (handler.rs Lines 761–773). Two copies of this state-shaping rule will drift over time. Consider promoting make_failure_state to a shared location and calling it from both sites.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler/dpf.rs` around lines 230 - 268, The
failure-state branching in make_failure_state duplicates the Assigned-vs-Failed
logic already used by attempt_state_transition, so consolidate this
state-shaping rule into one shared helper. Move or expose make_failure_state in
a common location used by both dpf_cr_creation_failed and
attempt_state_transition, and have both call the same helper so the
ManagedHostState selection stays consistent in one place.
crates/machine-controller/src/handler.rs (1)

11705-11758: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Express this as a table-driven test.

is_reprovision_restartable_failure is a total operation returning bool, enumerated here through a sequence of standalone assert!s. The repository conventions direct such cases to value_scenarios! / check_values, which keeps each case labeled and makes adding a row (e.g. host-attributed BiosSetupFailed with a non-MainFlow source) a one-liner. The make_failed closure already isolates the one operation under test, so the table maps cleanly.

As per coding guidelines: "Use value_scenarios! for total operations (those returning a plain value, Option, or bool)." As per path instructions: "Prefer carbide-test-support scenarios ... or explicit cases ... for test coverage in Rust."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 11705 - 11758, Convert
the standalone boolean assertions in
is_reprovision_restartable_failure_matches_expected_causes into a table-driven
test using value_scenarios! and check_values, since
is_reprovision_restartable_failure is a total bool-returning operation. Keep the
existing make_failed helper, then express each host/DPU and FailureSource
combination as labeled rows so the expected restartable/non-restartable result
is obvious and easy to extend with additional cases.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/machine-controller/src/handler/dpf.rs`:
- Around line 374-388: The DpfProvisioning failure path is attributing the error
to the DPU ID, which prevents restartable handling because
is_reprovision_restartable_failure only matches the host machine ID. Update the
FailureDetails construction in dpf handler so the DpfProvisioning branch uses
state.host_snapshot.id when calling make_failure_state, consistent with the
sibling branches that already attribute restartable failures to the host.

---

Nitpick comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 11705-11758: Convert the standalone boolean assertions in
is_reprovision_restartable_failure_matches_expected_causes into a table-driven
test using value_scenarios! and check_values, since
is_reprovision_restartable_failure is a total bool-returning operation. Keep the
existing make_failed helper, then express each host/DPU and FailureSource
combination as labeled rows so the expected restartable/non-restartable result
is obvious and easy to extend with additional cases.

In `@crates/machine-controller/src/handler/dpf.rs`:
- Around line 230-268: The failure-state branching in make_failure_state
duplicates the Assigned-vs-Failed logic already used by
attempt_state_transition, so consolidate this state-shaping rule into one shared
helper. Move or expose make_failure_state in a common location used by both
dpf_cr_creation_failed and attempt_state_transition, and have both call the same
helper so the ManagedHostState selection stays consistent in one place.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 45a7c493-e960-45ae-8fbc-e97cd7b0f17e

📥 Commits

Reviewing files that changed from the base of the PR and between 5408f4d and a1ba8ee.

📒 Files selected for processing (2)
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/handler/dpf.rs

Comment thread crates/machine-controller/src/handler/dpf.rs
let details = FailureDetails {
cause: FailureCause::DpfProvisioning {
err: format!(
"DPUDevice/DPUNode creation failed. Force-delete again to clean old values. Wait until DPU CR are deleted. {err}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still want the error to say to force-delete the machine?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for ingestion yes. for reprov no. Let me fix this as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/machine-controller/src/handler.rs (1)

1869-1888: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Clear persisted failure details before leaving ManagedHostState::Failed.

This branch transitions out of Failed without the failure-detail cleanup that the other recovery paths do. On the next controller iteration, the pre-dispatch get_failed_state(mh_snapshot) check can immediately promote the host back to ManagedHostState::Failed, so the restarted reprovision never makes forward progress.

Please clear the relevant failure record(s) as part of this restart path before returning the transition.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 1869 - 1888, Clear the
persisted failure details in this recovery branch before transitioning out of
ManagedHostState::Failed, since the next loop’s get_failed_state(mh_snapshot)
check can immediately re-fail the host. Update the restart path in handler.rs
around
ReprovisionState::next_substate_based_on_bfb_support(...).next_state_with_all_dpus_updated(...)
to perform the same failure-record cleanup used by the other recovery paths
before returning next_state.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 1929-1940: The restartability check in the reprovision helper is
too broad because it treats every ManagedHostState::Failed with
FailureCause::DpfProvisioning as reprovision-restartable. Update the match near
start_dpu_reprovision/its helper logic in handler.rs to only allow
reprovision-specific failures, either by checking additional provenance in
FailureDetails or by carrying explicit reprovision context in the failure state.
Add a negative test covering a non-reprovision DpfProvisioning failure to ensure
it is not considered restartable.

---

Outside diff comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 1869-1888: Clear the persisted failure details in this recovery
branch before transitioning out of ManagedHostState::Failed, since the next
loop’s get_failed_state(mh_snapshot) check can immediately re-fail the host.
Update the restart path in handler.rs around
ReprovisionState::next_substate_based_on_bfb_support(...).next_state_with_all_dpus_updated(...)
to perform the same failure-record cleanup used by the other recovery paths
before returning next_state.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4aa4a23b-808a-4f25-bee8-eb65d5a910d8

📥 Commits

Reviewing files that changed from the base of the PR and between a1ba8ee and bafdc2d.

📒 Files selected for processing (2)
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/handler/dpf.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/machine-controller/src/handler/dpf.rs

Comment thread crates/machine-controller/src/handler.rs
Comment thread crates/machine-controller/src/handler/dpf.rs
@abvarshney-nv abvarshney-nv enabled auto-merge (squash) June 29, 2026 17:38
@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 288 6 26 105 7 144
machine-validation-runner 751 30 190 274 36 221
machine_validation 751 30 190 274 36 221
machine_validation-aarch64 751 30 190 274 36 221
nvmetal-carbide 751 30 190 274 36 221
TOTAL 3298 126 786 1207 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@abvarshney-nv abvarshney-nv merged commit 257c38c into NVIDIA:main Jun 29, 2026
56 checks passed
@abvarshney-nv abvarshney-nv deleted the reprov_dpf_restart branch June 30, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: NICo does not restart DPU reprovisioning if machine stuck in Failed/DpfProvisioning state

2 participants