fix: dpf: Move machine into correct state on failure and allow restart of reprovisioning when dpf fails. by abvarshney-nv · Pull Request #2978 · NVIDIA/infra-controller

abvarshney-nv · 2026-06-29T16:54:19Z

When a DPF provisioning failure occurred while a host was in ManagedHostState::Assigned, the state machine was unconditionally transitioning to top-level ManagedHostState::Failed. This lost the Assigned context.

Additionally, start_dpu_reprovision had no path to restart reprovisioning from a DpfProvisioning failure.

Related issues

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

coderabbitai · 2026-06-29T16:54:40Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9179a63-1925-4d17-93d9-192ca6ea1eca

📥 Commits

Reviewing files that changed from the base of the PR and between bafdc2d and 0918037.

📒 Files selected for processing (2)

crates/machine-controller/src/handler.rs
crates/machine-controller/src/handler/dpf.rs

🚧 Files skipped from review as they are similar to previous changes (2)

crates/machine-controller/src/handler/dpf.rs
crates/machine-controller/src/handler.rs

Summary by CodeRabbit

Bug Fixes
- Improved DPU reprovision restart behavior by expanding the set of failure scenarios that are eligible for automatic recovery.
- Adjusted restart criteria for BIOS setup and DPF provisioning failures to restart under the intended conditions.
- Refined DPU failure-state transitions to preserve the existing assignment context when appropriate, improving recovery continuity.
Tests
- Added/updated unit tests to validate restartable failure detection and confirm the expected resulting failure-state transitions.

Walkthrough

The PR updates DPF failure handling to preserve assigned-state failure shape during reprovision paths and expands the restartable failure predicate used when restarting DPU reprovision after failure.

Changes

DPU Reprovision Failure Routing

Layer / File(s)	Summary
Failure state helper and caller updates `crates/machine-controller/src/handler/dpf.rs`	Adds `make_failure_state` to choose between top-level `ManagedHostState::Failed` and `Assigned`-embedded `InstanceState::Failed`, and routes DPF provisioning and reprovisioning failure paths through it.
Restartable failure predicate and tests `crates/machine-controller/src/handler.rs`	Replaces the reprovision restartability check with `is_reprovision_restartable_failure`, broadens restartable `DpfProvisioning` handling, wires it into `start_dpu_reprovision`, and adds unit coverage for the restartable cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main fix: correcting DPF failure state handling and enabling reprovision restart.
Description check	✅ Passed	The description matches the code changes and the stated bug fix for DPF failure handling and reprovision restart.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

crates/machine-controller/src/handler/dpf.rs (1)
230-268: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoff

The failure-shape selection here duplicates the inline logic in handler.rs.

make_failure_state reimplements the exact Assigned-vs-Failed selection already present in attempt_state_transition (handler.rs Lines 761–773). Two copies of this state-shaping rule will drift over time. Consider promoting make_failure_state to a shared location and calling it from both sites.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler/dpf.rs` around lines 230 - 268, The
failure-state branching in make_failure_state duplicates the Assigned-vs-Failed
logic already used by attempt_state_transition, so consolidate this
state-shaping rule into one shared helper. Move or expose make_failure_state in
a common location used by both dpf_cr_creation_failed and
attempt_state_transition, and have both call the same helper so the
ManagedHostState selection stays consistent in one place.
crates/machine-controller/src/handler.rs (1)
11705-11758: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Express this as a table-driven test.

is_reprovision_restartable_failure is a total operation returning bool, enumerated here through a sequence of standalone assert!s. The repository conventions direct such cases to value_scenarios! / check_values, which keeps each case labeled and makes adding a row (e.g. host-attributed BiosSetupFailed with a non-MainFlow source) a one-liner. The make_failed closure already isolates the one operation under test, so the table maps cleanly.

As per coding guidelines: "Use value_scenarios! for total operations (those returning a plain value, Option, or bool)." As per path instructions: "Prefer carbide-test-support scenarios ... or explicit cases ... for test coverage in Rust."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 11705 - 11758, Convert
the standalone boolean assertions in
is_reprovision_restartable_failure_matches_expected_causes into a table-driven
test using value_scenarios! and check_values, since
is_reprovision_restartable_failure is a total bool-returning operation. Keep the
existing make_failed helper, then express each host/DPU and FailureSource
combination as labeled rows so the expected restartable/non-restartable result
is obvious and easy to extend with additional cases.
Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/machine-controller/src/handler/dpf.rs`:
- Around line 374-388: The DpfProvisioning failure path is attributing the error
to the DPU ID, which prevents restartable handling because
is_reprovision_restartable_failure only matches the host machine ID. Update the
FailureDetails construction in dpf handler so the DpfProvisioning branch uses
state.host_snapshot.id when calling make_failure_state, consistent with the
sibling branches that already attribute restartable failures to the host.

---

Nitpick comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 11705-11758: Convert the standalone boolean assertions in
is_reprovision_restartable_failure_matches_expected_causes into a table-driven
test using value_scenarios! and check_values, since
is_reprovision_restartable_failure is a total bool-returning operation. Keep the
existing make_failed helper, then express each host/DPU and FailureSource
combination as labeled rows so the expected restartable/non-restartable result
is obvious and easy to extend with additional cases.

In `@crates/machine-controller/src/handler/dpf.rs`:
- Around line 230-268: The failure-state branching in make_failure_state
duplicates the Assigned-vs-Failed logic already used by
attempt_state_transition, so consolidate this state-shaping rule into one shared
helper. Move or expose make_failure_state in a common location used by both
dpf_cr_creation_failed and attempt_state_transition, and have both call the same
helper so the ManagedHostState selection stays consistent in one place.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 45a7c493-e960-45ae-8fbc-e97cd7b0f17e

📥 Commits

Reviewing files that changed from the base of the PR and between 5408f4d and a1ba8ee.

📒 Files selected for processing (2)

crates/machine-controller/src/handler.rs
crates/machine-controller/src/handler/dpf.rs

wminckler · 2026-06-29T17:17:02Z

+    let details = FailureDetails {
+        cause: FailureCause::DpfProvisioning {
+            err: format!(
+                "DPUDevice/DPUNode creation failed. Force-delete again to clean old values. Wait until DPU CR are deleted. {err}"


do you still want the error to say to force-delete the machine?

for ingestion yes. for reprov no. Let me fix this as well.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/machine-controller/src/handler.rs (1)
1869-1888: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Clear persisted failure details before leaving ManagedHostState::Failed.

This branch transitions out of Failed without the failure-detail cleanup that the other recovery paths do. On the next controller iteration, the pre-dispatch get_failed_state(mh_snapshot) check can immediately promote the host back to ManagedHostState::Failed, so the restarted reprovision never makes forward progress.

Please clear the relevant failure record(s) as part of this restart path before returning the transition.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 1869 - 1888, Clear the
persisted failure details in this recovery branch before transitioning out of
ManagedHostState::Failed, since the next loop’s get_failed_state(mh_snapshot)
check can immediately re-fail the host. Update the restart path in handler.rs
around
ReprovisionState::next_substate_based_on_bfb_support(...).next_state_with_all_dpus_updated(...)
to perform the same failure-record cleanup used by the other recovery paths
before returning next_state.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 1929-1940: The restartability check in the reprovision helper is
too broad because it treats every ManagedHostState::Failed with
FailureCause::DpfProvisioning as reprovision-restartable. Update the match near
start_dpu_reprovision/its helper logic in handler.rs to only allow
reprovision-specific failures, either by checking additional provenance in
FailureDetails or by carrying explicit reprovision context in the failure state.
Add a negative test covering a non-reprovision DpfProvisioning failure to ensure
it is not considered restartable.

---

Outside diff comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 1869-1888: Clear the persisted failure details in this recovery
branch before transitioning out of ManagedHostState::Failed, since the next
loop’s get_failed_state(mh_snapshot) check can immediately re-fail the host.
Update the restart path in handler.rs around
ReprovisionState::next_substate_based_on_bfb_support(...).next_state_with_all_dpus_updated(...)
to perform the same failure-record cleanup used by the other recovery paths
before returning next_state.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4aa4a23b-808a-4f25-bee8-eb65d5a910d8

📥 Commits

Reviewing files that changed from the base of the PR and between a1ba8ee and bafdc2d.

📒 Files selected for processing (2)

crates/machine-controller/src/handler.rs
crates/machine-controller/src/handler/dpf.rs

🚧 Files skipped from review as they are similar to previous changes (1)

crates/machine-controller/src/handler/dpf.rs

…t of reprovisioning when dpf fails.

github-actions · 2026-06-29T18:18:34Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	288	6	26	105	7	144
machine-validation-runner	751	30	190	274	36	221
machine_validation	751	30	190	274	36	221
machine_validation-aarch64	751	30	190	274	36	221
nvmetal-carbide	751	30	190	274	36	221
TOTAL	3298	126	786	1207	151	1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

abvarshney-nv requested a review from a team as a code owner June 29, 2026 16:54

abvarshney-nv requested a review from wminckler June 29, 2026 16:54

abvarshney-nv linked an issue Jun 29, 2026 that may be closed by this pull request

bug: NICo does not restart DPU reprovisioning if machine stuck in Failed/DpfProvisioning state #2834

Open

2 tasks

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread crates/machine-controller/src/handler/dpf.rs

abvarshney-nv force-pushed the reprov_dpf_restart branch from a1ba8ee to bafdc2d Compare June 29, 2026 17:15

wminckler reviewed Jun 29, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread crates/machine-controller/src/handler.rs

fix: dpf: Move machine into correct state on failure and allow restar…

0918037

…t of reprovisioning when dpf fails.

abvarshney-nv force-pushed the reprov_dpf_restart branch from bafdc2d to 0918037 Compare June 29, 2026 17:21

wminckler reviewed Jun 29, 2026

View reviewed changes

Comment thread crates/machine-controller/src/handler/dpf.rs

wminckler approved these changes Jun 29, 2026

View reviewed changes

abvarshney-nv enabled auto-merge (squash) June 29, 2026 17:38

abvarshney-nv merged commit 257c38c into NVIDIA:main Jun 29, 2026
56 checks passed

abvarshney-nv deleted the reprov_dpf_restart branch June 30, 2026 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: dpf: Move machine into correct state on failure and allow restart of reprovisioning when dpf fails.#2978

fix: dpf: Move machine into correct state on failure and allow restart of reprovisioning when dpf fails.#2978
abvarshney-nv merged 1 commit into
NVIDIA:mainfrom
abvarshney-nv:reprov_dpf_restart

abvarshney-nv commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

wminckler Jun 29, 2026

Uh oh!

abvarshney-nv Jun 29, 2026

Uh oh!

abvarshney-nv Jun 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

abvarshney-nv commented Jun 29, 2026

Related issues

Type of Change

Breaking Changes

Testing

Additional Notes

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wminckler Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

abvarshney-nv Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

abvarshney-nv Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 29, 2026

🔍 Container Scan Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading