fix: dpf: Move machine into correct state on failure and allow restart of reprovisioning when dpf fails.#2978
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
Summary by CodeRabbit
WalkthroughThe PR updates DPF failure handling to preserve assigned-state failure shape during reprovision paths and expands the restartable failure predicate used when restarting DPU reprovision after failure. ChangesDPU Reprovision Failure Routing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
crates/machine-controller/src/handler/dpf.rs (1)
230-268: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoffThe failure-shape selection here duplicates the inline logic in
handler.rs.
make_failure_statereimplements the exactAssigned-vs-Failedselection already present inattempt_state_transition(handler.rsLines 761–773). Two copies of this state-shaping rule will drift over time. Consider promotingmake_failure_stateto a shared location and calling it from both sites.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/handler/dpf.rs` around lines 230 - 268, The failure-state branching in make_failure_state duplicates the Assigned-vs-Failed logic already used by attempt_state_transition, so consolidate this state-shaping rule into one shared helper. Move or expose make_failure_state in a common location used by both dpf_cr_creation_failed and attempt_state_transition, and have both call the same helper so the ManagedHostState selection stays consistent in one place.crates/machine-controller/src/handler.rs (1)
11705-11758: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winExpress this as a table-driven test.
is_reprovision_restartable_failureis a total operation returningbool, enumerated here through a sequence of standaloneassert!s. The repository conventions direct such cases tovalue_scenarios!/check_values, which keeps each case labeled and makes adding a row (e.g. host-attributedBiosSetupFailedwith a non-MainFlowsource) a one-liner. Themake_failedclosure already isolates the one operation under test, so the table maps cleanly.As per coding guidelines: "Use
value_scenarios!for total operations (those returning a plain value,Option, orbool)." As per path instructions: "Prefercarbide-test-supportscenarios ... or explicit cases ... for test coverage in Rust."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/handler.rs` around lines 11705 - 11758, Convert the standalone boolean assertions in is_reprovision_restartable_failure_matches_expected_causes into a table-driven test using value_scenarios! and check_values, since is_reprovision_restartable_failure is a total bool-returning operation. Keep the existing make_failed helper, then express each host/DPU and FailureSource combination as labeled rows so the expected restartable/non-restartable result is obvious and easy to extend with additional cases.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/machine-controller/src/handler/dpf.rs`:
- Around line 374-388: The DpfProvisioning failure path is attributing the error
to the DPU ID, which prevents restartable handling because
is_reprovision_restartable_failure only matches the host machine ID. Update the
FailureDetails construction in dpf handler so the DpfProvisioning branch uses
state.host_snapshot.id when calling make_failure_state, consistent with the
sibling branches that already attribute restartable failures to the host.
---
Nitpick comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 11705-11758: Convert the standalone boolean assertions in
is_reprovision_restartable_failure_matches_expected_causes into a table-driven
test using value_scenarios! and check_values, since
is_reprovision_restartable_failure is a total bool-returning operation. Keep the
existing make_failed helper, then express each host/DPU and FailureSource
combination as labeled rows so the expected restartable/non-restartable result
is obvious and easy to extend with additional cases.
In `@crates/machine-controller/src/handler/dpf.rs`:
- Around line 230-268: The failure-state branching in make_failure_state
duplicates the Assigned-vs-Failed logic already used by
attempt_state_transition, so consolidate this state-shaping rule into one shared
helper. Move or expose make_failure_state in a common location used by both
dpf_cr_creation_failed and attempt_state_transition, and have both call the same
helper so the ManagedHostState selection stays consistent in one place.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 45a7c493-e960-45ae-8fbc-e97cd7b0f17e
📒 Files selected for processing (2)
crates/machine-controller/src/handler.rscrates/machine-controller/src/handler/dpf.rs
a1ba8ee to
bafdc2d
Compare
| let details = FailureDetails { | ||
| cause: FailureCause::DpfProvisioning { | ||
| err: format!( | ||
| "DPUDevice/DPUNode creation failed. Force-delete again to clean old values. Wait until DPU CR are deleted. {err}" |
There was a problem hiding this comment.
do you still want the error to say to force-delete the machine?
There was a problem hiding this comment.
for ingestion yes. for reprov no. Let me fix this as well.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/machine-controller/src/handler.rs (1)
1869-1888: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winClear persisted failure details before leaving
ManagedHostState::Failed.This branch transitions out of
Failedwithout the failure-detail cleanup that the other recovery paths do. On the next controller iteration, the pre-dispatchget_failed_state(mh_snapshot)check can immediately promote the host back toManagedHostState::Failed, so the restarted reprovision never makes forward progress.Please clear the relevant failure record(s) as part of this restart path before returning the transition.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/handler.rs` around lines 1869 - 1888, Clear the persisted failure details in this recovery branch before transitioning out of ManagedHostState::Failed, since the next loop’s get_failed_state(mh_snapshot) check can immediately re-fail the host. Update the restart path in handler.rs around ReprovisionState::next_substate_based_on_bfb_support(...).next_state_with_all_dpus_updated(...) to perform the same failure-record cleanup used by the other recovery paths before returning next_state.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 1929-1940: The restartability check in the reprovision helper is
too broad because it treats every ManagedHostState::Failed with
FailureCause::DpfProvisioning as reprovision-restartable. Update the match near
start_dpu_reprovision/its helper logic in handler.rs to only allow
reprovision-specific failures, either by checking additional provenance in
FailureDetails or by carrying explicit reprovision context in the failure state.
Add a negative test covering a non-reprovision DpfProvisioning failure to ensure
it is not considered restartable.
---
Outside diff comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 1869-1888: Clear the persisted failure details in this recovery
branch before transitioning out of ManagedHostState::Failed, since the next
loop’s get_failed_state(mh_snapshot) check can immediately re-fail the host.
Update the restart path in handler.rs around
ReprovisionState::next_substate_based_on_bfb_support(...).next_state_with_all_dpus_updated(...)
to perform the same failure-record cleanup used by the other recovery paths
before returning next_state.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4aa4a23b-808a-4f25-bee8-eb65d5a910d8
📒 Files selected for processing (2)
crates/machine-controller/src/handler.rscrates/machine-controller/src/handler/dpf.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/machine-controller/src/handler/dpf.rs
…t of reprovisioning when dpf fails.
bafdc2d to
0918037
Compare
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
When a DPF provisioning failure occurred while a host was in ManagedHostState::Assigned, the state machine was unconditionally transitioning to top-level ManagedHostState::Failed. This lost the Assigned context.
Additionally, start_dpu_reprovision had no path to restart reprovisioning from a DpfProvisioning failure.
Related issues
Type of Change
Breaking Changes
Testing
Additional Notes