Skip to content

Feat/enhance error handle#557

Open
jackieismpc wants to merge 23 commits intork8s-dev:mainfrom
jackieismpc:feat/enhance-error-handle
Open

Feat/enhance error handle#557
jackieismpc wants to merge 23 commits intork8s-dev:mainfrom
jackieismpc:feat/enhance-error-handle

Conversation

@jackieismpc
Copy link
Copy Markdown
Contributor

#556

summary

  • This update finalizes the stable API alignment for dagrs 0.8.1 and dagrs-derive 0.5.0, and includes an additional review-and-fix pass on top of the local commit stack.
  • The main crate continues to converge on a single structured error surface based on DagrsError / ErrorCode, a single terminal event based on ExecutionTerminated, and a consistent contract between graph APIs and derive macros.
  • The review fixes in this pass focus on checkpoint resume correctness, reset-environment coverage, the Graph::add_edge iterator contract, and keeping tests/docs in sync.

change

  • dagrs

    • Checkpoint resume now rebuilds runtime graph state and replays serialized checkpoint outputs into downstream input channels, so resume_from_checkpoint() can continue execution instead of only restoring status flags.
    • NodeState now carries a StoredOutputKind tag so checkpoint output data can be restored safely.
    • Graph::add_edge now accepts IntoIterator<Item = NodeId>, matching the upgrade plan while remaining backward-compatible with existing Vec<NodeId> call sites.
    • Fixed stale checkpoint documentation examples by replacing the removed set_checkpoint_interval(...) API with set_checkpoint_config(...).
    • Updated README and CHANGELOG to document replayed checkpoint outputs during resume.
  • dagrs tests

    • Reworked test_resume_execution_basic into a real resume test that verifies completed nodes are not rerun and pending downstream nodes receive checkpointed input.
    • Added reset_policy_test to cover both default reset() environment preservation and explicit reset_with(ResetPolicy::ResetEnv) behavior.
    • Updated checkpoint serialization coverage to validate StoredOutputKind.
  • dagrs-derive

    • Fixed dependencies! type inference after Graph::add_edge was widened to an iterator-based signature by collecting explicitly into Vec<NodeId>.
  • validation

    • cargo fmt --all -- --check
    • cargo clippy -p dagrs --all-targets --all-features -- -D warnings
    • cargo clippy -p dagrs-derive --all-targets --all-features -- -D warnings
    • cargo test -p dagrs --all-features
    • cargo test -p dagrs-derive --all-features
    • cargo build -p dagrs --examples --all-features

Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec3cf29d13

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af41ade474

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: jackie <jackieismpc@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the dagrs runtime and dagrs-derive macros to the stated stable API surface (dagrs 0.8.1 / dagrs-derive 0.5.0), centered on structured errors (DagrsError/ErrorCode), updated event/termination contracts, and more correct checkpoint resume behavior (including output replay).

Changes:

  • Introduces/propagates a unified structured error model across graph build, runtime, channels, hooks, and checkpoints; updates events to end with ExecutionTerminated.
  • Enhances checkpoint/resume semantics by making node execution state explicit (NodeExecStatus) and supporting replay of serialized outputs during resume; expands/updates tests accordingly.
  • Updates dagrs-derive macros and adds UI fixtures/tests; also includes broad Buck build rule refactors across multiple crates.

Reviewed changes

Copilot reviewed 75 out of 80 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
project/slayerfs/BUCK Renames vendor/manifest targets and adds Windows-specific rustc_flags select.
project/rks/BUCK Renames vendor/manifest targets and updates Windows rustc_flags wiring.
project/rkl/BUCK Renames vendor/manifest targets and updates Windows rustc_flags wiring.
project/rkforge/BUCK Renames vendor/manifest targets and adds Windows-specific rustc_flags select.
project/rfuse3/BUCK Renames vendor/manifest targets for Buck rule consistency.
project/libvault/BUCK Renames vendor/manifest/buildscript targets and updates env/rustc_flags references.
project/libscheduler/BUCK Renames vendor/manifest targets for Buck rule consistency.
project/libruntime/BUCK Renames vendor/manifest targets for Buck rule consistency.
project/libnetwork/BUCK Renames vendor/manifest targets for Buck rule consistency.
project/libfuse-fs/BUCK Renames vendor/manifest targets for Buck rule consistency.
project/libcsi/BUCK Renames vendor/manifest targets for Buck rule consistency.
project/libcni/BUCK Renames vendor/manifest targets for Buck rule consistency.
project/distribution/BUCK Renames vendor/manifest targets and adds Windows-specific rustc_flags select.
project/dagrs/tests/v2_test.rs Updates tests to handle Result-returning build APIs.
project/dagrs/tests/typed_channel_test.rs Updates tests to unwrap add_node/add_edge results.
project/dagrs/tests/router_reuse_test.rs Updates build/reset calls to return/unwrap Result.
project/dagrs/tests/router_merge_test.rs Updates build calls for Result-returning APIs.
project/dagrs/tests/reset_policy_test.rs Adds coverage for reset environment lifecycle policy.
project/dagrs/tests/loop_reset_test.rs Updates reset call to unwrap Result.
project/dagrs/tests/hook_test.rs Updates hook contract (no on_error), retries, and termination events.
project/dagrs/tests/event_test.rs Updates event model assertions (skip reasons, durations, termination event).
project/dagrs/tests/checkpoint_test.rs Expands checkpoint/resume tests and adapts to new NodeState model.
project/dagrs/tests/chain_skip_test.rs Updates build calls for Result-returning APIs.
project/dagrs/tests/branch_pruning_test.rs Updates build calls for Result-returning APIs.
project/dagrs/src/utils/output.rs Reworks Output to carry DagrsError and adds execution_failed helper.
project/dagrs/src/utils/hook.rs Updates hook docs/contracts and retry error type to DagrsError.
project/dagrs/src/utils/execstate.rs Adjusts mutex handling and exposes is_success.
project/dagrs/src/utils/checkpoint.rs Introduces NodeExecStatus/StoredOutputKind, new checkpoint ordering, and DagrsError-based store APIs.
project/dagrs/src/node/typed_action.rs Adapts typed channels to new InChannels internal structure.
project/dagrs/src/node/mod.rs Adds restore_from_checkpoint hook; hardens EnvVar node-table lookup.
project/dagrs/src/node/loop_node.rs Adds loop condition/node checkpoint restore hooks.
project/dagrs/src/lib.rs Re-exports structured error types and updated checkpoint types.
project/dagrs/src/graph/loop_subgraph.rs Makes LoopSubgraph build APIs fallible and avoids panics on execution.
project/dagrs/src/graph/event.rs Replaces GraphFinished with ExecutionTerminated, adds typed error payloads and skip reasons.
project/dagrs/src/graph/error.rs Defines ErrorCode, DagrsError, and ErrorContext as the unified error surface.
project/dagrs/src/graph/abstract_graph.rs Makes abstract edge insertion fallible and de-panics missing-node paths.
project/dagrs/src/connection/out_channel.rs Converts channel errors into structured DagrsError and updates APIs accordingly.
project/dagrs/src/connection/in_channel.rs Converts channel errors into structured DagrsError and adds disabled-sender support for resume/control-flow.
project/dagrs/README.md Updates docs for new build/runtime/error/event contracts and versions.
project/dagrs/examples/typed_action.rs Updates example to unwrap add_node/add_edge results.
project/dagrs/examples/recv_any_example.rs Updates example to unwrap add_node/add_edge results.
project/dagrs/examples/loop_dag.rs Updates example to unwrap build calls.
project/dagrs/examples/hello_dagrs.rs Updates example to unwrap add_node result.
project/dagrs/examples/dagrs-sklearn/tests/dag_job_test.rs Updates expected failure type to ErrorCode-based structured errors.
project/dagrs/examples/dagrs-sklearn/src/parser/yaml_parser.rs Propagates build errors from add_node/add_edge via Result.
project/dagrs/examples/dagrs-sklearn/src/command_action.rs Maps command execution failures into DagrsError with details.
project/dagrs/examples/dagrs-sklearn/examples/sklearn.rs Adapts output error handling to DagrsError.
project/dagrs/examples/dagrs-sklearn/examples/notebook.ipynb Updates embedded code to new Output::Err(DagrsError) handling.
project/dagrs/examples/dagrs-sklearn/Cargo.toml Declares a workspace boundary for the example crate.
project/dagrs/examples/custom_node.rs Updates example to unwrap add_node result.
project/dagrs/examples/conditional_node.rs Updates example to unwrap build calls.
project/dagrs/examples/compute_dag.rs Updates example to unwrap build calls.
project/dagrs/examples/auto_relay.rs Updates example main signature and propagates dependencies!/run errors.
project/dagrs/examples/auto_node.rs Adjusts generics ordering/bounds for auto-node example.
project/dagrs/docs/CHANGELOG.md Adds v0.8.1 changelog with migration notes.
project/dagrs/Cargo.toml Bumps dagrs version to 0.8.1.
project/dagrs/BUCK Renames vendor/manifest targets for Buck rule consistency.
project/dagrs-derive/tests/ui/pass/auto_node_success/src/main.rs Adds passing UI fixture for auto_node macro behavior.
project/dagrs-derive/tests/ui/pass/auto_node_success/Cargo.toml Adds pass fixture manifest for UI compilation test.
project/dagrs-derive/tests/ui/pass/auto_node_success/Cargo.lock Adds lockfile for pass UI fixture.
project/dagrs-derive/tests/ui/fail/tuple_struct/src/main.rs Adds failing UI fixture for tuple struct diagnostics.
project/dagrs-derive/tests/ui/fail/tuple_struct/Cargo.toml Adds fail fixture manifest for tuple struct case.
project/dagrs-derive/tests/ui/fail/tuple_struct/Cargo.lock Adds lockfile for tuple struct fail fixture.
project/dagrs-derive/tests/ui/fail/reserved_field/src/main.rs Adds failing UI fixture for reserved field diagnostics.
project/dagrs-derive/tests/ui/fail/reserved_field/Cargo.toml Adds fail fixture manifest for reserved field case.
project/dagrs-derive/tests/ui/fail/reserved_field/Cargo.lock Adds lockfile for reserved field fail fixture.
project/dagrs-derive/tests/dependencies_macro.rs Adds tests ensuring dependencies! propagates build errors and supports ?.
project/dagrs-derive/tests/auto_node_ui.rs Adds cargo-driven UI test runner for fixtures.
project/dagrs-derive/src/relay.rs Updates macro expansion to propagate build errors and avoid identifier collisions.
project/dagrs-derive/src/lib.rs Updates macro docs/contracts; documents dependencies! returns Result.
project/dagrs-derive/src/auto_node.rs Improves generics handling, reserved-field diagnostics, and removes unsafe Send/Sync injection.
project/dagrs-derive/README.md Documents macro contract targeting dagrs 0.8.1 and updated behavior.
project/dagrs-derive/Cargo.toml Bumps dagrs-derive version to 0.5.0.
project/dagrs-derive/BUCK Renames vendor/manifest targets for Buck rule consistency.
project/common/BUCK Renames vendor/manifest targets for Buck rule consistency.
project/Cargo.toml Updates workspace dependency version for dagrs-derive.
project/Cargo.lock Updates locked versions for dagrs and dagrs-derive.
project/aardvark-dns/BUCK Renames vendor/manifest/buildscript targets and adds Windows-specific rustc_flags select.
buckal.snap Updates buckal fingerprints after crate/version/buildrule changes.

Comment on lines +209 to 218
let disabled = self.1.clone();
let futures = self
.0
.iter_mut()
.filter(|(id, _)| !disabled.contains(id))
.map(|(_, c)| async {
let content: Content = c.lock().await.recv(NodeId(0)).await?;
Ok(content.into_inner())
});
join_all(futures).await.into_iter().map(f).collect()
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as InChannels::map: TypedInChannels::map uses recv(NodeId(0)), so any channel errors will carry the wrong channel_id in DagrsError.context. Pass the actual sender NodeId into recv to keep error context accurate.

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +45
/// After the task is successfully executed, set the execution result.
pub(crate) fn set_output(&self, output: Output) {
self.success.store(true, Ordering::Relaxed);
*self.output.lock().unwrap() = output;
if let Ok(mut guard) = self.output.lock() {
*guard = output;
}
}

/// [`Output`] for fetching internal storage.
/// This function is generally not called directly, but first uses the semaphore for synchronization control.
pub(crate) fn get_output(&self) -> Option<Content> {
self.output.lock().unwrap().get_out()
self.output.lock().ok().and_then(|guard| guard.get_out())
}
pub(crate) fn get_full_output(&self) -> Output {
self.output.lock().unwrap().clone()
self.output
.lock()
.map(|guard| guard.clone())
.unwrap_or_else(|_| Output::empty())
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecState::set_output silently ignores poisoned mutexes (and get_full_output falls back to Output::empty() on poison). That can mask real execution results and make downstream logic think a node produced no output. Consider recovering the poisoned lock via poisoned.into_inner() and still reading/writing the stored Output, or propagate/panic so corruption isn’t silently hidden.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +62
pub fn add_node(&mut self, node: impl Node + 'static) -> DagrsResult<NodeId> {
let node_id = node.id();
for existing in &self.inner_nodes {
let existing = existing.try_lock().map_err(|_| {
DagrsError::new(
ErrorCode::DgBld0005ConcurrentBuildMutation,
"failed to acquire loop subgraph node lock while building graph",
)
})?;
if existing.id() == node_id {
return Err(DagrsError::new(
ErrorCode::DgBld0003DuplicateNodeId,
"duplicate node id detected while building loop subgraph",
)
.with_node_id(node_id.as_usize()));
}
}
self.inner_nodes.push(Arc::new(Mutex::new(node)));
Ok(node_id)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoopSubgraph::add_node checks for duplicate IDs by try_lock()ing every existing inner node. During graph build this introduces a spurious failure mode (DgBld0005ConcurrentBuildMutation) if any node mutex is locked (even if no real concurrent build mutation is happening). A more robust approach is to track inner node IDs separately (e.g., a HashSet<NodeId>) so duplicate detection is lock-free and deterministic.

Copilot uses AI. Check for mistakes.
.arg("--quiet")
.arg("--manifest-path")
.arg(case_dir.join("Cargo.toml"))
.env("CARGO_NET_OFFLINE", "true")
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These UI tests force CARGO_NET_OFFLINE=true, but the fixtures depend on registry crates (e.g. async-trait = "0.1" in tests/ui/pass/auto_node_success). In a clean CI environment without a pre-populated Cargo registry cache, cargo check will fail because it can’t download those dependencies. Either remove the offline setting, vendor the fixture dependencies, or switch to a UI-test approach that doesn’t require registry access.

Suggested change
.env("CARGO_NET_OFFLINE", "true")

Copilot uses AI. Check for mistakes.
@genedna
Copy link
Copy Markdown
Member

genedna commented Mar 23, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: edf6ec72e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

genedna pushed a commit that referenced this pull request Mar 23, 2026
Review content is in PR_557_REVIEW.md. The GH_TOKE PAT in the environment
is invalid (returns 401 Bad Credentials), preventing direct posting to
the GitHub PR thread via API.

https://claude.ai/code/session_012q6mfySUQapXeVUmL5mu7v
Signed-off-by: jackie <jackieismpc@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8a1822a58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1767 to 1771
self.abstract_graph.add_edge(from_id, *to_id)?;

let (tx, rx) = mpsc::channel::<Content>(32);
from_channel.insert(*to_id, Arc::new(Mutex::new(OutChannel::Mpsc(tx.clone()))));
rx_map.insert(*to_id, rx);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Roll back sender-side edge mutations on receiver-lock errors

This mutates the abstract graph and sender-side channel map before the later try_lock_node_for_build(...) on each target's input channel. If a receiver is concurrently locked, add_edge() now returns Err(DgBld0005...) but leaves the edge half-installed: abstract_graph, from_channel, and in_degree are updated, while the new rx is dropped. Retrying the same add_edge() call then no-ops because from_channel.0.contains_key(to_id) is already true, so that edge remains permanently broken and sends fail against a closed channel.

Useful? React with 👍 / 👎.

Signed-off-by: jackie <jackieismpc@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1464917ea

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6369540a3d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +624 to +627
if !matches!(node_state.status, NodeExecStatus::Succeeded)
|| node_state.output_data.is_some()
|| node_state.output_summary.is_none()
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Rerun channel-only successes during checkpoint resume

Update the rerun predicate to handle succeeded nodes that sent data through OutChannels but returned Output::empty(). In this code path, save_checkpoint_with_loop_node_iterations() records status=Succeeded with both output_data and output_summary unset, and succeeded_node_requires_rerun() therefore returns false; then prune_succeeded_nodes_from_resume_span() can drop that sender from active_nodes. If a pending peer in the same resumed block still waits on that sender (a common pattern in this repo, e.g. actions that broadcast(...) then return empty), resume can deadlock because no replay packet exists and the sender is never rerun.

Useful? React with 👍 / 👎.

Signed-off-by: jackieismpc <jackieismpc@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants