Skip to content

Commit 9cb3570

Browse files
dulinrileyfacebook-github-bot
authored andcommitted
Ignore mailbox errors in GetState for HostMeshAgent
Summary: If some actor owns some ActorMeshes, it will periodically send out a GetState message to the HostMeshAgent and ProcMeshAgent. If that sender crashes while waiting for a reply, it'll cause a MailboxSenderError on the agents. We don't want those agents to stop because of such an error, as it just means nobody will receive a reply. GetRankStatus and GetState messages are read-only and have no side effects, so it's fine to just warn on the MailboxSenderError, there is no invalid state left hanging around. Differential Revision: D85720817
1 parent a0aceab commit 9cb3570

File tree

2 files changed

+44
-5
lines changed

2 files changed

+44
-5
lines changed

hyperactor_mesh/src/proc_mesh/mesh_agent.rs

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -669,8 +669,17 @@ impl Handler<resource::GetRankStatus> for ProcMeshAgent {
669669
StatusOverlay::try_from_runs(vec![(rank..(rank + 1), status)])
670670
.expect("valid single-run overlay")
671671
};
672-
get_rank_status.reply.send(cx, overlay)?;
673-
672+
let result = get_rank_status.reply.send(cx, overlay);
673+
// Ignore errors, because returning Err from here would cause the ProcMeshAgent
674+
// to be stopped, which would prevent querying and spawning other actors.
675+
// This only means some actor that requested the state of an actor failed to receive it.
676+
if let Err(e) = result {
677+
tracing::warn!(
678+
actor = %cx.self_id(),
679+
"failed to send GetRankStatus reply due to error: {}",
680+
e
681+
);
682+
}
674683
Ok(())
675684
}
676685
}
@@ -724,7 +733,17 @@ impl Handler<resource::GetState<ActorState>> for ProcMeshAgent {
724733
},
725734
};
726735

727-
get_state.reply.send(cx, state)?;
736+
let result = get_state.reply.send(cx, state);
737+
// Ignore errors, because returning Err from here would cause the ProcMeshAgent
738+
// to be stopped, which would prevent querying and spawning other actors.
739+
// This only means some actor that requested the state of an actor failed to receive it.
740+
if let Err(e) = result {
741+
tracing::warn!(
742+
actor = %cx.self_id(),
743+
"failed to send GetState reply due to error: {}",
744+
e
745+
);
746+
}
728747
Ok(())
729748
}
730749
}

hyperactor_mesh/src/v1/host_mesh/mesh_agent.rs

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,17 @@ impl Handler<resource::GetRankStatus> for HostMeshAgent {
290290
StatusOverlay::try_from_runs(vec![(rank..(rank + 1), status)])
291291
.expect("valid single-run overlay")
292292
};
293-
get_rank_status.reply.send(cx, overlay)?;
293+
let result = get_rank_status.reply.send(cx, overlay);
294+
// Ignore errors, because returning Err from here would cause the HostMeshAgent
295+
// to be stopped, which would take down the entire host. This only means
296+
// some actor that requested the rank status failed to receive it.
297+
if let Err(e) = result {
298+
tracing::warn!(
299+
actor = %cx.self_id(),
300+
"failed to send GetRankStatus reply due to error: {}",
301+
e
302+
);
303+
}
294304
Ok(())
295305
}
296306
}
@@ -403,7 +413,17 @@ impl Handler<resource::GetState<ProcState>> for HostMeshAgent {
403413
},
404414
};
405415

406-
get_state.reply.send(cx, state)?;
416+
let result = get_state.reply.send(cx, state);
417+
// Ignore errors, because returning Err from here would cause the HostMeshAgent
418+
// to be stopped, which would take down the entire host. This only means
419+
// some actor that requested the state of a proc failed to receive it.
420+
if let Err(e) = result {
421+
tracing::warn!(
422+
actor = %cx.self_id(),
423+
"failed to send GetState reply due to error: {}",
424+
e
425+
);
426+
}
407427
Ok(())
408428
}
409429
}

0 commit comments

Comments
 (0)