Skip to content

Conversation

@Frando
Copy link
Member

@Frando Frando commented Nov 18, 2025

Description

  • fix idle timeout clear condition (previously it would hot loop)
  • fix hot loop when local_addrs watchable becomes disconnected during shutdown
  • when sending a datagram fails in the transports sender, include the dst address in the error message
  • do not break the RemoteStateActor when sending a datagram fails

Breaking Changes

Notes & open questions

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.
    • List all breaking changes in the above "Breaking Changes" section.
    • Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:

@github-actions
Copy link

github-actions bot commented Nov 18, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3676/docs/iroh/

Last updated: 2025-11-18T13:53:21Z

@n0bot n0bot bot added this to iroh Nov 18, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Nov 18, 2025
@github-actions
Copy link

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 655ad0e

@Frando Frando force-pushed the Frando/mp-fix-remote-state-actor-loop branch from 6ffcbb0 to 7c11665 Compare November 18, 2025 13:51
@matheus23
Copy link
Member

@Frando here's another version of fixing the idle timeout: #3678

I think I'd prefer that version ✌️

Comment on lines -256 to +246
_ = self.local_addrs.updated() => {
_ = self.local_addrs.updated(), if self.local_addrs.is_connected() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this briefly while on the call, but I think we got distracted by something else before the answer:
Wouldn't it make sense to match on the Result you get back?

  • In case it's Err(Disconnected) you know that you can probably close the actor. This should only happen when magicsock::DiscoveredDirectAddrs is dropped, which should only happen when MagicSock is dropped. Soo.. I think dropping out of the loop at this point is totally fine, even desired.
  • IIUC, the , if self.local_addrs.is_connected() is a precondition, so it's possible that you start polling updated() while it's connected, but then it disconnects and resolves to Err(Disconnected). I don't think that was the intended change here. The fix for this would be to just match the Result from updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. Though hope that one day we have structured concurrency and disconnected just can't happen.

This... might be another thing to look into as a fix for this. I went with spawning these actors detached because it was the fasted thing to do and get on with thinking about the logic I had to solve.

But we should totally improve this. They should be spawned in a JoinSet, but that might need an RemoteMap actor or something to manage these tasks and that itself be a child of the magicsock::Actor. Or manage the JoinSet immediately in the magicsock::Actor, but I didn't know if that violated too many boundaries.

Though having said all that, I think we should handle this disconnect stuff here correctly and do the structured-concurrency fix some later time.

Comment on lines -281 to +277
if self.connections.is_empty() && inbox.is_idle() && idle_timeout.is_none() {
let is_idle = self.connections.is_empty() && inbox.is_idle();
if idle_timeout.is_none() && is_idle {
trace!("start idle timeout");
idle_timeout
.as_mut()
.set_future(time::sleep(ACTOR_MAX_IDLE_TIMEOUT));
} else if idle_timeout.is_some() {
} else if idle_timeout.is_some() && !is_idle {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented, I think I'd prefer the other version from my PR to this fix. There's lots of states that idle_timeout can be in, and it's probably easier to default it to sth that continuously gets reset further and further into the future.

Comment on lines -303 to +295
self.handle_msg_send_datagram(transmit).await?;
if let Err(err) = self.handle_msg_send_datagram(transmit).await {
warn!("failed to send datagram: {err:#}");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug:
We should avoid short-circuiting not on the handle_msg_send_datagram level, but on the send_datagram level. Otherwise a single path that fails will make us stop sending datagrams to any other paths that might still work.

So in essence that just means this if let Err(err) = ... should be moved inside the handle_msg_send_datagram call.

let task = task::spawn(
async move {
if let Err(err) = self.run(rx).await {
error!("actor failed: {err:#}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was a poor-man's attempt at being able to have "fatal" errors. The right thing of course is to panic and spawn the tasks in a JoinSet and making sure the panic bubbles up to the endpoint. But that's for another day I guess. For now making this infallible is the safer approach I think. While if forces you to handle the fatal error everywhere it will probably result in better code.

if any_match {
Err(io::Error::other("all available transports failed"))
Err(io::Error::other(format!(
"all available transports failed for destination {dst:?}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, convention in the stdlib is that the caller should provide the context. Don't we have somewhere else to add context?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

4 participants