-
Notifications
You must be signed in to change notification settings - Fork 480
.stop()
sometimes hangs indefinitely
#2625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
When the stop timeouts, I have one connection left with 0 streams, |
Closing the async _invokeStartableMethod(methodName) {
const timeout = new Promise((resolve, reject) => {
setTimeout(() => {
reject('timeout')
}, 10000);
});
const connectionManager = this.components["connectionManager"]
if (isStartable(connectionManager)) {
await connectionManager[methodName]?.();
}
let promise = Promise.all(Object.values(this.components)
.filter(obj => isStartable(obj))
.map(async (startable) => {
await startable[methodName]?.();
}));
const race = Promise.race([promise, timeout]);
try {
await race;
}
catch (e) {
throw e;
}
finally {
clearTimeout(timeout);
}
} |
I may be missing something but where is the
Looking at the code, the TCP listener aborts each connection on close whereas the WebSocket transport tries to close them gracefully. Do you still see the issue if the transport aborts the connections instead of closing them on close? |
/// it-ws/server.ts
async close (): Promise<void> {
await new Promise<void>((resolve, reject) => {
this.server.closeAllConnections()
this.server.close((err) => {
if (err != null) {
reject(err); return
}
resolve()
})
})
}
|
Actually the closeAllConnections trick does work since the connection is a The close connection manager first trick seems to always work. Also adding like a 5s delay before stopping also seems to work. It feels like there is some kind of connection creation in process that is not caught correctly on closing |
Aha, that's interesting, nice find. The docs do say though:
...so I wonder if there's something else going on here? |
Ye the whole thing about aborting Websocket connections before closing does not seem to have an impact. Only the two things I mentioned in my last comment did mitigate the problem for me |
That would explain something! |
I'm running a node with this configuration, and I keep having the hanging problem: const node = await createLibp2p({
transports: [webRTCDirect(), webRTC(), circuitRelayTransport()],
services: {
ping: ping(),
identify: identify(),
},
streamMuxers: [yamux()],
}) So, in my script I connect to a Go node running on the same machine using WebRTC transport, and I call ping a few times. After that, I do something like this: async main() {
try {
// Create node, connect and ping.
// ...
} finally {
console.log('STOPPING', node.status)
await node.stop()
console.log('STOPPED', node.status)
}
console.log('RETURNING', node.status)
return
}
main() and I see all 3 logs, I see that the node moves from And stopping the connection manager explicitly doesn't help either — the script still sometimes hangs. |
Can you try adding why-is-node-running to your script? It's invaluable at tracking down why node doesn't exit. import why from 'why-is-node-running'
// ... more imports
async main() {
try {
// Create node, connect and ping.
// ...
} finally {
// print all open handles after 5s - `.unref` the timeout so it doesn't keep the process running itself
setTimeout(() => {
why()
}, 5_000).unref()
console.log('STOPPING', node.status)
await node.stop()
console.log('STOPPED', node.status)
}
console.log('RETURNING', node.status)
return
} |
Oh, thanks! Didn't know such thing exists (I'm not a JS developer in general, just had to dig into js-libp2p this time). |
This is what I get. Not sure where to start digging, but hope it could be useful to resolve the issue:
|
I tried to closing various components manually, making sure to close each connection explicitly, making sure to closing all the streams on the connection before closing them, etc. Nothing seems to help — sometimes the program hangs anyway. |
The |
Hm... interesting. I only dial one peer in this script, and then I make sure I close the connection explicitly before stopping the node (I also tried closing the connection after stopping the node, which is weird, but who knows). I'll check if there's something stuck in the dial queue. |
The dial queue is empty when the program is getting stuck. I also tried passing custom abort signal to all the components and method calls that accept it, and then making sure I abort after stopping the libp2p node. Didn't help either. |
Another curious thing which may or may not be relevant is that when I run my script with |
Another observation: using TCP transport I don't have this problem — seems to only be happening with the WebRTC Direct transport. I can't use TCP though because I'm planning to run this in the browser (even though I'm testing locally with NodeJS for faster feedback loop). |
Yeah, that's to be expected - the
Is the node you are dialing public? If so can you share the multiaddr of the peer you are trying to dial? |
No, I'm dialing a local node running on the same machine. I'll try to dial some public node to see if it keeps happening. |
I can't replicate this locally - what OS/node version are you using? |
I tried dialing To reproduce try running a Kubo node. I'm running Kubo v0.34.1. And here's the cleaned up script that I hope is useful. I'm running it with Node v20.12.2 (pretty old, yeah, but same problem happens with Bun too, so would guess it's not Node-specific). import {createLibp2p} from 'libp2p'
import {webRTC, webRTCDirect} from '@libp2p/webrtc'
import {circuitRelayTransport} from '@libp2p/circuit-relay-v2'
import {ping} from '@libp2p/ping'
import {multiaddr} from '@multiformats/multiaddr'
import {identify} from '@libp2p/identify'
import {peerIdFromString} from '@libp2p/peer-id'
import {yamux} from '@chainsafe/libp2p-yamux'
import why from 'why-is-node-running'
export async function main() {
// Replace with peer info of your locally running Kubo node.
const peerInfo = {
id: '12D3KooWH7RY9kkwqUAn4vzrtes5D9oFf48J3nQDRNHm81uqEdMJ',
addrs: [
'/ip4/127.0.0.1/udp/4001/webrtc-direct/certhash/uEiATWFB-N5p3HLeAetDJ6G4gL3eSCQ8_bn1oeWMjird2ng/p2p/12D3KooWH7RY9kkwqUAn4vzrtes5D9oFf48J3nQDRNHm81uqEdMJ',
],
}
const node = await createLibp2p({
start: false,
transports: [webRTCDirect(), webRTC(), circuitRelayTransport()],
services: {
ping: ping(),
identify: identify(),
},
streamMuxers: [yamux()],
})
const pid = peerIdFromString(peerInfo.id)
await node.peerStore.merge(pid, {
multiaddrs: peerInfo.addrs.map((v) => multiaddr(v)),
})
await node.start()
try {
console.log(await node.services.ping.ping(pid))
console.log(await node.services.ping.ping(pid))
const info = await node.peerStore.get(pid)
console.log('PROTOCOLS', info.protocols)
} finally {
// print all open handles after 5s - `.unref` the timeout so it doesn't keep the process running itself
setTimeout(() => {
why()
}, 5_000).unref()
console.log('STOPPING', node.status)
await node.stop()
console.log('STOPPED', node.status)
}
console.log('RETURNING', node.status)
return
}
process.on('exit', (code) => {
console.log('Process exiting with code', code)
})
async function realMain() {
await main()
console.log('MAIN DONE')
}
realMain() |
Thanks for that - I've replicated the issue locally. Can you try patching your local copy of It basically just wraps the use of the handshake datachannel in a I'm not seeing the hang any more with the fix (though sometimes there's an EOF where the remote closes the channel during the noise handshake, but it doesn't keep the process running). |
@achingbrain Yeah, this seems to help. Thanks! Hope the PR won't take long to get accepted :) |
@achingbrain Before the PR is accepted, do you think there's any way to close that leaking channel from the outside? Maybe some obscure API that would let you close the handle that prevents the program to exit? |
There's no secret API that allows access to the channel. The strange thing is that the RTCPeerConnection is closed when the node is stopped - at this point any open datachannels should get closed too but for a reason that needs figuring out this doesn't appear to be happening. If you are blocked & need a release ASAP you could try something like patch-package to apply the changes after an |
I'm still seeing this even after #3076 - I think the bug might be in |
Uh oh!
There was an error while loading. Please reload this page.
Version:
1.8.1
Subsystem:
@libp2p/websocket , connection manager, components
Severity:
High
Description:
Steps to reproduce the error:
I have a test scenario with a few nodes that I connect to each other and send some data
Sometimes when terminating a node it endlessly waits for the Websocket transport to shutdown. This is because there are connections that does close (?) and in the
it-ws
library there are no calls toserver.closeAllConnections()
beforeclose
.I have managed to make the shutdown work better if modify
js-libp2p/packages/libp2p/src/components.ts
Line 72 in 73f2b6b
Also no issues if I use Tcp transport instead.
This is not a good description on how to reproduce the issue, because it is a bit random when it occurs. But wanted to create this issue if others have the same problem and will update the description if I have a isolated a good test
The text was updated successfully, but these errors were encountered: