Skip to content

.stop() sometimes hangs indefinitely #2625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marcus-pousette opened this issue Jul 18, 2024 · 27 comments · Fixed by #3076
Closed

.stop() sometimes hangs indefinitely #2625

marcus-pousette opened this issue Jul 18, 2024 · 27 comments · Fixed by #3076
Labels
need/author-input Needs input from the original author

Comments

@marcus-pousette
Copy link
Contributor

marcus-pousette commented Jul 18, 2024

  • Version:
    1.8.1

  • Subsystem:
    @libp2p/websocket , connection manager, components

Severity:

High

Description:

Steps to reproduce the error:

I have a test scenario with a few nodes that I connect to each other and send some data

Sometimes when terminating a node it endlessly waits for the Websocket transport to shutdown. This is because there are connections that does close (?) and in the it-ws library there are no calls to server.closeAllConnections() before close.

I have managed to make the shutdown work better if modify

private async _invokeStartableMethod (methodName: 'beforeStart' | 'start' | 'afterStart' | 'beforeStop' | 'stop' | 'afterStop'): Promise<void> {
to terminate the components in the reverse order in sequentially... so I suspect there is some kind of race condition going on or leak of some sort

Also no issues if I use Tcp transport instead.

This is not a good description on how to reproduce the issue, because it is a bit random when it occurs. But wanted to create this issue if others have the same problem and will update the description if I have a isolated a good test

@marcus-pousette marcus-pousette added the need/triage Needs initial labeling and prioritization label Jul 18, 2024
@marcus-pousette
Copy link
Contributor Author

marcus-pousette commented Jul 18, 2024

When the stop timeouts, I have one connection left with 0 streams,
remoteAddr: '/ip6/::ffff:7f00:1/tcp/56327/p2p/12D3KooWApN6mEB5666wTpqpzkgcKXC7kgMxmm7CAWRiD3WAg4Yv'
direction: 'inbound'
encryption: '/noise'
multiplexer: '/yamux/1.0.0'
transient: false

@marcus-pousette
Copy link
Contributor Author

marcus-pousette commented Jul 18, 2024

Closing the connectionManager component before all other components seems also to fix the issue

async _invokeStartableMethod(methodName) {
        const timeout = new Promise((resolve, reject) => {
            setTimeout(() => {
                reject('timeout')
            }, 10000);
        });
        const connectionManager = this.components["connectionManager"]
        if (isStartable(connectionManager)) {
            await connectionManager[methodName]?.();
        }
        let promise = Promise.all(Object.values(this.components)
            .filter(obj => isStartable(obj))
            .map(async (startable) => {
                await startable[methodName]?.();
            }));
        const race = Promise.race([promise, timeout]);
        try {
            await race;
        }
        catch (e) {
            throw e;
        }
        finally {
            clearTimeout(timeout);
        }
    }

@achingbrain
Copy link
Member

in the it-ws library there are no calls to server.closeAllConnections() before close.

I may be missing something but where is the closeAllConnections method defined? The it-ws module stores a reference to a http/https net.Server which has no such method.

no issues if I use Tcp transport

Looking at the code, the TCP listener aborts each connection on close whereas the WebSocket transport tries to close them gracefully.

Do you still see the issue if the transport aborts the connections instead of closing them on close?

@marcus-pousette
Copy link
Contributor Author

http.Server and https.Server has that method since NodeJs 18.2.0

/// it-ws/server.ts

 async close (): Promise<void> {
    await new Promise<void>((resolve, reject) => {
      this.server.closeAllConnections()
      this.server.close((err) => {
        if (err != null) {
          reject(err); return
        }

        resolve()
      })
    })
  }
  

@marcus-pousette
Copy link
Contributor Author

marcus-pousette commented Jul 18, 2024

Actually the closeAllConnections trick does work since the connection is a tcp one (?)

The close connection manager first trick seems to always work.

Also adding like a 5s delay before stopping also seems to work.

It feels like there is some kind of connection creation in process that is not caught correctly on closing

@achingbrain
Copy link
Member

http.Server and https.Server has that method since NodeJs 18.2.0

Aha, that's interesting, nice find.

The docs do say though:

This does not destroy sockets upgraded to a different protocol, such as WebSocket or HTTP/2

...so I wonder if there's something else going on here?

@marcus-pousette
Copy link
Contributor Author

Ye the whole thing about aborting Websocket connections before closing does not seem to have an impact. Only the two things I mentioned in my last comment did mitigate the problem for me

@marcus-pousette
Copy link
Contributor Author

Looking at the code, the TCP listener aborts each connection on close whereas the WebSocket transport tries to close them gracefully.

That would explain something!

@burdiyan
Copy link

burdiyan commented Apr 9, 2025

I'm running a node with this configuration, and I keep having the hanging problem:

const node = await createLibp2p({
    transports: [webRTCDirect(), webRTC(), circuitRelayTransport()],
    services: {
      ping: ping(),
      identify: identify(),
    },
    streamMuxers: [yamux()],
  })

So, in my script I connect to a Go node running on the same machine using WebRTC transport, and I call ping a few times. After that, I do something like this:

async main() {
    try {
        // Create node, connect and ping.
        // ...
    } finally {
        console.log('STOPPING', node.status)
        await node.stop()
        console.log('STOPPED', node.status)
    }
    console.log('RETURNING', node.status)
    return 
}

main()

and I see all 3 logs, I see that the node moves from started to stopped state, but the script sometimes is left hanging. I tried to inspect the active requests and handles of the runtime, but they are empty. It's really weird.

And stopping the connection manager explicitly doesn't help either — the script still sometimes hangs.

@achingbrain
Copy link
Member

Can you try adding why-is-node-running to your script? It's invaluable at tracking down why node doesn't exit.

import why from 'why-is-node-running'

// ... more imports

async main() {
  try {
    // Create node, connect and ping.
    // ...
  } finally {
    // print all open handles after 5s - `.unref` the timeout so it doesn't keep the process running itself
    setTimeout(() => {
      why()
    }, 5_000).unref()

    console.log('STOPPING', node.status)
    await node.stop()
    console.log('STOPPED', node.status)
  }

  console.log('RETURNING', node.status)
  return 
}

@achingbrain achingbrain added need/author-input Needs input from the original author and removed need/triage Needs initial labeling and prioritization labels Apr 10, 2025
@burdiyan
Copy link

Oh, thanks! Didn't know such thing exists (I'm not a JS developer in general, just had to dig into js-libp2p this time).

@burdiyan
Copy link

This is what I get. Not sure where to start digging, but hope it could be useful to resolve the issue:

There are 7 handle(s) keeping the process running.

# KEYPAIRGENREQUEST
@libp2p/crypto/dist/src/keys/ed25519/index.js:31 - const key = keypair('ed25519', {
@libp2p/crypto/dist/src/keys/ed25519/utils.js:22 - const { privateKey, publicKey } = crypto.generateKey();
@libp2p/crypto/dist/src/keys/index.js:19         - return generateEd25519KeyPair();
libp2p/dist/src/index.js:47                      - options.privateKey ??= await generateKeyPair('Ed25519');
src/run.mjs:36                                                                                                         - const node = await createLibp2p({
src/run.mjs:97                                                                                                         - await main()

# KEYPAIRGENREQUEST
@peculiar/webcrypto/build/webcrypto.js:1546                                 - const keys = crypto.generateKeyPairSync("ec", {
@peculiar/webcrypto/build/webcrypto.js:1729                                 - const keys = await EcCrypto.generateKey({
webcrypto-core/build/webcrypto-core.js:131                                  - return this.onGenerateKey.apply(this, args);
webcrypto-core/build/webcrypto-core.js:1435                                 - const result = await provider.generateKey({ ...preparedAlgorithm, name: provider.name }, extractable, keyUsages, ...params);
@libp2p/webrtc/dist/src/private-to-public/utils/get-rtcpeerconnection.js:62 - const keyPair = await crypto.subtle.generateKey({
@libp2p/webrtc/dist/src/private-to-public/transport.js:74                   - const peerConnection = await createDialerRTCPeerConnection('client', ufrag, typeof this.init.rtcConfiguration === 'function' ? await this.init.rtcConfiguration() : this.init.rtcConfiguration ?? {});

# ThreadSafeCallback callback
@ipshipyard/node-datachannel/dist/esm/polyfill/RTCDataChannel.mjs:45     - __privateGet(this, _dataChannel).onClosed(() => {
@ipshipyard/node-datachannel/dist/esm/polyfill/RTCPeerConnection.mjs:282 - const dataChannel = new RTCDataChannel(channel, opts);
@libp2p/webrtc/dist/src/private-to-public/utils/connect.js:15            - const handshakeDataChannel = peerConnection.createDataChannel('', { negotiated: true, id: 0 });
@libp2p/webrtc/dist/src/private-to-public/transport.js:76                - return await raceSignal(connect(peerConnection, ufrag, {
@libp2p/webrtc/dist/src/private-to-public/transport.js:36                - const rawConn = await this._connect(ma, options);
libp2p/dist/src/connection-manager/dial-queue.js:210                     - const conn = await this.components.transportManager.dial(address.multiaddr, {
libp2p/dist/src/connection-manager/dial-queue.js:147                     - return await this.dialPeer(options, signal);
race-signal/dist/src/index.js:31                                         - return await Promise.race([

# KEYPAIRGENREQUEST
@chainsafe/libp2p-noise/dist/src/crypto/index.js:116          - const { publicKey, privateKey } = crypto.generateKeyPairSync('x25519', {
@chainsafe/libp2p-noise/dist/src/noise.js:40                  - this.staticKey = _crypto.generateX25519KeyPair();
@chainsafe/libp2p-noise/dist/src/index.js:41                  - return (components) => new Noise(components, init);
@libp2p/webrtc/dist/src/private-to-public/utils/connect.js:70 - const connectionEncrypter = noise({ prologueBytes: noisePrologue })(options);
race-signal/dist/src/index.js:31                              - return await Promise.race([
@libp2p/webrtc/dist/src/private-to-public/transport.js:76     - return await raceSignal(connect(peerConnection, ufrag, {

# SIGNREQUEST
@libp2p/crypto/dist/src/keys/ed25519/index.js:87               - return crypto.sign(null, msg instanceof Uint8Array ? msg : msg.subarray(), obj);
@libp2p/crypto/dist/src/keys/ed25519/ed25519.js:51             - return crypto.hashAndSign(this.raw, message);
@chainsafe/libp2p-noise/dist/src/utils.js:8                    - const identitySig = await privateKey.sign(getSignaturePayload(staticPublicKey));
@chainsafe/libp2p-noise/dist/src/performHandshake.js:40        - const payload = await createHandshakePayload(privateKey, s.publicKey, extensions);
@chainsafe/libp2p-noise/dist/src/noise.js:154                  - result = await performHandshakeResponder({
@chainsafe/libp2p-noise/dist/src/noise.js:106                  - const handshake = await this.performHandshakeResponder(wrappedConnection, this.components.privateKey, options?.remotePeer?.publicKey, options);
@libp2p/webrtc/dist/src/private-to-public/utils/connect.js:123 - await connectionEncrypter.secureInbound(wrappedDuplex, {

# KEYPAIRGENREQUEST
@chainsafe/libp2p-noise/dist/src/crypto/index.js:116    - const { publicKey, privateKey } = crypto.generateKeyPairSync('x25519', {
@chainsafe/libp2p-noise/dist/src/protocol.js:98         - const e = this.crypto.generateKeypair();
@chainsafe/libp2p-noise/dist/src/protocol.js:200        - const e = this.writeE();
@chainsafe/libp2p-noise/dist/src/performHandshake.js:54 - await connection.write(xx.writeMessageB(payload), options);
@chainsafe/libp2p-noise/dist/src/noise.js:154           - result = await performHandshakeResponder({
@chainsafe/libp2p-noise/dist/src/noise.js:106           - const handshake = await this.performHandshakeResponder(wrappedConnection, this.components.privateKey, options?.remotePeer?.publicKey, options);

# SIGNREQUEST
@libp2p/crypto/dist/src/keys/ed25519/index.js:110              - return crypto.verify(null, msg instanceof Uint8Array ? msg : msg.subarray(), obj, sig);
@libp2p/crypto/dist/src/keys/ed25519/ed25519.js:31             - return crypto.hashAndVerify(this.raw, sig, data);
@chainsafe/libp2p-noise/dist/src/utils.js:26                   - if (!(await publicKey.verify(signaturePayload, payload.identitySig))) {
@chainsafe/libp2p-noise/dist/src/performHandshake.js:60        - const receivedPayload = await decodeHandshakePayload(plaintext, xx.rs, remoteIdentityKey);
@chainsafe/libp2p-noise/dist/src/noise.js:154                  - result = await performHandshakeResponder({
@chainsafe/libp2p-noise/dist/src/noise.js:106                  - const handshake = await this.performHandshakeResponder(wrappedConnection, this.components.privateKey, options?.remotePeer?.publicKey, options);
@libp2p/webrtc/dist/src/private-to-public/utils/connect.js:123 - await connectionEncrypter.secureInbound(wrappedDuplex, {

@burdiyan
Copy link

I tried to closing various components manually, making sure to close each connection explicitly, making sure to closing all the streams on the connection before closing them, etc. Nothing seems to help — sometimes the program hangs anyway.

@achingbrain
Copy link
Member

The # KEYPAIRGENREQUEST and # SIGNREQUEST stuff is safe to ignore - it's probably the # ThreadSafeCallback callback - it looks like a dial has hung and/or is still in progress and it's not being torn down properly.

@burdiyan
Copy link

Hm... interesting. I only dial one peer in this script, and then I make sure I close the connection explicitly before stopping the node (I also tried closing the connection after stopping the node, which is weird, but who knows). I'll check if there's something stuck in the dial queue.

@burdiyan
Copy link

The dial queue is empty when the program is getting stuck.

I also tried passing custom abort signal to all the components and method calls that accept it, and then making sure I abort after stopping the libp2p node. Didn't help either.

@burdiyan
Copy link

Another curious thing which may or may not be relevant is that when I run my script with node it hangs much more often than if I run it with bun.

@burdiyan
Copy link

Another observation: using TCP transport I don't have this problem — seems to only be happening with the WebRTC Direct transport. I can't use TCP though because I'm planning to run this in the browser (even though I'm testing locally with NodeJS for faster feedback loop).

@achingbrain
Copy link
Member

seems to only be happening with the WebRTC Direct transport

Yeah, that's to be expected - the ThreadSafeCallback reference that's keeping the process running is part of WebRTCDirect (e.g. private to public):

# ThreadSafeCallback callback
@ipshipyard/node-datachannel/dist/esm/polyfill/RTCDataChannel.mjs:45     - __privateGet(this, _dataChannel).onClosed(() => {
@ipshipyard/node-datachannel/dist/esm/polyfill/RTCPeerConnection.mjs:282 - const dataChannel = new RTCDataChannel(channel, opts);
@libp2p/webrtc/dist/src/private-to-public/utils/connect.js:15            - const handshakeDataChannel = peerConnection.createDataChannel('', { negotiated: true, id: 0 });
@libp2p/webrtc/dist/src/private-to-public/transport.js:76                - return await raceSignal(connect(peerConnection, ufrag, {
@libp2p/webrtc/dist/src/private-to-public/transport.js:36                - const rawConn = await this._connect(ma, options);
libp2p/dist/src/connection-manager/dial-queue.js:210                     - const conn = await this.components.transportManager.dial(address.multiaddr, {
libp2p/dist/src/connection-manager/dial-queue.js:147                     - return await this.dialPeer(options, signal);
race-signal/dist/src/index.js:31                                         - return await Promise.race([

Is the node you are dialing public? If so can you share the multiaddr of the peer you are trying to dial?

@burdiyan
Copy link

Is the node you are dialing public? If so can you share the multiaddr of the peer you are trying to dial?

No, I'm dialing a local node running on the same machine. I'll try to dial some public node to see if it keeps happening.

@achingbrain
Copy link
Member

I can't replicate this locally - what OS/node version are you using?

@burdiyan
Copy link

I tried dialing 12D3KooWKnDdG3iXw9eTFijk3EWSunZcFi54Zka4wmtqtt6rPxc8 on /ip4/40.160.9.115/udp/4005/webrtc-direct/certhash/uEiDvpLLt7lCMQ_BD2XcXoubpigiSCwMZ_y-d5IGl2NfVoA and it keeps happening, but very randomly.

To reproduce try running a Kubo node. I'm running Kubo v0.34.1.

And here's the cleaned up script that I hope is useful. I'm running it with Node v20.12.2 (pretty old, yeah, but same problem happens with Bun too, so would guess it's not Node-specific).

import {createLibp2p} from 'libp2p'
import {webRTC, webRTCDirect} from '@libp2p/webrtc'
import {circuitRelayTransport} from '@libp2p/circuit-relay-v2'
import {ping} from '@libp2p/ping'
import {multiaddr} from '@multiformats/multiaddr'
import {identify} from '@libp2p/identify'
import {peerIdFromString} from '@libp2p/peer-id'
import {yamux} from '@chainsafe/libp2p-yamux'
import why from 'why-is-node-running'

export async function main() {
  // Replace with peer info of your locally running Kubo node.
  const peerInfo = {
    id: '12D3KooWH7RY9kkwqUAn4vzrtes5D9oFf48J3nQDRNHm81uqEdMJ',
    addrs: [
      '/ip4/127.0.0.1/udp/4001/webrtc-direct/certhash/uEiATWFB-N5p3HLeAetDJ6G4gL3eSCQ8_bn1oeWMjird2ng/p2p/12D3KooWH7RY9kkwqUAn4vzrtes5D9oFf48J3nQDRNHm81uqEdMJ',
    ],
  }

  const node = await createLibp2p({
    start: false,
    transports: [webRTCDirect(), webRTC(), circuitRelayTransport()],
    services: {
      ping: ping(),
      identify: identify(),
    },
    streamMuxers: [yamux()],
  })

  const pid = peerIdFromString(peerInfo.id)

  await node.peerStore.merge(pid, {
    multiaddrs: peerInfo.addrs.map((v) => multiaddr(v)),
  })

  await node.start()

  try {
    console.log(await node.services.ping.ping(pid))
    console.log(await node.services.ping.ping(pid))

    const info = await node.peerStore.get(pid)
    console.log('PROTOCOLS', info.protocols)
  } finally {
    // print all open handles after 5s - `.unref` the timeout so it doesn't keep the process running itself
    setTimeout(() => {
      why()
    }, 5_000).unref()

    console.log('STOPPING', node.status)
    await node.stop()
    console.log('STOPPED', node.status)
  }

  console.log('RETURNING', node.status)
  return
}

process.on('exit', (code) => {
  console.log('Process exiting with code', code)
})

async function realMain() {
  await main()
  console.log('MAIN DONE')
}

realMain()

@achingbrain
Copy link
Member

achingbrain commented Apr 10, 2025

Thanks for that - I've replicated the issue locally. Can you try patching your local copy of @libp2p/webrtc in node_modules with the changes from https://github.com/libp2p/js-libp2p/pull/3076/files?w=1 ?

It basically just wraps the use of the handshake datachannel in a try/finally and explicitly closes the channel once the handshake is complete or has failed.

I'm not seeing the hang any more with the fix (though sometimes there's an EOF where the remote closes the channel during the noise handshake, but it doesn't keep the process running).

@burdiyan
Copy link

burdiyan commented Apr 10, 2025

@achingbrain Yeah, this seems to help. Thanks! Hope the PR won't take long to get accepted :)

@burdiyan
Copy link

@achingbrain Before the PR is accepted, do you think there's any way to close that leaking channel from the outside? Maybe some obscure API that would let you close the handle that prevents the program to exit?

@achingbrain
Copy link
Member

There's no secret API that allows access to the channel. The strange thing is that the RTCPeerConnection is closed when the node is stopped - at this point any open datachannels should get closed too but for a reason that needs figuring out this doesn't appear to be happening.

If you are blocked & need a release ASAP you could try something like patch-package to apply the changes after an npm install, though you'll likely lose compatibility with any rust-libp2p nodes until libp2p/rust-libp2p#5986 is resolved.

@achingbrain
Copy link
Member

achingbrain commented Apr 16, 2025

I'm still seeing this even after #3076 - I think the bug might be in node-datachannel, it looks like sometimes RTCDataChannels just don't close properly - murat-dogan/node-datachannel#349

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/author-input Needs input from the original author
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants