Skip to content

Conversation

@piotrpio
Copy link
Contributor

@piotrpio piotrpio commented Nov 5, 2025

Signed-off-by: Piotr Piotrowski [email protected]


#### Custom server pool

Client should have a way to specify custom server pool for reconnection
Copy link
Collaborator

@scottf scottf Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure...clients should have a way to specify a custom server pool for ALL connection attempts including the original connection and reconnection attempts.

It also seems logical that the pool is part of connection options. Is this document saying that the pool can be replaced at any time? Much like an auth handler, I think the pool should be able to change it's behavior itself, for instance in response to outside input, RTTs or any other thing without having to replace the entire pool with another instance...but this may be a language specific problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could be more clear, but yes, the idea is to have a method which sets a server pool which you can invoke whenever you want. It does not force reconnect itself, it just configures the pool and you can call it whenever and however many times you want.

As for the initial server pool - it's already part of establishing connection in all clients right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saying... Java only allows you to set the server pool implementation as part of connection options. I'm not entirely against allowing the dev to set this at any time, I'm just saying it would make more sense if the pool implementation is dynamic in a similar way to an auth handler that reads from a file every time. Just set it at the beginning of the connection and you are done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In simpler words: calling this API replaces all the servers, with the new provided servers.

// before attempting connection. If false, attempt connection immediately.
type ReconnectToServerHandler func([]Server, ServerInfo) (*url.URL, bool)
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There also needs to be feedback to the pool.

  • Give the pool discovered servers whenever the client receives them.
  • Tell the pool when a connection failed or succeeded.

Copy link
Contributor Author

@piotrpio piotrpio Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give the pool discovered servers whenever the client receives them

Agree, we should probably add discovered servers o the pool (with information on whether a server was added from server INFO or by the user)

Tell the pool when a connection failed or succeeded.

Implementation may vary, but in go all servers in pool have a reconnect counter which is reset on successful connect (which is used for tracking MaxReconnects), so there is feedback.


New pool should be used for all reconnection attempts until the client is
restarted or a new pool is specified. New pool is subject to the same rules as
the default one - randomization unless disabled, max reconnect attempts, etc.
Copy link
Collaborator

@scottf scottf Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool should not have to care about anything of the normal options or behavior, but it can care if it choses. This is the point of suppling a custom pool, to allow it to do anything it wants.

This means that randomization and the act of resolving ip addresses are part of the pool behavior.

Max reconnect attempts is not part of the pool behavior, it's part of the behavior that uses the pool. The behavior should ask for a resolved ip address, but the pool can chose what the result of a resolved ip address request is, so for instance it could give a resolved address or not when asked for an address

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie - once the server list is set, those are the only servers - on connect, the client will learn new servers - if the client doesn't want to use gossiped servers, it should specify its option to ignore cluster updates.

INFO.

The callback should return the server URL to which the client should attempt to
reconnect (a url outside of the pool should not be allowed). Additionally, the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we care if the reconnect to a specific server wants to connect to something outside of the pool?

What happens when a reconnect to a specific server fails?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A retry like normal.

We want to clearly separate concerns here:

  • Managing the pool
  • Picking from the pool

These are different things, the callback is picking from the pool only.


The callback should return the server URL to which the client should attempt to
reconnect (a url outside of the pool should not be allowed). Additionally, the
callback should allow to specify whether client should include a delay before
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delay should not be part of the callback - it's already configured for the connection options. The pool should simply give a server address to connect to.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really true, we have various dynamic reconnect delay options already.

The use case here for example when you get information about primary and backup clusters from a service discovery system. You would have a primary and backup cluster, you'd try the primary cluster a few times (with backoff) and finally decide on some value to switch to the standby cluster - that switch should essentially reset the backoff because now you're retrying against a new cluster.

I think probably the simple boolean here is not enough, but we do need the flexibility to influence delay+target selection as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'm also not sure a boolean is enough here, but given that we already have either ReconnectWait (which is a duration) or a callback which can customize it, I think it's reasonable to be able to reuse what's there. Also (at least in go), user may want to decide to add a sleep in the reconnect callback.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different concern than the pool that is already addressed, granted in a way that the user may not be able override in certain implementations.

It may be simple enough to allow a callback for this specific behavior. For instance the Java code already has a functional callback, outside of the pool, that can have dynamic behavior, and the developer has the ability to have that callback be aware or influenced by the pool.

This is the model I propose. Keep the pool the pool. Keep other behaviors separate. It's up to the user to coordinate that behavior.

// or a new server URL. New servers will be automatically added to the pool.
// - bool: If true, apply reconnect delay (ReconnectWait + jitter or CustomReconnectDelayCB)
// before attempting connection. If false, attempt connection immediately.
type ReconnectToServerHandler func([]Server, ServerInfo) (*url.URL, bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should simply return the server - this is a "select" server - if this is done any client flag for selection/retry etc is getting ignored - not a fan.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above:

The use case here for example when you get information about primary and backup clusters from a service discovery system. You would have a primary and backup cluster, you'd try the primary cluster a few times (with backoff) and finally decide on some value to switch to the standby cluster - that switch should essentially reset the backoff because now you're retrying against a new cluster.

I think probably the simple boolean here is not enough, but we do need the flexibility to influence delay+target selection as above


As arguments, it should receive the list of servers that are in the pool
(together with reconnect attempt number), as well as the most up-to-date server
INFO.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite sure why we want to expose all the metrics - the callbacks for reconnecting/reconnect/disconnect have info on what is happening, and grabbing the info for their investigation is available on connection - this means that they can grab/store/use their own data to supplement the selection the "server selector".

Effectively, I would move my internal selection logic to such function and use it internally, this way how it's handled is uniform.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you get information from SRV records you get priority and weight as part of the SRV answer, this means reconnections have to proceed respecting those. It's metrics entirely outside of the clients knowledge, only by knowing the recent reconnection counts etc can this pluggable behaviour make those decisions

attempts, ignoring any advertised servers from the Server.

New pool should be used for all reconnection attempts until the client is
restarted or a new pool is specified. New pool is subject to the same rules as
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to clarify here what happens if a custom pool it set at run time and gossip learn is on, do they merge like they would normally from the initial url list or not?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I meant by feedback. Basically the pool cannot be just "give me the next server" There has to be a way for the pool to be given the discovered servers, although the pool can ignore them if it wants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could just add gossiped servers to the pool with an indicator whether it was configured by the user or came from auto-discovery, just like we do without the custom pool/

@scottf
Copy link
Collaborator

scottf commented Nov 7, 2025

For reference here is the current Java behavior. The pool and re/connect behavior is based on the original go reference implementation. Not here: details about default pool behavior tracking failures and default pool management.

Pool behavior:

  • be initialized with the connection options to be able to extract information like:
    • the bootstrap servers
    • max reconnect attempts
    • whether to ignore discovered servers
    • whether to randomize
  • accept discovered urls
  • be able to peek at the next server - look at the next server without the pool marking it as given
  • give the next server
  • be asked to resolve a urls's host to a list of valid ip based utrls. Always returns a list of resolved or a list of just the original url
  • be told when a connect to a server succeeded
  • be told when a connect to a server failed
  • be able to be asked if the pool represents the needs to upgrade to a secure connection

Reconnect delay handler / callback:

  • input is the reconnect round, starting at 1, incrementing every time the full server list is tried
  • returns a length of time to wait
  • default implementations leverages connection options
    • reconnect jitter for Tls/Secure connection
    • reconnect jitter for insecure connections

Initial connect behavior:

  • Peek at the next server from the pool - this gives the ability to know if you've gone through the whole list or have already tried a server. Most useful on reconnect when reconciling with max reconnect attempts.
  • Get the next server from the pool
  • Ask the pool to resolve the server to ip addresses.
  • Try to connect to all of the resolved list
  • Tell pool success of failure for the unresolved host
  • Loop until all servers are tried or success;

Reconnect behavior:

  • Peek at the next server from the pool - this gives the ability to know if you've gone through the whole list or have already tried a server
  • Get the next server from the pool
  • Invoke the reconnect delay handler / callback and wait the expected time
  • Ask the pool to resolve the server to ip addresses.
  • Try to connect to all of the resolved list
  • Tell pool success of failure for the unresolved host
  • Loop until all servers are tried for the maximum number of reconnects or success

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants