Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: rework Reconciler to use proper http.Transport #929

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DmitriyMV
Copy link
Member

@DmitriyMV DmitriyMV commented Feb 13, 2025

chore: rework Reconciler to use proper http.Transport

  • Ensure Reconciler is internally consistent on all variations of Reconcile call (including parallel). Track aliases and clusters
    side by side.
  • Add tests for the above.
  • Replace HealthCheck logic with the actual tcp probing.
  • Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
  • Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
  • Ensure that active probes are actually deleted on time expiration.
  • [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
  • Expose metrics. Specifically for idleConnections and inuse connections. Register those in prometheus.
  • Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For #886

Signed-off-by: Dmitriy Matrenichev [email protected]

@DmitriyMV DmitriyMV added the integration/e2e-workload-proxy Triggers all e2e workload proxy tests for Omni label Feb 13, 2025
logger: logger,
lbLogger: logger.WithOptions(zap.IncreaseLevel(zapcore.ErrorLevel)),
logLevel: logLevel,
trans := &http.Transport{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use https://github.com/hashicorp/go-cleanhttp instead of hand-rolling it

ExpectContinueTimeout: 1 * time.Second,
}

if err := http2.ConfigureTransport(trans); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we dial without TLS, I'm not sure it has much value

trans := &http.Transport{
Proxy: http.ProxyFromEnvironment,
ForceAttemptHTTP2: true,
MaxIdleConns: 100,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to consider idle connections in a specific way, as the host would be service name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also stuff like MaxConnsPerHost

return proxy, clusterID, nil
// hostPortForAlias returns a unique IP:port for the given cluster and alias.
func hostPortForAlias(clusterID resource.ID, alias string) string {
return fmt.Sprintf("%s_%s:4242", clusterID, alias)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have some kind of guarantee that _ can't be in the clusterID ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. Maybe pick some other (perhaps unicode) symbol? But there is no guarantee that it will not be picked in the future either.There is one in ClusterDiscoveryConfig.Validate and one in Config.Validate checks in Talos, but both only ensure that ClusterID is not empty. As for Omni, I'm currently try to dig usages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we pick a very obscure character, we can also disallow it in the validations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use net.JoinHostPort

}

func (b *backend) HealthCheck(context.Context) (upstream.Tier, error) { return 0, nil }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that should be a Dial check actually ?


return startErr
}
func (registry *Reconciler) removeLB(cluster resource.ID, alias string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should call this on some "idle" timeout - that is if we don't have DialContext for the cluster for some timeout, we can shut it down

@DmitriyMV DmitriyMV force-pushed the reconciler branch 6 times, most recently from 06b6aae to 754b393 Compare February 23, 2025 19:16
@DmitriyMV DmitriyMV linked an issue Feb 23, 2025 that may be closed by this pull request
@DmitriyMV DmitriyMV force-pushed the reconciler branch 4 times, most recently from 75c07ff to e752ce3 Compare February 23, 2025 21:24
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [x] Replace HealthCheck logic with the actual tcp probing.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration/e2e-workload-proxy Triggers all e2e workload proxy tests for Omni status/ok-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

workload proxy refactoring
4 participants