workload proxy refactoring #886

smira · 2025-01-29T16:08:25Z

Ignoring Healthchecks

HTTP request from the user arrives to <uniqueID>.proxy.omni, uniqueID can be mapped to cluster, workload service.
We validate the request, and we can convert cluster,service pair to a list of [IP:port, ...]
We use Go's http ReverseProxy, we rewrite the URL to be http://<cluster>.<service>/<original URL>.
In Reverse Proxy HTTP transport, we override Dial function to "resolve" <cluster>.<service> to one of the [IP:port] pairs.

The benefit: http.Client can re-use outgoing HTTP connections, including HTTP/1.1 keep-alive if the host part of the URL is the same.

We can control size of the HTTP transport idle connection pool, etc.

With Healthchecks

Running healthchecks has a cost: it will do Dial on all configured endpoints with a configured interval, which has both CPU & network resource usage.

We can run healthchecks on demand - if some cluster-service pair is being used, we start healthchecking, if it is idle for some time, we can shut down healthchecks. We can bootstrap initial healthcheck state with machine connection status.

Note: don't run healthchecks for single-node clusters.

In the flow above, on step (4), when we Dial, we do upstreams.Pick() to get a random healthy machine.

Note: we can enable healthcheck to use latency tiers.

The text was updated successfully, but these errors were encountered:

smira · 2025-01-29T16:08:50Z

This should resolve a memory leak related to memconn in tcp.Loadbalancer.

Remove inmemory connection, and use custom `Director` and `DialContext` to proxify connections. For siderolabs#886 Signed-off-by: Dmitriy Matrenichev <[email protected]>

DmitriyMV · 2025-02-13T11:54:21Z

Shutdown healthchecks for cluster that hasn't been used for some time. (remove upstream.List[T]) but keep the key. Lazy upstream initialization.
Simplify map->map to single cluster map. We have an cluster list with list of upstreams where for each alias only port information differs. That means that we can simplify map of upstreams from two levels to one. Create a separate map of aliases for port info.
Before set upstream list we "down" those who are definitely down. Required for those where requests comes right after Reconcile.
Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.
Set metrics for idleConnections and inused connections: either using prometheus or using Go internals.
Put those to Grafana.

Remove inmemory connection, and use custom `Director` and `DialContext` to proxify connections. For siderolabs#886 Signed-off-by: Dmitriy Matrenichev <[email protected]>

- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters side by side. - [x] Add tests for the above. - [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it. - [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory? - [ ] Replace HealthCheck logic with the actual tcp probing. - [ ] Ensure that active probes are actually deleted on time expiration. - [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case). - [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus. - [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections. For siderolabs#886 Signed-off-by: Dmitriy Matrenichev <[email protected]>

- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters side by side. - [x] Add tests for the above. - [x] Replace HealthCheck logic with the actual tcp probing. - [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it. - [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory? - [ ] Ensure that active probes are actually deleted on time expiration. - [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case). - [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus. - [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections. For siderolabs#886 Signed-off-by: Dmitriy Matrenichev <[email protected]>

This was referenced Jan 29, 2025

fix: do not use memconn for tcpproxy load balancer #885

Closed

Investigate high memory consumption on the Omni account #795

Open

DmitriyMV self-assigned this Feb 5, 2025

DmitriyMV linked a pull request Feb 13, 2025 that will close this issue

chore: rework Reconciler to use proper http.Transport #929

Open

8 tasks

DmitriyMV linked a pull request Feb 23, 2025 that will close this issue

chore: rework Reconciler to use proper http.Transport #929

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload proxy refactoring #886

workload proxy refactoring #886

smira commented Jan 29, 2025 •

edited

Loading

smira commented Jan 29, 2025

DmitriyMV commented Feb 13, 2025 •

edited

Loading

workload proxy refactoring #886

workload proxy refactoring #886

Comments

smira commented Jan 29, 2025 • edited Loading

Ignoring Healthchecks

With Healthchecks

smira commented Jan 29, 2025

DmitriyMV commented Feb 13, 2025 • edited Loading

smira commented Jan 29, 2025 •

edited

Loading

DmitriyMV commented Feb 13, 2025 •

edited

Loading