Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workload proxy refactoring #886

Open
Tracked by #795
smira opened this issue Jan 29, 2025 · 2 comments · May be fixed by #929
Open
Tracked by #795

workload proxy refactoring #886

smira opened this issue Jan 29, 2025 · 2 comments · May be fixed by #929
Assignees

Comments

@smira
Copy link
Member

smira commented Jan 29, 2025

Ignoring Healthchecks

  1. HTTP request from the user arrives to <uniqueID>.proxy.omni, uniqueID can be mapped to cluster, workload service.
  2. We validate the request, and we can convert cluster,service pair to a list of [IP:port, ...]
  3. We use Go's http ReverseProxy, we rewrite the URL to be http://<cluster>.<service>/<original URL>.
  4. In Reverse Proxy HTTP transport, we override Dial function to "resolve" <cluster>.<service> to one of the [IP:port] pairs.

The benefit: http.Client can re-use outgoing HTTP connections, including HTTP/1.1 keep-alive if the host part of the URL is the same.

We can control size of the HTTP transport idle connection pool, etc.

With Healthchecks

Running healthchecks has a cost: it will do Dial on all configured endpoints with a configured interval, which has both CPU & network resource usage.

We can run healthchecks on demand - if some cluster-service pair is being used, we start healthchecking, if it is idle for some time, we can shut down healthchecks. We can bootstrap initial healthcheck state with machine connection status.

Note: don't run healthchecks for single-node clusters.

In the flow above, on step (4), when we Dial, we do upstreams.Pick() to get a random healthy machine.

Note: we can enable healthcheck to use latency tiers.

@smira
Copy link
Member Author

smira commented Jan 29, 2025

This should resolve a memory leak related to memconn in tcp.Loadbalancer.

@DmitriyMV DmitriyMV self-assigned this Feb 5, 2025
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 13, 2025
Remove inmemory connection, and use custom `Director` and `DialContext`
to proxify connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
@DmitriyMV DmitriyMV linked a pull request Feb 13, 2025 that will close this issue
8 tasks
@DmitriyMV
Copy link
Member

DmitriyMV commented Feb 13, 2025

  • Shutdown healthchecks for cluster that hasn't been used for some time. (remove upstream.List[T]) but keep the key. Lazy upstream initialization.
  • Simplify map->map to single cluster map. We have an cluster list with list of upstreams where for each alias only port information differs. That means that we can simplify map of upstreams from two levels to one. Create a separate map of aliases for port info.
  • Before set upstream list we "down" those who are definitely down. Required for those where requests comes right after Reconcile.
  • Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.
  • Set metrics for idleConnections and inused connections: either using prometheus or using Go internals.
  • Put those to Grafana.

DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 22, 2025
Remove inmemory connection, and use custom `Director` and `DialContext`
to proxify connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 23, 2025
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Replace HealthCheck logic with the actual tcp probing.
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 23, 2025
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Replace HealthCheck logic with the actual tcp probing.
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 23, 2025
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Replace HealthCheck logic with the actual tcp probing.
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 23, 2025
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [x] Replace HealthCheck logic with the actual tcp probing.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 23, 2025
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [x] Replace HealthCheck logic with the actual tcp probing.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
@DmitriyMV DmitriyMV linked a pull request Feb 23, 2025 that will close this issue
8 tasks
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 23, 2025
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [x] Replace HealthCheck logic with the actual tcp probing.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 23, 2025
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [x] Replace HealthCheck logic with the actual tcp probing.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 23, 2025
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [x] Replace HealthCheck logic with the actual tcp probing.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 23, 2025
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [x] Replace HealthCheck logic with the actual tcp probing.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/omni that referenced this issue Feb 23, 2025
- [x] Ensure `Reconciler` is internally consistent on all variations of `Reconcile` call (including parallel). Track aliases and clusters
side by side.
- [x] Add tests for the above.
- [x] Replace HealthCheck logic with the actual tcp probing.
- [ ] Replace probing port on alias removal. That is - if we lost an alias to the probing port, find a new one and use it.
- [ ] Find a way to test the dialer logic. Pass the cache as DI? Or add probes factory?
- [ ] Ensure that active probes are actually deleted on time expiration.
- [?] On setting upstream.List[T] ensure that "down" clusters are added with the negative score so we don't see those on dial attempt. (Looks like it's already the case).
- [ ] Expose metrics. Specifically for `idleConnections` and `inuse` connections. Register those in prometheus.
- [ ] Use latency checker from standard prometheus exporter. GotConn and WriteReq are important. GotResponse maybe interesting for their workload. Count can be used for idleConnection because in-flight requests and in-flight connections.

For siderolabs#886

Signed-off-by: Dmitriy Matrenichev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants