Skip to content

Conversation

@ipochi
Copy link
Contributor

@ipochi ipochi commented Oct 27, 2025

Currently, the server only logs if it receives a drain request from the agent. Ideally, in that scenario, the server should mark it and not select it for newer requests. This PR implements that.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ipochi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 27, 2025
@jkh52
Copy link
Contributor

jkh52 commented Oct 30, 2025

I'm happy to see interest in completing this feature! When I had been looking into this, here's the questions I ran into:

  1. Whether the server should keep agent draining state by raw connection, or by agent identifiers. The current BackendStorage makes the latter much more complicated, I agree with the former (like this PR does). But note that currently, the agent sync loop will continue to issue new syncs, and a server could get into conflicting drain state for a given agent. So without an accompanying agent change (to stop syncing new connections after entering draining state), it is possible for the server to continue picking these agents.

That potential agent change is probably an improvement but needs careful risk analysis.

  1. Fallback logic in case no non-draining agent is found, continue using one that is draining. I think the argument for this is strong, because it's "no worse" than the current server behavior, and may avoid degrading some clusters (those that are limping along with very few agents, e.g. 1).

@ipochi
Copy link
Contributor Author

ipochi commented Oct 30, 2025

I'm happy to see interest in completing this feature! When I had been looking into this, here's the questions I ran into:

  1. Whether the server should keep agent draining state by raw connection, or by agent identifiers. The current BackendStorage makes the latter much more complicated, I agree with the former (like this PR does). But note that currently, the agent sync loop will continue to issue new syncs, and a server could get into conflicting drain state for a given agent. So without an accompanying agent change (to stop syncing new connections after entering draining state), it is possible for the server to continue picking these agents.

That potential agent change is probably an improvement but needs careful risk analysis.

  1. Fallback logic in case no non-draining agent is found, continue using one that is draining. I think the argument for this is strong, because it's "no worse" than the current server behavior, and may avoid degrading some clusters (those that are limping along with very few agents, e.g. 1).

Thank you for the feedback @jkh52

Your point on 1, hadn't crossed my mind. It seems logical thing to do to stop the agent sync loop if the agent is draining. I'll give more thought on this whether the server should do more so that it doesn't get into a conflicted state or the agent should stop teh sync loops.

On Point 2, I think its valid point that continue to use the draining agent if there are no non-draining agents. Its akin falling back on current behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants