Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Traffic Control] Add dynamic config via admin api #20887

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

williampsmith
Copy link
Contributor

@williampsmith williampsmith commented Jan 15, 2025

Description

Implement an admin API endpoint to do a hot reconfiguration of the traffic control policy without clearing blocklists or existing count min sketch.

To facilitate this, the PR introduces a refactor that allows AuthorityState to hold a reference to TrafficController, and for all sui callsites to reference this rather than instantiating separately. This is ok for our purposes (and indeed is much cleaner) because sui-node enforces that validators do not run json rpc server. Additionally, we need to make the policy a member of the traffic controller struct in order to reference it from the main thread.

Things to note:
For now, this only works when the node is previously running traffic controller with a blocklist policy. i.e. it will fail if used to enable traffic control on a node that did not already have it enabled. It will also fail if called when the node is running an allowlist policy.

Given the above, there are two intended use patterns:

  1. Before being called, the node is running traffic controller in dry-run mode with some non-allowlist policy config so that tallying and metrics collection occurs. In such a case, it can be called with the appropriate flag to disable dry-run mode. Note that it can also be called to enable dry-run mode where the node previously had it running in active mode.
  2. Before being called, the node is running traffic controller in active mode, but the node operator wants to elevate the policy restriction by adjusting the threshold values.

Note also that this currently is supported for rpc node and validator server traffic control use cases. Bridge node is not yet supported.

Test plan

How did you test the new or updated feature?


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • gRPC:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:

Copy link

vercel bot commented Jan 15, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 28, 2025 10:34pm
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Mar 28, 2025 10:34pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Mar 28, 2025 10:34pm

@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 15, 2025 03:19 — with GitHub Actions Inactive
@williampsmith williampsmith force-pushed the william/traffic-control-admin-api branch from 8e20a05 to 9f63abd Compare January 16, 2025 05:05
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 16, 2025 05:06 — with GitHub Actions Inactive
@williampsmith williampsmith force-pushed the william/traffic-control-admin-api branch from 9f63abd to 813327f Compare January 16, 2025 05:28
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 16, 2025 05:28 — with GitHub Actions Inactive
@williampsmith williampsmith marked this pull request as ready for review January 16, 2025 05:32
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 16, 2025 05:32 — with GitHub Actions Inactive
Copy link
Contributor

@johnjmartin johnjmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the most familiar with this code, but LGTM

@williampsmith williampsmith force-pushed the william/traffic-control-admin-api branch from e5a02bc to 8a45392 Compare January 28, 2025 21:39
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 28, 2025 21:39 — with GitHub Actions Inactive
@williampsmith williampsmith force-pushed the william/traffic-control-admin-api branch from 8a45392 to 6fa8705 Compare January 28, 2025 22:47
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 28, 2025 22:47 — with GitHub Actions Inactive
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 29, 2025 01:21 — with GitHub Actions Inactive
@williampsmith williampsmith force-pushed the william/traffic-control-admin-api branch from 161d531 to 4402ce6 Compare January 29, 2025 01:49
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 29, 2025 01:50 — with GitHub Actions Inactive
@williampsmith williampsmith temporarily deployed to sui-typescript-aws-kms-test-env January 29, 2025 09:39 — with GitHub Actions Inactive
@williampsmith williampsmith force-pushed the william/traffic-control-admin-api branch from 2e88644 to 8f75876 Compare February 1, 2025 02:06
@@ -738,6 +824,7 @@ async fn test_traffic_sketch_allowlist_mode() {
}

async fn assert_traffic_control_ok(mut test_cluster: TestCluster) -> Result<(), anyhow::Error> {
telemetry_subscribers::init_for_testing();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for adding this to the different tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without it, you won't get logs printed when you run unit tests

@johnjmartin
Copy link
Contributor

To facilitate this, the PR introduces a refactor that allows AuthorityState to hold a reference to TrafficController, and for all sui callsites to reference this rather than instantiating separately. This is ok for our purposes (and indeed is much cleaner) because sui-node enforces that validators do not run json rpc server.

can you expand on this? without the enforcement is there a risk that trafficController blocks JSONRPC requests? We do run this on fullnodes too

it will fail if used to enable traffic control on a node that did not already have it enabled. It will also fail if called when the node is running an allowlist policy.

is this by design? I think it's desirable for validators to enable traffic control only via the api, but I understand if that's adding too much complexity to this PR.

@williampsmith
Copy link
Contributor Author

To facilitate this, the PR introduces a refactor that allows AuthorityState to hold a reference to TrafficController, and for all sui callsites to reference this rather than instantiating separately. This is ok for our purposes (and indeed is much cleaner) because sui-node enforces that validators do not run json rpc server.

can you expand on this? without the enforcement is there a risk that trafficController blocks JSONRPC requests? We do run this on fullnodes too

Basically with the current formulation, there will only ever be one instance of TrafficController, since it is held in AuthorityState, which is a singleton. However, there are multiple holders of references to AuthorityState, and specifically, there are two that are callsites for traffic controller. One is the json rpc side, and the other is the AuthorityServer (which runs on validator only). So the risk is that if both json rpc and authority server are running on the same sui-node instance, they will both be writing to the same TrafficController instance, which we want to avoid. But the assertion I'm making is that sui-node is always either a validator or an rpc node or neither, which makes this safe. Lmk if you have any reason to disagree with that statement.

it will fail if used to enable traffic control on a node that did not already have it enabled. It will also fail if called when the node is running an allowlist policy.

is this by design? I think it's desirable for validators to enable traffic control only via the api, but I understand if that's adding too much complexity to this PR.

Yeah it would make things pretty complicated indeed, since you would need to change a bunch of references in authority server, for example, to be mutexes in order to create a traffic controller reference where previously there was none. The solution is to have nodes running in dry run mode, and then they can dynamically go from dry-run mode to live mode through this API without issue. There should be no reason for nodes to not at least run dry run mode at this point.

…; during downtime we ignore tally requests with log line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants