feat: Extend the text based configuration to include feature flags and the SaturationDetector's configuration #1492

shmuelk · 2025-08-28T10:24:42Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
Eliminates the use of environment variables to enable the new Data Layer and configure the SaturationDetector.

The use of environment variables for these types of things is not scalable. It also much easer to maintain multiple configurations if needed in the form of simple YAML text.

Which issue(s) this PR fixes:
Fixes #1481

Does this PR introduce a user-facing change?:

- The saturation detector is no longer configured via the environment variables `SD_QUEUE_DEPTH_THRESHOLD`, `SD_KV_CACHE_UTIL_THRESHOLD`, and `SD_METRICS_STALENESS_THRESHOLD`. Instead it is configured via the  `saturationDetector` section of the text based configuration. See the text based configuration guide for more details.
- The experimental Data Layer feature is no longer enabled via the `ENABLE_EXPERIMENTAL_DATALAYER_V2` environment variable. It is instead enabled in the `featureGates` section of the text based configuration. See the text based configuration guide for more details.

netlify · 2025-08-28T10:24:49Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`5b9a70e`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68cb00a34b57d7000746b203
😎 Deploy Preview	https://deploy-preview-1492--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-08-28T10:24:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shmuelk
Once this PR has been reviewed and has the lgtm label, please assign danehans for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-08-28T10:24:52Z

Hi @shmuelk. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nirrozenbaum · 2025-08-28T11:57:27Z

apix/config/v1alpha1/endpointpickerconfig_types.go

+	// +optional
+	// SaturationDetector when present specifies the configuration of the
+	// Saturation detector. If not present, default values are used.
+	SaturationDetector *SaturationDetector `json:"saturationDetector,omitempty"`


I think this is too specific and should be generalized.
we've talked in the past about having a parameters section in the config API, such that multiple structs in the code can consume. for example, I might want to consume the metricsStalenessThreshold in more places.
This approach is not future proof and we might need to add more and more fields here instead of just generalizing.

Parts of the code, "just knowing" what parameters they are looking for in a pool of parameters, is as error prone as environment variables. A typo will cause a parameter to be ignored and presumably a default value will be used. I believe this will be hard to "debug" scenarios like this.

The sections need to be explicit.

As for having constants as I did or references to a parameter, that is a different issue. It does however make for a very verbose configuration. With the need for more validation code. Unless you do something like:

saturationDetector: queueDepthThreshold: value: 5 kvCacheUtilThreshold: parameterRef: plover

nirrozenbaum · 2025-08-28T11:59:39Z

apix/config/v1alpha1/endpointpickerconfig_types.go

+type FeatureGates struct {
+	// +optional
+	// EnableDataLayer If present and true enables the experimental Datalayer APIs.
+	// The Datalayer APIswill be disabled if EnableDataLayer is not present in the
+	// configuration or its value is false.
+	EnableDataLayer bool `json:"enableDataLayer,omitempty"`
+
+	// +optional
+	// EnableFlowControl If present and true enables the experimental FlowControl feature.
+	// The FlowControl feature will be disabled if EnableFlowControl is not in the
+	// configuration or its value is false.
+	EnableFlowControl bool `json:"enableFlowControl,omitempty"`
+}


can this be generalized to a map[string]bool instead of two boolean fields? what if tomorrow I have another feature? would we need to update the config api code on every new feature?
what if I'd like to use the same config API to enable a feature in llm-d that is not part of the GIE?

The disadvantage of a map and environment variables for that matter is that it is difficult to validate. A silly typo causes all sorts of problems.

Validating the keys, will not solve your desire of experimental issues in llm-d

Having thought about this some more, what about the following:

FeatureGates will be a map[string]bool

There will be an API to register feature gate keys. The EPP will use this to set its own feature gate keys and llm-d can use this as well.

The map will be validated. Any key that is in the FeatureGates map and not registered will cause the validation of the configuration to fail.

The epp config will have a FeatureGates map which will contain a value for all of the registered keys. They will be initialized with a value of false and overlayed with the value from the text configuration's FeatureGates field.

@nirrozenbaum ?

I have pushed changes that implemets the above described changes

LukeAVanDrie · 2025-08-28T17:33:44Z

pkg/epp/config/loader/configloader.go

+	return featureConfig
+}
+
+func loadSaturationDetectorConfig(sd *configapi.SaturationDetector) saturationdetector.Config {


nit: As we expand the configuration surface for more components, does it make sense to include all validation and defaulting logic in loader/...? Or, is it preferable to delegate to the local config.go files for each component. E.g., saturationdetector/config.go could export

// ValidateAndApplyDefaults checks the provided configuration for validity and then mutates the receiver to populate any // empty fields with system defaults. func (c *Config) ValidateAndApplyDefaults() error { ... }

If I were a developer working on an individual component, I think the latter would be easier to trace as it keeps the default values and validation logic scoped locally to the component. It also keeps the single entry point loader/... as a pure delegator. Overall, I think this approach has tighter cohesion and looser coupling.

I am wiring up the Flow Control layer this week and am considering the best path forward as it has a much larger configuration surface than the saturation detector.

There were previous reviews of the original text based configuration PR in which there was a desire to keep all of the defaults setting code in one place to make it easier to understand the overall configuration.

There was also a great desire to separate validation from setting of defaults.

I would assume the same is basically true for the validation of the configuration. At this time the validation code is inside configloader.go. It could easily be moved in its own source file with a set of functions each validating a different part of the configuration.

I have moved the validation code into a separate file (validation.go) in the pkg/epp/config/loader directory

Signed-off-by: Shmuel Kallner <[email protected]>

kfswain · 2025-09-16T22:46:48Z

/ok-to-test

Signed-off-by: Shmuel Kallner <[email protected]>

k8s-ci-robot · 2025-10-11T05:08:44Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 28, 2025

k8s-ci-robot requested review from liu-cong and nirrozenbaum August 28, 2025 10:24

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 28, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 28, 2025

nirrozenbaum reviewed Aug 28, 2025

View reviewed changes

LukeAVanDrie reviewed Aug 28, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2025

shmuelk added 8 commits September 15, 2025 15:22

Added feature gates and saturation config to YAML config

25c645a

Signed-off-by: Shmuel Kallner <[email protected]>

Added feature gates and saturation config to EPP config

b454a12

Signed-off-by: Shmuel Kallner <[email protected]>

Updated config loader for feature gates and saturation config

84656ff

Signed-off-by: Shmuel Kallner <[email protected]>

Updated runner to use new config capabilities

ccec6bf

Signed-off-by: Shmuel Kallner <[email protected]>

Added feature flag for data layer

e55bdc5

Signed-off-by: Shmuel Kallner <[email protected]>

Updated tests

659e40f

Signed-off-by: Shmuel Kallner <[email protected]>

Fixed lint issue

9fc75b5

Signed-off-by: Shmuel Kallner <[email protected]>

Updated documentation

1ae9119

Signed-off-by: Shmuel Kallner <[email protected]>

shmuelk force-pushed the extended-config branch from cd0dcdc to 1ae9119 Compare September 15, 2025 12:32

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 16, 2025

shmuelk added 4 commits September 17, 2025 21:23

Changed featureFlags to be an array of strings, similar to K8S

675ba22

Signed-off-by: Shmuel Kallner <[email protected]>

Loader changes due to featureFlags changes

eb2fbc5

Signed-off-by: Shmuel Kallner <[email protected]>

Moved config validation into a separate file

522d88f

Signed-off-by: Shmuel Kallner <[email protected]>

Updates to tests

150c998

Signed-off-by: Shmuel Kallner <[email protected]>

shmuelk added 2 commits September 17, 2025 21:25

Updates to documentation

4d9fc5c

Signed-off-by: Shmuel Kallner <[email protected]>

Added copyright

5b9a70e

Signed-off-by: Shmuel Kallner <[email protected]>

shmuelk mentioned this pull request Sep 25, 2025

REQUEST: New membership for shmuelk kubernetes/org#5864

Closed

11 tasks

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 11, 2025

feat: Extend the text based configuration to include feature flags and the SaturationDetector's configuration #1492

Are you sure you want to change the base?

feat: Extend the text based configuration to include feature flags and the SaturationDetector's configuration #1492

Uh oh!

Conversation

shmuelk commented Aug 28, 2025

Uh oh!

netlify bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Aug 28, 2025

Uh oh!

k8s-ci-robot commented Aug 28, 2025

Uh oh!

nirrozenbaum Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

shmuelk Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

shmuelk Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

shmuelk Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shmuelk Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shmuelk Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

shmuelk Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

kfswain commented Sep 16, 2025

Uh oh!

k8s-ci-robot commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

netlify bot commented Aug 28, 2025 •

edited

Loading

shmuelk Aug 28, 2025 •

edited

Loading

LukeAVanDrie Aug 28, 2025 •

edited

Loading