Skip to content

Conversation

shmuelk
Copy link
Contributor

@shmuelk shmuelk commented Aug 28, 2025

What type of PR is this?
/kind feature

What this PR does / why we need it:
Eliminates the use of environment variables to enable the new Data Layer and configure the SaturationDetector.

The use of environment variables for these types of things is not scalable. It also much easer to maintain multiple configurations if needed in the form of simple YAML text.

Which issue(s) this PR fixes:
Fixes #1481

Does this PR introduce a user-facing change?:

- The saturation detector is no longer configured via the environment variables `SD_QUEUE_DEPTH_THRESHOLD`, `SD_KV_CACHE_UTIL_THRESHOLD`, and `SD_METRICS_STALENESS_THRESHOLD`. Instead it is configured via the  `saturationDetector` section of the text based configuration. See the text based configuration guide for more details.
- The experimental Data Layer feature is no longer enabled via the `ENABLE_EXPERIMENTAL_DATALAYER_V2` environment variable. It is instead enabled in the `featureGates` section of the text based configuration. See the text based configuration guide for more details.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 28, 2025
Copy link

netlify bot commented Aug 28, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit cd0dcdc
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68b08a91669eba0008b4d5ec
😎 Deploy Preview https://deploy-preview-1492--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shmuelk
Once this PR has been reviewed and has the lgtm label, please assign danehans for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 28, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @shmuelk. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 28, 2025
// +optional
// SaturationDetector when present specifies the configuration of the
// Saturation detector. If not present, default values are used.
SaturationDetector *SaturationDetector `json:"saturationDetector,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too specific and should be generalized.
we've talked in the past about having a parameters section in the config API, such that multiple structs in the code can consume. for example, I might want to consume the metricsStalenessThreshold in more places.
This approach is not future proof and we might need to add more and more fields here instead of just generalizing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parts of the code, "just knowing" what parameters they are looking for in a pool of parameters, is as error prone as environment variables. A typo will cause a parameter to be ignored and presumably a default value will be used. I believe this will be hard to "debug" scenarios like this.

The sections need to be explicit.

As for having constants as I did or references to a parameter, that is a different issue. It does however make for a very verbose configuration. With the need for more validation code. Unless you do something like:

saturationDetector:
   queueDepthThreshold:
     value: 5
   kvCacheUtilThreshold:
     parameterRef: plover

Comment on lines 135 to 147
type FeatureGates struct {
// +optional
// EnableDataLayer If present and true enables the experimental Datalayer APIs.
// The Datalayer APIswill be disabled if EnableDataLayer is not present in the
// configuration or its value is false.
EnableDataLayer bool `json:"enableDataLayer,omitempty"`

// +optional
// EnableFlowControl If present and true enables the experimental FlowControl feature.
// The FlowControl feature will be disabled if EnableFlowControl is not in the
// configuration or its value is false.
EnableFlowControl bool `json:"enableFlowControl,omitempty"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be generalized to a map[string]bool instead of two boolean fields? what if tomorrow I have another feature? would we need to update the config api code on every new feature?
what if I'd like to use the same config API to enable a feature in llm-d that is not part of the GIE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disadvantage of a map and environment variables for that matter is that it is difficult to validate. A silly typo causes all sorts of problems.

Validating the keys, will not solve your desire of experimental issues in llm-d

Copy link
Contributor Author

@shmuelk shmuelk Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought about this some more, what about the following:

  • FeatureGates will be a map[string]bool
  • There will be an API to register feature gate keys. The EPP will use this to set its own feature gate keys and llm-d can use this as well.
  • The map will be validated. Any key that is in the FeatureGates map and not registered will cause the validation of the configuration to fail.
  • The epp config will have a FeatureGates map which will contain a value for all of the registered keys. They will be initialized with a value of false and overlayed with the value from the text configuration's FeatureGates field.

@nirrozenbaum ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed changes that implemets the above described changes

return featureConfig
}

func loadSaturationDetectorConfig(sd *configapi.SaturationDetector) saturationdetector.Config {
Copy link
Contributor

@LukeAVanDrie LukeAVanDrie Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: As we expand the configuration surface for more components, does it make sense to include all validation and defaulting logic in loader/...? Or, is it preferable to delegate to the local config.go files for each component. E.g., saturationdetector/config.go could export

// ValidateAndApplyDefaults checks the provided configuration for validity and then mutates the receiver to populate any
// empty fields with system defaults.
func (c *Config) ValidateAndApplyDefaults() error { ... }

If I were a developer working on an individual component, I think the latter would be easier to trace as it keeps the default values and validation logic scoped locally to the component. It also keeps the single entry point loader/... as a pure delegator. Overall, I think this approach has tighter cohesion and looser coupling.

I am wiring up the Flow Control layer this week and am considering the best path forward as it has a much larger configuration surface than the saturation detector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Eliminate use of environment variables to configure the EPP
4 participants