-
Notifications
You must be signed in to change notification settings - Fork 207
refactor: refactor monitoring session #1906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor: refactor monitoring session #1906
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
liu-cong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @JeffLuoo Did we decide to deprecate monitoring.gke.enabled and use provider:gke to enable GKE monitoring?
@capri-xiyue Note we set monitoring.gke in llm-d user guides today, but I assume deprecating it just means monitoring.gke becomes a noop right?
Yes, there is already a deprecation note. We need to update the guides in GKE, the previous release now supports both, so we can safely migrate the guide. |
|
@capri-xiyue can you pls address the comments so we can move forward with this PR? |
@ahg-g I have returned for a brief three-day window before taking leave again until late December. I will prioritize moving this PR forward immediately. |
Yes, my understanding of it is that |
|
@JeffLuoo Can you help review this PR to see it makes sense or not? |
|
/retest |
@ahg-g I've addressed all the comments |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: capri-xiyue, JeffLuoo The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| enabled: true # log all requests by default | ||
| --- | ||
| {{- if or .Values.inferenceExtension.monitoring.gke.enabled (and .Values.inferenceExtension.monitoring.prometheus.enabled .Values.inferenceExtension.monitoring.prometheus.auth.enabled) }} | ||
| {{- if and .Values.inferenceExtension.monitoring.prometheus.enabled .Values.inferenceExtension.monitoring.prometheus.auth.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the prometheus enabled is by default set to false in the values.yaml: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/v1.2.0-rc.1/config/charts/inferencepool/values.yaml#L55-L56
This implies that the monitoring stack on GKE won't have all required objects like roles binding by default if inferenceExtension.monitoring.prometheus.enabled is set to false with this change.
Shall we use inferenceExtension.monitoring.prometheus.enabled to control the enablement of monitoring stacks regardless of the provider? If so, we need to update the mentioning of inferenceExtension.monitoring.gke.enable=true, e.g. https://github.com/search?q=repo%3Allm-d%2Fllm-d%20inferenceExtension.monitoring.gke.enable%3Dtrue&type=code
wdyt? @ahg-g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the prometheus enabled is by default set to false in the values.yaml: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/v1.2.0-rc.1/config/charts/inferencepool/values.yaml#L55-L56
This implies that the monitoring stack on GKE won't have all required objects like roles binding by default if
inferenceExtension.monitoring.prometheus.enabledis set to false with this change.
Can you elaborate here? the existing gke.enable flag that we are removing in this PR is not controlling that either, right?
Shall we use
inferenceExtension.monitoring.prometheus.enabledto control the enablement of monitoring stacks regardless of the provider? If so, we need to update the mentioning ofinferenceExtension.monitoring.gke.enable=true, e.g. https://github.com/search?q=repo%3Allm-d%2Fllm-d%20inferenceExtension.monitoring.gke.enable%3Dtrue&type=codewdyt? @ahg-g
I agree as long as all what this flag enables is indeed prometheus-related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. EPP uses Prometheus as the metric source behind the metrics endpoint.
The PR LGTM then, we just need to make sure we update all references later to include inferenceExtension.monitoring.prometheus.enabled in addition to setting the provider to GKE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more place to update -
| --set inferenceExtension.monitoring.gke.enabled=true \ |
With this change, we switch to use inferenceExtension.monitoring.prometheus.enabled so mentioning of inferenceExtension.monitoring.gke.enabled needs to be updated as well. This is the only mentioning I found in EPP repo: https://github.com/search?q=repo%3Akubernetes-sigs%2Fgateway-api-inference-extension+monitoring.gke&type=code
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
gke needs to be removed in monitoring session
Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?: