Skip to content

introduce a general purpose pod scraper and add models scraper #783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

nirrozenbaum
Copy link
Contributor

@nirrozenbaum nirrozenbaum commented May 5, 2025

This PR lays the ground for pod scraper extensibility (part of the data layer extensibility).
it's still not pluggable or extensible, but the PR shows a clear way for doing that (see in main.go the initialization of the PodScraperFactory).
the PR converts the existing metrics scraper to the new format and adds models scraper for /v1/models according to OpenAI models API schema.

pay attention that the intention in the models scraper is NOT multiple base models.
it is intended for a single base model, BUT with a different set of LoRA adapters on different pods.

still WIP, but pushing a PR to start collecting feedback on the general direction.

EDIT:
the PR is now ready for final review after fixing all tests and finalizing the structures.
two non blocking points that will be handled in a follow up PR:

  • pod scraper should not start a go routine per scraper. all scrapers should consolidate to a single go routine per pod.
  • when reading ScrapedData using the GetData() function from PodInfo, current code requires type infer. need to think if this can be cleaner.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 5, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and robscott May 5, 2025 12:36
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nirrozenbaum
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 5, 2025
Copy link

netlify bot commented May 5, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 77bc048
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/681b2ef7a7384200086913df
😎 Deploy Preview https://deploy-preview-783--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2025
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling"
runserver "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/server"
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging"
)

const (
defaultMetricsEndpoint = "/metrics"
modelsScrapeInterval = 30 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we are moving these already, should we make them configurable? That's okay if we want to hold off on doing that, we already have a breathtaking amount of config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it should go to config. if there is no objection I'll do in a follow up PR.
prefer to finalize this PR and get it merged, otherwise I'll hit rebase issues every day (since it touches many files).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

@@ -38,16 +39,20 @@ import (
"sigs.k8s.io/controller-runtime/pkg/manager"
"sigs.k8s.io/controller-runtime/pkg/metrics/filters"
"sigs.k8s.io/gateway-api-inference-extension/internal/runnable"
backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics"
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should refer to this package as metricscollection or metricsingress to make clear the difference between the collector package and the metric emission/egress package

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this is a bit confusing.
I left the namings as it was, but I agree.
I'm suggesting to postpone the a follow up due to the PR size and the large number of files it touches.

var errs error
updated := existing.Clone()
updated := s.existingMetrics.Clone().(*Metrics)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to type infer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this has a generic return type, I assume b/c of the way the interface was defined. I wonder if we can convert the clone func to a generic. It seems odd to have to type infer on a func calledClone as that implies it wasn't quite a clone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, wrote in other comment as well.
still not 100% sure about the ScrapeResult which needs to go through another iteration for sure.
just wanted to make progress with this PR.

@@ -29,7 +30,8 @@ func (ss *KVCacheScorer) Name() string {
func (ss *KVCacheScorer) Score(ctx *types.SchedulingContext, pods []types.Pod) map[types.Pod]float64 {
scores := make(map[types.Pod]float64, len(pods))
for _, pod := range pods {
scores[pod] = 1 - pod.GetMetrics().KVCacheUsagePercent
podMetrics := pod.GetData()[metrics.MetricsDataKey].(*metrics.Metrics)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this seems like the primary way to access data; GetData()[metrics.MetricsDataKey].(*metrics.Metrics) seems like a code smell to me. Non-blocking, we can noodle on this overtime, but something seems amiss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agreed.
this needs to be better defined.
I'm still not 100% about the ScrapeResult which is used as general purpose return value from Scrape.
eventually when accessing the data, we need to know the type for accessing the specific fields.

I'm still trying to think about a better way, but considering the fact that most part are generalized, I wanted to make progress and try to finalize this in next iteration of the pod scrapers.

@nirrozenbaum
Copy link
Contributor Author

nirrozenbaum commented May 6, 2025

@kfswain got an excellent pointer from Abdullah in the Prefix scorer PR (was send in a different context), which seems very related to this PR.

Kube-scheduler has an interface called CycleState that serves as an abstract parameters map that is passed to plugins during the scheduling cycle.
See CycleState concept in kube-scheduler: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/cycle_state.go#L32 and https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/interface.go#L556

the reason for using an abstract map is the ability to be pluggable & extensible.

this is kind of identical to the ScrapeResult interface with the Clone function. Pay attention that everywhere where the CycleState is used is doing a type infer, cause it's a general purpose parameters map. similarly, in this PR, ScrapeResult is a general purpose result of scrape, and when one wants to use it, there is a need for type infer.

few usage examples in kube-scheduler plugins (with type infer):

LMKWYT.

@kfswain
Copy link
Collaborator

kfswain commented May 6, 2025

@kfswain got an excellent pointer from Abdullah in the Prefix scorer PR (was send in a different context), which seems very related to this PR.

Kube-scheduler has an interface called CycleState that serves as an abstract parameters map that is passed to plugins during the scheduling cycle. See CycleState concept in kube-scheduler: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/cycle_state.go#L32 and https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/interface.go#L556

the reason for using an abstract map is the ability to be pluggable & extensible.

this is kind of identical to the ScrapeResult interface with the Clone function. Pay attention that everywhere where the CycleState is used is doing a type infer, cause it's a general purpose parameters map. similarly, in this PR, ScrapeResult is a general purpose result of scrape, and when one wants to use it, there is a need for type infer.

few usage examples in kube-scheduler plugins (with type infer):

LMKWYT.

Yeah, im totally with you there, I'm just suggesting we use a generic to genericize the type infer and hide it from usage as much as we can. See my play example here: https://go.dev/play/p/FAE2ZIH_xRt

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2025
includes adapting the changes in all tests

Signed-off-by: Nir Rozenbaum <[email protected]>
@nirrozenbaum nirrozenbaum changed the title [WIP] introducing a general purpose pod scraper and implement models scraper introduce a general purpose pod scraper and add models scraper May 7, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2025
@nirrozenbaum
Copy link
Contributor Author

the PR is now ready for final review after fixing all tests and finalizing the structures.
two non blocking points that will be handled in a follow up PR:

  • pod scraper should not start a go routine per scraper. all scrapers should consolidate to a single go routine per pod.
  • when reading ScrapedData using the GetData() function from PodInfo, current code requires type infer. need to think if this can be cleaner.

updated these points in the PR description as well.

@nirrozenbaum nirrozenbaum requested a review from kfswain May 7, 2025 10:43
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2025
@nirrozenbaum
Copy link
Contributor Author

For easier and simpler review, will break this PR into a series of smaller PRs.
can be closed for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants