-
Notifications
You must be signed in to change notification settings - Fork 76
introduce a general purpose pod scraper and add models scraper #783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nirrozenbaum The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling" | ||
runserver "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/server" | ||
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" | ||
) | ||
|
||
const ( | ||
defaultMetricsEndpoint = "/metrics" | ||
modelsScrapeInterval = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we are moving these already, should we make them configurable? That's okay if we want to hold off on doing that, we already have a breathtaking amount of config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it should go to config. if there is no objection I'll do in a follow up PR.
prefer to finalize this PR and get it merged, otherwise I'll hit rebase issues every day (since it touches many files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG
@@ -38,16 +39,20 @@ import ( | |||
"sigs.k8s.io/controller-runtime/pkg/manager" | |||
"sigs.k8s.io/controller-runtime/pkg/metrics/filters" | |||
"sigs.k8s.io/gateway-api-inference-extension/internal/runnable" | |||
backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" | |||
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should refer to this package as metricscollection
or metricsingress
to make clear the difference between the collector package and the metric emission/egress package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. this is a bit confusing.
I left the namings as it was, but I agree.
I'm suggesting to postpone the a follow up due to the PR size and the large number of files it touches.
var errs error | ||
updated := existing.Clone() | ||
updated := s.existingMetrics.Clone().(*Metrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to type infer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this has a generic return type, I assume b/c of the way the interface was defined. I wonder if we can convert the clone func to a generic. It seems odd to have to type infer on a func calledClone
as that implies it wasn't quite a clone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, wrote in other comment as well.
still not 100% sure about the ScrapeResult which needs to go through another iteration for sure.
just wanted to make progress with this PR.
@@ -29,7 +30,8 @@ func (ss *KVCacheScorer) Name() string { | |||
func (ss *KVCacheScorer) Score(ctx *types.SchedulingContext, pods []types.Pod) map[types.Pod]float64 { | |||
scores := make(map[types.Pod]float64, len(pods)) | |||
for _, pod := range pods { | |||
scores[pod] = 1 - pod.GetMetrics().KVCacheUsagePercent | |||
podMetrics := pod.GetData()[metrics.MetricsDataKey].(*metrics.Metrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this seems like the primary way to access data; GetData()[metrics.MetricsDataKey].(*metrics.Metrics)
seems like a code smell to me. Non-blocking, we can noodle on this overtime, but something seems amiss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, agreed.
this needs to be better defined.
I'm still not 100% about the ScrapeResult which is used as general purpose return value from Scrape.
eventually when accessing the data, we need to know the type for accessing the specific fields.
I'm still trying to think about a better way, but considering the fact that most part are generalized, I wanted to make progress and try to finalize this in next iteration of the pod scrapers.
@kfswain got an excellent pointer from Abdullah in the Prefix scorer PR (was send in a different context), which seems very related to this PR. Kube-scheduler has an interface called the reason for using an abstract map is the ability to be pluggable & extensible. this is kind of identical to the ScrapeResult interface with the Clone function. Pay attention that everywhere where the CycleState is used is doing a type infer, cause it's a general purpose parameters map. similarly, in this PR, ScrapeResult is a general purpose result of scrape, and when one wants to use it, there is a need for type infer. few usage examples in kube-scheduler plugins (with type infer):
LMKWYT. |
Yeah, im totally with you there, I'm just suggesting we use a generic to genericize the type infer and hide it from usage as much as we can. See my play example here: https://go.dev/play/p/FAE2ZIH_xRt |
aa301d2
to
0be970f
Compare
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
a559f7b
to
bd8fa4c
Compare
includes adapting the changes in all tests Signed-off-by: Nir Rozenbaum <[email protected]>
5c04425
to
77bc048
Compare
the PR is now ready for final review after fixing all tests and finalizing the structures.
updated these points in the PR description as well. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
For easier and simpler review, will break this PR into a series of smaller PRs. |
This PR lays the ground for pod scraper extensibility (part of the data layer extensibility).
it's still not pluggable or extensible, but the PR shows a clear way for doing that (see in main.go the initialization of the PodScraperFactory).
the PR converts the existing metrics scraper to the new format and adds models scraper for
/v1/models
according to OpenAI models API schema.pay attention that the intention in the models scraper is NOT multiple base models.
it is intended for a single base model, BUT with a different set of LoRA adapters on different pods.
still WIP, but pushing a PR to start collecting feedback on the general direction.
EDIT:
the PR is now ready for final review after fixing all tests and finalizing the structures.
two non blocking points that will be handled in a follow up PR: