-
Notifications
You must be signed in to change notification settings - Fork 244
KEP-897: Propose centralized experiment tracking in Kubeflow #892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0eefb9d to
1b99df6
Compare
| This proposal introduces an optional/configurable authorization mode for Model Registry that leverages Kubernetes | ||
| subject access review without adding custom resource definitions or modifying the existing API concepts. Instead, it | ||
| maps Kubernetes RBAC concepts to the existing REST API entities at the namespace level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm supportive of implementation solution(s) which do not impose in MR to mix in code multiple cross-cutting concerns, ie "business logic" (data transformation and queries) and "authorizations" (tenancy) 👍
It is a general enough driving principle and best-practice, but thought I'd add it here as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to keep the multi tenancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc WGs to review
@kubeflow/wg-pipeline-leads @kubeflow/wg-data-leads @kubeflow/wg-automl-leads @kubeflow/wg-data-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-manifests-leads @kubeflow/wg-notebooks-leads
|
|
||
| This proposal aims to resolve the current fragmented and limited experiment tracking experience by expanding the | ||
| **Kubeflow Model Registry** into a unified, centralized metadata store. Currently, experiment tracking is scattered | ||
| across components like Kubeflow Pipelines (which requires pipeline execution for tracking) and Katib (limited to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am interested whether we should define the concept of Kubeflow Experiment outside of KFP and Katib ?
E.g. Experiment which sits on top of TrainJob, Katib Jobs, Pipelines, Spark Jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are on the same page. The overall goal is that an experiment in Model Registry could contain a mix of experiment runs from Pipelines, TrainJob, SDK, etc. Each run would indicate the source of where they came from, but the experiments are not exclusive to a particular source.
When Pipelines is integrated in this mode, it wouldn't have its own experiments in its database like today. Instead it would pull the list of available ones from Model Registry. We'd still keep the existing Pipelines experiments concept for standalone installations of Pipelines though.
|
|
||
| The proposal tackles these issues by: | ||
|
|
||
| - **Expanding Model Registry** into a central experiment tracking store for experiments, runs, metrics, and artifacts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be scope of Model Registry or we need another Kubeflow project for it?
If we decide to use Model Registry for it, we might want to find another name of this project.
cc @kubeflow/wg-data-leads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, in some regards we are seeing the need of a "AI Asset Registry", but I'm also favourable of simply "Kubeflow Registry"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would stick with model registry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of Kubeflow Registry, since it makes very clear for users that this project is designed for metadata storage, and not only for model artifacts. Alternatively, we can name it: Kubeflow Tracker/Tracking, or find better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the plan is that kubeflow model registry is mlflow compatible and we can then use the mlflow sdk also for pipelines, trainer etc. and get rid of ml-metadata CC @franciscojavierarceo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the name Model Registry is still valid since experiment tracking is just part of the Model lifecycle and needs to be stored in Model Registry to be able to track model lineage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should align on what does Experiment mean in AI lifecycle ?
For example, if we consider any type of MLOps activity as Experiment (e.g. data preparation, training, HPOs, evals), we might need to find better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also consider a rename such as "Kubeflow registry" (or anything else which we find name convergence) after this work is completed so that, given more capabilities are added/will be added (like the catalog, experiment, etc) we will have more data and make a better informed decision.
this is also with the other hat-on-head that with *our present focus on Graduation, making a name change at this time, might be also strategically challenging.
cc @mprahl @andreyvelich wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with postponing the renaming conversation. We can clarify through documentation and the UI as part of the delivery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can rename it after the CNCF graduation, and include it as a Northstar in this KEP.
I just would like us to align on the future project scope and goals.
|
|
||
| - **Expanding Model Registry** into a central experiment tracking store for experiments, runs, metrics, and artifacts | ||
| across all Kubeflow components. | ||
| - Providing **MLFlow SDK compatibility**, enabling users to leverage familiar APIs while storing data in Kubeflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why MLFlow and not Weight & Biases: https://wandb.ai/site/ ?
I found that these days more and more organizations adopt it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I do really like Weight & Biases but MLflow has a plugin architecture that allows the SDK to use a different backend (Model Registry in this case). So we could add integrations with Weight & Biases (e.g. export experiments and metrics to it from KFP) but we could not easily reuse their SDK, which is the reason MLflow is proposed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the benefits to use MLFlow SDK if eventually we will implement tracking API in the Kubeflow SDK ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MLFlow is the dominant player in the market. https://clickpy.clickhouse.com/dashboard/mlflow
FWIW Weights and Biases is a close second but MLFlow has dominated the market for longer. https://clickpy.clickhouse.com/dashboard/wandb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich the main benefits of the MLflow plugin are the Kubeflow SDK could choose to wrap the MLflow SDK to support the powerful autolog feature and existing workflows leveraging the MLflow SDK would continue to work with minimal changes when pointing to a Model Registry API with experiment tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sene!
@andreyvelich Model Registry isn't CRD based and they already added some of the APIs mentioned in this KEP here: kubeflow/model-registry#1318.
I am trying to understand how we will leverage MLFlow autolog feature in the concept of runs.
I understand that Model Registry has ability to orchestrate Experiment and Runs which are not CRD-based, but at the end users should be able to create Kubeflow CRDs (e.g. TrainJob, SparkJob, Workflow) within their Experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich technically MLflow had 1 million more downloads last month than Weight and Biases.
Also I believe both proof of concepts were done with MLflow and I believe we intend on eventually trying to be compatible with both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich the autologging happens inside your training code that runs within the job containers, not at the orchestration level.
So for example you create a TrainJob CRD, then Kubeflow starts containers based on that CRD and inside those containers your training code runs with MLflow autologging enabled, which captures metrics and parameters and sends them to the Model Registry backend.
The CRDs orchestrate where/how the code runs and autologging just captures what happens during execution, so they're separate layers that work together. And users don't need to modify their CRDs at all, experiment tracking is just another library in their training code.
BTW, this is similar to the demo I shared earlier with MLflow SDK in Kubeflow SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that looks great! I guess, in case of distributed training (when we execute train function across multiple nodes), we will enable autolog() only for the MASTER NODE, like this:
if dist.rank() == 0:
mlflow.autolog()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, but we still need to disable it on other nodes, so we can just do it this way :
rank = dist.get_rank()
mlflow.autolog(disable = rank != 0)
| Kubeflow components without forcing everything through pipelines. This restriction often drives users to seek | ||
| solutions outside the Kubeflow ecosystem. | ||
| 1. **Fragmented experience**: Users must navigate multiple interfaces to correlate run results and evaluation data, | ||
| depending on which Kubeflow component they use (e.g., Pipeline runs, Katib experiments, Training Operator results). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| depending on which Kubeflow component they use (e.g., Pipeline runs, Katib experiments, Training Operator results). | |
| depending on which Kubeflow project they use (e.g., Pipeline runs, Katib experiments, TrainJob results). |
|
|
||
| ### Non-Goals | ||
|
|
||
| 1. **Decoupling Katib's experiments from their current implementation**: While the Kubeflow community can revisit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, the concept of Katib Experiment and Pipeline is different.
In Katib, Experiment is just a CRD which defines HP Tuning job: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/hp-tuning/random.yaml
Katib maintains its own database just to allow metrics collector to push metrics (e.g. accuracy, loss): https://www.kubeflow.org/docs/components/katib/reference/architecture/#katib-control-plane-components.
It is much more lightweight compare to KFP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich it would require more thought of how to integrate it, but at a high-level, the domain models seem to map well:
- Katib experiment -> Model Registry experiment
- Katib trail -> Model Registry run
- Katib trail parameters -> Model Registry run parameters
- Katib trial metrics -> Model Registry run metrics
So I think Katib's APIs could stay the same but just also create/reuse an experiment in Model Registry and then create a "run" in the Model Registry experiment per executed "trial".
I think for the purpose of this KEP, aligning on it being theoretically possible for Katib to export this metadata to the domain models proposed for Model Registry would be enough. The Katib maintainers/community can then decide if they'd like to integrate after the Model Registry implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, the concept of Katib Experiment is misleading. Katib Experiment is a batch job (e.g. OptimizeJob) that has its own scheduler (e.g. Suggestion) which makes decision whether it needs to create more runs (e.g. Trials).
Ideally, we should refactor Katib API towards OptimizeJob where we have clear definition of hyperparameter optimization job.
@kubeflow/wg-training-leads @astefanutti @franciscojavierarceo Any thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before Experiment API, we did have the StudyJob concept that fully aligned with the original Google Vizier paper.
Ref: https://github.com/google/vizier
| - Enable users to leverage MLFlow's familiar APIs while storing data in centralized Kubeflow infrastructure | ||
| - Support automatic logging for popular ML frameworks (TensorFlow, PyTorch, scikit-learn, etc.) through MLFlow's SDK | ||
|
|
||
| #### Kubeflow SDK Enhancement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have some code snippet on how those integration will look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add this in Kubeflow SDK Implementation under Design Details. The example is from a proof of concept developed by @kramaranya!
| experiments in Katib. | ||
| 1. **Users**: A user to associate runs, metrics, artifacts, etc. for auditing. This will generally map to the Kubernetes | ||
| identity. | ||
| 1. **Runs**: An execution in the machine learning workflow. In Pipelines, this maps to a pipeline run. In Katib, this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have representation of Run depending on the Kubeflow CRDs, i.e. ?
TrainJob
OptimizeJob
SparkJob
Pipeline/Workflow
I can imagine that users might want to create Experiment where they only have training job, or they might want to create more complex Experiment with multiple steps (e.g. data process, training, evals).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a run could be just a single TrainJob or it could a Pipeline with multiple steps (each step is a nested run). I'm trying to model this after MLflow so it's agnostic to the source that logged it.
|
|
||
| ### Model Registry Domain Models | ||
|
|
||
| The expanded Model Registry will include the following domain models, heavily influenced by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add schema/ API spec in this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also make link to:
kubeflow/model-registry#1224 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll clarify that the linked comment includes APIs as well.
1b99df6 to
56dd509
Compare
Signed-off-by: mprahl <[email protected]>
56dd509 to
52bd338
Compare
| depending on which Kubeflow project they use (e.g., Pipeline runs, Katib experiments, Training Operator results). | ||
| 1. **No unified tracking**: Users cannot easily compare runs, share insights, or maintain consistent metadata across the | ||
| entire Kubeflow ecosystem. | ||
| 1. **Maintenance challenges**: The Kubeflow Pipelines dependency on MLMD creates technical debt and limits future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML-metadata also violates a lot of security best practices and breaks hard multi-tenancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and has a lot of CVEs
Scanning ghcr.io/kubeflow/kfp-metadata-envoy:2.5.0
+----------+------+--------+-----+
| Critical | High | Medium | Low |
+----------+------+--------+-----+
| 0 | 0 | 28 | 7 |
+----------+------+--------+-----+
Scanning ghcr.io/kubeflow/kfp-metadata-writer:2.5.0
+----------+------+--------+-----+
| Critical | High | Medium | Low |
+----------+------+--------+-----+
| 13 | 372 | 973 | 845 |
+----------+------+--------+-----+
Scanning gcr.io/tfx-oss-public/ml_metadata_store_server:1.14.0
+----------+------+--------+-----+
| Critical | High | Medium | Low |
+----------+------+--------+-----+
| 0 | 0 | 41 | 18 |
+----------+------+--------+-----+
| interoperability. | ||
|
|
||
| ### Goals | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add per namespace/profile multi-tenancy as goal.
|
I'm closing this KEP because my team no longer has capacity to take this on. If others want to pursue this, feel free to fork the KEP and I'll be happy to review and advise. 😄 |
|
@mprahl may we keep it open for now? Just to have it tracked. The stalebot will close it anyway if there is no activity on this topic |
|
I agree with @juliusvonkohout! Maybe we should put out a call for contributors to help us add Experiment Tracking support via MLFlow for Kubeflow sub-projects. cc @kubeflow/wg-training-leads @kubeflow/wg-pipeline-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-manifests-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-data-leads @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-outreach-committee @jbottum |
|
Rather than tying it strictly to MlFlow implementation choice, I believe it would be very helpful to add an SPI (strongly inspired to MlFlow Exp/Run to begin with) so that if one day you want to tie other integration in this area you could. Not to dispute MlFlow king popularity, but in other community discussions other alternatives have also their market-share, so an SPI would allow to prepare the ground for as well additional contributor, to what Andrey just said. What would be the @kubeflow/kubeflow-steering-committee pov on this? |
|
I fully agree - designing an extensible architecture makes sense, since it will let us easily swap between experiment tracking solutions (e.g., MLflow, W&B, or even custom option). |
Very IMHO an SPI that is 1:1 to the MlFlow API (with MlFlow integration as its implementation) in the short term. |
|
Experiment tracking is heavily dependent on Registry and UI to support it for visualizations, and tracking models and versions and metrics. What are thoughts on that when speak out this SPI based integration? If we say SPI enables them to capture data and lets the users use the native tools they integrated with, for example using MlFlow UI separately? My next question is how do we foresee we bring back the champion model back into Kubeflow Model Registry for deployment or management? or do we need to? For me, this defines the scope of Model registry activities too going forward. Thoughts? |
|
I've reached out to the MLflow community to see their willingness for me to contribute a multi-tenancy feature which would allow us to have a single MLflow instances for a Kubeflow installation. Then the Kubeflow community (could be Pipeline WG) could maintain an MLflow plugin to handle Kubernetes RBAC requirements: |
Thank you very much. ping me on slack if you need help. |
GitHub issue: #897
This proposal aims to resolve the current fragmented and limited experiment tracking experience by expanding the
Kubeflow Model Registry into a unified, centralized metadata store. Currently, experiment tracking is scattered
across components like Kubeflow Pipelines (which requires pipeline execution for tracking) and Katib (limited to
hyperparameter tuning). This leads to challenges such as limited flexibility for direct logging from Python scripts
or Jupyter notebooks, a fragmented user experience across multiple interfaces, and maintenance difficulties due
to reliance on the inactive MLMD project.