-
Couldn't load subscription status.
- Fork 244
KEP-913: Add a KEP for the reusuable KFP components repository #914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@thesuperzapper @HumairAK @droctothorpe @zazulam @chensun @andreyvelich @franciscojavierarceo could you please review this proposal if interested? This was brought up on the last community call by @HumairAK and I can bring it up for discussion next week. Feel free to tag others for review as well. |
|
|
||
| 1. Move reusable components and pipelines into a dedicated GitHub repository with clear structure and governance. | ||
| 2. Provide standardized metadata, documentation, and testing requirements for every asset. | ||
| 3. Ship an installable Python package for core (community-maintained) artifacts that is versioned to match Kubeflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean "components" not "artifacts"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. You're right.
| 4. Maintain a parallel, clearly demarcated area for third-party contributions, shipped as its own Python package that | ||
| tracks the same release cadence as the core catalog. | ||
| 5. Automate maintenance (e.g. stale component detection, dependency validation) to keep the catalog healthy. | ||
| 6. Provide developer onboarding materials and guidance for agents generating components/pipelines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be fantastic training data for a LLM-driven pipeline generator.
|
|
||
| ## Summary | ||
|
|
||
| Establish a dedicated Kubeflow Pipelines (KFP) repository\* that hosts reusable components and full pipelines under a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a robust internal product offering that implements much of what this document describes. I will encourage the internal maintainers to explore the possibility of contributing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For further context, Red Hat aims to have their own repo of Red Hat supported components for OpenShift AI customers, but aims to use the same repo structure, CI, documentation style, and etc. as what gets accepted in upstream Kubeflow (this proposal). So if Capital One can also align, then we'd have more resources for working on follow up things like potentially an API and UI for this catalog to contribute to upstream Kubeflow and use downstream.
| │ │ │ └── test_component.py | ||
| │ │ └── <supporting_files> | ||
| │ └── ... (other categories: evaluation/, data_processing/, etc.) | ||
| ├── pipelines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious to hear more about the rationale behind sharable pipelines (as opposed to components).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droctothorpe I see a few reasons:
- Nested pipelines are supported, so sometimes various components benefit from running in parallel or in a chain and then can be used as if it were a component.
- Provides good examples for how some of these components stitch together for common use cases (e.g. converting PDFs to markdown and inserting them in a Vector database)
- It provides quick starts for tutorials and documentation.
| - Every asset must include `component.py` or `pipeline.py`, `metadata.yaml`, `README.md`, `OWNERS`, and optional | ||
| supporting files. The `OWNERS` file empowers the owning team to review changes, update metadata, and manage lifecycle | ||
| tasks without central gatekeeping. | ||
| - Optional internal unit tests must live under a `tests/` subdirectory to avoid clutter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should unit tests be mandatory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what about integration tests to run the components on actual KFP infrastructure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a section about enabling tests on Kind clusters in CI with KFP installed in standalone mode. My concern is that many components may depend on external services or be resource intensive so I didn't want to require it, but I think it's valid to have it as an option. I'll explicitly make it be an opt-out in the metadata file.
I'll add a new file such as test_pipeline.py in the components/pipelines directories that get automatically run in the CI environment if defined. In test_pipeline.py, we have an optional verify_pipeline function that takes the completed pipeline and KFP client as input and is responsible for verifying the result.
| lint/tests pass but does not guarantee functionality. | ||
| - Third-party assets remain the responsibility of their listed owners; Kubeflow maintainers provide validation | ||
| infrastructure only. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to clarify a formal deprecation policy in case incompatibilities or CVEs surface in a contributed component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM, I see you addressed this further down.
|
|
||
| ### Standardized README Templates | ||
|
|
||
| Each component/pipeline directory includes a `README.md` generated from a template and auto-populated with docstring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to provide a CLI that handles boilerplate generation and validation in a consolidated way that can also be invoked in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll add that!
| 3. Black formatting (`black --check --line-length 120`). | ||
| 4. Docstring lint verifying Google-style docstrings (e.g. `pydocstyle --convention=google`) and enforcing docstrings on | ||
| every `dsl.component` or `dsl.pipeline`-decorated function. | ||
| 5. Static import guard: ensure only stdlib imports appear at module top level; third-party imports must live inside the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if people will want to author components that leverage the embedded artifact pattern you authored (since it greatly simplifies component logic testing), in which case, the static import guard may need to be refined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll keep it local for now but we can adjust later.
|
|
||
| ### Open Questions | ||
|
|
||
| - Should we expose the catalog via an API/website in addition to GitHub? (Out of scope initially but worth tracking.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like https://operatorhub.io/ would be appreciated by end users I think. If we build a consolidated CLI, it would be cool if it provided some list / describe component capabilities as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, I think a static website makes more sense than an API; it could easily be generated in CI and served via GH Pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I might play around with this and see how much effort it'd be to contribute one.
| ### Open Questions | ||
|
|
||
| - Should we expose the catalog via an API/website in addition to GitHub? (Out of scope initially but worth tracking.) | ||
| - Should the core components Python package be included in the Kubeflow SDK directly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean towards no, but maybe there are some benefits that I'm not taking into consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO both are helpful for different reasons, one for SEO and the other because developers are too lazy to google and we could probably compile the examples directly into docs. Probably that would be better in general anyways.
|
Left some minor comments and questions. Overall, looks really good! Appreciate all the thought that went into this. It's a massive step up from the components directory in its current state. |
f72f2fb to
6955a81
Compare
| - Third-party assets remain the responsibility of their listed owners; Kubeflow maintainers provide validation | ||
| infrastructure only. | ||
|
|
||
| ### Artifact Metadata Schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Matt, as the catalog grows having a structured approach to discoverability might be nice. Some thoughts:
- A CI job to build a consolidated catalog (catalog.json) from fields in metadata.yaml from which a UI or SDK could be built. Publish the catalog for easy integration with external tools.
- A tags fields might be useful
- I would imagine a lot of the components will be related to other kubeflow components, trainer, katib etc. Having more explicit fields for them along with min_ versions might be useful. Treat them as "core dependencies". For example, as a user I want to use kf trainer and I want to know all available components and version compatibility, from the sdk or ui related to kubeflow trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @briangallagher! I addressed points 2 and 3. I put point 1 in the open questions section so we don't lose track of it.
|
LGTM |
| project introduces standardized metadata, documentation, testing, and maintenance automation to make components | ||
| discoverable, reliable, and safe to adopt. | ||
|
|
||
| \*Working title `kubeflow/kfp-components`; the final repository name will be confirmed during implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to propose we call it kubeflow/pipelines-components for consistency with kubeflow/pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
|
+1 |
| │ │ └── <component-name>/ | ||
| │ │ ├── __init__.py (exposes the component entrypoint for imports) | ||
| │ │ ├── component.py | ||
| │ │ ├── metadata.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the community meeting, we need to consider how we will manage the docker images (which are part of the component).
My preference is that we require components to ONLY use either:
- an approved "base" docker image
- an extension of one of the "approved base images" using a Dockerfile defined in this repo (possibly under the components folder).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thesuperzapper the latest push adds content around this. Please let me know if that aligns with your thoughts.
6955a81 to
acd4aac
Compare
Signed-off-by: mprahl <[email protected]>
acd4aac to
63d2659
Compare
|
Lgtm |
This KEP proposes creating a dedicated kubeflow/kfp-components repository to host reusable Kubeflow Pipelines components and pipelines under a clear core vs third_party split, with standardized per-asset metadata and autogenerated READMEs, enforced CI (formatting, docstrings, static import guard, compile checks, dependency probes, optional pytest, example compilation), separate Python packages for core and third-party (kfp-components, kfp-components-third-party) with ergonomic imports and semver aligned to Kubeflow, and governance via OWNERS plus scheduled automation to keep assets verified, dependencies current, and stale items removed; rollout covers bootstrapping the repo, migrating curated assets with a deprecation window, onboarding third parties, and coordinating with ongoing pipelines cleanup to reduce fragmentation and improve discoverability, reliability, and reuse.
Resolves: #913