Skip to content

Conversation

@mprahl
Copy link

@mprahl mprahl commented Oct 20, 2025

This KEP proposes creating a dedicated kubeflow/kfp-components repository to host reusable Kubeflow Pipelines components and pipelines under a clear core vs third_party split, with standardized per-asset metadata and autogenerated READMEs, enforced CI (formatting, docstrings, static import guard, compile checks, dependency probes, optional pytest, example compilation), separate Python packages for core and third-party (kfp-components, kfp-components-third-party) with ergonomic imports and semver aligned to Kubeflow, and governance via OWNERS plus scheduled automation to keep assets verified, dependencies current, and stale items removed; rollout covers bootstrapping the repo, migrating curated assets with a deprecation window, onboarding third parties, and coordinating with ongoing pipelines cleanup to reduce fragmentation and improve discoverability, reliability, and reuse.

Resolves: #913

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mprahl
Copy link
Author

mprahl commented Oct 20, 2025

@thesuperzapper @HumairAK @droctothorpe @zazulam @chensun @andreyvelich @franciscojavierarceo could you please review this proposal if interested? This was brought up on the last community call by @HumairAK and I can bring it up for discussion next week.

Feel free to tag others for review as well.


1. Move reusable components and pipelines into a dedicated GitHub repository with clear structure and governance.
2. Provide standardized metadata, documentation, and testing requirements for every asset.
3. Ship an installable Python package for core (community-maintained) artifacts that is versioned to match Kubeflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "components" not "artifacts"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. You're right.

4. Maintain a parallel, clearly demarcated area for third-party contributions, shipped as its own Python package that
tracks the same release cadence as the core catalog.
5. Automate maintenance (e.g. stale component detection, dependency validation) to keep the catalog healthy.
6. Provide developer onboarding materials and guidance for agents generating components/pipelines.
Copy link
Contributor

@droctothorpe droctothorpe Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be fantastic training data for a LLM-driven pipeline generator.


## Summary

Establish a dedicated Kubeflow Pipelines (KFP) repository\* that hosts reusable components and full pipelines under a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a robust internal product offering that implements much of what this document describes. I will encourage the internal maintainers to explore the possibility of contributing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For further context, Red Hat aims to have their own repo of Red Hat supported components for OpenShift AI customers, but aims to use the same repo structure, CI, documentation style, and etc. as what gets accepted in upstream Kubeflow (this proposal). So if Capital One can also align, then we'd have more resources for working on follow up things like potentially an API and UI for this catalog to contribute to upstream Kubeflow and use downstream.

│ │ │ └── test_component.py
│ │ └── <supporting_files>
│ └── ... (other categories: evaluation/, data_processing/, etc.)
├── pipelines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to hear more about the rationale behind sharable pipelines (as opposed to components).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droctothorpe I see a few reasons:

  1. Nested pipelines are supported, so sometimes various components benefit from running in parallel or in a chain and then can be used as if it were a component.
  2. Provides good examples for how some of these components stitch together for common use cases (e.g. converting PDFs to markdown and inserting them in a Vector database)
  3. It provides quick starts for tutorials and documentation.

- Every asset must include `component.py` or `pipeline.py`, `metadata.yaml`, `README.md`, `OWNERS`, and optional
supporting files. The `OWNERS` file empowers the owning team to review changes, update metadata, and manage lifecycle
tasks without central gatekeeping.
- Optional internal unit tests must live under a `tests/` subdirectory to avoid clutter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should unit tests be mandatory?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what about integration tests to run the components on actual KFP infrastructure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a section about enabling tests on Kind clusters in CI with KFP installed in standalone mode. My concern is that many components may depend on external services or be resource intensive so I didn't want to require it, but I think it's valid to have it as an option. I'll explicitly make it be an opt-out in the metadata file.

I'll add a new file such as test_pipeline.py in the components/pipelines directories that get automatically run in the CI environment if defined. In test_pipeline.py, we have an optional verify_pipeline function that takes the completed pipeline and KFP client as input and is responsible for verifying the result.

lint/tests pass but does not guarantee functionality.
- Third-party assets remain the responsibility of their listed owners; Kubeflow maintainers provide validation
infrastructure only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to clarify a formal deprecation policy in case incompatibilities or CVEs surface in a contributed component.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM, I see you addressed this further down.


### Standardized README Templates

Each component/pipeline directory includes a `README.md` generated from a template and auto-populated with docstring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to provide a CLI that handles boilerplate generation and validation in a consolidated way that can also be invoked in CI.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll add that!

3. Black formatting (`black --check --line-length 120`).
4. Docstring lint verifying Google-style docstrings (e.g. `pydocstyle --convention=google`) and enforcing docstrings on
every `dsl.component` or `dsl.pipeline`-decorated function.
5. Static import guard: ensure only stdlib imports appear at module top level; third-party imports must live inside the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if people will want to author components that leverage the embedded artifact pattern you authored (since it greatly simplifies component logic testing), in which case, the static import guard may need to be refined.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll keep it local for now but we can adjust later.


### Open Questions

- Should we expose the catalog via an API/website in addition to GitHub? (Out of scope initially but worth tracking.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like https://operatorhub.io/ would be appreciated by end users I think. If we build a consolidated CLI, it would be cool if it provided some list / describe component capabilities as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, I think a static website makes more sense than an API; it could easily be generated in CI and served via GH Pages.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I might play around with this and see how much effort it'd be to contribute one.

### Open Questions

- Should we expose the catalog via an API/website in addition to GitHub? (Out of scope initially but worth tracking.)
- Should the core components Python package be included in the Kubeflow SDK directly?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards no, but maybe there are some benefits that I'm not taking into consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO both are helpful for different reasons, one for SEO and the other because developers are too lazy to google and we could probably compile the examples directly into docs. Probably that would be better in general anyways.

@droctothorpe
Copy link
Contributor

Left some minor comments and questions. Overall, looks really good! Appreciate all the thought that went into this. It's a massive step up from the components directory in its current state.

- Third-party assets remain the responsibility of their listed owners; Kubeflow maintainers provide validation
infrastructure only.

### Artifact Metadata Schema

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Matt, as the catalog grows having a structured approach to discoverability might be nice. Some thoughts:

  • A CI job to build a consolidated catalog (catalog.json) from fields in metadata.yaml from which a UI or SDK could be built. Publish the catalog for easy integration with external tools.
  • A tags fields might be useful
  • I would imagine a lot of the components will be related to other kubeflow components, trainer, katib etc. Having more explicit fields for them along with min_ versions might be useful. Treat them as "core dependencies". For example, as a user I want to use kf trainer and I want to know all available components and version compatibility, from the sdk or ui related to kubeflow trainer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @briangallagher! I addressed points 2 and 3. I put point 1 in the open questions section so we don't lose track of it.

@HumairAK
Copy link

LGTM

project introduces standardized metadata, documentation, testing, and maintenance automation to make components
discoverable, reliable, and safe to adopt.

\*Working title `kubeflow/kfp-components`; the final repository name will be confirmed during implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to propose we call it kubeflow/pipelines-components for consistency with kubeflow/pipelines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

@terrytangyuan
Copy link
Member

+1

│ │ └── <component-name>/
│ │ ├── __init__.py (exposes the component entrypoint for imports)
│ │ ├── component.py
│ │ ├── metadata.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the community meeting, we need to consider how we will manage the docker images (which are part of the component).

My preference is that we require components to ONLY use either:

  1. an approved "base" docker image
  2. an extension of one of the "approved base images" using a Dockerfile defined in this repo (possibly under the components folder).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thesuperzapper the latest push adds content around this. Please let me know if that aligns with your thoughts.

@johnugeorge
Copy link
Member

Lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KEP-913: Create dedicated repository for Kubeflow Pipelines components and pipelines

9 participants