Skip to content

blog: Add post on introducing Kubeflow Trainer V2 #169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jul 21, 2025

Conversation

kramaranya
Copy link
Contributor

Copy link
Member

@tarekabouzeid tarekabouzeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work ! Thank you so much @kramaranya

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kramaranya, I left a few comments!
/cc @astefanutti @deepanker13 @saileshd1402 @kubeflow/wg-training-leads

- Abstract Kubernetes complexity from data scientists
- Consolidate efforts between Kubernetes Batch WG and Kubeflow community

We’re deeply grateful to all contributors and community members who made the **Trainer v2** possible with their hard work and valuable feedback. We’re deeply grateful to all contributors and community members who made the Trainer v2 possible with their hard work and valuable feedback. We'd like to give special recognition to [andreyvelich](https://github.com/andreyvelich), [tenzen-y](https://github.com/tenzen-y), [electronic-waste](https://github.com/electronic-waste), [astefanutti](https://github.com/astefanutti), [ironicbo](https://github.com/ironicbo), [mahdikhashan](https://github.com/mahdikhashan), [kramaranya](https://github.com/kramaranya), [harshal292004](https://github.com/harshal292004), [akshaychitneni](https://github.com/akshaychitneni), [chenyi015](https://github.com/chenyi015) and the rest of the contributors. See the full [contributor list](https://kubeflow.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1&var-period_name=Last%206%20months&var-metric=commits&var-repogroup_name=kubeflow%2Ftrainer&var-country_name=All&var-companies=All) for everyone who helped make this release possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to also highlight @ahg-g@kannon92, and @vsoch contributions here, since their feedback was essential while we designed the Kubeflow Trainer architecture last year together with the Batch WG.

WDYT @tenzen-y ?

metadata:
name: pytorch-simple
namespace: kubeflow
spec:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall you also set numNodes: 2 ?

# LLM Fine-Tuning Support

Another improvement of **Trainer v2** is its **built-in support for fine-tuning large language models**, where we provide two types of trainers:
- `BuiltinTrainer` - already includes the fine-tuning logic and allows data scientists to quickly start fine-tuning requiring only parameter adjustments,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should say that in the first release we will support torchtune Runtimes for LLama models.
cc @Electronic-Waste

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. We need to say that in the first release:

  1. We support TorchTune LLM Trainer as one option in BuiltinTrainer.
  2. For TorchTune LLM Trainer, we provide users with some runtimes(ClusterTrainingRuntime). And currently, we only support Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct in manifests respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! updated in 336b058

Copy link
Contributor

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: saileshd1402, kubeflow/wg-training-leads.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Thanks @kramaranya, I left a few comments!
/cc @astefanutti @deepanker13 @saileshd1402 @kubeflow/wg-training-leads

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Huge thanks for this. And thank you for your mentioning @andreyvelich.I left some suggestions with regard to the LLM Fine-Tuning Support section.

# LLM Fine-Tuning Support

Another improvement of **Trainer v2** is its **built-in support for fine-tuning large language models**, where we provide two types of trainers:
- `BuiltinTrainer` - already includes the fine-tuning logic and allows data scientists to quickly start fine-tuning requiring only parameter adjustments,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. We need to say that in the first release:

  1. We support TorchTune LLM Trainer as one option in BuiltinTrainer.
  2. For TorchTune LLM Trainer, we provide users with some runtimes(ClusterTrainingRuntime). And currently, we only support Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct in manifests respectively.

Comment on lines 165 to 182
job_name = TrainerClient().train(
trainer=BuiltinTrainer(
config=TorchTuneConfig(
dtype="bf16",
batch_size=1,
epochs=1,
num_nodes=5,
),
),
initializer=Initializer(
dataset=HuggingFaceDatasetInitializer(
storage_uri="tatsu-lab/alpaca",
)
),
runtime=Runtime(
name="torchtune-llama3.1-8b",
),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
job_name = TrainerClient().train(
trainer=BuiltinTrainer(
config=TorchTuneConfig(
dtype="bf16",
batch_size=1,
epochs=1,
num_nodes=5,
),
),
initializer=Initializer(
dataset=HuggingFaceDatasetInitializer(
storage_uri="tatsu-lab/alpaca",
)
),
runtime=Runtime(
name="torchtune-llama3.1-8b",
),
)
job_name = client.train(
runtime=Runtime(
name="torchtune-llama3.2-1b"
),
initializer=Initializer(
dataset=HuggingFaceDatasetInitializer(
storage_uri="hf://tatsu-lab/alpaca/data"
),
model=HuggingFaceModelInitializer(
storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
access_token="<YOUR_HF_TOKEN>" # Replace with your Hugging Face token,
)
),
trainer=BuiltinTrainer(
config=TorchTuneConfig(
dataset_preprocess_config=TorchTuneInstructDataset(
source=DataFormat.PARQUET,
),
resources_per_node={
"gpu": 1,
}
)
)
)

Maybe we need to switch to a runnable example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you can also say that, "For more details, please refer to this example".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, thanks @Electronic-Waste!
updated in 336b058

Signed-off-by: kramaranya <[email protected]>
Signed-off-by: kramaranya <[email protected]>
Signed-off-by: kramaranya <[email protected]>

Signed-off-by: kramaranya <[email protected]>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The diagram below shows how different personas interact with these custom resources:

![division_of_labor](/images/2025-07-09-introducing-trainer-v2/user-personas.drawio.svg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use user-personas diagram here and delete the other one ?

job_name = client.train(
runtime=client.get_runtime("torch-distributed"),
trainer=CustomTrainer(
func=my_train_func,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Did you get a chance to check it ?

- **[Native Kueue integration](https://github.com/kubernetes-sigs/kueue/issues/3884)** - improve resource management and scheduling capabilities for TrainJob resources
- **[Model Registry integrations](https://github.com/kubeflow/trainer/issues/2245)** - export trained models directly to Model Registry

For users migrating from **Trainer v1**, check out a [**Migration Guide**](https://www.kubeflow.org/docs/components/trainer/operator-guides/migration/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should highlight it in a separate section ?
And we should also say migrating from Kubeflow Training Operator v1.

title: "Introducing Kubeflow Trainer V2"
hide: false
permalink: /trainer/intro/
author: "AutoML & Training WG"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Kubeflow Trainer Team"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me, thanks!
wdyt @andreyvelich?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Kubeflow Trainer Team sounds good!

author: "AutoML & Training WG"
---

Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "hide this complexity"


**The main goals of KF Trainer v2 include:**
- Make AI/ML workloads easier to manage at scale
- Improve the Python interface

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Improve the Python interface
- Provide a Pythonic interface to train models


Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners.

**The main goals of KF Trainer v2 include:**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**The main goals of KF Trainer v2 include:**
**The main goals of Kubeflow Trainer v2 include:**

- Abstract Kubernetes complexity from AI Practitioners
- Consolidate efforts between Kubernetes Batch WG and Kubeflow community

We’re deeply grateful to all contributors and community members who made the **Trainer v2** possible with their hard work and valuable feedback. We'd like to give special recognition to [andreyvelich](https://github.com/andreyvelich), [tenzen-y](https://github.com/tenzen-y), [electronic-waste](https://github.com/electronic-waste), [astefanutti](https://github.com/astefanutti), [ironicbo](https://github.com/ironicbo), [mahdikhashan](https://github.com/mahdikhashan), [kramaranya](https://github.com/kramaranya), [harshal292004](https://github.com/harshal292004), [akshaychitneni](https://github.com/akshaychitneni), [chenyi015](https://github.com/chenyi015) and the rest of the contributors. We would also like to highlight [ahg-g](https://github.com/ahg-g), [kannon92](https://github.com/kannon92), and [vsoch](https://github.com/vsoch) whose feedback was essential while we designed the Kubeflow Trainer architecture together with the Batch WG. See the full [contributor list](https://kubeflow.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1&var-period_name=Last%206%20months&var-metric=commits&var-repogroup_name=kubeflow%2Ftrainer&var-country_name=All&var-companies=All) for everyone who helped make this release possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: break lines to keep one sentence per line.


**Trainer v2** leverages these Kubernetes-native improvements to re-use existing functionality and not reinventing the wheel. This collaboration between the Kubernetes and Kubeflow communities delivers a more standardized approach to ML training on Kubernetes.

# Division of Labor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labor sounds a bit too laborious 😃. Maybe just "User Personas" or "For AI practitioners and MLOps engineers"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Did you get a chance to check it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me :) I'm leaning towards "User Personas".
Another option I was considering was "Personas: Platform Engineers and AI Practitioners", but "User Personas" seems a better option in case we change personas later again.
cc @andreyvelich @franciscojavierarceo @tenzen-y @Electronic-Waste any preferences?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, User Personas make sense to me.


Running machine learning workloads on Kubernetes can be challenging. Distributed training, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The **Kubeflow Trainer v2 (KF Trainer)** was created to simplify this complexity, by making training on Kubernetes easier for AI Practitioners.

**The main goals of KF Trainer v2 include:**
Copy link

@astefanutti astefanutti Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be obvious for everyone involved in the project but it doesn't seem to me like very explicit / prominent in this article: PyTorch :)

I'd try to message that Kubeflow trainer v2 is the easiest and most scalable way to run PyTorch distributed training on Kubernetes!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, emphasis that PyTorch is the primary framework for us makes sense.
Let's include this as one of the main goals.
WDYT @kramaranya @Electronic-Waste @tenzen-y @franciscojavierarceo ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Did you get a chance to check it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I do agree we should emphasize on this point.

I'm leaning toward modifying the current goal "Make AI/ML workloads easier to manage at scale" to be:
"Make AI/ML workloads easier to manage at scale, with PyTorch as the primary framework"

And then modify an intro:
"Running machine learning workloads on Kubernetes can be challenging. Distributed training and LLMs fine-tuning, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The Kubeflow Trainer v2 (KF Trainer) was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs."

Alternately, we could just add a new goal with no intro chnages:
"Deliver the easiest and most scalable PyTorch distributed training on Kubernetes"

What do you think @astefanutti @andreyvelich ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich could you also take a look at ^^, so I can update it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description looks good.
For the goals, we can leave the goal to make aiml workloads easier to scale as it is, and just another goal for the PyTorch, as you said.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, updated in aa4f6e7. @astefanutti please let me know what you think :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks!

Signed-off-by: kramaranya <[email protected]>
@google-oss-prow google-oss-prow bot removed the lgtm label Jul 20, 2025
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Thanks for this great work! Just one nits.

.gitignore Outdated
@@ -11,3 +11,4 @@ _notebooks/.ipynb_checkpoints
.netlify
.tweet-cache
__pycache__
.idea
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.idea
.idea

Need a new blank line here

Signed-off-by: kramaranya <[email protected]>
@astefanutti
Copy link

/lgtm

Thanks!

@google-oss-prow google-oss-prow bot added the lgtm label Jul 21, 2025
Copy link

@eoinfennessy eoinfennessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @kramaranya!

I added a few minor suggestions and clarifying questions.

# Python SDK

**The KF Trainer v2** introduces a **redesigned Python SDK**, which is intended to be the **primary interface for AI Practitioners**.
The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by providing a unified interface across cloud environments?

Suggested change
The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity.
The SDK provides the same interface for multiple ML frameworks, and abstracts the underlying complexities of Kubernetes and cloud environments.

Copy link
Contributor Author

@kramaranya kramaranya Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means you can use the same SDK commands and configurations for any cloud provider, without needing to learn different APIs for each platform. I think 'a unified interface' works better here, comparing to 'the same interface'. wdyt
https://www.kubeflow.org/docs/components/trainer/overview/#what-is-kubeflow-trainer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unified interface makes sense for me.

# Simplified API

Previously, in the **Kubeflow Training Operator** users worked with different custom resources for each ML framework, each with their own framework-specific configurations.
The **KF Trainer v2** replaces these multiple CRDs with a **unified TrainJob API** that works with **multiple ML frameworks**.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The **KF Trainer v2** replaces these multiple CRDs with a **unified TrainJob API** that works with **multiple ML frameworks**.
**Kubeflow Trainer v2** replaces these multiple CRDs with a **unified TrainJob CRD** that works with **multiple ML frameworks**.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To stay consistent, we should keep KF Trainer v2, and I would keep API to avoid duplication :)

Comment on lines 204 to 205
One of the challenges in **KF Trainer v1** was supporting additional ML frameworks, especially for closed-sourced frameworks.
The v2 architecture addresses this by introducing a **Pipeline Framewor**k that allows customers to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that the pipeline framework was introduced to make it easier for Kubeflow Trainer developers to support adding new frameworks to Trainer, and was not a user-facing change.

@andreyvelich @tenzen-y do we intend to document how users can implement custom plugins?

Suggest replacing "customers" with "users":

Suggested change
One of the challenges in **KF Trainer v1** was supporting additional ML frameworks, especially for closed-sourced frameworks.
The v2 architecture addresses this by introducing a **Pipeline Framewor**k that allows customers to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks.
The v2 architecture addresses this by introducing a **Pipeline Framework** that allows users to **extend the Plugins** and **support orchestration** for their custom in-house ML frameworks.

Copy link
Member

@andreyvelich andreyvelich Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the doc is work in progress by @IRONICBo is here: kubeflow/website#4039

@kramaranya @eoinfennessy Maybe we could be more explicit here, and say that allows platform administrators to extend the Plugins ... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 4280688

Copy link
Contributor

@eoinfennessy: changing LGTM is restricted to collaborators

In response to this:

This is great, thanks @kramaranya!

I added a few minor suggestions and clarifying questions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: kramaranya <[email protected]>
@google-oss-prow google-oss-prow bot removed the lgtm label Jul 21, 2025

![user_personas](/images/2025-07-09-introducing-trainer-v2/user-personas.drawio.svg)

- **Platform Engineers** define and manage **the infrastructure configurations** required for training jobs using `TrainingRuntimes` or `ClusterTrainingRuntimes`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, @andreyvelich should it actually be Platform Administrators?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's keep the persona name consistent please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated in 4280688

Signed-off-by: kramaranya <[email protected]>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we can address the changes in the followup PRs
/lgtm
/approve

/hold feel free to un-hold once it is ready @kramaranya

@google-oss-prow google-oss-prow bot added the lgtm label Jul 21, 2025
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kramaranya
Copy link
Contributor Author

Thanks everyone for reviews!!!
/unhold

@google-oss-prow google-oss-prow bot merged commit 5d960bc into kubeflow:master Jul 21, 2025
7 checks passed
@tenzen-y
Copy link
Member

@kramaranya Thank you for this blog!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Blog Post Introducing Kubeflow Trainer V2