This repository provides a set of different experiments designed for you to try out and explore FlexAI. These experiments range from running your very first training job, to fine-tuning language, diffusion, and text-to-speech models using techniques like QLoRA and LoRA, as well as integrating FlexAI with other platforms, such as experiment trackers.
FlexAI CLI: Install the FlexAI CLI by following the steps shown in the Installing the FlexAI CLI guide.
The following table lists out the experiments available in this repository. Each experiment is designed to walk you through a specific use case and contains its required code, as well as detailed instructions on how to run it on FlexAI:
| No. | Section | Description |
|---|---|---|
| 1 | A Simple Training Job on FlexAI | Step-by-step guide to get your first Training Job on FlexAI running |
| 2 | A Simple Distributed Data Parallel (DDP) Training Job on FlexAI | Demonstrates that you only need to add 2 flags in order to start a DDP Training Job |
| 3 | Resuming a Training Job from a Checkpoint | Learn how to resume a Training Job from a previously saved checkpoint |
| 4 | Streaming Large Datasets During a Training Job | Train a model on a large dataset using streaming |
| 5 | Training Job & Experiment Tracking | Using Weights and Biases with FlexAI for experiment tracking |
| 6 | Fine-Tuning a Language Model with QLoRA | Fine-tune a causal language model efficiently using QLoRA |
| 7 | Fine-Tuning a Diffusion Model with LoRA | Fine-tune a diffusion model efficiently using LoRA |
| 8 | Fine-Tuning a Text-to-Speech Model | Fine-tune a text-to-speech (TTS) model |
| 9 | Fine-Tuning a language Model using Flash Attention | Fine-tune a causal language model efficiently using the flash-attn package |
| 10 | Train and Serve a French LLM on FlexAI with LlamaFactory | Fine-tune Qwen2.5-7B on French data using LlamaFactory and deploy as a production-ready inference endpoint |
| 11 | Train and Serve Language Models with Axolotl | Fine-tune language models on domain-specific data using Axolotl framework with custom dataset configurations and FSDP |
| 12 | Reinforcement Learning Fine-Tuning with EasyR1 | Train language models with RL using GRPO and DAPO algorithms for enhanced reasoning capabilities |
| 13 | GRPO Training on Vision-Language Models | Fine-tune vision-language models using GRPO reinforcement learning with HuggingFace TRL, LoRA, vLLM, and DeepSpeed ZeRO-3 |
| 14 | Language Model Evaluation with LM-Evaluation-Harness | Comprehensive evaluation of language models across 300+ tasks and benchmarks using the LM-Evaluation-Harness framework |
| 15 | RAG Application with LangChain and FlexAI Inference Endpoints | Interactive interface for users to ask questions based on provided documents using Retrieval-Augmented Generation |
| 16 | Speech-to-Text Application Using FlexAI Inference Endpoints | Interactive interface for recording audio messages and receiving transcriptions |
| 17 | Multi-Agent Orchestration with LangGraph | Build a multi-agent system where specialized AI agents work together under a central supervisor |
| 18 | Text-to-Image Inference with FlexAI Endpoints | Deploy and use Stable Diffusion 3.5 Large for high-quality image generation via FlexAI inference endpoints |
| 19 | Text-to-Audio Inference with FlexAI Endpoints | Deploy and use Stable Audio Open 1.0 for high-quality audio generation via FlexAI inference endpoints |
| 20 | Text-to-Speech Inference with FlexAI Endpoints | Deploy and use the Kokoro model for natural voice synthesis via FlexAI inference endpoints |
| 21 | Text-to-Video Inference with FlexAI Endpoints | Deploy and use Wan2.2-T2V-A14B for high-quality video generation via FlexAI inference endpoints |
| 22 | Object Detection and Computer Vision with Ultralytics YOLO11 | Train and deploy state-of-the-art object detection, segmentation, and pose estimation models using YOLO11 |
This repository includes experiments that utilize the HuggingFace Accelerate library for efficient training.
The FlexAI CLI simplifies running training scripts by automatically determining the appropriate execution method:
python: Used for single-accelerator training.torchrun: Automatically used for multi-accelerator distributed training.
If you're accustomed to using the accelerate launch command, you can seamlessly run the same scripts on FlexAI without modification. Simply provide the script to FlexAI, and it will handle execution.
As highlighted in the Accelerate documentation, the accelerate launch command is optional. Instead, the Accelerate functionality integrates directly into your script, making it compatible with other launchers like torchrun.
Note
Unlike accelerate launch, torchrun does not use the YAML configuration file generated by accelerate config.
If your training setup relies on specific configurations from the YAML file, you may need to adjust your script to explicitly define these settings using the Accelerator class.
By doing so, you ensure seamless execution across different environments while maintaining flexibility for various training setups.