GRAG-HESSIANAI-TRAINING-PIPELINE

Empowering AI with Determined Precision and Speed

Built with the tools and technologies:

Overview

The GRAG-HessianAI-Training-Pipeline project streamlines deep learning model training by orchestrating dataset processing, hyperparameter tuning, and distributed computing. It offers seamless integration with external libraries and custom utilities, optimizing model training efficiency. Targeting AI researchers and developers, it simplifies the training workflow for enhanced model performance and evaluation.

Features

	Feature	Summary
⚙️	Architecture	Utilizes DeepSpeed for optimization and gradient checkpointing Integrates with external libraries and custom utilities for streamlined workflow Defines container bind mounts and environment configurations in `config.yaml`
🔩	Code Quality	Code files maintain clear structure and readability Utilizes Hugging Face Transformers for model management Implements distributed computing and fine-tuning techniques
📄	Documentation	Comprehensive documentation in Python with various file formats Facilitates dependencies management with `requirements.txt` Provides detailed training pipeline configurations in YAML files
🔌	Integrations	Integrates with Determined AI for distributed training Utilizes Transformers, Datasets, and scikit-learn for functionality Facilitates model retrieval and tokenization with custom utilities
🧩	Modularity	Codebase structured into modular components for easy maintenance Separates training pipeline configurations into individual files Enables efficient model deployment and maintenance within the architecture
🧪	Testing	Includes testing commands using pytest for codebase validation Ensures model training and evaluation functionality is tested Facilitates efficient model training and evaluation within the project architecture
⚡️	Performance	Optimizes training pipeline with mixed precision and optimizer settings Configures gradient accumulation and zero optimization for improved performance Fine-tunes batch sizes and clipping for enhanced training efficiency
🛡️	Security	Ensures secure model deployment and real-time inference tasks Handles input processing and result storage securely Upgrades dependencies and fixes bugs for improved security
📦	Dependencies	Manages project dependencies using pip and `requirements.txt` Specifies required packages and versions for seamless integration Facilitates efficient model training and evaluation within the architecture

Project Structure

└── GRAG-HessianAI-Training-Pipeline/
    └── GRAG_Hessian_AI_Determined_Training_Pipeline
        ├── Orpo_attendee.yaml
        ├── README.md
        ├── chat_format.py
        ├── config.yaml
        ├── cpt.yaml
        ├── cpt_finetune.py
        ├── cptold.txt
        ├── dpo.yaml
        ├── dpo_finetune.py
        ├── ds_configs
        ├── inference.py
        ├── lora.yaml
        ├── lora_finetune.py
        ├── lora_utils.py
        ├── metadata.json
        ├── old_startup-hook.sh
        ├── orpo.yaml
        ├── orpo_finetune.py
        ├── requirements.txt
        ├── sft.yaml
        ├── sft_finetune.py
        ├── startup-hook.sh
        ├── untitled.txt
        ├── utils.py
        └── utils_lora_old.py

Project Index

GRAG-HESSIANAI-TRAINING-PIPELINE/

__root__

GRAG_Hessian_AI_Determined_Training_Pipeline

Orpo_attendee.yaml - Defines training pipeline configuration for LLAMA_8B_ORPO_attendee, specifying resources, hyperparameters, and environment settings
- Sets up deep learning model training with specific dataset subsets and training arguments, including batch size, learning rate, and evaluation strategy
- Configures deepspeed for optimization and gradient checkpointing.

cpt_finetune.py - The code file `cpt_finetune.py` orchestrates the training pipeline by loading datasets, setting up special tokens, and initializing the training process
- It leverages distributed computing and fine-tuning techniques to train a model based on specified hyperparameters
- The file integrates with external libraries and custom utilities to streamline the training workflow within the project architecture.

requirements.txt - Facilitates project dependencies management by specifying required packages and versions
- This file ensures the project can leverage essential libraries like transformers, datasets, and scikit-learn for seamless integration and functionality within the codebase architecture.

lora_finetune.py - The code file orchestrates the loading and processing of datasets for training a conversational AI model
- It ensures the datasets are in the correct format and applies necessary transformations
- Additionally, it sets up special tokens for the model and initiates the training process with the specified training arguments and callbacks.

sft_finetune.py - The code file orchestrates the training pipeline for fine-tuning a language model using a self-feeding chat dataset
- It loads the dataset, sets up special tokens, formats prompts, and initiates training with specific configurations
- The file integrates with external libraries and tools to facilitate efficient model training and evaluation within the project architecture.

chat_format.py - Generate chat ML templates for user, system, and assistant messages based on predefined roles within the chat messages
- The code defines templates for different message roles and formats them accordingly for ML processing
- Additionally, it provides functions to retrieve assistant prompts and template IDs for responses, enhancing the chat generation process.

metadata.json - Tracks the progress and identification of a specific trial within the training pipeline, capturing the number of completed steps and the unique trial ID
- This metadata file plays a crucial role in monitoring and managing the training process within the project architecture.

config.yaml Define container bind mounts, environment configurations, and resource allocations for the training pipeline in the project architecture.

cptold.txt - Define training pipeline configuration for Qwen1.5 model with specific hyperparameters and resources
- Specifies dataset subsets, model details, and training settings for the AI model.

sft.yaml - Facilitates training a deep learning model with Determined AI, leveraging a specific dataset and hyperparameters configuration
- Manages resource allocation, environment setup, and training parameters for the training pipeline.

untitled.txt Patch the HF callback script to handle additional metric types for improved training pipeline functionality.

lora.yaml - Facilitates training a language model using a specific dataset and hyperparameters
- Manages resource allocation, environment setup, and training configuration for the Nemo_12B_Lora_ORPO_attendee project.

utils.py - Facilitates model retrieval and tokenization for the AI training pipeline
- Handles model loading based on inference mode, customizes tokenization parameters, and downloads model checkpoints
- Integrates with Determined for distributed training.

dpo.yaml - Facilitates training pipeline configuration for deep learning model fine-tuning with Determined AI
- Specifies resources, hyperparameters, and environment settings for the training job
- Manages data subsets, loss function, and training strategies
- Enables efficient model training and evaluation.

dpo_finetune.py - The code file orchestrates the training pipeline for a Determined AI model by loading datasets, processing conversation formats, and training a model using specified hyperparameters
- It ensures data compatibility, tokenization, and model training with distributed support, ultimately facilitating efficient model training and evaluation.

cpt.yaml - Facilitates training of a custom AI model using DeepSpeed with specific hyperparameters and configurations
- Manages data subsets, batch sizes, mixed precision, and gradient accumulation steps for efficient training
- Enables fine-tuning of the Mistral-Nemo-Base-2407 model on the GRAG-CPT-Hessian-AI dataset for various language tasks.

utils_lora_old.py - Facilitates model retrieval, tokenizer setup, and checkpoint downloading for the AI training pipeline
- Handles model variations, including Lora integration, and ensures proper tokenization
- Enables seamless model deployment and maintenance within the project architecture.

inference.py - The code in `inference.py` orchestrates model inference using Determined AI, generating responses based on input data
- It leverages a pre-trained model to process conversations and produce corresponding outputs
- The script facilitates the deployment of the model for real-time inference tasks, handling input processing, model generation, and result storage.

orpo_finetune.py - The code file orchestrates the training pipeline for fine-tuning a conversational AI model using the ORPO technique
- It handles dataset processing, model setup, and training execution
- The code integrates with Determined AI for distributed training and leverages Hugging Face Transformers for model management
- The main function initiates the training process based on specified parameters and hyperparameters.

lora_utils.py - Enables retrieval of pre-trained language models and tokenizers, facilitating model inference and training
- Supports custom configurations for model architecture and tokenization
- Additionally, provides functionality for downloading model checkpoints and defining tokenization functions.

startup-hook.sh Patch startup script to upgrade dependencies and fix a bug in the Hugging Face callback module for the AI training pipeline.

orpo.yaml - Facilitates training of a mini ORPO model with attendee-specific data subsets
- Utilizes a specific deep learning model and dataset for training, with customized training arguments and hyperparameters
- Implements a deepspeed configuration for efficient training.

old_startup-hook.sh Patch the startup script to upgrade dependencies and modify a specific condition for training metrics handling.

ds_configs

ds_config_stage_1.json - Define training pipeline configurations for stage 1 with automatic settings for mixed precision, optimizer, scheduler, zero optimization, gradient accumulation, and gradient clipping
- Includes options for batch sizes and FLOPs profiling.

ds_config_stage_2_cpu_offload.json - Define CPU offload configuration for stage 2 training pipeline in ds_config_stage_2_cpu_offload.json
- Configure FP16, AdamW optimizer, WarmupLR scheduler, and zero optimization settings for gradient accumulation and clipping
- Fine-tune training batch size and micro-batch size per GPU
- Optionally enable FLOPs profiler for detailed performance analysis.

ds_config_stage_2.json - Define configuration settings for stage 2 of the training pipeline, specifying optimization parameters, gradient accumulation, and zero optimization strategies
- This file plays a crucial role in fine-tuning training performance and resource utilization within the project architecture.

ds_config_stage_3.json - Optimizes training pipeline by configuring mixed precision, optimizer settings, and zero optimization parameters for efficient model training
- Fine-tunes batch sizes, gradient accumulation, and clipping for improved performance.

Getting Started

Prerequisites

Before getting started with GRAG-HessianAI-Training-Pipeline, ensure your runtime environment meets the following requirements:

Programming Language: Python
Package Manager: Pip

Installation

Install GRAG-HessianAI-Training-Pipeline using one of the following methods:

Build from source:

Clone the GRAG-HessianAI-Training-Pipeline repository:

❯ git clone ../GRAG-HessianAI-Training-Pipeline

Navigate to the project directory:

❯ cd GRAG-HessianAI-Training-Pipeline

Install the project dependencies:

Using pip

❯ pip install -r GRAG_Hessian_AI_Determined_Training_Pipeline/requirements.txt

Usage

Run GRAG-HessianAI-Training-Pipeline using the following command: Using pip

❯ python {entrypoint}

Testing

Run the test suite using the following command: Using pip

❯ pytest

Project Roadmap

Task 1: ~~Implement feature one.~~
Task 2: Implement feature two.
Task 3: Implement feature three.

Contributing

💬 Join the Discussions: Share your insights, provide feedback, or ask questions.
🐛 Report Issues: Submit bugs found or log feature requests for the GRAG-HessianAI-Training-Pipeline project.
💡 Submit Pull Requests: Review open PRs, and submit your own PRs.

Contributing Guidelines

Fork the Repository: Start by forking the project repository to your LOCAL account.
Clone Locally: Clone the forked repository to your local machine using a git client.
```
git clone /Users/soumyapaul/Downloads/GRAG-HessianAI-Training-Pipeline
```
Create a New Branch: Always work on a new branch, giving it a descriptive name.
```
git checkout -b new-feature-x
```
Make Your Changes: Develop and test your changes locally.
Commit Your Changes: Commit with a clear message describing your updates.
```
git commit -m 'Implemented new feature x.'
```
Push to LOCAL: Push the changes to your forked repository.
```
git push origin new-feature-x
```
Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!

Contributor Graph

License

This project is protected under the Apache License, Version 2.0 License.

Acknowledgments

--- Contributors: Marcel Rosiak, Soumya Paul, Siavash Mollaebrahim, Zain Ul Haq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRAG-HESSIANAI-TRAINING-PIPELINE

Table of Contents

Overview

Features

Project Structure

Project Index

Getting Started

Prerequisites

Installation

Usage

Testing

Project Roadmap

Contributing

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ds_configs		ds_configs
Orpo_attendee.yaml		Orpo_attendee.yaml
README.md		README.md
chat_format.py		chat_format.py
config.yaml		config.yaml
cpt.yaml		cpt.yaml
cpt_finetune.py		cpt_finetune.py
cptold.txt		cptold.txt
dpo.yaml		dpo.yaml
dpo_finetune.py		dpo_finetune.py
inference.py		inference.py
lora.yaml		lora.yaml
lora_finetune.py		lora_finetune.py
lora_utils.py		lora_utils.py
metadata.json		metadata.json
old_startup-hook.sh		old_startup-hook.sh
orpo.yaml		orpo.yaml
orpo_finetune.py		orpo_finetune.py
requirements.txt		requirements.txt
sft.yaml		sft.yaml
sft_finetune.py		sft_finetune.py
startup-hook.sh		startup-hook.sh
untitled.txt		untitled.txt
utils.py		utils.py
utils_lora_old.py		utils_lora_old.py

Orpo_attendee.yaml	- Defines training pipeline configuration for LLAMA_8B_ORPO_attendee, specifying resources, hyperparameters, and environment settings - Sets up deep learning model training with specific dataset subsets and training arguments, including batch size, learning rate, and evaluation strategy - Configures deepspeed for optimization and gradient checkpointing.
cpt_finetune.py	- The code file `cpt_finetune.py` orchestrates the training pipeline by loading datasets, setting up special tokens, and initializing the training process - It leverages distributed computing and fine-tuning techniques to train a model based on specified hyperparameters - The file integrates with external libraries and custom utilities to streamline the training workflow within the project architecture.
requirements.txt	- Facilitates project dependencies management by specifying required packages and versions - This file ensures the project can leverage essential libraries like transformers, datasets, and scikit-learn for seamless integration and functionality within the codebase architecture.
lora_finetune.py	- The code file orchestrates the loading and processing of datasets for training a conversational AI model - It ensures the datasets are in the correct format and applies necessary transformations - Additionally, it sets up special tokens for the model and initiates the training process with the specified training arguments and callbacks.
sft_finetune.py	- The code file orchestrates the training pipeline for fine-tuning a language model using a self-feeding chat dataset - It loads the dataset, sets up special tokens, formats prompts, and initiates training with specific configurations - The file integrates with external libraries and tools to facilitate efficient model training and evaluation within the project architecture.
chat_format.py	- Generate chat ML templates for user, system, and assistant messages based on predefined roles within the chat messages - The code defines templates for different message roles and formats them accordingly for ML processing - Additionally, it provides functions to retrieve assistant prompts and template IDs for responses, enhancing the chat generation process.
metadata.json	- Tracks the progress and identification of a specific trial within the training pipeline, capturing the number of completed steps and the unique trial ID - This metadata file plays a crucial role in monitoring and managing the training process within the project architecture.
config.yaml	Define container bind mounts, environment configurations, and resource allocations for the training pipeline in the project architecture.
cptold.txt	- Define training pipeline configuration for Qwen1.5 model with specific hyperparameters and resources - Specifies dataset subsets, model details, and training settings for the AI model.
sft.yaml	- Facilitates training a deep learning model with Determined AI, leveraging a specific dataset and hyperparameters configuration - Manages resource allocation, environment setup, and training parameters for the training pipeline.
untitled.txt	Patch the HF callback script to handle additional metric types for improved training pipeline functionality.
lora.yaml	- Facilitates training a language model using a specific dataset and hyperparameters - Manages resource allocation, environment setup, and training configuration for the Nemo_12B_Lora_ORPO_attendee project.
utils.py	- Facilitates model retrieval and tokenization for the AI training pipeline - Handles model loading based on inference mode, customizes tokenization parameters, and downloads model checkpoints - Integrates with Determined for distributed training.
dpo.yaml	- Facilitates training pipeline configuration for deep learning model fine-tuning with Determined AI - Specifies resources, hyperparameters, and environment settings for the training job - Manages data subsets, loss function, and training strategies - Enables efficient model training and evaluation.
dpo_finetune.py	- The code file orchestrates the training pipeline for a Determined AI model by loading datasets, processing conversation formats, and training a model using specified hyperparameters - It ensures data compatibility, tokenization, and model training with distributed support, ultimately facilitating efficient model training and evaluation.
cpt.yaml	- Facilitates training of a custom AI model using DeepSpeed with specific hyperparameters and configurations - Manages data subsets, batch sizes, mixed precision, and gradient accumulation steps for efficient training - Enables fine-tuning of the Mistral-Nemo-Base-2407 model on the GRAG-CPT-Hessian-AI dataset for various language tasks.
utils_lora_old.py	- Facilitates model retrieval, tokenizer setup, and checkpoint downloading for the AI training pipeline - Handles model variations, including Lora integration, and ensures proper tokenization - Enables seamless model deployment and maintenance within the project architecture.
inference.py	- The code in `inference.py` orchestrates model inference using Determined AI, generating responses based on input data - It leverages a pre-trained model to process conversations and produce corresponding outputs - The script facilitates the deployment of the model for real-time inference tasks, handling input processing, model generation, and result storage.
orpo_finetune.py	- The code file orchestrates the training pipeline for fine-tuning a conversational AI model using the ORPO technique - It handles dataset processing, model setup, and training execution - The code integrates with Determined AI for distributed training and leverages Hugging Face Transformers for model management - The main function initiates the training process based on specified parameters and hyperparameters.
lora_utils.py	- Enables retrieval of pre-trained language models and tokenizers, facilitating model inference and training - Supports custom configurations for model architecture and tokenization - Additionally, provides functionality for downloading model checkpoints and defining tokenization functions.
startup-hook.sh	Patch startup script to upgrade dependencies and fix a bug in the Hugging Face callback module for the AI training pipeline.
orpo.yaml	- Facilitates training of a mini ORPO model with attendee-specific data subsets - Utilizes a specific deep learning model and dataset for training, with customized training arguments and hyperparameters - Implements a deepspeed configuration for efficient training.
old_startup-hook.sh	Patch the startup script to upgrade dependencies and modify a specific condition for training metrics handling.

ds_config_stage_1.json	- Define training pipeline configurations for stage 1 with automatic settings for mixed precision, optimizer, scheduler, zero optimization, gradient accumulation, and gradient clipping - Includes options for batch sizes and FLOPs profiling.
ds_config_stage_2_cpu_offload.json	- Define CPU offload configuration for stage 2 training pipeline in ds_config_stage_2_cpu_offload.json - Configure FP16, AdamW optimizer, WarmupLR scheduler, and zero optimization settings for gradient accumulation and clipping - Fine-tune training batch size and micro-batch size per GPU - Optionally enable FLOPs profiler for detailed performance analysis.
ds_config_stage_2.json	- Define configuration settings for stage 2 of the training pipeline, specifying optimization parameters, gradient accumulation, and zero optimization strategies - This file plays a crucial role in fine-tuning training performance and resource utilization within the project architecture.
ds_config_stage_3.json	- Optimizes training pipeline by configuring mixed precision, optimizer settings, and zero optimization parameters for efficient model training - Fine-tunes batch sizes, gradient accumulation, and clipping for improved performance.

Avemio-Tech/ave-digi-German-RAG-HessianAI-Training-Pipeline

Folders and files

Latest commit

History

Repository files navigation

GRAG-HESSIANAI-TRAINING-PIPELINE

Table of Contents

Overview

Features

Project Structure

Project Index

Getting Started

Prerequisites

Installation

Usage

Testing

Project Roadmap

Contributing

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages