Convert to package

artemisp · Mar 22, 2024 · 3bc98a7 · 3bc98a7
1 parent e4a84e8
commit 3bc98a7
Show file tree

Hide file tree

Showing 23 changed files with 263 additions and 42 deletions.
diff --git a/.gitignore b/.gitignore
@@ -24,4 +24,7 @@
 # Ignore predictions files
 **predictions**
 
-**mrqa**
+**mrqa**
+
+build/
+parallelm.egg-info/
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ The codebase is based on top of [`PyTorch Lightning`](https://lightning.ai/docs/
 
 Templates are developed to be compatible with [`balance-my-slurm`](https://github.com/artemisp/balance-my-slurm/tree/main) so check it out! 🧐
 
-An example config file to run is provided in `src/configs/train/llama_mrqa.py`. Make sure to download the data to run it
+An example config file to run is provided in `parallelm/configs/train/llama_mrqa.py`. Make sure to download the data to run it
  You can do so as follows:
 ```
 >> mkdir mrqa
@@ -15,7 +15,7 @@ An example config file to run is provided in `src/configs/train/llama_mrqa.py`.
 ```
 Then you can train the model by:
 ```
- srun --gpus 1 --nodes 1 --mem-per-cpu 12GB  --constraint 48BGgpu --ntasks-per-node 1 --cpus-per-gpu 10 /nlp/data/artemisp/mambaforge/envs/test_me/bin/python  src/pl_ft.py --cfg /nlp/data/artemisp/multigpu-lm-templates/src/configs/train/llama_mrqa.py
+ srun --gpus 1 --nodes 1 --mem-per-cpu 12GB  --constraint 48BGgpu --ntasks-per-node 1 --cpus-per-gpu 10 /nlp/data/artemisp/mambaforge/envs/test_me/bin/python  parallelm/pl_ft.py --cfg /nlp/data/artemisp/multigpu-lm-templates/parallelm/configs/train/llama_mrqa.py
 ```
 
 <a name="toc"></a>
@@ -53,6 +53,8 @@ In installing `PyTorch` we assume `CUDA` version ~~12.0~~ 12.1 are compatible wi
 >> conda activate test_me
 >> conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
 >> python -m pip install -r requirements.txt
+>> cd ParalleLM 
+>> python -m pip install -e .
 ```
 
 
@@ -63,6 +65,8 @@ In installing `PyTorch` we assume `CUDA` version ~~12.0~~ 12.1 are compatible wi
 >> conda activate test_me
 >> conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
 >> python -m pip install -r requirements.txt
+>> cd ParalleLM 
+>> python -m pip install -e .
 ```
 
 If you want to use a faster (like a LOT FASTER) package manager built on top
@@ -73,7 +77,7 @@ If you want to use a faster (like a LOT FASTER) package manager built on top
 <a name="skeleton"></a>
 ## 📁 Files and Skeleton
 ```
-├── src
+├── parallelm
 │   ├── common
 │   │   ├── checkpoint_utils.py # utility functions for checkpointing
 │   ├── configs
@@ -101,7 +105,7 @@ Now let's look at each of them in turn:
 ### 📊 Data: The Heart of Your NLP Adventure! 🚀
 This module handles dataloading, preprocessing, and post-processing. 
 
-#### `data/pl_dataloaders.py`
+#### `parallelm.data.pl_dataloaders`
 
 * `CustomDataset`: A subclass of `torch.utils.data.Dataset`, designed for flexible data handling. It supports initialization with datasets in various formats, optional tokenizer integration, and custom preprocessing. It is used by `CustomDataModule`.
 * `CustomDataModule`: Extends `pl.LightningDataModule` to organize data loading, preprocessing, and setup for different phases like training and validation. It supports distributed training and custom tokenization and preprocessing workflows.
@@ -137,17 +141,17 @@ This module handles dataloading, preprocessing, and post-processing.
             batch_size (int, optional): The batch size to use for training and inference. Defaults to `None`.
         """
 
-#### `data/postprocessing.py`
+#### `parallelm.data.postprocessing`
 Define postprocessing functions here. They are accesed in `models.pl_module.CustomModule` prediction step. They can be defined in `datamodule_kwargs` in the config by their name `datamodule_kwargs: {"postproc_fn": <func_name>}`. Each function in postprocessing accepts a single string. 
 
-#### `data/preprocessing.py`
+#### `parallelm.data.preprocessing`
 Define preprocessing functions here. It is used by `data.pl_dataloaders.CustomDataModule` for template formatting, and tokenization. The relevant arguments in the config are `preprocessing_kwargs` and `tokenization_kwargs`.
 
 <a name="data"></a>
 ### Models 🤖
 
 
-#### `models/pl_modules.py`
+#### `parallelm.models.pl_modules`
 *  `CustomModule`: A `lightning` wrapper around a `transformers` model that allows for training in `LoRA` or prefix tuning mode using the implementation from [here](https://github.com/kipgparker/soft-prompt-tuning/blob/main/soft_embedding.py) as well as quantization. It allows for distributed training, and high control of the training processes. The `training_step` method can be adapted for different loss functions if necessary. 
  """
         A PyTorch Lightning module that encapsulates a Hugging Face transformer model
@@ -240,8 +244,8 @@ Define preprocessing functions here. It is used by `data.pl_dataloaders.CustomDa
 Okay.. all good till now, but of course you want to take some control. Don't you worry! A lot of the things you would want to do can simply be achieved by changing a single variable in the config! 
 
 Two example configs are provided in 
-* `/nlp/data/artemisp/multigpu-lm-templates/src/configs/train/llama_mrqa.py`
-* `/nlp/data/artemisp/multigpu-lm-templates/src/configs/base.py`
+* `/nlp/data/artemisp/multigpu-lm-templates/parallelm/configs/train/llama_mrqa.py`
+* `/nlp/data/artemisp/multigpu-lm-templates/parallelm/configs/base.py`
 
 ## General Configuration
 * `output_dir`: Specifies the current working directory of the project. This is used as a base to construct paths for data, outputs, and logs.
@@ -471,19 +475,19 @@ The most important is the `column_dict`. It populates the `input_template`/`targ
 # Ready to train?
 
 You can submit a batch job:
-`sbatch src/slurm_scripts/run_ft.sh --cfg path/to/your/config` to run on 8gpus. and
-`sbatch src/slurm_scripts/run_ft_1gpu.sh --cfg path/to/your/config` to run on one. You can modify the scripts accordingly. 
+`sbatch parallelm/slurm_scripts/run_ft.sh --cfg path/to/your/config` to run on 8gpus. and
+`sbatch parallelm/slurm_scripts/run_ft_1gpu.sh --cfg path/to/your/config` to run on one. You can modify the scripts accordingly. 
 
 For interractive debugging do the following:
-` srun --gpus 1 --nodes 1 --mem-per-cpu 12GB  --constraint 48GBgpu --ntasks-per-node 1 --cpus-per-gpu 10 python  src/pl_ft.py --cfg /path/to/your/config`
+` srun --gpus 1 --nodes 1 --mem-per-cpu 12GB  --constraint 48GBgpu --ntasks-per-node 1 --cpus-per-gpu 10 python  parallelm/pl_ft.py --cfg /path/to/your/config`
 
 
 <a name="evaluate"></a>
 # Ready to evaluate?
 
 Create a config which defines `resume_From_checkpoint` which is passed in the `module_kwargs` and also on the top level of the config. Then specify your `metrics` and `output_dir`. Finally, either pass a raw dataset with a `predict` split, or specify your prediction split in `datamodule_kwargs`. 
 
-Run `sbatch src/slurm_scripts/run_predict_1gpu.sh --cfg path/to/your/config` to run on one gpu or select the 4gpu script for faster inference. You can modify the scripts accordingly. 
+Run `sbatch parallelm/slurm_scripts/run_predict_1gpu.sh --cfg path/to/your/config` to run on one gpu or select the 4gpu script for faster inference. You can modify the scripts accordingly. 
 
 
 ## How to Cite

diff --git a/src/configs/base_llama.py → configs/base_llama.py b/src/configs/base_llama.py → configs/base_llama.py
diff --git a/configs/base_llama70b.py b/configs/base_llama70b.py
@@ -0,0 +1,188 @@
+import os
+proj_dir=os.getcwd()
+
+seed=42
+debug=True
+strategy='ddp'
+
+prefix_tuning=False
+prefix_tokens=30
+
+# Output directory
+output_dir = f'{os.getenv("OUTPUT_DIR", f"{proj_dir}/output")}/llama2/lora/natural_instructions_200k'
+resume_from_checkpoint = None
+metrics = ['bleu']
+
+raw_data =  "Muennighoff/natural-instructions"
+
+
+preprocessing_kwargs = {
+    "remove_html": False,
+    "pad_punctuation": False,
+    "drop_tables": False,
+    "column_dict": {"inputs": ["definition", "inputs"], "target": "targets"},
+    "input_template": "[INST] {} {} [/INST]",
+    "target_template": "{}",
+    "concat_input_output": True,
+    "keep_columns": ["definition", "input", "target", "context_aware_embeds"],
+}
+
+
+tokenization_kwargs = {
+    "tokenizer_name": 'meta-llama/Llama-2-70b-hf',
+    "max_input_length": 1024,
+    "max_target_length": 1024,
+    "padding": "max_length",
+    "truncation": True,
+    "concat_input_output": True,
+    "prefix_tuning": prefix_tuning,
+    "n_prefix_tokens": prefix_tokens,
+    "decoder_prefix": False,
+    "pad_token": 'unk_token'
+}
+
+# Datamodule Arguments
+datamodule_kwargs = {
+    "debug": debug,
+    "strategy": strategy,
+    "raw_data": raw_data,
+    "deduplicate_columns": ["id"],
+    "load_from_cache_file": False,
+    "num_workers": 12,
+    "batch_size": 1,
+    "shots": 10000,
+    "dev_from_train": -1, ## set to -1 if use dev file for validation, else subsample from train
+    "overfit": False,
+    "dev_size": 1024,
+    "tiny": False,
+    "tiny_size": 1024,
+    "filter_long_sequences": True,
+    "preprocessing_kwargs": preprocessing_kwargs,
+    "tokenization_kwargs": tokenization_kwargs,
+    "batch_tokenize": True,
+    "predict_split": 'dev',
+
+}
+
+
+## logger arguments
+logger_type=None
+logger_kwargs = {
+    'name': 'llama2/lora/natural_instructions_200k',
+    'save_dir':  os.getenv("OUTPUT_DIR", f"{proj_dir}/wandb_logs"),
+    'project': os.getenv("WANDB_PROJ_NAME", f"test"),
+    'log_model': False,
+    'resume':  os.getenv("WANDB_RESUME", "allow"),
+}
+
+optimizer_config = {
+    "lr": 1e-4,
+    "eps": 1e-8,
+    "weight_decay": 1e-4,
+    "scheduler": "CosineAnnealingLR",
+
+}
+
+lora_config = {
+        "r": 8,
+        "lora_alpha": 16,
+        "lora_dropout": 0.05,
+        "bias": "none",
+        "target_modules":  ['q_proj','v_proj', 'k_proj', 'lm_head'],
+        "task_type": "CAUSAL_LM",
+    }
+
+
+quantization_config = {
+    "load_in_4bit":True,
+    "bnb_4bit_use_double_quant":True,
+    "bnb_4bit_quant_type":"nf4",
+    "bnb_4bit_compute_dtype": "bfloat16"
+}
+
+generation_kwargs= {
+    "max_new_tokens": 30,
+    "min_new_tokens": 1,
+    "num_return_sequences": 1,  
+    "do_sample": False,
+    }
+
+# Model Arguments
+module_kwargs = {
+    "model_name": 'meta-llama/Llama-2-7b-hf',
+    "optimizer": 'AdamW',
+    "auto_model_class": "AutoModelForCausalLM",
+    "prefix_tuning": prefix_tuning,
+    "n_prefix_tokens": prefix_tokens,
+    "initialize_from_vocab": False,
+
+    "optimizer_type": "AdamW",
+    "optimizer_config": optimizer_config,
+    "gradient_checkpointing": True,
+    "quantization_precision": 4,
+    "precision": "bf16",
+    "tokenization_kwargs": tokenization_kwargs,
+
+    "lora": True,
+    "lora_config": lora_config,
+    "quantization": True,
+    "quantization_config": quantization_config,
+
+    "generation_kwargs": generation_kwargs,
+
+    "freeze_encoder": False,
+    "freeze_encoder_layers": [],
+    "freeze_decoder": False,
+    "freeze_decoder_layers": [],
+    "keep_in_fp32_modules": [],
+    "resume_from_checkpoint": resume_from_checkpoint,
+    "postproc_fn": "identity",
+}
+
+
+# Callbacks
+checkpoint_callback=True
+checkpoint_callback_kwargs = {
+    "dirpath": output_dir,
+    "verbose": True,
+    "monitor": "val_loss",
+    "mode": "min",
+    "save_last": True,
+    "save_top_k": 1,
+    "every_n_train_steps": 10,
+    "save_on_train_epoch_end": False
+}
+
+# Trainer Arguments
+accelerator='auto'
+devices="auto"
+num_nodes=1
+precision="bf16-mixed"
+fast_dev_run=False
+max_epochs=1
+min_epochs=None
+max_steps=100000
+min_steps=1000
+max_time=None
+limit_train_batches=None
+limit_val_batches=None
+limit_test_batches=None
+limit_predict_batches=None
+overfit_batches=0.0
+val_check_interval=.1
+check_val_every_n_epoch=1
+num_sanity_val_steps=0
+log_every_n_steps=50
+enable_progress_bar=True
+enable_model_summary=True
+accumulate_grad_batches=4
+gradient_clip_val=0.3
+gradient_clip_algorithm='norm'
+deterministic=None
+benchmark=None
+inference_mode=True
+profiler=None
+detect_anomaly=False
+barebones=False
+sync_batchnorm=strategy in ['ddp', 'fsdp','fsdp_native', 'ddp_find_unused_parameters_true']
+reload_dataloaders_every_n_epochs=0
diff --git a/parallelm/__init__.py b/parallelm/__init__.py
@@ -0,0 +1 @@
+name = 'parallelm'
diff --git a/parallelm/common/__init__.py b/parallelm/common/__init__.py
@@ -0,0 +1 @@
+name = 'parallelm'
diff --git a/src/common/checkpoint_utils.py → parallelm/common/checkpoint_utils.py b/src/common/checkpoint_utils.py → parallelm/common/checkpoint_utils.py
diff --git a/parallelm/data/__init__.py b/parallelm/data/__init__.py
@@ -0,0 +1 @@
+name = 'parallelm'
diff --git a/src/data/data_utils.py → parallelm/data/data_utils.py b/src/data/data_utils.py → parallelm/data/data_utils.py
diff --git a/src/data/pl_dataloaders.py → parallelm/data/pl_dataloaders.py b/src/data/pl_dataloaders.py → parallelm/data/pl_dataloaders.py
@@ -18,7 +18,7 @@
 cache_dir = os.getenv('CACHE_DIR', "./.cache")
 
 sys.path.append(os.getcwd())
-from src.data.preprocessing import get_inputs_and_targets, tokenize_inputs_and_targets, batch_tokenize_inputs_and_targets
+from parallelm.data.preprocessing import get_inputs_and_targets, tokenize_inputs_and_targets, batch_tokenize_inputs_and_targets
 
 
 class CustomDataset(Dataset):

diff --git a/src/data/postprocessing.py → parallelm/data/postprocessing.py b/src/data/postprocessing.py → parallelm/data/postprocessing.py
diff --git a/src/data/preprocessing.py → parallelm/data/preprocessing.py b/src/data/preprocessing.py → parallelm/data/preprocessing.py
@@ -2,7 +2,7 @@
 
 import torch
 from transformers import AutoTokenizer
-from src.data.data_utils import (
+from parallelm.data.data_utils import (
     _remove_html, 
     _pad_punctuation, 
     _filter_na, 

diff --git a/src/models/pl_modules.py → parallelm/models/pl_modules.py b/src/models/pl_modules.py → parallelm/models/pl_modules.py
@@ -5,17 +5,16 @@
 
 import os
 import sys
-sys.path.append(os.getcwd())
 
 from torch import nn
 from tqdm import tqdm
 from dotenv import load_dotenv
 
-from src.models.soft_embedding import SoftEmbedding
-from src.common.checkpoint_utils import trim_lora, trim_prefix
+from parallelm.models.soft_embedding import SoftEmbedding
+from parallelm.common.checkpoint_utils import trim_lora, trim_prefix
 from transformers import BitsAndBytesConfig, AutoTokenizer, LlamaTokenizer, T5Config, AutoModel
 from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
-import src.data.postprocessing as postprocessing
+import parallelm.data.postprocessing as postprocessing
 
 # Load the variables from the .env file
 load_dotenv(os.getcwd()+'/.env')
@@ -315,7 +314,9 @@ def configure_optimizers(self):
         if self.optimizer_type == 'Adafactor':
             from transformers import Adafactor
             optimizer = Adafactor(self.trainer.model.parameters(), **self.optimizer_config)
-
+        if 'bnb' in self.optimizer_type:
+            import bitsandbytes as bnb
+            optimizer = getattr(bnb.optim, self.optimizer_type.split('.')[-1])(self.trainer.model.parameters(), **self.optimizer_config)
         if self.optimizer_config.get('scheduler', None):
             scheduler = getattr(torch.optim.lr_scheduler,scheduler)
             return [optimizer], [scheduler(optimizer, **scheduler_config)]

diff --git a/src/models/soft_embedding.py → parallelm/models/soft_embedding.py b/src/models/soft_embedding.py → parallelm/models/soft_embedding.py