Skip to content

Commit

Permalink
Convert to package
Browse files Browse the repository at this point in the history
  • Loading branch information
artemisp committed Mar 22, 2024
1 parent e4a84e8 commit 3bc98a7
Show file tree
Hide file tree
Showing 23 changed files with 263 additions and 42 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,7 @@
# Ignore predictions files
**predictions**

**mrqa**
**mrqa**

build/
parallelm.egg-info/
30 changes: 17 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The codebase is based on top of [`PyTorch Lightning`](https://lightning.ai/docs/

Templates are developed to be compatible with [`balance-my-slurm`](https://github.com/artemisp/balance-my-slurm/tree/main) so check it out! 🧐

An example config file to run is provided in `src/configs/train/llama_mrqa.py`. Make sure to download the data to run it
An example config file to run is provided in `parallelm/configs/train/llama_mrqa.py`. Make sure to download the data to run it
You can do so as follows:
```
>> mkdir mrqa
Expand All @@ -15,7 +15,7 @@ An example config file to run is provided in `src/configs/train/llama_mrqa.py`.
```
Then you can train the model by:
```
srun --gpus 1 --nodes 1 --mem-per-cpu 12GB --constraint 48BGgpu --ntasks-per-node 1 --cpus-per-gpu 10 /nlp/data/artemisp/mambaforge/envs/test_me/bin/python src/pl_ft.py --cfg /nlp/data/artemisp/multigpu-lm-templates/src/configs/train/llama_mrqa.py
srun --gpus 1 --nodes 1 --mem-per-cpu 12GB --constraint 48BGgpu --ntasks-per-node 1 --cpus-per-gpu 10 /nlp/data/artemisp/mambaforge/envs/test_me/bin/python parallelm/pl_ft.py --cfg /nlp/data/artemisp/multigpu-lm-templates/parallelm/configs/train/llama_mrqa.py
```

<a name="toc"></a>
Expand Down Expand Up @@ -53,6 +53,8 @@ In installing `PyTorch` we assume `CUDA` version ~~12.0~~ 12.1 are compatible wi
>> conda activate test_me
>> conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
>> python -m pip install -r requirements.txt
>> cd ParalleLM
>> python -m pip install -e .
```


Expand All @@ -63,6 +65,8 @@ In installing `PyTorch` we assume `CUDA` version ~~12.0~~ 12.1 are compatible wi
>> conda activate test_me
>> conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
>> python -m pip install -r requirements.txt
>> cd ParalleLM
>> python -m pip install -e .
```

If you want to use a faster (like a LOT FASTER) package manager built on top
Expand All @@ -73,7 +77,7 @@ If you want to use a faster (like a LOT FASTER) package manager built on top
<a name="skeleton"></a>
## 📁 Files and Skeleton
```
├── src
├── parallelm
│ ├── common
│ │ ├── checkpoint_utils.py # utility functions for checkpointing
│ ├── configs
Expand Down Expand Up @@ -101,7 +105,7 @@ Now let's look at each of them in turn:
### 📊 Data: The Heart of Your NLP Adventure! 🚀
This module handles dataloading, preprocessing, and post-processing.

#### `data/pl_dataloaders.py`
#### `parallelm.data.pl_dataloaders`

* `CustomDataset`: A subclass of `torch.utils.data.Dataset`, designed for flexible data handling. It supports initialization with datasets in various formats, optional tokenizer integration, and custom preprocessing. It is used by `CustomDataModule`.
* `CustomDataModule`: Extends `pl.LightningDataModule` to organize data loading, preprocessing, and setup for different phases like training and validation. It supports distributed training and custom tokenization and preprocessing workflows.
Expand Down Expand Up @@ -137,17 +141,17 @@ This module handles dataloading, preprocessing, and post-processing.
batch_size (int, optional): The batch size to use for training and inference. Defaults to `None`.
"""

#### `data/postprocessing.py`
#### `parallelm.data.postprocessing`
Define postprocessing functions here. They are accesed in `models.pl_module.CustomModule` prediction step. They can be defined in `datamodule_kwargs` in the config by their name `datamodule_kwargs: {"postproc_fn": <func_name>}`. Each function in postprocessing accepts a single string.

#### `data/preprocessing.py`
#### `parallelm.data.preprocessing`
Define preprocessing functions here. It is used by `data.pl_dataloaders.CustomDataModule` for template formatting, and tokenization. The relevant arguments in the config are `preprocessing_kwargs` and `tokenization_kwargs`.

<a name="data"></a>
### Models 🤖


#### `models/pl_modules.py`
#### `parallelm.models.pl_modules`
* `CustomModule`: A `lightning` wrapper around a `transformers` model that allows for training in `LoRA` or prefix tuning mode using the implementation from [here](https://github.com/kipgparker/soft-prompt-tuning/blob/main/soft_embedding.py) as well as quantization. It allows for distributed training, and high control of the training processes. The `training_step` method can be adapted for different loss functions if necessary.
"""
A PyTorch Lightning module that encapsulates a Hugging Face transformer model
Expand Down Expand Up @@ -240,8 +244,8 @@ Define preprocessing functions here. It is used by `data.pl_dataloaders.CustomDa
Okay.. all good till now, but of course you want to take some control. Don't you worry! A lot of the things you would want to do can simply be achieved by changing a single variable in the config!

Two example configs are provided in
* `/nlp/data/artemisp/multigpu-lm-templates/src/configs/train/llama_mrqa.py`
* `/nlp/data/artemisp/multigpu-lm-templates/src/configs/base.py`
* `/nlp/data/artemisp/multigpu-lm-templates/parallelm/configs/train/llama_mrqa.py`
* `/nlp/data/artemisp/multigpu-lm-templates/parallelm/configs/base.py`

## General Configuration
* `output_dir`: Specifies the current working directory of the project. This is used as a base to construct paths for data, outputs, and logs.
Expand Down Expand Up @@ -471,19 +475,19 @@ The most important is the `column_dict`. It populates the `input_template`/`targ
# Ready to train?

You can submit a batch job:
`sbatch src/slurm_scripts/run_ft.sh --cfg path/to/your/config` to run on 8gpus. and
`sbatch src/slurm_scripts/run_ft_1gpu.sh --cfg path/to/your/config` to run on one. You can modify the scripts accordingly.
`sbatch parallelm/slurm_scripts/run_ft.sh --cfg path/to/your/config` to run on 8gpus. and
`sbatch parallelm/slurm_scripts/run_ft_1gpu.sh --cfg path/to/your/config` to run on one. You can modify the scripts accordingly.

For interractive debugging do the following:
` srun --gpus 1 --nodes 1 --mem-per-cpu 12GB --constraint 48GBgpu --ntasks-per-node 1 --cpus-per-gpu 10 python src/pl_ft.py --cfg /path/to/your/config`
` srun --gpus 1 --nodes 1 --mem-per-cpu 12GB --constraint 48GBgpu --ntasks-per-node 1 --cpus-per-gpu 10 python parallelm/pl_ft.py --cfg /path/to/your/config`


<a name="evaluate"></a>
# Ready to evaluate?

Create a config which defines `resume_From_checkpoint` which is passed in the `module_kwargs` and also on the top level of the config. Then specify your `metrics` and `output_dir`. Finally, either pass a raw dataset with a `predict` split, or specify your prediction split in `datamodule_kwargs`.

Run `sbatch src/slurm_scripts/run_predict_1gpu.sh --cfg path/to/your/config` to run on one gpu or select the 4gpu script for faster inference. You can modify the scripts accordingly.
Run `sbatch parallelm/slurm_scripts/run_predict_1gpu.sh --cfg path/to/your/config` to run on one gpu or select the 4gpu script for faster inference. You can modify the scripts accordingly.


## How to Cite
Expand Down
File renamed without changes.
188 changes: 188 additions & 0 deletions configs/base_llama70b.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
import os
proj_dir=os.getcwd()

seed=42
debug=True
strategy='ddp'

prefix_tuning=False
prefix_tokens=30

# Output directory
output_dir = f'{os.getenv("OUTPUT_DIR", f"{proj_dir}/output")}/llama2/lora/natural_instructions_200k'
resume_from_checkpoint = None
metrics = ['bleu']

raw_data = "Muennighoff/natural-instructions"


preprocessing_kwargs = {
"remove_html": False,
"pad_punctuation": False,
"drop_tables": False,
"column_dict": {"inputs": ["definition", "inputs"], "target": "targets"},
"input_template": "[INST] {} {} [/INST]",
"target_template": "{}",
"concat_input_output": True,
"keep_columns": ["definition", "input", "target", "context_aware_embeds"],
}


tokenization_kwargs = {
"tokenizer_name": 'meta-llama/Llama-2-70b-hf',
"max_input_length": 1024,
"max_target_length": 1024,
"padding": "max_length",
"truncation": True,
"concat_input_output": True,
"prefix_tuning": prefix_tuning,
"n_prefix_tokens": prefix_tokens,
"decoder_prefix": False,
"pad_token": 'unk_token'
}

# Datamodule Arguments
datamodule_kwargs = {
"debug": debug,
"strategy": strategy,
"raw_data": raw_data,
"deduplicate_columns": ["id"],
"load_from_cache_file": False,
"num_workers": 12,
"batch_size": 1,
"shots": 10000,
"dev_from_train": -1, ## set to -1 if use dev file for validation, else subsample from train
"overfit": False,
"dev_size": 1024,
"tiny": False,
"tiny_size": 1024,
"filter_long_sequences": True,
"preprocessing_kwargs": preprocessing_kwargs,
"tokenization_kwargs": tokenization_kwargs,
"batch_tokenize": True,
"predict_split": 'dev',

}


## logger arguments
logger_type=None
logger_kwargs = {
'name': 'llama2/lora/natural_instructions_200k',
'save_dir': os.getenv("OUTPUT_DIR", f"{proj_dir}/wandb_logs"),
'project': os.getenv("WANDB_PROJ_NAME", f"test"),
'log_model': False,
'resume': os.getenv("WANDB_RESUME", "allow"),
}

optimizer_config = {
"lr": 1e-4,
"eps": 1e-8,
"weight_decay": 1e-4,
"scheduler": "CosineAnnealingLR",

}

lora_config = {
"r": 8,
"lora_alpha": 16,
"lora_dropout": 0.05,
"bias": "none",
"target_modules": ['q_proj','v_proj', 'k_proj', 'lm_head'],
"task_type": "CAUSAL_LM",
}


quantization_config = {
"load_in_4bit":True,
"bnb_4bit_use_double_quant":True,
"bnb_4bit_quant_type":"nf4",
"bnb_4bit_compute_dtype": "bfloat16"
}

generation_kwargs= {
"max_new_tokens": 30,
"min_new_tokens": 1,
"num_return_sequences": 1,
"do_sample": False,
}

# Model Arguments
module_kwargs = {
"model_name": 'meta-llama/Llama-2-7b-hf',
"optimizer": 'AdamW',
"auto_model_class": "AutoModelForCausalLM",
"prefix_tuning": prefix_tuning,
"n_prefix_tokens": prefix_tokens,
"initialize_from_vocab": False,

"optimizer_type": "AdamW",
"optimizer_config": optimizer_config,
"gradient_checkpointing": True,
"quantization_precision": 4,
"precision": "bf16",
"tokenization_kwargs": tokenization_kwargs,

"lora": True,
"lora_config": lora_config,
"quantization": True,
"quantization_config": quantization_config,

"generation_kwargs": generation_kwargs,

"freeze_encoder": False,
"freeze_encoder_layers": [],
"freeze_decoder": False,
"freeze_decoder_layers": [],
"keep_in_fp32_modules": [],
"resume_from_checkpoint": resume_from_checkpoint,
"postproc_fn": "identity",
}


# Callbacks
checkpoint_callback=True
checkpoint_callback_kwargs = {
"dirpath": output_dir,
"verbose": True,
"monitor": "val_loss",
"mode": "min",
"save_last": True,
"save_top_k": 1,
"every_n_train_steps": 10,
"save_on_train_epoch_end": False
}

# Trainer Arguments
accelerator='auto'
devices="auto"
num_nodes=1
precision="bf16-mixed"
fast_dev_run=False
max_epochs=1
min_epochs=None
max_steps=100000
min_steps=1000
max_time=None
limit_train_batches=None
limit_val_batches=None
limit_test_batches=None
limit_predict_batches=None
overfit_batches=0.0
val_check_interval=.1
check_val_every_n_epoch=1
num_sanity_val_steps=0
log_every_n_steps=50
enable_progress_bar=True
enable_model_summary=True
accumulate_grad_batches=4
gradient_clip_val=0.3
gradient_clip_algorithm='norm'
deterministic=None
benchmark=None
inference_mode=True
profiler=None
detect_anomaly=False
barebones=False
sync_batchnorm=strategy in ['ddp', 'fsdp','fsdp_native', 'ddp_find_unused_parameters_true']
reload_dataloaders_every_n_epochs=0
1 change: 1 addition & 0 deletions parallelm/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
name = 'parallelm'
1 change: 1 addition & 0 deletions parallelm/common/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
name = 'parallelm'
File renamed without changes.
1 change: 1 addition & 0 deletions parallelm/data/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
name = 'parallelm'
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
cache_dir = os.getenv('CACHE_DIR', "./.cache")

sys.path.append(os.getcwd())
from src.data.preprocessing import get_inputs_and_targets, tokenize_inputs_and_targets, batch_tokenize_inputs_and_targets
from parallelm.data.preprocessing import get_inputs_and_targets, tokenize_inputs_and_targets, batch_tokenize_inputs_and_targets


class CustomDataset(Dataset):
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

import torch
from transformers import AutoTokenizer
from src.data.data_utils import (
from parallelm.data.data_utils import (
_remove_html,
_pad_punctuation,
_filter_na,
Expand Down
11 changes: 6 additions & 5 deletions src/models/pl_modules.py → parallelm/models/pl_modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,16 @@

import os
import sys
sys.path.append(os.getcwd())

from torch import nn
from tqdm import tqdm
from dotenv import load_dotenv

from src.models.soft_embedding import SoftEmbedding
from src.common.checkpoint_utils import trim_lora, trim_prefix
from parallelm.models.soft_embedding import SoftEmbedding
from parallelm.common.checkpoint_utils import trim_lora, trim_prefix
from transformers import BitsAndBytesConfig, AutoTokenizer, LlamaTokenizer, T5Config, AutoModel
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
import src.data.postprocessing as postprocessing
import parallelm.data.postprocessing as postprocessing

# Load the variables from the .env file
load_dotenv(os.getcwd()+'/.env')
Expand Down Expand Up @@ -315,7 +314,9 @@ def configure_optimizers(self):
if self.optimizer_type == 'Adafactor':
from transformers import Adafactor
optimizer = Adafactor(self.trainer.model.parameters(), **self.optimizer_config)

if 'bnb' in self.optimizer_type:
import bitsandbytes as bnb
optimizer = getattr(bnb.optim, self.optimizer_type.split('.')[-1])(self.trainer.model.parameters(), **self.optimizer_config)
if self.optimizer_config.get('scheduler', None):
scheduler = getattr(torch.optim.lr_scheduler,scheduler)
return [optimizer], [scheduler(optimizer, **scheduler_config)]
Expand Down
File renamed without changes.
Loading

0 comments on commit 3bc98a7

Please sign in to comment.