VectorInstitute
diff --git a/‎.gitignore‎
Lines changed: 6 additions & 1 deletion b/‎.gitignore‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 4 additions & 5 deletions b/‎README.md‎
Lines changed: 4 additions & 5 deletions
diff --git a/‎configs/config.yaml‎
Lines changed: 10 additions & 0 deletions b/‎configs/config.yaml‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎docs/config.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/config.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/reference_throughput.md‎
Lines changed: 33 additions & 0 deletions b/‎docs/reference_throughput.md‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎examples/launch.sh‎
Lines changed: 1 addition & 1 deletion b/‎examples/launch.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/llama_example.py‎
Lines changed: 51 additions & 6 deletions b/‎examples/llama_example.py‎
Lines changed: 51 additions & 6 deletions
diff --git a/‎profiling/README.md‎
Lines changed: 20 additions & 0 deletions b/‎profiling/README.md‎
Lines changed: 20 additions & 0 deletions
@@ -4,4 +4,9 @@
 **/*.sh
 __pycache__/
 wandb/
-build/
+build/
+data/
+**/*.pyc
+/.cache
+/.vscode
+/data
@@ -61,11 +61,10 @@ We implement several training optimizations that can be reviewed under [`docs/tr
 
 We have provided an example script to show what a regular workflow would look like for the user. It assumes a preprocessed dataset has already been created. The [`examples/launch.sh`](examples/launch.sh) script begins dense finetuning a Llama-2 7B chat model sharded across a node of 4x A100-80GB GPUs. With the Python environment activated, this can be launched using `sbatch launch.sh`. We also provide a script to launch the same training run in a multinode setting across two A100 nodes at [`examples/launch_multinode.sh`](examples/launch_multinode.sh). Please note that hybrid sharding strategies need to be employed as you scale to multinode settings to minimize communication bottlenecks. More information regarding this can be found in [`docs/config.md`](docs/config.md).
 
-At the end of training, a consolidated model will be saved under your output directory as a `.bin` file. You can simply just run [`vectorlm/utils/convert_to_hf.py`](vectorlm/utils/convert_to_hf.py) to convert it to the regular HuggingFace model format. The script uses the main config file to determine save locations.
-
-## Roadmap
-- PEFT methods (LoRA).
+At the end of training, a consolidated model will be saved under your output directory. 
+- If LoRA is enabled, the output will be a PEFT adapter repository that can be loaded directly via [AutoModel.from_pretrained](https://huggingface.co/docs/transformers/main/en/peft#load-a-peft-adapter). 
+- Otherwise, the output would be a `.bin` file. You can simply just run [`vectorlm/utils/convert_to_hf.py`](vectorlm/utils/convert_to_hf.py) to convert it to the regular HuggingFace model format. The script uses the main config file to determine save locations.
 
 # Contributors
 
-Adil Asif, Ziwen Han, John Willes.
+Adil Asif, Ziwen Han, John Willes, Jacob-Junqi Tian.
@@ -20,6 +20,16 @@ train_parameters:
   use_flash_attention: True
   low_cpu_mem_usage: True
 
+  # LoRA config: uncomment the block below to enable LoRA
+  
+  # lora_peft_config:
+  #   task_type: CAUSAL_LM
+  #   inference_mode: False
+  #   r: 8
+  #   lora_alpha: 32
+  #   lora_dropout: 0.1
+
+
   # Gradient norm clipping
   max_grad_norm: 1
   gradient_accumulation_steps: 4
 
@@ -29,6 +29,7 @@ The key-value pairs stored under `wandb_config` are directly passed into the [`w
 * `use_activation_checkpointing`: Whether to use activation checkpointing. This greatly reduces memory footprint as only a few intermediate activations as saved during the forward pass, and are then recomputed for the backward pass on the fly. However, the tradeoff between compute vs. memory usually makes this worth it.
 * `use_flash_attention`: Whether to use Flash Attention. If it is supported for your model in HuggingFace, you can enable this option.
 * `low_cpu_mem_usage`: Whether to efficiently load the model. If enabled, the model weights are only loaded once on rank 0 and are broadcasted to the rest of the world from the main rank. It will prevent the CPU memory from exploding when loading large models (e.g. LLaMa-70B).
+- `lora_peft_config`: Optionally, fine-tune the model using low-rank adaptation via HuggingFace PEFT. Uncomment this section to enable LoRA. All parameters specified under this section are forwarded to [peft.LoRAConfig](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraConfig).
 
 ### Gradient
 
 
@@ -0,0 +1,33 @@
+# Reference Throughput
+
+We've benchmarked VectorLM on the Vaughan cluster for a number of model architectures across a variety of node configurations.
+In experiments labelled as LoRA, we set hidden dimension to 8. During the testing, the NVIDIA driver version was 525.105.17, CUDA Runtime 12.1.105, and torch 2.2.2.
+
+For consistency, we use a batch size of 8 and the maximum context length that the pre-trained LLM supports, capped at 65536. Note that especially for smaller models, it might be possible to further increase throughput by switching to a larger batch size.
+
+Entries that read NaN represent combinations where the node configuration does not have enough GPU memory for the training run to complete. An exception is gemma-2b, which currently does not support full-rank FSDP fine-tuning.
+
+All values in the table below represent the median training throughput in tokens per second across all training steps, aggregated across all GPU devices.
+
+|                                      | Llama-2-13b-hf | Llama-2-7b-hf | Mistral-7B-v0.1 | Mixtral-8x7B-Instruct-v0.1 | gemma-2b | opt-350m |
+| :----------------------------------- | -------------: | ------------: | --------------: | -------------------------: | -------: | -------: |
+| (full_rank) NVIDIA A100-SXM4-80GB x1 |        424.726 |       570.818 |         528.747 |                        nan |      nan |  780.045 |
+| (full_rank) NVIDIA A100-SXM4-80GB x2 |        660.355 |        919.19 |         794.566 |                    275.459 |      nan |  1227.67 |
+| (full_rank) NVIDIA A100-SXM4-80GB x4 |         1309.4 |       1744.39 |         1577.09 |                    817.162 |      nan |  2181.46 |
+| (full_rank) NVIDIA A40 x1            |            nan |       47.6435 |         107.503 |                        nan |      nan |  666.881 |
+| (full_rank) NVIDIA A40 x2            |            nan |       313.074 |         322.624 |                        nan |      nan |  854.672 |
+| (full_rank) NVIDIA A40 x4            |         345.96 |       570.977 |         553.658 |                        nan |      nan |  1765.49 |
+| (full_rank) Tesla T4 x1              |            nan |           nan |             nan |                        nan |      nan |   475.51 |
+| (full_rank) Tesla T4 x2              |            nan |           nan |             nan |                        nan |      nan |  768.008 |
+| (full_rank) Tesla T4 x4              |            nan |           nan |             nan |                        nan |      nan |   1383.6 |
+| (full_rank) Tesla T4 x8              |            nan |           nan |             nan |                        nan |      nan |  2414.68 |
+| (lora) NVIDIA A100-SXM4-80GB x1      |        560.167 |       646.801 |         525.802 |                        nan |  851.678 |  859.379 |
+| (lora) NVIDIA A100-SXM4-80GB x2      |        871.993 |       1157.17 |         1105.68 |                    239.431 |  1724.57 |  1463.82 |
+| (lora) NVIDIA A100-SXM4-80GB x4      |        1783.53 |       2091.03 |         2150.06 |                    1309.74 |  2719.24 |  2381.01 |
+| (lora) NVIDIA A40 x1                 |        272.931 |       435.386 |         336.507 |                        nan |  983.256 |  652.611 |
+| (lora) NVIDIA A40 x2                 |        105.442 |       457.183 |         356.263 |                        nan |  725.723 |  1136.17 |
+| (lora) NVIDIA A40 x4                 |         543.22 |       715.416 |         642.642 |                        nan |  1302.62 |  1647.57 |
+| (lora) Tesla T4 x1                   |            nan |           nan |             nan |                        nan |  148.272 |  571.471 |
+| (lora) Tesla T4 x2                   |            nan |       101.126 |         102.859 |                        nan |  256.534 |  811.159 |
+| (lora) Tesla T4 x4                   |            nan |       188.575 |         190.127 |                        nan |  495.755 |  1506.05 |
+| (lora) Tesla T4 x8                   |        196.709 |       372.375 |         351.361 |                        nan |   897.81 |  2945.86 |
@@ -23,4 +23,4 @@ export LOGLEVEL=INFO
 export PYTHONFAULTHANDLER=1
 # export CUDA_LAUNCH_BLOCKING=0
 
-torchrun --nnodes=1 --nproc-per-node=4 llama_example.py --yaml_path ../configs/config.yaml
+torchrun --nnodes=1 --nproc-per-node=${SLURM_GPUS_ON_NODE} llama_example.py --yaml_path ../configs/config.yaml
@@ -1,3 +1,5 @@
+from __future__ import annotations
+
 import argparse
 import math
 import os
@@ -9,15 +11,24 @@
 from torch.optim import AdamW
 from tqdm import tqdm
 from transformers import set_seed
-from transformers.models.llama.modeling_llama import LlamaDecoderLayer
 
 from vectorlm.dataset import Dataset
 from vectorlm.trainer import Trainer
 from vectorlm.utils.data_utils import Config
 from vectorlm.utils.misc_utils import cleanup, setup, wandb_setup
-from vectorlm.utils.model_utils import load_model_and_tokenizer, shard_model
+from vectorlm.utils.model_utils import (
+    get_lora_model_from_base_model,
+    get_submodule_by_pattern,
+    load_model_and_tokenizer,
+    shard_model,
+)
 from vectorlm.utils.optimizer_utils import get_custom_scheduler
-from vectorlm.utils.save_utils import save_consolidated_model
+from vectorlm.utils.save_utils import (
+    checkpoint_exists,
+    get_latest_checkpoint_dir,
+    save_consolidated_model,
+    save_peft_adapter,
+)
 
 
 def parse_args() -> Namespace:
@@ -30,7 +41,9 @@ def parse_args() -> Namespace:
     """
     parser = argparse.ArgumentParser()
     parser.add_argument(
-        "--yaml_path", default="configs/config.yaml", required=False,
+        "--yaml_path",
+        default="configs/config.yaml",
+        required=False,
     )
     return parser.parse_args()
 
@@ -67,14 +80,40 @@ def main(config: Config) -> None:
         training_args.low_cpu_mem_usage,
     )
 
+    lora_peft_config = config.train_parameters.get("lora_peft_config")
+    is_peft_adapter_restored = False
+    is_lora_enabled = False
+    if lora_peft_config is not None:
+        is_lora_enabled = True
+        peft_adapter_path = None
+        # Restore peft adapter from filesystem if available.
+        if checkpoint_exists(training_args.output_dir):
+            peft_adapter_path = os.path.join(
+                training_args.output_dir,
+                "checkpoints",
+                get_latest_checkpoint_dir(
+                    os.path.join(training_args.output_dir, "checkpoints"),
+                ),
+            )
+            is_peft_adapter_restored = True
+
+        model = get_lora_model_from_base_model(
+            model,
+            lora_peft_config,
+            peft_adapter_path,
+        )
+
+    decoder_layer_module = get_submodule_by_pattern(model, r"DecoderLayer$")
+    assert decoder_layer_module is not None, f"No DecoderLayer found in {model}"
     model = shard_model(
         model,
-        LlamaDecoderLayer,
+        decoder_layer_module,
         training_args.use_mp,
         training_args.use_activation_checkpointing,
         training_args.sharding_strategy,
         local_rank,
         training_args.low_cpu_mem_usage,
+        is_lora_enabled,
     )
 
     # load dataset
@@ -112,6 +151,7 @@ def main(config: Config) -> None:
         dataset,
         optimizer,
         lr_scheduler,
+        is_peft_adapter_restored,
     )
 
     # Checkpoint check. Always call before training.
@@ -138,9 +178,14 @@ def main(config: Config) -> None:
                 f"epoch_{epoch}",
                 "end-epoch-model",
             )
-        save_consolidated_model(trainer.model, hf_save_dir, rank)
+
+        if is_lora_enabled:
+            save_peft_adapter(trainer.model, hf_save_dir)
+        else:
+            save_consolidated_model(trainer.model, hf_save_dir, rank)
         dataset.reset_dataloaders()
 
+
 if __name__ == "__main__":
     args = parse_args()
     config = Config(yaml_path=args.yaml_path)
 
@@ -0,0 +1,20 @@
+# Profiling Utils
+
+To modify the specific SLURM resources types to benchmark, adjust the launcher script `launch_benchmark.py` as needed. Modify `profiling/configs/lora-benchmark.yaml` to adjust parameters such as batch size and token width.
+
+On the Vector cluster, run the following to launch the benchmarks:
+
+```bash
+$ mkdir data/
+$ python3 launch_benchmark.py
+
+# The launcher script will print a list of
+# SLURM commands it plans to run. Press ENTER
+# to accept and automatically invoke the commands.
+```
+
+After the SLURM jobs complete, profiler output can be found under `data/benchmark`. Invoke the following the to generate a Markdown summary of the results:
+
+```bash
+$ python3 profiling/parse_benchmark.py --folder data/benchmark
+```