Skip to content

Commit ce1eaa3

Browse files
Implemented baseline LoRA peft with FSDP integration, tested on one node. (#5)
* Implemented baseline LoRA peft for one Nvidia GPU. * Added support for saving lora adapters. Added support for non-fsdp models. * save_utils: added support for non-FSDP optimizers. trainer: replaced clip_grad_norm_ with nn.utils.clip_grad_norm_ for lora compatibility. * example_lora: highlighted current lora (non-fsdp) limitations. * Added instructions on LoRA on one GPU. * Added example script for launching lora. * Revised instructions on LoRA on one GPU. * Implemented LoRA FSDP. Also see https://github.com/facebookresearch/llama-recipes/blob/674b37ee66f59a7845cbc3868948f4d7fa69c679/src/llama_recipes/utils/fsdp_utils.py#L9 * Reverted automatic formatter changes in README.md * Eliminated non-FSDP logic from save_utils. Set model path to local copy of llama-2-7b in example config. * Moved lora config out of example config.yaml. * Implemented LoRA benchmarking logic for worker. * model_utils: Refactored get_lora_model to reduce interface width. (this method no longer wraps load_model_and_tokenizer) test_modelling: revised base model fixture scope since torch FSDP wrap is in-place. launch_benchmark: added confirmation before launching. * test_modelling: moved text output to data/. * added example yaml config for lora benchmarking. * launch_benchmark: marked qos flag as optional. * launch_benchmark: added option to limit number of jobs launched. * launch_benchmark: implemented torch profiler integration. * Merged changes from low CPU memory usage feature (#6) into jjt/lora-benchmarking * added changes to implement low cpu mem usage feature * implemented new ruff linting changes and ran a fix across files * Revised launch_benchmark.py to use new profiling path. * Enabled automatic creation of data/trace folder. * Added instructions for profiling tools. * Cleaned up duplicate imports from merge. * Cleaned up duplicate imports from merge. * Cleaned up parse_benchmark.py * Integrated LoRA logic into llama_example.py. * Moved lora_configs into train_parameters in config yaml. Adjusted docs/config.md accordingly. * Revised handling of nproc-per-node in benchmark script. * Included parameter_count info in benchmark output. * Implemented basic util for parsing benchmarking output. * model_utils: Enabled low_cpu_mem_usage in auto model from_pretrained by default. * launch_lora_benchmark.sh: implemented automatic identification of num_gpus. lora-benchmark: switched parse_benchmark: implemented option to specify benchmark artifact folder to load. * requirements.txt: included accelerate to support low_cpu_mem loading. * benchmark.py: adjusted BenchmarkingDataset to avoid StopIteration exception. * benchmark.py: added env var flag to toggle export_trace * parse_benchmark: included profiler table in output file. launch_benchmark: automated folder creation. launch_lora_benchmark: included model info in slurm output. * get_lora_model_from_base_model: enabled peft for models loaded via low_cpu_mem. More investigation might be needed. * model_utils: revised dtype handling for peft-wrapped models. * parse_benchmark: implemented sorting of profiler table output. launch_benchmark: revised default run time limit. * Merged example_lora into examples/llama_example.pu * Added instructions related to parse_benchmark * parse_benchmark: implemented aggregation across repeated metrics. * Implemented non-LoRA profiling and benchmarking. * Various static typechecking and formatting fixes. * Implemented restoring LoRA train state from filesystem. During training the adapter weights are saved to and loaded from the filesystem. The base model weights are loaded separately. Revised reference to optim_state_dict_to_load in load_optimizer. * Included train step number in LoRA adapter output path. * Added reference throughput table to documentation. * Added unit description to reference throughput table. Applied markdown formatting via prettier. * Added unit description to reference throughput table. Applied markdown formatting via prettier. * Benchmark: added option to override max_length of pre-trained model. * Deleted unused `accelerate` dependency from requirements.txt * Benchmark: added comment on max_length. * Benchmark: added comment on batch size. * Benchmark: added option to override batch size. * Benchmark throughput documentation: revised word choices. * Moved profiling-tracking logic out of Trainer. * Eliminated hasattr check related to no_sync since FSDP is always enabled. * Replaced peft fsdp_auto_wrap_policy to eliminate implicit `accelerate` dependency. Eliminated redundant bfloat16 type conversion. Fixed scope of placeholder for `is_peft_adapter_restored`. * Configured LoRA auto-wrap policy as off by default- enable the policy only when LoRA is required. * Revised punctuation in lora_requires_grad_policy_fn. * Renamed declarative `enable_lora` with descriptive `is_lora_enabled`. * Replaced optimizer.load_state_dict with load_sharded_optimizer_state_dict for PEFT optimizer. Added LoRA/PEFT documentations. * benchmarking: deleted unused TypeVar in parse_benchmark.py * Replaced config getattr and hasattr with dict methods. * Deleted redundant lora-specific launch scripts. * Added launch_benchmark.sh for throughput benchmarks. * Benchmark: run `makedirs` only `if __name__ == "__main__"`. * Replaced peft class attributes in Trainer with instance attributes. Added information about benchmarking environment. Additional formatting fixes. --------- Co-authored-by: Adil <[email protected]>
1 parent 8320c48 commit ce1eaa3

26 files changed

+1686
-109
lines changed

.gitignore

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,9 @@
44
**/*.sh
55
__pycache__/
66
wandb/
7-
build/
7+
build/
8+
data/
9+
**/*.pyc
10+
/.cache
11+
/.vscode
12+
/data

README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -61,11 +61,10 @@ We implement several training optimizations that can be reviewed under [`docs/tr
6161

6262
We have provided an example script to show what a regular workflow would look like for the user. It assumes a preprocessed dataset has already been created. The [`examples/launch.sh`](examples/launch.sh) script begins dense finetuning a Llama-2 7B chat model sharded across a node of 4x A100-80GB GPUs. With the Python environment activated, this can be launched using `sbatch launch.sh`. We also provide a script to launch the same training run in a multinode setting across two A100 nodes at [`examples/launch_multinode.sh`](examples/launch_multinode.sh). Please note that hybrid sharding strategies need to be employed as you scale to multinode settings to minimize communication bottlenecks. More information regarding this can be found in [`docs/config.md`](docs/config.md).
6363

64-
At the end of training, a consolidated model will be saved under your output directory as a `.bin` file. You can simply just run [`vectorlm/utils/convert_to_hf.py`](vectorlm/utils/convert_to_hf.py) to convert it to the regular HuggingFace model format. The script uses the main config file to determine save locations.
65-
66-
## Roadmap
67-
- PEFT methods (LoRA).
64+
At the end of training, a consolidated model will be saved under your output directory.
65+
- If LoRA is enabled, the output will be a PEFT adapter repository that can be loaded directly via [AutoModel.from_pretrained](https://huggingface.co/docs/transformers/main/en/peft#load-a-peft-adapter).
66+
- Otherwise, the output would be a `.bin` file. You can simply just run [`vectorlm/utils/convert_to_hf.py`](vectorlm/utils/convert_to_hf.py) to convert it to the regular HuggingFace model format. The script uses the main config file to determine save locations.
6867

6968
# Contributors
7069

71-
Adil Asif, Ziwen Han, John Willes.
70+
Adil Asif, Ziwen Han, John Willes, Jacob-Junqi Tian.

configs/config.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,16 @@ train_parameters:
2020
use_flash_attention: True
2121
low_cpu_mem_usage: True
2222

23+
# LoRA config: uncomment the block below to enable LoRA
24+
25+
# lora_peft_config:
26+
# task_type: CAUSAL_LM
27+
# inference_mode: False
28+
# r: 8
29+
# lora_alpha: 32
30+
# lora_dropout: 0.1
31+
32+
2333
# Gradient norm clipping
2434
max_grad_norm: 1
2535
gradient_accumulation_steps: 4

docs/config.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ The key-value pairs stored under `wandb_config` are directly passed into the [`w
2929
* `use_activation_checkpointing`: Whether to use activation checkpointing. This greatly reduces memory footprint as only a few intermediate activations as saved during the forward pass, and are then recomputed for the backward pass on the fly. However, the tradeoff between compute vs. memory usually makes this worth it.
3030
* `use_flash_attention`: Whether to use Flash Attention. If it is supported for your model in HuggingFace, you can enable this option.
3131
* `low_cpu_mem_usage`: Whether to efficiently load the model. If enabled, the model weights are only loaded once on rank 0 and are broadcasted to the rest of the world from the main rank. It will prevent the CPU memory from exploding when loading large models (e.g. LLaMa-70B).
32+
- `lora_peft_config`: Optionally, fine-tune the model using low-rank adaptation via HuggingFace PEFT. Uncomment this section to enable LoRA. All parameters specified under this section are forwarded to [peft.LoRAConfig](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraConfig).
3233

3334
### Gradient
3435

docs/reference_throughput.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Reference Throughput
2+
3+
We've benchmarked VectorLM on the Vaughan cluster for a number of model architectures across a variety of node configurations.
4+
In experiments labelled as LoRA, we set hidden dimension to 8. During the testing, the NVIDIA driver version was 525.105.17, CUDA Runtime 12.1.105, and torch 2.2.2.
5+
6+
For consistency, we use a batch size of 8 and the maximum context length that the pre-trained LLM supports, capped at 65536. Note that especially for smaller models, it might be possible to further increase throughput by switching to a larger batch size.
7+
8+
Entries that read NaN represent combinations where the node configuration does not have enough GPU memory for the training run to complete. An exception is gemma-2b, which currently does not support full-rank FSDP fine-tuning.
9+
10+
All values in the table below represent the median training throughput in tokens per second across all training steps, aggregated across all GPU devices.
11+
12+
| | Llama-2-13b-hf | Llama-2-7b-hf | Mistral-7B-v0.1 | Mixtral-8x7B-Instruct-v0.1 | gemma-2b | opt-350m |
13+
| :----------------------------------- | -------------: | ------------: | --------------: | -------------------------: | -------: | -------: |
14+
| (full_rank) NVIDIA A100-SXM4-80GB x1 | 424.726 | 570.818 | 528.747 | nan | nan | 780.045 |
15+
| (full_rank) NVIDIA A100-SXM4-80GB x2 | 660.355 | 919.19 | 794.566 | 275.459 | nan | 1227.67 |
16+
| (full_rank) NVIDIA A100-SXM4-80GB x4 | 1309.4 | 1744.39 | 1577.09 | 817.162 | nan | 2181.46 |
17+
| (full_rank) NVIDIA A40 x1 | nan | 47.6435 | 107.503 | nan | nan | 666.881 |
18+
| (full_rank) NVIDIA A40 x2 | nan | 313.074 | 322.624 | nan | nan | 854.672 |
19+
| (full_rank) NVIDIA A40 x4 | 345.96 | 570.977 | 553.658 | nan | nan | 1765.49 |
20+
| (full_rank) Tesla T4 x1 | nan | nan | nan | nan | nan | 475.51 |
21+
| (full_rank) Tesla T4 x2 | nan | nan | nan | nan | nan | 768.008 |
22+
| (full_rank) Tesla T4 x4 | nan | nan | nan | nan | nan | 1383.6 |
23+
| (full_rank) Tesla T4 x8 | nan | nan | nan | nan | nan | 2414.68 |
24+
| (lora) NVIDIA A100-SXM4-80GB x1 | 560.167 | 646.801 | 525.802 | nan | 851.678 | 859.379 |
25+
| (lora) NVIDIA A100-SXM4-80GB x2 | 871.993 | 1157.17 | 1105.68 | 239.431 | 1724.57 | 1463.82 |
26+
| (lora) NVIDIA A100-SXM4-80GB x4 | 1783.53 | 2091.03 | 2150.06 | 1309.74 | 2719.24 | 2381.01 |
27+
| (lora) NVIDIA A40 x1 | 272.931 | 435.386 | 336.507 | nan | 983.256 | 652.611 |
28+
| (lora) NVIDIA A40 x2 | 105.442 | 457.183 | 356.263 | nan | 725.723 | 1136.17 |
29+
| (lora) NVIDIA A40 x4 | 543.22 | 715.416 | 642.642 | nan | 1302.62 | 1647.57 |
30+
| (lora) Tesla T4 x1 | nan | nan | nan | nan | 148.272 | 571.471 |
31+
| (lora) Tesla T4 x2 | nan | 101.126 | 102.859 | nan | 256.534 | 811.159 |
32+
| (lora) Tesla T4 x4 | nan | 188.575 | 190.127 | nan | 495.755 | 1506.05 |
33+
| (lora) Tesla T4 x8 | 196.709 | 372.375 | 351.361 | nan | 897.81 | 2945.86 |

examples/launch.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,4 @@ export LOGLEVEL=INFO
2323
export PYTHONFAULTHANDLER=1
2424
# export CUDA_LAUNCH_BLOCKING=0
2525

26-
torchrun --nnodes=1 --nproc-per-node=4 llama_example.py --yaml_path ../configs/config.yaml
26+
torchrun --nnodes=1 --nproc-per-node=${SLURM_GPUS_ON_NODE} llama_example.py --yaml_path ../configs/config.yaml

examples/llama_example.py

Lines changed: 51 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
from __future__ import annotations
2+
13
import argparse
24
import math
35
import os
@@ -9,15 +11,24 @@
911
from torch.optim import AdamW
1012
from tqdm import tqdm
1113
from transformers import set_seed
12-
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
1314

1415
from vectorlm.dataset import Dataset
1516
from vectorlm.trainer import Trainer
1617
from vectorlm.utils.data_utils import Config
1718
from vectorlm.utils.misc_utils import cleanup, setup, wandb_setup
18-
from vectorlm.utils.model_utils import load_model_and_tokenizer, shard_model
19+
from vectorlm.utils.model_utils import (
20+
get_lora_model_from_base_model,
21+
get_submodule_by_pattern,
22+
load_model_and_tokenizer,
23+
shard_model,
24+
)
1925
from vectorlm.utils.optimizer_utils import get_custom_scheduler
20-
from vectorlm.utils.save_utils import save_consolidated_model
26+
from vectorlm.utils.save_utils import (
27+
checkpoint_exists,
28+
get_latest_checkpoint_dir,
29+
save_consolidated_model,
30+
save_peft_adapter,
31+
)
2132

2233

2334
def parse_args() -> Namespace:
@@ -30,7 +41,9 @@ def parse_args() -> Namespace:
3041
"""
3142
parser = argparse.ArgumentParser()
3243
parser.add_argument(
33-
"--yaml_path", default="configs/config.yaml", required=False,
44+
"--yaml_path",
45+
default="configs/config.yaml",
46+
required=False,
3447
)
3548
return parser.parse_args()
3649

@@ -67,14 +80,40 @@ def main(config: Config) -> None:
6780
training_args.low_cpu_mem_usage,
6881
)
6982

83+
lora_peft_config = config.train_parameters.get("lora_peft_config")
84+
is_peft_adapter_restored = False
85+
is_lora_enabled = False
86+
if lora_peft_config is not None:
87+
is_lora_enabled = True
88+
peft_adapter_path = None
89+
# Restore peft adapter from filesystem if available.
90+
if checkpoint_exists(training_args.output_dir):
91+
peft_adapter_path = os.path.join(
92+
training_args.output_dir,
93+
"checkpoints",
94+
get_latest_checkpoint_dir(
95+
os.path.join(training_args.output_dir, "checkpoints"),
96+
),
97+
)
98+
is_peft_adapter_restored = True
99+
100+
model = get_lora_model_from_base_model(
101+
model,
102+
lora_peft_config,
103+
peft_adapter_path,
104+
)
105+
106+
decoder_layer_module = get_submodule_by_pattern(model, r"DecoderLayer$")
107+
assert decoder_layer_module is not None, f"No DecoderLayer found in {model}"
70108
model = shard_model(
71109
model,
72-
LlamaDecoderLayer,
110+
decoder_layer_module,
73111
training_args.use_mp,
74112
training_args.use_activation_checkpointing,
75113
training_args.sharding_strategy,
76114
local_rank,
77115
training_args.low_cpu_mem_usage,
116+
is_lora_enabled,
78117
)
79118

80119
# load dataset
@@ -112,6 +151,7 @@ def main(config: Config) -> None:
112151
dataset,
113152
optimizer,
114153
lr_scheduler,
154+
is_peft_adapter_restored,
115155
)
116156

117157
# Checkpoint check. Always call before training.
@@ -138,9 +178,14 @@ def main(config: Config) -> None:
138178
f"epoch_{epoch}",
139179
"end-epoch-model",
140180
)
141-
save_consolidated_model(trainer.model, hf_save_dir, rank)
181+
182+
if is_lora_enabled:
183+
save_peft_adapter(trainer.model, hf_save_dir)
184+
else:
185+
save_consolidated_model(trainer.model, hf_save_dir, rank)
142186
dataset.reset_dataloaders()
143187

188+
144189
if __name__ == "__main__":
145190
args = parse_args()
146191
config = Config(yaml_path=args.yaml_path)

profiling/README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Profiling Utils
2+
3+
To modify the specific SLURM resources types to benchmark, adjust the launcher script `launch_benchmark.py` as needed. Modify `profiling/configs/lora-benchmark.yaml` to adjust parameters such as batch size and token width.
4+
5+
On the Vector cluster, run the following to launch the benchmarks:
6+
7+
```bash
8+
$ mkdir data/
9+
$ python3 launch_benchmark.py
10+
11+
# The launcher script will print a list of
12+
# SLURM commands it plans to run. Press ENTER
13+
# to accept and automatically invoke the commands.
14+
```
15+
16+
After the SLURM jobs complete, profiler output can be found under `data/benchmark`. Invoke the following the to generate a Markdown summary of the results:
17+
18+
```bash
19+
$ python3 profiling/parse_benchmark.py --folder data/benchmark
20+
```

0 commit comments

Comments
 (0)