Finetuning LLM workspace template failed with OOM for LoRA/Llama70B #174

sudhirn-anyscale · 2024-04-16T20:32:08Z

Launched finetuning job as follows and it failed with OOM Error for Llama-2-70B
ray_job_log_job_eqeqt513ex4xy1sgwgcjk8ag1i.log

$ python main.py job_compute_configs/aws.yaml training_configs/lora/llama-2-70b-4k-4xg5_48xlarge.yaml

Error

   result = forward_call(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 268, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB (GPU 4; 21.99 GiB total capacity; 16.79 GiB already allocated; 907.38 MiB free; 20.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The text was updated successfully, but these errors were encountered:

pcmoritz · 2024-10-17T19:31:37Z

I'm curious why nobody answered this issue, but I ran into a similar issue recently and the problem was with the template -- even though the config claimed it was using LoRA, it was actually missing the lora part. The template is now updated and looks like

# Change this to the model you want to fine-tune
model_id: meta-llama/Meta-Llama-3-70B-Instruct

# Change this to the path to your training data
train_path: s3://air-example-data/gsm8k/train.jsonl

# Change this to the path to your validation data. This is optional
valid_path: s3://air-example-data/gsm8k/test.jsonl

# Change this to the context length you want to use. Examples with longer
# context length will be truncated.
context_length: 4096

# Change this to total number of GPUs that you want to use
num_devices: 8

# Change this to the number of epochs that you want to train for
num_epochs: 3

# Change this to the batch size that you want to use
train_batch_size_per_device: 2
eval_batch_size_per_device: 2

# Change this to the learning rate that you want to use
learning_rate: 1e-4

# This will pad batches to the longest sequence. Use "max_length" when profiling to profile the worst case.
padding: "longest"

# By default, we will keep the best checkpoint. You can change this to keep more checkpoints.
num_checkpoints_to_keep: 1

# Deepspeed configuration, you can provide your own deepspeed setup
deepspeed:
  config_path: deepspeed_configs/zero_3_offload_optim+param.json

# Accelerator type, the value of 0.001 is not important, as long as it is
# between 0 and 1. This ensures that the given accelerator is available for each trainer
# worker.
worker_resources:
  accelerator_type:A100-80G: 0.001

# Lora configuration
lora_config:
  r: 8
  lora_alpha: 16
  lora_dropout: 0.05
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
    - embed_tokens
    - lm_head
  task_type: "CAUSAL_LM"
  bias: "none"
  modules_to_save: []

note the lora part at the bottom and that config has been working for me :)

sudhirn-anyscale · 2024-10-17T19:36:10Z

Thanks @pcmoritz

sudhirn-anyscale assigned sudhirn-anyscale, kouroshHakha and akshay-anyscale Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning LLM workspace template failed with OOM for LoRA/Llama70B #174

Finetuning LLM workspace template failed with OOM for LoRA/Llama70B #174

sudhirn-anyscale commented Apr 16, 2024

pcmoritz commented Oct 17, 2024

sudhirn-anyscale commented Oct 17, 2024

Finetuning LLM workspace template failed with OOM for LoRA/Llama70B #174

Finetuning LLM workspace template failed with OOM for LoRA/Llama70B #174

Comments

sudhirn-anyscale commented Apr 16, 2024

pcmoritz commented Oct 17, 2024

sudhirn-anyscale commented Oct 17, 2024