Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning LLM workspace template failed with OOM for LoRA/Llama70B #174

Open
sudhirn-anyscale opened this issue Apr 16, 2024 · 2 comments
Assignees

Comments

@sudhirn-anyscale
Copy link

Launched finetuning job as follows and it failed with OOM Error for Llama-2-70B
ray_job_log_job_eqeqt513ex4xy1sgwgcjk8ag1i.log

$ python main.py job_compute_configs/aws.yaml training_configs/lora/llama-2-70b-4k-4xg5_48xlarge.yaml

Error

   result = forward_call(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 268, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB (GPU 4; 21.99 GiB total capacity; 16.79 GiB already allocated; 907.38 MiB free; 20.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@pcmoritz
Copy link
Contributor

I'm curious why nobody answered this issue, but I ran into a similar issue recently and the problem was with the template -- even though the config claimed it was using LoRA, it was actually missing the lora part. The template is now updated and looks like

# Change this to the model you want to fine-tune
model_id: meta-llama/Meta-Llama-3-70B-Instruct

# Change this to the path to your training data
train_path: s3://air-example-data/gsm8k/train.jsonl

# Change this to the path to your validation data. This is optional
valid_path: s3://air-example-data/gsm8k/test.jsonl

# Change this to the context length you want to use. Examples with longer
# context length will be truncated.
context_length: 4096

# Change this to total number of GPUs that you want to use
num_devices: 8

# Change this to the number of epochs that you want to train for
num_epochs: 3

# Change this to the batch size that you want to use
train_batch_size_per_device: 2
eval_batch_size_per_device: 2

# Change this to the learning rate that you want to use
learning_rate: 1e-4

# This will pad batches to the longest sequence. Use "max_length" when profiling to profile the worst case.
padding: "longest"

# By default, we will keep the best checkpoint. You can change this to keep more checkpoints.
num_checkpoints_to_keep: 1

# Deepspeed configuration, you can provide your own deepspeed setup
deepspeed:
  config_path: deepspeed_configs/zero_3_offload_optim+param.json

# Accelerator type, the value of 0.001 is not important, as long as it is
# between 0 and 1. This ensures that the given accelerator is available for each trainer
# worker.
worker_resources:
  accelerator_type:A100-80G: 0.001

# Lora configuration
lora_config:
  r: 8
  lora_alpha: 16
  lora_dropout: 0.05
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
    - embed_tokens
    - lm_head
  task_type: "CAUSAL_LM"
  bias: "none"
  modules_to_save: []

note the lora part at the bottom and that config has been working for me :)

@sudhirn-anyscale
Copy link
Author

Thanks @pcmoritz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants