Skip to content

Commit ad2a4bd

Browse files
authored
Superoffload examples (#990)
* feat: add examples for superoffload * fix: typo * fix: remove hardcoded GPU bind * feat: add requirement for superoffload
1 parent 01f520e commit ad2a4bd

10 files changed

+1529
-0
lines changed
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
2+
# SuperOffload Fine-Tuning Examples
3+
4+
This directory shows how to fine‑tune popular large language models using [DeepSpeed](https://www.deepspeed.ai/) ZeRO Stage 3 with **SuperOffload**. SuperOffload is an optimized CPU offloading engine for full‑parameter training on emerging “Superchips” (NVIDIA GH200 / GB200, AMD MI300A) that provide very high CPU↔GPU bandwidth. It enables:
5+
6+
* 1× GH200: GPT-OSS-20B, Qwen3-14B, Phi-4
7+
* 2× GH200: Seed-OSS-36B, Qwen3-30B-A3B
8+
* 4× GH200: Llama-70B
9+
10+
With common sequence length and batch size, SuperOffload can deliver up to ~500 TFLOPS on GH200—about 50% higher throughput than ZeRO-Offload.
11+
12+
## Quick Start
13+
14+
### 1. Install dependencies
15+
16+
```bash
17+
pip install -r requirements.txt
18+
```
19+
20+
### 2. No custom model code required
21+
22+
All examples use Hugging Face Transformers and DeepSpeed ZeRO Stage 3, no custom modeling code required.
23+
24+
### 3. Enable SuperOffload (one line)
25+
26+
Add the `super_offload` flag to the `offload_optimizer` block in the ZeRO Stage 3 DeepSpeed config:
27+
28+
```jsonc
29+
"zero_optimization": {
30+
"stage": 3,
31+
"offload_optimizer": {
32+
"device": "cpu",
33+
"pin_memory": true,
34+
"ratio": 0.90,
35+
"super_offload": true,
36+
"cpuadam_cores_perc": 0.90
37+
}
38+
}
39+
```
40+
41+
To fall back to ZeRO-Offload, remove `"super_offload": true` (and optionally `cpuadam_cores_perc`).
42+
43+
### 4. Run a fine-tuning script
44+
45+
Fine-tune GPT-OSS-20B (1× GH200):
46+
47+
```bash
48+
bash finetune_gpt-oss-20b_1gpu.sh superoffload
49+
```
50+
51+
Fine-tune Qwen3-14B (1× GH200):
52+
53+
```bash
54+
bash finetune_qwen3-14b_1gpu.sh superoffload
55+
```
56+
57+
Fine-tune Phi-4 (1× GH200):
58+
59+
```bash
60+
bash finetune_phi-4_1gpu.sh superoffload
61+
```
62+
63+
Fine-tune Llama 8B (1× GH200):
64+
65+
```bash
66+
bash finetune_llama-8b_1gpu.sh superoffload
67+
```
68+
69+
Fine-tune Seed-OSS-36B (2× GH200):
70+
71+
```bash
72+
bash finetune_seed-oss-36b_2gpu.sh superoffload
73+
```
74+
75+
Fine-tune Llama 70B (4× GH200):
76+
77+
```bash
78+
bash finetune_llama-70b_4gpu.sh superoffload
79+
```
80+
81+
Switch to ZeRO-Offload by replacing `superoffload` with `zerooffload` in the first argument.
82+
83+
Each script optionally accepts a second argument for batch size (default 4):
84+
85+
```bash
86+
bash finetune_qwen3-14b_1gpu.sh superoffload 8
87+
```
88+
89+
Logs, DeepSpeed configs, and outputs are written beside the script location (e.g. `qwen3-14b_superoffload_output/`).
90+
91+
92+
> If a script is missing for a larger model, copy an existing one, change `MODEL_NAME`, and update output naming.
93+
94+
95+
## Notes
96+
97+
* NUMA Binding is required for efficient training on GH200. Each GPU is paired with a CPU to ensure that the training process is launched on the CPU directly associated with that GPU. This pairing improves affinity, delivering higher CPU–GPU bandwidth and greater throughput. In DeepSpeed, we provide a simple interface to enable NUMA binding: simply add the `--bind_cores_to_rank` flag when launching the DeepSpeed engine.
98+
* Memory System Resource Partitioning and Monitoring (MPAM) is essential for achieving optimal throughput performance. In SuperOffload, GPU execution is overlapped with CPU-based Adam execution. MPAM helps reduce interference between these two processes, leading to smoother execution and better performance.
99+
100+
## Citation
101+
102+
If you use SuperOffload, please cite:
103+
104+
```bib
105+
@inproceedings{superoffload,
106+
author = {Xinyu Lian and Masahiro Tanaka and Olatunji Ruwase and Minjia Zhang},
107+
title = "{SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips}",
108+
year = {2026},
109+
booktitle = {Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS'26)}
110+
}
111+
```
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
#!/bin/bash
2+
set -e
3+
4+
echo "================================================"
5+
echo "GPT-OSS-20B Fine-tuning with DeepSpeed on 1 GPU"
6+
echo "================================================"
7+
8+
# MODE=Options: "superoffload" or "zerooffload"
9+
MODE=$1
10+
BATCH_SIZE=${2:-4}
11+
12+
SCRIPT_DIR=$(dirname "$0")
13+
MODEL_NAME="openai/gpt-oss-20b"
14+
OUTPUT_DIR="${SCRIPT_DIR}/gpt-oss-20b_${MODE}_output"
15+
DS_CONFIG_JSON="${SCRIPT_DIR}/gpt-oss-20b_${MODE}_config.json"
16+
17+
mkdir -p $OUTPUT_DIR
18+
19+
# Script argument parameters
20+
ACTIVATION_CHECKPOINTING=true
21+
SAVE_CHECKPOINT=false
22+
MAX_LENGTH=8192
23+
LOG_INTERVAL=1
24+
DATASET_NAME="tatsu-lab/alpaca"
25+
DATASET_PERCENTAGE=10.0
26+
USE_WANDB=false
27+
WANDB_PROJECT="gpt-oss-20b"
28+
WANDB_RUN_NAME="gpt-oss-20b-$MODE"
29+
DETERMINISTIC=false
30+
BENCH_STEPS=10
31+
WARMUP_STEPS=20
32+
33+
EPOCHS=1
34+
LR=1e-5
35+
WARMUP=0.05
36+
WEIGHT_DECAY=0.01
37+
SEED=42
38+
39+
ACTIVATION_CHECKPOINTING_FLAG=""
40+
if [ "$ACTIVATION_CHECKPOINTING" = "true" ]; then
41+
ACTIVATION_CHECKPOINTING_FLAG="--activation_checkpointing"
42+
fi
43+
44+
SAVE_CHECKPOINT_ARG=""
45+
if [ "$SAVE_CHECKPOINT" = "true" ]; then
46+
SAVE_CHECKPOINT_ARG="--save_checkpoint"
47+
fi
48+
49+
WANDB_FLAG=""
50+
if [ "$USE_WANDB" = "true" ]; then
51+
WANDB_FLAG="--use_wandb"
52+
fi
53+
54+
DETERMINISTIC_FLAG=""
55+
if [ "$DETERMINISTIC" = "true" ]; then
56+
DETERMINISTIC_FLAG="--deterministic"
57+
fi
58+
59+
# Create DeepSpeed configuration file
60+
if [ "$MODE" = "superoffload" ]; then
61+
cat > "$DS_CONFIG_JSON" << EOF
62+
{
63+
"train_batch_size": $BATCH_SIZE,
64+
"gradient_accumulation_steps": 1,
65+
"bf16": { "enabled": true },
66+
"zero_optimization": {
67+
"stage": 3,
68+
"overlap_comm": false,
69+
"reduce_bucket_size": 8e8,
70+
"sub_group_size": 8e8,
71+
"offload_optimizer": {
72+
"device": "cpu",
73+
"pin_memory": true,
74+
"ratio": 0.90,
75+
"super_offload": true,
76+
"cpuadam_cores_perc": 0.90
77+
}
78+
},
79+
"wall_clock_breakdown": true
80+
}
81+
EOF
82+
83+
elif [ "$MODE" = "zerooffload" ]; then
84+
cat > "$DS_CONFIG_JSON" << EOF
85+
{
86+
"train_batch_size": $BATCH_SIZE,
87+
"gradient_accumulation_steps": 1,
88+
"bf16": { "enabled": true },
89+
"zero_optimization": {
90+
"stage": 3,
91+
"overlap_comm": false,
92+
"reduce_bucket_size": 8e8,
93+
"sub_group_size": 8e8,
94+
"offload_optimizer": {
95+
"device": "cpu",
96+
"pin_memory": true
97+
}
98+
},
99+
"wall_clock_breakdown": true
100+
}
101+
EOF
102+
fi
103+
104+
# Set number of GPUs
105+
GPUS_PER_NODE=1
106+
107+
CMD="deepspeed --num_gpus=$GPUS_PER_NODE finetune_zero3.py \
108+
--deepspeed_config=$DS_CONFIG_JSON \
109+
--model_name $MODEL_NAME \
110+
--leaf_module "GptOssExperts" \
111+
--num_train_epochs $EPOCHS \
112+
--lr $LR \
113+
--batch_size $BATCH_SIZE \
114+
--weight_decay $WEIGHT_DECAY \
115+
--output_dir $OUTPUT_DIR \
116+
--seed $SEED \
117+
--max_length $MAX_LENGTH \
118+
--log_interval $LOG_INTERVAL \
119+
--dataset_name $DATASET_NAME \
120+
--dataset_percentage $DATASET_PERCENTAGE \
121+
--bench_steps $BENCH_STEPS \
122+
--warmup_steps $WARMUP_STEPS \
123+
--attn_implementation eager \
124+
$ACTIVATION_CHECKPOINTING_FLAG \
125+
$SAVE_CHECKPOINT_ARG \
126+
$WANDB_FLAG \
127+
--wandb_project $WANDB_PROJECT \
128+
--wandb_run_name $WANDB_RUN_NAME \
129+
$DETERMINISTIC_FLAG"
130+
131+
echo "Starting training with MODE $MODE"
132+
echo "================================================"
133+
eval $CMD
134+
135+
echo "================================================"
136+
echo "Training completed"
137+
echo "================================================"
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
#!/bin/bash
2+
set -e
3+
4+
echo "================================================"
5+
echo "Llama-3.3-70B-Instruct Fine-tuning with DeepSpeed on 4 GPU"
6+
echo "================================================"
7+
8+
# MODE=Options: "superoffload" or "zerooffload"
9+
MODE=$1
10+
BATCH_SIZE=${2:-4}
11+
12+
SCRIPT_DIR=$(dirname "$0")
13+
MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct"
14+
OUTPUT_DIR="${SCRIPT_DIR}/llama-3.3-70b-instruct_${MODE}_output"
15+
DS_CONFIG_JSON="${SCRIPT_DIR}/llama-3.3-70b-instruct_${MODE}_config.json"
16+
17+
mkdir -p $OUTPUT_DIR
18+
19+
# Script argument parameters
20+
ACTIVATION_CHECKPOINTING=true
21+
SAVE_CHECKPOINT=false
22+
MAX_LENGTH=4096
23+
LOG_INTERVAL=1
24+
DATASET_NAME="tatsu-lab/alpaca"
25+
DATASET_PERCENTAGE=10.0
26+
USE_WANDB=false
27+
WANDB_PROJECT="llama-3.3-70b-instruct"
28+
WANDB_RUN_NAME="llama-3.3-70b-instruct-$MODE"
29+
DETERMINISTIC=false
30+
BENCH_STEPS=10
31+
WARMUP_STEPS=20
32+
33+
EPOCHS=1
34+
LR=1e-5
35+
WARMUP=0.05
36+
WEIGHT_DECAY=0.01
37+
SEED=42
38+
39+
ACTIVATION_CHECKPOINTING_FLAG=""
40+
if [ "$ACTIVATION_CHECKPOINTING" = "true" ]; then
41+
ACTIVATION_CHECKPOINTING_FLAG="--activation_checkpointing"
42+
fi
43+
44+
SAVE_CHECKPOINT_ARG=""
45+
if [ "$SAVE_CHECKPOINT" = "true" ]; then
46+
SAVE_CHECKPOINT_ARG="--save_checkpoint"
47+
fi
48+
49+
WANDB_FLAG=""
50+
if [ "$USE_WANDB" = "true" ]; then
51+
WANDB_FLAG="--use_wandb"
52+
fi
53+
54+
DETERMINISTIC_FLAG=""
55+
if [ "$DETERMINISTIC" = "true" ]; then
56+
DETERMINISTIC_FLAG="--deterministic"
57+
fi
58+
59+
# Create DeepSpeed configuration file
60+
if [ "$MODE" = "superoffload" ]; then
61+
cat > "$DS_CONFIG_JSON" << EOF
62+
{
63+
"train_batch_size": $BATCH_SIZE,
64+
"gradient_accumulation_steps": 1,
65+
"bf16": { "enabled": true },
66+
"zero_optimization": {
67+
"stage": 3,
68+
"overlap_comm": false,
69+
"reduce_bucket_size": 4e8,
70+
"sub_group_size": 4e8,
71+
"offload_optimizer": {
72+
"device": "cpu",
73+
"pin_memory": true,
74+
"ratio": 0.90,
75+
"super_offload": true,
76+
"cpuadam_cores_perc": 0.90
77+
}
78+
},
79+
"wall_clock_breakdown": true
80+
}
81+
EOF
82+
83+
elif [ "$MODE" = "zerooffload" ]; then
84+
cat > "$DS_CONFIG_JSON" << EOF
85+
{
86+
"train_batch_size": $BATCH_SIZE,
87+
"gradient_accumulation_steps": 1,
88+
"bf16": { "enabled": true },
89+
"zero_optimization": {
90+
"stage": 3,
91+
"overlap_comm": false,
92+
"reduce_bucket_size": 4e8,
93+
"sub_group_size": 4e8,
94+
"offload_optimizer": {
95+
"device": "cpu",
96+
"pin_memory": true
97+
}
98+
},
99+
"wall_clock_breakdown": true
100+
}
101+
EOF
102+
fi
103+
104+
GPUS_PER_NODE=4
105+
106+
CMD="deepspeed --num_gpus=$GPUS_PER_NODE --bind_cores_to_rank finetune_zero3.py \
107+
--deepspeed_config=$DS_CONFIG_JSON \
108+
--model_name $MODEL_NAME \
109+
--num_train_epochs $EPOCHS \
110+
--lr $LR \
111+
--batch_size $BATCH_SIZE \
112+
--weight_decay $WEIGHT_DECAY \
113+
--output_dir $OUTPUT_DIR \
114+
--seed $SEED \
115+
--max_length $MAX_LENGTH \
116+
--log_interval $LOG_INTERVAL \
117+
--dataset_name $DATASET_NAME \
118+
--dataset_percentage $DATASET_PERCENTAGE \
119+
--bench_steps $BENCH_STEPS \
120+
--warmup_steps $WARMUP_STEPS \
121+
$ACTIVATION_CHECKPOINTING_FLAG \
122+
$SAVE_CHECKPOINT_ARG \
123+
$WANDB_FLAG \
124+
--wandb_project $WANDB_PROJECT \
125+
--wandb_run_name $WANDB_RUN_NAME \
126+
$DETERMINISTIC_FLAG"
127+
128+
echo "Starting training with MODE $MODE"
129+
echo "================================================"
130+
eval $CMD

0 commit comments

Comments
 (0)