In this repository, we provide the artifact for the paper Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization 🚀.
Mist is an advanced automatic distributed training configuration optimizing system designed to optimize large language model (LLM) training by co-optimizing parallelism techniques (data, tensor, and pipeline parallelism) alongside memory footprint reduction strategies (activation checkpointing, redundancy elimination, and offloading).
Key Features:
- 🚀 Optimized Performance: Achieves up to 2.04× speedup compared to state-of-the-art automatic systems.
- ⚡ Smart Parallelism & Memory Optimization: Dynamically balances memory usage and compute efficiency.
- 🔍 Symbolic Performance Analysis: Rapidly explores optimization configurations using symbolic expressions.
- 🔄 Overlap-Centric Scheduling: Maximizes computation-communication overlap for efficient GPU utilization.
Non Goals
- Production: Mist is a research prototype built on PyTorch to explore distributed training optimizations. Certain production features like dynamic gradient scaling, gradient clipping, and training monitoring are intentionally omitted. For production use, we recommend applying Mist’s optimized strategies in Megatron-LM and DeepSpeed. We also disabled these features for baselines for fair performance comparison.
- Numeric Stability: even though we tried our best to make sure the execution is correct and tested the correctness for several base cases, numerical instabilities may arise due to complex overlap scheduling and data race conditions in complicated configurations. We are happy to further improve it once we spot these cases.
We recommend to use Docker Engine for building the artifact to fully control all software dependencies. Please follow the instructions to Install Docker Engine and NVIDIA Container Toolkit first. Note that if the current user is not in the docker user group, all following docker-related commands requires root privilege (i.e. with sudo) to run.
For convenience, we also provide the installation script below (extracted from official guide):
curl https://get.docker.com | sh && sudo systemctl --now enable docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Clone the repository and build the docker container. NOTE: for users with GPUs different from L4 GPUs (sm_89), you may have to change the environment variable TORCH_CUDA_ARCH_LIST
in the Dockerfile. You can find details here.
git clone https://github.com/Dazz993/Mist.git
cd Mist
docker build -t mist -f Dockerfile .
docker run --gpus all -it --rm --privileged --ipc=host --shm-size=20G --ulimit memlock=-1 --name "mist" -v $(pwd):/workspace/ mist
To get consistent and stable results especially for machines like L4, fix the gpu frequency
# Check supported frequencies
nvidia-smi -q -d SUPPORTED_CLOCKS
# Set the frequencies (e.g. for L4 GPUs)
nvidia-smi -ac 6251,1050
Mist can analyze the model execution time (including the breakdown for pipeline parallelism) and the memory usage efficiently. We use test-small-base
config as an example, which is a *GPT-1.3B model running on 2 GPUs with BSZ=8, DP=2, FlashAttn=False. This is the best configuration that Megatron-LM can achieve. The corresponding YAML file is /benchmark/mist/configs/test-small-base.yaml
. This YAML file contains network and memory parameters for GCP L4 GPUs. For other GPUs, this setup can be used as a functionality test.
cd /workspace/benchmark/mist/analysis/
python run.py --config-name test-small-base
Expected results:
# ... Breakdowns ...
# ..................
Total latency: 10.659405381925026
Peak fwd memory: [19255.25]
Peak bwd memory: [19503.125005722046]
Mist can directly run the configurations in an efficient way.
cd /workspace/benchmark/mist/exec/
torchrun --nproc-per-node 2 benchmark_one_case.py --config-name test-small-base
Expected results:
[Total] Median: 11.3290 s, Mean: 11.3290 s, Std: 0.05885070
Total Latency: 11.3290
[Stage Peak] Allocated memories: [19273.83] MB
[Stage Peak] Reserved memories: [20862.00, 21096.00] MB
Mist provides highly accurate memory estimation, ensuring reliable resource planning. However, performance estimation may have slight deviations, as Mist primarily focuses on comparing performance across different configurations rather than absolute runtime. Some constant overheads, like optimizer step time, are omitted since they remain the same across configurations. We will cover that later.
Then let's use Mist to tune the best configuration:
cd /workspace/benchmark/mist/tune/
python tune_one_case.py --config-name test-small-base +output_path=/workspace/benchmark/mist/tune/results/test-small-mist
Expected results:
Best cost: 9.26892465
Best solution: [16,
[(((0, 11), (1, 1), 5, 0), (2, 1, 1, 0, 0, 1, 0.0, 0.0, 0.0, 0.0)),
(((12, 23), (1, 1), 0, 0), (2, 1, 1, 0, 0, 1, 0.0, 0.0, 0.0, 0.0))]
]
Saved the best solution to /workspace/benchmark/mist/tune/results/test-small-mist.yaml
The outputs can be interpreted as:
Gradient Accumulation Steps: 16. Two pipeline stages.
- ----------------------------------------------------
- (0, 11): (layer_idx_start, layer_idx_end)
- (1, 1) : (nnodes, nprocs_per_node)
- 5: number of checkpointed layers in a single stage
- ----------------------------------------------------
- (2, 1, 1): (Batch size, DP, TP)
- (0, 0, 1): (WeightsSharding, GradsSharding, OptSharding)
- (0.0, 0.0, 0.0, 0.0): (W, G, O, A). where they map to weights, grads,
optimizer states, and activation offloading ratio.
Execute the tuned configurations:
cd /workspace/benchmark/mist/exec/
torchrun --nproc-per-node 2 \
benchmark_one_case.py \
--config-path /workspace/benchmark/mist/tune/results/ \
--config-name test-small-mist
Expected results:
Total Latency: 9.9345
Therefore, the speedup is roughly ~14%. This is the datapoint in (Figure 11, (a) - 1).
Step 3: Run Single-Node Performance Evaluation [Specifically for GCP L4 GPUs] (Est. Time: 3.5 hours)
For L4 GPUs, we directly provide the configurations that are tuned by us that can be used to quickly test the speedup of Mist compared to baselines. We also provide a general process for evaluating on a brand new cluster. See the next section.
We provide the best configurations that is found in our used L4 clusters under /workspace/benchmark/mist/tuned_configs/
.
cd /workspace/benchmark/mist/tuned_configs/
bash run_single_node.sh
Results are summarized in /workspace/benchmark/mist/tuned_configs/l4-24gb/gpt/summary.json
and corresponding llama file.
Then we evaluate the performance of Megatron-LM. The best configurations of Megatron-LM are manually found by us and mostly match our searching results from our baseline search space.
cd /workspace/benchmark/megatron/
bash scripts/tops/l4/gpt2/1_8xl4_node_1_pcie.sh
bash scripts/tops/l4/llama/1_8xl4_node_1_pcie.sh
Results are under /workspace/benchmark/megatron/results
.
Similarly, we evaluate the performance of DeepSpeed.
cd /workspace/benchmark/deepspeed/
bash scripts/tops/l4/gpt2/1_8xl4_node_1_pcie.sh
bash scripts/tops/l4/llama/1_8xl4_node_1_pcie.sh
Results are under /workspace/benchmark/deepspeed/results
.
We provide a python file to collect the results for easy comparison.
cd /workspace/benchmark/
python scripts/collect_single_node_results_v1.py
Expected Results (for clearity we ignore absolute numbers)
+----------------------+-----------------------+------------------------+
| SpeedUp | SpeedUp vs Megatron | SpeedUp vs DeepSpeed |
+======================+=======================+========================+
| gpt2-1.3b-flash_True | 1.279X | 1.473X |
+----------------------+-----------------------+------------------------+
| gpt2-2.7b-flash_True | 1.193X | 1.488X |
+----------------------+-----------------------+------------------------+
| gpt2-7b-flash_True | 1.191X | 1.709X |
+----------------------+-----------------------+------------------------+
+-----------------------+-----------------------+------------------------+
| SpeedUp | SpeedUp vs Megatron | SpeedUp vs DeepSpeed |
+=======================+=======================+========================+
| llama-1.3b-flash_True | 1.557X | 1.498X |
+-----------------------+-----------------------+------------------------+
| llama-2.7b-flash_True | 1.374X | 1.535X |
+-----------------------+-----------------------+------------------------+
| llama-7b-flash_True | 1.325X | 1.730X |
+-----------------------+-----------------------+------------------------+
+-----------------------+-----------------------+------------------------+
| SpeedUp | SpeedUp vs Megatron | SpeedUp vs DeepSpeed |
+=======================+=======================+========================+
| gpt2-1.3b-flash_False | 1.175X | 1.418X |
+-----------------------+-----------------------+------------------------+
| gpt2-2.7b-flash_False | 1.141X | 1.384X |
+-----------------------+-----------------------+------------------------+
| gpt2-7b-flash_False | 1.222X | 2.053X |
+-----------------------+-----------------------+------------------------+
We show how tuning time changes as the search space is increased step by step. To test that, we use a GPT 22B model to be run on 4
cd /workspace/benchmark/mist/benchmark-tuning-time
python run.py --model=gpt2/22b -n 4 -m 8
The results are under /workspace/benchmark/mist/benchmark-tuning-time/results
.
We first profile the networking and overlapping params.
cd /workspace/mist/tools/
bash scripts/profile_interference_single_node.sh
The outputs of it should be similar to:
GPU to GPU params: [4.5, 0.0, 2.995, 0.0004, 4.1604, 0.0003, 4.6325, 0.0005, 4.5, 0.0, 4.5, 0.0, 4.5, 0.0, 4.5, 0.0, 4.5, 0.0, 4.5, 0.0, 4.5, 0.0, 4.5, 0.0, 4.5, 0.0, 4.5, 0.0, 4.5, 0.0, 4.5, 0.0]
CPU to GPU params: [4.1353, 0.0003]
GPU to CPU params: [3.9305, 0.0006]
Inference params: [10.3312, 7.6785, 4.0921, 2.1695, 2.6942, 3.1015, 1.9612, 2.2752, 2.7452, 2.3535, 6.6221, 2.5708, 2.8219, 4.4773, 2.7195, 3.0124, 1.2946, 1.0224, 1.973, 1.982, 1.9532, 1.9653, 2.4663, 1.5854, 1.3876, 2.8578, 1.9576, 1.9656]
Steps:
- Create a new experiment template under
/workspace/benchmark/mist/configs/experiment/
. For example, if I am using A10 GPUs, I should do:Please make sure thecd /workspace/benchmark/mist/configs/experiment/ cp template-l4.yaml template-a10g.yaml
template-xxx
->xxx
is the simplified device name, which can be got bypython3 -c "import torch; print(torch.cuda.get_device_name(0).split(' ')[-1].lower())"
- Update
template-a10g.yaml
:- Copy the GPU-GPU, CPU-GPU, GPU-CPU, and Interence params into the
hardware
block in the template. - Update the
nvlink
andmemory_capacity
, wherememory_capacity
is the size of device memories (in GB).
- Copy the GPU-GPU, CPU-GPU, GPU-CPU, and Interence params into the
- Update
/workspace/benchmark/mist/experiment/run.py
.- Add the entry in the TEMPLATE dict.
The entry python file of the experiments is /workspace/benchmark/mist/experiment/run.py
.
You can create your own shell file or change the existing ones as it lists all the configurations to be run.
# nnodes, nproc_per_node, model, model_size, seq_len, global_batch_size, use_flash_attention, extra_args
run 1 2 "gpt2" "1.3b" 2048 32 true "--skip-exec"
run 1 2 "gpt2" "1.3b" 2048 32 false "--skip-exec"
run 1 2 "gpt2" "1.3b" 2048 32 true "--skip-tune"
run 1 2 "gpt2" "1.3b" 2048 32 false "--skip-tune"`
cd /workspace/benchmark/mist/experiment/
bash scripts/24gb_pcie/gpt/1-run_8_a10_single_node.sh
Due to the limit of large-scale compute resources, we haven't had a chance to polish the instructions of multi-node performance evaluation in a container manner. You may find in many places we are using miniconda
. Here we provide general instructions and key files which enables the benchmarking in a similar way.
To easily test the multi-node performance, we use pdsh
for fast execution.
Specifically, we use pdsh -f 1024 -R ssh -w worker-[1-$NUM_HOSTS]
Assume your cluster contains 4 nodes, each with 8 GPUs.
You have to number them from 1 to 4, e.g. the first machine as worker-1
, the last machine as worker-4
.
And each machine should be able to directly ssh to another one through worker-%idx
.
Similarly, we need to first profile the network and overlap params. Please refer to /workspace/mist/tools/scripts/profile_interference_multinode.sh
for details. [This may take a long time as we apply a relatively fine-grained network bandwidth estimation method.]
After getting the networking and overlap param, repeat the steps above to update the template-a10g.yaml
and /workspace/benchmark/mist/experiment/run.py
on all machines.
Take a look at 3-1-tune_22b_8_a10_4_nodes.sh
and 3-2-exec_22b_8_a10_4_nodes.sh
under /workspace/benchmark/mist/experiment/scripts/24gb_pcie/gpt/
.
- It is annoying that during the multi-node testing, there may be cases where only one or several machines are down because of OOM errors while others are still running. Although we have tried our best to cover this for smooth benchmarking, there may still be idling situation and it requires manual monitoring.
For AE reviewers, please contact the authors through HotCRP for any questions. For other users, please open an issue publicly or contact Zhanda ([email protected]) for any technical questions.