Ten numbers of CPM-Ant training. #112
zh-zheng
announced in
Training Logs 训练日志
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We summarize the training process of CPM-Ant into 10 numbers:
68 days: The training lasted 68 days.
10 billion parameters: CPM-Ant contains 10 billion parameters, and we will release several smaller variants compressed by BMCook.
2 tasks: We used 2 pre-training tasks, text generation and infilling.
32 GPUs: We used 32 NVIDIA A100 40G GPUs to train CPM-Ant. At the beginning of the training, we used only 8 GPUs, which was increased to 32 on June 7.
200G data: We selected 200G high-quality Chinese data from 11T raw corpus for training the model, containing about 50 billion tokens. Since it takes about 200 billion tokens to train a 10 billion model (refer to this paper), CPM-Ant was trained for 4 epochs.
430,813 Yuan: With the help of BMTrain, the training of CPM-Ant cost 430,813 Yuan ($63k), only 1/20 of what Google spent to train T5-11B model (about $1.3 million)!
4872kg CO2e: The carbon emission of CPM-Ant training process is about 4872kg CO2e (calculated on this website), while the carbon emission from Google's training of T5-11B model is 46.7t CO2e (reference), which is about 10 times ours. It can be seen that using the BMTrain toolkit to train big models is environmentally friendly!
6 interruptions: The training has encountered 6 interruptions for the following times and reasons.
June 6, 13:20 - June 6, 16:10. GPU driver was upgraded.
June 7, 00:30 - June 7, 17:20. Number of GPUs increased from 8 to 32.
June 10, 17:40 - June 12, 02:20. One node in the cluster failed, and we also upgraded the model. See the log for June 12 for details.
June 14, 21:00 - June 15, 11:50. An out-of-memory error occurred due to an abnormal data read caused by insufficient disk space.
June 16, 08:30 - June 16, 12:10. A computing node failed.
June 19, 09:10 - June 19, 16:10. A CUDA out-of-memory error occurred, and the problem did not recur after restarting the training.
1 bug: We introduced a bug during the model upgrade from June 10 to 12, which caused the model to read data from the beginning every time we resumed training, and the bug was found and fixed on June 22. The "side effect" of this bug can be seen in the loss curve, where the training loss becomes very low whenever the model learns from the data at the beginning, as it has been seen many times before.
22,545 characters: In order to better track the training process, we update the training logs every day, which contain a total of 22,545 characters.
Beta Was this translation helpful? Give feedback.
All reactions