You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4, reducing to 2
[2024-01-17 10:10:24,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=16, lr=[0.0001], mom=[(0.9, 0.95)]
[2024-01-17 10:10:24,478] [INFO] [timer.py:260:stop] epoch=0/micro_step=64/global_step=16, RunningAvgSamplesPerSec=7.620608764200451, CurrSamplesPerSec=7.801349699356398, MemAllocated=13.44GB, MaxMemAllocated=14.82GB
0%| | 67/114599 [00:09<4:29:54, 7.07batch/s][2024-01-17 10:10:25,080] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
[2024-01-17 10:10:25,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=17, lr=[0.0001], mom=[(0.9, 0.95)]
[2024-01-17 10:10:25,081] [INFO] [timer.py:260:stop] epoch=0/micro_step=68/global_step=17, RunningAvgSamplesPerSec=7.547169766053195, CurrSamplesPerSec=6.64997792221485, MemAllocated=13.44GB, MaxMemAllocated=14.82GB
0%| | 71/114599 [00:10<4:44:42, 6.70batch/s]
Traceback (most recent call last):
File "/root/ChatGLM-Finetuning/train.py", line 234, in
main()
File "/root/ChatGLM-Finetuning/train.py", line 195, in main
model.step()
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2148, in step
self._take_model_step(lr_kwargs)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2054, in _take_model_step
self.optimizer.step()
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1778, in step
self._update_scale(self.overflow)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2029, in _update_scale
self.loss_scaler.update_scale(has_overflow)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
[2024-01-17 10:10:28,152] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1997
The text was updated successfully, but these errors were encountered:
450586509
changed the title
在RTX 4090 上运行报这个错:Current loss scale already at minimum - cannot decrease scale anymore
在RTX 4090 上微调chatglm3报这个错:Current loss scale already at minimum - cannot decrease scale anymore
Jan 17, 2024
ss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4, reducing to 2
[2024-01-17 10:10:24,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=16, lr=[0.0001], mom=[(0.9, 0.95)]
[2024-01-17 10:10:24,478] [INFO] [timer.py:260:stop] epoch=0/micro_step=64/global_step=16, RunningAvgSamplesPerSec=7.620608764200451, CurrSamplesPerSec=7.801349699356398, MemAllocated=13.44GB, MaxMemAllocated=14.82GB
0%| | 67/114599 [00:09<4:29:54, 7.07batch/s][2024-01-17 10:10:25,080] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
[2024-01-17 10:10:25,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=17, lr=[0.0001], mom=[(0.9, 0.95)]
[2024-01-17 10:10:25,081] [INFO] [timer.py:260:stop] epoch=0/micro_step=68/global_step=17, RunningAvgSamplesPerSec=7.547169766053195, CurrSamplesPerSec=6.64997792221485, MemAllocated=13.44GB, MaxMemAllocated=14.82GB
0%| | 71/114599 [00:10<4:44:42, 6.70batch/s]
Traceback (most recent call last):
File "/root/ChatGLM-Finetuning/train.py", line 234, in
main()
File "/root/ChatGLM-Finetuning/train.py", line 195, in main
model.step()
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2148, in step
self._take_model_step(lr_kwargs)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2054, in _take_model_step
self.optimizer.step()
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1778, in step
self._update_scale(self.overflow)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2029, in _update_scale
self.loss_scaler.update_scale(has_overflow)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
[2024-01-17 10:10:28,152] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1997
The text was updated successfully, but these errors were encountered: