Skip to content

Commit

Permalink
Merge branch 'ko3n1g/ci/finalize-frozen-ckpt-tests' into 'main'
Browse files Browse the repository at this point in the history
ci: Add frozen checkpoints

See merge request ADLR/megatron-lm!2541
  • Loading branch information
ko3n1g committed Jan 16, 2025
2 parents 004fbcb + 4aada1b commit b835a10
Show file tree
Hide file tree
Showing 4 changed files with 6 additions and 4 deletions.
2 changes: 2 additions & 0 deletions tests/test_utils/python_scripts/launch_jet_workload.py
Original file line number Diff line number Diff line change
Expand Up @@ -274,12 +274,14 @@ def main(
logger.error(e)
time.sleep((3**n_download_attempt) * 60)
n_download_attempt += 1
no_log = True
except KeyError as e:
logger.error(e)
break
no_log = True

if no_log:
logger.error("Did not find any logs to download, retry.")
continue

concat_logs = "\n".join(logs)
Expand Down
2 changes: 1 addition & 1 deletion tests/test_utils/recipes/bert.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ spec:
n_repeat:
artifacts:
/workspace/data/bert_data: text/the_pile/bert_shard00
# /workspace/checkpoints/bert_mr_mcore_tp2_pp2_frozen_resume_torch_dist_dgx_a100_1N8G_dev: model/mcore_bert/bert_mr_mcore_tp2_pp2_frozen_resume_torch_dist_dgx_a100_1N8G_dev/22390338
/workspace/checkpoints/bert_mr_mcore_tp2_pp2_frozen_resume_torch_dist_dgx_a100_1N8G_dev: model/mcore_bert/bert_mr_mcore_tp2_pp2_frozen_resume_torch_dist_dgx_a100_1N8G_dev/22410107
script: |-
ls
cd /opt/megatron-lm
Expand Down
4 changes: 2 additions & 2 deletions tests/test_utils/recipes/gpt.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ spec:
platforms: dgx_a100
artifacts:
/workspace/data/gpt3_data: text/the_pile/shard00
# /workspace/checkpoints/gpt3_mr_mcore_te_tp4_pp2_frozen_resume_torch_dist_reshard_8x1xNone_dgx_a100_1N8G_dev: model/mcore_gpt/gpt3_mr_mcore_te_tp4_pp2_frozen_resume_torch_dist_reshard_8x1xNone_dgx_a100_1N8G_dev/22390338
# /workspace/checkpoints/gpt3_mr_mcore_te_tp2_pp1_frozen_resume_torch_dist_te_8experts2parallel_dist_optimizer_dgx_a100_1N8G_dev: model/mcore_gpt/gpt3_mr_mcore_te_tp2_pp1_frozen_resume_torch_dist_te_8experts2parallel_dist_optimizer_dgx_a100_1N8G_dev/22390338
/workspace/checkpoints/gpt3_mr_mcore_te_tp4_pp2_frozen_resume_torch_dist_reshard_8x1xNone_dgx_a100_1N8G_dev: model/mcore_gpt/gpt3_mr_mcore_te_tp4_pp2_frozen_resume_torch_dist_reshard_8x1xNone_dgx_a100_1N8G_dev/22410107
/workspace/checkpoints/gpt3_mr_mcore_te_tp2_pp1_frozen_resume_torch_dist_te_8experts2parallel_dist_optimizer_dgx_a100_1N8G_dev: model/mcore_gpt/gpt3_mr_mcore_te_tp2_pp1_frozen_resume_torch_dist_te_8experts2parallel_dist_optimizer_dgx_a100_1N8G_dev/22410107
script: |-
ls
cd /opt/megatron-lm
Expand Down
2 changes: 1 addition & 1 deletion tests/test_utils/recipes/t5.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ spec:
platforms: dgx_a100
artifacts:
/workspace/data/t5_data: text/the_pile/t5_shard00
# /workspace/checkpoints/t5_220m_mr_mcore_te_tp2_pp2_frozen_resume_torch_dgx_a100_1N8G_dev: model/mcore_t5/t5_220m_mr_mcore_te_tp2_pp2_frozen_resume_torch_dgx_a100_1N8G_dev/22390338
/workspace/checkpoints/t5_220m_mr_mcore_te_tp2_pp2_frozen_resume_torch_dgx_a100_1N8G_dev: model/mcore_t5/t5_220m_mr_mcore_te_tp2_pp2_frozen_resume_torch_dgx_a100_1N8G_dev/22410107
script: |-
ls
cd /opt/megatron-lm
Expand Down

0 comments on commit b835a10

Please sign in to comment.