Fetch from nvidia Megatron-LM #5

RaymondLi0 · 2022-08-03T20:32:25Z

No description provided.

This reverts commit d87ba91.

ci: Run on multiple clusters See merge request ADLR/megatron-lm!3292

ci: Allow specific TE-ref See merge request ADLR/megatron-lm!3302

ci(fix): Write logs to log_dir See merge request ADLR/megatron-lm!3299

Address dist checkpointing PyT 24.08 failure See merge request ADLR/megatron-lm!3253

ci(hotfix): Downstream pipeline See merge request ADLR/megatron-lm!3307

…nal argparse flag to clear GPU... Co-authored-by: Szymon Migacz <[email protected]>

MR feedback: added units for arguments, optional argparse flag to clear GPU... See merge request ADLR/megatron-lm!3308

…mamba class constructor Co-authored-by: Zhiyu Li <[email protected]>

Allow process group as optional argument for mamba class constructor See merge request ADLR/megatron-lm!2966

Add NVTX ranges to categorize execution See merge request ADLR/megatron-lm!2588

Move fsdp 2 import from _composable to public See merge request ADLR/megatron-lm!3116

…image`

ci: Add nemo-image to `ci-rebuild-mcore-nemo-image` See merge request ADLR/megatron-lm!3321

ci: Re-enable tests that failed on memory See merge request ADLR/megatron-lm!3197

Signed-off-by: oliver könig <[email protected]>

Co-authored-by: Shanmugam Ramasamy <[email protected]> Co-authored-by: Shanmugam Ramasamy <[email protected]>

Engine updates See merge request ADLR/megatron-lm!3254

Co-authored-by: Mcore Bot <[email protected]>

ci: Onboard mr-slim to h100 See merge request ADLR/megatron-lm!3312

…s like 'pre_wd_mult' instead of 'wd_mult'

Quick fix for NeMo: handle alternate key names like 'pre_wd_mult' instead of 'wd_mult' See merge request ADLR/megatron-lm!3444

chore: Bump version 0.14.0 See merge request ADLR/megatron-lm!3477

Co-authored-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: Selvaraj Anandaraj <[email protected]>

Added offloading support for MCore layers See merge request ADLR/megatron-lm!3071

… avoid shuffling of new tokens Co-authored-by: Shanmugam Ramasamy <[email protected]> Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: Shanmugam Ramasamy <[email protected]>

Bug fix to reset kv chunks assigned to -1 and avoid shuffling of new tokens See merge request ADLR/megatron-lm!3437

chore: Add init to tools See merge request ADLR/megatron-lm!3483

…ling

Fix unit test test_fp8_param.py blockwise scaling See merge request ADLR/megatron-lm!3480

chore: Add init to examples See merge request ADLR/megatron-lm!3492

build: Force pin down setuptools See merge request ADLR/megatron-lm!3493

…fp8 inference

Pad input tensors and enable fp8 weights for fp8 inference See merge request ADLR/megatron-lm!3341

…l Communication Grid for Model Parallelism Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: Mcore Bot <[email protected]>

M4 Taskforce: Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism See merge request ADLR/megatron-lm!3398

…ence

Pass strict=False to load_checkpoint in inference See merge request ADLR/megatron-lm!3508

Skip fused rope check if te version < 1.4.0 See merge request ADLR/megatron-lm!3526

Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: Oliver Koenig <[email protected]> Co-authored-by: Guyue Huang <[email protected]> Co-authored-by: Pingtian Li <[email protected]> Co-authored-by: Xin Yao <[email protected]> Co-authored-by: Shanmugam Ramasamy <[email protected]> Co-authored-by: Shanmugam Ramasamy <[email protected]>

ci: Misc refactorings See merge request ADLR/megatron-lm!3529

…t when specifying '--no-load-optim'

Add option to load main params from checkpoint when specifying '--no-load-optim' See merge request ADLR/megatron-lm!3284

Co-authored-by: Yashaswi Karnati <[email protected]> Co-authored-by: Yashaswi Karnati <[email protected]>

MiMO VLM training example and functional tests See merge request ADLR/megatron-lm!3328

ko3n1g and others added 30 commits May 12, 2025 23:03

ci(hotfix): Update Dockerfile.ci.dev

5c7ecad

Revert "ADLR/megatron-lm!2711 - Add in-process restart"

e41dde6

This reverts commit d87ba91.

ADLR/megatron-lm!3292 - ci: Run on multiple clusters

f61b17c

Merge branch 'ko3n1g/ci/multi-cluster' into 'main'

c552e21

ci: Run on multiple clusters See merge request ADLR/megatron-lm!3292

ADLR/megatron-lm!3302 - ci: Allow specific TE-ref

55343df

Merge branch 'ko3n1g/ci/te-nightly' into 'main'

d50e830

ci: Allow specific TE-ref See merge request ADLR/megatron-lm!3302

ADLR/megatron-lm!3299 - ci(fix): Write logs to log_dir

8c4875f

Merge branch 'ko3n1g/ci/unit-tests-locally' into 'main'

d6eb60b

ci(fix): Write logs to log_dir See merge request ADLR/megatron-lm!3299

ADLR/megatron-lm!3253 - Address dist checkpointing PyT 24.08 failure

c58e57f

Merge branch 'dist-ckpt-2408' into 'main'

4a114e6

Address dist checkpointing PyT 24.08 failure See merge request ADLR/megatron-lm!3253

ADLR/megatron-lm!3307 - ci(hotfix): Downstream pipeline

d2cbe5a

Merge branch 'ko3n1g/ci/fix-downstream-pipeline' into 'main'

53d55fb

ci(hotfix): Downstream pipeline See merge request ADLR/megatron-lm!3307

ADLR/megatron-lm!3308 - MR feedback: added units for arguments, optio…

9c586bf

…nal argparse flag to clear GPU... Co-authored-by: Szymon Migacz <[email protected]>

Merge branch 'inprocess_mr' into 'main'

8416bff

MR feedback: added units for arguments, optional argparse flag to clear GPU... See merge request ADLR/megatron-lm!3308

ADLR/megatron-lm!2966 - Allow process group as optional argument for …

07b1992

…mamba class constructor Co-authored-by: Zhiyu Li <[email protected]>

Merge branch 'zhiyul/orthotope/ssm' into 'main'

175497e

Allow process group as optional argument for mamba class constructor See merge request ADLR/megatron-lm!2966

ADLR/megatron-lm!2588 - Add NVTX ranges to categorize execution

7f9f2bf

Merge branch 'llama31_automated_breakdown' into 'main'

8a9e864

Add NVTX ranges to categorize execution See merge request ADLR/megatron-lm!2588

ADLR/megatron-lm!3116 - Move fsdp 2 import from _composable to public

1ff5a37

Merge branch 'boxiangw/public_fsdp_import' into 'main'

ed0d528

Move fsdp 2 import from _composable to public See merge request ADLR/megatron-lm!3116

ADLR/megatron-lm!3321 - ci: Add nemo-image to `ci-rebuild-mcore-nemo-…

d70e2e4

…image`

Merge branch 'ko3n1g/ci/fix-rebuild-job' into 'main'

054fad5

ci: Add nemo-image to `ci-rebuild-mcore-nemo-image` See merge request ADLR/megatron-lm!3321

ADLR/megatron-lm!3197 - ci: Re-enable tests that failed on memory

e494219

Merge branch 'ko3n1g/ci/re-enable-broken-tests' into 'main'

bfc751a

ci: Re-enable tests that failed on memory See merge request ADLR/megatron-lm!3197

tests: Disable flaky test

a73b4d2

Signed-off-by: oliver könig <[email protected]>

ADLR/megatron-lm!3254 - Engine updates

407e504

Co-authored-by: Shanmugam Ramasamy <[email protected]> Co-authored-by: Shanmugam Ramasamy <[email protected]>

Merge branch 'engine_updates' into 'main'

7fe8f69

Engine updates See merge request ADLR/megatron-lm!3254

ADLR/megatron-lm!3312 - ci: Onboard mr-slim to h100

ee1d765

Co-authored-by: Mcore Bot <[email protected]>

Merge branch 'ko3n1g/ci/dev-on-h100' into 'main'

861a8fa

ci: Onboard mr-slim to h100 See merge request ADLR/megatron-lm!3312

ADLR/megatron-lm!3334 - chore: Deprecate T5 tests

cf03fb2

skierat and others added 30 commits June 17, 2025 11:56

ADLR/megatron-lm!3444 - Quick fix for NeMo: handle alternate key name…

e0b2c60

…s like 'pre_wd_mult' instead of 'wd_mult'

Merge branch 'skierat/quick_nemo_fix' into 'main'

bfa39e8

Quick fix for NeMo: handle alternate key names like 'pre_wd_mult' instead of 'wd_mult' See merge request ADLR/megatron-lm!3444

ADLR/megatron-lm!3477 - chore: Bump version 0.14.0

0e3af7e

Merge branch 'ko3n1g/chore/release-version-0.14.0' into 'main'

27c9b6c

chore: Bump version 0.14.0 See merge request ADLR/megatron-lm!3477

ADLR/megatron-lm!3071 - Added offloading support for MCore layers

3987e89

Co-authored-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: Selvaraj Anandaraj <[email protected]>

Merge branch 'lora_offload' into 'main'

4a91173

Added offloading support for MCore layers See merge request ADLR/megatron-lm!3071

ADLR/megatron-lm!3437 - Bug fix to reset kv chunks assigned to -1 and…

115785f

… avoid shuffling of new tokens Co-authored-by: Shanmugam Ramasamy <[email protected]> Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: Shanmugam Ramasamy <[email protected]>

Merge branch 'bugFixDE' into 'main'

3b0f763

Bug fix to reset kv chunks assigned to -1 and avoid shuffling of new tokens See merge request ADLR/megatron-lm!3437

ADLR/megatron-lm!3483 - chore: Add init to tools

642a181

Merge branch 'ko3n1g/chore/tool-init' into 'main'

0710137

chore: Add init to tools See merge request ADLR/megatron-lm!3483

ADLR/megatron-lm!3480 - Fix unit test test_fp8_param.py blockwise sca…

171c351

…ling

Merge branch 'fix_2425' into 'main'

57082f9

Fix unit test test_fp8_param.py blockwise scaling See merge request ADLR/megatron-lm!3480

ADLR/megatron-lm!3492 - chore: Add init to examples

9f1c4b2

Merge branch 'ko3n1g/chore/examples-init' into 'main'

6ac5633

chore: Add init to examples See merge request ADLR/megatron-lm!3492

ADLR/megatron-lm!3493 - build: Force pin down setuptools

2074d19

Merge branch 'ko3n1g/build/fix-setuptools-version' into 'main'

0600a3c

build: Force pin down setuptools See merge request ADLR/megatron-lm!3493

ADLR/megatron-lm!3341 - Pad input tensors and enable fp8 weights for …

a002d50

…fp8 inference

Merge branch 'fp8_inference' into 'main'

6a6cd47

Pad input tensors and enable fp8 weights for fp8 inference See merge request ADLR/megatron-lm!3341

ADLR/megatron-lm!3398 - M4 Taskforce: Add HyperCommGrid: N-Dimensiona…

2151c65

…l Communication Grid for Model Parallelism Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: Mcore Bot <[email protected]>

Merge branch 'yuya/m4_hyper_comm_grid' into 'main'

45400df

M4 Taskforce: Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism See merge request ADLR/megatron-lm!3398

ADLR/megatron-lm!3508 - Pass strict=False to load_checkpoint in infer…

db59202

…ence

Merge branch 'helenn-allow-loading-unstrict-checkpoint' into 'main'

1ab876d

Pass strict=False to load_checkpoint in inference See merge request ADLR/megatron-lm!3508

ADLR/megatron-lm!3526 - Skip fused rope check if te version < 1.4.0

9964092

Merge branch 'boxiangw/skip-te-fused-rope-test' into 'main'

878d65f

Skip fused rope check if te version < 1.4.0 See merge request ADLR/megatron-lm!3526

Merge branch 'ko3n1g/chore/some-fixes' into 'main'

cc3ed64

ci: Misc refactorings See merge request ADLR/megatron-lm!3529

ADLR/megatron-lm!3284 - Add option to load main params from checkpoin…

1e42279

…t when specifying '--no-load-optim'

Merge branch 'kunlunl/load_main_params_from_ckpt' into 'main'

c203e6a

Add option to load main params from checkpoint when specifying '--no-load-optim' See merge request ADLR/megatron-lm!3284

ADLR/megatron-lm!3328 - MiMO VLM training example and functional tests

881dfe4

Co-authored-by: Yashaswi Karnati <[email protected]> Co-authored-by: Yashaswi Karnati <[email protected]>

Merge branch 'yash/mimo_train_loop_mr' into 'main'

6b70889

MiMO VLM training example and functional tests See merge request ADLR/megatron-lm!3328

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fetch from nvidia Megatron-LM #5

Fetch from nvidia Megatron-LM #5

Uh oh!

RaymondLi0 commented Aug 3, 2022

Uh oh!

Uh oh!

Fetch from nvidia Megatron-LM #5

Are you sure you want to change the base?

Fetch from nvidia Megatron-LM #5

Uh oh!

Conversation

RaymondLi0 commented Aug 3, 2022

Uh oh!

Uh oh!