Skip to content

Fetch from nvidia Megatron-LM #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5,152 commits into
base: load-iter
Choose a base branch
from
Open

Conversation

RaymondLi0
Copy link

No description provided.

ko3n1g and others added 30 commits May 12, 2025 23:03
ci: Run on multiple clusters

See merge request ADLR/megatron-lm!3292
ci: Allow specific TE-ref

See merge request ADLR/megatron-lm!3302
ci(fix): Write logs to log_dir

See merge request ADLR/megatron-lm!3299
Address dist checkpointing PyT 24.08 failure

See merge request ADLR/megatron-lm!3253
ci(hotfix): Downstream pipeline

See merge request ADLR/megatron-lm!3307
…nal argparse flag to clear GPU...

Co-authored-by: Szymon Migacz <[email protected]>
MR feedback: added units for arguments, optional argparse flag to clear GPU...

See merge request ADLR/megatron-lm!3308
Allow process group as optional argument for mamba class constructor

See merge request ADLR/megatron-lm!2966
Add NVTX ranges to categorize execution

See merge request ADLR/megatron-lm!2588
Move fsdp 2 import from _composable to public

See merge request ADLR/megatron-lm!3116
ci: Add nemo-image to `ci-rebuild-mcore-nemo-image`

See merge request ADLR/megatron-lm!3321
ci: Re-enable tests that failed on memory

See merge request ADLR/megatron-lm!3197
Signed-off-by: oliver könig <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Engine updates

See merge request ADLR/megatron-lm!3254
ci: Onboard mr-slim to h100

See merge request ADLR/megatron-lm!3312
skierat and others added 30 commits June 17, 2025 11:56
Quick fix for NeMo: handle alternate key names like 'pre_wd_mult' instead of 'wd_mult'

See merge request ADLR/megatron-lm!3444
chore: Bump version 0.14.0

See merge request ADLR/megatron-lm!3477
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Added offloading support for MCore layers

See merge request ADLR/megatron-lm!3071
… avoid shuffling of new tokens

Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Bug fix to reset kv chunks assigned to -1 and avoid shuffling of new tokens

See merge request ADLR/megatron-lm!3437
chore: Add init to tools

See merge request ADLR/megatron-lm!3483
Fix unit test test_fp8_param.py blockwise scaling

See merge request ADLR/megatron-lm!3480
chore: Add init to examples

See merge request ADLR/megatron-lm!3492
build: Force pin down setuptools

See merge request ADLR/megatron-lm!3493
Pad input tensors and enable fp8 weights for fp8 inference

See merge request ADLR/megatron-lm!3341
…l Communication Grid for Model Parallelism

Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Mcore Bot <[email protected]>
M4 Taskforce: Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism

See merge request ADLR/megatron-lm!3398
Pass strict=False to load_checkpoint in inference

See merge request ADLR/megatron-lm!3508
Skip fused rope check if te version < 1.4.0

See merge request ADLR/megatron-lm!3526
Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: Oliver Koenig <[email protected]>
Co-authored-by: Guyue Huang <[email protected]>
Co-authored-by: Pingtian Li <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
ci: Misc refactorings

See merge request ADLR/megatron-lm!3529
Add option to load main params from checkpoint when specifying '--no-load-optim'

See merge request ADLR/megatron-lm!3284
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
MiMO VLM training example and functional tests

See merge request ADLR/megatron-lm!3328
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.