Skip to content

Fetch from nvidia Megatron-LM #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5,177 commits into
base: load-iter
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
5177 commits
Select commit Hold shift + click to select a range
407e504
ADLR/megatron-lm!3254 - Engine updates
shanmugamr1992 May 18, 2025
7fe8f69
Merge branch 'engine_updates' into 'main'
shanmugamr1992 May 18, 2025
ee1d765
ADLR/megatron-lm!3312 - ci: Onboard mr-slim to h100
ko3n1g May 19, 2025
861a8fa
Merge branch 'ko3n1g/ci/dev-on-h100' into 'main'
ko3n1g May 19, 2025
cf03fb2
ADLR/megatron-lm!3334 - chore: Deprecate T5 tests
ko3n1g May 19, 2025
8e1c3df
Merge branch 'ko3n1g/chore/remove-t5-from-lts' into 'main'
ko3n1g May 19, 2025
6eb0bcf
ADLR/megatron-lm!3062 - Fix wrong fp8_meta info when resume training …
BestJuly May 19, 2025
dfc0a3d
Merge branch 'lit/fix_fp8_moe_resume_training' into 'main'
ko3n1g May 19, 2025
ee4815f
ADLR/megatron-lm!3198 - Adding Audio Submodules for MiMO.
yashaswikarnati May 19, 2025
3895909
Merge branch 'yash/mimo_audio_submodules' into 'main'
jaredcasper May 19, 2025
5749637
ADLR/megatron-lm!3314 - Bugfix in huggingface hf_llava converter
May 20, 2025
5da8c0b
Merge branch 'tpoon/hf-small-fix' into 'main'
jaredcasper May 20, 2025
22d0305
ADLR/megatron-lm!3339 - ci: Remove deprecated bert tests
ko3n1g May 20, 2025
4d1a4e8
Merge branch 'ko3n1g/tests/deprecated-bert-tests' into 'main'
ko3n1g May 20, 2025
9eebd51
ADLR/megatron-lm!3252 - Skyw/force vp stage passing in core
skyw May 20, 2025
bed7dbd
Merge branch 'skyw/force_vp_stage_passing_in_core' into 'main'
ko3n1g May 20, 2025
000c978
ADLR/megatron-lm!3134 - Fix MMMU prompt and inference context
May 21, 2025
eb20e24
Merge branch 'matthieul/fix_mmodal_inference' into 'main'
jaredcasper May 21, 2025
3a49d53
ADLR/megatron-lm!3336 - fix: Make NVTX optional
ko3n1g May 21, 2025
5a676b3
Merge branch 'ko3n1g/fix/guard-nvtx' into 'main'
ko3n1g May 21, 2025
0db0e83
ADLR/megatron-lm!3329 - Fix CUDA_DEVICE_MAX_CONNECTIONS check on Blac…
duncanriach May 22, 2025
75b1ca1
Merge branch 'fix-cdmc-check-on-blackwell' into 'main'
deepakn94 May 22, 2025
40df28b
ADLR/megatron-lm!2902 - Userbuffer registration for MCore-FSDP
youngeunkwon0405 May 22, 2025
20304a3
Merge branch 'fsdp-ubr' into 'main'
ko3n1g May 22, 2025
6b6e9db
ci(hotfix):switch runner
ko3n1g May 22, 2025
497c3e2
ADLR/megatron-lm!3346 - Fix text generation
May 24, 2025
b8be0af
Merge branch 'matthieul/fix_text_generation' into 'main'
trintamaki May 24, 2025
194b2be
ADLR/megatron-lm!3343 - ci: Run tests on H100
ko3n1g May 24, 2025
3996ec2
Merge branch 'ko3n1g/ci/dev-on-h100-2' into 'main'
ko3n1g May 24, 2025
022bcb5
ADLR/megatron-lm!3274 - feat: add force-load-balancing for MoE router
Victarry May 24, 2025
18b32aa
Merge branch 'denliu/router_force_balance' into 'main'
ko3n1g May 24, 2025
37587af
ADLR/megatron-lm!3353 - tests: Onboard gpt-nemo test
ko3n1g May 24, 2025
957dc60
Merge branch 'ko3n1g/tests/gpt-nemo' into 'main'
ko3n1g May 24, 2025
32b6d48
ADLR/megatron-lm!3309 - Add user guide for Multi-Storage Client integ…
shunjiad May 24, 2025
cd88296
Merge branch 'chore-msc-doc' into 'main'
ko3n1g May 24, 2025
de7945b
ADLR/megatron-lm!3335 - tests: Onboard MoE memory test
ko3n1g May 25, 2025
1e05700
Merge branch 'ko3n1g/tests/moe-memory' into 'main'
ko3n1g May 25, 2025
c8cc2c6
ADLR/megatron-lm!3354 - ci: Restart on segfault
ko3n1g May 26, 2025
42ccbb8
Merge branch 'ko3n1g/ci/restart-on-segfault' into 'main'
ko3n1g May 26, 2025
1c7d3db
ADLR/megatron-lm!3364 - chore: add pre-commit config file
ko3n1g May 27, 2025
0e6223d
Merge branch 'ko3n1g/chore/add-precommit' into 'main'
ko3n1g May 27, 2025
f1c74a6
ADLR/megatron-lm!3359 - ci: Update nightlies
ko3n1g May 27, 2025
0d77a93
Merge branch 'ko3n1g/ci/update-nightlies-2' into 'main'
ko3n1g May 27, 2025
e582852
ADLR/megatron-lm!3231 - Multiple touches for TensorRT Model Optimizer…
ChenhanYu May 28, 2025
7b4fbe9
Merge branch 'chenhany/heterogenous_sharded_ckpt' into 'main'
ko3n1g May 28, 2025
fb68b3a
ADLR/megatron-lm!3360 - add more nemo2 tests
dimapihtar May 28, 2025
27c33cb
Merge branch 'add_nemo2_tests' into 'main'
ko3n1g May 28, 2025
e019f82
ADLR/megatron-lm!3348 - fix: handle checkpoint_dir path as a string
shunjiad May 29, 2025
0cb4da1
Merge branch 'fix-torch-msc-checkpointing' into 'main'
deepakn94 May 29, 2025
90e768c
ADLR/megatron-lm!3367 - Update dataset helper for online video decoding
May 29, 2025
705d312
Merge branch 'matthieul/fix_text_generation' into 'main'
trintamaki May 29, 2025
7c1baea
ADLR/megatron-lm!3365 - Do not use eval on arbitrary user input.
jaredcasper May 29, 2025
c820c68
Merge branch 'safer-eval' into 'main'
jaredcasper May 29, 2025
c6b08c2
ADLR/megatron-lm!3363 - tests: Update frozen-checkpoints
ko3n1g May 30, 2025
8a39761
Merge branch 'ko3n1g/tests/frozen-cpkt' into 'main'
ko3n1g May 30, 2025
8d08685
ADLR/megatron-lm!3375 - Consolidate eval methods across train and gen…
May 30, 2025
13898cb
Merge branch 'matthieul/consolidate_eval' into 'main'
trintamaki May 30, 2025
de245df
ADLR/megatron-lm!3388 - ci: Auto-restart on nan
ko3n1g May 30, 2025
0a438ed
Merge branch 'ko3n1g/ci/restart-on-nan' into 'main'
ko3n1g May 30, 2025
23e6471
ADLR/megatron-lm!2949 - perf(mla, experimental): MLA RoPE fusion and …
hxbai Jun 2, 2025
9c1a535
Merge branch 'hongxiaob/mla_rope' into 'main'
ko3n1g Jun 2, 2025
da3f0ff
ADLR/megatron-lm!3280 - Fix custom FSDP float8 tensor set_item
shjwudp Jun 3, 2025
549d637
Merge branch 'fix_cfsdp_fp8_param_load' into 'main'
chtruong814 Jun 3, 2025
24c60db
ADLR/megatron-lm!3401 - ci: Move queue blocker
ko3n1g Jun 3, 2025
cfea2ea
Merge branch 'ko3n1g/ci/move-queue-blocker' into 'main'
ko3n1g Jun 3, 2025
37b0afd
ADLR/megatron-lm!3400 - ci: Improve error-handling of missing logs
ko3n1g Jun 4, 2025
6a62a54
Merge branch 'ko3n1g/ci/better-log-failure-handling' into 'main'
ko3n1g Jun 4, 2025
4648912
ADLR/megatron-lm!3408 - ci: Control job concurrency
ko3n1g Jun 4, 2025
cde60ce
Merge branch 'ko3n1g/ci/job-concurrency' into 'main'
ko3n1g Jun 4, 2025
eab047c
ADLR/megatron-lm!3412 - ci: Catch missing logs
ko3n1g Jun 4, 2025
25a26ca
Merge branch 'ko3n1g/ci/fix-no-log' into 'main'
ko3n1g Jun 4, 2025
9bdfe31
ADLR/megatron-lm!3411 - ci: Remove tests from A100
ko3n1g Jun 4, 2025
ff64f96
Merge branch 'ko3n1g/ci/move-tests' into 'main'
ko3n1g Jun 4, 2025
d960800
ADLR/megatron-lm!3393 - Add an option to skip counting zeros in grad …
erhoo82 Jun 5, 2025
b47a9bb
Merge branch 'no_count_zeros' into 'main'
ko3n1g Jun 5, 2025
bc80491
ADLR/megatron-lm!3326 - Add an interface to set high priority stream …
youngeunkwon0405 Jun 5, 2025
957f348
Merge branch 'comm-priority-setting' into 'main'
ko3n1g Jun 5, 2025
7af72f9
ADLR/megatron-lm!3241 - Llama4 inference
wdykas Jun 6, 2025
4eb36f8
Merge branch 'llama4-inference' into 'main'
chtruong814 Jun 6, 2025
61a42f6
ADLR/megatron-lm!3421 - Change default value of high_priority_stream_…
youngeunkwon0405 Jun 6, 2025
7c64be3
Merge branch 'comm-priority-patch' into 'main'
jaredcasper Jun 6, 2025
92d68da
ADLR/megatron-lm!3170 - [feat, moe]: FP8 padding optimization of MoE …
Victarry Jun 9, 2025
140dce2
Merge branch 'denliu/router_pad' into 'main'
ko3n1g Jun 9, 2025
9e3adb5
ADLR/megatron-lm!3306 - Remove deprecated alltoall_seq dispatcher.
Victarry Jun 9, 2025
823466e
Merge branch 'denliu/remove_alltoall_seq_dispatcher' into 'main'
ko3n1g Jun 9, 2025
db07e3f
ADLR/megatron-lm!3347 - Fix flash decode bug caused by unnecessary ro…
santhnm2 Jun 9, 2025
2e15d12
Merge branch 'hybrid_example' into 'main'
ko3n1g Jun 9, 2025
1589517
ADLR/megatron-lm!3404 - Fix perf issues with NVTX range profiling
Jun 9, 2025
b04c901
Merge branch 'nvtx_perf_fix' into 'main'
ko3n1g Jun 9, 2025
791454d
ADLR/megatron-lm!3385 - Enforce param group ordering after checkpoint…
skierat Jun 9, 2025
40cb6e7
Merge branch 'skierat/fix_param_groups' into 'main'
ko3n1g Jun 9, 2025
54cdc7a
ADLR/megatron-lm!3399 - [MM] [Bug Fix] model parameter dtype, embeddi…
cuichenx Jun 10, 2025
d1409db
Merge branch 'chcui/llama-nemotron-nano-vl-8b' into 'main'
ko3n1g Jun 10, 2025
629b615
Revert "Merge branch 'chcui/llama-nemotron-nano-vl-8b' into 'main'"
ko3n1g Jun 10, 2025
50a1247
Reapply "Merge branch 'chcui/llama-nemotron-nano-vl-8b' into 'main'"
ko3n1g Jun 10, 2025
5ae21f8
Revert "ADLR/megatron-lm!3399 - [MM] [Bug Fix] model parameter dtype,…
ko3n1g Jun 10, 2025
62e7e60
ADLR/megatron-lm!3332 - fix(mtp): Fix issue with MTP+VPP after !3108 …
shifangx Jun 11, 2025
ad36348
Merge branch 'shifang/fix_vp_stage' into 'main'
ko3n1g Jun 11, 2025
0f4f095
ADLR/megatron-lm!3384 - Expose TE fused MLP with module spec
timmoon10 Jun 11, 2025
0595ef2
Merge branch 'mfutrega/fused_swiglu' into 'main'
ko3n1g Jun 11, 2025
9e5fe7a
ADLR/megatron-lm!3403 - Moe inference functional tests
wdykas Jun 12, 2025
0dea9a5
Merge branch 'moe-tests' into 'main'
ko3n1g Jun 12, 2025
80d66ec
ADLR/megatron-lm!3458 - ci: Benchmark release tests suite with TE2.2 …
ko3n1g Jun 12, 2025
a3e2222
Merge branch 'ko3n1g/chore/release-benchmarks-dev' into 'main'
ko3n1g Jun 12, 2025
15e4446
ADLR/megatron-lm!3371 - Move data to GPU for TP data processing
parthmannan Jun 12, 2025
d58f062
Merge branch 'pmannan/improve_data_processing' into 'main'
ko3n1g Jun 12, 2025
f5cfc10
Reapply "ADLR/megatron-lm!3399 - [MM] [Bug Fix] model parameter dtype…
ko3n1g Jun 12, 2025
5bb6cf3
update golden values
ko3n1g Jun 12, 2025
603592a
ADLR/megatron-lm!3366 - Optimize dummy weight tensors for cudagraph a…
gdengk Jun 12, 2025
40bfaf5
Merge branch 'gaod/llama4/cudagraph_optimize' into 'main'
ko3n1g Jun 12, 2025
6782fe4
ADLR/megatron-lm!3377 - Add --enable-experimental to args.
Victarry Jun 12, 2025
32737be
Merge branch 'denliu/add_enable_experimental_flag' into 'main'
ko3n1g Jun 12, 2025
e63aee4
ADLR/megatron-lm!3281 - perf(MLA): MLA down proj switch back to TELinear
yuzhongw-nvidia Jun 13, 2025
ae63c41
Merge branch 'mla_down_proj_telinear' into 'main'
ko3n1g Jun 13, 2025
9042182
ADLR/megatron-lm!3463 - ci: Retry on network errors
ko3n1g Jun 13, 2025
819f752
Merge branch 'ko3n1g/ci/wait-resources-resiliency' into 'main'
ko3n1g Jun 13, 2025
b8605c6
ADLR/megatron-lm!3361 - Add TE functional tests
ko3n1g Jun 13, 2025
107fc72
Merge branch 'ko3n1g/guyueh/te_functional_tests' into 'main'
ko3n1g Jun 13, 2025
effa991
revert
ko3n1g Jun 13, 2025
ad7d1df
ci: Restart on cuda error
ko3n1g Jun 13, 2025
f21a28b
Revert "ADLR/megatron-lm!3281 - perf(MLA): MLA down proj switch back …
ko3n1g Jun 13, 2025
a4fc916
Merge branch 'ko3n1g/ci/restart-on-cuda'
ko3n1g Jun 13, 2025
7f7ffcf
Merge branch 'ko3n1g/chore/re-apply-3399'
ko3n1g Jun 13, 2025
73558db
ci: Set gpt-nemo tests as allowed to fail
ko3n1g Jun 13, 2025
42f7f7f
ci: Fix while loop
ko3n1g Jun 13, 2025
0bbcbb1
ADLR/megatron-lm!3024 - Added support for offloading Swiglu activatio…
sanandaraj5597 Jun 13, 2025
fdcf52b
Merge branch 'swiglu_offload' into 'main'
ericharper Jun 13, 2025
cfe7b06
ADLR/megatron-lm!3279 - Fix MoE Aux loss
aklife97 Jun 13, 2025
aaddc23
Merge branch 'akhattar/auxloss_fix' into 'main'
ko3n1g Jun 13, 2025
db8cd9a
ADLR/megatron-lm!3429 - llama 3p1 nemotron nano vl 8b v1 instructions
Jun 13, 2025
dca59c6
Merge branch 'matthieul/llama_3p1_nemotron_nano_vl_8b_v1' into 'main'
ko3n1g Jun 13, 2025
9caa5d3
ADLR/megatron-lm!3289 - Fix attention unit test
santhnm2 Jun 14, 2025
8a03b29
Merge branch 'attention_unit_test_fix' into 'main'
ko3n1g Jun 14, 2025
04c93ae
ADLR/megatron-lm!3265 - Handle strict argument for local checkpointing
Jun 14, 2025
59ae4e3
Merge branch 'jszulc/local-ckpt-strict-loading' into 'main'
ko3n1g Jun 14, 2025
77732c3
ADLR/megatron-lm!2795 - feat(Pipeline Parallel, MoE): Flexible Asymme…
Shunkangz Jun 14, 2025
aec50ee
Merge branch 'flexible_vpp' into 'main'
ko3n1g Jun 14, 2025
19d30fa
ADLR/megatron-lm!3317 - Fix version check of test_fp8_param.py
kunlunl Jun 14, 2025
48396b2
Merge branch 'kunlunl/fix_fp8_param_ut_version_check' into 'main'
ko3n1g Jun 14, 2025
0d549aa
ADLR/megatron-lm!3461 - Fix common state comparison primitive
mikolajblaz Jun 14, 2025
de3da90
Merge branch 'mblaz/fix-dict-utils-diff' into 'main'
ko3n1g Jun 14, 2025
f2116e2
ADLR/megatron-lm!3153 - Update inference README
mathemakitten Jun 14, 2025
a981bf8
Merge branch 'helenn-update-inference-readme' into 'main'
jaredcasper Jun 14, 2025
d920c0d
ADLR/megatron-lm!3345 - M4 Taskforce: update get_rank & get_size of PG
yaoyu-33 Jun 14, 2025
fabb0a0
Merge branch 'yuya/m4_get_rank_get_size_of_pg_update' into 'main'
ko3n1g Jun 14, 2025
03322c1
ADLR/megatron-lm!3448 - CRADIO-g support
Jun 14, 2025
c85b6e7
Merge branch 'tpoon/cradio-g-mr' into 'main'
ko3n1g Jun 14, 2025
9d509a0
ADLR/megatron-lm!3127 - feat(optimizer): Support bf16 dtype for optim…
BestJuly Jun 14, 2025
083b1dc
Merge branch 'lit/support_bf16_optimzer_states' into 'main'
ko3n1g Jun 14, 2025
9900d9a
ADLR/megatron-lm!3379 - Megatron SFT
Jun 14, 2025
775a1d1
Merge branch 'megatron-main-sft' into 'main'
ko3n1g Jun 14, 2025
ee56591
ADLR/megatron-lm!3376 - Fix cuda graph for MambaLayer
guyueh1 Jun 14, 2025
5b4e466
Merge branch 'fix_cuda_graph_for_ssm' into 'main'
ko3n1g Jun 14, 2025
e3ec174
ADLR/megatron-lm!2276 - Add Mamba context parallel
duncanriach Jun 14, 2025
55080a3
Merge branch 'duncan/mamba-context-parallel' into 'main'
ericharper Jun 14, 2025
d559555
ADLR/megatron-lm!3415 - [MXFP8]Reduce memory footprint by initializin…
Jun 14, 2025
bcf96e3
Merge branch 'qiyuw/mxfp8-param' into 'main'
ko3n1g Jun 14, 2025
66194b7
ADLR/megatron-lm!3462 - Add hybrid functional inference test
wdykas Jun 14, 2025
d738935
Merge branch 'mamba-inference-test' into 'main'
ko3n1g Jun 14, 2025
bf6e998
ADLR/megatron-lm!3316 - added llama model training example with FP8
sbhavani Jun 14, 2025
38e30f5
Merge branch 'main' into 'main'
ko3n1g Jun 14, 2025
0f05866
ADLR/megatron-lm!3387 - feat(MoE): Using `te_general_gemm` to handle …
hxbai Jun 14, 2025
dc8372b
Merge branch 'hongxiaob/custom_router_gating' into 'main'
ko3n1g Jun 14, 2025
1674ce3
ADLR/megatron-lm!3190 - Mark weights from vision encoder to be non-te…
wdykas Jun 14, 2025
a165235
Merge branch 'hf-diverge-fix' into 'main'
ko3n1g Jun 14, 2025
0431153
ADLR/megatron-lm!2850 - Granular upcycling implementation
shifangx Jun 15, 2025
c2fb1de
Merge branch 'shifang/granular_upcycling' into 'main'
ko3n1g Jun 15, 2025
a0937dd
ADLR/megatron-lm!3424 - Add GPU energy (and ~power) monitoring for tr…
Jun 15, 2025
cca17b7
Merge branch 'energy-monitoring' into 'main'
ko3n1g Jun 15, 2025
8333bd5
ADLR/megatron-lm!3217 - feat(MoE): Support ep a2a overlap - (01) Add …
Wohox Jun 16, 2025
3e55583
Merge branch 'pingtianl/fine_grained_transformer_layer_submodules' in…
ko3n1g Jun 16, 2025
5005416
ADLR/megatron-lm!3397 - build: Switch to uv
ko3n1g Jun 16, 2025
0df9325
Merge branch 'ko3n1g/build/refactor-setup' into 'main'
ko3n1g Jun 16, 2025
59f2093
ADLR/megatron-lm!3468 - build: Simplify nemo image
ko3n1g Jun 16, 2025
df7401b
Merge branch 'ko3n1g/build/simplify-nemo-image' into 'main'
ko3n1g Jun 16, 2025
2b1c2d6
ADLR/megatron-lm!3272 - Make completions endpoint use MCore inference…
santhnm2 Jun 16, 2025
c40f31f
Merge branch 'completions_endpoint_fix' into 'main'
ko3n1g Jun 16, 2025
2b11af0
ADLR/megatron-lm!3420 - Implement dist-ckpt content versioning
mikolajblaz Jun 16, 2025
83a0f5a
Merge branch 'mblaz/dist-ckpt-content-versioning' into 'main'
ko3n1g Jun 16, 2025
8c1d0c7
ADLR/megatron-lm!3451 - fix (ckpt): Fix `_extra_state` for TE 2.5
yaox12 Jun 16, 2025
6bf889f
Merge branch 'xiny/fix_extra_state' into 'main'
ko3n1g Jun 16, 2025
6dc6050
ADLR/megatron-lm!3081 - Add Hybrid Shard Data-Parallel Support for Cu…
shjwudp Jun 16, 2025
aad967f
Merge branch 'custom_fsdp_hsdp_support' into 'main'
ko3n1g Jun 16, 2025
c7cf075
ADLR/megatron-lm!3450 - Revert `fork` to `spawn` based on stability i…
sbak5 Jun 16, 2025
c8f2f56
Merge branch 'sbak/ckpt_manager_fix' into 'main'
jaredcasper Jun 16, 2025
f7e4641
ADLR/megatron-lm!3301 - Add kitchen extension with per-layer configur…
kwyss-nvidia Jun 16, 2025
8c15450
Merge branch 'kwyss/megatron_kitchen_extension' into 'main'
jaredcasper Jun 16, 2025
1e8e9a4
ADLR/megatron-lm!3474 - Add deprecation warning for legacy inference
santhnm2 Jun 17, 2025
b87f147
Merge branch 'legacy_deprecation_warning' into 'main'
ko3n1g Jun 17, 2025
ab77e52
ADLR/megatron-lm!3181 - Change naming of original_max_position_embedd…
BoxiangW Jun 17, 2025
2386c6c
Merge branch 'boxiangw/mla-yarn-change-option-name' into 'main'
ericharper Jun 17, 2025
fee5600
ADLR/megatron-lm!3472 - Make cudagraph replay check more descriptive …
mathemakitten Jun 17, 2025
c3dc507
Merge branch 'helenn-flag-specific-error-for-cudagraph-replay' into '…
ericharper Jun 17, 2025
db70ed4
ADLR/megatron-lm!3414 - M4 Taskforce: Disable T5 and encoder_and_deco…
yaoyu-33 Jun 17, 2025
5615930
Merge branch 'yuya/m4_remove_encoder_pp_tests_ci_add_deprecation' int…
ko3n1g Jun 17, 2025
e0b2c60
ADLR/megatron-lm!3444 - Quick fix for NeMo: handle alternate key name…
skierat Jun 17, 2025
bfa39e8
Merge branch 'skierat/quick_nemo_fix' into 'main'
ko3n1g Jun 17, 2025
0e3af7e
ADLR/megatron-lm!3477 - chore: Bump version 0.14.0
ko3n1g Jun 17, 2025
27c9b6c
Merge branch 'ko3n1g/chore/release-version-0.14.0' into 'main'
ericharper Jun 17, 2025
3987e89
ADLR/megatron-lm!3071 - Added offloading support for MCore layers
sanandaraj5597 Jun 17, 2025
4a91173
Merge branch 'lora_offload' into 'main'
ericharper Jun 17, 2025
115785f
ADLR/megatron-lm!3437 - Bug fix to reset kv chunks assigned to -1 and…
shanmugamr1992 Jun 18, 2025
3b0f763
Merge branch 'bugFixDE' into 'main'
shanmugamr1992 Jun 18, 2025
642a181
ADLR/megatron-lm!3483 - chore: Add init to tools
ko3n1g Jun 18, 2025
0710137
Merge branch 'ko3n1g/chore/tool-init' into 'main'
ko3n1g Jun 18, 2025
171c351
ADLR/megatron-lm!3480 - Fix unit test test_fp8_param.py blockwise sca…
guyueh1 Jun 18, 2025
57082f9
Merge branch 'fix_2425' into 'main'
ko3n1g Jun 18, 2025
9f1c4b2
ADLR/megatron-lm!3492 - chore: Add init to examples
ko3n1g Jun 18, 2025
6ac5633
Merge branch 'ko3n1g/chore/examples-init' into 'main'
ko3n1g Jun 18, 2025
2074d19
ADLR/megatron-lm!3493 - build: Force pin down setuptools
ko3n1g Jun 18, 2025
0600a3c
Merge branch 'ko3n1g/build/fix-setuptools-version' into 'main'
ko3n1g Jun 18, 2025
a002d50
ADLR/megatron-lm!3341 - Pad input tensors and enable fp8 weights for …
santhnm2 Jun 18, 2025
6a6cd47
Merge branch 'fp8_inference' into 'main'
ko3n1g Jun 18, 2025
2151c65
ADLR/megatron-lm!3398 - M4 Taskforce: Add HyperCommGrid: N-Dimensiona…
yaoyu-33 Jun 26, 2025
45400df
Merge branch 'yuya/m4_hyper_comm_grid' into 'main'
chtruong814 Jun 26, 2025
db59202
ADLR/megatron-lm!3508 - Pass strict=False to load_checkpoint in infer…
mathemakitten Jun 26, 2025
1ab876d
Merge branch 'helenn-allow-loading-unstrict-checkpoint' into 'main'
deepakn94 Jun 26, 2025
9964092
ADLR/megatron-lm!3526 - Skip fused rope check if te version < 1.4.0
BoxiangW Jun 27, 2025
878d65f
Merge branch 'boxiangw/skip-te-fused-rope-test' into 'main'
ko3n1g Jun 27, 2025
e2d16c0
ADLR/megatron-lm!3529 - ci: Misc refactorings
ko3n1g Jun 27, 2025
cc3ed64
Merge branch 'ko3n1g/chore/some-fixes' into 'main'
ko3n1g Jun 27, 2025
1e42279
ADLR/megatron-lm!3284 - Add option to load main params from checkpoin…
kunlunl Jun 27, 2025
c203e6a
Merge branch 'kunlunl/load_main_params_from_ckpt' into 'main'
ko3n1g Jun 27, 2025
881dfe4
ADLR/megatron-lm!3328 - MiMO VLM training example and functional tests
yashaswikarnati Jun 28, 2025
6b70889
Merge branch 'yash/mimo_train_loop_mr' into 'main'
ko3n1g Jun 28, 2025
4ba4542
ADLR/megatron-lm!3539 - test: Disable apex tests
ko3n1g Jun 30, 2025
d125627
Merge branch 'ko3n1g/test/disable-apex-tests' into 'main'
ko3n1g Jun 30, 2025
5e34e9c
ADLR/megatron-lm!3533 - Added double buffering switch for offloading
sanandaraj5597 Jun 30, 2025
8a416d0
Merge branch 'double_buffering_interface' into 'main'
jaredcasper Jun 30, 2025
7fd003f
ADLR/megatron-lm!3440 - Add vp_stage attr to FSDP wrapper.
cspades Jul 1, 2025
5e0e2c7
Merge branch 'cye/fsdp-vp-stage-fix' into 'main'
ericharper Jul 1, 2025
6d5670e
ADLR/megatron-lm!3544 - tests: Disable Apex tests (part 2)
ko3n1g Jul 1, 2025
805f3b8
Merge branch 'ko3n1g/tests/disable-apex-tests-2' into 'main'
ko3n1g Jul 1, 2025
e392d40
ADLR/megatron-lm!3456 - Fix num_warmup_microbatches for PP=1 CUDA gra…
buptzyb Jul 1, 2025
c237a3d
Merge branch 'robinz/fix_schedule' into 'main'
ko3n1g Jul 1, 2025
8e7428e
ADLR/megatron-lm!3547 - tests: Remove multimodal test
ko3n1g Jul 1, 2025
720ea36
Merge branch 'ko3n1g/ci/nightlies' into 'main'
ko3n1g Jul 1, 2025
f06fa41
ADLR/megatron-lm!3549 - build: Guard modelopt on macOS
ko3n1g Jul 1, 2025
76144fe
Merge branch 'ko3n1g/build/guard-modelopt' into 'main'
ko3n1g Jul 1, 2025
4c092ba
ADLR/megatron-lm!3525 - Fix TE version change on rope_fusion
BoxiangW Jul 2, 2025
683895b
Merge branch 'boxiangw/te-rope-fusion-fix' into 'main'
ko3n1g Jul 2, 2025
106ca9b
ADLR/megatron-lm!3554 - ci: Retry on `Call to CUDA function failed.`
ko3n1g Jul 2, 2025
809aab6
Merge branch 'ko3n1g/ci/restart-cuda-error' into 'main'
ko3n1g Jul 2, 2025
915ae4c
tests(hotfix): Update golden values file
ko3n1g Jul 2, 2025
6d1e2d7
ADLR/megatron-lm!3545 - Fix FSDP-double-buffer
youngeunkwon0405 Jul 2, 2025
6f6968f
Merge branch 'fix_fsdp_double_buffer' into 'main'
ko3n1g Jul 2, 2025
f7ba245
ADLR/megatron-lm!3557 - Fix 'apex.contrib.nccl_allocator' has no attr…
youngeunkwon0405 Jul 2, 2025
b61e211
Merge branch 'fix_nccl_allocator_error' into 'main'
ko3n1g Jul 2, 2025
a82fa72
ADLR/megatron-lm!3478 - Fix zero grad_norm when enabling precision-aw…
BestJuly Jul 2, 2025
dc65034
Merge branch 'lit/fix_zero_grad_norm' into 'main'
ko3n1g Jul 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
32 changes: 32 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: BUG
about: Report a bug that needs attention
title: "[BUG]"
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Stack trace/logs**
If applicable, add the stack trace or logs from the time of the error.

**Environment (please complete the following information):**
- Megatron-LM commit ID
- PyTorch version
- CUDA version
- NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
23 changes: 23 additions & 0 deletions .github/ISSUE_TEMPLATE/enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
name: ENHANCEMENT
about: Suggest an idea to improve this project
title: "[ENHANCEMENT]"
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Proposed implementation**
If you have a proposed implementation for the feature state it here or link to a PR.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
31 changes: 31 additions & 0 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests

on:
schedule:
- cron: '15 18 * * *'

jobs:
stale:

runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write

steps:
- uses: actions/stale@v5
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 60
stale-issue-message: 'Marking as stale. No activity in 60 days.'
stale-pr-message: 'Marking as stale. No activity in 60 days.'
stale-issue-label: 'stale'
stale-pr-label: 'stale'
remove-stale-when-updated: true
operations-per-run: 1000
days-before-close: -1
15 changes: 14 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,15 @@
__pycache__

*.so
build
.coverage_*
*.egg-info
*~
slurm*
logs
.vscode
local/
.gitmodules
wandb/
onelogger.log
onelogger.err
.venv
Loading