[pull] main from inclusionAI:main#62
Merged
Merged
Conversation
* fix(sft): report CP-invariant token-count stats (#1242) After PR #1223 introduced CP-local loss, compute_packed_sft_loss started recording `n_tokens`, `n_valid_tokens` and `prompt_tokens` using the CP-split `loss_mask` / `logprobs`. These denominators are summed only across the DP group at export time, so the reported values under-count by the CP factor (e.g. ~4x smaller with CP=4). The ratios reported via `stat(..., denominator=...)` (loss, ppl, vocab_*) remain correct because numerator and denominator scale together, so the issue is easy to miss. Preserve the pre-CP-split loss_mask as `_global_loss_mask` when the CP-local path constructs its inputs, and use it as the denominator for the token-count metrics so they are invariant to the CP topology. Keep separate `n_tokens_local` / `n_valid_tokens_local` denominators with CP-local shapes for the CP-local tensors (`logprobs`, `vocab_*`), since `stats_tracker.stat` requires matching shapes. Verified on 64-GPU (CP=2) and 128-GPU (CP=4) SPMD runs with the same Qwen3-30B + swe_distilled_1000 setup: step-1 `n_tokens` matches exactly across both topologies (6,320,900), and equals `CP * n_tokens_local` on each, confirming the fix is topology-invariant. Fixes #1242 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * style: fix 'Loggin stats' typo Noticed by Copilot reviewer on PR #1249. * fix(sft): make CP-local SFT loss/ppl/vocab_* stats CP-invariant After PR #1223 introduced CP-local loss, ratio metrics in compute_packed_sft_loss (loss/entropy/ppl/vocab_*) started reporting the average over rank 0's CP slice rather than the global average, because both numerator and denominator were CP-local tensors and export only reduced across DP. Counts were already fixed in PR #1242, but the ratios were not -- empirically loss/ppl/max can drift up to 25% (e.g. ppl/max 2.57 vs 3.20) on long-context SFT where prompt / completion tokens land in different CP slices. Fix the reporting without reintroducing the expensive logits all-gather that #1223 removed: at export time, the per-key reduce sees only scalar numerator/denominator (already .sum()-ed), so all-reducing across DP + CP costs a few bytes -- not 37GB of logits. Key changes: - stats_tracker: add per-key reduce_group override (kw-only) on denominator/scalar/stat; _avg/_min/_max/_sum/SCALAR honor it via _effective_reduce_group; reset clears it; fix latent reduce_types pop bug in single-key export path. - megatron_engine: in CP-local forward_step, expose _cp_reduce_group (CP) and _cp_dp_reduce_group (DP+CP) on cp_inputs. - lm_engine: use _cp_reduce_group to all-reduce per-sequence seqlogp/valid-count so ppl is CP-invariant; use _cp_dp_reduce_group as reduce_group for loss/entropy/vocab_* stats so the global mean is reported. Verified on a 64-GPU CP=2 SFT replay (Qwen3-30B-A3B + scale-swe data, seed=1, BS=128): with the fix, loss/avg, ppl/avg/max, vocab_*/avg match the pre-CP-local reference run to within 0.06%; grad_norm and n_tokens/n_valid_tokens are unchanged (the latter remains as fixed in PR #1242). Refs: #1242 * fix: reassemble CP packed sequences * chore: revert unnecessary changes --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: 博惟 <bowei.fw@antgroup.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )