Skip to content

[pull] main from inclusionAI:main#62

Merged
pull[bot] merged 3 commits into
axistore80-coder:mainfrom
areal-project:main
May 6, 2026
Merged

[pull] main from inclusionAI:main#62
pull[bot] merged 3 commits into
axistore80-coder:mainfrom
areal-project:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 6, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

garrett4wade and others added 3 commits May 6, 2026 16:09
* fix(sft): report CP-invariant token-count stats (#1242)

After PR #1223 introduced CP-local loss, compute_packed_sft_loss
started recording `n_tokens`, `n_valid_tokens` and `prompt_tokens`
using the CP-split `loss_mask` / `logprobs`. These denominators are
summed only across the DP group at export time, so the reported
values under-count by the CP factor (e.g. ~4x smaller with CP=4).
The ratios reported via `stat(..., denominator=...)` (loss, ppl,
vocab_*) remain correct because numerator and denominator scale
together, so the issue is easy to miss.

Preserve the pre-CP-split loss_mask as `_global_loss_mask` when the
CP-local path constructs its inputs, and use it as the denominator
for the token-count metrics so they are invariant to the CP topology.
Keep separate `n_tokens_local` / `n_valid_tokens_local` denominators
with CP-local shapes for the CP-local tensors (`logprobs`, `vocab_*`),
since `stats_tracker.stat` requires matching shapes.

Verified on 64-GPU (CP=2) and 128-GPU (CP=4) SPMD runs with the same
Qwen3-30B + swe_distilled_1000 setup: step-1 `n_tokens` matches
exactly across both topologies (6,320,900), and equals
`CP * n_tokens_local` on each, confirming the fix is topology-invariant.

Fixes #1242

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* style: fix 'Loggin stats' typo

Noticed by Copilot reviewer on PR #1249.

* fix(sft): make CP-local SFT loss/ppl/vocab_* stats CP-invariant

After PR #1223 introduced CP-local loss, ratio metrics in
compute_packed_sft_loss (loss/entropy/ppl/vocab_*) started reporting
the average over rank 0's CP slice rather than the global average,
because both numerator and denominator were CP-local tensors and
export only reduced across DP. Counts were already fixed in PR #1242,
but the ratios were not -- empirically loss/ppl/max can drift up to
25% (e.g. ppl/max 2.57 vs 3.20) on long-context SFT where prompt /
completion tokens land in different CP slices.

Fix the reporting without reintroducing the expensive logits
all-gather that #1223 removed: at export time, the per-key reduce
sees only scalar numerator/denominator (already .sum()-ed), so
all-reducing across DP + CP costs a few bytes -- not 37GB of logits.

Key changes:
- stats_tracker: add per-key reduce_group override (kw-only) on
  denominator/scalar/stat; _avg/_min/_max/_sum/SCALAR honor it via
  _effective_reduce_group; reset clears it; fix latent reduce_types
  pop bug in single-key export path.
- megatron_engine: in CP-local forward_step, expose
  _cp_reduce_group (CP) and _cp_dp_reduce_group (DP+CP) on cp_inputs.
- lm_engine: use _cp_reduce_group to all-reduce per-sequence
  seqlogp/valid-count so ppl is CP-invariant; use
  _cp_dp_reduce_group as reduce_group for loss/entropy/vocab_*
  stats so the global mean is reported.

Verified on a 64-GPU CP=2 SFT replay (Qwen3-30B-A3B + scale-swe data,
seed=1, BS=128): with the fix, loss/avg, ppl/avg/max, vocab_*/avg
match the pre-CP-local reference run to within 0.06%; grad_norm and
n_tokens/n_valid_tokens are unchanged (the latter remains as fixed
in PR #1242).

Refs: #1242

* fix: reassemble CP packed sequences

* chore: revert unnecessary changes

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: 博惟 <bowei.fw@antgroup.com>
@pull pull Bot locked and limited conversation to collaborators May 6, 2026
@pull pull Bot added the ⤵️ pull label May 6, 2026
@pull pull Bot merged commit de55b4c into axistore80-coder:main May 6, 2026
3 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants