[pull] main from inclusionAI:main by pull[bot] · Pull Request #62 · axistore80-coder/AReaL

pull · 2026-05-06T13:21:32Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* fix(sft): report CP-invariant token-count stats (#1242) After PR #1223 introduced CP-local loss, compute_packed_sft_loss started recording `n_tokens`, `n_valid_tokens` and `prompt_tokens` using the CP-split `loss_mask` / `logprobs`. These denominators are summed only across the DP group at export time, so the reported values under-count by the CP factor (e.g. ~4x smaller with CP=4). The ratios reported via `stat(..., denominator=...)` (loss, ppl, vocab_*) remain correct because numerator and denominator scale together, so the issue is easy to miss. Preserve the pre-CP-split loss_mask as `_global_loss_mask` when the CP-local path constructs its inputs, and use it as the denominator for the token-count metrics so they are invariant to the CP topology. Keep separate `n_tokens_local` / `n_valid_tokens_local` denominators with CP-local shapes for the CP-local tensors (`logprobs`, `vocab_*`), since `stats_tracker.stat` requires matching shapes. Verified on 64-GPU (CP=2) and 128-GPU (CP=4) SPMD runs with the same Qwen3-30B + swe_distilled_1000 setup: step-1 `n_tokens` matches exactly across both topologies (6,320,900), and equals `CP * n_tokens_local` on each, confirming the fix is topology-invariant. Fixes #1242 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * style: fix 'Loggin stats' typo Noticed by Copilot reviewer on PR #1249. * fix(sft): make CP-local SFT loss/ppl/vocab_* stats CP-invariant After PR #1223 introduced CP-local loss, ratio metrics in compute_packed_sft_loss (loss/entropy/ppl/vocab_*) started reporting the average over rank 0's CP slice rather than the global average, because both numerator and denominator were CP-local tensors and export only reduced across DP. Counts were already fixed in PR #1242, but the ratios were not -- empirically loss/ppl/max can drift up to 25% (e.g. ppl/max 2.57 vs 3.20) on long-context SFT where prompt / completion tokens land in different CP slices. Fix the reporting without reintroducing the expensive logits all-gather that #1223 removed: at export time, the per-key reduce sees only scalar numerator/denominator (already .sum()-ed), so all-reducing across DP + CP costs a few bytes -- not 37GB of logits. Key changes: - stats_tracker: add per-key reduce_group override (kw-only) on denominator/scalar/stat; _avg/_min/_max/_sum/SCALAR honor it via _effective_reduce_group; reset clears it; fix latent reduce_types pop bug in single-key export path. - megatron_engine: in CP-local forward_step, expose _cp_reduce_group (CP) and _cp_dp_reduce_group (DP+CP) on cp_inputs. - lm_engine: use _cp_reduce_group to all-reduce per-sequence seqlogp/valid-count so ppl is CP-invariant; use _cp_dp_reduce_group as reduce_group for loss/entropy/vocab_* stats so the global mean is reported. Verified on a 64-GPU CP=2 SFT replay (Qwen3-30B-A3B + scale-swe data, seed=1, BS=128): with the fix, loss/avg, ppl/avg/max, vocab_*/avg match the pre-CP-local reference run to within 0.06%; grad_norm and n_tokens/n_valid_tokens are unchanged (the latter remains as fixed in PR #1242). Refs: #1242 * fix: reassemble CP packed sequences * chore: revert unnecessary changes --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: 博惟 <bowei.fw@antgroup.com>

garrett4wade and others added 3 commits May 6, 2026 16:09

chore: upgrade GCP CI image (#1303)

2498d17

gov: add 0501 community meeting materials (#1306)

de55b4c

pull Bot locked and limited conversation to collaborators May 6, 2026

pull Bot added the ⤵️ pull label May 6, 2026

pull Bot merged commit de55b4c into axistore80-coder:main May 6, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from inclusionAI:main#62

[pull] main from inclusionAI:main#62
pull[bot] merged 3 commits into
axistore80-coder:mainfrom
areal-project:main

pull Bot commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pull Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pull Bot commented May 6, 2026 •

edited

Loading