Skip to content

fix(qwen3_5): shard ViT frames across CP ranks#175

Merged
kcz358 merged 1 commit into
mainfrom
fix/qwen3-5-vit-cp-shard
May 20, 2026
Merged

fix(qwen3_5): shard ViT frames across CP ranks#175
kcz358 merged 1 commit into
mainfrom
fix/qwen3-5-vit-cp-shard

Conversation

@kcz358
Copy link
Copy Markdown
Collaborator

@kcz358 kcz358 commented May 20, 2026

What

Updates Qwen3.5 ViT frame parallel dispatch so CP ranks are symmetric sources:

  • Each CP rank first takes a deterministic contiguous shard of its duplicated local frames.
  • The existing LPT balancing still runs across the flat dp×cp group.
  • Reverse dispatch returns each CP rank's frame shard.
  • CP ranks then use an autograd-aware variable-length all-gather to reconstruct the full local-dp feature sequence before masked scatter.

Why

The previous implementation made cp_rank=0 the only source and cp_rank>0 pure receivers, then used CP all_reduce as a broadcast. That creates source/receiver autograd graph asymmetry. It works for some qwen3_5 video cases but can deadlock in more complex wrappers such as aero_realtime with SP + ViT frame parallel: cp_rank=0 remains in backward while cp_rank>0 reaches grad clipping.

This keeps the same global dp×cp load balancing but removes cp source/receiver asymmetry.

Validation

  • Verified syntax with .
  • Manually validated in training: aero_realtime with SP + vit_frame_parallel no longer deadlocks; qwen3_5 SP + vit_frame_parallel + video still runs.

@kcz358 kcz358 merged commit 1a898a2 into main May 20, 2026
3 checks passed
@kcz358 kcz358 deleted the fix/qwen3-5-vit-cp-shard branch May 20, 2026 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant