fix(qwen3_5): shard ViT frames across CP ranks by kcz358 · Pull Request #175 · EvolvingLMMs-Lab/lmms-engine

kcz358 · 2026-05-20T03:32:27Z

What

Updates Qwen3.5 ViT frame parallel dispatch so CP ranks are symmetric sources:

Each CP rank first takes a deterministic contiguous shard of its duplicated local frames.
The existing LPT balancing still runs across the flat dp×cp group.
Reverse dispatch returns each CP rank's frame shard.
CP ranks then use an autograd-aware variable-length all-gather to reconstruct the full local-dp feature sequence before masked scatter.

Why

The previous implementation made cp_rank=0 the only source and cp_rank>0 pure receivers, then used CP all_reduce as a broadcast. That creates source/receiver autograd graph asymmetry. It works for some qwen3_5 video cases but can deadlock in more complex wrappers such as aero_realtime with SP + ViT frame parallel: cp_rank=0 remains in backward while cp_rank>0 reaches grad clipping.

This keeps the same global dp×cp load balancing but removes cp source/receiver asymmetry.

Validation

Verified syntax with .
Manually validated in training: aero_realtime with SP + vit_frame_parallel no longer deadlocks; qwen3_5 SP + vit_frame_parallel + video still runs.

fix(qwen3_5): shard ViT frames across CP ranks

47ed66d

kcz358 merged commit 1a898a2 into main May 20, 2026
3 checks passed

kcz358 deleted the fix/qwen3-5-vit-cp-shard branch May 20, 2026 03:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qwen3_5): shard ViT frames across CP ranks#175

fix(qwen3_5): shard ViT frames across CP ranks#175
kcz358 merged 1 commit into
mainfrom
fix/qwen3-5-vit-cp-shard

kcz358 commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kcz358 commented May 20, 2026

What

Why

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant