[KMCompiler] [TLERaw] Add WS thread budget repro cases#713
Draft
Zhang-kg wants to merge 9 commits into
Draft
Conversation
|
|
5005b55 to
b60113f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds two self-contained repro cases for the TLE warp-specialization thread-budget issue:
Both cases reproduce:
triton.runtime.errors.OutOfResources:
out of resource: threads, Required: 384, Hardware limit: 256.
The repros are intentionally isolated. They do not depend on the previous external megakernel-moe experiment directory.
Environment
Validated locally on:
GPU: NVIDIA H100 80GB HBM3
CUDA: 12.8
Python: 3.10
NVSHMEM: 3.4.5 (
nvidia-nvshmem-cu12==3.4.5)MPI launcher: Open MPI 4.1.2
Triton: FlagTree PR682-based Triton with TLE raw NVSHMEM support
Repro Cases
1. oor384_receiver_w4
This case configures:
default / dispatch partition: 4 warps
receiver worker partition: 4 warps
compute worker partition: 4 warps
total: 12 warps = 384 threads
The receiver partition is allocated 4 warps, but only warp_id == 0 executes the receiver logic.
Result:
Required: 384, Hardware limit: 256
2. oor384_receiver_w4_allwarps
This case removes the warp_id == 0 guard and lets the 4 receiver warps participate in receiver work distribution.
Result is still:
Required: 384, Hardware limit: 256
This shows that the OOR is independent of whether receiver internally uses only warp0 or all 4 receiver warps. The issue is the
total WS role thread budget:
4 + 4 + 4 warps = 12 warps = 384 threads
while the compiled function reports:
CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 256
How To Reproduce
Set local environment variables:
Run receiver-w4:
Run receiver-w4-allwarps:
Expected result for both:
Notes
This PR does not claim this is a compiler bug by itself. The repro documents a concrete TLE WS thread-budget boundary:
dispatch/default 4 warps + receiver 4 warps + compute 4 warps
currently results in a 384-thread kernel requirement, while the compiled function is limited to 256 threads per block.