Skip to content

[KMCompiler] [TLERaw] Add WS thread budget repro cases#713

Draft
Zhang-kg wants to merge 9 commits into
flagos-ai:triton_v3.6.xfrom
Zhang-kg:tleraw-ws-thread-budget-repros
Draft

[KMCompiler] [TLERaw] Add WS thread budget repro cases#713
Zhang-kg wants to merge 9 commits into
flagos-ai:triton_v3.6.xfrom
Zhang-kg:tleraw-ws-thread-budget-repros

Conversation

@Zhang-kg

Copy link
Copy Markdown

Summary

This PR adds two self-contained repro cases for the TLE warp-specialization thread-budget issue:

python/test/tle/integration/isolated_repros/oor384_receiver_w4
python/test/tle/integration/isolated_repros/oor384_receiver_w4_allwarps

Both cases reproduce:

triton.runtime.errors.OutOfResources:
out of resource: threads, Required: 384, Hardware limit: 256.

The repros are intentionally isolated. They do not depend on the previous external megakernel-moe experiment directory.

Environment

Validated locally on:

GPU: NVIDIA H100 80GB HBM3
CUDA: 12.8
Python: 3.10
NVSHMEM: 3.4.5 (nvidia-nvshmem-cu12==3.4.5)
MPI launcher: Open MPI 4.1.2
Triton: FlagTree PR682-based Triton with TLE raw NVSHMEM support

Repro Cases

1. oor384_receiver_w4

This case configures:

default / dispatch partition: 4 warps
receiver worker partition: 4 warps
compute worker partition: 4 warps
total: 12 warps = 384 threads

The receiver partition is allocated 4 warps, but only warp_id == 0 executes the receiver logic.

Result:

Required: 384, Hardware limit: 256

2. oor384_receiver_w4_allwarps

This case removes the warp_id == 0 guard and lets the 4 receiver warps participate in receiver work distribution.

Result is still:

Required: 384, Hardware limit: 256

This shows that the OOR is independent of whether receiver internally uses only warp0 or all 4 receiver warps. The issue is the
total WS role thread budget:

4 + 4 + 4 warps = 12 warps = 384 threads

while the compiled function reports:

CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 256

How To Reproduce

Set local environment variables:

export CUDA_HOME=/usr/local/cuda-12.8
export NVSHMEM_HOME=/path/to/nvshmem
export LD_LIBRARY_PATH="$NVSHMEM_HOME/lib:${CUDA_HOME}/lib64:${LD_LIBRARY_PATH:-}"
export CPATH="${CUDA_HOME}/targets/x86_64-linux/include:$NVSHMEM_HOME/include:${CPATH:-}"
export PYTHON_BIN=/path/to/python

Run receiver-w4:

cd python/test/tle/integration/isolated_repros/oor384_receiver_w4

PYTHONNOUSERSITE=1 \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONPATH=/path/to/FlagTree/python:$PWD \
"$PYTHON_BIN" repro_receiver_w4.py

Run receiver-w4-allwarps:

cd python/test/tle/integration/isolated_repros/oor384_receiver_w4_allwarps

PYTHONNOUSERSITE=1 \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONPATH=/path/to/FlagTree/python:$PWD \
NVCC=$PWD/nvcc_flock_wrapper.sh \
TRITON_CACHE_DIR=/tmp/tle_ws_oor384_receiver_w4_allwarps_selfcontained \
"$PYTHON_BIN" repro_receiver_w4_allwarps.py

Expected result for both:

triton.runtime.errors.OutOfResources:
out of resource: threads, Required: 384, Hardware limit: 256.

Notes

This PR does not claim this is a compiler bug by itself. The repro documents a concrete TLE WS thread-budget boundary:

dispatch/default 4 warps + receiver 4 warps + compute 4 warps

currently results in a 384-thread kernel requirement, while the compiled function is limited to 256 threads per block.

@CLAassistant

CLAassistant commented Jun 23, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ lizhangyu258
❌ Zhang-kg
You have signed the CLA already but the status is still pending? Let us recheck it.

@Zhang-kg Zhang-kg force-pushed the tleraw-ws-thread-budget-repros branch 2 times, most recently from 5005b55 to b60113f Compare June 23, 2026 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants