[KMCompiler] [TLERaw] Add WS thread budget repro cases by Zhang-kg · Pull Request #713 · flagos-ai/FlagTree

Zhang-kg · 2026-06-23T06:16:35Z

Summary

This PR adds two self-contained repro cases for the TLE warp-specialization thread-budget issue:

python/test/tle/integration/isolated_repros/oor384_receiver_w4
python/test/tle/integration/isolated_repros/oor384_receiver_w4_allwarps

Both cases reproduce:

triton.runtime.errors.OutOfResources:
out of resource: threads, Required: 384, Hardware limit: 256.

The repros are intentionally isolated. They do not depend on the previous external megakernel-moe experiment directory.

Environment

Validated locally on:

GPU: NVIDIA H100 80GB HBM3
CUDA: 12.8
Python: 3.10
NVSHMEM: 3.4.5 (nvidia-nvshmem-cu12==3.4.5)
MPI launcher: Open MPI 4.1.2
Triton: FlagTree PR682-based Triton with TLE raw NVSHMEM support

Repro Cases

1. oor384_receiver_w4

This case configures:

default / dispatch partition: 4 warps
receiver worker partition: 4 warps
compute worker partition: 4 warps
total: 12 warps = 384 threads

The receiver partition is allocated 4 warps, but only warp_id == 0 executes the receiver logic.

Result:

Required: 384, Hardware limit: 256

2. oor384_receiver_w4_allwarps

This case removes the warp_id == 0 guard and lets the 4 receiver warps participate in receiver work distribution.

Result is still:

Required: 384, Hardware limit: 256

This shows that the OOR is independent of whether receiver internally uses only warp0 or all 4 receiver warps. The issue is the
total WS role thread budget:

4 + 4 + 4 warps = 12 warps = 384 threads

while the compiled function reports:

CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 256

How To Reproduce

Set local environment variables:

export CUDA_HOME=/usr/local/cuda-12.8
export NVSHMEM_HOME=/path/to/nvshmem
export LD_LIBRARY_PATH="$NVSHMEM_HOME/lib:${CUDA_HOME}/lib64:${LD_LIBRARY_PATH:-}"
export CPATH="${CUDA_HOME}/targets/x86_64-linux/include:$NVSHMEM_HOME/include:${CPATH:-}"
export PYTHON_BIN=/path/to/python

Run receiver-w4:

cd python/test/tle/integration/isolated_repros/oor384_receiver_w4

PYTHONNOUSERSITE=1 \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONPATH=/path/to/FlagTree/python:$PWD \
"$PYTHON_BIN" repro_receiver_w4.py

Run receiver-w4-allwarps:

cd python/test/tle/integration/isolated_repros/oor384_receiver_w4_allwarps

PYTHONNOUSERSITE=1 \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONPATH=/path/to/FlagTree/python:$PWD \
NVCC=$PWD/nvcc_flock_wrapper.sh \
TRITON_CACHE_DIR=/tmp/tle_ws_oor384_receiver_w4_allwarps_selfcontained \
"$PYTHON_BIN" repro_receiver_w4_allwarps.py

Expected result for both:

triton.runtime.errors.OutOfResources:
out of resource: threads, Required: 384, Hardware limit: 256.

Notes

This PR does not claim this is a compiler bug by itself. The repro documents a concrete TLE WS thread-budget boundary:

dispatch/default 4 warps + receiver 4 warps + compute 4 warps

currently results in a 384-thread kernel requirement, while the compiled function is limited to 256 threads per block.

CLAassistant · 2026-06-23T06:16:44Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ lizhangyu258
❌ Zhang-kg
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

lizhangyu258 added 7 commits June 10, 2026 00:37

support nvshmem

011ab66

add nvshmem/example

9588e13

add macro define parameters

fd24dca

merge tle_raw.call and libdevice.call

396925c

refactor cuda jit nvcc compile

c56b174

register make_cubin during the first initialization

2f7ef63

add test case

8cea58e

github-actions Bot added nvidia triton_v3.6.x labels Jun 23, 2026

Zhang-kg force-pushed the tleraw-ws-thread-budget-repros branch 2 times, most recently from 5005b55 to b60113f Compare June 23, 2026 06:21

Zhang-kg added 2 commits June 23, 2026 06:27

[KMCompiler] [TLERaw] Add WS thread budget repro cases

b60113f

oor_min_reproduce_without_nvshmem

1b1ecc5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KMCompiler] [TLERaw] Add WS thread budget repro cases#713

[KMCompiler] [TLERaw] Add WS thread budget repro cases#713
Zhang-kg wants to merge 9 commits into
flagos-ai:triton_v3.6.xfrom
Zhang-kg:tleraw-ws-thread-budget-repros

Zhang-kg commented Jun 23, 2026

Uh oh!

CLAassistant commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Zhang-kg commented Jun 23, 2026

Summary

Environment

Repro Cases

1. oor384_receiver_w4

2. oor384_receiver_w4_allwarps

How To Reproduce

Notes

Uh oh!

CLAassistant commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Jun 23, 2026 •

edited

Loading