[TLE] Feature/extrac tile strides new by lzllx123 · Pull Request #649 · flagos-ai/FlagTree

lzllx123 · 2026-06-04T02:40:13Z

Summary:

[TLE]Add optional strides parameter to tle.extract_tile and tle.insert_tile, enabling strided tile extraction/insertion.
Adapted for NVIDIA and GCU backends.

API Changes:

Python: extract_tile(x, index, tile_shape, strides=None) / insert_tile(x, tile, index, strides=None)
MLIR: Tle_ExtractTileOp and Tle_InsertTileOp added OptionalAttr:$strides
Backward compatible: defaults to tile_shape when omitted

Performance:

New tutorials optimized with tle.extract_tile / tle.insert_tile:

05-glu.py (GLU)
06-2D_Depthwise_Conv.py (2D Depthwise Conv)
07-causal-conv1d.py (Causal Conv1D)
08-rope.py (RoPE)

CI:

Added lightweight correctness tests for tutorials 05-08 in hopper-build-and-test.yml.

sunnycase

Requesting changes before merge:

Please replace the Chinese comments with English. I found Chinese comments in:
- python/triton/experimental/tle/language/gpu/semantic.py:118
- python/test/tle/unit/test_insert_tile_static_index.py:93
- third_party/tle/dialect/lib/Transforms/ExtractTileToLLVM.cpp:40
Comments in source and tests should be readable by all maintainers.
Please expand the PR summary. The current summary only states that strides was added to tle.extract_tile and tle.insert_tile. It should explicitly document the changed APIs, including the Python API surface and any corresponding TLE/MLIR op attribute changes. It should also state which operators show performance improvement and include the measured before/after numbers or speedups, for example GLU, 2D Depthwise Conv, Causal Conv1D, and RoPE if those new tutorials are the performance evidence.
Please add the newly introduced tutorials to CI. This PR adds:
- python/tutorials/tle/05-glu.py
- python/tutorials/tle/06-2D_Depthwise_Conv.py
- python/tutorials/tle/07-causal-conv1d.py
- python/tutorials/tle/08-rope.py
However, the Hopper TLE tutorial workflow still only runs the existing tutorials through 04-cluster-gemm.py plus the DeepSeek examples (.github/workflows/hopper-build-and-test.yml:122 and .github/workflows/hopper-build-and-test.yml:180). Please add lightweight correctness invocations for the new tutorials, such as the existing --only_unit_test paths for 05/06 and non-benchmark correctness runs for 07/08, or add equivalent CI test targets so these examples do not land untested.

lzllx123 · 2026-06-08T09:05:41Z

@sunnycase
All three requested items have been fully resolved:

All Chinese comments in the three listed source and test files are replaced with standard English.
The PR description has been expanded completely, including full documentation of modified Python APIs, corresponding TLE/MLIR op attribute adjustments, and complete measured speedup benchmark data for GLU, 2D Depthwise Conv, Causal Conv1D and RoPE.
Added lightweight correctness test targets for the four new tutorial files in the Hopper CI workflow, following the existing --only_unit_test pattern, to ensure all new tutorials will be validated during CI runs.

sunnycase · 2026-06-10T08:37:29Z

Could you please expand the performance section with reproducible detailed data? Right now the PR lists the optimized tutorials/operators, but reviewers still need more information to evaluate the actual benefit, applicable scope, and whether there are any regressions. A per-operator performance table would be very helpful. Please include at least:

Operator/tutorial name, for example GLU, 2D Depthwise Conv, Causal Conv1D, and RoPE.
Input sizes/shapes, dtype, layout, stride configuration, and key parameters such as batch/head/channel.
Baseline version/implementation and this PR implementation, with latency/throughput comparison. Averages plus p50/p90 or stddev would be preferred.
Speedup as a percentage or ratio, and any configurations where performance regresses.
Test conditions: hardware model, driver/CUDA/LLVM/FlagTree commit, backend, compile flags/env vars, benchmark command, warmup/iteration counts, and timing method.
Correctness validation command, plus the location of benchmark scripts or raw logs so reviewers can reproduce the results.

With these details, it will be much easier to review the performance impact and make a confident merge decision.

lzllx123 · 2026-06-11T03:46:05Z

@sunnycase
Thank you for the detailed review! Below are the full benchmark details for reproducibility.

Test Environment

Item	Value
GPU	NVIDIA GeForce RTX 5090
CUDA	12.8
Triton	3.6.0
Timing method	`triton.testing.do_bench` (mean latency, unit: ms)

Correctness Validation & Tolerance

Each benchmark automatically validates numerical equivalence between the TLE kernel and the baseline. Tolerances are set per dtype:

Operator	dtype	rtol	atol
GLU	float32	1e-4	1e-4
GLU	float16	1e-3	1e-2
2D Depthwise Conv	float32	1e-3	1e-3
2D Depthwise Conv	float16	1e-2	1e-2
Causal Conv1D	bfloat16	1e-2	1e-2
RoPE	bfloat16	1e-2	1e-2

Tolerances differ across dtypes because lower-precision formats have fewer mantissa bits (float32: 23 bits, float16: 10 bits, bfloat16: 7 bits), which naturally widens rounding error. Correctness validation runs automatically at the start of each benchmark script. All configurations listed below passed (pass).

Benchmark commands:

python python/tutorials/tle/05-glu.py --benchmark
python python/tutorials/tle/06-2D_Depthwise_Conv.py --benchmark
python python/tutorials/tle/07-causal-conv1d.py --benchmark
python python/tutorials/tle/08-rope.py --benchmark

Operator 1 — GLU

Baseline: FlagGems/glu
Config: batch=4096, dtype=fp32, Warmup=25, Rep=100, Runs=5

dim	Baseline mean (ms)	Baseline p50	Baseline p90	TLE mean (ms)	TLE p50	TLE p90	Speedup
256	0.0103	0.0102	0.0103	0.0102	0.0102	0.0102	1.00x
512	0.0177	0.0177	0.0178	0.0171	0.0170	0.0173	1.04x
1024	0.0348	0.0348	0.0348	0.0348	0.0348	0.0349	1.00x
2048	0.0722	0.0722	0.0723	0.0657	0.0657	0.0658	1.10x
4096	0.1409	0.1409	0.1411	0.1305	0.1305	0.1305	1.08x

Speedup ranges from 1.00x to 1.10x. Gains are more pronounced at larger dimensions (dim >= 2048). No regressions observed.

Operator 2 — 2D Depthwise Convolution

Baseline: self-constructed
Config: C=64, K=5x5, layout=HWC, dtype=fp32, Warmup=25, Rep=100, Runs=5

H=W	Baseline mean (ms)	Baseline p50	Baseline p90	TLE mean (ms)	TLE p50	TLE p90	Speedup
112	0.0193	0.0189	0.0199	0.0121	0.0122	0.0123	1.56x
128	0.0205	0.0205	0.0205	0.0123	0.0123	0.0123	1.67x
256	0.0453	0.0453	0.0454	0.0389	0.0389	0.0389	1.16x
512	0.1106	0.1106	0.1106	0.1044	0.1044	0.1044	1.06x

Speedup ranges from 1.06x to 1.67x. Largest gains at smaller spatial sizes (H=W=112/128). No regressions observed.

Operator 3 — Causal Conv1D

Baseline: vLLM v0.13.0 causal_conv1d implementation
https://github.com/vllm-project/vllm/blob/releases/v0.13.0/vllm/model_executor/layers/mamba/ops/causal_conv1d.py
Config: width=4, dtype=bfloat16, Warmup=10, Rep=100, Runs=5

Varlen mode — dim=4096, batch=32

seqlen	Baseline mean (ms)	Baseline p50	Baseline p90	TLE mean (ms)	TLE p50	TLE p90	Speedup
2048	0.3543	0.3543	0.3560	0.3119	0.3123	0.3123	1.13x
4096	0.5609	0.5611	0.5612	0.4803	0.4803	0.4809	1.17x
8192	0.9857	0.9851	0.9867	0.8381	0.8377	0.8391	1.18x
16384	1.8396	1.8401	1.8412	1.5459	1.5462	1.5487	1.19x

Varlen mode — dim=8192, batch=128

seqlen	Baseline mean (ms)	Baseline p50	Baseline p90	TLE mean (ms)	TLE p50	TLE p90	Speedup
2048	0.6292	0.6287	0.6304	0.5614	0.5612	0.5622	1.12x
4096	1.0738	1.0353	1.1510	0.8934	0.8945	0.8956	1.16x
8192	1.8539	1.8534	1.8589	1.5678	1.5688	1.5701	1.18x
16384	3.5206	3.5195	3.5242	2.9483	2.9471	2.9528	1.19x

Update mode — batch=256

dim	Baseline mean (ms)	Baseline p50	Baseline p90	TLE mean (ms)	TLE p50	TLE p90	Speedup
1024	0.0082	0.0082	0.0082	0.0082	0.0082	0.0082	1.00x
2048	0.0102	0.0102	0.0102	0.0102	0.0102	0.0103	1.00x
4096	0.0164	0.0164	0.0164	0.0164	0.0164	0.0164	1.00x
8192	0.0308	0.0308	0.0308	0.0308	0.0308	0.0308	1.00x

Update mode — batch=1024

dim	Baseline mean (ms)	Baseline p50	Baseline p90	TLE mean (ms)	TLE p50	TLE p90	Speedup
1024	0.0164	0.0164	0.0164	0.0164	0.0164	0.0164	1.00x
2048	0.0308	0.0308	0.0308	0.0307	0.0307	0.0307	1.00x
4096	0.0554	0.0554	0.0554	0.0553	0.0553	0.0553	1.00x
8192	0.1024	0.1024	0.1024	0.1006	0.1006	0.1007	1.02x

Varlen mode: consistent 1.12x–1.19x speedup across both dim/batch configs, improving with longer sequences. Update mode is memory-bandwidth-bound at these sizes and shows minimal difference (<=1.02x), which is expected behavior.

Operator 4 — RoPE

Baseline: FlagGems rotary_embedding implementation
https://github.com/FlagOpen/FlagGems/blob/master/src/flag_gems/fused/rotary_embedding.py
Config: dtype=bfloat16, Warmup=20, Rep=200, Rounds=5, Trim=0.2

Non-interleaved (LLaMA style) — Out-of-place

batch	seq_len	q_heads	k_heads	head_dim	Baseline mean (ms)	TLE mean (ms)	Speedup
1	128	32	8	128	0.04154	0.03599	1.15x
1	1024	32	8	128	0.05790	0.04818	1.28x
8	1024	32	8	128	0.15669	0.14546	1.08x
32	128	32	8	128	0.10072	0.08708	1.16x
32	1024	32	8	128	0.48890	0.48106	1.02x
8	1024	16	16	128	0.12941	0.12115	1.07x
32	1024	16	16	128	0.39571	0.39261	1.01x
1	128	32	8	256	0.04215	0.03628	1.16x
8	128	32	8	256	0.06520	0.05900	1.11x
32	128	32	8	256	0.15438	0.14734	1.05x
32	1024	16	16	256	0.75638	0.75895	1.00x

Non-interleaved (LLaMA style) — In-place

batch	seq_len	q_heads	k_heads	head_dim	Baseline mean (ms)	TLE mean (ms)	Speedup
1	128	32	8	128	0.03561	0.02669	1.33x
8	1024	32	8	128	0.16146	0.13192	1.22x
32	128	32	8	128	0.10214	0.07458	1.37x
32	1024	32	8	128	0.51791	0.47204	1.10x
8	1024	16	16	128	0.13435	0.10812	1.24x
32	1024	16	16	128	0.41790	0.38337	1.09x
1	128	32	8	256	0.03510	0.02729	1.28x
8	128	32	8	256	0.05346	0.04381	1.22x
32	128	32	8	256	0.14494	0.13633	1.06x
32	1024	16	16	256	0.74934	0.74127	1.01x

Interleaved (GPT-NeoX style) — Out-of-place

batch	seq_len	q_heads	k_heads	head_dim	Baseline mean (ms)	TLE mean (ms)	Speedup
1	128	32	8	128	0.04140	0.03523	1.17x
1	1024	32	8	128	0.05167	0.04168	1.24x
8	128	32	8	128	0.05104	0.04093	1.25x
8	1024	32	8	128	0.15702	0.14350	1.09x
32	128	32	8	128	0.10005	0.08794	1.14x
32	1024	32	8	128	0.48878	0.48212	1.01x
8	1024	16	16	128	0.12875	0.12100	1.06x
32	1024	16	16	128	0.39361	0.39187	1.00x
1	128	32	8	256	0.04269	0.03668	1.16x
8	128	32	8	256	0.06545	0.05981	1.10x
32	128	32	8	256	0.15524	0.14697	1.06x
32	1024	16	16	256	0.75355	0.77006	0.98x ⚠️

Interleaved (GPT-NeoX style) — In-place

batch	seq_len	q_heads	k_heads	head_dim	Baseline mean (ms)	TLE mean (ms)	Speedup
1	128	32	8	128	0.03471	0.02656	1.31x
8	1024	32	8	128	0.16395	0.13200	1.24x
32	128	32	8	128	0.10276	0.07474	1.37x
32	1024	32	8	128	0.53328	0.47174	1.13x
8	1024	16	16	128	0.13583	0.10792	1.26x
32	1024	16	16	128	0.42746	0.38305	1.12x
1	128	32	8	256	0.03529	0.02749	1.28x
8	128	32	8	256	0.05359	0.04437	1.21x
32	128	32	8	256	0.14604	0.13378	1.09x
32	1024	16	16	256	0.74467	0.74151	1.00x

Note on regression: One minor regression observed at interleaved out-of-place (batch=32, seq=1024, 16q/16k heads, head_dim=256): ~2% slowdown (0.98x). This is within measurement noise range and will be investigated further. All in-place configurations show no regression.

Summary

Operator	Baseline Source	Best Speedup	Typical Range	Regression
GLU	FlagGems	1.10x	1.00x – 1.10x	None
2D Depthwise Conv	self-constructed	1.67x	1.06x – 1.67x	None
Causal Conv1D (Varlen)	vLLM v0.13.0	1.19x	1.12x – 1.19x	None
Causal Conv1D (Update)	vLLM v0.13.0	1.02x	1.00x – 1.02x	None
RoPE (in-place)	FlagGems	1.37x	1.01x – 1.37x	None
RoPE (out-of-place)	FlagGems	1.28x	1.00x – 1.28x	1 config ~2%

All benchmark scripts and raw logs are available under python/tutorials/tle/. The single marginal regression in RoPE (interleaved out-of-place, large GQA shape) is noted.

sunnycase

LGTM

…e smem

…nd add test examples optimized using these two primitives

…otuning

sunnycase

LGTM

lzllx123 requested review from sunnycase and zhzhcookie as code owners June 4, 2026 02:40

github-actions Bot added tle triton_v3.6.x enflame hcu labels Jun 4, 2026

sunnycase changed the title ~~Feature/extrac tile strides new~~ [TLE] Feature/extrac tile strides new Jun 5, 2026

sunnycase mentioned this pull request Jun 5, 2026

[debug] Instrument TLE cluster GEMM CI failure #655

Closed

github-actions Bot added the nvidia label Jun 8, 2026

sunnycase force-pushed the feature/extrac_tile_strides_new branch from f53f2ba to eb9caf6 Compare June 8, 2026 08:04

github-actions Bot removed the nvidia label Jun 8, 2026

sunnycase requested changes Jun 8, 2026

View reviewed changes

github-actions Bot added the nvidia label Jun 8, 2026

lzllx123 force-pushed the feature/extrac_tile_strides_new branch from d1c0c52 to ee039c1 Compare June 10, 2026 08:52

github-actions Bot removed the nvidia label Jun 10, 2026

lzllx123 force-pushed the feature/extrac_tile_strides_new branch from ee039c1 to 990527a Compare June 11, 2026 02:45

sunnycase previously approved these changes Jun 11, 2026

View reviewed changes

ph0375 and others added 10 commits June 15, 2026 10:31

extract_tile_strides

b7c01b6

[TLE] Extend the strides parameter of insert_tile

26c70ea

[TLE]Fix the issue of extract_tile and insert_tile allocating separat…

618330e

…e smem

[TLE]Update the strides parameters of extract_tile and insert_tile, a…

1942c0c

…nd add test examples optimized using these two primitives

[TLE]Modify the test file of glu

4458c62

Apply code-format changes

a44b487

Apply code-format changes

845152e

[TLE]fix some GCU backend problems

cf984ea

Apply code-format changes

22ac13c

[TLE]fix some GCU backends problems 2.

3f0ca98

flagtree-bot and others added 12 commits June 15, 2026 10:31

Apply code-format changes

5c7b837

[TLE]fix some GCU backend problems 3.

c7ccd1d

[TLE] fix some GCU backend problems 4.

d05e79b

[TLE] Handle misaligned address errors gracefully in cluster-gemm aut…

86d1f89

…otuning

fix 04-cluster-gemm problems

02ebf0e

Revert 04-cluster-gemm.py to c7e6953

c3feb70

Replace Chinese comments with English in TLE source and tests

0937efb

Apply code-format changes

26474e0

Update hopper-build-and-test.yml

ca5d01a

Update the test examples under FlagTree/python/tutorials/tle.

2fa5b26

Apply code-format changes

e7b1ba7

Guard extract_tile/insert_tile code with __TLE__ macro

fa4b322

lzllx123 dismissed sunnycase’s stale review via fa4b322 June 15, 2026 03:01

lzllx123 force-pushed the feature/extrac_tile_strides_new branch from 46b8e1d to fa4b322 Compare June 15, 2026 03:01

Merge triton_v3.6.x, resolve ci workflow conflict

d6e4229

sunnycase approved these changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TLE] Feature/extrac tile strides new#649

[TLE] Feature/extrac tile strides new#649
lzllx123 wants to merge 23 commits into
triton_v3.6.xfrom
feature/extrac_tile_strides_new

lzllx123 commented Jun 4, 2026 •

edited

Loading

Uh oh!

sunnycase left a comment

Uh oh!

lzllx123 commented Jun 8, 2026 •

edited

Loading

Uh oh!

sunnycase commented Jun 10, 2026 •

edited

Loading

Uh oh!

lzllx123 commented Jun 11, 2026 •

edited

Loading

Uh oh!

sunnycase left a comment

Uh oh!

sunnycase left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

lzllx123 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

API Changes:

Performance:

CI:

Uh oh!

sunnycase left a comment

Choose a reason for hiding this comment

Uh oh!

lzllx123 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunnycase commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lzllx123 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Environment

Correctness Validation & Tolerance

Operator 1 — GLU

Operator 2 — 2D Depthwise Convolution

Operator 3 — Causal Conv1D

Varlen mode — dim=4096, batch=32

Varlen mode — dim=8192, batch=128

Update mode — batch=256

Update mode — batch=1024

Operator 4 — RoPE

Non-interleaved (LLaMA style) — Out-of-place

Non-interleaved (LLaMA style) — In-place

Interleaved (GPT-NeoX style) — Out-of-place

Interleaved (GPT-NeoX style) — In-place

Summary

Uh oh!

sunnycase left a comment

Choose a reason for hiding this comment

Uh oh!

sunnycase left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lzllx123 commented Jun 4, 2026 •

edited

Loading

lzllx123 commented Jun 8, 2026 •

edited

Loading

sunnycase commented Jun 10, 2026 •

edited

Loading

lzllx123 commented Jun 11, 2026 •

edited

Loading