[TLE] Feature/extrac tile strides new#649
Conversation
f53f2ba to
eb9caf6
Compare
sunnycase
left a comment
There was a problem hiding this comment.
Requesting changes before merge:
-
Please replace the Chinese comments with English. I found Chinese comments in:
python/triton/experimental/tle/language/gpu/semantic.py:118python/test/tle/unit/test_insert_tile_static_index.py:93third_party/tle/dialect/lib/Transforms/ExtractTileToLLVM.cpp:40
Comments in source and tests should be readable by all maintainers.
-
Please expand the PR summary. The current summary only states that
strideswas added totle.extract_tileandtle.insert_tile. It should explicitly document the changed APIs, including the Python API surface and any corresponding TLE/MLIR op attribute changes. It should also state which operators show performance improvement and include the measured before/after numbers or speedups, for example GLU, 2D Depthwise Conv, Causal Conv1D, and RoPE if those new tutorials are the performance evidence. -
Please add the newly introduced tutorials to CI. This PR adds:
python/tutorials/tle/05-glu.pypython/tutorials/tle/06-2D_Depthwise_Conv.pypython/tutorials/tle/07-causal-conv1d.pypython/tutorials/tle/08-rope.py
However, the Hopper TLE tutorial workflow still only runs the existing tutorials through
04-cluster-gemm.pyplus the DeepSeek examples (.github/workflows/hopper-build-and-test.yml:122and.github/workflows/hopper-build-and-test.yml:180). Please add lightweight correctness invocations for the new tutorials, such as the existing--only_unit_testpaths for 05/06 and non-benchmark correctness runs for 07/08, or add equivalent CI test targets so these examples do not land untested.
|
@sunnycase
|
|
Could you please expand the performance section with reproducible detailed data? Right now the PR lists the optimized tutorials/operators, but reviewers still need more information to evaluate the actual benefit, applicable scope, and whether there are any regressions. A per-operator performance table would be very helpful. Please include at least:
With these details, it will be much easier to review the performance impact and make a confident merge decision. |
d1c0c52 to
ee039c1
Compare
ee039c1 to
990527a
Compare
|
@sunnycase Test Environment
Correctness Validation & ToleranceEach benchmark automatically validates numerical equivalence between the TLE kernel and the baseline. Tolerances are set per dtype:
Tolerances differ across dtypes because lower-precision formats have fewer mantissa bits (float32: 23 bits, float16: 10 bits, bfloat16: 7 bits), which naturally widens rounding error. Correctness validation runs automatically at the start of each benchmark script. All configurations listed below passed ( Benchmark commands: python python/tutorials/tle/05-glu.py --benchmark
python python/tutorials/tle/06-2D_Depthwise_Conv.py --benchmark
python python/tutorials/tle/07-causal-conv1d.py --benchmark
python python/tutorials/tle/08-rope.py --benchmarkOperator 1 — GLUBaseline: FlagGems/glu
Speedup ranges from 1.00x to 1.10x. Gains are more pronounced at larger dimensions (dim >= 2048). No regressions observed. Operator 2 — 2D Depthwise ConvolutionBaseline: self-constructed
Speedup ranges from 1.06x to 1.67x. Largest gains at smaller spatial sizes (H=W=112/128). No regressions observed. Operator 3 — Causal Conv1DBaseline: vLLM v0.13.0 causal_conv1d implementation Varlen mode — dim=4096, batch=32
Varlen mode — dim=8192, batch=128
Update mode — batch=256
Update mode — batch=1024
Varlen mode: consistent 1.12x–1.19x speedup across both dim/batch configs, improving with longer sequences. Update mode is memory-bandwidth-bound at these sizes and shows minimal difference (<=1.02x), which is expected behavior. Operator 4 — RoPEBaseline: FlagGems rotary_embedding implementation Non-interleaved (LLaMA style) — Out-of-place
Non-interleaved (LLaMA style) — In-place
Interleaved (GPT-NeoX style) — Out-of-place
Interleaved (GPT-NeoX style) — In-place
Note on regression: One minor regression observed at interleaved out-of-place (batch=32, seq=1024, 16q/16k heads, head_dim=256): ~2% slowdown (0.98x). This is within measurement noise range and will be investigated further. All in-place configurations show no regression. Summary
All benchmark scripts and raw logs are available under python/tutorials/tle/. The single marginal regression in RoPE (interleaved out-of-place, large GQA shape) is noted. |
…nd add test examples optimized using these two primitives
46b8e1d to
fa4b322
Compare
Summary:
API Changes:
Performance:
New tutorials optimized with tle.extract_tile / tle.insert_tile:
CI:
Added lightweight correctness tests for tutorials 05-08 in hopper-build-and-test.yml.