Skip to content

[BACKEND] Add opt-in TileIR backend integration#703

Open
KingsleyLiu-NV wants to merge 11 commits into
flagos-ai:triton_v3.6.xfrom
KingsleyLiu-NV:feature/flagtree-tileir-integration
Open

[BACKEND] Add opt-in TileIR backend integration#703
KingsleyLiu-NV wants to merge 11 commits into
flagos-ai:triton_v3.6.xfrom
KingsleyLiu-NV:feature/flagtree-tileir-integration

Conversation

@KingsleyLiu-NV

@KingsleyLiu-NV KingsleyLiu-NV commented Jun 17, 2026

Copy link
Copy Markdown

Summary

This PR integrates Triton-to-Tile-IR into FlagTree as an independent tileir backend.

TileIR has its own compiler and driver and is installed alongside the existing NVIDIA and AMD backends. The common python/triton layer only provides backend-neutral routing, compiler, driver, and language-extension hooks. TileIR-specific policy and implementation remain under third_party/tileir.

Runtime behavior is unchanged unless FLAGTREE_USE_TILEIR=1 is set.

Design

TileIR is installed as an independent backend with its own TileIRBackend and TileIRDriver. On NVIDIA systems, CudaDriver remains the active hardware driver and produces the initial cuda target. Routing may select a tileir target for an individual kernel without replacing or modifying the NVIDIA backend.

The shared Python changes are backend-neutral hooks:

python/triton/runtime/jit.py
├── python/triton/backends/__init__.py::route_target()
│   └── TileIRBackend.route_target()
│       └── third_party/tileir/backend/router.py
└── python/triton/compiler/compiler.py
    ├── get_backend(final_target)
    ├── backend.make_ir(...)
    └── get_driver(final_target, active_driver)

The common interfaces under python/triton/backends do not import TileIR implementation code. TileIR-specific routing, compilation, and driver behavior remain under third_party/tileir.

With FLAGTREE_USE_TILEIR=1:

  • CUDA kernels without TLE route to TileIR.
  • Kernels using only the supported tle.gpu.tile view/token subset route to TileIR.
  • Other or unknown TLE usage remains on native NVIDIA.
  • Non-CUDA targets remain unchanged.

The policy is implemented in the TileIR backend router.

Backend language APIs use the generic tl.ext registry. tle.gpu.tile.<name> lazily forwards to tl.ext.<name>, while the TileIR implementation remains in extend_core.py, extend_semantic.py, and triton_tileir.cc. Ordinary TLE imports therefore do not depend on TileIR.

Implementation

This PR:

  • registers TileIR as an independent FlagTree backend;
  • adds generic per-kernel target routing;
  • selects the compiler and kernel driver from the final routed target;
  • adds a generic backend language-extension registry through tl.ext;
  • adds TileIR-specific routing, frontend, lowering, and driver integration;
  • adds TLE view/token operations and their C++ builder bindings;
  • adds compatibility handling for FlagTree's Triton 3.6 and LLVM versions;
  • adds tutorials, correctness checks, benchmarks, and TileIR CI coverage.

The TileIR source is based on upstream commit a3befd959b02410cfbdac08d91d817b0ec0b3e33.

cuda-tile is pinned at commit 2e5ccba66fb3afdba34b26cf358418283027c248.

The upstream baseline, dependency pins, build requirements, LLVM compatibility handling, and FlagTree-local vendor changes are recorded in the TileIR backend README.

Validation

Load View Token Ordering

01-load-view-token-ordering.py validates:

  • TLE tensor-view operations;
  • memory-token creation and chaining;
  • load_view_tko and store_view_tko;
  • successful TileIR execution;
  • expected native NVIDIA rejection when TileIR routing is disabled.

Mixed Kernel Routing

02-mixed-kernel-routing.py validates in one process:

  • a plain Triton kernel routed to TileIR;
  • a non-TileIR TLE kernel routed to native NVIDIA;
  • correct results from both paths;
  • expected TileIR and native cache artifacts.

Triton TileIR Benchmarks

03-triton-tileir-benchmarks.py provides:

  • self-contained Triton kernels;
  • native NVIDIA and TileIR execution;
  • correctness checks;
  • CUPTI kernel-time measurements;
  • seven benchmark families;
  • a curated CI subset with three representative pairs per case.

The available benchmark families are:

bmm
fmha
linear_bias_act
mla
mla_decoding
matmul
rope

Tutorial usage and reference H100 performance results are documented in the TileIR tutorials.

CI

The dedicated TileIR CI workflow:

  • builds FlagTree with TileIR and CTK 13.3;
  • verifies tileiras;
  • runs the load-view token-ordering tutorial;
  • runs the mixed-routing tutorial;
  • runs the curated native NVIDIA and TileIR benchmark subset.

Existing backend CI continues to use the common Triton frontend without importing TileIR implementation code.

@CLAassistant

CLAassistant commented Jun 17, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@KingsleyLiu-NV KingsleyLiu-NV marked this pull request as draft June 17, 2026 08:09

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there performance results on Blackwell?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can specify the Blackwell GPUs you’re interested in (e.g., B200 or RTX PRO 6000), I can add the benchmark results accordingly

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can specify the Blackwell GPUs you’re interested in (e.g., B200 or RTX PRO 6000), I can add the benchmark results accordingly

B200, please.

@sunnycase sunnycase left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change. One question about the API placement: since tileir does not seem to expose GPU-specific details here, would it be a better fit under tle rather than tle.gpu? That may keep the namespace aligned with the abstraction level, unless there is a planned GPU-specific surface that I am missing.

@Vincent-Xiao Vincent-Xiao marked this pull request as ready for review June 23, 2026 08:47
@KingsleyLiu-NV

Copy link
Copy Markdown
Author

Thanks for the change. One question about the API placement: since tileir does not seem to expose GPU-specific details here, would it be a better fit under tle rather than tle.gpu? That may keep the namespace aligned with the abstraction level, unless there is a planned GPU-specific surface that I am missing.

I put the load-view-token related APIs under tle.gpu.tile based on a suggestion from @Vincent-Xiao. I’m not sure this is the best approach, but it’s relatively easy to change since it is just an alias.

@KingsleyLiu-NV KingsleyLiu-NV force-pushed the feature/flagtree-tileir-integration branch 2 times, most recently from 47b0477 to 01680a4 Compare June 29, 2026 05:20
@KingsleyLiu-NV

Copy link
Copy Markdown
Author

@Vincent-Xiao Can you please approve CI workflows for my latest commit?

@KingsleyLiu-NV KingsleyLiu-NV force-pushed the feature/flagtree-tileir-integration branch from 01680a4 to cbf1dc5 Compare June 29, 2026 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants