Added flux demo #3418

cehongwang · 2025-02-27T00:39:18Z

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py

demo/flux_demo.py

narendasan · 2025-03-03T22:37:46Z

Can the app display the inference time, might be nice to have some stats rendered live as you generate

py/torch_tensorrt/runtime/_cudagraphs.py

…daGraph and Weight streaming

…namic shapes.

peri044 · 2025-06-07T00:20:52Z

examples/apps/flux-demo.py

+
+import gradio as gr
+import modelopt.torch.quantization as mtq
+import register_sdpa


We can avoid copying the register_sdpa.py and sdpa_converter.py files by doing this

import sys import os # Register SDPA as a standalone operator. Converter and lowering pass are defined in register_sdpa.py sys.path.append(os.path.join(os.path.dirname(__file__), '../dynamo')) from register_sdpa import *

peri044 · 2025-06-07T00:39:11Z

examples/apps/NGRVNG.safetensors

what does this file do ?

peri044 · 2025-06-07T00:41:27Z

examples/dynamo/torch_export_flux_dev.py

@@ -112,6 +112,8 @@
    min_block_size=1,
    use_fp32_acc=True,
    use_explicit_typing=True,
+    use_python_runtime=True,


can we default to using C++ runtime (use_python_runtime=False) ?

peri044 · 2025-06-07T00:43:03Z

examples/dynamo/torch_export_flux_dev.py

+backbone.to("cpu")
 pipe.transformer = trt_gm
+del ep
+torch.cuda.empty_cache()
 pipe.transformer.config = config
-
+trt_gm.device = torch.device("cuda")


Can we use the offload_module_to_cpu=True to handle this block of code ?

peri044 · 2025-06-07T00:43:20Z

py/torch_tensorrt/dynamo/_compiler.py

@@ -912,7 +913,7 @@ def contains_metadata(gm: torch.fx.GraphModule) -> bool:
        parse_graph_io(submodule, subgraph_data)
        dryrun_tracker.tensorrt_graph_count += 1
        dryrun_tracker.per_subgraph_data.append(subgraph_data)
-
+        torch.cuda.empty_cache()


is this needed here ?

peri044 · 2025-06-07T00:45:38Z

py/torch_tensorrt/dynamo/_refit.py

@@ -341,7 +370,7 @@ def refit_module_weights(

    # Iterate over all components that can be accelerated
    # Generate the corresponding TRT Module for those
-
+    new_weight_module.module().to(CPU_DEVICE)


Is the new_weight_module the updated weights module that user provides ?

peri044 · 2025-06-07T00:48:35Z

py/torch_tensorrt/dynamo/_refit.py

    if verify_output and arg_inputs is not None:
+        new_gm.to(torch.cuda.current_device())


We should ensure we use the device that's passed in via args/Compilation settings or default device and not call rely on torch cuda calls unless it is needed.

peri044 · 2025-06-07T00:50:31Z

py/torch_tensorrt/dynamo/runtime/_MutableTorchTensorRTModule.py

+        self.original_model.to("cpu")
        torch.cuda.empty_cache()


use deallocate module

peri044 · 2025-06-07T00:54:23Z

tools/perf/Flux/register_sdpa.py

+from torch_tensorrt.dynamo.lowering.passes.pass_utils import (
+    clean_up_graph_after_modifications,
+)
+


use them from examples

peri044 · 2025-06-07T00:59:34Z

tools/perf/Flux/flux_quantization.py

I think we should avoid copying the whole model scripts for measuring perf. Try using the sys.path approach and importing the model and just a perf loop. something like

import sys import os sys.path.append(torchtrt_root + "examples/dynamo/apps") from flux_demo import * model = <insert FLUX model (fp16 or fp8) > results = measure_flux_perf(.... )

facebook-github-bot added the cla signed label Feb 27, 2025

github-actions bot added component: conversion Issues re: Conversion stage component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Feb 27, 2025

cehongwang marked this pull request as draft February 27, 2025 00:39

github-actions bot requested a review from peri044 February 27, 2025 00:39

peri044 reviewed Feb 27, 2025

View reviewed changes

py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py Outdated Show resolved Hide resolved

narendasan reviewed Feb 27, 2025

View reviewed changes

demo/flux_demo.py Outdated Show resolved Hide resolved

cehongwang force-pushed the flux-demo branch from b2eb297 to 6d36077 Compare February 28, 2025 23:20

github-actions bot added the component: runtime label Mar 3, 2025

cehongwang force-pushed the flux-demo branch 4 times, most recently from 48a7c94 to 5a528f1 Compare March 18, 2025 04:44

github-actions bot added the component: tests Issues re: Tests label Mar 18, 2025

narendasan reviewed Mar 18, 2025

View reviewed changes

py/torch_tensorrt/runtime/_cudagraphs.py Show resolved Hide resolved

cehongwang force-pushed the flux-demo branch 6 times, most recently from 361fb76 to 0aeea36 Compare March 25, 2025 09:26

cehongwang marked this pull request as ready for review March 26, 2025 07:53

cehongwang force-pushed the flux-demo branch 4 times, most recently from 9964674 to cfbc9ea Compare March 26, 2025 07:59

cehongwang requested review from narendasan and peri044 March 26, 2025 07:59

cehongwang added 24 commits May 28, 2025 19:47

changed the file place and deleted unnecessary code

5b4beab

Fixed memory overhead and enabled Flux with Mutable Module

c9573a1

Supported LoRA

2e90e73

Refined Flux demo, solved a bug of device mismatch, and prototyped Cu…

ef5bca8

…daGraph and Weight streaming

Enabled Cuda Graph

a34d25c

Enabled weight streaming and CudaGraph. Supported MTTM saving with dy…

b8fafae

…namic shapes.

Changed the Refitting test to disable CPU offload

b6a96d8

Fixed Cuda Error

53d06f3

Fixed the bug of SDXL Cuda Error

51c3a90

Changed the way to enable CudaGraph for MTTM

3920a63

Finalize the refit revision

0cb1dc2

Fixed the comments

6066d51

Correct the flux export example

d23853d

Added a textbox to display time the generation process takes

b7b433a

Added perf script

7d2e1c3

added back control flag

b941b75

trying to add quantization to Flux

13bd604

Enable int8 and fp8 quantization for FLUX

e6e817a

Optimized FLUX compilation memory usage

41f1f80

Optimized lowering and decomposition to benchmark quantization again

1346fd4

Fixed the benchmark typo

084724e

Use MutableTorchTensorRTModule to do quantization

fb373a0

Added quantization debug script

c67ee2f

Fixed fp16 quantization error

9c7edb2

cehongwang force-pushed the flux-demo branch from c25890e to 41139e9 Compare May 28, 2025 19:48

Added converter registration

f536ac6

cehongwang force-pushed the flux-demo branch from 41139e9 to f536ac6 Compare June 2, 2025 20:38

Deleted unnecessary files

27a2001

cehongwang force-pushed the flux-demo branch from 3d7253e to 27a2001 Compare June 6, 2025 01:58

peri044 reviewed Jun 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added flux demo #3418

Added flux demo #3418

cehongwang commented Feb 27, 2025

Uh oh!

Uh oh!

Uh oh!

narendasan commented Mar 3, 2025

Uh oh!

Uh oh!

peri044 Jun 7, 2025

Uh oh!

peri044 Jun 7, 2025

Uh oh!

peri044 Jun 7, 2025

Uh oh!

peri044 Jun 7, 2025

Uh oh!

peri044 Jun 7, 2025

Uh oh!

peri044 Jun 7, 2025

Uh oh!

peri044 Jun 7, 2025

Uh oh!

peri044 Jun 7, 2025

Uh oh!

peri044 Jun 7, 2025

Uh oh!

peri044 Jun 7, 2025

Uh oh!

Uh oh!

		if verify_output and arg_inputs is not None:
		new_gm.to(torch.cuda.current_device())

Added flux demo #3418

Are you sure you want to change the base?

Added flux demo #3418

Conversation

cehongwang commented Feb 27, 2025

Description

Type of change

Checklist:

Uh oh!

Uh oh!

Uh oh!

narendasan commented Mar 3, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!