Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking Optimization for skyreels I2V: Addressing Speed and Quality Challenges with Negative Prompts and TEA Acceleration #378

Open
ptmaster opened this issue Feb 19, 2025 · 11 comments

Comments

@ptmaster
Copy link

Today, everyone has successfully run I2V, thanx! but the primary issue remains the speed. The inclusion of negative prompts has doubled the computation time, and while enabling TEA acceleration improves speed, it unfortunately compromises the visual quality. These two factors, one positive and one negative, present a rather challenging situation of time. On average, generating a 97-frame video takes approximately 15 minutes on one 4090. I kindly request your assistance in optimizing this process at your earliest convenience. Thank you very much! :)

@kijai
Copy link
Owner

kijai commented Feb 19, 2025

Losing cfg distillation really hurts with the speed indeed, best way to mitigate that is to find out how many steps we really need cfg for, as the nodes already support scheduling it over time. My default workflow only runs it with cfg for half the steps, speeding it up considerably with little quality loss. However it's hard to judge both the value of cfg itself and how many steps to do it for, this just requires testing.

@ObiLeek
Copy link

ObiLeek commented Feb 19, 2025

15 minutes is too much. On my RTX 4090 the times for 97 frames are as follows (including model loading):

attention_mode: sageattn_varlen
teacache: 0.15
model quantization: fp8_e4m3fn
time: 156 seconds

attention_mode: sageattn_varlen
teacache: disabled
model quantization: fp8_e4m3fn
time: 186 seconds

attention_mode: sageattn_varlen
teacache: 0.15
model quantization: disabled
time: 197 seconds

BlockSwap set to fill VRAM to 95%.

I would recommend checking VRAM fill to see if it is at 97 percent or more. If it is, then pull it down to 95 or less.

@wwwffbf
Copy link

wwwffbf commented Feb 19, 2025

@ObiLeek what is your resolution? so fast!

He is using 544x960x97F 30steps, I guess.

@ptmaster
Copy link
Author

15 minutes is too much. On my RTX 4090 the times for 97 frames are as follows (including model loading):

attention_mode: sageattn_varlen teacache: 0.15 model quantization: fp8_e4m3fn time: 156 seconds

attention_mode: sageattn_varlen teacache: disabled model quantization: fp8_e4m3fn time: 186 seconds

attention_mode: sageattn_varlen teacache: 0.15 model quantization: disabled time: 197 seconds

BlockSwap set to fill VRAM to 95%.

I would recommend checking VRAM fill to see if it is at 97 percent or more. If it is, then pull it down to 95 or less.

With 960 544 ? Perhaps you should take a look at the previous post, which documented a method to avoid bursting memory and having to enable blockswap.
#372

Image

@ObiLeek
Copy link

ObiLeek commented Feb 19, 2025

Resolution is 544x960, 30 steps. All other parameters are the same as in the example workflow.

But auto_cpu_offload is not the same as blockswap. Blockswapping is much more faster. At least in my environment.

@wwwffbf
Copy link

wwwffbf commented Feb 19, 2025

Resolution is 544x960, 30 steps. All other parameters are the same as in the example workflow.

But auto_cpu_offload is not the same as blockswap. Blockswapping is much more faster. At least in my environment.

Incredible speed! Could you share your workflow?

@ptmaster
Copy link
Author

Resolution is 544x960, 30 steps. All other parameters are the same as in the example workflow.

But auto_cpu_offload is not the same as blockswap. Blockswapping is much more faster. At least in my environment.

I understand that purely from a data speed perspective, your data seems credible and you own a real 4090. However, we can't overlook the importance of image quality and dynamic outcomes. While using a flowmatch sampler with TEA acceleration might easily achieve speeds around three minutes, it's challenging to obtain satisfactory results. Could we please avoid focusing solely on data speed? It feels a bit misleading and making misconception of update for KJ, and I hope you understand my concern. :)

@ObiLeek
Copy link

ObiLeek commented Feb 19, 2025

Sure, I understand. TeaCache is a bit of a quality lottery with this model, but with the fixed seed I used for the tests, the quality was great. However, I shared the result without it:

attention_mode: sageattn_varlen
teacache: disabled
model quantization: fp8_e4m3fn
time: 186 seconds
scheduler: SDE-DPMSolverMultistepScheduler
resolution: 544x960
steps: 30
frames: 97

It's not a problem for me to do any test if there is interest. The workflow is identical to hyvideo_skyreel_img2vid_example_01 in the latest version, but with the above parameters and modified blockswapping.

@wwwffbf
Copy link

wwwffbf commented Feb 19, 2025

blockswapping.
How did you set the block swap.
when using fp8 e4m3fn and triton, Even if you minimize the block swap or even just delete it, the video memory will not take up more than 90% of the memory.

@ganicus
Copy link

ganicus commented Feb 19, 2025

Where is tea cache in the workflow? Is it part of the Torch Compile node somewhere?

@pftq
Copy link

pftq commented Feb 22, 2025

It's a new node called "Hunyuan TeaCache" and connects to the teacache_args in the Hunyuan Sampler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants