-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeking Optimization for skyreels I2V: Addressing Speed and Quality Challenges with Negative Prompts and TEA Acceleration #378
Comments
Losing cfg distillation really hurts with the speed indeed, best way to mitigate that is to find out how many steps we really need cfg for, as the nodes already support scheduling it over time. My default workflow only runs it with cfg for half the steps, speeding it up considerably with little quality loss. However it's hard to judge both the value of cfg itself and how many steps to do it for, this just requires testing. |
15 minutes is too much. On my RTX 4090 the times for 97 frames are as follows (including model loading): attention_mode: sageattn_varlen attention_mode: sageattn_varlen attention_mode: sageattn_varlen BlockSwap set to fill VRAM to 95%. I would recommend checking VRAM fill to see if it is at 97 percent or more. If it is, then pull it down to 95 or less. |
@ObiLeek what is your resolution? so fast! He is using 544x960x97F 30steps, I guess. |
With 960 544 ? Perhaps you should take a look at the previous post, which documented a method to avoid bursting memory and having to enable blockswap. |
Resolution is 544x960, 30 steps. All other parameters are the same as in the example workflow. But auto_cpu_offload is not the same as blockswap. Blockswapping is much more faster. At least in my environment. |
Incredible speed! Could you share your workflow? |
I understand that purely from a data speed perspective, your data seems credible and you own a real 4090. However, we can't overlook the importance of image quality and dynamic outcomes. While using a flowmatch sampler with TEA acceleration might easily achieve speeds around three minutes, it's challenging to obtain satisfactory results. Could we please avoid focusing solely on data speed? It feels a bit misleading and making misconception of update for KJ, and I hope you understand my concern. :) |
Sure, I understand. TeaCache is a bit of a quality lottery with this model, but with the fixed seed I used for the tests, the quality was great. However, I shared the result without it: attention_mode: sageattn_varlen It's not a problem for me to do any test if there is interest. The workflow is identical to hyvideo_skyreel_img2vid_example_01 in the latest version, but with the above parameters and modified blockswapping. |
|
Where is tea cache in the workflow? Is it part of the Torch Compile node somewhere? |
It's a new node called "Hunyuan TeaCache" and connects to the teacache_args in the Hunyuan Sampler. |
Today, everyone has successfully run I2V, thanx! but the primary issue remains the speed. The inclusion of negative prompts has doubled the computation time, and while enabling TEA acceleration improves speed, it unfortunately compromises the visual quality. These two factors, one positive and one negative, present a rather challenging situation of time. On average, generating a 97-frame video takes approximately 15 minutes on one 4090. I kindly request your assistance in optimizing this process at your earliest convenience. Thank you very much! :)
The text was updated successfully, but these errors were encountered: