Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SkyReels Multi-GPU #399

Open
pftq opened this issue Feb 22, 2025 · 12 comments
Open

SkyReels Multi-GPU #399

pftq opened this issue Feb 22, 2025 · 12 comments

Comments

@pftq
Copy link

pftq commented Feb 22, 2025

Is the multi-gpu option in SkyReels exposed through the ComfyUI nodes? If not, can this be added?

And as a side note, there seems to be a bug in the official code - I made a quick fix that seemed to address it in case you run into it as well:
SkyworkAI/SkyReels-V1#66

@kijai
Copy link
Owner

kijai commented Feb 22, 2025

For the wrapper it probably could, but I don't have multiple GPUs to test that stuff and I'm not that interested in cloud inference myself.

@pftq
Copy link
Author

pftq commented Feb 22, 2025

If you want to just expose the parameter in the node and that's it, I'd be happy to test it. I'm not that familiar with ComfyUI myself so I'm not sure where to start to make that adjustment. Complication would be if the bug I mentioned needs to be patched in the Kijai version as well - where would that change have to be mirrored?

@kijai
Copy link
Owner

kijai commented Feb 22, 2025

If you want to just expose the parameter in the node and that's it, I'd be happy to test it. I'm not that familiar with ComfyUI myself so I'm not sure where to start to make that adjustment. Complication would be if the bug I mentioned needs to be patched in the Kijai version as well - where would that change have to be mirrored?

It's not that simple as the pipeline is different, would be a lot more work than just exposing an option.

@pftq
Copy link
Author

pftq commented Feb 23, 2025

Are we talking about the same thing? I mean multiple GPUs to speed up the same video being generated, not a dividing the batch to different GPUs. What would be the pipeline change there?

I just finished testing the multi-GPU on the main python SkyReels up to 4 GPUs and it's 1:1 performance improvement (2x for every 2x in GPUs). Would be really good to have.

@magekinnarus
Copy link

I just finished testing the multi-GPU on the main python SkyReels up to 4 GPUs and it's 1:1 performance improvement (2x for every 2x in GPUs). Would be really good to have.

This is data parallelism which duplicates the model across multiple GPUs. I think most people struggle to fit the model in their GPUs and are looking for ways to spread the model weights or components across multiple GPUs to fit the model to perform with less powerful GPUs. Your method requires having multiple GPUs powerful enough to process the whole thing on each individual GPU and merely duplicate the process on each GPU to speed things up. Your use case is an exception rather than a norm.

@pftq
Copy link
Author

pftq commented Feb 23, 2025

Maybe - but my POV is that it is already functionality supported by the model and apparently toggled by setting num_gpus to 2, 4, whatever. If it's something that can be done as a node or parameter to the ComfyUI wrapper, then it's a lot of upside.

@mr-lab
Copy link

mr-lab commented Feb 23, 2025

For the wrapper it probably could, but I don't have multiple GPUs to test that stuff and I'm not that interested in cloud inference myself.

i offer again , a rented multi gpu cloud machine for the benefit of the community.

@mr-lab
Copy link

mr-lab commented Feb 23, 2025

I just finished testing the multi-GPU on the main python SkyReels up to 4 GPUs and it's 1:1 performance improvement (2x for every 2x in GPUs). Would be really good to have.

This is data parallelism which duplicates the model across multiple GPUs. I think most people struggle to fit the model in their GPUs and are looking for ways to spread the model weights or components across multiple GPUs to fit the model to perform with less powerful GPUs. Your method requires having multiple GPUs powerful enough to process the whole thing on each individual GPU and merely duplicate the process on each GPU to speed things up. Your use case is an exception rather than a norm.

not really i want multi gpu support most of us are using cloud compute with multi gpus .

xdit has support for default hunyuanVideo model and more like flux ... the issue is you lose quantization and lora , i have no clue how to add that .

i will have a look at SkyReels and see if there is the multi gpu implementation compatible with any hunyuanVideo nodes , it seams that it's way better than xdit , xdit offers 1.3x speed up for each gpu

what really bothers me is TACO-DiT a closed source multi gpu comfy ui implementation of xdit , offered for a price .

Are we talking about the same thing? I mean multiple GPUs to speed up the same video being generated, not a dividing the batch to different GPUs. What would be the pipeline change there?

I just finished testing the multi-GPU on the main python SkyReels up to 4 GPUs and it's 1:1 performance improvement (2x for every 2x in GPUs). Would be really good to have.

Edit i read the code of skyreels :
it is true it's not as simple as just adding a multi gpu function , a whole lot of things needs to be redone , here is the list :
unet loader
lora loader
text_encoders loader
sampler
possibly decoder encoder (if you want faster decoding on multi gpu)
and more , the good news it uses a lot simpler code than xdit , torch.multiprocessing will just spawn(single_gpu_run) for each gpu
gonna take a crack at this using hunyuanVideowrapper code not SkyReels

@pftq
Copy link
Author

pftq commented Feb 24, 2025

I've been testing the multi-gpu without comfyui on the main skyreels repo and it's super fast. 8 GPUs is just straight up 8x faster. So yeah, definitely a lot better than xdit. Something that was 4 hours to render on 1 GPU improves to taking only 30 min.

Is the implementation not in the Skyreels model itself? I was hoping it might be something like just exposing the gpu_num parameter. Mentioning faster decoding/encoding is confusing me a bit, since most of the time spent is in the sampler, so we don't really need a faster decoder/encoder. Just want to make sure we're not talking about something besides the multi-gpu inference SkyReels already has implemented. To me, it seems like it should just be in between the model loader and the sampler to change the gpu_num flag and let the model handle the rest when it's run in the sampler.

@mr-lab
Copy link

mr-lab commented Feb 25, 2025

I've been testing the multi-gpu without comfyui on the main skyreels repo and it's super fast. 8 GPUs is just straight up 8x faster. So yeah, definitely a lot better than xdit. Something that was 4 hours to render on 1 GPU improves to taking only 30 min.

Is the implementation not in the Skyreels model itself? I was hoping it might be something like just exposing the gpu_num parameter. Mentioning faster decoding/encoding is confusing me a bit, since most of the time spent is in the sampler, so we don't really need a faster decoder/encoder. Just want to make sure we're not talking about something besides the multi-gpu inference SkyReels already has implemented. To me, it seems like it should just be in between the model loader and the sampler to change the gpu_num flag and let the model handle the rest when it's run in the sampler.

i think you can even straight up run hunyuanVideo base model on it , after reading most of the code the skyreel pipe line has the following extra :
a way to move text encoders between CPU and GPU dynamically
per-sample CFG handling and guidance rescaling
Additional Parameters: Such as attention_kwargs, max_sequence_length, and cfg_for.
CLIP Layer Skipping: More control over CLIP layers via clip_skip.

and on skyreels_video_infer.py
we have this for multi gpu
from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
from para_attn.parallel_vae.diffusers_adapters import parallelize_vae

    max_batch_dim_size = 2 if enable_cfg_parallel and world_size > 1 else 1
    max_ulysses_dim_size = int(world_size / max_batch_dim_size)
    logger.info(f"max_batch_dim_size: {max_batch_dim_size}, max_ulysses_dim_size:{max_ulysses_dim_size}")

    mesh = init_context_parallel_mesh(
        self.pipe.device.type,
        max_ring_dim_size=1,
        max_batch_dim_size=max_batch_dim_size,
    )
    parallelize_pipe(self.pipe, mesh=mesh)
    parallelize_vae(self.pipe.vae, mesh=mesh._flatten())

then we have single_gpu_run that i believe is an instance of each gpu run , called by mp.spawn( single_gpu_run,... and we have import torch.distributed that calls this :
dist.init_process_group(
backend="nccl",
init_method="tcp://127.0.0.1:23456",
timeout=timedelta(seconds=600),
world_size=world_size,
rank=local_rank,
)

i think i'm very close to make musubi-tuner generate script a multi gpu one ,

@kijai can choose to implement this or not
that is not my problem anymore his project sure have benefited the community as it is .
i understand that he has no multi gpu and this whole thing is really a gift he keeps giving for free . and for that thank you @kijai either way .

but it's easy to implement since SkyreelsVideoPipeline is just a wrapper class around HunyuanVideoPipeline
class SkyreelsVideoPipeline(HunyuanVideoPipeline):
that adds img2video functions and have more control , the multi gpu part is just https://github.com/chengzeyi/ParaAttention with torch.multiprocessing
i walk back my comment that this is not easy , it's very easy but i'm not very good with comfyui nodes coding , i'm testing this with hv_generate_video.py of musubi-tuner

good luck to all and thank you.

@pinballelectronica
Copy link

@pftq
Copy link
Author

pftq commented Feb 25, 2025

https://github.com/pollockjj/ComfyUI-MultiGPU

It's not the same and complicates the setup a bit - we're talking about the built-in multi-GPU of the Skyreels model for inferencing so ideally it's just turning that on for when it's used in ComfyUI. I'll try to look more into it later this week as well, just have other things on the to-do to test first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants