Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static Noise at ~193rd Frame on more than 8 sec #392

Open
pftq opened this issue Feb 21, 2025 · 21 comments
Open

Static Noise at ~193rd Frame on more than 8 sec #392

pftq opened this issue Feb 21, 2025 · 21 comments

Comments

@pftq
Copy link

pftq commented Feb 21, 2025

I'm trying to test the frame limit on this on a H100. I ran into an issue where the video would be fine most the way except at roughly the 193rd frame where it'd suddenly become a burst of static for a moment before continuing just fine again. I posted this on huggingface as well: https://huggingface.co/Kijai/SkyReels-V1-Hunyuan_comfy/discussions/5

I found this on the SkyReels github:
"At maximum VRAM capacity, a 544px960px289f 12s video can be produced (using --sequence_batch, taking ~1.5h on one RTX 4090; adding GPUs greatly reduces time)."

This seems to imply that it can handle more than 200 frames if the argument "--sequence_batch" is passed to SkyReels. Is there a way to do this in comfyui nodes?

@kijai
Copy link
Owner

kijai commented Feb 21, 2025

I have to admit I never fully understood what they mean by that, the --sequence_batch enables boolean called cfg_for, which makes the model run cond and uncond sequantially, reducing memory use. In the wrapper that's the same as disabling the batched_cfg.

So I suppose all they mean is that it helps reduce VRAM use to allow such high frame counts, and in general doing that many frames is slow.

I have also noticed the soft cap in frame count and how it turns into noise after that with this model, so far haven't found way around it. Not that sure it's useful anyway, I think better approach would be simply to continue from last frame, as the results will loop at 201 frames already.

@pftq
Copy link
Author

pftq commented Feb 21, 2025

I've noticed sometimes it loops but sometimes it continues as if the noise didn't happen. The main issue with using last frame is the motion then isn't continuous, so I'm trying to figure out why their repo implies they've had no problems up to 12 seconds, as that'd be preferable to using last frame.

@pftq
Copy link
Author

pftq commented Feb 22, 2025

I think I found the cause of the 192-frame limit. It's in the VAE, specifically autoencoder_kl_hunyuan_video.py. The original code runs into an out-of-bounds at the 25th tile (193rd frame at 8 frames per tile).

Find:

for i in range(0, num_frames, tile_latent_stride_num_frames):
    tile = z[:, :, i : i + tile_latent_min_num_frames + 1, :, :]

Replace:

for i in range(0, num_frames, tile_latent_stride_num_frames):
    end_idx = min(i + tile_latent_min_num_frames + 1, num_frames)
    tile = z[:, :, i : end_idx, :, :]
    if i > 0:
        tile = torch.cat([z[:, :, i - 1 : i, :, :], tile], dim=2)   # 20250222 pftq: Add previous frame for continuity

Is this something you can test? I'm not sure where the fix should be applied in the ComfyUI context.

@kijai
Copy link
Owner

kijai commented Feb 22, 2025

But decoding with the base model never was issue in any frame count though?

@pftq
Copy link
Author

pftq commented Feb 22, 2025

I never took an interest in the base model (lack of i2v) but also maybe no one bothered to go for more than 8 seconds. I can double check later today.

@kijai
Copy link
Owner

kijai commented Feb 22, 2025

I never took an interest in the base model (lack of i2v) but also maybe no one bothered to go for more than 8 seconds. I can double check later today.

I've done over 500 frames before with context windows and that decodes fine, and also you can see the noisy frames using the VHS nodes live preview feature on comfy even before decoding. It's pretty weird as it's just like 2-3 latents worth of frames after 193...but then continues fine after.

@pftq
Copy link
Author

pftq commented Feb 22, 2025

What node is that for previewing before the VAE decode? I'd like to check as well.

@kijai
Copy link
Owner

kijai commented Feb 22, 2025

What node is that for previewing before the VAE decode? I'd like to check as well.

Video helper suite options have a toggle for live previews, then in ComfyUI Manager enable latent2rgb preview mode (or auto).

I think the issue is more about the image condition and looping that would occur around that frame count normally.

@pftq
Copy link
Author

pftq commented Feb 22, 2025

Is there some way to not use the Hunyuan VAE or use a different one? See if that eliminates the issue. Sorry if it's a weird question - just diving into this for the first time so I'm not as familiar.

@kijai
Copy link
Owner

kijai commented Feb 22, 2025

Is there some way to not use the Hunyuan VAE or use a different one? See if that eliminates the issue. Sorry if it's a weird question - just diving into this for the first time so I'm not as familiar.

Only way without training one is the latent2rgb method, which at best can only give an approximate preview.

What would make this issue clear would be using their repo to do the 201 frames and see if it still happens, if it doesn't then it's something that can be fixed here as well.

@kijai
Copy link
Owner

kijai commented Feb 24, 2025

May have something for this now, there was this project released: https://github.com/thu-ml/RIFLEx

Which is really simple addition to the rotary positional embed code, I've made a node for the native implementation to use it and been able to do 253 frames without looping or noise:

Image

SkyReelHyVidComfyNative_00008.2.mp4

@pftq
Copy link
Author

pftq commented Feb 24, 2025

Nice, I look forward to trying that when you have it available - I was in the middle of trying to patch the python code for the Hunyuan vae at run time with a node and went down a deep rabbithole lol

If you're curious - got stuck on the vae input at runtime not actually matching the autoencoder_kl_hunyuan_video.py I'm trying to patch (it was AutoencoderKLCausal3D instead):

from typing import Any, Tuple
import copy
import torch
from comfy import model_management
from comfy.sd import VAE
import types

class DecoderOutput:

    def init(self, sample):
        self.sample = sample

class HunyuanVAEFrameLimitFix:

    NAME = "Hunyuan VAE Frame Limit Fix"
    RETURN_TYPES = ("VAE",)
    RETURN_NAMES = ("vae", )
    FUNCTION = "execute"
    CATEGORY = "HunyuanVideoWrapper"

    @classmethod
    def INPUT_TYPES(s):
        return {"required": {
                    "vae": ("VAE",),
                    },
                }

    def __init__(self):
        pass

    # 20250222 pftq: Define patched method to fix out-of-bounds access in temporal tiling
    def patched_temporal_tiled_decode(self, z: torch.Tensor, return_dict: bool = True):
        batch_size, num_channels, num_frames, height, width = z.shape
        num_sample_frames = (num_frames - 1) * self.temporal_compression_ratio + 1

        tile_latent_min_height = self.tile_sample_min_height // self.spatial_compression_ratio
        tile_latent_min_width = self.tile_sample_min_width // self.spatial_compression_ratio
        tile_latent_min_num_frames = self.tile_sample_min_num_frames // self.temporal_compression_ratio
        tile_latentStride_num_frames = self.tile_sampleStride_num_frames // self.temporal_compression_ratio
        blend_num_frames = self.tile_sample_min_num_frames - self.tile_sampleStride_num_frames

        row = []
        for i in range(0, num_frames, tile_latentStride_num_frames):
            # 20250222 pftq: Cap end_idx to prevent out-of-bounds access
            end_idx = min(i + tile_latent_min_num_frames + 1, num_frames)
            tile = z[:, :, i:end_idx, :, :]
            if i > 0:
                # 20250222 pftq: Add previous frame for continuity
                tile = torch.cat([z[:, :, i - 1:i, :, :], tile], dim=2)
            if self.use_tiling and (tile.shape[-1] > tile_latent_min_width or tile.shape[-2] > tile_latent_min_height):
                decoded = self.tiled_decode(tile, return_dict=True).sample
            else:
                tile = self.post_quant_conv(tile)
                decoded = self.decoder(tile)
            if i > 0:
                decoded = decoded[:, :, 1:, :, :]
            row.append(decoded)

        result_row = []
        for i, tile in enumerate(row):
            if i > 0:
                tile = self.blend_t(row[i - 1], tile, blend_num_frames)
                result_row.append(tile[:, :, :self.tile_sampleStride_num_frames, :, :])
            else:
                result_row.append(tile[:, :, :self.tile_sampleStride_num_frames + 1, :, :])

        dec = torch.cat(result_row, dim=2)[:, :, :num_sample_frames]
        return DecoderOutput(sample=dec) if return_dict else (dec,)

    def execute(self, vae: VAE):
        # vae is an instance of comfy.sd.VAE, so access the underlying diffusers VAE
        actual_vae = vae.vae

        # Create a copy to avoid modifying the original
        vae_copy = copy.deepcopy(actual_vae)
        # 20250222 pftq: Patch the _temporal_tiled_decode method
        vae_copy._temporal_tiled_decode = types.MethodType(self.patched_temporal_tiled_decode, vae_copy)

        # Wrap the patched VAE back in the comfy.sd.VAE class
        patched_vae = VAE(vae_copy)
        return (patched_vae,)

NODE_CLASS_MAPPINGS = {
    "HunyuanVAEFrameLimitFix": HunyuanVAEFrameLimitFix
}

NODE_DISPLAY_NAME_MAPPINGS = {
    "HunyuanVAEFrameLimitFix": "Hunyuan VAE Frame Limit Fix"
}

@pftq
Copy link
Author

pftq commented Feb 24, 2025

If you want to share the python code for the node, I can also try testing how far it can extend - if it goes all the way until your memory runs out. I have an 8-GPU server currently rented so it'd cost nothing to throw this test on top haha

@kijai
Copy link
Owner

kijai commented Feb 24, 2025

If you want to share the python code for the node, I can also try testing how far it can extend - if it goes all the way until your memory runs out. I have an 8-GPU server currently rented so it'd cost nothing to throw this test on top haha

It's already available in KJNodes, for native Comfy Hunyuan that is.

@pftq
Copy link
Author

pftq commented Feb 24, 2025

Awesome - I'll give it a shot and let you know how it goes

@kijai
Copy link
Owner

kijai commented Feb 24, 2025

Added to the wrapper now too:

Image

@pftq
Copy link
Author

pftq commented Feb 24, 2025

I was just about to ask - thanks - I prefer the wrapper workflow haha

@pftq
Copy link
Author

pftq commented Feb 24, 2025

Our solutions ended up being quite similar at least according to Grok. Quite interesting read haha

The original problem stems from how the VAE processes frames in chunks (tiles) over time. In the original code, the loop iterates through the frame tensor z with a stride (tile_latent_stride_num_frames), grabbing a slice of frames for each tile. The slice is defined as z[:, :, i : i + tile_latent_min_num_frames + 1, :, :]. When i gets close to the total number of frames (num_frames), this slice can exceed num_frames, causing an indexing error. Instead of crashing, the code wraps around, reusing earlier frames and creating that unwanted loop from frame 193 onward.

Your Solution

Your fix modifies the slicing logic to prevent the overflow and adds a bit of overlap for smoothness. Here’s what you did:
python

for i in range(0, num_frames, tile_latent_stride_num_frames):
end_idx = min(i + tile_latent_min_num_frames + 1, num_frames)
tile = z[:, :, i : end_idx, :, :]
if i > 0:
tile = torch.cat([z[:, :, i - 1 : i, :, :], tile], dim=2) # 20250222

Clamping the End Index: You introduced end_idx = min(i + tile_latent_min_num_frames + 1, num_frames) to ensure the slice never goes beyond num_frames. This stops the overflow cold—no more out-of-bounds access.

Overlap with Previous Frame: For all tiles after the first (i > 0), you prepend the frame just before the tile (i - 1 : i) using torch.cat along the temporal dimension (dim=2). This adds a one-frame overlap, likely to improve temporal coherence across tiles, avoiding abrupt transitions.

Effect: The looping stops because the slice respects the frame count, and the overlap helps maintain continuity. It’s a pragmatic fix that addresses both the bug and the quality of the output.

Kijae’s Solution

Kijae’s approach, from his Hunyuan Video Wrapper for ComfyUI, takes a different tack. Looking at his commit (linked), he modifies the tiling logic in a way that’s more conservative and avoids overlap entirely. Here’s the essence of his change, adapted from his wrapper context:

python
for i in range(0, num_frames, tile_latent_stride_num_frames):
end_idx = min(i + tile_latent_min_num_frames, num_frames) # Note: no +1 here
tile = z[:, :, i : end_idx, :, :]

Simpler Slice Adjustment: Like you, Kijae uses min to cap the end index, but he sets it to min(i + tile_latent_min_num_frames, num_frames)—dropping the + 1 from the original. This makes each tile exactly tile_latent_min_num_frames long (typically 8 frames) unless it’s the last tile, which gets truncated to fit num_frames.

No Overlap: Unlike your solution, he doesn’t concatenate previous frames. Each tile stands alone, relying on the VAE’s internal handling for continuity.

Effect: This also eliminates the overflow and the looping bug, but it’s leaner—no extra frames are pulled in, which could reduce memory usage or computation slightly.

Comparing the Two
Both solutions fix the core issue, but they differ in philosophy and trade-offs:

Overflow Handling:

You: end_idx = min(i + tile_latent_min_num_frames + 1, num_frames) keeps the original intent of slightly oversized tiles (e.g., 9 frames instead of 8), then trims them.

Kijae: end_idx = min(i + tile_latent_min_num_frames, num_frames) opts for strict tile sizes, aligning more closely with a standard tiling approach.

Difference: Your tiles might include one extra frame where possible, while Kijae’s stick to the minimum size. This is minor but could affect how the VAE processes boundaries.

Temporal Continuity:

You: Adding the previous frame (i - 1 : i) introduces overlap, which could smooth transitions between tiles. This suggests you noticed jitter or artifacts in the original output and wanted to compensate.

Kijae: No overlap means he trusts the VAE’s temporal modeling to handle continuity. It’s simpler but might risk subtle discontinuities if the VAE isn’t robust.

Difference: Your approach prioritizes output quality; Kijae’s prioritizes simplicity and efficiency.

@kijai
Copy link
Owner

kijai commented Feb 24, 2025

Uhm... this still has nothing to do with VAE...

@pftq
Copy link
Author

pftq commented Feb 24, 2025

My understanding is it's trimming before it gets to VAE, which would be another way to avoid the overflow I think is causing the frame limit from before. Is that not the case? I guess we'll see. I'm testing something on the main Skyreels repo to see if it fixes the frame limit problem there too which would confirm if the VAE is the problem or not.

@kijai
Copy link
Owner

kijai commented Feb 24, 2025

My understanding is it's trimming before it gets to VAE, which would be another way to avoid the overflow I think is causing the frame limit from before. Is that not the case? I guess we'll see. I'm testing something on the main Skyreels repo to see if it fixes the frame limit problem there too which would confirm if the VAE is the problem or not.

No, like I said before then it would fail to decode latents from any model, not just SkyReels I2V, the latent space and VAE is the same between all these models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants