Memory optimization + fixes by Elricfae · Pull Request #242 · VisoMasterFusion/VisoMaster-Fusion

Elricfae · 2026-05-13T20:30:40Z

Summary

These changes focus on performance optimization and code quality improvements, with a particular emphasis on VRAM management, CUDA efficiency, and code formatting consistency. The modifications span multiple processor modules and optimize memory allocation patterns throughout the face processing pipeline.

Key Changes:

🚀 Performance Optimizations:

VRAM/PCIe Efficiency: Replaced CPU-allocated tensors with direct GPU allocation using torch.ones(), torch.zeros(), and torch.full() to avoid unnecessary PCIe transfers
Memory Reuse in DDIM Loop: Pre-allocated buffers that are reused in-place (.fill_() method) instead of creating new tensors on every step, eliminating VRAM fragmentation
Lazy Kernel Initialization: Converted convolution kernels to lazy-initialized properties, ensuring thread-safe CUDA allocation without pre-allocation overhead
Video Buffer Management: Optimized frame buffer sizing from preroll_target * 4 to preroll_target + (num_threads * 2) with RAM safety net checks to prevent OOM on high-resolution videos

🧹 Code Quality:

Standardized multi-line formatting for io_binding.bind_output() calls across all face detection/swapping models (RetinaFace, SCRFD, YOLOFace, YuNet, CSCS, etc.)
Removed redundant .clone() operations where references suffice (e.g., mask assignments, tensor assignments)
Unified tensor operations using in-place methods and cleaner chaining patterns
Fixed redundant CUDA stream sync comments and removed dead code

** 🐛 Batch Video Processing Fixes

Face Mix-up Prevention: Added force_recognition_in_batch flag to disable fast-path optimization during batch processing, ensuring true identity verification on each video
State Reset: Detector and Temporal EMA states now purged between batch videos

🔧 Minor Fixes:

Adjusted eye ratio minimum threshold from 0.10 to 0.08 for better eyelid blending
Suppressed non-critical ONNX warnings by consistently setting log_severity_level = 3
Streamlined ONNX session initialization logic
Input Face Loader Robustness: Added try-except wrapping in the loader worker to gracefully skip corrupted images instead of crashing
Input Face UI Batching: Implemented batched UI updates with a Python deque + timer system to prevent main thread freezing during large face loads

📊 Expected Impact

VRAM: Noticeably reduced fragmentation and peak memory usage, especially during multi-iteration operations
Throughput: GPU kernels execute with less synchronization overhead
Reliability: Batch processing now handles mixed-quality videos without face identity confusion
UX: Input face loading no longer freezes the UI on large datasets

Elricfae added 2 commits May 13, 2026 22:25

Memory optimization + fixes

671faf0

Batch fixes and Input faces loading fix

bdbce05

Elricfae added the patch Bug fixes, documentation, refactors, dependency updates label May 14, 2026

Glat0s merged commit 5ce122a into VisoMasterFusion:dev May 14, 2026

t3nka mentioned this pull request May 14, 2026

Input Faces search refreshes endlessly after typing more than one character #200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory optimization + fixes#242

Memory optimization + fixes#242
Glat0s merged 2 commits into
VisoMasterFusion:devfrom
Elricfae:dev

Elricfae commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Elricfae commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes:

📊 Expected Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Elricfae commented May 13, 2026 •

edited

Loading