Skip to content

Memory optimization + fixes#242

Merged
Glat0s merged 2 commits into
VisoMasterFusion:devfrom
Elricfae:dev
May 14, 2026
Merged

Memory optimization + fixes#242
Glat0s merged 2 commits into
VisoMasterFusion:devfrom
Elricfae:dev

Conversation

@Elricfae

@Elricfae Elricfae commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

These changes focus on performance optimization and code quality improvements, with a particular emphasis on VRAM management, CUDA efficiency, and code formatting consistency. The modifications span multiple processor modules and optimize memory allocation patterns throughout the face processing pipeline.

Key Changes:

🚀 Performance Optimizations:

  • VRAM/PCIe Efficiency: Replaced CPU-allocated tensors with direct GPU allocation using torch.ones(), torch.zeros(), and torch.full() to avoid unnecessary PCIe transfers
  • Memory Reuse in DDIM Loop: Pre-allocated buffers that are reused in-place (.fill_() method) instead of creating new tensors on every step, eliminating VRAM fragmentation
  • Lazy Kernel Initialization: Converted convolution kernels to lazy-initialized properties, ensuring thread-safe CUDA allocation without pre-allocation overhead
  • Video Buffer Management: Optimized frame buffer sizing from preroll_target * 4 to preroll_target + (num_threads * 2) with RAM safety net checks to prevent OOM on high-resolution videos

🧹 Code Quality:

  • Standardized multi-line formatting for io_binding.bind_output() calls across all face detection/swapping models (RetinaFace, SCRFD, YOLOFace, YuNet, CSCS, etc.)
  • Removed redundant .clone() operations where references suffice (e.g., mask assignments, tensor assignments)
  • Unified tensor operations using in-place methods and cleaner chaining patterns
  • Fixed redundant CUDA stream sync comments and removed dead code

** 🐛 Batch Video Processing Fixes

  • Face Mix-up Prevention: Added force_recognition_in_batch flag to disable fast-path optimization during batch processing, ensuring true identity verification on each video
  • State Reset: Detector and Temporal EMA states now purged between batch videos

🔧 Minor Fixes:

  • Adjusted eye ratio minimum threshold from 0.10 to 0.08 for better eyelid blending
  • Suppressed non-critical ONNX warnings by consistently setting log_severity_level = 3
  • Streamlined ONNX session initialization logic
  • Input Face Loader Robustness: Added try-except wrapping in the loader worker to gracefully skip corrupted images instead of crashing
  • Input Face UI Batching: Implemented batched UI updates with a Python deque + timer system to prevent main thread freezing during large face loads

📊 Expected Impact

  • VRAM: Noticeably reduced fragmentation and peak memory usage, especially during multi-iteration operations
  • Throughput: GPU kernels execute with less synchronization overhead
  • Reliability: Batch processing now handles mixed-quality videos without face identity confusion
  • UX: Input face loading no longer freezes the UI on large datasets

@Elricfae Elricfae added the patch Bug fixes, documentation, refactors, dependency updates label May 14, 2026
@Glat0s Glat0s merged commit 5ce122a into VisoMasterFusion:dev May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Bug fixes, documentation, refactors, dependency updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants