Skip to content

Commit c5d2de3

Browse files
authored
Revise the default logic for the model cache RAM limit (#7566)
## Summary This PR revises the logic for calculating the model cache RAM limit. See the code for thorough documentation of the change. The updated logic is more conservative in the amount of RAM that it will use. This will likely be a better default for more users. Of course, users can still choose to set a more aggressive limit by overriding the logic with `max_cache_ram_gb`. ## Related Issues / Discussions - Should help with #7563 ## QA Instructions Exercise all heuristics: - [x] Heuristic 1 - [x] Heuristic 2 - [x] Heuristic 3 - [x] Heuristic 4 ## Merge Plan - [x] Merge #7565 first and update the target branch ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_
2 parents f7511bf + ce57c4e commit c5d2de3

File tree

2 files changed

+103
-35
lines changed

2 files changed

+103
-35
lines changed

docs/features/low-vram.md

Lines changed: 27 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,12 @@ It is possible to fine-tune the settings for best performance or if you still ge
2828

2929
## Details and fine-tuning
3030

31-
Low-VRAM mode involves 3 features, each of which can be configured or fine-tuned:
31+
Low-VRAM mode involves 4 features, each of which can be configured or fine-tuned:
3232

33-
- Partial model loading
34-
- Dynamic RAM and VRAM cache sizes
35-
- Working memory
33+
- Partial model loading (`enable_partial_loading`)
34+
- Dynamic RAM and VRAM cache sizes (`max_cache_ram_gb`, `max_cache_vram_gb`)
35+
- Working memory (`device_working_mem_gb`)
36+
- Keeping a RAM weight copy (`keep_ram_copy_of_weights`)
3637

3738
Read on to learn about these features and understand how to fine-tune them for your system and use-cases.
3839

@@ -67,12 +68,20 @@ As of v5.6.0, the caches are dynamically sized. The `ram` and `vram` settings ar
6768
But, if your GPU has enough VRAM to hold models fully, you might get a perf boost by manually setting the cache sizes in `invokeai.yaml`:
6869

6970
```yaml
70-
# Set the RAM cache size to as large as possible, leaving a few GB free for the rest of your system and Invoke.
71-
# For example, if your system has 32GB RAM, 28GB is a good value.
71+
# The default max cache RAM size is logged on InvokeAI startup. It is determined based on your system RAM / VRAM.
72+
# You can override the default value by setting `max_cache_ram_gb`.
73+
# Increasing `max_cache_ram_gb` will increase the amount of RAM used to cache inactive models, resulting in faster model
74+
# reloads for the cached models.
75+
# As an example, if your system has 32GB of RAM and no other heavy processes, setting the `max_cache_ram_gb` to 28GB
76+
# might be a good value to achieve aggressive model caching.
7277
max_cache_ram_gb: 28
73-
# Set the VRAM cache size to be as large as possible while leaving enough room for the working memory of the tasks you will be doing.
74-
# For example, on a 24GB GPU that will be running unquantized FLUX without any auxiliary models,
75-
# 18GB is a good value.
78+
# The default max cache VRAM size is adjusted dynamically based on the amount of available VRAM (taking into
79+
# consideration the VRAM used by other processes).
80+
# You can override the default value by setting `max_cache_vram_gb`. Note that this value takes precedence over the
81+
# `device_working_mem_gb`.
82+
# It is recommended to set the VRAM cache size to be as large as possible while leaving enough room for the working
83+
# memory of the tasks you will be doing. For example, on a 24GB GPU that will be running unquantized FLUX without any
84+
# auxiliary models, 18GB might be a good value.
7685
max_cache_vram_gb: 18
7786
```
7887
@@ -109,6 +118,15 @@ device_working_mem_gb: 4
109118

110119
Once decoding completes, the model manager "reclaims" the extra VRAM allocated as working memory for future model loading operations.
111120

121+
### Keeping a RAM weight copy
122+
123+
Invoke has the option of keeping a RAM copy of all model weights, even when they are loaded onto the GPU. This optimization is _on_ by default, and enables faster model switching and LoRA patching. Disabling this feature will reduce the average RAM load while running Invoke (peak RAM likely won't change), at the cost of slower model switching and LoRA patching. If you have limited RAM, you can disable this optimization:
124+
125+
```yaml
126+
# Set to false to reduce the average RAM usage at the cost of slower model switching and LoRA patching.
127+
keep_ram_copy_of_weights: false
128+
```
129+
112130
### Disabling Nvidia sysmem fallback (Windows only)
113131

114132
On Windows, Nvidia GPUs are able to use system RAM when their VRAM fills up via **sysmem fallback**. While it sounds like a good idea on the surface, in practice it causes massive slowdowns during generation.

invokeai/backend/model_manager/load/model_cache/model_cache.py

Lines changed: 76 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,8 @@ def __init__(
123123
self._cached_models: Dict[str, CacheRecord] = {}
124124
self._cache_stack: List[str] = []
125125

126+
self._ram_cache_size_bytes = self._calc_ram_available_to_model_cache()
127+
126128
@property
127129
def stats(self) -> Optional[CacheStats]:
128130
"""Return collected CacheStats object."""
@@ -388,41 +390,89 @@ def _get_vram_in_use(self) -> int:
388390
# Alternative definition of VRAM in use:
389391
# return sum(ce.cached_model.cur_vram_bytes() for ce in self._cached_models.values())
390392

391-
def _get_ram_available(self) -> int:
392-
"""Get the amount of RAM available for the cache to use, while keeping memory pressure under control."""
393+
def _calc_ram_available_to_model_cache(self) -> int:
394+
"""Calculate the amount of RAM available for the cache to use."""
393395
# If self._max_ram_cache_size_gb is set, then it overrides the default logic.
394396
if self._max_ram_cache_size_gb is not None:
395-
ram_total_available_to_cache = int(self._max_ram_cache_size_gb * GB)
396-
return ram_total_available_to_cache - self._get_ram_in_use()
397-
398-
virtual_memory = psutil.virtual_memory()
399-
ram_total = virtual_memory.total
400-
ram_available = virtual_memory.available
401-
ram_used = ram_total - ram_available
402-
403-
# The total size of all the models in the cache will often be larger than the amount of RAM reported by psutil
404-
# (due to lazy-loading and OS RAM caching behaviour). We could just rely on the psutil values, but it feels
405-
# like a bad idea to over-fill the model cache. So, for now, we'll try to keep the total size of models in the
406-
# cache under the total amount of system RAM.
407-
cache_ram_used = self._get_ram_in_use()
408-
ram_used = max(cache_ram_used, ram_used)
409-
410-
# Aim to keep 10% of RAM free.
411-
ram_available_based_on_memory_usage = int(ram_total * 0.9) - ram_used
397+
self._logger.info(f"Using user-defined RAM cache size: {self._max_ram_cache_size_gb} GB.")
398+
return int(self._max_ram_cache_size_gb * GB)
399+
400+
# Heuristics for dynamically calculating the RAM cache size, **in order of increasing priority**:
401+
# 1. As an initial default, use 50% of the total RAM for InvokeAI.
402+
# - Assume a 2GB baseline for InvokeAI's non-model RAM usage, and use the rest of the RAM for the model cache.
403+
# 2. On a system with a lot of RAM (e.g. 64GB+), users probably don't want InvokeAI to eat up too much RAM.
404+
# There are diminishing returns to storing more and more models. So, we apply an upper bound.
405+
# - On systems without a CUDA device, the upper bound is 32GB.
406+
# - On systems with a CUDA device, the upper bound is 2x the amount of VRAM.
407+
# 3. On systems with a CUDA device, the minimum should be the VRAM size (less the working memory).
408+
# - Setting lower than this would mean that we sometimes kick models out of the cache when there is room for
409+
# all models in VRAM.
410+
# - Consider an extreme case of a system with 8GB RAM / 24GB VRAM. I haven't tested this, but I think
411+
# you'd still want the RAM cache size to be ~24GB (less the working memory). (Though you'd probably want to
412+
# set `keep_ram_copy_of_weights: false` in this case.)
413+
# 4. Absolute minimum of 4GB.
414+
415+
# NOTE(ryand): We explored dynamically adjusting the RAM cache size based on memory pressure (using psutil), but
416+
# decided against it for now, for the following reasons:
417+
# - It was surprisingly difficult to get memory metrics with consistent definitions across OSes. (If you go
418+
# down this path again, don't underestimate the amount of complexity here and be sure to test rigorously on all
419+
# OSes.)
420+
# - Making the RAM cache size dynamic opens the door for performance regressions that are hard to diagnose and
421+
# hard for users to understand. It is better for users to see that their RAM is maxed out, and then override
422+
# the default value if desired.
423+
424+
# Lookup the total VRAM size for the CUDA execution device.
425+
total_cuda_vram_bytes: int | None = None
426+
if self._execution_device.type == "cuda":
427+
_, total_cuda_vram_bytes = torch.cuda.mem_get_info(self._execution_device)
428+
429+
# Apply heuristic 1.
430+
# ------------------
431+
heuristics_applied = [1]
432+
total_system_ram_bytes = psutil.virtual_memory().total
433+
# Assumed baseline RAM used by InvokeAI for non-model stuff.
434+
baseline_ram_used_by_invokeai = 2 * GB
435+
ram_available_to_model_cache = int(total_system_ram_bytes * 0.5 - baseline_ram_used_by_invokeai)
436+
437+
# Apply heuristic 2.
438+
# ------------------
439+
max_ram_cache_size_bytes = 32 * GB
440+
if total_cuda_vram_bytes is not None:
441+
max_ram_cache_size_bytes = 2 * total_cuda_vram_bytes
442+
if ram_available_to_model_cache > max_ram_cache_size_bytes:
443+
heuristics_applied.append(2)
444+
ram_available_to_model_cache = max_ram_cache_size_bytes
445+
446+
# Apply heuristic 3.
447+
# ------------------
448+
if total_cuda_vram_bytes is not None:
449+
if self._max_vram_cache_size_gb is not None:
450+
min_ram_cache_size_bytes = int(self._max_vram_cache_size_gb * GB)
451+
else:
452+
min_ram_cache_size_bytes = total_cuda_vram_bytes - int(self._execution_device_working_mem_gb * GB)
453+
if ram_available_to_model_cache < min_ram_cache_size_bytes:
454+
heuristics_applied.append(3)
455+
ram_available_to_model_cache = min_ram_cache_size_bytes
412456

413-
# If we are running out of RAM, then there's an increased likelihood that we will run into this issue:
414-
# https://github.com/invoke-ai/InvokeAI/issues/7513
415-
# To keep things running smoothly, there's a minimum RAM cache size that we always allow (even if this means
416-
# using swap).
417-
min_ram_cache_size_bytes = 4 * GB
418-
ram_available_based_on_min_cache_size = min_ram_cache_size_bytes - cache_ram_used
457+
# Apply heuristic 4.
458+
# ------------------
459+
if ram_available_to_model_cache < 4 * GB:
460+
heuristics_applied.append(4)
461+
ram_available_to_model_cache = 4 * GB
419462

420-
return max(ram_available_based_on_memory_usage, ram_available_based_on_min_cache_size)
463+
self._logger.info(
464+
f"Calculated model RAM cache size: {ram_available_to_model_cache / MB:.2f} MB. Heuristics applied: {heuristics_applied}."
465+
)
466+
return ram_available_to_model_cache
421467

422468
def _get_ram_in_use(self) -> int:
423469
"""Get the amount of RAM currently in use."""
424470
return sum(ce.cached_model.total_bytes() for ce in self._cached_models.values())
425471

472+
def _get_ram_available(self) -> int:
473+
"""Get the amount of RAM available for the cache to use."""
474+
return self._ram_cache_size_bytes - self._get_ram_in_use()
475+
426476
def _capture_memory_snapshot(self) -> Optional[MemorySnapshot]:
427477
if self._log_memory_usage:
428478
return MemorySnapshot.capture()

0 commit comments

Comments
 (0)