Skip to content

Commit ce57c4e

Browse files
committed
Update the Low-VRAM docs.
1 parent 0cf51ce commit ce57c4e

File tree

1 file changed

+27
-9
lines changed

1 file changed

+27
-9
lines changed

docs/features/low-vram.md

Lines changed: 27 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,12 @@ It is possible to fine-tune the settings for best performance or if you still ge
2828

2929
## Details and fine-tuning
3030

31-
Low-VRAM mode involves 3 features, each of which can be configured or fine-tuned:
31+
Low-VRAM mode involves 4 features, each of which can be configured or fine-tuned:
3232

33-
- Partial model loading
34-
- Dynamic RAM and VRAM cache sizes
35-
- Working memory
33+
- Partial model loading (`enable_partial_loading`)
34+
- Dynamic RAM and VRAM cache sizes (`max_cache_ram_gb`, `max_cache_vram_gb`)
35+
- Working memory (`device_working_mem_gb`)
36+
- Keeping a RAM weight copy (`keep_ram_copy_of_weights`)
3637

3738
Read on to learn about these features and understand how to fine-tune them for your system and use-cases.
3839

@@ -67,12 +68,20 @@ As of v5.6.0, the caches are dynamically sized. The `ram` and `vram` settings ar
6768
But, if your GPU has enough VRAM to hold models fully, you might get a perf boost by manually setting the cache sizes in `invokeai.yaml`:
6869

6970
```yaml
70-
# Set the RAM cache size to as large as possible, leaving a few GB free for the rest of your system and Invoke.
71-
# For example, if your system has 32GB RAM, 28GB is a good value.
71+
# The default max cache RAM size is logged on InvokeAI startup. It is determined based on your system RAM / VRAM.
72+
# You can override the default value by setting `max_cache_ram_gb`.
73+
# Increasing `max_cache_ram_gb` will increase the amount of RAM used to cache inactive models, resulting in faster model
74+
# reloads for the cached models.
75+
# As an example, if your system has 32GB of RAM and no other heavy processes, setting the `max_cache_ram_gb` to 28GB
76+
# might be a good value to achieve aggressive model caching.
7277
max_cache_ram_gb: 28
73-
# Set the VRAM cache size to be as large as possible while leaving enough room for the working memory of the tasks you will be doing.
74-
# For example, on a 24GB GPU that will be running unquantized FLUX without any auxiliary models,
75-
# 18GB is a good value.
78+
# The default max cache VRAM size is adjusted dynamically based on the amount of available VRAM (taking into
79+
# consideration the VRAM used by other processes).
80+
# You can override the default value by setting `max_cache_vram_gb`. Note that this value takes precedence over the
81+
# `device_working_mem_gb`.
82+
# It is recommended to set the VRAM cache size to be as large as possible while leaving enough room for the working
83+
# memory of the tasks you will be doing. For example, on a 24GB GPU that will be running unquantized FLUX without any
84+
# auxiliary models, 18GB might be a good value.
7685
max_cache_vram_gb: 18
7786
```
7887
@@ -109,6 +118,15 @@ device_working_mem_gb: 4
109118

110119
Once decoding completes, the model manager "reclaims" the extra VRAM allocated as working memory for future model loading operations.
111120

121+
### Keeping a RAM weight copy
122+
123+
Invoke has the option of keeping a RAM copy of all model weights, even when they are loaded onto the GPU. This optimization is _on_ by default, and enables faster model switching and LoRA patching. Disabling this feature will reduce the average RAM load while running Invoke (peak RAM likely won't change), at the cost of slower model switching and LoRA patching. If you have limited RAM, you can disable this optimization:
124+
125+
```yaml
126+
# Set to false to reduce the average RAM usage at the cost of slower model switching and LoRA patching.
127+
keep_ram_copy_of_weights: false
128+
```
129+
112130
### Disabling Nvidia sysmem fallback (Windows only)
113131

114132
On Windows, Nvidia GPUs are able to use system RAM when their VRAM fills up via **sysmem fallback**. While it sounds like a good idea on the surface, in practice it causes massive slowdowns during generation.

0 commit comments

Comments
 (0)