You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since in usual case with dedicated gpu card, full gguf file need to update to gpu ram. In some case (Intel shared vram gpu) is small as 3.9GB. After allocating some more vram blocks, for shader etc, the gpu accessible ram range is used up.
There is a need to offload some layers to cpu by smaller -ngl X parameter. This can reduce the memory use in GPU point of view. Thus the "main" memory there is a need to have another full set of gguf file for cpu inference. Since it is assumed vram is not the same as main memory for usual case like dedicated gpu cards. So there is a big trunk of duplicated data and that leads to cpu and OS ram running out. And this also close the door for some PC running llama.cpp with some more resent 4bit quantized LLMs.
I am wondering if there is a kind of linux kernel dmabuf mechanism to share the "video ram" or iommu remap back to cpu and avoid duplicated memory usage.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Since in usual case with dedicated gpu card, full gguf file need to update to gpu ram. In some case (Intel shared vram gpu) is small as 3.9GB. After allocating some more vram blocks, for shader etc, the gpu accessible ram range is used up.
There is a need to offload some layers to cpu by smaller -ngl X parameter. This can reduce the memory use in GPU point of view. Thus the "main" memory there is a need to have another full set of gguf file for cpu inference. Since it is assumed vram is not the same as main memory for usual case like dedicated gpu cards. So there is a big trunk of duplicated data and that leads to cpu and OS ram running out. And this also close the door for some PC running llama.cpp with some more resent 4bit quantized LLMs.
I am wondering if there is a kind of linux kernel dmabuf mechanism to share the "video ram" or iommu remap back to cpu and avoid duplicated memory usage.
Beta Was this translation helpful? Give feedback.
All reactions