feature request: share vram (intel igpu) consumption too much ram due to cpu + gpu share loading #14669

99degree · 2025-07-13T22:54:57Z

99degree
Jul 13, 2025

Since in usual case with dedicated gpu card, full gguf file need to update to gpu ram. In some case (Intel shared vram gpu) is small as 3.9GB. After allocating some more vram blocks, for shader etc, the gpu accessible ram range is used up.

There is a need to offload some layers to cpu by smaller -ngl X parameter. This can reduce the memory use in GPU point of view. Thus the "main" memory there is a need to have another full set of gguf file for cpu inference. Since it is assumed vram is not the same as main memory for usual case like dedicated gpu cards. So there is a big trunk of duplicated data and that leads to cpu and OS ram running out. And this also close the door for some PC running llama.cpp with some more resent 4bit quantized LLMs.

I am wondering if there is a kind of linux kernel dmabuf mechanism to share the "video ram" or iommu remap back to cpu and avoid duplicated memory usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature request: share vram (intel igpu) consumption too much ram due to cpu + gpu share loading #14669

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

feature request: share vram (intel igpu) consumption too much ram due to cpu + gpu share loading #14669

Uh oh!

99degree Jul 13, 2025

Replies: 0 comments

99degree
Jul 13, 2025