Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe it would better to have a diagram to show how llama.cpp process inferences #11967

Closed
yinuu opened this issue Feb 20, 2025 · 2 comments
Closed

Comments

@yinuu
Copy link

yinuu commented Feb 20, 2025

I'm using llama.cpp to deploy deepseek-r1-671B-Q4_0 weights, but I found documention/README.md is barely detailed; I even have to read the source to understand what would happen if I make some flag on. For example '--gpu-layers', according to code it would be a key for PP, but no word was put on the detail in the document, but it found no better performance when i make it greater than model tensor layers.
// TODO: move these checks to ggml_backend_sched // enabling pipeline parallelism in the scheduler increases memory usage, so it is only done when necessary bool pipeline_parallel = model->n_devices() > 1 && model->params.n_gpu_layers > (int)model->hparams.n_layer && model->params.split_mode == LLAMA_SPLIT_MODE_LAYER && params.offload_kqv;
it would highly appreciated if I could have a prcoessing diargam, better if it has some related flag attached to each node;

thanks all the way

@foldl
Copy link
Contributor

foldl commented Feb 20, 2025

You can have a look on #10825. Is that what you need?

@yinuu
Copy link
Author

yinuu commented Feb 21, 2025

You can have a look on #10825. Is that what you need?

it's not that clear, but it really helps, thanks again!

@yinuu yinuu closed this as completed Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants