You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1
---
2
-
title: Distributed inference using llama.cpp
2
+
title: Run distributed inference with llama.cpp on Arm-based AWS Graviton4 instances
3
3
4
4
minutes_to_complete: 30
5
5
6
-
who_is_this_for: This introductory topic is for developers with some experience using llama.cpp who want to learn distributed inference.
6
+
who_is_this_for: This introductory topic is for developers with some experience using llama.cpp who want to learn how to run distributed inference on Arm-based servers.
7
7
8
8
learning_objectives:
9
9
- Set up a main host and worker nodes with llama.cpp
- Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
16
+
- Download and build `llama.cpp`, a C++ library for efficient CPU inference of Llama and similar large language models on CPUs, optimized for local and embedded environments.
17
17
- Convert Meta's `safetensors` files to a single GGUF file.
18
18
- Quantize the 16-bit GGUF weights file to 4-bit weights.
19
19
- Load and run the model.
20
20
21
21
{{% notice Note %}}
22
-
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take 1-2 hours. If you already have a quantized GGUF file, you can skip the download and quantization.
22
+
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take several hours depending on bandwidth and system resources. If you already have a quantized GGUF file, you can skip the download and quantization.
23
23
{{% /notice %}}
24
24
25
-
## Set up dependencies
25
+
## Install dependencies
26
26
27
27
Before you start, make sure you have permission to access Meta's [Llama 3.1 70B parameter model](https://huggingface.co/meta-llama/Llama-3.1-70B).
28
28
@@ -35,7 +35,7 @@ You must repeat the install steps on each device. However, only run the download
0 commit comments