Run llama.cpp on RunPod

Description

RunPod provides a cheap serverless GPU service that allows to simply serve AI models. They handle queuing and auto-scaling.

You just have to provide a Docker image. This repository contains instructions to build your own image for any model.

Steps

Clone this repository
Choose a model and download it to the workspace directory. Here we use this model with 7B parameters.

wget -P workspace https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

Build the Docker image. Create a llama-runpod-finetune repository on Docker Hub and replace your-docker-hub-login with your login.

docker build -t llama-runpod-finetune .
docker tag llama-runpod-finetune your-docker-hub-login/llama-runpod:latest
docker push your-docker-hub-login/llama-runpod-finetune:latest

Go to RunPod's serverless console and create a template:

You can pass the arguments to llama_cpp in the LLAMA_ARGS environment variable. Here are mine:

{"model": "llama-2-7b.Q4_K_M.gguf", "n_gpu_layers": -1}

n_gpu_layers is set to -1 to offload all layers to the GPU.

Create the endpoint:

Profit!

Replace ENDPOINT_ID and API_KEY with your own values. You can get API_KEY on that page.

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
readme-images		readme-images
workspace		workspace
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Run llama.cpp on RunPod

Description

Steps

About

Releases

Packages

Languages

gbecigneul/llama-runpod-finetuning

Folders and files

Latest commit

History

Repository files navigation

Run llama.cpp on RunPod

Description

Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages