Website | HuggingFace Collection | Whitepaper | GitHub | Discord
Using this demo you can run Bonsai language models locally on Mac (Metal), Linux/Windows (CUDA).
- llama.cpp (GGUF) — C/C++, runs on Mac (Metal), Linux/Windows (CUDA), and CPU.
- MLX (MLX format) — Python, optimized for Apple Silicon.
The required inference kernels are not yet available in upstream llama.cpp or MLX. Pre-built binaries and source code come from our forks:
- llama.cpp: PrismML-Eng/llama.cpp — pre-built binaries
- MLX: PrismML-Eng/mlx (branch
prism)
Three model sizes are available: 8B, 4B, and 1.7B, each in two formats:
| Model | Format | HuggingFace Repo |
|---|---|---|
| Bonsai-8B | GGUF | prism-ml/Bonsai-8B-gguf |
| Bonsai-8B | MLX | prism-ml/Bonsai-8B-mlx-1bit |
| Bonsai-4B | GGUF | prism-ml/Bonsai-4B-gguf |
| Bonsai-4B | MLX | prism-ml/Bonsai-4B-mlx-1bit |
| Bonsai-1.7B | GGUF | prism-ml/Bonsai-1.7B-gguf |
| Bonsai-1.7B | MLX | prism-ml/Bonsai-1.7B-mlx-1bit |
Set BONSAI_MODEL to choose which size to download and run (default: 8B).
git clone https://github.com/PrismML-Eng/Bonsai-demo.git
cd Bonsai-demo
# (Optional) Choose a model size: 8B (default), 4B, or 1.7B
export BONSAI_MODEL=8B
# One command does everything: installs deps, downloads models + binaries
./setup.shgit clone https://github.com/PrismML-Eng/Bonsai-demo.git
cd Bonsai-demo
# (Optional) Choose a model size: 8B (default), 4B, or 1.7B
$env:BONSAI_MODEL = "8B"
# Run setup
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\setup.ps1You can download a different size and switch between them instantly — no full re-setup needed:
BONSAI_MODEL=4B ./scripts/download_models.sh
BONSAI_MODEL=4B ./scripts/run_llama.sh -p "Who are you? Introduce yourself in haiku"The setup script handles everything for you, even on a fresh machine:
- Checks/installs system deps — Xcode CLT on macOS, build-essential on Linux
- Installs uv — fast Python package manager (user-local, not global)
- Creates a Python venv and runs
uv sync— installs cmake, ninja, huggingface-cli frompyproject.toml - Downloads models from HuggingFace (needs
PRISM_HF_TOKENwhile repos are private) - Downloads pre-built binaries from GitHub Release (or builds from source if you prefer)
- Builds MLX from source (macOS only) — clones our fork, then
uv sync --extra mlxfor the full ML stack
Re-running setup.sh is safe — it skips already-completed steps.
All run scripts respect BONSAI_MODEL (default 8B). Set it to run a different size:
./scripts/run_llama.sh -p "What is the capital of France?"
# Run a different model size
BONSAI_MODEL=4B ./scripts/run_llama.sh -p "Who are you? Introduce yourself in haiku"source .venv/bin/activate
./scripts/run_mlx.sh -p "What is the capital of France?"Start llama-server with its built-in chat UI:
./scripts/start_llama_server.sh # http://localhost:8080
# Serve a different model size
BONSAI_MODEL=4B ./scripts/start_llama_server.shThe 8B model supports up to 65,536 tokens.
By default the scripts pass -c 0, which lets llama.cpp's --fit automatically size the KV cache to your available memory (no pre-allocation waste). If your build doesn't support -c 0, the scripts fall back to a safe value based on system RAM:
Estimates for Bonsai-8B (weights + KV cache + activations):
| Context Size | Est. Memory Usage |
|---|---|
| 8,192 tokens | ~2.5 GB |
| 32,768 tokens | ~5.9 GB |
| 65,536 tokens | ~10.5 GB |
Override with: ./scripts/run_llama.sh -c 8192 -p "Your prompt"
Open WebUI provides a ChatGPT-like browser interface. It auto-starts the backend servers if they're not already running. Ctrl+C stops everything.
# Install (heavy — separate from base deps)
source .venv/bin/activate
uv pip install open-webui
# One command — starts backends + opens http://localhost:9090
./scripts/start_openwebui.shIf you prefer to build llama.cpp from source instead of using pre-built binaries:
./scripts/build_mac.shClones PrismML-Eng/llama.cpp, builds with Metal, outputs to bin/mac/.
./scripts/build_cuda_linux.shAuto-detects CUDA version. Pass --cuda-path /usr/local/cuda-12.8 to use a specific toolkit.
.\scripts\build_cuda_windows.ps1Auto-detects CUDA toolkit. Pass -CudaPath "C:\path\to\cuda" to use a specific version.
Requires Visual Studio Build Tools (or full Visual Studio) and CUDA toolkit.
All binaries are available from the GitHub Release:
| Platform |
|---|
| macOS Apple Silicon |
| Linux x64 (CUDA 12.4) |
| Linux x64 (CUDA 12.8) |
| Linux x64 (CUDA 13.1) |
| Windows x64 (CUDA 12.4) |
| Windows x64 (CUDA 13.1) |
After setup, the directory looks like this:
Bonsai-demo/
├── README.md
├── setup.sh # macOS/Linux setup
├── setup.ps1 # Windows setup
├── pyproject.toml # Python dependencies
├── scripts/
│ ├── common.sh # Shared helpers + BONSAI_MODEL
│ ├── download_models.sh # HuggingFace download
│ ├── download_binaries.sh # GitHub release download
│ ├── run_llama.sh # llama.cpp (auto-detects Mac/Linux)
│ ├── run_mlx.sh # MLX inference
│ ├── mlx_generate.py # MLX Python script
│ ├── start_llama_server.sh # llama.cpp server (port 8080)
│ ├── start_mlx_server.sh # MLX server (port 8081)
│ ├── start_openwebui.sh # Open WebUI + auto-starts backends
│ ├── build_mac.sh # Build llama.cpp for Mac
│ ├── build_cuda_linux.sh # Build llama.cpp for Linux CUDA
│ └── build_cuda_windows.ps1 # Build llama.cpp for Windows CUDA
├── models/ # ← downloaded by setup
│ ├── gguf/
│ │ ├── 8B/ # GGUF 8B model
│ │ ├── 4B/ # GGUF 4B model
│ │ └── 1.7B/ # GGUF 1.7B model
│ ├── Bonsai-8B-mlx/ # MLX 8B model (macOS)
│ ├── Bonsai-4B-mlx/ # MLX 4B model (macOS)
│ └── Bonsai-1.7B-mlx/ # MLX 1.7B model (macOS)
├── bin/ # ← downloaded or built by setup
│ ├── mac/ # macOS binaries
│ └── cuda/ # CUDA binaries
├── mlx/ # ← cloned by setup (macOS)
└── .venv/ # ← created by setup
Items marked with ← are created at setup time and excluded from git.