Ollama API Proxy

A simple proxy server for Ollama API requests with authentication, designed to provide OpenAI-compatible endpoints for Ollama models.

📋 Table of Contents

✨ Features
🚀 Quick Start
🔌 API Usage
☁️ Cloudflare Tunnel Setup
⚙️ Configuration
📂 Project Structure
🔒 Security Notes
🔧 Troubleshooting
📄 License
🤝 Contributing

✨ Features

OpenAI API Compatibility: Accept requests in OpenAI format and forward them to Ollama
Authentication: API key-based authentication for secure access
Docker Support: Complete containerized setup with Docker Compose
Cloudflare Tunnel Integration: Built-in support for secure external access
Health Check Endpoint: Monitor proxy status
GPU & CPU Support: Dual Ollama instances with intelligent routing
Model-Based Routing: Configure which models run on CPU vs GPU via JSON config
GPU Acceleration: NVIDIA GPU support for Ollama GPU container
Flexible Configuration: Environment-based and file-based configuration

🚀 Quick Start

📋 Prerequisites

Docker and Docker Compose
For NVIDIA GPU: NVIDIA Docker runtime (nvidia-docker2)
For CPU-only: no extra requirements
Node.js 18+ (for local development)

🐳 Docker Deployment (Recommended)

Clone the repository:

git clone https://github.com/loonylabs-dev/ollama-proxy.git
cd ollama-proxy

Set up environment:

cp .env.example .env
# Edit .env with your API key and configuration

Configure model routing:

cp model-routing.example.json model-routing.json
# Edit model-routing.json to specify which models run on CPU vs GPU

Choose your runtime (see Runtime selection below):

# NVIDIA GPU
docker-compose -f docker-compose.nvidia.yml up -d

# CPU-only
docker-compose -f docker-compose.cpu.yml up -d

Download models (in Ollama containers):

# For GPU instance
docker exec -it ollama-proxy-ollama-gpu-1 /bin/bash
ollama pull llama3
ollama pull codellama

# For CPU instance (optional - for smaller models)
docker exec -it ollama-proxy-ollama-cpu-1 /bin/bash
ollama pull llama3.2:1b
ollama pull phi3:mini

💻 Local Development

Install dependencies:
```
npm install
```

Set up environment:

cp .env.example .env
# Edit .env with your configuration (change OLLAMA_URL to http://localhost:11434)

Start Ollama locally (port 11434)

Start the proxy:

npm run dev          # Development mode
# or
npm run build && npm start  # Production mode

🎮 Runtime selection

This project supports two GPU types with separate Docker Compose configurations:

NVIDIA GPUs (RTX, Tesla, Quadro)

Supported GPUs:

Consumer: RTX 3060, 3090, 4060, 4090, 5090, etc.
Datacenter: Tesla T4, V100, A100, H100
Professional: Quadro RTX series

Requirements:

NVIDIA Docker runtime (nvidia-docker2)
NVIDIA drivers 545.x+ recommended
Docker Compose

Setup:

# Using docker-compose directly
docker-compose -f docker-compose.nvidia.yml up -d

# Or create a symlink (Linux/macOS)
ln -s docker-compose.nvidia.yml docker-compose.yml
docker-compose up -d

# Or copy to docker-compose.yml (Windows)
copy docker-compose.nvidia.yml docker-compose.yml
docker-compose up -d

Features:

Full CUDA acceleration
Up to 36GB VRAM allocation
Production-tested with RTX 5090, 4090, 3090
Includes GPU watchdog for automatic recovery

CPU-only (no GPU)

Runs a single Ollama instance on CPU
Best for small/medium models or environments without a GPU

Setup:

# Using docker-compose directly
docker-compose -f docker-compose.cpu.yml up -d

# Or create a symlink (Linux)
ln -s docker-compose.cpu.yml docker-compose.yml
docker-compose up -d

# Or copy to docker-compose.yml (Windows)
copy docker-compose.cpu.yml docker-compose.yml
docker-compose up -d

Choosing the Right Configuration

Factor	NVIDIA	CPU-only
Best For	Large models (>13B), maximum performance	Small/medium models, simplicity
Memory	Dedicated GPU VRAM	System RAM/CPU
Power	Higher	Lower
Requirements	NVIDIA drivers + nvidia-docker2	None

🔌 API Usage

The proxy supports both native Ollama API and OpenAI-compatible endpoints. Choose the API that best fits your use case:

🎯 CPU/GPU Routing

The proxy runs two separate Ollama instances with automatic model-based routing:

GPU Instance (default): High-performance inference with NVIDIA GPU for large models
CPU Instance: Optimized for smaller models (≤3B parameters)

Configuration via model-routing.json:

{
  "cpu": [
    "gemma3:4b",
    "llama3.2:1b",
    "phi3:mini",
    "qwen2.5:0.5b"
  ],
  "gpu": ["*"]
}

cpu: Array of model names to route to CPU instance
gpu: Array of model patterns (use ["*"] for "all others")

How it works:

When a request comes in, the proxy checks the requested model name
If the model is in the cpu list, it routes to the CPU instance
Otherwise, it routes to the GPU instance (default)
The routing is transparent - no headers or special configuration needed

Example requests:

# This will automatically route to GPU (llama3 not in CPU list)
curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3", "messages": [{"role": "user", "content": "Hello"}]}'

# This will automatically route to CPU (gemma3:4b in CPU list)
curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{"model": "gemma3:4b", "messages": [{"role": "user", "content": "Hello"}]}'

When to use CPU routing:

Small models (≤3B parameters) where CPU performance is sufficient
Lightweight models for testing or development
Cost optimization for simple queries
Running multiple concurrent requests on smaller models

Native Ollama API

Use these endpoints for direct Ollama compatibility or tools that support Ollama natively. Model-based routing is fully supported - requests are automatically routed to CPU or GPU based on your model-routing.json configuration.

Available Endpoints:

POST /api/chat - Chat with streaming support (with model-based routing)
POST /api/generate - Text generation (with model-based routing)
GET /api/tags - List installed models

Example - Chat (GPU routing):

curl -X POST http://localhost:3000/api/chat \\
  -H "Content-Type: application/json" \\
  -H "Authorization: Bearer your_api_key_here" \\
  -d '{
    "model": "llama3",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Example - Chat (CPU routing):

# This routes to CPU if gemma3:4b is in your model-routing.json CPU list
curl -X POST http://localhost:3000/api/chat \\
  -H "Content-Type: application/json" \\
  -H "Authorization: Bearer your_api_key_here" \\
  -d '{
    "model": "gemma3:4b",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Example - List Models:

curl -X GET http://localhost:3000/api/tags \\
  -H "Authorization: Bearer your_api_key_here"

OpenAI-Compatible API

Use these endpoints for OpenAI tools (LobeChat, ChatGPT-Web, OpenAI Python library, etc.):

Available Endpoints:

POST /v1/chat/completions - Chat completions (OpenAI format)
GET /v1/models - List models (OpenAI format)

Example - Chat:

curl -X POST http://localhost:3000/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -H "Authorization: Bearer your_api_key_here" \\
  -d '{
    "model": "llama3",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Tool Configuration: Most OpenAI tools require the base URL to end with /v1:

✅ Correct: http://your-domain.com/v1
❌ Wrong: http://your-domain.com

Popular Tools:

LobeChat: Base URL: http://your-domain.com/v1
OpenAI Python: base_url="http://your-domain.com/v1"
ChatGPT-Web: API Endpoint: http://your-domain.com/v1

General Endpoints

GET /health - Health check (no authentication required)

☁️ Cloudflare Tunnel Setup

For secure external access:

Set up Cloudflare Tunnel:

# Install cloudflared and create a tunnel
cloudflared tunnel create ollama-proxy

Configure tunnel:

cp cloudflare/config.example.yml cloudflare/config.yml
# Edit config.yml with your tunnel ID and domain
# IMPORTANT: hostname must NOT include protocol or port (use e.g. "api.example.com")

Add tunnel credentials: Place your tunnel credentials JSON file in cloudflare/
The tunnel will automatically start with Docker Compose

⚙️ Configuration

🔧 Environment Variables

Variable	Default	Description
`API_KEY`	Required	Authentication key for API access
`OLLAMA_GPU_URL`	`http://ollama-gpu:11434` (Docker) `http://localhost:11434` (local)	Ollama GPU server URL (default for unlisted models)
`OLLAMA_CPU_URL`	`http://ollama-cpu:11434` (Docker) `http://localhost:11435` (local)	Ollama CPU server URL (for models in `model-routing.json` CPU list)
`PORT`	`3000`	Proxy server port (local dev only)

📋 Model Routing Configuration

The model-routing.json file controls which Ollama instance handles each model.

Default configuration (created from model-routing.example.json):

{
  "cpu": [
    "gemma3:4b",
    "llama3.2:1b",
    "phi3:mini",
    "qwen2.5:0.5b"
  ],
  "gpu": ["*"]
}

CPU-only routing example (route all models to CPU):

{
  "cpu": ["*"],
  "gpu": []
}

Configuration rules:

Models listed in cpu array are routed to the CPU instance
All other models are routed to GPU (indicated by ["*"] in gpu array)
Model names must match exactly (including version tags like :4b)
Changes require proxy container restart to take effect

Best practices:

List small models (≤3B parameters) in the CPU array
Keep GPU for large models and high-performance tasks
Test both instances after configuration changes
Use exact model names as they appear in ollama list

🐳 Docker Configuration

The Docker setup includes:

Ollama GPU container: Runs Ollama with NVIDIA GPU support
Ollama CPU container: Runs Ollama on CPU for smaller models
Proxy container: Runs the API proxy with intelligent routing
Watchdog GPU container: Monitors GPU instance health
Cloudflared container: Provides tunnel access (optional)

Two configurations available:

docker-compose.nvidia.yml - NVIDIA GPU setup
docker-compose.cpu.yml - CPU-only setup

🎮 GPU Configuration

NVIDIA GPU Setup (docker-compose.nvidia.yml):

36GB memory limit
16GB memory reservation
CUDA runtime with nvidia-docker2
Unlimited locked memory

CPU-only Setup (docker-compose.cpu.yml):

16GB memory limit
8GB memory reservation
Optimized CPU settings (threads, batch size)

See Runtime selection for details.

📜 NPM Scripts

Script	Description
`npm run start:ollama`	Start Docker Compose setup
`npm run stop:ollama`	Stop Docker Compose setup
`npm run logs:ollama`	View Docker Compose logs
`npm run restart:ollama`	Restart Docker Compose setup
`npm run dev`	Start development server
`npm run build`	Build for production
`npm start`	Start production server

📂 Project Structure

ollama-proxy/
├── src/                          # Source code
│   ├── index.ts                 # Main proxy server
│   ├── types/                   # TypeScript type definitions
│   └── utils/                   # Utility functions (transformers)
├── watchdog-gpu/                # GPU monitoring container
│   ├── Dockerfile               # Watchdog container image
│   └── watchdog.sh              # Monitoring script
├── ollama/                      # Ollama configuration
│   └── ollama.json              # Model settings
├── cloudflare/                  # Cloudflare tunnel config
│   ├── config.yml               # Tunnel config (ignored)
│   ├── config.example.yml       # Tunnel template
│   └── *.json                   # Credentials (ignored)
├── logs/                        # Log files
│   └── watchdog/                # Watchdog logs
├── docs/                        # Documentation
│   ├── README.md                # Documentation index
│   ├── DOCKER_COMPOSE.md        # Docker Compose guide
│   └── OLLAMA_SETTINGS.md       # Ollama configuration
├── model-routing.json           # Model routing config (ignored, user-specific)
├── model-routing.example.json   # Model routing template
├── model-routing.example.cpu-only.json   # CPU-only routing template (wildcard to CPU)
├── docker-compose.yml           # Runtime selection guide (documentation)
├── docker-compose.nvidia.yml    # NVIDIA GPU configuration
├── docker-compose.cpu.yml       # CPU-only configuration
├── docker-compose.example.nvidia.yml  # NVIDIA example configuration
├── docker-compose.example.cpu.yml     # CPU-only example configuration
├── Dockerfile                   # Proxy container image
├── .env                         # Environment variables (ignored)
├── .env.example                 # Environment template
├── package.json                 # Node.js dependencies
└── tsconfig.json                # TypeScript configuration

🔒 Security Notes

Keep your API key secure and never commit it to version control
Cloudflare tunnel credentials are sensitive and excluded from git
model-routing.json is gitignored - your routing config stays private
The proxy only accepts requests with valid API keys
Internal communication uses Docker networks for security

🔧 Troubleshooting

API Connection Issues

OpenAI Tools (404 errors):

Problem: "Cannot POST /chat/completions" or 404 errors
Solution: Use /v1 in base URL: http://your-domain.com/v1
Why: OpenAI tools append /chat/completions automatically

Native Ollama Tools:

Problem: Connection refused or 404 on /api/* endpoints
Solution: Use base URL without /v1: http://your-domain.com
Endpoints: /api/chat, /api/tags, /api/generate

Model Issues

Model Not Found:

Symptoms: Chat fails but models list works
Check available models:
- OpenAI format: GET /v1/models
- Ollama format: GET /api/tags
Solution: Download missing models in Ollama container

Download Models:

# GPU instance
docker exec -it ollama-proxy-ollama-gpu-1 /bin/bash
ollama list                    # List installed models
ollama pull llama3            # Download new models
ollama pull qwen2.5:7b        # Download specific version

# CPU instance (smaller models recommended)
docker exec -it ollama-proxy-ollama-cpu-1 /bin/bash
ollama pull llama3.2:1b       # Small efficient model
ollama pull phi3:mini         # Lightweight model

Authentication Issues

401 Unauthorized:

Check API_KEY in .env file
Ensure Authorization: Bearer your_api_key header is correct
Health endpoint (/health) doesn't require authentication

Performance Issues

Request Too Large:

Proxy accepts up to 10MB request bodies
Consider reducing conversation history for long chats

GPU Not Used:

Verify GPU configuration in docker-compose.yml:
- Must have both runtime: nvidia AND deploy.resources.reservations.devices
- See GPU Configuration Best Practices below
Check NVIDIA Docker runtime: sudo apt install nvidia-docker2
Test GPU access in container: docker exec ollama-proxy-ollama-gpu-1 nvidia-smi
Check Ollama logs: npm run logs:ollama or docker logs ollama-proxy-ollama-gpu-1 -f
Look for: "insufficient VRAM" or "offloaded 0/X layers" indicates GPU fallback

GPU Initialization Issues

Problem: GPU detected by system but not by Ollama container

Symptoms:
- ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
- cuda driver library failed to get device context 800/801
- Failed to initialize NVML: Unknown Error
- nvidia-smi works on host but fails in container

Solution: Configure nvidia-container-runtime cgroups support

# Enable cgroups in nvidia-container-runtime
sudo sed -i 's/#no-cgroups = false/no-cgroups = false/' /etc/nvidia-container-runtime/config.toml

# Restart Docker to apply changes
sudo systemctl restart docker

# Recreate containers
docker-compose down && docker-compose up -d

GPU Configuration Best Practices: Use both mechanisms for maximum compatibility

# ✅ Recommended: Use BOTH for reliable GPU access
ollama-gpu:
  runtime: nvidia  # Direct GPU access (required!)
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

# ❌ Common mistake: Using only deploy without runtime
# This configuration may fail to initialize GPU in container
deploy:
  resources:
    reservations:
      devices: [...]
# Missing: runtime: nvidia

Why both mechanisms?

runtime: nvidia ensures GPU driver access in container (nvidia-smi works)
deploy.resources provides resource limits and allocation
Using only deploy without runtime can lead to NVML initialization failures

NVIDIA Driver Compatibility:

Known issues: Driver series 555.x had Ollama compatibility problems
Recommended: Use stable drivers (545.x, 552.x series)
RTX 5090: Driver 575.64+ generally works but may show performance warnings

GPU Watchdog for Production Stability

Problem: Ollama occasionally falls back to CPU after model switching due to VRAM not being released properly. This is a known issue affecting production deployments.

Solution: The setup includes an automatic GPU watchdog container that monitors and restarts Ollama when GPU issues occur:

# Watchdog starts automatically with the full stack
docker-compose up -d

# View GPU watchdog logs
docker logs ollama-proxy-watchdog-gpu-1 -f

# Or view persistent logs
tail -f logs/watchdog/ollama-gpu-watchdog.log

What the watchdog monitors:

insufficient VRAM to load any model layers
offloaded 0/X layers to GPU (indicates CPU fallback)
gpu VRAM usage didn't recover within timeout
runner.vram="0 B" (GPU not allocated)
context limit hit - shifting (warning only - known Ollama Issue #2805)
Hung requests: Ollama runner processes running longer than timeout (5 minutes default)

Features:

Fully automated - no manual intervention required
Container-based - runs as part of your Docker stack
Silent monitoring - only logs when problems detected
Intelligent escalation - tries quick restart first, escalates to full recreation if same error persists
Hung request detection - automatically restarts if ollama runner stuck (addresses Ollama #2805 infinite loop bug)
Log deduplication - prevents spam from repeated pattern detection (MD5-based, 100 entry cache)
JSON structured logs for easy monitoring
Health checks and restart policies
Runs as root - required for Docker socket access
Configurable via environment variables:
- CHECK_INTERVAL=5 (seconds between checks)
- RESTART_COOLDOWN=60 (minimum seconds between restarts)
- HUNG_REQUEST_TIMEOUT=300 (seconds before considering request hung)
- LOG_LEVEL=INFO (DEBUG, INFO, WARNING, ERROR)

Logging Behavior:

INFO mode (default): Silent during normal operation, logs only when problems detected
DEBUG mode: Verbose logging including all monitored log lines (for troubleshooting only)
Logs are written to both stdout (Docker logs) and /var/log/watchdog/ollama-watchdog.log

Architecture: The watchdog runs as a separate container with access to Docker socket, allowing it to monitor and restart the Ollama container when GPU fallback is detected. It runs as root to access the Docker daemon. This ensures your setup remains production-ready without manual intervention.

Escalation Strategy: When problems are detected, the watchdog uses an intelligent escalation approach:

First attempt: Quick container restart (docker restart)
Second attempt: Full container recreation via docker-compose (docker compose up -d --force-recreate) if the same error persists
Success tracking: Resets escalation counter when GPU access is successfully verified

This two-tier approach handles both transient issues (quick restart) and stubborn GPU context errors (full recreation).

Docker Issues

Port Conflicts:

Setup uses internal Docker networking (no host ports exposed)
Access via Cloudflare tunnel or modify docker-compose.yml

Container Won't Start:

Check Docker Compose logs: docker-compose logs
Verify .env file exists and contains API_KEY
Ensure NVIDIA runtime available for GPU support

📄 License

MIT License - see LICENSE file for details.

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

For questions or support, please open an issue on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ollama API Proxy

✨ Features

🚀 Quick Start

🐳 Docker Deployment (Recommended)

🎮 Runtime selection

NVIDIA GPUs (RTX, Tesla, Quadro)

CPU-only (no GPU)

Choosing the Right Configuration

🔌 API Usage

🎯 CPU/GPU Routing

Native Ollama API

OpenAI-Compatible API

General Endpoints

☁️ Cloudflare Tunnel Setup

⚙️ Configuration

📂 Project Structure

🔒 Security Notes

🔧 Troubleshooting

API Connection Issues

Model Issues

Authentication Issues

Performance Issues

GPU Initialization Issues

GPU Watchdog for Production Stability

Docker Issues

📄 License

🤝 Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
cloudflare		cloudflare
docs		docs
models		models
ollama		ollama
src		src
tests		tests
watchdog-gpu		watchdog-gpu
.env.example		.env.example
.env.example.cpu		.env.example.cpu
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
WARP.md		WARP.md
docker-compose.cpu.yml		docker-compose.cpu.yml
docker-compose.example.cpu.yml		docker-compose.example.cpu.yml
docker-compose.example.nvidia.yml		docker-compose.example.nvidia.yml
docker-compose.example.yml		docker-compose.example.yml
docker-compose.nvidia.yml		docker-compose.nvidia.yml
docker-compose.yml		docker-compose.yml
jest.config.js		jest.config.js
jest.setup.js		jest.setup.js
model-routing.example.cpu-only.json		model-routing.example.cpu-only.json
model-routing.example.json		model-routing.example.json
package.json		package.json
tsconfig.json		tsconfig.json

License

loonylabs-dev/ollama-proxy

Folders and files

Latest commit

History

Repository files navigation

Ollama API Proxy

✨ Features

🚀 Quick Start

🐳 Docker Deployment (Recommended)

🎮 Runtime selection

NVIDIA GPUs (RTX, Tesla, Quadro)

CPU-only (no GPU)

Choosing the Right Configuration

🔌 API Usage

🎯 CPU/GPU Routing

Native Ollama API

OpenAI-Compatible API

General Endpoints

☁️ Cloudflare Tunnel Setup

⚙️ Configuration

📂 Project Structure

🔒 Security Notes

🔧 Troubleshooting

API Connection Issues

Model Issues

Authentication Issues

Performance Issues

GPU Initialization Issues

GPU Watchdog for Production Stability

Docker Issues

📄 License

🤝 Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages