Migrate to OpenAI-compatible vLLM API format by FranardoHuang · Pull Request #168 · augcog/tai

FranardoHuang · 2026-01-30T01:32:19Z

Summary

Refactor backend to use external vLLM servers via OpenAI-compatible API instead of direct model loading
Add automated vLLM server startup/shutdown scripts with proper GPU assignments
Update configuration to support remote vLLM server deployment
Remove hardcoded API keys and paths

Changes

Architecture

Backend now connects to separate vLLM servers (chat, embedding, whisper) via HTTP
Enables running inference on GPU machines separate from the API server
Uses AsyncOpenAI client for chat completions with streaming support

New Scripts

scripts/start_vllm_servers.sh - Starts all 3 vLLM servers in tmux with:
- Sequential startup with readiness monitoring
- GPU memory utilization reporting
- Proper CUDA_VISIBLE_DEVICES assignments
scripts/stop_vllm_servers.sh - Cleanly shuts down all servers

Configuration

New .env variables: VLLM_CHAT_URL, VLLM_EMBEDDING_URL, VLLM_WHISPER_URL, VLLM_API_KEY
Updated .env.example with all vLLM configuration options
Added docs/vllm-setup.md with complete deployment guide

Test plan

vLLM servers start successfully with ./scripts/start_vllm_servers.sh
GPU memory allocations match expected values
Backend connects to vLLM servers on startup
Chat completions work with streaming
Embedding generation works for RAG
Whisper transcription works for audio

🤖 Generated with Claude Code

- Remove hardcoded API key from config.py (default to 'EMPTY') - Fix hardcoded URL in chat_service.py audio_generator (use settings) - Fix hardcoded paths in chat_service.py and rag_retriever.py (use dynamic paths) - Add vLLM server configuration to .env.example with localhost defaults - Rename 4090modelservice.md to docs/vllm-setup.md with improved docs - Add API key generation instructions and remote server setup guide - Update README with distributed architecture diagram and vLLM config section Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Resolve conflicts by keeping openai_format's configuration-based approach: - model.py: Keep vLLM client with settings config - chat_service.py: Keep clean whitespace - rag_retriever.py: Keep lazy loading embedding client pattern Incorporates main's changes: - Add F1_racing course support - Use dynamic paths for file locations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add start_vllm_servers.sh: tmux-based script to start all 3 vLLM servers sequentially with proper GPU assignments and startup monitoring - Add stop_vllm_servers.sh: script to cleanly shut down all vLLM servers - Update vllm-setup.md with GPU assignments (CUDA_VISIBLE_DEVICES) and automated script documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Openai API format * Update * Edit * fix: remove hardcoded API keys and paths, add vLLM server documentation - Remove hardcoded API key from config.py (default to 'EMPTY') - Fix hardcoded URL in chat_service.py audio_generator (use settings) - Fix hardcoded paths in chat_service.py and rag_retriever.py (use dynamic paths) - Add vLLM server configuration to .env.example with localhost defaults - Rename 4090modelservice.md to docs/vllm-setup.md with improved docs - Add API key generation instructions and remote server setup guide - Update README with distributed architecture diagram and vLLM config section Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add vLLM server startup/shutdown scripts with GPU assignments - Add start_vllm_servers.sh: tmux-based script to start all 3 vLLM servers sequentially with proper GPU assignments and startup monitoring - Add stop_vllm_servers.sh: script to cleanly shut down all vLLM servers - Update vllm-setup.md with GPU assignments (CUDA_VISIBLE_DEVICES) and automated script documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Benzhang2004 <zhangjialin04@sina.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Benzhang2004 and others added 6 commits November 24, 2025 08:42

Openai API format

38ab3e7

Update

f6bcee2

Edit

c6396e9

FranardoHuang merged commit 8f57bc0 into main Jan 30, 2026
1 check failed

FranardoHuang deleted the openai_format branch January 30, 2026 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to OpenAI-compatible vLLM API format#168

Migrate to OpenAI-compatible vLLM API format#168
FranardoHuang merged 6 commits intomainfrom
openai_format

FranardoHuang commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FranardoHuang commented Jan 30, 2026

Summary

Changes

Architecture

New Scripts

Configuration

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants