Skip to content

Migrate to OpenAI-compatible vLLM API format#168

Merged
FranardoHuang merged 6 commits intomainfrom
openai_format
Jan 30, 2026
Merged

Migrate to OpenAI-compatible vLLM API format#168
FranardoHuang merged 6 commits intomainfrom
openai_format

Conversation

@FranardoHuang
Copy link
Member

Summary

  • Refactor backend to use external vLLM servers via OpenAI-compatible API instead of direct model loading
  • Add automated vLLM server startup/shutdown scripts with proper GPU assignments
  • Update configuration to support remote vLLM server deployment
  • Remove hardcoded API keys and paths

Changes

Architecture

  • Backend now connects to separate vLLM servers (chat, embedding, whisper) via HTTP
  • Enables running inference on GPU machines separate from the API server
  • Uses AsyncOpenAI client for chat completions with streaming support

New Scripts

  • scripts/start_vllm_servers.sh - Starts all 3 vLLM servers in tmux with:
    • Sequential startup with readiness monitoring
    • GPU memory utilization reporting
    • Proper CUDA_VISIBLE_DEVICES assignments
  • scripts/stop_vllm_servers.sh - Cleanly shuts down all servers

Configuration

  • New .env variables: VLLM_CHAT_URL, VLLM_EMBEDDING_URL, VLLM_WHISPER_URL, VLLM_API_KEY
  • Updated .env.example with all vLLM configuration options
  • Added docs/vllm-setup.md with complete deployment guide

Test plan

  • vLLM servers start successfully with ./scripts/start_vllm_servers.sh
  • GPU memory allocations match expected values
  • Backend connects to vLLM servers on startup
  • Chat completions work with streaming
  • Embedding generation works for RAG
  • Whisper transcription works for audio

🤖 Generated with Claude Code

Benzhang2004 and others added 6 commits November 24, 2025 08:42
- Remove hardcoded API key from config.py (default to 'EMPTY')
- Fix hardcoded URL in chat_service.py audio_generator (use settings)
- Fix hardcoded paths in chat_service.py and rag_retriever.py (use dynamic paths)
- Add vLLM server configuration to .env.example with localhost defaults
- Rename 4090modelservice.md to docs/vllm-setup.md with improved docs
- Add API key generation instructions and remote server setup guide
- Update README with distributed architecture diagram and vLLM config section

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolve conflicts by keeping openai_format's configuration-based approach:
- model.py: Keep vLLM client with settings config
- chat_service.py: Keep clean whitespace
- rag_retriever.py: Keep lazy loading embedding client pattern

Incorporates main's changes:
- Add F1_racing course support
- Use dynamic paths for file locations

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add start_vllm_servers.sh: tmux-based script to start all 3 vLLM servers
  sequentially with proper GPU assignments and startup monitoring
- Add stop_vllm_servers.sh: script to cleanly shut down all vLLM servers
- Update vllm-setup.md with GPU assignments (CUDA_VISIBLE_DEVICES) and
  automated script documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@FranardoHuang FranardoHuang merged commit 8f57bc0 into main Jan 30, 2026
1 check failed
@FranardoHuang FranardoHuang deleted the openai_format branch January 30, 2026 01:36
FranardoHuang added a commit that referenced this pull request Mar 12, 2026
* Openai API format

* Update

* Edit

* fix: remove hardcoded API keys and paths, add vLLM server documentation

- Remove hardcoded API key from config.py (default to 'EMPTY')
- Fix hardcoded URL in chat_service.py audio_generator (use settings)
- Fix hardcoded paths in chat_service.py and rag_retriever.py (use dynamic paths)
- Add vLLM server configuration to .env.example with localhost defaults
- Rename 4090modelservice.md to docs/vllm-setup.md with improved docs
- Add API key generation instructions and remote server setup guide
- Update README with distributed architecture diagram and vLLM config section

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add vLLM server startup/shutdown scripts with GPU assignments

- Add start_vllm_servers.sh: tmux-based script to start all 3 vLLM servers
  sequentially with proper GPU assignments and startup monitoring
- Add stop_vllm_servers.sh: script to cleanly shut down all vLLM servers
- Update vllm-setup.md with GPU assignments (CUDA_VISIBLE_DEVICES) and
  automated script documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Benzhang2004 <zhangjialin04@sina.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants