This directory (vllm/) is set up so you can use it as its own repository and run a service with systemctl, using start_vllm.sh as the entry point.
Self-contained: the TurboQuant code needed at runtime lives under third_party/turboquant/ (vendored copy). You do not need a sibling checkout at ../turboquant unless you want to override the path with TURBOQUANT_ROOT.
start_vllm.sh: entry point (CUDA + TurboQuant env + starts the server)run_api_server_turboquant.py: wrapper that enables TurboQuant before vLLM initializesconfig.yaml: local vLLM config (model/tokenizer/etc.); create fromconfig.example.yaml(often gitignored when publishing)third_party/turboquant/: vendored TurboQuant used bybootstrap.shandstart_vllm.sh(PYTHONPATH+ editable install)requirements.txt: editable TurboQuant install with thevllmextra fromthird_party/systemd/: systemd unit + installer
- Linux with systemd
- CUDA installed (for GPU). Default:
CUDA_HOME=/usr/local/cuda - Python 3
- Python dependencies (vLLM + turboquant, etc.). If you use the local venv:
vllm/venv/
- Install Python dependencies in the interpreter that will run the service (e.g. local venv or conda):
cd vllm
python3 -m venv venv
./venv/bin/python -m pip install -U pip setuptools wheel
./venv/bin/pip install -r requirements.txt(requirements.txt installs TurboQuant in editable mode from third_party/turboquant[vllm].)
- Alternative to step 1 — only ensure TurboQuant is available for your chosen Python:
VLLM_PYTHON=/root/anaconda3/bin/python3 ./bootstrap.shThis script:
- upgrades
pip - exits successfully if
import turboquantalready works - otherwise installs
turboquantin editable mode fromthird_party/turboquant(preferred), or from../turboquantif present, or fromTURBOQUANT_ROOT
Create your config.yaml from the template and set paths:
cp config.example.yaml config.yamlThen edit config.yaml and set at least:
model: /data/models/...tokenizer: /data/models/...
From the vllm/ directory:
./start_vllm.shTo pick GPU / port:
CUDA_VISIBLE_DEVICES=0 VLLM_PORT=8000 ./start_vllm.sh- Install and enable the service:
sudo ./systemd/install_vllm_turboquant_service.sh- (Optional) Tune service environment variables:
sudo nano /etc/default/vllm-turboquantMain variables:
VLLM_ROOT: absolute path to this repo (the installer can set this)VLLM_PYTHON: Python binary (e.g./root/anaconda3/bin/python3or.../venv/bin/python)VLLM_HOST,VLLM_PORTCUDA_VISIBLE_DEVICES,CUDA_HOMETQ_KEY_BITS,TQ_VALUE_BITS,TQ_BUFFER_SIZE,TQ_INITIAL_LAYERS_COUNT
- Restart after changes:
sudo systemctl restart vllm-turboquant- Status:
systemctl status vllm-turboquant- Logs:
journalctl -u vllm-turboquant -f- Stop / start:
sudo systemctl stop vllm-turboquant
sudo systemctl start vllm-turboquant- Python / packages not found: set
VLLM_PYTHON=/path/to/venv/bin/pythonin/etc/default/vllm-turboquant import turboquantfails: run./bootstrap.shorpip install -r requirements.txtusing the same interpreter asVLLM_PYTHON- CUDA errors: fix
CUDA_HOMEand verifynvcc/ libraries on the host - Port in use: change
VLLM_PORT