🦀 Lightweight HTTP server in Rust to run LLM models on edge without requiring expensive compute and dependencies.
- Minimal Dependencies: Built with Candle ML framework - no PyTorch or heavy ML frameworks required
- HTTP API: Simple REST API with Axum for easy integration
- CUDA Support: Optional GPU acceleration (just uncomment in
Cargo.toml) - Temperature & Seed Control: Fine-tune generation behavior
- Auto Model Download: Automatically fetches models from Hugging Face Hub
- CORS Enabled: Ready for web application integration
- Rust 1.70+ (install from rustup.rs)
- (Optional) CUDA 11.8+ for GPU acceleration
git clone https://github.com/yourusername/RustLLM_serve.git
cd RustLLM_serve
cargo build --releasecargo run --releaseThe server will start on http://0.0.0.0:3000 and automatically download the TinyLlama model on first run.
Check if the server is running and which model is loaded.
Endpoint: GET /health
Example:
curl http://localhost:3000/healthResponse:
{
"status": "ok",
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}Generate text completion from a prompt.
Endpoint: POST /inference
Request Body:
{
"prompt": "Hello, my name is",
"max_tokens": 50,
"temperature": 0.8,
"seed": 42
}Parameters:
prompt(required): Input text to completemax_tokens(optional): Maximum tokens to generate (default: 50, max: 1024)temperature(optional): Sampling temperature 0.01-100.0 (default: 1.0)- Lower values → more deterministic
- Higher values → more creative/random
seed(optional): Random seed for reproducibility (default: 42)
Example:
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 100,
"temperature": 0.7
}'Response:
{
"generated_text": "The future of AI is bright and full of possibilities..."
}Edit Cargo.toml and uncomment the CUDA features:
# Comment these lines:
# candle-core = "0.9.2"
# candle-nn = "0.9.2"
# Uncomment these lines:
candle-core = { version = "0.9.2", features = ["cuda"] }
candle-nn = { version = "0.9.2", features = ["cuda"] }In src/main.rs, modify the model repository:
let repo_id = api.model("TinyLlama/TinyLlama-1.1B-Chat-v1.0".to_string());
// Change to any compatible Llama-architecture model from Hugging FaceIn src/main.rs:
let listener = tokio::net::TcpListener::bind("0.0.0.0:3000").await?;
// Change port as neededRustLLM_serve/
├── src/
│ ├── main.rs # Server initialization and model loading
│ ├── config.rs # Model configuration structures
│ ├── api/
│ │ ├── handlers.rs # HTTP request handlers
│ │ ├── models.rs # Request/Response models
│ │ └── server.rs # Router configuration
│ └── llm/
│ ├── models.rs # LLM model trait and implementations
│ ├── inference.rs # Text generation logic
│ ├── decoder.rs # Transformer decoder
│ ├── attention.rs # Self-attention mechanism
│ ├── embedding.rs # Token embeddings
│ └── ... # Other model components
├── Cargo.toml
└── README.md
Test the health endpoint:
curl http://localhost:3000/healthTest inference:
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{"prompt": "Once upon a time", "max_tokens": 50}'The server uses tracing for structured logging. Logs include:
- Model loading progress
- Inference requests and responses
- HTTP request traces
Set log level via environment variable:
RUST_LOG=debug cargo run- Use Release Mode: Always run with
--releasefor production - Enable CUDA: GPU acceleration provides 10-100x speedup
- Adjust max_tokens: Lower values = faster responses
- Batch Requests: The server handles concurrent requests efficiently
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- Built with Candle - Hugging Face's Rust ML framework
- Uses Axum for HTTP server
- Models from Hugging Face Hub