Feature Request: Native llama.cpp / GGUF Backend Support

### Problem Statement

Add native support for llama.cpp-based inference servers (GGUF models) as a backend option, alongside existing OpenAI-compatible providers.

This would enable efficient local deployment of Decepticon using quantized models on consumer GPUs (e.g., T4, 3060, etc.).

### Proposed Solution

Add a llama.cpp backend option, either:

- Native llama.cpp integration
Directly interface with llama.cpp (via subprocess or Python bindings like llama-cpp-python)
Allow configuration of:
n_gpu_layers
n_ctx
n_batch
sampling parameters (temperature, top_p, etc.)

- Enhanced OpenAI-compatible support
Officially support llama.cpp’s OpenAI-compatible server mode
Add config presets for llama.cpp quirks:
streaming differences
stop token handling
system prompt formatting (important for models like Qwen)

### Alternatives Considered

Ollama backend works but adds overhead and reduces control.
OpenAI-compatible proxy partially works, but lacks optimization and standardization.

### Area

Docker / Infrastructure

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Native llama.cpp / GGUF Backend Support #151

Problem Statement

Proposed Solution

Alternatives Considered

Area

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Feature Request: Native llama.cpp / GGUF Backend Support #151

Description

Problem Statement

Proposed Solution

Alternatives Considered

Area

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions