Skip to content

Feature Request: Native llama.cpp / GGUF Backend Support #151

@quevedoSteven

Description

@quevedoSteven

Problem Statement

Add native support for llama.cpp-based inference servers (GGUF models) as a backend option, alongside existing OpenAI-compatible providers.

This would enable efficient local deployment of Decepticon using quantized models on consumer GPUs (e.g., T4, 3060, etc.).

Proposed Solution

Add a llama.cpp backend option, either:

  • Native llama.cpp integration
    Directly interface with llama.cpp (via subprocess or Python bindings like llama-cpp-python)
    Allow configuration of:
    n_gpu_layers
    n_ctx
    n_batch
    sampling parameters (temperature, top_p, etc.)

  • Enhanced OpenAI-compatible support
    Officially support llama.cpp’s OpenAI-compatible server mode
    Add config presets for llama.cpp quirks:
    streaming differences
    stop token handling
    system prompt formatting (important for models like Qwen)

Alternatives Considered

Ollama backend works but adds overhead and reduces control.
OpenAI-compatible proxy partially works, but lacks optimization and standardization.

Area

Docker / Infrastructure

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions