Skip to content

OpenAI::chat_stream_with_tools bypasses OpenAICompatibleProvider, hardcodes /responses endpoint — breaks OpenAI-compatible servers (llama.cpp, LM Studio) #121

@gtrak

Description

@gtrak

Summary

In v1.3.8, the OpenAI backend's chat_with_tools and chat_stream_with_tools methods were rewritten to always use the Responses API (/responses) instead of delegating to OpenAICompatibleProvider which uses the configurable CHAT_ENDPOINT (defaulting to chat/completions). This breaks any OpenAI-compatible server that only implements /v1/chat/completions (llama.cpp, LM Studio, etc.).

Regression

v1.3.7src/backends/openai.rs:520-521 correctly delegated to the compatible provider:

async fn chat_stream_with_tools(...) {
    // Delegate to the inner OpenAICompatibleProvider which has the full implementation
    self.provider.chat_stream_with_tools(messages, tools).await
}

v1.3.8src/backends/openai.rs:392-412 now builds a Responses API payload:

async fn chat_stream_with_tools(...) {
    let params = ResponsesRequestParams {
        config: &self.provider.config,
        messages,
        tools,
        stream: true,
    };
    let body = build_responses_request(params)?;
    let response = self
        .send_responses_request(&body, "OpenAI responses stream")
        .await?;
    let response = self
        .ensure_success_response(response, "OpenAI responses API")
        .await?;
    Ok(create_responses_stream_chunks(response))
}

Root Cause

The responses_url() method at src/backends/openai.rs:578-584 hardcodes the endpoint:

fn responses_url(&self) -> Result<reqwest::Url, LLMError> {
    self.provider
        .config
        .base_url
        .join("responses")  // <-- HARDCODED, bypasses CHAT_ENDPOINT
        .map_err(|e| LLMError::HttpError(e.to_string()))
}

This completely bypasses the OpenAIProviderConfig trait's CHAT_ENDPOINT constant (which defaults to "chat/completions" at src/providers/openai_compatible.rs:95).

Impact

When a local llama.cpp server (or any OpenAI-compatible server) is used with tool calling:

  1. The OpenAI backend sends a Responses API payload to /v1/responses
  2. llama.cpp only implements /v1/chat/completions, not /v1/responses
  3. Even if the server responds, the Responses stream parser (src/backends/openai/responses/stream/events.rs:103) expects OpenAI's Responses API SSE format (with item.id, item.call_id, etc.), not the Chat Completions format
  4. Result: ResponseFormatError: Missing id in responses event

Observed error:

Response format error: Missing id in responses event. Raw response: {"arguments":"","call_id":"fc_e4qppGeHFazV6EpUOz2tmXT92N7Yqmwu","name":"Read","status":"in_progress","type":"function_call"}

The raw response is actually valid Responses API-style data from llama.cpp, but it's being sent to the /responses endpoint which the server doesn't fully support.

Suggested Fix

Either:

  1. Add a config option to choose between Responses API and Chat Completions (e.g., use_responses_api: bool on the OpenAI struct, defaulting to true for openai.com but false for custom base URLs)
  2. Auto-detect: If a custom base_url is provided (not api.openai.com), fall back to OpenAICompatibleProvider's chat/completions implementation
  3. Re-delegate for custom URLs: When base_url differs from the default, delegate to OpenAICompatibleProvider::chat_stream_with_tools which uses CHAT_ENDPOINT

Workaround

Pin to llm = "=1.3.7" in Cargo.toml.


Note: Evidence gathered by an LLM agent (opencode/GLM5.1) during investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions