Skip to content

feat(config): per-operation LLM provider/model endpoints (extraction vs consolidation vs reflect) #1646

@kevin-ho

Description

@kevin-ho

Use case

As a Hindsight operator running an active agent, I want to point extraction (retain) at a free local model while keeping consolidation and reflect on a stronger cloud model. Extraction is high-volume structured JSON work — it doesn't need frontier quality. Consolidation and reflect are lower-volume but quality-sensitive. Right now I have to pick one endpoint for everything, which means either overpaying for extraction or under-serving consolidation.

Problem statement

Hindsight uses a single LLM endpoint for all operations. Extraction (retain) accounts for the vast majority of token burn during active use, but it's structured output work that fast local models handle well. Consolidation and reflect benefit from stronger reasoning. There's no way to route different operations to different endpoints, so you're forced into an all-or-nothing cost/quality tradeoff.

How would this feature help

Would allow operators to cut token costs significantly (60-80% for active agents) by running extraction on a free local model (llama.cpp, Ollama, vLLM) while keeping consolidation and reflect on a paid cloud model for quality. Also enables tiered setups — fast cheap model for extraction, stronger model for consolidation, best model for reflect.

Proposed solution

Add per-operation LLM provider/model env var overrides following the existing per-scope pattern (HINDSIGHT_API_RETAIN_LLM_EXTRA_BODY in #1607):

# Global (existing, unchanged — fallback for anything not overridden)
HINDSIGHT_API_LLM_PROVIDER=openai
HINDSIGHT_API_LLM_MODEL=some-model

# Override extraction to a local/cheaper model
HINDSIGHT_API_RETAIN_LLM_PROVIDER=openai
HINDSIGHT_API_RETAIN_LLM_MODEL=some-efficient-model
HINDSIGHT_API_RETAIN_LLM_BASE_URL=http://localhost:8080/v1

# Override consolidation to a stronger model
HINDSIGHT_API_CONSOLIDATION_LLM_PROVIDER=anthropic
HINDSIGHT_API_CONSOLIDATION_LLM_MODEL=some-reasoning-model

For each of {RETAIN,REFLECT,CONSOLIDATION}:

  • HINDSIGHT_API_<SCOPE>_LLM_PROVIDER
  • HINDSIGHT_API_<SCOPE>_LLM_MODEL
  • HINDSIGHT_API_<SCOPE>_LLM_BASE_URL
  • HINDSIGHT_API_<SCOPE>_LLM_API_KEY
  • HINDSIGHT_API_<SCOPE>_LLM_TIMEOUT
  • HINDSIGHT_API_<SCOPE>_LLM_EXTRA_BODY
  • HINDSIGHT_API_<SCOPE>_LLM_LITELLMROUTER_CONFIG

When unset, each scope falls back to the global HINDSIGHT_API_LLM_*. Embedding generation always uses the configured embedding model regardless.

Alternatives considered

  • LiteLLM Router as a single entrypoint with routing rules — works but adds operational complexity, another moving part, and still costs for every call through the proxy. Per-scope env vars are simpler and keep local traffic truly local.
  • Two separate Hindsight instances sharing the same database — functional but doubles operational overhead (two containers, two health checks, confusing worker coordination).
  • Setting HINDSIGHT_API_RETAIN_EXTRACTION_MODE=chunks to skip extraction entirely — eliminates LLM cost for retain but also eliminates structured fact extraction, which is the whole point.

Priority

Medium

Additional context

The plumbing already exists — MemoryEngine.__init__ constructs separate LLMConfig instances per operation. Per-scope *_LLM_TIMEOUT, *_LLM_EXTRA_BODY (#1607), and *_LLM_LITELLMROUTER_CONFIG already follow this exact pattern. This would extend it to provider/model/base_url/api_key.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions