Use case
As a Hindsight operator running an active agent, I want to point extraction (retain) at a free local model while keeping consolidation and reflect on a stronger cloud model. Extraction is high-volume structured JSON work — it doesn't need frontier quality. Consolidation and reflect are lower-volume but quality-sensitive. Right now I have to pick one endpoint for everything, which means either overpaying for extraction or under-serving consolidation.
Problem statement
Hindsight uses a single LLM endpoint for all operations. Extraction (retain) accounts for the vast majority of token burn during active use, but it's structured output work that fast local models handle well. Consolidation and reflect benefit from stronger reasoning. There's no way to route different operations to different endpoints, so you're forced into an all-or-nothing cost/quality tradeoff.
How would this feature help
Would allow operators to cut token costs significantly (60-80% for active agents) by running extraction on a free local model (llama.cpp, Ollama, vLLM) while keeping consolidation and reflect on a paid cloud model for quality. Also enables tiered setups — fast cheap model for extraction, stronger model for consolidation, best model for reflect.
Proposed solution
Add per-operation LLM provider/model env var overrides following the existing per-scope pattern (HINDSIGHT_API_RETAIN_LLM_EXTRA_BODY in #1607):
# Global (existing, unchanged — fallback for anything not overridden)
HINDSIGHT_API_LLM_PROVIDER=openai
HINDSIGHT_API_LLM_MODEL=some-model
# Override extraction to a local/cheaper model
HINDSIGHT_API_RETAIN_LLM_PROVIDER=openai
HINDSIGHT_API_RETAIN_LLM_MODEL=some-efficient-model
HINDSIGHT_API_RETAIN_LLM_BASE_URL=http://localhost:8080/v1
# Override consolidation to a stronger model
HINDSIGHT_API_CONSOLIDATION_LLM_PROVIDER=anthropic
HINDSIGHT_API_CONSOLIDATION_LLM_MODEL=some-reasoning-model
For each of {RETAIN,REFLECT,CONSOLIDATION}:
HINDSIGHT_API_<SCOPE>_LLM_PROVIDER
HINDSIGHT_API_<SCOPE>_LLM_MODEL
HINDSIGHT_API_<SCOPE>_LLM_BASE_URL
HINDSIGHT_API_<SCOPE>_LLM_API_KEY
HINDSIGHT_API_<SCOPE>_LLM_TIMEOUT
HINDSIGHT_API_<SCOPE>_LLM_EXTRA_BODY
HINDSIGHT_API_<SCOPE>_LLM_LITELLMROUTER_CONFIG
When unset, each scope falls back to the global HINDSIGHT_API_LLM_*. Embedding generation always uses the configured embedding model regardless.
Alternatives considered
- LiteLLM Router as a single entrypoint with routing rules — works but adds operational complexity, another moving part, and still costs for every call through the proxy. Per-scope env vars are simpler and keep local traffic truly local.
- Two separate Hindsight instances sharing the same database — functional but doubles operational overhead (two containers, two health checks, confusing worker coordination).
- Setting
HINDSIGHT_API_RETAIN_EXTRACTION_MODE=chunks to skip extraction entirely — eliminates LLM cost for retain but also eliminates structured fact extraction, which is the whole point.
Priority
Medium
Additional context
The plumbing already exists — MemoryEngine.__init__ constructs separate LLMConfig instances per operation. Per-scope *_LLM_TIMEOUT, *_LLM_EXTRA_BODY (#1607), and *_LLM_LITELLMROUTER_CONFIG already follow this exact pattern. This would extend it to provider/model/base_url/api_key.
Use case
As a Hindsight operator running an active agent, I want to point extraction (retain) at a free local model while keeping consolidation and reflect on a stronger cloud model. Extraction is high-volume structured JSON work — it doesn't need frontier quality. Consolidation and reflect are lower-volume but quality-sensitive. Right now I have to pick one endpoint for everything, which means either overpaying for extraction or under-serving consolidation.
Problem statement
Hindsight uses a single LLM endpoint for all operations. Extraction (retain) accounts for the vast majority of token burn during active use, but it's structured output work that fast local models handle well. Consolidation and reflect benefit from stronger reasoning. There's no way to route different operations to different endpoints, so you're forced into an all-or-nothing cost/quality tradeoff.
How would this feature help
Would allow operators to cut token costs significantly (60-80% for active agents) by running extraction on a free local model (llama.cpp, Ollama, vLLM) while keeping consolidation and reflect on a paid cloud model for quality. Also enables tiered setups — fast cheap model for extraction, stronger model for consolidation, best model for reflect.
Proposed solution
Add per-operation LLM provider/model env var overrides following the existing per-scope pattern (
HINDSIGHT_API_RETAIN_LLM_EXTRA_BODYin #1607):For each of
{RETAIN,REFLECT,CONSOLIDATION}:HINDSIGHT_API_<SCOPE>_LLM_PROVIDERHINDSIGHT_API_<SCOPE>_LLM_MODELHINDSIGHT_API_<SCOPE>_LLM_BASE_URLHINDSIGHT_API_<SCOPE>_LLM_API_KEYHINDSIGHT_API_<SCOPE>_LLM_TIMEOUTHINDSIGHT_API_<SCOPE>_LLM_EXTRA_BODYHINDSIGHT_API_<SCOPE>_LLM_LITELLMROUTER_CONFIGWhen unset, each scope falls back to the global
HINDSIGHT_API_LLM_*. Embedding generation always uses the configured embedding model regardless.Alternatives considered
HINDSIGHT_API_RETAIN_EXTRACTION_MODE=chunksto skip extraction entirely — eliminates LLM cost for retain but also eliminates structured fact extraction, which is the whole point.Priority
Medium
Additional context
The plumbing already exists —
MemoryEngine.__init__constructs separateLLMConfiginstances per operation. Per-scope*_LLM_TIMEOUT,*_LLM_EXTRA_BODY(#1607), and*_LLM_LITELLMROUTER_CONFIGalready follow this exact pattern. This would extend it to provider/model/base_url/api_key.