Feature Request
Please add DeepSeek V4 Flash (deepseek-v4-flash) to the retain() model leaderboard.
Why it matters
DeepSeek V4 Flash is one of the most popular low-cost API models for production agent workloads. It's a natural fit for Hindsight's retain/consolidation pipeline:
- Pricing: $0.14/M input, $0.28/M output — cheaper than most models on the leaderboard
- Provider: DeepSeek API direct (
api.deepseek.com/v1) — OpenAI-compatible endpoint
- Real-world usage: We're running it in production with Hindsight v0.6.1 for consolidation and fact extraction. Results so far:
- Clean JSON schema conformance with zero retries (after disabling thinking mode via
{"thinking": {"type": "disabled"}})
- ~6-8 seconds per consolidation batch (8 facts per batch)
- Healthy merge ratios: creates/updates/skips all within expected ranges
- ~220 calls/day at ~$0.04/day total cost
Missing from the current leaderboard
The current leaderboard covers Groq (gpt-oss), OpenAI (GPT variants), Google (Gemini), and Ollama Cloud (Gemma). DeepSeek is absent entirely despite being:
- One of the top 3 API providers by volume
- The cheapest option per-token for structured output tasks
- Fully OpenAI-compatible (works with Hindsight's existing
openai provider)
Benchmark concern
The "Not Viable" list shows 10 sub-3B models that failed schema conformance. That's useful data. But there's a gap between those tiny models and the 20B+ models that dominate the leaderboard. DeepSeek V4 Flash would fill that gap — it's a mid-sized model (likely 20-30B range based on pricing tier) that's affordable enough for daily production use.
Configuration for testing
HINDSIGHT_API_LLM_PROVIDER=openai
HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash
HINDSIGHT_API_LLM_BASE_URL=https://api.deepseek.com/v1
HINDSIGHT_API_LLM_API_KEY=<deepseek-api-key>
Note: Thinking mode should be disabled for best results with structured extraction tasks. Flash's base comprehension handles fact comparison and JSON schema conformation without chain-of-thought, and thinking mode actually hurts by burning output tokens on internal monologue before the JSON response.
Would love to see how it benchmarks against the current leaderboard entries. Happy to help test if needed.
Thanks!
Feature Request
Please add DeepSeek V4 Flash (
deepseek-v4-flash) to the retain() model leaderboard.Why it matters
DeepSeek V4 Flash is one of the most popular low-cost API models for production agent workloads. It's a natural fit for Hindsight's retain/consolidation pipeline:
api.deepseek.com/v1) — OpenAI-compatible endpoint{"thinking": {"type": "disabled"}})Missing from the current leaderboard
The current leaderboard covers Groq (gpt-oss), OpenAI (GPT variants), Google (Gemini), and Ollama Cloud (Gemma). DeepSeek is absent entirely despite being:
openaiprovider)Benchmark concern
The "Not Viable" list shows 10 sub-3B models that failed schema conformance. That's useful data. But there's a gap between those tiny models and the 20B+ models that dominate the leaderboard. DeepSeek V4 Flash would fill that gap — it's a mid-sized model (likely 20-30B range based on pricing tier) that's affordable enough for daily production use.
Configuration for testing
Note: Thinking mode should be disabled for best results with structured extraction tasks. Flash's base comprehension handles fact comparison and JSON schema conformation without chain-of-thought, and thinking mode actually hurts by burning output tokens on internal monologue before the JSON response.
Would love to see how it benchmarks against the current leaderboard entries. Happy to help test if needed.
Thanks!