Skip to content

Request: Add DeepSeek V4 Flash to retain() leaderboard #12

@GusBot69

Description

@GusBot69

Feature Request

Please add DeepSeek V4 Flash (deepseek-v4-flash) to the retain() model leaderboard.

Why it matters

DeepSeek V4 Flash is one of the most popular low-cost API models for production agent workloads. It's a natural fit for Hindsight's retain/consolidation pipeline:

  • Pricing: $0.14/M input, $0.28/M output — cheaper than most models on the leaderboard
  • Provider: DeepSeek API direct (api.deepseek.com/v1) — OpenAI-compatible endpoint
  • Real-world usage: We're running it in production with Hindsight v0.6.1 for consolidation and fact extraction. Results so far:
    • Clean JSON schema conformance with zero retries (after disabling thinking mode via {"thinking": {"type": "disabled"}})
    • ~6-8 seconds per consolidation batch (8 facts per batch)
    • Healthy merge ratios: creates/updates/skips all within expected ranges
    • ~220 calls/day at ~$0.04/day total cost

Missing from the current leaderboard

The current leaderboard covers Groq (gpt-oss), OpenAI (GPT variants), Google (Gemini), and Ollama Cloud (Gemma). DeepSeek is absent entirely despite being:

  • One of the top 3 API providers by volume
  • The cheapest option per-token for structured output tasks
  • Fully OpenAI-compatible (works with Hindsight's existing openai provider)

Benchmark concern

The "Not Viable" list shows 10 sub-3B models that failed schema conformance. That's useful data. But there's a gap between those tiny models and the 20B+ models that dominate the leaderboard. DeepSeek V4 Flash would fill that gap — it's a mid-sized model (likely 20-30B range based on pricing tier) that's affordable enough for daily production use.

Configuration for testing

HINDSIGHT_API_LLM_PROVIDER=openai
HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash
HINDSIGHT_API_LLM_BASE_URL=https://api.deepseek.com/v1
HINDSIGHT_API_LLM_API_KEY=<deepseek-api-key>

Note: Thinking mode should be disabled for best results with structured extraction tasks. Flash's base comprehension handles fact comparison and JSON schema conformation without chain-of-thought, and thinking mode actually hurts by burning output tokens on internal monologue before the JSON response.

Would love to see how it benchmarks against the current leaderboard entries. Happy to help test if needed.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions