Request: Add DeepSeek V4 Flash to retain() leaderboard

## Feature Request

Please add **DeepSeek V4 Flash** (`deepseek-v4-flash`) to the retain() model leaderboard.

### Why it matters

DeepSeek V4 Flash is one of the most popular low-cost API models for production agent workloads. It's a natural fit for Hindsight's retain/consolidation pipeline:

- **Pricing:** $0.14/M input, $0.28/M output — cheaper than most models on the leaderboard
- **Provider:** DeepSeek API direct (`api.deepseek.com/v1`) — OpenAI-compatible endpoint
- **Real-world usage:** We're running it in production with Hindsight v0.6.1 for consolidation and fact extraction. Results so far:
  - Clean JSON schema conformance with zero retries (after disabling thinking mode via `{"thinking": {"type": "disabled"}}`)
  - ~6-8 seconds per consolidation batch (8 facts per batch)
  - Healthy merge ratios: creates/updates/skips all within expected ranges
  - ~220 calls/day at ~$0.04/day total cost

### Missing from the current leaderboard

The current leaderboard covers Groq (gpt-oss), OpenAI (GPT variants), Google (Gemini), and Ollama Cloud (Gemma). DeepSeek is absent entirely despite being:
- One of the top 3 API providers by volume
- The cheapest option per-token for structured output tasks
- Fully OpenAI-compatible (works with Hindsight's existing `openai` provider)

### Benchmark concern

The "Not Viable" list shows 10 sub-3B models that failed schema conformance. That's useful data. But there's a gap between those tiny models and the 20B+ models that dominate the leaderboard. DeepSeek V4 Flash would fill that gap — it's a mid-sized model (likely 20-30B range based on pricing tier) that's affordable enough for daily production use.

### Configuration for testing

```env
HINDSIGHT_API_LLM_PROVIDER=openai
HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash
HINDSIGHT_API_LLM_BASE_URL=https://api.deepseek.com/v1
HINDSIGHT_API_LLM_API_KEY=<deepseek-api-key>
```

Note: Thinking mode should be disabled for best results with structured extraction tasks. Flash's base comprehension handles fact comparison and JSON schema conformation without chain-of-thought, and thinking mode actually hurts by burning output tokens on internal monologue before the JSON response.

Would love to see how it benchmarks against the current leaderboard entries. Happy to help test if needed.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Add DeepSeek V4 Flash to retain() leaderboard #12

Feature Request

Why it matters

Missing from the current leaderboard

Benchmark concern

Configuration for testing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Request: Add DeepSeek V4 Flash to retain() leaderboard #12

Description

Feature Request

Why it matters

Missing from the current leaderboard

Benchmark concern

Configuration for testing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions