🐟 Sentifish

Web search provider benchmarking platform — run standardized IR evaluations across Brave, Serper, Tavily, and TinyFish, then compare results side-by-side.

Why

Search APIs all claim to be the best. Sentifish lets you prove it with data: run the same queries across multiple providers, score them with real IR metrics, and see who actually delivers.

Demo

Providers competing head-to-head on the same queries

Precision, Recall, NDCG, MRR scored per provider

NDCG@10 quality tracking across runs

Full dashboard with stats, comparisons, and run history

Features

6 search providers: Brave, Serper, SerpAPI, Tavily, Exa, TinyFish
IR scoring: Precision@K, Recall@K, NDCG@K, MRR, Latency
Async evaluation runs: POST triggers run, poll for results as providers finish
Dashboard UI: provider head-to-head comparison, NDCG trend charts, run history
Typed API client: full TypeScript SDK with React Query integration
Persistent results: in-memory store + disk persistence to ./results/
Dataset management: create, list, and manage evaluation datasets
Multi-dataset evaluation: Run benchmarks across multiple datasets in parallel
7 curated datasets: Sample + 6 domain-specific datasets covering developer tools, news, research, startups, ML engineering, and product design
Railway-ready: Dockerfile + railway.toml included

Architecture

sentifish/          → Backend (Python/FastAPI)
sentifish-ui/       → Frontend (React/Vite/TypeScript)

Backend

Python FastAPI server with async evaluation engine.

Providers run concurrently (TinyFish: 2 concurrent, others: 5)
Scores stream in as providers complete via as_completed
Results persisted to ./results/ on disk
Multi-dataset runs spawn parallel evaluation tasks

Frontend

React SPA with a mobile-first dashboard.

React + Vite + TypeScript with SWC
shadcn/ui component library (Radix primitives)
Framer Motion animations
Tailwind CSS with custom design tokens (DM Sans + JetBrains Mono)
React Query for server state
React Router for navigation

Dashboard components:

Component	What it shows
`StatCards`	Total queries evaluated, best NDCG@10, fastest provider, total runs
`ProviderComparison`	Animated bar charts comparing P@K, R@K, NDCG, MRR across all providers
`TrendChart`	NDCG@10 sparklines over last 12 runs per provider
`RecentRuns`	Timeline of evaluation runs with status, provider badges, best scores
`InsightCard`	Auto-generated evaluation insights

Quick Start

Backend

cd server
uv sync --all-groups
cp ../env.sample .env   # fill in API keys (see env.sample for all options including API_KEY)
./run

Server starts at http://localhost:4010.

Frontend

cd ../sentifish-ui
npm install
npm run dev

UI starts at http://localhost:5173. Set VITE_SENTIFISH_API_URL to point at the backend.

API

Endpoint	Method	Description
`/health`	GET	Health check
`/api/providers`	GET	List configured providers
`/api/datasets`	GET/POST	List or create datasets
`/api/datasets/{name}`	GET/DELETE	Get or delete a dataset
`/api/runs`	GET/POST	List runs or trigger a new eval
`/api/runs/{id}`	GET	Full run results
`/api/runs/{id}/summary`	GET	Aggregated scores per provider

Trigger an eval

curl -X POST http://localhost:4010/api/runs \
  -H "Content-Type: application/json" \
  -d '{"dataset": "sample", "providers": ["brave", "serper", "tavily", "tinyfish"], "top_k": 10}'

Poll for results

curl http://localhost:4010/api/runs/{run_id}

Scores stream in as providers complete — no need to wait for the slowest one.

Multi-dataset evaluation

curl -X POST http://localhost:4010/api/runs \
  -H "Content-Type: application/json" \
  -d '{"datasets": ["developer-tools", "ml-engineer"], "providers": ["brave", "serper"], "top_k": 10}'

Returns one run per dataset, all executing in parallel:

{
  "runs": [
    {"id": "...", "dataset": "developer-tools", "status": "running"},
    {"id": "...", "dataset": "ml-engineer", "status": "running"}
  ]
}

Datasets

Ships with 7 evaluation datasets (48+ queries total):

Dataset	Domain	Queries	Description
`sample`	General	5	Mixed topics for smoke testing
`developer-tools`	Engineering	8	Framework docs, debugging, API references
`news-and-current-events`	Journalism	8	Fact-checking, trending topics, explainers
`academic-research`	Science	8	Papers, scientific concepts, literature reviews
`startup-founder`	Business	8	Market research, fundraising, competitor analysis
`ml-engineer`	AI/ML	8	Model benchmarks, implementations, tooling
`product-designer`	Design	8	UX patterns, design systems, accessibility

Metrics

Metric	What it measures
Precision@K	Fraction of top-K results that are relevant
Recall@K	Fraction of known relevant docs found in top-K
NDCG@K	Ranking quality (rewards relevant results appearing higher)
MRR	How quickly the first relevant result appears
Latency	Wall-clock response time per provider

Content Depth (content_depth) is the normalized average snippet length returned by a provider. It measures how much substantive content each result contains — providers that return longer, richer snippets score higher. Displayed in the dashboard's Provider Comparison chart and in per-query score tables.

Tests

cd server
uv run pytest --cov=app tests/

Deploy

Deploys via Docker on Railway. Push to main triggers auto-deploy.

docker build -t sentifish .
docker run -p 4010:4010 --env-file .env sentifish

Tech Stack

Layer	Stack
Backend	Python, FastAPI, asyncio, uv
Frontend	React 18, Vite, TypeScript, Tailwind, shadcn/ui, Framer Motion
Scoring	Custom IR metrics engine (Precision, Recall, NDCG, MRR)
Search	Brave API, Serper API, Tavily API, TinyFish API (SSE)
Storage	In-memory + disk persistence
Deploy	Docker, Railway

Roadmap

From search benchmarking to full agent observability

Sentifish today benchmarks search providers. The bigger vision is observability for any AI agent's output — think Datadog, but for agents.

LLM-as-Judge Scoring — Semantic relevance scoring via OpenAI. Evaluate answer quality, not just URL matches. Catches cases where different URLs contain equally good content.

Agent Output Evaluation — Beyond search: benchmark any web agent's task output for correctness and drift. Define expected outputs, run agents on schedule, score results automatically.

Drift Detection & Alerts — Scheduled re-evaluation with auto-alerting when agent quality degrades over time. Catch silent failures before users do.

Self-Improving Baselines — Feed failure context + last N successful runs to an LLM. Auto-classify real drift vs benign change, and adapt baselines without manual intervention.

Multi-Agent Arena — Head-to-head comparison of browser agents (TinyFish, Browserbase, Multion, Steel) on identical tasks, scored with the same metrics framework.

Security

Write endpoints (POST /api/runs, POST /api/datasets, DELETE /api/datasets/{name}) can be protected with an optional API key. Set the API_KEY environment variable to enable authentication — all write requests must then include a matching X-Api-Key header. When API_KEY is not set, auth is disabled and the API is open.

API_KEY=your-secret-key-here

curl -X POST http://localhost:4010/api/runs \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: your-secret-key-here" \
  -d '{"dataset": "sample", "providers": ["brave"], "top_k": 10}'

Research

The docs/research/ directory contains 51 research files on agentic frameworks, evaluation methods, and observability — the background research that shaped Sentifish's design.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
.github/workflows		.github/workflows
assets/gifs		assets/gifs
docs		docs
research		research
server		server
ui		ui
.gitignore		.gitignore
.release-please-manifest.json		.release-please-manifest.json
Dockerfile		Dockerfile
README.md		README.md
env.sample		env.sample
railway.toml		railway.toml
release-please-config.json		release-please-config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐟 Sentifish

Why

Demo

Features

Architecture

Backend

Frontend

Quick Start

Backend

Frontend

API

Trigger an eval

Poll for results

Multi-dataset evaluation

Datasets

Metrics

Tests

Deploy

Tech Stack

Roadmap

Security

Research

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🐟 Sentifish

Why

Demo

Features

Architecture

Backend

Frontend

Quick Start

Backend

Frontend

API

Trigger an eval

Poll for results

Multi-dataset evaluation

Datasets

Metrics

Tests

Deploy

Tech Stack

Roadmap

Security

Research

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages