agent-routing-eval-lab

Offline evaluation lab for agent routing and tool-selection policies before production.

Most agent teams tweak prompts, routers, tool catalogs, and policy rules frequently. This lab demonstrates why production changes need offline evaluation first: a new policy can improve one metric while quietly increasing cost, latency, unsafe actions, or unresolved requests.

What this repo demonstrates

Replaying logged decisions from a realistic support/ops scenario
Comparing candidate routers on the same historical data
Measuring quality/cost/latency/safety trade-offs
Flagging weak support/coverage regions in logs
Using context-aware bounded tool cards vs exposing every tool schema
Generating a decision-ready report before rollout

Architecture

flowchart LR
    A[Historical logged decisions] --> B[Candidate routing policies]
    B --> C[Offline evaluator]
    C --> D[Metrics + support coverage checks]
    D --> E[Markdown report + terminal summary]
    E --> F[Rollout decision \n hold / revise / canary]

See /docs/architecture.md for details. For terminology used throughout the lab, see /docs/glossary.md.

Quickstart

make install
make test
make generate-data
make evaluate
make report
make demo

Example comparison output

Policy	Success	Correct Tool	Avg Cost	Avg Latency (ms)	Unsafe	Unresolved	Regret	Score
contextweaver_v1	79.67%	83.67%	$0.153	183.3	2.67%	8.67%	0.319	83.41
baseline	62.00%	66.00%	$0.309	304.2	3.00%	37.67%	0.666	66.67
strict_policy	46.00%	46.00%	$0.065	121.8	0.00%	44.67%	0.711	60.91
cost_aware	23.67%	23.67%	$0.051	105.4	0.00%	23.00%	1.014	49.38

Source: mirrors reports/example_report.md, generated by make demo with 300 synthetic rows and seed 7.

Enterprise governance mapping

Use this lab as a pre-deployment gate before online A/B testing. It helps teams reject policy changes that improve happy-path demos but harm safety, support coverage, or operating cost.

Public library showcase

skdr-eval: wrapped by src/agent_routing_eval_lab/adapters/skdr_eval_adapter.py as the evaluation anchor. The adapter is explicit about fallback behavior and emits warnings when native API wiring is unavailable.
contextweaver: demonstrated via bounded tool cards in src/agent_routing_eval_lab/adapters/contextweaver_adapter.py and the ContextWeaverRouter.

Optional extensions to deterministic flows (e.g., ChainWeaver) or governance layers (e.g., AgentFence / agent-kernel) are noted in docs, but routing evaluation stays the main focus.

When to use this pattern

Evaluating agent router changes safely
Reducing tool-call costs without quality regressions
Validating stricter safety/approval policies
Preparing an agent for production rollout
Comparing prompt/model/tool-catalog changes before traffic exposure

Limitations

Synthetic demo data, not production telemetry
Offline evaluation cannot replace online experiments
Still requires red-teaming and human review for high-risk actions
Counterfactual estimates are sensitive to support/coverage in logs

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
reports		reports
src/agent_routing_eval_lab		src/agent_routing_eval_lab
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-routing-eval-lab

What this repo demonstrates

Architecture

Quickstart

Example comparison output

Enterprise governance mapping

Public library showcase

When to use this pattern

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-routing-eval-lab

What this repo demonstrates

Architecture

Quickstart

Example comparison output

Enterprise governance mapping

Public library showcase

When to use this pattern

Limitations

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages