Agent-Eval is a Python framework for building LLM agents with tools. Go from empty folder to a working agent with custom tools in minutes.
- โก Quick setup -
pip installto working agent in minutes - ๐ง Simple tools - Drop a Python class in
/tools, it's automatically discovered - ๐ Any model - Gemini, OpenAI, Claude, or 100+ others via LiteLLM
- ๐ Built-in metrics - Token usage, timing, and conversation tracking
- ๐ File-based prompts - Version control your system prompts
- ๐จ Beautiful CLI - Interactive mode with colors and helpful commands
- Rapid prototyping - Test agent ideas quickly
- Production development - Scale from prototype to production
- Multi-model experiments - Easy switching between LLM providers
- Team collaboration - Git-friendly structure for shared development
# 1. Install
pip install -e .
# 2. Set your API key (Gemini has a generous free tier, no credit card required!)
export GEMINI_API_KEY="your_key_here"
# 3. Chat with your agent instantly
agent-eval run "What are some great movies that came out this year?"
# 4. Start interactive mode
agent-eval chatThat's it! ๐ You now have a fully functional LLM agent with tool support, conversation management, and evaluation metrics.
Creating tools is ridiculously simple. Just implement the Tool class anywhere in the /tools directory and the agent will automatically discover and use it:
from agent_eval.tools import Tool
class WeatherTool(Tool):
def get_schema(self):
return {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
def execute(self, location: str):
# Your implementation here
return f"Weather in {location}: Sunny, 72ยฐF"That's literally it! ๐คฏ
- โ No registration required
- โ Auto-discovery at runtime
- โ Automatic validation
- โ Built-in error handling
- โ Tool approval workflow
Agent-Eval treats prompts as first-class citizens with proper version control and organization:
your-project/
โโโ prompts/
โ โโโ default_system.txt # Main system prompt
โ โโโ customer_support.txt # Domain-specific variants
โ โโโ data_analyst.txt # Role-specific prompts
โ โโโ llm_judge_evaluation.txt # Custom evaluation criteria
Features:
- ๐ฏ File-based prompts - Easy to version control and collaborate
- ๐ Hot-swapping - Test different prompts instantly
- ๐ A/B testing - Compare prompt performance systematically
- ๐จ Template system - Reusable prompt components
# Test different prompts instantly
agent-eval chat --system-prompt="prompts/customer_support.txt"
# Hot-swap during development
echo "You are a specialized data analyst..." > prompts/custom.txt
agent-eval chat --system-prompt="prompts/custom.txt"Agent-Eval supports 100+ models through LiteLLM integration:
export GEMINI_API_KEY="your_key"
# Generous free tier: 15 RPM, 1M tokens/day# OpenAI
export OPENAI_API_KEY="your_key"
agent-eval chat --model="openai/gpt-4"
# Anthropic Claude
export ANTHROPIC_API_KEY="your_key"
agent-eval chat --model="anthropic/claude-3-sonnet-20240229"
# Or any other provider...Supported Providers:
- Gemini (Free tier available!)
- OpenAI (GPT-4, GPT-3.5)
- Anthropic (Claude family)
- Cohere, Hugging Face, Azure and 100+ more
# Interactive mode with colors and emojis
$ agent-eval chat
๐ Initializing agent...
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ค Agent-Eval โ
โ LLM Agent Evaluation Framework โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Model: gemini/gemini-2.5-flash
Tools: ask_user, search_web
Auto-approve: No
๐ฌ Ready! Type '/help' for commands
You: What's the capital of France?
๐ค Agent: The capital of France is Paris.
๐ Tokens: 245 | Tools: 0 | Iterations: 1 | Time: 0.8s/help- Show available commands/tools- List discovered tools/metrics- Show performance stats/save- Save conversation to file/clear- Reset conversation
Every interaction is automatically tracked and monitored:
- Token usage per query
- Tool execution count
- Response time tracking
- Error rate monitoring
- Iteration counting
// Auto-saved to conversation_history/
{
"conversation_history": [...],
"metrics": {
"total_tokens": 1250,
"tool_calls": 3,
"iterations": 2,
"duration": 2.34
},
"model": "gemini/gemini-2.5-flash",
"tools_available": ["search_web", "ask_user"]
}The professional workflow that scales from prototype to bulletproof agents:
๐ Create Test Cases โ ๐ Run Evaluation โ ๐ Review Results
โฌ โฌ
๐ Add More Tests โ โ
Passes? โ ๐ฏ Tweak System Prompt
The Flow:
- ๐ Write test cases in
golden_dataset/ - ๐ Run evaluation:
agent-eval evaluate --dataset golden_dataset/ - ๐ Analyze failures in evaluation reports
- ๐ฏ Refine system prompt in
prompts/default_system.txt - ๐ Re-evaluate until tests pass
- โ Add edge cases you discovered
- ๐ Repeat โ Build bulletproof agents
This iterative workflow is what separates toy demos from production-grade agents.
The missing piece for production agent evaluation:
your-project/
โโโ golden_dataset/ # Your test cases
โ โโโ customer_support.json
โ โโโ data_analysis.json
โโโ evaluation_logs/ # Automated runs
โ โโโ run_2025_01_15_14_30/
โ โโโ responses.json
โ โโโ metrics.json
โ โโโ failures.log
โโโ llm_judge_reports/ # AI evaluation
โโโ accuracy_scores.json
โโโ quality_analysis.md
How it will work:
- ๐ Define Test Cases - JSON files with input/expected output pairs
- ๐ Run Evaluation -
agent-eval evaluate --dataset golden_dataset/ - ๐ค LLM-as-Judge - Automatic quality scoring for subjective responses
- ๐ Detailed Reports - Pass/fail rates, performance trends, insights
- ๐ CI/CD Integration - Automated testing in your deployment pipeline
Enhanced Evaluation Features:
๐ฏ Custom LLM-as-Judge Prompts
prompts/
โโโ llm_judge_evaluation.txt # Your custom evaluation criteria
โโโ accuracy_rubric.txt # Domain-specific scoring
โโโ safety_evaluation.txt # Safety/compliance checks
๐ Evaluation Dashboard
- Pass/fail rates by test case category
- Prompt performance comparisons
- Regression detection across iterations
- Cost analysis per evaluation run
๐ CI/CD Integration
# In your GitHub Actions
- name: Evaluate Agent
run: agent-eval evaluate --dataset golden_dataset/ --fail-threshold 0.95This will make Agent-Eval the complete solution for agent development and evaluation.
# Override any setting
agent-eval chat \
--model="openai/gpt-4" \
--max-iterations=15 \
--auto-approve \
--system-prompt="You are a helpful coding assistant"pip install -e .# Clone and setup
git clone <repository-url>
cd agent-eval
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -e .
# Run tests
python test_agent.py # Comprehensive agent tests
python -m pytest tests/ -v # Tool-specific tests# Option 1: Environment variables
export GEMINI_API_KEY="your_key"
export OPENAI_API_KEY="your_key"
# Option 2: .env file
echo "GEMINI_API_KEY=your_key" > .envAgent-Eval is built on solid foundations:
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Beautiful CLI โโโโโโ Agent Core โโโโโโ Tool Discovery โ
โ โข Colored UI โ โ โข Conversation โ โ โข Auto-detect โ
โ โข Interactive โ โ โข Metrics โ โ โข Validation โ
โ โข Intuitive โ โ โข Prompt Mgmt โ โ โข Execution โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโดโโโโโโโโโโโ โ
โ Prompt System โ โ
โ โข File-based โ โ
โ โข Version Control โ โ
โ โข Hot-swapping โ โ
โ โข A/B Testing โ โ
โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโดโโโโโโโโโ โ
โ LiteLLM โโโโโโโโโโโโโโโโโ
โ โข 100+ Models โ
โ โข Unified API โ
โ โข Reliability โ
โโโโโโโโโโโโโโโโโโ
Core Components:
- ๐จ CLI Layer - Beautiful, intuitive developer experience
- ๐ง Agent Core - Conversation management, metrics, error handling
- ๐ Prompt System - File-based prompts with version control and A/B testing
- ๐ง Tool System - Zero-config auto-discovery and execution
- ๐ LiteLLM Integration - Universal model provider support
Open an issue with:
- Clear description
- Steps to reproduce
- Expected vs actual behavior
- Your environment details
If Agent-Eval makes your LLM agent development easier, please star the repo! โญ
Every star helps us reach more developers who are struggling with complex agent frameworks.
Built with โค๏ธ for the LLM developer community