Skip to content

adamcronin42/agent-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿค– Agent-Eval

โœจ What is Agent-Eval?

Agent-Eval is a Python framework for building LLM agents with tools. Go from empty folder to a working agent with custom tools in minutes.

๐Ÿš€ What You Get

  • โšก Quick setup - pip install to working agent in minutes
  • ๐Ÿ”ง Simple tools - Drop a Python class in /tools, it's automatically discovered
  • ๐ŸŒ Any model - Gemini, OpenAI, Claude, or 100+ others via LiteLLM
  • ๐Ÿ“Š Built-in metrics - Token usage, timing, and conversation tracking
  • ๐Ÿ“œ File-based prompts - Version control your system prompts
  • ๐ŸŽจ Beautiful CLI - Interactive mode with colors and helpful commands

๐ŸŽฏ Perfect For

  • Rapid prototyping - Test agent ideas quickly
  • Production development - Scale from prototype to production
  • Multi-model experiments - Easy switching between LLM providers
  • Team collaboration - Git-friendly structure for shared development

โšก Quick Start - Be Running in 60 Seconds

# 1. Install
pip install -e .

# 2. Set your API key (Gemini has a generous free tier, no credit card required!)
export GEMINI_API_KEY="your_key_here"

# 3. Chat with your agent instantly
agent-eval run "What are some great movies that came out this year?"

# 4. Start interactive mode
agent-eval chat

That's it! ๐ŸŽ‰ You now have a fully functional LLM agent with tool support, conversation management, and evaluation metrics.


๐Ÿ”ง Tool Development - The Magic Happens Here

Creating tools is ridiculously simple. Just implement the Tool class anywhere in the /tools directory and the agent will automatically discover and use it:

from agent_eval.tools import Tool

class WeatherTool(Tool):
    def get_schema(self):
        return {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }

    def execute(self, location: str):
        # Your implementation here
        return f"Weather in {location}: Sunny, 72ยฐF"

That's literally it! ๐Ÿคฏ

  • โœ… No registration required
  • โœ… Auto-discovery at runtime
  • โœ… Automatic validation
  • โœ… Built-in error handling
  • โœ… Tool approval workflow

๐Ÿ“œ System Prompt Management

Agent-Eval treats prompts as first-class citizens with proper version control and organization:

your-project/
โ”œโ”€โ”€ prompts/
โ”‚   โ”œโ”€โ”€ default_system.txt        # Main system prompt
โ”‚   โ”œโ”€โ”€ customer_support.txt      # Domain-specific variants
โ”‚   โ”œโ”€โ”€ data_analyst.txt          # Role-specific prompts
โ”‚   โ””โ”€โ”€ llm_judge_evaluation.txt  # Custom evaluation criteria

Features:

  • ๐ŸŽฏ File-based prompts - Easy to version control and collaborate
  • ๐Ÿ”„ Hot-swapping - Test different prompts instantly
  • ๐Ÿ“Š A/B testing - Compare prompt performance systematically
  • ๐ŸŽจ Template system - Reusable prompt components
# Test different prompts instantly
agent-eval chat --system-prompt="prompts/customer_support.txt"

# Hot-swap during development
echo "You are a specialized data analyst..." > prompts/custom.txt
agent-eval chat --system-prompt="prompts/custom.txt"

๐ŸŒ Any LLM Provider - Your Choice

Agent-Eval supports 100+ models through LiteLLM integration:

๐Ÿ†“ Start Free with Gemini

export GEMINI_API_KEY="your_key"
# Generous free tier: 15 RPM, 1M tokens/day

๐Ÿš€ Scale with Premium Providers

# OpenAI
export OPENAI_API_KEY="your_key"
agent-eval chat --model="openai/gpt-4"

# Anthropic Claude
export ANTHROPIC_API_KEY="your_key"
agent-eval chat --model="anthropic/claude-3-sonnet-20240229"

# Or any other provider...

Supported Providers:

  • Gemini (Free tier available!)
  • OpenAI (GPT-4, GPT-3.5)
  • Anthropic (Claude family)
  • Cohere, Hugging Face, Azure and 100+ more

๐ŸŽจ Beautiful CLI Experience

# Interactive mode with colors and emojis
$ agent-eval chat

๐Ÿš€ Initializing agent...

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘           ๐Ÿค– Agent-Eval              โ•‘
โ•‘    LLM Agent Evaluation Framework    โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Model: gemini/gemini-2.5-flash
Tools: ask_user, search_web
Auto-approve: No

๐Ÿ’ฌ Ready! Type '/help' for commands

You: What's the capital of France?
๐Ÿค– Agent: The capital of France is Paris.

๐Ÿ“Š Tokens: 245 | Tools: 0 | Iterations: 1 | Time: 0.8s

๐Ÿ› ๏ธ Built-in Commands

  • /help - Show available commands
  • /tools - List discovered tools
  • /metrics - Show performance stats
  • /save - Save conversation to file
  • /clear - Reset conversation

๐Ÿ“Š Built-in Metrics & Tracking

Every interaction is automatically tracked and monitored:

๐Ÿ” Real-time Metrics

  • Token usage per query
  • Tool execution count
  • Response time tracking
  • Error rate monitoring
  • Iteration counting

๐Ÿ’พ Conversation Persistence

// Auto-saved to conversation_history/
{
  "conversation_history": [...],
  "metrics": {
    "total_tokens": 1250,
    "tool_calls": 3,
    "iterations": 2,
    "duration": 2.34
  },
  "model": "gemini/gemini-2.5-flash",
  "tools_available": ["search_web", "ask_user"]
}

๐Ÿ”ฎ What's Coming Next

๐Ÿ”„ The Agent Development Loop (The Secret to Production-Ready Agents)

The professional workflow that scales from prototype to bulletproof agents:

๐Ÿ“ Create Test Cases โ†’ ๐Ÿš€ Run Evaluation โ†’ ๐Ÿ“Š Review Results
     โฌ†                                           โฌ‡
๐Ÿ” Add More Tests  โ† โœ… Passes? โ†’ ๐ŸŽฏ Tweak System Prompt

The Flow:

  1. ๐Ÿ“ Write test cases in golden_dataset/
  2. ๐Ÿš€ Run evaluation: agent-eval evaluate --dataset golden_dataset/
  3. ๐Ÿ“Š Analyze failures in evaluation reports
  4. ๐ŸŽฏ Refine system prompt in prompts/default_system.txt
  5. ๐Ÿ” Re-evaluate until tests pass
  6. โž• Add edge cases you discovered
  7. ๐Ÿ”„ Repeat โ†’ Build bulletproof agents

This iterative workflow is what separates toy demos from production-grade agents.

๐Ÿ“Š Evaluation Pipeline (Coming Soon)

The missing piece for production agent evaluation:

your-project/
โ”œโ”€โ”€ golden_dataset/           # Your test cases
โ”‚   โ”œโ”€โ”€ customer_support.json
โ”‚   โ””โ”€โ”€ data_analysis.json
โ”œโ”€โ”€ evaluation_logs/          # Automated runs
โ”‚   โ””โ”€โ”€ run_2025_01_15_14_30/
โ”‚       โ”œโ”€โ”€ responses.json
โ”‚       โ”œโ”€โ”€ metrics.json
โ”‚       โ””โ”€โ”€ failures.log
โ””โ”€โ”€ llm_judge_reports/        # AI evaluation
    โ”œโ”€โ”€ accuracy_scores.json
    โ””โ”€โ”€ quality_analysis.md

How it will work:

  1. ๐Ÿ“ Define Test Cases - JSON files with input/expected output pairs
  2. ๐Ÿš€ Run Evaluation - agent-eval evaluate --dataset golden_dataset/
  3. ๐Ÿค– LLM-as-Judge - Automatic quality scoring for subjective responses
  4. ๐Ÿ“ˆ Detailed Reports - Pass/fail rates, performance trends, insights
  5. ๐Ÿ”„ CI/CD Integration - Automated testing in your deployment pipeline

Enhanced Evaluation Features:

๐ŸŽฏ Custom LLM-as-Judge Prompts

prompts/
โ”œโ”€โ”€ llm_judge_evaluation.txt    # Your custom evaluation criteria
โ”œโ”€โ”€ accuracy_rubric.txt         # Domain-specific scoring
โ””โ”€โ”€ safety_evaluation.txt       # Safety/compliance checks

๐Ÿ“Š Evaluation Dashboard

  • Pass/fail rates by test case category
  • Prompt performance comparisons
  • Regression detection across iterations
  • Cost analysis per evaluation run

๐Ÿ”„ CI/CD Integration

# In your GitHub Actions
- name: Evaluate Agent
  run: agent-eval evaluate --dataset golden_dataset/ --fail-threshold 0.95

This will make Agent-Eval the complete solution for agent development and evaluation.


๐Ÿš€ Advanced Usage

๐ŸŽ›๏ธ Custom Configuration

# Override any setting
agent-eval chat \
  --model="openai/gpt-4" \
  --max-iterations=15 \
  --auto-approve \
  --system-prompt="You are a helpful coding assistant"

๐Ÿ—๏ธ Installation & Setup

๐Ÿ“ฆ Quick Install

pip install -e .

๐Ÿ Development Setup

# Clone and setup
git clone <repository-url>
cd agent-eval
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -e .

# Run tests
python test_agent.py  # Comprehensive agent tests
python -m pytest tests/ -v  # Tool-specific tests

๐Ÿ”‘ API Key Setup

# Option 1: Environment variables
export GEMINI_API_KEY="your_key"
export OPENAI_API_KEY="your_key"

# Option 2: .env file
echo "GEMINI_API_KEY=your_key" > .env

๐Ÿ›๏ธ Architecture

Agent-Eval is built on solid foundations:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Beautiful CLI โ”‚โ”€โ”€โ”€โ”€โ”‚  Agent Core      โ”‚โ”€โ”€โ”€โ”€โ”‚  Tool Discovery โ”‚
โ”‚  โ€ข Colored UI   โ”‚    โ”‚  โ€ข Conversation  โ”‚    โ”‚  โ€ข Auto-detect  โ”‚
โ”‚  โ€ข Interactive  โ”‚    โ”‚  โ€ข Metrics       โ”‚    โ”‚  โ€ข Validation   โ”‚
โ”‚  โ€ข Intuitive    โ”‚    โ”‚  โ€ข Prompt Mgmt   โ”‚    โ”‚  โ€ข Execution    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚                         โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚
                    โ”‚   Prompt System     โ”‚             โ”‚
                    โ”‚  โ€ข File-based       โ”‚             โ”‚
                    โ”‚  โ€ข Version Control  โ”‚             โ”‚
                    โ”‚  โ€ข Hot-swapping     โ”‚             โ”‚
                    โ”‚  โ€ข A/B Testing      โ”‚             โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
                               โ”‚                         โ”‚
                       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”               โ”‚
                       โ”‚   LiteLLM      โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚  โ€ข 100+ Models โ”‚
                       โ”‚  โ€ข Unified API โ”‚
                       โ”‚  โ€ข Reliability โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Core Components:

  • ๐ŸŽจ CLI Layer - Beautiful, intuitive developer experience
  • ๐Ÿง  Agent Core - Conversation management, metrics, error handling
  • ๐Ÿ“œ Prompt System - File-based prompts with version control and A/B testing
  • ๐Ÿ”ง Tool System - Zero-config auto-discovery and execution
  • ๐ŸŒ LiteLLM Integration - Universal model provider support

๐Ÿ› Found a Bug?

Open an issue with:

  • Clear description
  • Steps to reproduce
  • Expected vs actual behavior
  • Your environment details

๐ŸŒŸ Star Us!

If Agent-Eval makes your LLM agent development easier, please star the repo! โญ

Every star helps us reach more developers who are struggling with complex agent frameworks.


Built with โค๏ธ for the LLM developer community

About

agent-eval lets you build an agent and start evaluating within minutes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages