Skip to content

adventurewave-labs/HASEB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HASEB: Holistic Agentic System Evaluator & Benchmarking Suite

HASEB Logo

A unified, open-source evaluation platform for holistically assessing agentic systems across diverse tasks with multi-dimensional "process viability" metrics.

License: ISC Node.js Version TypeScript React PostgreSQL

πŸ“– Documentation β€’ πŸš€ Quick Start β€’ πŸ—οΈ Architecture β€’ πŸ“Š Benchmarks β€’ πŸ”§ Configuration

🎯 Overview

HASEB (Holistic Agentic System Evaluator & Benchmarking Suite) is a comprehensive evaluation platform designed to assess AI agents across multiple dimensions including performance, efficiency, cost, robustness, and quality. Built with modern web technologies and following SPARC methodology, HASEB provides researchers and developers with the tools needed to systematically evaluate and compare agentic systems.

Key Features

  • πŸ”„ Multi-Environment Support: SWE-bench, GAIA, OSWorld, WebArena, AgentBench
  • πŸ“Š Multi-Dimensional Metrics: Performance, Efficiency, Cost, Robustness, Quality
  • ⚑ Real-Time Monitoring: WebSocket-based live evaluation tracking
  • 🎯 LangGraph Orchestration: Stateful workflow management
  • πŸ“± Interactive Dashboard: React-based visualization interface
  • πŸ—„οΈ PostgreSQL Backend: Scalable metrics storage and analysis
  • πŸ”’ Enterprise Security: JWT authentication, rate limiting, CORS protection
  • πŸ“ˆ Comprehensive Analytics: Real-time leaderboards and trend analysis

πŸš€ Quick Start

Prerequisites

Exact versions required:

  • Node.js >= 18.0.0 (tested with 18.19.0+)
  • PostgreSQL >= 15.0 (tested with 15.4+)
  • npm >= 9.0.0 (tested with 10.2.4+)
  • Git >= 2.30.0

System Requirements

  • Memory: Minimum 4GB RAM (8GB+ recommended)
  • Storage: Minimum 10GB free space
  • OS: Linux (Ubuntu 20.04+), macOS (12+), or Windows 10+

Installation

  1. Clone the repository

    git clone https://github.com/your-org/haseb.git
    cd haseb
  2. Verify Node.js version

    node --version  # Should be >= 18.0.0
    npm --version   # Should be >= 9.0.0
  3. Install dependencies

    npm install
  4. Set up environment variables

    cp .env.example .env
    # Edit .env with your configuration - see Configuration section below
  5. Set up PostgreSQL database

    # Verify PostgreSQL is running
    pg_isready
    
    # Create database
    createdb haseb
    
    # Run migrations
    npm run migrate
    
    # Seed test data (optional)
    npm run seed:test
  6. Start the development servers

    # Terminal 1: Start backend server
    npm run dev:backend
    
    # Terminal 2: Start frontend development server
    npm run dev
  7. Verify the installation

    # Check health endpoint
    curl http://localhost:3000/health
    
    # Check API documentation
    curl http://localhost:3000/api-docs
  8. Access the application

Docker Quick Start

# Start PostgreSQL for testing
docker-compose -f docker-compose.test.yml up -d

# Run the application
npm run dev:backend

πŸ—οΈ Architecture

System Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frontend      β”‚    β”‚   Backend API   β”‚    β”‚   Database      β”‚
β”‚   (React 19)    │◄──►│   (Express)     │◄──►│   (PostgreSQL)  β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ Dashboard     β”‚    β”‚ β€’ REST API      β”‚    β”‚ β€’ Evaluations   β”‚
β”‚ β€’ Leaderboards  β”‚    β”‚ β€’ WebSocket     β”‚    β”‚ β€’ Agents        β”‚
β”‚ β€’ Analytics     β”‚    β”‚ β€’ Auth          β”‚    β”‚ β€’ Benchmarks    β”‚
β”‚ β€’ Settings      β”‚    β”‚ β€’ Validation    β”‚    β”‚ β€’ Metrics       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Orchestrator  β”‚
                    β”‚   (LangGraph)   β”‚
                    β”‚                 β”‚
                    β”‚ β€’ Evaluation    β”‚
                    β”‚   Workflows     β”‚
                    β”‚ β€’ Metrics       β”‚
                    β”‚   Collection    β”‚
                    β”‚ β€’ Agent         β”‚
                    β”‚   Coordination  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Modules

Evaluation Orchestrator (/src/orchestrator/)

  • EvaluationOrchestrator.ts: LangGraph-based workflow management
  • EnvironmentManager.ts: Environment setup and teardown
  • MetricsCollector.ts: Multi-dimensional metrics collection
  • WebSocketManager.ts: Real-time progress updates
  • ExecutionEngine.ts: Task execution coordination
  • EvaluationQueue.ts: Task queue management

Multi-Environment Agents (/src/agents/)

  • SWE_Bench_Agent.ts: Code generation evaluation
  • GUI_Automation_Agent.ts: GUI-based environments
  • General_Reasoning_Agent.ts: General-purpose benchmarks
  • BaseExecutionAgent.ts: Common agent functionality

API Layer (/src/api/)

  • agents.ts: Agent management endpoints
  • evaluations.ts: Evaluation orchestration
  • benchmarks.ts: Benchmark configuration
  • metrics.ts: Metrics collection and analysis
  • auth.ts: Authentication and authorization
  • orchestrator.ts: Workflow orchestration

Database Layer (/src/database/)

  • models/: TypeScript models for all entities
  • migrations.ts: Database schema migrations
  • connection.ts: PostgreSQL connection pooling
  • seed-*.ts: Database seeding scripts

Frontend Components (/src/components/, /src/pages/)

  • DashboardLayout.tsx: Main application layout
  • RealTimeEvaluations.tsx: Live evaluation monitoring
  • MetricCard.tsx: Metrics visualization
  • TopAgentsChart.tsx: Performance leaderboards

Database Schema

-- Core Tables
users              -- User management and authentication
agents             -- AI agent definitions and configurations
benchmarks         -- Benchmark definitions and datasets
evaluations        -- Evaluation execution records
tasks              -- Individual task execution within evaluations
evaluation_states  -- State tracking for complex workflows
migrations         -- Database migration tracking

Metrics Collection System

HASEB collects comprehensive metrics across five dimensions:

Performance Metrics

  • Task Success Rate: Percentage of successfully completed tasks
  • Completion Time: Total time taken for evaluation
  • First Success Time: Time to first successful completion

Efficiency Metrics

  • Execution Time: Total CPU time used
  • Latency per Step: Average time per evaluation step
  • Total Steps: Number of steps taken
  • Token Efficiency: Tasks completed per token

Cost Metrics

  • Total Tokens: Input + output tokens used
  • Estimated Cost: USD equivalent of API calls
  • Cost per Task: Average cost per completed task
  • Resource Utilization: CPU, memory, storage usage

Robustness Metrics

  • Tool Call Error Rate: Percentage of failed tool calls
  • Recovery Rate: Success rate after errors
  • Error Types: Classification of errors encountered
  • Fallback Usage: How often fallback mechanisms were used

Quality Metrics

  • Tool Selection Accuracy: Correct tool selection rate
  • Parameter Accuracy: Correct parameter usage rate
  • Output Relevance: Relevance score of outputs
  • Output Completeness: Completeness score of outputs

πŸ“Š Supported Benchmarks

Benchmark Type Description Tasks Data Path
SWE-bench Code Generation Real-world software engineering tasks from GitHub 2,294 ./data/swe-bench
GAIA General Reasoning General AI Assistant tasks across domains 1,000+ ./data/gaia
OSWorld GUI Automation Operating system interaction and desktop automation 300+ ./data/osworld
WebArena Web Automation Web-based task completion and browser automation 800+ ./data/webarena
AgentBench General Purpose Multi-domain agent evaluation suite 500+ Custom

πŸ”§ Configuration

Environment Variables

Create a .env file in the project root with the following variables:

Required Configuration

# Server Configuration
NODE_ENV=development
PORT=3000
API_BASE_URL=http://localhost:3000

# Database Configuration (Required)
DB_HOST=localhost
DB_PORT=5432
DB_NAME=haseb
DB_USER=postgres
DB_PASSWORD=password
DB_SSL=false
DB_MAX_CONNECTIONS=20
DB_IDLE_TIMEOUT=30000

# JWT Configuration (Required for production)
JWT_SECRET=your-super-secret-jwt-key-change-this-in-production
JWT_EXPIRES_IN=24h
JWT_REFRESH_EXPIRES_IN=7d

Optional Configuration

# CORS Configuration
CORS_ORIGIN=http://localhost:3000

# Rate Limiting
RATE_LIMIT_WINDOW_MS=900000
RATE_LIMIT_MAX_REQUESTS=100

# Logging
LOG_LEVEL=info
LOG_FILE_PATH=logs

# External API Keys (if needed for integrations)
OPENAI_API_KEY=your-openai-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key

# Benchmark Data Paths
SWE_BENCH_DATA_PATH=./data/swe-bench
GAIA_DATA_PATH=./data/gaia
OSWORLD_DATA_PATH=./data/osworld
WEBARENA_DATA_PATH=./data/webarena

# File Upload Configuration
MAX_FILE_SIZE=10MB
UPLOAD_PATH=./uploads

# Email Configuration (optional)
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=your-email@gmail.com
SMTP_PASS=your-app-password

# Security
BCRYPT_ROUNDS=12
SESSION_SECRET=your-session-secret-change-this

# Monitoring (optional)
SENTRY_DSN=your-sentry-dsn
PROMETHEUS_PORT=9090

# Cache Configuration (optional)
REDIS_URL=redis://localhost:6379
CACHE_TTL=3600

Database Setup

  1. Install PostgreSQL

    # Ubuntu/Debian
    sudo apt update && sudo apt install postgresql postgresql-contrib
    
    # macOS
    brew install postgresql
    brew services start postgresql
    
    # Windows
    # Download from postgresql.org and follow installation guide
  2. Create Database and User

    # Connect to PostgreSQL
    sudo -u postgres psql
    
    # In PostgreSQL shell:
    CREATE DATABASE haseb;
    CREATE USER haseb_user WITH PASSWORD 'your_password';
    GRANT ALL PRIVILEGES ON DATABASE haseb TO haseb_user;
    \q
  3. Run Database Migrations

    npm run migrate
  4. Verify Database Connection

    npm run dev:backend
    # Check logs for "Database connected successfully"
    curl http://localhost:3000/health

Authentication Setup

For production deployment, configure JWT authentication:

JWT_SECRET=generate-secure-random-string-here
JWT_EXPIRES_IN=24h
BCRYPT_ROUNDS=12

Generate a secure JWT secret:

node -e "console.log(require('crypto').randomBytes(64).toString('hex'))"

πŸ“– Documentation

API Documentation

User Guides

Developer Documentation

  • Architecture Overview: ARCHITECTURE.md
  • Database Schema: See src/database/migrations.ts
  • Testing Guide: See Testing section below

πŸ§ͺ Testing

Running Tests

# Run all tests
npm test

# Run tests with coverage
npm run test:coverage

# Run specific test suites
npm run test:unit          # Unit tests only
npm run test:integration   # Integration tests only
npm run test:e2e          # End-to-end tests (Playwright)
npm run test:performance  # Performance benchmarks
npm run test:security     # Security tests

# Watch mode for development
npm run test:watch

# Backend-specific tests
npm run test:backend

Test Structure

tests/
β”œβ”€β”€ unit/                  # Unit tests
β”‚   β”œβ”€β”€ agents/           # Agent logic tests
β”‚   β”œβ”€β”€ api/              # API endpoint tests
β”‚   β”œβ”€β”€ services/         # Service layer tests
β”‚   β”œβ”€β”€ utils/            # Utility function tests
β”‚   β”œβ”€β”€ hooks/            # React hook tests
β”‚   β”œβ”€β”€ store/            # State management tests
β”‚   └── database/         # Database model tests
β”œβ”€β”€ integration/          # Integration tests
β”‚   β”œβ”€β”€ database/         # Database integration
β”‚   β”œβ”€β”€ metrics-system/   # Metrics collection
β”‚   └── multi-agent/      # Multi-agent workflows
β”œβ”€β”€ e2e/                  # End-to-end tests
β”‚   β”œβ”€β”€ dashboard/        # Dashboard UI tests
β”‚   β”œβ”€β”€ evaluations/      # Evaluation workflow
β”‚   └── agents-workflow/  # Agent workflow tests
β”œβ”€β”€ performance/          # Performance tests
└── security/             # Security tests

Coverage Requirements

  • Minimum Coverage: 90%
  • Critical Path Coverage: 95%
  • API Endpoint Coverage: 100%
  • Database Model Coverage: 95%

Test Database Setup

Tests use a separate database configuration:

# Start test database
docker-compose -f docker-compose.test.yml up -d

# Run tests with test database
NODE_ENV=test npm test

πŸš€ Deployment

Production Deployment

  1. Environment Setup

    export NODE_ENV=production
    
    # Build the application
    npm run build
  2. Database Setup

    # Run production migrations
    npm run migrate
    
    # Seed production data (optional)
    npm run seed
  3. Start Production Server

    npm start

Docker Deployment

# Build production image
docker build -t haseb:latest .

# Run with environment variables
docker run -d \
  --name haseb \
  -p 3000:3000 \
  -e NODE_ENV=production \
  -e DATABASE_URL=postgresql://... \
  haseb:latest

Cloud Deployment Guides

πŸ“Š Metrics & Analytics

Real-Time Monitoring

  • WebSocket Updates: Live evaluation progress at ws://localhost:3000
  • Dashboard Metrics: Real-time performance charts
  • Health Monitoring: System health at /health
  • API Metrics: Request/response tracking

Performance Analytics

The system tracks comprehensive metrics across all evaluations:

// Example metrics structure
{
  performance: {
    taskSuccessRate: 0.85,
    executionTime: 1200,
    firstSuccessTime: 800
  },
  efficiency: {
    totalSteps: 45,
    latencyPerStep: 26.7,
    totalTokens: 15000
  },
  cost: {
    estimatedCost: 0.25,
    costPerTask: 0.05,
    resourceUtilization: 0.67
  },
  robustness: {
    toolCallErrorRate: 0.12,
    recoveryRate: 0.89,
    errorTypes: ['timeout', 'api_limit']
  },
  quality: {
    toolSelectionAccuracy: 0.92,
    parameterAccuracy: 0.88,
    outputRelevance: 0.91
  }
}

πŸ”Œ API Reference

Base URL

Development: http://localhost:3000/api
Production:  https://your-domain.com/api

Authentication

# Login (if auth is enabled)
POST /api/auth/login
{
  "email": "user@example.com",
  "password": "password123"
}

# Get current user
GET /api/auth/me
Authorization: Bearer <jwt_token>

Core Endpoints

# Agents
GET    /api/agents              # List agents
POST   /api/agents              # Create agent
GET    /api/agents/:id          # Get agent details
PUT    /api/agents/:id          # Update agent
DELETE /api/agents/:id          # Delete agent

# Evaluations
GET    /api/evaluations         # List evaluations
POST   /api/evaluations         # Start evaluation
GET    /api/evaluations/:id     # Get evaluation details
PATCH  /api/evaluations/:id/status  # Update status
PUT    /api/evaluations/:id/metrics  # Update metrics

# Benchmarks
GET    /api/benchmarks          # List benchmarks
POST   /api/benchmarks          # Create benchmark
GET    /api/benchmarks/:id      # Get benchmark details

# Metrics
GET    /api/metrics/dashboard   # Dashboard metrics
GET    /api/metrics/performance # Performance analytics
GET    /api/metrics/leaderboard # Leaderboard data

WebSocket API

// Connect to real-time updates
const ws = new WebSocket('ws://localhost:3000');

// Subscribe to evaluation updates
ws.send(JSON.stringify({
  type: 'subscribe',
  channel: 'evaluations',
  evaluationId: 'uuid-here'
}));

// Receive real-time updates
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Update:', data);
};

πŸ› οΈ Development

Development Workflow

  1. Setup Development Environment

    git clone <repository>
    cd haseb
    npm install
    cp .env.example .env
    # Edit .env with your configuration
  2. Start Development Servers

    # Terminal 1: Backend
    npm run dev:backend
    
    # Terminal 2: Frontend
    npm run dev
  3. Run Tests in Watch Mode

    npm run test:watch
  4. Lint and Format Code

    npm run lint:fix
    npm run format

Code Style and Standards

  • TypeScript: Strict mode enabled, 100% type coverage required
  • ESLint: Recommended rules with React and TypeScript plugins
  • Prettier: Standard formatting with 2-space indentation
  • Husky: Pre-commit hooks for quality enforcement
  • Conventional Commits: Standardized commit message format

Adding New Features

  1. Create feature branch: git checkout -b feature/new-feature
  2. Write tests first: Follow TDD methodology
  3. Implement functionality: Write production code
  4. Update documentation: Include API docs and README updates
  5. Run full test suite: Ensure 100% pass rate
  6. Submit pull request: With comprehensive description

Database Changes

  1. Create migration: Add to src/database/migrations.ts
  2. Update models: Modify TypeScript models in src/database/models/
  3. Test migration: Run npm run migrate:test
  4. Update seeds: Modify seed files if needed

πŸ› Troubleshooting

Common Issues and Solutions

Database Connection Issues

# Check PostgreSQL status
sudo systemctl status postgresql

# Test connection
psql -h localhost -U username -d haseb

# Reset database
npm run migrate:reset

Port Conflicts

# Find process using port 3000
lsof -i :3000

# Kill process
kill -9 <PID>

# Use different port
PORT=3001 npm run dev

Memory Issues

# Increase Node.js memory limit
NODE_OPTIONS="--max-old-space-size=4096" npm run dev

Frontend Build Issues

# Clear build cache
rm -rf dist node_modules/.vite
npm install
npm run build

Debug Mode

# Enable debug logging
DEBUG=haseb:* npm run dev

# Database query logging
LOG_LEVEL=debug npm run dev

# Verbose test output
DEBUG=haseb:* npm test

Getting Help

πŸ“„ License

This project is licensed under the ISC License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • LangChain Team for the excellent LangGraph framework
  • SWE-bench Team for the benchmark dataset
  • GAIA Team for general reasoning tasks
  • React Community for the amazing frontend framework
  • PostgreSQL Team for the reliable database system
  • OpenAI Community for API integration patterns

πŸ—ΊοΈ Roadmap

Version 1.1 (Q1 2024)

  • Additional benchmark integrations (HumanEval, MBPP)
  • Advanced analytics dashboard with custom metrics
  • Custom metrics framework for domain-specific evaluation
  • Agent marketplace and sharing platform

Version 1.2 (Q2 2024)

  • Distributed evaluation support across multiple nodes
  • Advanced visualizations with D3.js integration
  • Performance optimization for large-scale evaluations
  • Mobile responsive design improvements

Version 2.0 (Q3 2024)

  • Multi-cloud deployment support (AWS, GCP, Azure)
  • Advanced agent orchestration with hierarchical workflows
  • Real-time collaboration features for teams
  • Enterprise features with SSO and advanced security

Built with ❀️ by the HASEB Team

Website β€’ Documentation β€’ GitHub

About

HASEB - Holistic Agentic System Evaluator & Benchmarking Suite. Unified platform for assessing agentic systems with multi-dimensional process viability metrics.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages