A unified, open-source evaluation platform for holistically assessing agentic systems across diverse tasks with multi-dimensional "process viability" metrics.
π Documentation β’ π Quick Start β’ ποΈ Architecture β’ π Benchmarks β’ π§ Configuration
HASEB (Holistic Agentic System Evaluator & Benchmarking Suite) is a comprehensive evaluation platform designed to assess AI agents across multiple dimensions including performance, efficiency, cost, robustness, and quality. Built with modern web technologies and following SPARC methodology, HASEB provides researchers and developers with the tools needed to systematically evaluate and compare agentic systems.
- π Multi-Environment Support: SWE-bench, GAIA, OSWorld, WebArena, AgentBench
- π Multi-Dimensional Metrics: Performance, Efficiency, Cost, Robustness, Quality
- β‘ Real-Time Monitoring: WebSocket-based live evaluation tracking
- π― LangGraph Orchestration: Stateful workflow management
- π± Interactive Dashboard: React-based visualization interface
- ποΈ PostgreSQL Backend: Scalable metrics storage and analysis
- π Enterprise Security: JWT authentication, rate limiting, CORS protection
- π Comprehensive Analytics: Real-time leaderboards and trend analysis
Exact versions required:
- Node.js >= 18.0.0 (tested with 18.19.0+)
- PostgreSQL >= 15.0 (tested with 15.4+)
- npm >= 9.0.0 (tested with 10.2.4+)
- Git >= 2.30.0
- Memory: Minimum 4GB RAM (8GB+ recommended)
- Storage: Minimum 10GB free space
- OS: Linux (Ubuntu 20.04+), macOS (12+), or Windows 10+
-
Clone the repository
git clone https://github.com/your-org/haseb.git cd haseb -
Verify Node.js version
node --version # Should be >= 18.0.0 npm --version # Should be >= 9.0.0
-
Install dependencies
npm install
-
Set up environment variables
cp .env.example .env # Edit .env with your configuration - see Configuration section below -
Set up PostgreSQL database
# Verify PostgreSQL is running pg_isready # Create database createdb haseb # Run migrations npm run migrate # Seed test data (optional) npm run seed:test
-
Start the development servers
# Terminal 1: Start backend server npm run dev:backend # Terminal 2: Start frontend development server npm run dev
-
Verify the installation
# Check health endpoint curl http://localhost:3000/health # Check API documentation curl http://localhost:3000/api-docs
-
Access the application
- Frontend: http://localhost:3000
- API Documentation: http://localhost:3000/api-docs
- Health Check: http://localhost:3000/health
- API Root: http://localhost:3000/
# Start PostgreSQL for testing
docker-compose -f docker-compose.test.yml up -d
# Run the application
npm run dev:backendβββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Frontend β β Backend API β β Database β
β (React 19) βββββΊβ (Express) βββββΊβ (PostgreSQL) β
β β β β β β
β β’ Dashboard β β β’ REST API β β β’ Evaluations β
β β’ Leaderboards β β β’ WebSocket β β β’ Agents β
β β’ Analytics β β β’ Auth β β β’ Benchmarks β
β β’ Settings β β β’ Validation β β β’ Metrics β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
β
βββββββββββββββββββ
β Orchestrator β
β (LangGraph) β
β β
β β’ Evaluation β
β Workflows β
β β’ Metrics β
β Collection β
β β’ Agent β
β Coordination β
βββββββββββββββββββ
- EvaluationOrchestrator.ts: LangGraph-based workflow management
- EnvironmentManager.ts: Environment setup and teardown
- MetricsCollector.ts: Multi-dimensional metrics collection
- WebSocketManager.ts: Real-time progress updates
- ExecutionEngine.ts: Task execution coordination
- EvaluationQueue.ts: Task queue management
- SWE_Bench_Agent.ts: Code generation evaluation
- GUI_Automation_Agent.ts: GUI-based environments
- General_Reasoning_Agent.ts: General-purpose benchmarks
- BaseExecutionAgent.ts: Common agent functionality
- agents.ts: Agent management endpoints
- evaluations.ts: Evaluation orchestration
- benchmarks.ts: Benchmark configuration
- metrics.ts: Metrics collection and analysis
- auth.ts: Authentication and authorization
- orchestrator.ts: Workflow orchestration
- models/: TypeScript models for all entities
- migrations.ts: Database schema migrations
- connection.ts: PostgreSQL connection pooling
- seed-*.ts: Database seeding scripts
- DashboardLayout.tsx: Main application layout
- RealTimeEvaluations.tsx: Live evaluation monitoring
- MetricCard.tsx: Metrics visualization
- TopAgentsChart.tsx: Performance leaderboards
-- Core Tables
users -- User management and authentication
agents -- AI agent definitions and configurations
benchmarks -- Benchmark definitions and datasets
evaluations -- Evaluation execution records
tasks -- Individual task execution within evaluations
evaluation_states -- State tracking for complex workflows
migrations -- Database migration trackingHASEB collects comprehensive metrics across five dimensions:
- Task Success Rate: Percentage of successfully completed tasks
- Completion Time: Total time taken for evaluation
- First Success Time: Time to first successful completion
- Execution Time: Total CPU time used
- Latency per Step: Average time per evaluation step
- Total Steps: Number of steps taken
- Token Efficiency: Tasks completed per token
- Total Tokens: Input + output tokens used
- Estimated Cost: USD equivalent of API calls
- Cost per Task: Average cost per completed task
- Resource Utilization: CPU, memory, storage usage
- Tool Call Error Rate: Percentage of failed tool calls
- Recovery Rate: Success rate after errors
- Error Types: Classification of errors encountered
- Fallback Usage: How often fallback mechanisms were used
- Tool Selection Accuracy: Correct tool selection rate
- Parameter Accuracy: Correct parameter usage rate
- Output Relevance: Relevance score of outputs
- Output Completeness: Completeness score of outputs
| Benchmark | Type | Description | Tasks | Data Path |
|---|---|---|---|---|
| SWE-bench | Code Generation | Real-world software engineering tasks from GitHub | 2,294 | ./data/swe-bench |
| GAIA | General Reasoning | General AI Assistant tasks across domains | 1,000+ | ./data/gaia |
| OSWorld | GUI Automation | Operating system interaction and desktop automation | 300+ | ./data/osworld |
| WebArena | Web Automation | Web-based task completion and browser automation | 800+ | ./data/webarena |
| AgentBench | General Purpose | Multi-domain agent evaluation suite | 500+ | Custom |
Create a .env file in the project root with the following variables:
# Server Configuration
NODE_ENV=development
PORT=3000
API_BASE_URL=http://localhost:3000
# Database Configuration (Required)
DB_HOST=localhost
DB_PORT=5432
DB_NAME=haseb
DB_USER=postgres
DB_PASSWORD=password
DB_SSL=false
DB_MAX_CONNECTIONS=20
DB_IDLE_TIMEOUT=30000
# JWT Configuration (Required for production)
JWT_SECRET=your-super-secret-jwt-key-change-this-in-production
JWT_EXPIRES_IN=24h
JWT_REFRESH_EXPIRES_IN=7d# CORS Configuration
CORS_ORIGIN=http://localhost:3000
# Rate Limiting
RATE_LIMIT_WINDOW_MS=900000
RATE_LIMIT_MAX_REQUESTS=100
# Logging
LOG_LEVEL=info
LOG_FILE_PATH=logs
# External API Keys (if needed for integrations)
OPENAI_API_KEY=your-openai-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key
# Benchmark Data Paths
SWE_BENCH_DATA_PATH=./data/swe-bench
GAIA_DATA_PATH=./data/gaia
OSWORLD_DATA_PATH=./data/osworld
WEBARENA_DATA_PATH=./data/webarena
# File Upload Configuration
MAX_FILE_SIZE=10MB
UPLOAD_PATH=./uploads
# Email Configuration (optional)
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=your-email@gmail.com
SMTP_PASS=your-app-password
# Security
BCRYPT_ROUNDS=12
SESSION_SECRET=your-session-secret-change-this
# Monitoring (optional)
SENTRY_DSN=your-sentry-dsn
PROMETHEUS_PORT=9090
# Cache Configuration (optional)
REDIS_URL=redis://localhost:6379
CACHE_TTL=3600-
Install PostgreSQL
# Ubuntu/Debian sudo apt update && sudo apt install postgresql postgresql-contrib # macOS brew install postgresql brew services start postgresql # Windows # Download from postgresql.org and follow installation guide
-
Create Database and User
# Connect to PostgreSQL sudo -u postgres psql # In PostgreSQL shell: CREATE DATABASE haseb; CREATE USER haseb_user WITH PASSWORD 'your_password'; GRANT ALL PRIVILEGES ON DATABASE haseb TO haseb_user; \q
-
Run Database Migrations
npm run migrate
-
Verify Database Connection
npm run dev:backend # Check logs for "Database connected successfully" curl http://localhost:3000/health
For production deployment, configure JWT authentication:
JWT_SECRET=generate-secure-random-string-here
JWT_EXPIRES_IN=24h
BCRYPT_ROUNDS=12Generate a secure JWT secret:
node -e "console.log(require('crypto').randomBytes(64).toString('hex'))"- Interactive API Docs: http://localhost:3000/api-docs (Swagger UI)
- OpenAPI Specification: Available at
/api-docs/json - REST API Reference: See API_DOCUMENTATION.md
- Installation Guide: INSTALLATION.md
- Demo Walkthrough: DEMO.md
- Troubleshooting: TROUBLESHOOTING.md
- Architecture Overview: ARCHITECTURE.md
- Database Schema: See src/database/migrations.ts
- Testing Guide: See Testing section below
# Run all tests
npm test
# Run tests with coverage
npm run test:coverage
# Run specific test suites
npm run test:unit # Unit tests only
npm run test:integration # Integration tests only
npm run test:e2e # End-to-end tests (Playwright)
npm run test:performance # Performance benchmarks
npm run test:security # Security tests
# Watch mode for development
npm run test:watch
# Backend-specific tests
npm run test:backendtests/
βββ unit/ # Unit tests
β βββ agents/ # Agent logic tests
β βββ api/ # API endpoint tests
β βββ services/ # Service layer tests
β βββ utils/ # Utility function tests
β βββ hooks/ # React hook tests
β βββ store/ # State management tests
β βββ database/ # Database model tests
βββ integration/ # Integration tests
β βββ database/ # Database integration
β βββ metrics-system/ # Metrics collection
β βββ multi-agent/ # Multi-agent workflows
βββ e2e/ # End-to-end tests
β βββ dashboard/ # Dashboard UI tests
β βββ evaluations/ # Evaluation workflow
β βββ agents-workflow/ # Agent workflow tests
βββ performance/ # Performance tests
βββ security/ # Security tests
- Minimum Coverage: 90%
- Critical Path Coverage: 95%
- API Endpoint Coverage: 100%
- Database Model Coverage: 95%
Tests use a separate database configuration:
# Start test database
docker-compose -f docker-compose.test.yml up -d
# Run tests with test database
NODE_ENV=test npm test-
Environment Setup
export NODE_ENV=production # Build the application npm run build
-
Database Setup
# Run production migrations npm run migrate # Seed production data (optional) npm run seed
-
Start Production Server
npm start
# Build production image
docker build -t haseb:latest .
# Run with environment variables
docker run -d \
--name haseb \
-p 3000:3000 \
-e NODE_ENV=production \
-e DATABASE_URL=postgresql://... \
haseb:latest- AWS ECS: See docs/deployment/aws.md
- Google Cloud: See docs/deployment/gcp.md
- Azure: See docs/deployment/azure.md
- Heroku: See docs/deployment/heroku.md
- WebSocket Updates: Live evaluation progress at
ws://localhost:3000 - Dashboard Metrics: Real-time performance charts
- Health Monitoring: System health at
/health - API Metrics: Request/response tracking
The system tracks comprehensive metrics across all evaluations:
// Example metrics structure
{
performance: {
taskSuccessRate: 0.85,
executionTime: 1200,
firstSuccessTime: 800
},
efficiency: {
totalSteps: 45,
latencyPerStep: 26.7,
totalTokens: 15000
},
cost: {
estimatedCost: 0.25,
costPerTask: 0.05,
resourceUtilization: 0.67
},
robustness: {
toolCallErrorRate: 0.12,
recoveryRate: 0.89,
errorTypes: ['timeout', 'api_limit']
},
quality: {
toolSelectionAccuracy: 0.92,
parameterAccuracy: 0.88,
outputRelevance: 0.91
}
}Development: http://localhost:3000/api
Production: https://your-domain.com/api
# Login (if auth is enabled)
POST /api/auth/login
{
"email": "user@example.com",
"password": "password123"
}
# Get current user
GET /api/auth/me
Authorization: Bearer <jwt_token># Agents
GET /api/agents # List agents
POST /api/agents # Create agent
GET /api/agents/:id # Get agent details
PUT /api/agents/:id # Update agent
DELETE /api/agents/:id # Delete agent
# Evaluations
GET /api/evaluations # List evaluations
POST /api/evaluations # Start evaluation
GET /api/evaluations/:id # Get evaluation details
PATCH /api/evaluations/:id/status # Update status
PUT /api/evaluations/:id/metrics # Update metrics
# Benchmarks
GET /api/benchmarks # List benchmarks
POST /api/benchmarks # Create benchmark
GET /api/benchmarks/:id # Get benchmark details
# Metrics
GET /api/metrics/dashboard # Dashboard metrics
GET /api/metrics/performance # Performance analytics
GET /api/metrics/leaderboard # Leaderboard data// Connect to real-time updates
const ws = new WebSocket('ws://localhost:3000');
// Subscribe to evaluation updates
ws.send(JSON.stringify({
type: 'subscribe',
channel: 'evaluations',
evaluationId: 'uuid-here'
}));
// Receive real-time updates
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log('Update:', data);
};-
Setup Development Environment
git clone <repository> cd haseb npm install cp .env.example .env # Edit .env with your configuration
-
Start Development Servers
# Terminal 1: Backend npm run dev:backend # Terminal 2: Frontend npm run dev
-
Run Tests in Watch Mode
npm run test:watch
-
Lint and Format Code
npm run lint:fix npm run format
- TypeScript: Strict mode enabled, 100% type coverage required
- ESLint: Recommended rules with React and TypeScript plugins
- Prettier: Standard formatting with 2-space indentation
- Husky: Pre-commit hooks for quality enforcement
- Conventional Commits: Standardized commit message format
- Create feature branch:
git checkout -b feature/new-feature - Write tests first: Follow TDD methodology
- Implement functionality: Write production code
- Update documentation: Include API docs and README updates
- Run full test suite: Ensure 100% pass rate
- Submit pull request: With comprehensive description
- Create migration: Add to
src/database/migrations.ts - Update models: Modify TypeScript models in
src/database/models/ - Test migration: Run
npm run migrate:test - Update seeds: Modify seed files if needed
# Check PostgreSQL status
sudo systemctl status postgresql
# Test connection
psql -h localhost -U username -d haseb
# Reset database
npm run migrate:reset# Find process using port 3000
lsof -i :3000
# Kill process
kill -9 <PID>
# Use different port
PORT=3001 npm run dev# Increase Node.js memory limit
NODE_OPTIONS="--max-old-space-size=4096" npm run dev# Clear build cache
rm -rf dist node_modules/.vite
npm install
npm run build# Enable debug logging
DEBUG=haseb:* npm run dev
# Database query logging
LOG_LEVEL=debug npm run dev
# Verbose test output
DEBUG=haseb:* npm test- Documentation: ./docs/
- API Reference: http://localhost:3000/api-docs
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: support@haseb.org
This project is licensed under the ISC License - see the LICENSE file for details.
- LangChain Team for the excellent LangGraph framework
- SWE-bench Team for the benchmark dataset
- GAIA Team for general reasoning tasks
- React Community for the amazing frontend framework
- PostgreSQL Team for the reliable database system
- OpenAI Community for API integration patterns
- Additional benchmark integrations (HumanEval, MBPP)
- Advanced analytics dashboard with custom metrics
- Custom metrics framework for domain-specific evaluation
- Agent marketplace and sharing platform
- Distributed evaluation support across multiple nodes
- Advanced visualizations with D3.js integration
- Performance optimization for large-scale evaluations
- Mobile responsive design improvements
- Multi-cloud deployment support (AWS, GCP, Azure)
- Advanced agent orchestration with hierarchical workflows
- Real-time collaboration features for teams
- Enterprise features with SSO and advanced security
Built with β€οΈ by the HASEB Team
Website β’ Documentation β’ GitHub