Skip to content

RFC-002: Metrics and Observability Hooks #6

@pontino

Description

@pontino

Summary

Add comprehensive metrics and observability capabilities to eggai-clutch, enabling monitoring of pipeline performance, step latency, error rates, and resource utilization through pluggable exporters.

Motivation

Production deployments require visibility into:

  • Pipeline latency - End-to-end and per-step timing
  • Error rates - Failure patterns across agents
  • Throughput - Requests per second capacity
  • Resource usage - Queue depths, memory, concurrent executions

The existing on_step hook provides basic visibility but lacks structured metrics.

Proposed API

Metrics Configuration

from eggai_clutch import Clutch, Strategy, MetricsConfig

clutch = Clutch(
    "customer_support",
    strategy=Strategy.SEQUENTIAL,
    metrics=MetricsConfig(
        enabled=True,
        exporter="prometheus",
        prometheus_port=9090,
        labels={"service": "support", "env": "prod"},
    ),
)

Built-in Metrics

Metric Type Description
clutch_step_duration_seconds Histogram Per-step execution time
clutch_execution_duration_seconds Histogram Total pipeline time
clutch_executions_total Counter Execution count
clutch_step_errors_total Counter Error count by type
clutch_active_executions Gauge Currently running pipelines

Hook Integration

clutch = Clutch(
    "pipeline",
    on_step=async_step_callback,        # existing
    on_metric=metric_callback,          # NEW
)

async def metric_callback(event: MetricEvent):
    print(f"{event.name}: {event.value} {event.labels}")

Exporter Interface

class MetricsExporter(ABC):
    async def record_histogram(self, name: str, value: float, labels: dict) -> None: ...
    async def increment_counter(self, name: str, value: int, labels: dict) -> None: ...
    async def set_gauge(self, name: str, value: float, labels: dict) -> None: ...

Implementation Scope

Phase 1: Core Infrastructure

  • MetricsConfig and MetricEvent types
  • Timing in _call_handler
  • on_metric callback
  • Callback exporter

Phase 2: Exporters

  • Prometheus exporter with HTTP endpoint
  • StatsD exporter
  • OpenTelemetry exporter (optional dependency)

Phase 3: Dashboard Examples

  • Grafana dashboard templates
  • Documentation

Related Issues

Reference

See eggai-clutch-rfcs/RFC-002-metrics-hooks.md for full specification.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions