Skip to content

Heartbeat marks idle reactive agents as Crashed, conflating silence with failure #1102

@jeremykpark

Description

@jeremykpark

Description

Title

Heartbeat marks idle reactive agents as Crashed, conflating silence with failure

Summary

The kernel heartbeat monitor marks an agent Unresponsive after 60s of
inactivity and Crashed after another 30s, then fires auto-recovery via
the supervisor. For reactive agents (no continuous/cron schedule), silence
is the steady state — there is no activity between user messages. This
causes false-positive crash detection on every idle period, consuming
max_restarts budget and producing noisy UI state flips even when
nothing is actually wrong.

Environment

  • OpenFang 0.6.0 (Linux x86_64)
  • Agent: reactive mode, schedule: reactive, no cron/continuous
  • autonomous.heartbeat_interval_secs = 30, max_restarts = 10

Actual

  • Crashed fires on any silence > 90s regardless of whether the agent
    is mid-operation or simply waiting for input.
  • Auto-recovery consumes a restart from max_restarts; users hitting
    the cap get their agent permanently marked failed for no real reason.
  • Dashboard briefly flashes Crashed on window focus before auto-recovery
    completes, which is alarming for users who aren't aware it's a
    false positive.
  • Users can't distinguish from logs whether a real hang occurred or just
    idle time.

Proposed change

Introduce a new state (e.g. Idle or Awaiting) that represents
"no LLM activity but agent is healthy and ready for input." The
Crashed state should only fire when:

  • an LLM call is in-flight and exceeds a configurable deadline, or
  • a shell_exec/tool call is blocked beyond its own timeout, or
  • the agent task process has actually panicked/exited.

Expected Behavior

An idle reactive agent should remain in a steady Idle or Waiting state, distinct from Crashed. Crashed should reflect actual failure (panic, unrecoverable error, or stuck LLM call that exceeds a timeout mid-operation), not mere absence of activity.

Steps to Reproduce

  1. Create or activate any reactive agent (e.g., Browser Hand, Clip Hand,
    or a user-created agent with schedule: reactive).
  2. Send it a message, let the turn complete, then leave the chat window
    idle for >90 seconds.
  3. Observe in the dashboard or via GET /api/agents/<id>: state
    transitions Running → Unresponsive → Crashed → (auto-recover) → Running.
  4. No errors in journalctl -u openfang, no messages in any queue, no
    exception traces. The agent never actually failed.

OpenFang Version

0.6

Operating System

Linux (x86_64)

Logs / Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions