Description
Title
Heartbeat marks idle reactive agents as Crashed, conflating silence with failure
Summary
The kernel heartbeat monitor marks an agent Unresponsive after 60s of
inactivity and Crashed after another 30s, then fires auto-recovery via
the supervisor. For reactive agents (no continuous/cron schedule), silence
is the steady state — there is no activity between user messages. This
causes false-positive crash detection on every idle period, consuming
max_restarts budget and producing noisy UI state flips even when
nothing is actually wrong.
Environment
- OpenFang 0.6.0 (Linux x86_64)
- Agent: reactive mode,
schedule: reactive, no cron/continuous
autonomous.heartbeat_interval_secs = 30, max_restarts = 10
Actual
Crashed fires on any silence > 90s regardless of whether the agent
is mid-operation or simply waiting for input.
- Auto-recovery consumes a restart from
max_restarts; users hitting
the cap get their agent permanently marked failed for no real reason.
- Dashboard briefly flashes
Crashed on window focus before auto-recovery
completes, which is alarming for users who aren't aware it's a
false positive.
- Users can't distinguish from logs whether a real hang occurred or just
idle time.
Proposed change
Introduce a new state (e.g. Idle or Awaiting) that represents
"no LLM activity but agent is healthy and ready for input." The
Crashed state should only fire when:
- an LLM call is in-flight and exceeds a configurable deadline, or
- a
shell_exec/tool call is blocked beyond its own timeout, or
- the agent task process has actually panicked/exited.
Expected Behavior
An idle reactive agent should remain in a steady Idle or Waiting state, distinct from Crashed. Crashed should reflect actual failure (panic, unrecoverable error, or stuck LLM call that exceeds a timeout mid-operation), not mere absence of activity.
Steps to Reproduce
- Create or activate any reactive agent (e.g., Browser Hand, Clip Hand,
or a user-created agent with schedule: reactive).
- Send it a message, let the turn complete, then leave the chat window
idle for >90 seconds.
- Observe in the dashboard or via
GET /api/agents/<id>: state
transitions Running → Unresponsive → Crashed → (auto-recover) → Running.
- No errors in
journalctl -u openfang, no messages in any queue, no
exception traces. The agent never actually failed.
OpenFang Version
0.6
Operating System
Linux (x86_64)
Logs / Screenshots
No response
Description
Title
Heartbeat marks idle reactive agents as
Crashed, conflating silence with failureSummary
The kernel heartbeat monitor marks an agent
Unresponsiveafter 60s ofinactivity and
Crashedafter another 30s, then fires auto-recovery viathe supervisor. For reactive agents (no continuous/cron schedule), silence
is the steady state — there is no activity between user messages. This
causes false-positive crash detection on every idle period, consuming
max_restartsbudget and producing noisy UI state flips even whennothing is actually wrong.
Environment
schedule: reactive, no cron/continuousautonomous.heartbeat_interval_secs = 30,max_restarts = 10Actual
Crashedfires on any silence > 90s regardless of whether the agentis mid-operation or simply waiting for input.
max_restarts; users hittingthe cap get their agent permanently marked failed for no real reason.
Crashedon window focus before auto-recoverycompletes, which is alarming for users who aren't aware it's a
false positive.
idle time.
Proposed change
Introduce a new state (e.g.
IdleorAwaiting) that represents"no LLM activity but agent is healthy and ready for input." The
Crashedstate should only fire when:shell_exec/tool call is blocked beyond its own timeout, orExpected Behavior
An idle reactive agent should remain in a steady
IdleorWaitingstate, distinct fromCrashed.Crashedshould reflect actual failure (panic, unrecoverable error, or stuck LLM call that exceeds a timeout mid-operation), not mere absence of activity.Steps to Reproduce
or a user-created agent with
schedule: reactive).idle for >90 seconds.
GET /api/agents/<id>: statetransitions
Running → Unresponsive → Crashed → (auto-recover) → Running.journalctl -u openfang, no messages in any queue, noexception traces. The agent never actually failed.
OpenFang Version
0.6
Operating System
Linux (x86_64)
Logs / Screenshots
No response