Problem
stage_heartbeat emission is gated on stdout/stderr growth, so quiet-but-active phases produce sparse or missing heartbeat events.
In run 01KJR25QS6RY52D7VS2ZRAAXMK, merge_implementation remained active while heartbeat cadence showed multi-minute gaps tied to low output growth.
Why this matters
- Healthy stages can appear stalled in detached monitoring flows.
- Triage quality drops because “alive but quiet” and “possibly stuck” are not clearly distinguished.
progress.ndjson/live.json under-represent liveness during quiet command windows.
Evidence
Run artifact:
~/.local/state/kilroy/attractor/runs/01KJR25QS6RY52D7VS2ZRAAXMK/progress.ndjson
Observed heartbeat sequence for merge_implementation includes gaps while stage remains active, e.g.:
20:36:54Z (elapsed 240) -> 20:38:54Z (elapsed 360) (missing 300)
20:38:54Z (elapsed 360) -> 20:42:54Z (elapsed 600) (4-minute gap)
- later attempt:
20:53:05Z (elapsed 180) -> 20:55:05Z (elapsed 300) (missing 240)
Stage logs show ongoing activity around long-running commands (including npm install) while heartbeat signal remains irregular.
Relevant code:
internal/attractor/engine/codergen_router.go:
- heartbeat loop setup around
:1149
- output-growth gate around
:1167
appendProgress called only when gate passes around :1171
- API path has similar growth-gated heartbeat behavior around
:317
Steps to reproduce / observe
- Run a codergen stage with a long command that has quiet periods.
- Monitor
progress.ndjson for stage_heartbeat timestamps.
- Compare with stage
stdout.log/stderr.log evolution.
- Observe heartbeat gaps caused by lack of output growth, not process death.
Scope boundaries
This issue is about observability/liveness signaling.
This issue is not:
- A retry policy change.
- A command execution semantics change.
Potential directions (non-prescriptive)
- Emit liveness heartbeat on each interval while process/session is active.
- Keep byte counters as diagnostics rather than emission gates.
- Add explicit quiet-state metrics (
since_last_output_s, idle_for_s).
- Separate “liveness” and “output progress” event semantics.
Definition of done
- Active stages produce predictable liveness signals even when output is quiet.
- Monitoring can reliably differentiate
alive but quiet from missing liveness.
- Reproduction no longer shows heartbeat gaps solely due to output gating.
- Tests cover both output-producing and quiet-active periods.
Problem
stage_heartbeatemission is gated on stdout/stderr growth, so quiet-but-active phases produce sparse or missing heartbeat events.In run
01KJR25QS6RY52D7VS2ZRAAXMK,merge_implementationremained active while heartbeat cadence showed multi-minute gaps tied to low output growth.Why this matters
progress.ndjson/live.jsonunder-represent liveness during quiet command windows.Evidence
Run artifact:
~/.local/state/kilroy/attractor/runs/01KJR25QS6RY52D7VS2ZRAAXMK/progress.ndjsonObserved heartbeat sequence for
merge_implementationincludes gaps while stage remains active, e.g.:20:36:54Z (elapsed 240)->20:38:54Z (elapsed 360)(missing 300)20:38:54Z (elapsed 360)->20:42:54Z (elapsed 600)(4-minute gap)20:53:05Z (elapsed 180)->20:55:05Z (elapsed 300)(missing 240)Stage logs show ongoing activity around long-running commands (including
npm install) while heartbeat signal remains irregular.Relevant code:
internal/attractor/engine/codergen_router.go::1149:1167appendProgresscalled only when gate passes around:1171:317Steps to reproduce / observe
progress.ndjsonforstage_heartbeattimestamps.stdout.log/stderr.logevolution.Scope boundaries
This issue is about observability/liveness signaling.
This issue is not:
Potential directions (non-prescriptive)
since_last_output_s,idle_for_s).Definition of done
alive but quietfrom missing liveness.