Skip to content

Harden change sync lease recovery#14

Merged
dutifulbob merged 1 commit intomainfrom
feat/reliable-change-sync-leases
Apr 15, 2026
Merged

Harden change sync lease recovery#14
dutifulbob merged 1 commit intomainfrom
feat/reliable-change-sync-leases

Conversation

@dutifulbob
Copy link
Copy Markdown
Member

Summary

  • add owner-scoped fetch/backfill leases with heartbeat timestamps and a startup recovery pass
  • heartbeat both fetch and backfill work while it runs, and expose lease diagnostics in change status output
  • add migration and regression tests covering stale lease reclaim and startup recovery

Validation

  • go test ./internal/githubsync/...
  • go test ./internal/httpapi/... ./internal/ghr/...
  • go test ./...
  • go vet ./...
  • go build ./cmd/ghreplica && go build ./cmd/ghr
  • npm run docs:check

Notes

  • I could not run a local Postgres migration smoke test here because this environment cannot access Docker; I will validate the migration during deploy.

@dutifulbob dutifulbob force-pushed the feat/reliable-change-sync-leases branch from 0ce410e to 4c9cacc Compare April 15, 2026 18:15
@dutifulbob dutifulbob merged commit 7f13f69 into main Apr 15, 2026
1 check passed
@dutifulbob dutifulbob deleted the feat/reliable-change-sync-leases branch April 15, 2026 18:18
@dutifulbob
Copy link
Copy Markdown
Member Author

Final report:

What shipped:

  • owner-scoped fetch/backfill leases in repo_change_sync_states
  • heartbeat timestamps and stale-lease reclamation logic
  • startup lease recovery before the change-sync loop starts
  • heartbeat-driven fetch/backfill status semantics and diagnostics in /v1/changes/.../status
  • CLI status output for lease owner/heartbeat/expires fields
  • regression tests for stale fetch reclaim, fresh lease non-reclaim, and startup recovery

Validation run locally:

  • go test ./internal/githubsync/... -> passed
  • go test ./internal/httpapi/... ./internal/ghr/... -> passed
  • go test ./... -> passed
  • go vet ./... -> passed
  • go build ./cmd/ghreplica && go build ./cmd/ghr -> passed
  • npm run docs:check -> passed
  • local Postgres migration smoke -> could not run here because this environment could not access Docker

Review / comments:

  • codex review --base main was run on the pushed branch head
  • no PR issue comments were created
  • no inline PR review comments were created
  • no actionable P0/P1 findings were surfaced

CI:

Production deploy / verification:

  • updated VM checkout to merged main
  • built and restarted ghreplica
  • applied 000008_change_sync_lease_ownership.up.sql in production and inserted it into schema_migrations
  • restarted ghreplica after the schema fix
  • curl -fsS https://ghreplica.dutiful.dev/healthz -> passed
  • curl -fsS https://ghreplica.dutiful.dev/readyz -> passed
  • curl -fsS https://ghreplica.dutiful.dev/v1/changes/repos/openclaw/openclaw/status -> passed
  • sudo docker exec gcp-ghreplica-1 ghreplica backfill repo openclaw/openclaw --mode open_only --priority 10 -> passed after schema fix

Reliability check performed in production:

  • observed active fetch heartbeats advancing every 10s on openclaw/openclaw
  • force-restarted gcp-ghreplica-1 during that active fetch
  • observed the old owner stop heartbeating
  • observed a new worker owner reclaim the fetch lease and resume within the heartbeat-staleness window instead of waiting for the full 15m lease TTL
  • latest observed live status after the forced restart showed:
    • fetch_lease_owner_id=42f72ba5b24d:1:1776277582619356936
    • fetch_lease_heartbeat_at=2026-04-15T18:28:47.668021Z
    • fetch_in_progress=true
    • last_error=null

This was the exact failure mode we were fixing, and production now recovers through it without getting stuck behind the old stale lease.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant