Skip to content

E2E canary: post-deploy NIP-42 + send kind-9 to a known group, fail-fast on regression #29

Description

@MastaP

Why

The 2026-05-09 deploy of PR #24 (kind denormalization) shipped with a latent regression in WarmCaches: under the DB CPU pressure that existed at deploy time, the 90k-event membership replay timed out partway, the cache ended up partially populated, and g.cachesWarmed = true was set unconditionally. The relay then false-rejected real members on writes with restricted: you are not a member of that group.

The user noticed when their personal account couldn't post. PR #26 fixed the underlying cause (snapshot-based warm-up + per-group fully-loaded marker + DB fallback in IsMember). But the regression was visible to actual users for some minutes before anyone noticed.

A small, self-contained synthetic test running on every prod deploy would have caught this in seconds and rolled the deploy back automatically.

Proposed shape

A new Go binary at cmd/e2e-canary that:

  1. Connects via WSS to wss://sphere-relay.unicity.network (or arg).
  2. NIP-42 auth handshake with a known canary keypair (seckey from Secrets Manager).
  3. Sends a kind-9 message tagged #h: ['<canary-group>'].
  4. Waits for the OK relay-response with timeout (~10s).
  5. Asserts accepted=true, exits 0; on rejection or timeout, exits 1 with the relay's error message.

Built into the same Docker image (/usr/local/bin/zooid-canary) so we don't manage two image lifecycles.

Where it runs

A separate ECS task definition sphere-zooid-relay-eu-canary in the same cluster, network config, and image as the relay. Triggered after every CFN update-stack completes:

  1. update-stack → wait stable.
  2. run-task for the canary; capture exit code.
  3. If non-zero: log the relay error, alert via the existing Grafana alert pipeline, and (optionally) trigger update-stack rollback to the previous image.
  4. If zero: deploy declared healthy.

Setup

  • Generate a canary keypair: openssl rand -hex 32 for seckey, derive pubkey.
  • Add canary user to a dedicated canary group (kind-9000 from a relay admin), so we don't pollute general.
  • Store seckey in Secrets Manager (sphere-zooid-relay-eu-canary-key).
  • Wire the canary task to read it via the standard ECS secrets[] (granted on the execution role).

Optional next step: continuous canary via CloudWatch Synthetics every 5 minutes — same script, different trigger. Catches drift / capacity / cert issues independent of deploys.

Out of scope

  • Reading a canary group's history (different test, would catch read-side regressions).
  • Multi-region canary.
  • Canary for the tokens relay — same idea applies but its event semantics differ (no NIP-29 group filter); track separately.

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions