E2E canary: post-deploy NIP-42 + send kind-9 to a known group, fail-fast on regression

## Why

The 2026-05-09 deploy of PR #24 (kind denormalization) shipped with a latent regression in WarmCaches: under the DB CPU pressure that existed at deploy time, the 90k-event membership replay timed out partway, the cache ended up partially populated, and `g.cachesWarmed = true` was set unconditionally. The relay then false-rejected real members on writes with `restricted: you are not a member of that group`.

The user noticed when their personal account couldn't post. PR #26 fixed the underlying cause (snapshot-based warm-up + per-group fully-loaded marker + DB fallback in IsMember). But the regression was visible to actual users for some minutes before anyone noticed.

A small, self-contained synthetic test running on every prod deploy would have caught this in seconds and rolled the deploy back automatically.

## Proposed shape

A new Go binary at `cmd/e2e-canary` that:

1. Connects via WSS to `wss://sphere-relay.unicity.network` (or arg).
2. NIP-42 auth handshake with a known canary keypair (seckey from Secrets Manager).
3. Sends a kind-9 message tagged `#h: ['<canary-group>']`.
4. Waits for the OK relay-response with timeout (~10s).
5. Asserts `accepted=true`, exits 0; on rejection or timeout, exits 1 with the relay's error message.

Built into the same Docker image (`/usr/local/bin/zooid-canary`) so we don't manage two image lifecycles.

## Where it runs

A separate ECS task definition `sphere-zooid-relay-eu-canary` in the same cluster, network config, and image as the relay. Triggered after every CFN `update-stack` completes:

1. `update-stack` → wait stable.
2. `run-task` for the canary; capture exit code.
3. If non-zero: log the relay error, alert via the existing Grafana alert pipeline, and (optionally) trigger `update-stack` rollback to the previous image.
4. If zero: deploy declared healthy.

## Setup

- Generate a canary keypair: `openssl rand -hex 32` for seckey, derive pubkey.
- Add canary user to a dedicated `canary` group (kind-9000 from a relay admin), so we don't pollute `general`.
- Store seckey in Secrets Manager (`sphere-zooid-relay-eu-canary-key`).
- Wire the canary task to read it via the standard ECS `secrets[]` (granted on the execution role).

Optional next step: continuous canary via CloudWatch Synthetics every 5 minutes — same script, different trigger. Catches drift / capacity / cert issues independent of deploys.

## Out of scope

- Reading a canary group's history (different test, would catch read-side regressions).
- Multi-region canary.
- Canary for the tokens relay — same idea applies but its event semantics differ (no NIP-29 group filter); track separately.

## Refs

- The regression that motivated this: 2026-05-09 deploy of #24, found the hard way; root-caused and fixed in #26 over five review iterations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E canary: post-deploy NIP-42 + send kind-9 to a known group, fail-fast on regression #29

Why

Proposed shape

Where it runs

Setup

Out of scope

Refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

E2E canary: post-deploy NIP-42 + send kind-9 to a known group, fail-fast on regression #29

Description

Why

Proposed shape

Where it runs

Setup

Out of scope

Refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions