Why
The 2026-05-09 deploy of PR #24 (kind denormalization) shipped with a latent regression in WarmCaches: under the DB CPU pressure that existed at deploy time, the 90k-event membership replay timed out partway, the cache ended up partially populated, and g.cachesWarmed = true was set unconditionally. The relay then false-rejected real members on writes with restricted: you are not a member of that group.
The user noticed when their personal account couldn't post. PR #26 fixed the underlying cause (snapshot-based warm-up + per-group fully-loaded marker + DB fallback in IsMember). But the regression was visible to actual users for some minutes before anyone noticed.
A small, self-contained synthetic test running on every prod deploy would have caught this in seconds and rolled the deploy back automatically.
Proposed shape
A new Go binary at cmd/e2e-canary that:
- Connects via WSS to
wss://sphere-relay.unicity.network (or arg).
- NIP-42 auth handshake with a known canary keypair (seckey from Secrets Manager).
- Sends a kind-9 message tagged
#h: ['<canary-group>'].
- Waits for the OK relay-response with timeout (~10s).
- Asserts
accepted=true, exits 0; on rejection or timeout, exits 1 with the relay's error message.
Built into the same Docker image (/usr/local/bin/zooid-canary) so we don't manage two image lifecycles.
Where it runs
A separate ECS task definition sphere-zooid-relay-eu-canary in the same cluster, network config, and image as the relay. Triggered after every CFN update-stack completes:
update-stack → wait stable.
run-task for the canary; capture exit code.
- If non-zero: log the relay error, alert via the existing Grafana alert pipeline, and (optionally) trigger
update-stack rollback to the previous image.
- If zero: deploy declared healthy.
Setup
- Generate a canary keypair:
openssl rand -hex 32 for seckey, derive pubkey.
- Add canary user to a dedicated
canary group (kind-9000 from a relay admin), so we don't pollute general.
- Store seckey in Secrets Manager (
sphere-zooid-relay-eu-canary-key).
- Wire the canary task to read it via the standard ECS
secrets[] (granted on the execution role).
Optional next step: continuous canary via CloudWatch Synthetics every 5 minutes — same script, different trigger. Catches drift / capacity / cert issues independent of deploys.
Out of scope
- Reading a canary group's history (different test, would catch read-side regressions).
- Multi-region canary.
- Canary for the tokens relay — same idea applies but its event semantics differ (no NIP-29 group filter); track separately.
Refs
Why
The 2026-05-09 deploy of PR #24 (kind denormalization) shipped with a latent regression in WarmCaches: under the DB CPU pressure that existed at deploy time, the 90k-event membership replay timed out partway, the cache ended up partially populated, and
g.cachesWarmed = truewas set unconditionally. The relay then false-rejected real members on writes withrestricted: you are not a member of that group.The user noticed when their personal account couldn't post. PR #26 fixed the underlying cause (snapshot-based warm-up + per-group fully-loaded marker + DB fallback in IsMember). But the regression was visible to actual users for some minutes before anyone noticed.
A small, self-contained synthetic test running on every prod deploy would have caught this in seconds and rolled the deploy back automatically.
Proposed shape
A new Go binary at
cmd/e2e-canarythat:wss://sphere-relay.unicity.network(or arg).#h: ['<canary-group>'].accepted=true, exits 0; on rejection or timeout, exits 1 with the relay's error message.Built into the same Docker image (
/usr/local/bin/zooid-canary) so we don't manage two image lifecycles.Where it runs
A separate ECS task definition
sphere-zooid-relay-eu-canaryin the same cluster, network config, and image as the relay. Triggered after every CFNupdate-stackcompletes:update-stack→ wait stable.run-taskfor the canary; capture exit code.update-stackrollback to the previous image.Setup
openssl rand -hex 32for seckey, derive pubkey.canarygroup (kind-9000 from a relay admin), so we don't pollutegeneral.sphere-zooid-relay-eu-canary-key).secrets[](granted on the execution role).Optional next step: continuous canary via CloudWatch Synthetics every 5 minutes — same script, different trigger. Catches drift / capacity / cert issues independent of deploys.
Out of scope
Refs