Add Redis Sentinel (HA) support#149
Conversation
Branches createRedisCommitmentQueue on REDIS_SENTINEL_ADDRS to use NewFailoverClient (still *redis.Client) while keeping the single-endpoint path unchanged. New env vars: REDIS_SENTINEL_ADDRS, REDIS_MASTER_NAME, REDIS_SENTINEL_PASSWORD, REDIS_SENTINEL_USERNAME, REDIS_ROUTE_BY_LATENCY, REDIS_ROUTE_RANDOMLY, REDIS_REPLICA_ONLY.
…#148) Addresses review feedback on the previous commit: - Symmetric validation: setting REDIS_MASTER_NAME without REDIS_SENTINEL_ADDRS would have silently fallen back to single-endpoint mode against localhost:6379, dropping the master name on the floor. This is a real footgun for operators who intended to enable Sentinel, so the factory now returns a clear error in that case (mirroring the existing check for sentinel addrs without master name). - Operator visibility: log the active mode (sentinel/direct) and the endpoint(s) at queue creation time so HA wiring is verifiable from startup logs. - Tests: add the factory-level unit test that closes the coverage gap on the sentinel-addrs-without-master-name validation, plus a symmetric test for the new master-name-without-sentinel-addrs validation. Both exercise the unexported helper directly without spinning up Redis or MongoDB.
There was a problem hiding this comment.
Code Review
This pull request introduces Redis Sentinel (HA) support for the commitment queue, allowing for failover-capable Redis configurations. The changes include updates to the configuration structure, environment variable parsing, and the storage factory to support failover clients. A review comment identifies an opportunity to improve context propagation in the Redis client creation process to ensure connection tests respect upstream timeouts.
There was a problem hiding this comment.
Pull request overview
This PR adds Redis Sentinel (high availability) support for the Redis-backed commitment queue by extending Redis configuration, switching the storage factory to build a Sentinel failover client when configured, and documenting/testing the new behavior.
Changes:
- Extend
RedisConfigwith Sentinel-specific fields and env var parsing (including a new comma-slice helper). - Update
internal/storage/factory.goto construct either a direct Redis client or a Sentinel failover client, with symmetric startup validation errors for misconfiguration. - Add unit tests for config parsing and factory validation, and document the new env vars in
README.md(plus achanges.txtentry).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents Redis Sentinel env vars and usage example; updates Redis section wording. |
| internal/storage/factory.go | Adds Sentinel/direct branching and startup validation/logging for Redis commitment queue client construction. |
| internal/storage/factory_test.go | Adds unit tests for the two Sentinel/direct misconfiguration guards. |
| internal/config/config.go | Extends RedisConfig, wires new env vars, and adds getEnvStringSliceOrDefault. |
| internal/config/config_test.go | Adds tests for Sentinel env parsing, defaults, and the new string-slice helper. |
| changes.txt | Adds a changelog entry describing the Sentinel support work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Plumb context.Context through createRedisCommitmentQueue from NewStorage so the initial Redis Ping respects upstream timeouts and cancellation instead of using a fresh context.Background() (Gemini review comment). - Drop the ReplicaOnly knob entirely: the commitment queue is write-heavy (XADD, XACK, XReadGroup, XAutoClaim) and routing all commands to read-only replicas would break the queue with READONLY errors at runtime. Removing the knob is safer than documenting it as incompatible (Copilot review comment). - Correct two pre-existing stale defaults in the README's Storage Configuration table: REDIS_FLUSH_INTERVAL is 100ms (not 50ms) and REDIS_MAX_BATCH_SIZE is 5000 (not 2000). Spotted by Copilot while reviewing the Sentinel section in the same table. - Update changelog entry to reflect the final state.
Brings up redis master + redis-sentinel as separate containers on a shared Docker network, then exercises the same FailoverClient construction the production factory uses for REDIS_SENTINEL_ADDRS to verify the full Store -> Stream -> MarkProcessed pipeline works end-to-end through Sentinel-discovered master resolution. The factory itself is bypassed because Sentinel resolves the master alias to the container's internal IP at monitor time and returns that to clients, which is not reachable from the host where the test runs. A test-only Dialer redirects any port-6379 dial to the host-mapped master address; sentinel dials (port 26379) pass through. Factory validation logic is covered separately by factory_test.go.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Following review-driven simplification of the Sentinel options surface: - MinIdleConns is now wired through to both NewClient and NewFailoverClient. The field was already on RedisConfig and loaded from REDIS_MIN_IDLE_CONNS, but neither client constructor was passing it. Cold pool reconnects on burst traffic are a real cost for a high-throughput service; closing the gap is a one-line fix per branch. - RouteByLatency and RouteRandomly are removed from RedisConfig, env wiring, FailoverOptions, README and tests. Same shape of footgun as ReplicaOnly, narrower scope: most queue commands (XADD/XACK/ XReadGroup/XAutoClaim) are writes and hit master regardless, but XPendingExt IS a read-only command. With these flags on it would route to a replica and observe replication-lagged consumer-group state, causing silent re-delivery of already-acked entries. Better to not expose them at all than to document them as incompatible.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Closes #148.
Summary
RedisConfiggainsSentinelAddrs,MasterName,SentinelPassword,SentinelUsername(env varsREDIS_SENTINEL_ADDRS,REDIS_MASTER_NAME,REDIS_SENTINEL_PASSWORD,REDIS_SENTINEL_USERNAME).internal/storage/factory.go::createRedisCommitmentQueuebranches onlen(SentinelAddrs) > 0: Sentinel mode usesredislib.NewFailoverClient, otherwise the originalredislib.NewClientpath runs unchanged. Both constructors return*redis.Client, so no downstream signature changes. The factory now takes acontext.Contextso the initialPinghonors upstream timeouts.REDIS_SENTINEL_ADDRSwithoutREDIS_MASTER_NAME(and vice versa) returns a clear error at startup instead of silently falling back to single-endpoint mode.sentinel/direct) at queue construction time so HA wiring is verifiable from startup logs.REDIS_MIN_IDLE_CONNSenv var through both client constructors (it was previously loaded intoRedisConfigbut never passed to the redis options).changes.txthas a dated entry. Two pre-existing stale defaults in the README's Storage Configuration table (REDIS_FLUSH_INTERVAL,REDIS_MAX_BATCH_SIZE) are corrected to match the code.Why no
ReplicaOnly/RouteByLatency/RouteRandomlyknobsgo-redisexposes theseFailoverOptionsto send some or all commands to read-only replicas. Most queue commands (XADD/XACK/XReadGroup/XAutoClaim) are classified as writes and would hit master regardless, butXPendingExtis read-only — with any of these flags on it would route to a replica and observe replication-lagged consumer-group state, silently re-delivering already-acked entries. They are intentionally not exposed; there is no legitimate use case for any of them in this component.Test plan
go build ./...cleango vet ./...cleango test -count=1 ./internal/config/...— sentinel env parsing, defaults, comma-separated slice helpergo test -count=1 ./internal/storage/...— factory validation guards (both directions) plus existing storage tests including the testcontainer-backed Redis stream testsgo test -count=1 -race ./internal/config/... ./internal/storage/...cleango test -count=1 -run TestSentinelCommitmentQueueIntegration ./internal/storage/redis/— new integration test brings up redis master + redis-sentinel via testcontainers and runs the full Store → Stream → MarkProcessed pipeline through Sentinel-discovered master resolutionBackwards compatibility
Existing deployments using
REDIS_HOST/REDIS_PORTcontinue to work without any config change.UniversalClientwas considered but rejected in favor ofNewFailoverClientbecause it returns*redis.Clientdirectly, keepinginternal/storage/redis/commitment.goand other downstream consumers untouched.Out of scope
sentinel-compose.ymlexample deployment.