Add (ha) rabbit cluster (but not use it) ⚠️ #1179

YuryHrytsuk · 2025-08-14T13:03:01Z

What do these changes do?

Related issue/s

HA self-hosted RabbitMQ #1176

Related PR/s

master configuration https://git.speag.com/oSparc/osparc-ops-deployment-configuration/-/merge_requests/1542

Devops Actions ⚠️

create new docker swarm overlay network for rabbit

Prerequisites

Blocked by Update docker engine to major version 26 #1198
- Context: it requires Add support for --detach flag in stack deploy docker/cli#4258

Checklist

I tested and it works

New stack

The Stack has been included in CI Workflow

New service

If exposed via traefik

~~Service's Public URL is included in maintenance mode~~ --> unrelated
~~Service's Public URL is included in testing mode~~ --> unrelated
Service's has Traefik (Service Loadbalancer) Healthcheck enabled --> haproxy healthcheck is monitoring rabbit nodes
~~Credentials page is updated~~ --> to be updated in another PR when we switch traffic to this rabbit cluster
~~Url added to e2e test services~~ (e2e test checking that URL can be accessed) --> to be done when we swtich traffic

YuryHrytsuk · 2025-08-14T13:06:35Z

TODO

Clients should properly use HA rabbit

configure default replication factor for quorum queues? --> via rabbitmq.conf
how to connect to a multi-node cluster --> it is hidden by haproxy (loadbalancer) --> no changes
for backenders: make sure clients retry connection on failure

Cluster Formation

Source https://www.rabbitmq.com/docs/clustering

Ways of Forming a Cluster

Declaratively by listing cluster nodes in config file <--- we use
~~Declaratively using DNS-based discovery~~
~~Declaratively using AWS (EC2) instance discovery~~
~~Declaratively using Kubernetes discovery~~
~~Declaratively using Consul-based discovery~~
~~Declaratively using etcd-based discovery~~

Node Names (Identifiers)

must be unique --> achieved via docker service name and env variable

Cluster Formation Requirements

every cluster member must be able to resolve hostnames of every other cluster member, its own hostname, as well as machines on which command line tools such as rabbitmqctl might be used --> docker swarm networking

Ports That Must Be Opened for Clustering and Replication --> all works by default in docker swarm (all ports allowed)

4369: epmd, a helper discovery daemon used by RabbitMQ nodes and CLI tools
6000 through 6500: used by RabbitMQ Stream replication
25672: used for inter-node and CLI tools communication and is allocated from a dynamic range (limited to a single port by default, computed as AMQP port + 20000)
35672-35682: used by CLI tools for communication with nodes and is allocated from a dynamic

Nodes in a Cluster

Nodes are Equal Peers

For two nodes to be able to communicate they must have the same shared secret called the Erlang cookie.

Erlang cookie generation should be done at cluster deployment stage ⚠️ --> achieved via common secret

Node Counts and Quorum:

Two node clusters are highly recommended against --> added a test to forbit 2 cluster node configuraiton

Clustering and Clients

Messaging Protocols

In case of a node failure, clients should be able to reconnect to a different node, recover their topology and continue operation --> Task for backenders
Most client libraries accept a list of endpoints --> we use loadbalancer and 1 endpoint

Stream Clients

RabbitMQ Stream protocol clients behave differently from messaging protocols clients --> unrelated for us

Queue and Stream Leader Replica Placement

Queue leaders should be reasonably evenly distributed across cluster nodes (see this doc) --> we have rabbit connections evenly distributed + added comment to do changes if this (haproxy load balancing mechanism) is changed
- https://www.rabbitmq.com/docs/quorum-queues#member-management

quorum queues

grow queue once node with replica goes down?
test queue have 3+ replicas
increase replicas if there are only 2?

Cleaning volumes

Avoid tasks taking unlimited space --> do no retry jobs + always remove stack before starting new tasks
Avoid unexpected volume removal
- Deleting volumes failed but tasks keep running --> do not retry jobs + use timeouts
- Deleting volumes unrelated to rabbit (safeguards) --> added

HA Proxy highly available

running 2+ replicas and statistics --> we do not expose / use statistics at the beginning

+ haproxy extra configuration

YuryHrytsuk added 2 commits August 14, 2025 09:28

add-ha-rabbit

e38bae4

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

901ee0c

YuryHrytsuk self-assigned this Aug 14, 2025

YuryHrytsuk added 11 commits August 20, 2025 09:58

Add ha rabbit

1be4f1c

Document erlang cookie rotation

9ad628f

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

8f007fc

Add ha proxy

cf8bbfa

Further configuration

8ca30d7

Document autoscaling (not supported)

439d56a

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

2f86ff9

More configurable parameters

ebd87c9

minor improvements

18e172b

Add resource limits/reservations

1f52e7c

Add haproxy resources

c36db8a

YuryHrytsuk changed the title ~~Add ha rabbit~~ Add ha rabbit (but not use it) Aug 28, 2025

YuryHrytsuk changed the title ~~Add ha rabbit (but not use it)~~ Add (ha) rabbit cluster Aug 28, 2025

YuryHrytsuk added 6 commits August 28, 2025 11:50

Document side effect of haproxy round robin

2ba480e

Add healthcheck for haproxy

4714fa4

Update readme

4706404

Removing volumes

0934599

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

51cd721

Robust volume clean up

e42a00e

+ haproxy extra configuration

YuryHrytsuk changed the title ~~Add (ha) rabbit cluster~~ Add (ha) rabbit cluster (but not use it) Sep 3, 2025

YuryHrytsuk added 6 commits September 4, 2025 09:12

Simplification

4d7d3e3

Add confirmation dialogue

44a9ebe

Unification

f63e5b6

Minor clean up

8d28184

update gitignore

1a95eae

fixes after clean up

3409428

YuryHrytsuk changed the title ~~Add (ha) rabbit cluster (but not use it)~~ Add (ha) rabbit cluster (but not use it) ⚠️ Sep 4, 2025

YuryHrytsuk added 5 commits September 4, 2025 10:51

clean up

afb04ca

clean up

f649f0c

Deploy rabbit only if necessary

7115051

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

2c7debc

clean up

0e3f235

YuryHrytsuk mentioned this pull request Aug 14, 2025

HA self-hosted RabbitMQ #1176

Open

YuryHrytsuk added this to the Cheops milestone Sep 5, 2025

YuryHrytsuk added the blocked / paused label Sep 5, 2025

YuryHrytsuk added 5 commits September 5, 2025 11:25

Document cluster update behaviour. Architecture must be changed

2b15092

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

634a195

Switch from services to stacks

f5c4f1b

fixes

69159f2

Merge remote-tracking branch 'upstream/main' into add-ha-rabbit

3078248

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add (ha) rabbit cluster (but not use it) ⚠️ #1179

Add (ha) rabbit cluster (but not use it) ⚠️ #1179

YuryHrytsuk commented Aug 14, 2025 •

edited

Loading

Uh oh!

YuryHrytsuk commented Aug 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add (ha) rabbit cluster (but not use it) ⚠️ #1179

Are you sure you want to change the base?

Add (ha) rabbit cluster (but not use it) ⚠️ #1179

Conversation

YuryHrytsuk commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue/s

Related PR/s

Devops Actions ⚠️

Prerequisites

Checklist

Uh oh!

YuryHrytsuk commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Cluster Formation

Clustering and Clients

quorum queues

Cleaning volumes

HA Proxy highly available

Uh oh!

Uh oh!

YuryHrytsuk commented Aug 14, 2025 •

edited

Loading

YuryHrytsuk commented Aug 14, 2025 •

edited

Loading