Skip to content

Conversation

YuryHrytsuk
Copy link
Collaborator

@YuryHrytsuk YuryHrytsuk commented Aug 14, 2025

What do these changes do?

Related issue/s

Related PR/s

Devops Actions ⚠️

  • create new docker swarm overlay network for rabbit

Prerequisites

Checklist

  • I tested and it works

New stack

  • The Stack has been included in CI Workflow

New service

  • Service has resource limits and reservations
  • Service has placement constraints or is global
  • Service is restartable --> it is a cluster of nodes (each node is a separate docker service). Nodes can be restarted (1 at a time or more, if cluster raft allows)
  • Service restart is zero-downtime --> a cluster (3+ nodes) can revive a single node restart (node can be put under maintenance to be on safe side)
  • Service has >1 replicas in PROD
  • Service has docker healthcheck enabled
  • Service is monitored (via prometheus and grafana) --> to be done in next PR when we switch from rabbit (in simcore stack) to cluster rabbit introduced here
  • Service is not bound to one specific node (e.g. via files or volumes) --> it is bound because of volumes. no way around in our docker swarm setup
  • Relevant OPS E2E Test are added (e2e test rabbit state)
  • Grafana dashboards updated accordingly --> we already have a dashboard that (should) support cluster. To be tested in another PR when we switch from simcore rabbit to clustered rabbit introduced here

If exposed via traefik

  • Service's Public URL is included in maintenance mode --> unrelated
  • Service's Public URL is included in testing mode --> unrelated
  • Service's has Traefik (Service Loadbalancer) Healthcheck enabled --> haproxy healthcheck is monitoring rabbit nodes
  • Credentials page is updated --> to be updated in another PR when we switch traffic to this rabbit cluster
  • Url added to e2e test services (e2e test checking that URL can be accessed) --> to be done when we swtich traffic

@YuryHrytsuk YuryHrytsuk self-assigned this Aug 14, 2025
@YuryHrytsuk
Copy link
Collaborator Author

YuryHrytsuk commented Aug 14, 2025

TODO

  • document how to put node under maintenance --> readme
  • support single node cluster (for local or tiny deployments) --> done via jinja and iterating over node count
  • document how to update erlang cookie (auth secret to access rabbit nodes with CLI client)
  • document autoscaling (joining nodes dynamically on demand) --> not supported at the moment
  • how to properly add / remove nodes? --> readme
  • test rabbit node count >= 3 --> test repo config values unit test
  • how to apply new settings in rabbitmq.conf on a running cluster --> not supported (more in readme)
    • avoid causing restart of containers because of config sha change --> drop sha part so that docker fails to update service on config change
  • add e2e test monitoring health of the cluster --> ops e2e test added
  • run haproxy highly available --> 2+ replicas running
  • make down (reasonable behaviour) --> simply remove the stack but not volumes. Add extra target to clean volumes
  • applying changes via CI Pipelines --> deploy rabibt job is added
    • start cluster fresh new with empty volumes add later if there is a need. otherwise rely on manual operations if it comes to it (manually using makefile targets)
  • restarting (rabbitmq node) service --> document behaviour

Clients should properly use HA rabbit

  • configure default replication factor for quorum queues? --> via rabbitmq.conf
  • how to connect to a multi-node cluster --> it is hidden by haproxy (loadbalancer) --> no changes
  • for backenders: make sure clients retry connection on failure

Cluster Formation

Source https://www.rabbitmq.com/docs/clustering


Ways of Forming a Cluster

  • Declaratively by listing cluster nodes in config file <--- we use
  • Declaratively using DNS-based discovery
  • Declaratively using AWS (EC2) instance discovery
  • Declaratively using Kubernetes discovery
  • Declaratively using Consul-based discovery
  • Declaratively using etcd-based discovery

Node Names (Identifiers)

  • must be unique --> achieved via docker service name and env variable

Cluster Formation Requirements

  • every cluster member must be able to resolve hostnames of every other cluster member, its own hostname, as well as machines on which command line tools such as rabbitmqctl might be used --> docker swarm networking

Ports That Must Be Opened for Clustering and Replication --> all works by default in docker swarm (all ports allowed)

  • 4369: epmd, a helper discovery daemon used by RabbitMQ nodes and CLI tools
  • 6000 through 6500: used by RabbitMQ Stream replication
  • 25672: used for inter-node and CLI tools communication and is allocated from a dynamic range (limited to a single port by default, computed as AMQP port + 20000)
  • 35672-35682: used by CLI tools for communication with nodes and is allocated from a dynamic

Nodes in a Cluster

  • Nodes are Equal Peers

For two nodes to be able to communicate they must have the same shared secret called the Erlang cookie.

  • Erlang cookie generation should be done at cluster deployment stage ⚠️ --> achieved via common secret

Node Counts and Quorum:

  • Two node clusters are highly recommended against --> added a test to forbit 2 cluster node configuraiton

Clustering and Clients

Messaging Protocols

  • In case of a node failure, clients should be able to reconnect to a different node, recover their topology and continue operation --> Task for backenders
  • Most client libraries accept a list of endpoints --> we use loadbalancer and 1 endpoint

Stream Clients

  • RabbitMQ Stream protocol clients behave differently from messaging protocols clients --> unrelated for us

Queue and Stream Leader Replica Placement

quorum queues

  • grow queue once node with replica goes down?
  • test queue have 3+ replicas
  • increase replicas if there are only 2?

Cleaning volumes

  • Avoid tasks taking unlimited space --> do no retry jobs + always remove stack before starting new tasks
  • Avoid unexpected volume removal
    • Deleting volumes failed but tasks keep running --> do not retry jobs + use timeouts
    • Deleting volumes unrelated to rabbit (safeguards) --> added

HA Proxy highly available

  • running 2+ replicas and statistics --> we do not expose / use statistics at the beginning

@YuryHrytsuk YuryHrytsuk changed the title Add ha rabbit Add ha rabbit (but not use it) Aug 28, 2025
@YuryHrytsuk YuryHrytsuk changed the title Add ha rabbit (but not use it) Add (ha) rabbit cluster Aug 28, 2025
@YuryHrytsuk YuryHrytsuk changed the title Add (ha) rabbit cluster Add (ha) rabbit cluster (but not use it) Sep 3, 2025
@YuryHrytsuk YuryHrytsuk changed the title Add (ha) rabbit cluster (but not use it) Add (ha) rabbit cluster (but not use it) ⚠️ Sep 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant