Skip to content

[DPE-9519]: add quorum reconfig based on cluster size#21

Open
skourta wants to merge 174 commits into9/edgefrom
dpe-9510-quorum-management
Open

[DPE-9519]: add quorum reconfig based on cluster size#21
skourta wants to merge 174 commits into9/edgefrom
dpe-9510-quorum-management

Conversation

@skourta
Copy link
Contributor

@skourta skourta commented Mar 12, 2026

This pull request introduces dynamic quorum management for the sentinel cluster, ensuring that the quorum value automatically adjusts based on the number of active units in the cluster. It also adds robust event handling to trigger quorum reconfiguration when units join or depart, and updates integration tests to verify correct quorum behavior during scaling operations.

Dynamic quorum management:

  • Added a quorum property to ConfigManager that calculates the required quorum based on the number of active units, replacing the static QUORUM_NUMBER constant. [1] [2]
  • Updated sentinel configuration generation to use the dynamic quorum value, ensuring the monitor config always reflects the current cluster state.

Sentinel quorum operations:

  • Implemented get_configured_quorum and set_quorum methods in SentinelManager to retrieve and update the quorum value via the CLI, and added a generic set method to the client for configuration changes. [1] [2]
  • Added _reconfigure_quorum_if_necessary method to handle quorum updates, and integrated it with peer relation change and departure event handlers for automatic reconfiguration. [1] [2] [3]

Testing enhancements:

  • Introduced a new get_quorum helper and updated integration tests to assert correct quorum values after scaling up and down the cluster, improving test coverage for HA scenarios. [1] [2] [3] [4] [5] [6] [7]
  • Modified the CLI execution helper to support custom ports and JSON output for sentinel queries.

These changes collectively ensure that the sentinel cluster maintains the correct quorum configuration as the cluster size changes, improving reliability and correctness in high-availability deployments.

@skourta skourta marked this pull request as ready for review March 12, 2026 13:38
skourta added 2 commits March 13, 2026 15:16
This pull request introduces a significant refactor to how the codebase
handles primary endpoint addressing, shifting from using only IP
addresses to supporting both IP addresses and hostnames, depending on
the deployment substrate. It updates configuration management and event
handling to use the new `primary_endpoint` concept, adds hostname
resolution to Sentinel management, and ensures better compatibility with
non-VM substrates. Additionally, it introduces a utility function for IP
validation and improves Sentinel configuration for hostname-based
communication.

Key changes include:

**Primary Endpoint Refactor and Configuration Management:**

- Refactored all configuration and service management logic in
`config.py` and related event handlers to use a new `primary_endpoint`
parameter (which can be a hostname or IP) instead of just `primary_ip`.
This includes updating method signatures, internal logic, and how
endpoints are determined based on the substrate (VM vs. Kubernetes).
[[1]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL106-R106)
[[2]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL140-R147)
[[3]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL156-R160)
[[4]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL50-R50)
[[5]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL85-R88)
[[6]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL97-R122)
[[7]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL196-R205)
[[8]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL244-R253)
[[9]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL253-R268)
[[10]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cR284-R292)
[[11]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL338-R361)
[[12]](diffhunk://#diff-3dfb4a104b407f7d4b2e07795a63745a09cde228a8dbe86602e7b72ea600fbe4L161-R166)
[[13]](diffhunk://#diff-3dfb4a104b407f7d4b2e07795a63745a09cde228a8dbe86602e7b72ea600fbe4L209-R212)

- Updated event-driven updates to always use FQDN hostnames instead of
short hostnames, improving consistency and DNS compatibility.
[[1]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL106-R106)
[[2]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL248-R256)
[[3]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL290-R298)

**Sentinel Management Improvements:**

- Enhanced Sentinel management to resolve hostnames to IPs when
necessary, using the new `is_valid_ip` helper, and improved logic for
retrieving active Sentinel IPs.
[[1]](diffhunk://#diff-b226109a257f8cc9cb6b0a4a1eb4c1c730d2cc9620744b6855e4f3a96ca3041dR22)
[[2]](diffhunk://#diff-b226109a257f8cc9cb6b0a4a1eb4c1c730d2cc9620744b6855e4f3a96ca3041dL272-R285)

- Added `resolve-hostnames` and `announce-hostnames` options to the
generated Sentinel configuration to enable hostname-based communication
within the cluster.

**Utility Enhancements:**

- Introduced a new `is_valid_ip` helper function in `common/helpers.py`
to reliably check if a string is a valid IP address, supporting the
above refactors and Sentinel logic.

These changes collectively improve the charm's flexibility and
reliability in heterogeneous environments, especially for Kubernetes and
other non-VM substrates.
Copy link
Contributor

@reneradoi reneradoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections from my side. The Sentinel quorum is only used for sentinel down detection and doesn't impact failover voting, so we could even be a bit more relaxed and only update this in the start workflow and on the departure events.

Base automatically changed from DPE-9324-scale-down to 9/edge March 18, 2026 12:14
Copilot AI review requested due to automatic review settings March 18, 2026 13:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds dynamic Sentinel quorum management so the configured quorum automatically adjusts to the current cluster size, and updates unit/integration tests to validate quorum behavior during scaling and lifecycle events.

Changes:

  • Replaced the static Sentinel quorum constant with a computed ConfigManager.quorum and used it in sentinel config generation.
  • Added Sentinel quorum read/update operations (get_configured_quorum, set_quorum, and SentinelClient.set) and hooked automatic reconfiguration into peer relation changed/departed events.
  • Enhanced integration/unit tests with quorum assertions and helper utilities (plus minor integration timeout adjustments).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/managers/config.py Computes quorum dynamically from active units and uses it in sentinel monitor config.
src/common/client.py Adds SentinelClient.set() to update Sentinel config via CLI.
src/managers/sentinel.py Adds methods to read and write the currently configured sentinel quorum.
src/events/base_events.py Triggers quorum reconfiguration on peer relation changed/departed.
tests/unit/test_charm.py Adds unit tests covering quorum update/no-update behavior; adjusts patches for new quorum reads.
tests/unit/test_tls.py Updates TLS rotation unit tests to patch Sentinel quorum reads.
tests/integration/helpers.py Extends CLI helper to support custom ports/JSON and adds get_quorum() helper.
tests/integration/ha/test_scaling.py Adds quorum assertions across initial deploy/scale up/scale down paths.
tests/integration/continuous_writes.py Increases Glide request timeout to reduce flakiness during scaling tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +323 to +328
return int(client.primary(self.state.bind_address)["quorum"])

def set_quorum(self, quorum: int) -> None:
"""Set the quorum for the sentinel cluster."""
client = self._get_sentinel_client()
client.set(self.state.bind_address, PRIMARY_NAME, "quorum", str(quorum))
def set_quorum(self, quorum: int) -> None:
"""Set the quorum for the sentinel cluster."""
client = self._get_sentinel_client()
client.set(self.state.bind_address, PRIMARY_NAME, "quorum", str(quorum))
Comment on lines +488 to +497
def get_quorum(juju: jubilant.Juju, unit_name: str) -> int:
"""Get the currently configured sentinel quorum."""
status = juju.status()
model_info = juju.show_model()
units = status.get_units(APP_NAME)
unit_endpoint = (
units[unit_name].public_address
if model_info.type != "kubernetes"
else units[unit_name].address
)
Comment on lines +498 to +504
result = exec_valkey_cli(
hostname=unit_endpoint,
username=CharmUsers.SENTINEL_CHARM_ADMIN.value,
password=get_password(juju, user=CharmUsers.SENTINEL_CHARM_ADMIN),
command="SENTINEL primary primary",
port=SENTINEL_PORT,
json=True,
Comment on lines +547 to +556
if self.charm.sentinel_manager.get_configured_quorum() != self.charm.config_manager.quorum:
logger.debug("Updating sentinel quorum to match current cluster size")
try:
self.charm.sentinel_manager.set_quorum(self.charm.config_manager.quorum)
self.charm.config_manager.set_sentinel_config_properties(
self.charm.sentinel_manager.get_primary_ip()
)
except ValkeyWorkloadCommandError as e:
logger.error(f"Failed to update sentinel quorum: {e}")
# not critical to defer here, we can wait for the next relation change or config change to try again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants