[DPE-9325]: Network HA tests by skourta · Pull Request #23 · canonical/valkey-operator

skourta · 2026-03-16T14:32:40Z

This pull request introduces several improvements focused on handling TLS certificate updates when a unit's IP address changes, enhances lock management logic, and adds support for advanced TLS configuration in integration tests. It also introduces new HA testing infrastructure for Kubernetes environments.

TLS certificate and IP change handling:

Added logic to detect IP address changes and trigger certificate regeneration or refresh, including a new _on_ip_change method in base_events.py and a custom exception TLSCertificatesRequireRefreshError. Certificates are now checked for SAN changes and updated if needed, with events emitted for client relations. (src/events/base_events.py, src/managers/tls.py, src/common/exceptions.py, [1] [2] [3] [4]
Updated integration tests to support TLS, including reading certificate files and configuring Glide clients with advanced TLS settings. (tests/integration/continuous_writes.py, [1] [2] [3]

Lock management enhancements:

Improved lock handling by adding a timestamp for lock requests/releases, customizing lock timestamp attribute per subclass, and refining logic for determining the next unit to receive the lock. (src/common/locks.py, [1] [2] [3] [4] [5]

Kubernetes HA testing infrastructure:

Added new fixtures and helpers for deploying Chaos Mesh in Kubernetes environments to enable HA failure simulations, including a deployment script and network loss YAML. (tests/integration/ha/conftest.py, tests/integration/ha/helpers/deploy_chaos_mesh.sh, tests/integration/ha/helpers/chaos_network_loss.yml, [1] [2] [3]

Configuration and dependency updates:

Added kubernetes dependency to pyproject.toml for K8S-related testing and operations. (pyproject.toml, pyproject.tomlR57)
Updated config management to bind to the current endpoint and propagate IP changes to Valkey runtime configuration. (src/managers/config.py, src/managers/cluster.py, [1] [2]

These changes improve the robustness of certificate management, lock coordination, and enable advanced HA testing capabilities in Kubernetes environments.

…ation on leader elected

Copilot

Pull request overview

This PR extends the charm’s HA/integration testing capabilities (including K8S network-failure simulation) and improves runtime behavior around IP changes by coordinating restarts and refreshing/regenerating TLS certificates when SANs no longer match.

Changes:

Add HA “network cut” integration tests for VM and K8S substrates (with TLS on/off variants) and supporting Chaos Mesh tooling.
Enhance TLS/IP-change handling by detecting SAN drift and triggering cert regeneration/refresh plus a coordinated restart workflow.
Improve lock coordination by introducing a restart lock and related peer state fields; add K8S client dependency for integration helpers.

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
`tests/unit/test_charm.py`	Updates unit tests and adds coverage for IP-change behavior without a TLS relation.
`tests/spread/vm/test_network_cut_tls_on.py/task.yaml`	Spread task to run VM HA network cut test with TLS enabled.
`tests/spread/vm/test_network_cut_tls_off.py/task.yaml`	Spread task to run VM HA network cut test with TLS disabled.
`tests/spread/k8s/test_network_cut_tls_on.py/task.yaml`	Spread task to run K8S HA network cut test with TLS enabled.
`tests/spread/k8s/test_network_cut_tls_off.py/task.yaml`	Spread task to run K8S HA network cut test with TLS disabled.
`tests/integration/helpers.py`	Adds TLS-aware CLI execution, more robust primary detection, and a helper to read unit IP.
`tests/integration/ha/test_network_cut.py`	New HA scenario test validating failover/replica counts and TLS SAN updates after IP change.
`tests/integration/ha/helpers/helpers.py`	New HA helper utilities for cutting/restoring network on VM/K8S (Chaos Mesh + kubernetes client).
`tests/integration/ha/helpers/deploy_chaos_mesh.sh`	Script to deploy Chaos Mesh in test namespaces.
`tests/integration/ha/helpers/destroy_chaos_mesh.sh`	Script to remove Chaos Mesh resources after tests.
`tests/integration/ha/helpers/chaos_network_loss.yml`	NetworkChaos manifest template to simulate packet loss.
`tests/integration/ha/conftest.py`	Adds fixture to deploy/destroy Chaos Mesh for K8S runs.
`tests/integration/cw_helpers.py`	Adds TLS toggle to continuous-writes assertion helper.
`tests/integration/continuous_writes.py`	Adds advanced TLS client configuration support for continuous writes.
`tests/integration/conftest.py`	Adds async cleanup fixture for continuous writes.
`src/statuses.py`	Adds new maintenance statuses for unhealthy restarts.
`src/managers/tls.py`	Adds SAN parsing and comparison to decide when certs require regeneration.
`src/managers/config.py`	Ensures replica config uses the current endpoint binding.
`src/events/tls.py`	Observes config-changed to trigger cert refresh/regeneration and restart on IP change.
`src/events/base_events.py`	Introduces restart event + restart lock usage and restart orchestration logic.
`src/core/models.py`	Adds peer fields for restart lock coordination.
`src/common/locks.py`	Adds restart lock and timestamps for lock request/release updates.
`pyproject.toml`	Adds `kubernetes` dependency for K8S HA test tooling.
`poetry.lock`	Updates lockfile to include new dependencies and Poetry version metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/integration/ha/helpers/helpers.py

+        subprocess.check_output(
+            f"microk8s kubectl -n {model_name} delete networkchaos network-loss-primary",
+            shell=True,
+            env=env,
+        )


src/events/base_events.py

+        if event.restart_valkey and not self.charm.cluster_manager.is_healthy(
+            check_replica_sync=False
+        ):
+            self.charm.status.set_running_status(
+                ClusterStatuses.VALKEY_UNHEALTHY_RESTART.value,
+                scope="unit",
+                component_name=self.charm.cluster_manager.name,
+                statuses_state=self.charm.state.statuses,
+            )
+
+        self.charm.state.statuses.delete(
+            ClusterStatuses.VALKEY_UNHEALTHY_RESTART.value,
+            scope="unit",
+            component=self.charm.cluster_manager.name,
+        )
+
+        if event.restart_sentinel and not self.charm.sentinel_manager.is_healthy():
+            self.charm.status.set_running_status(
+                ClusterStatuses.SENTINEL_UNHEALTHY_RESTART.value,
+                scope="unit",
+                component_name=self.charm.cluster_manager.name,
+                statuses_state=self.charm.state.statuses,
+            )
+
+        self.charm.state.statuses.delete(
+            ClusterStatuses.SENTINEL_UNHEALTHY_RESTART.value,
+            scope="unit",
+            component=self.charm.cluster_manager.name,
+        )
+


tests/integration/ha/helpers/helpers.py

+from logging import getLogger
+
+import jubilant
+import kubernetes as kubernetes


tests/integration/ha/helpers/destroy_chaos_mesh.sh

+    echo "deleting api-resources"
+    for i in $(kubectl api-resources | grep chaos-mesh | awk '{print $1}'); do timeout 30 kubectl delete "${i}" --all --all-namespaces || :; done
+
+    if [ "$(kubectl -n "${chaos_mesh_ns}" get mutatingwebhookconfiguration | grep -c 'choas-mesh-mutation')" = "1" ]; then


src/events/base_events.py

+        if event.restart_valkey and not self.charm.cluster_manager.is_healthy(
+            check_replica_sync=False
+        ):
+            self.charm.status.set_running_status(
+                ClusterStatuses.VALKEY_UNHEALTHY_RESTART.value,
+                scope="unit",
+                component_name=self.charm.cluster_manager.name,
+                statuses_state=self.charm.state.statuses,
+            )
+
+        self.charm.state.statuses.delete(
+            ClusterStatuses.VALKEY_UNHEALTHY_RESTART.value,
+            scope="unit",
+            component=self.charm.cluster_manager.name,
+        )
+
+        if event.restart_sentinel and not self.charm.sentinel_manager.is_healthy():
+            self.charm.status.set_running_status(
+                ClusterStatuses.SENTINEL_UNHEALTHY_RESTART.value,
+                scope="unit",
+                component_name=self.charm.cluster_manager.name,
+                statuses_state=self.charm.state.statuses,
+            )
+
+        self.charm.state.statuses.delete(
+            ClusterStatuses.SENTINEL_UNHEALTHY_RESTART.value,
+            scope="unit",
+            component=self.charm.cluster_manager.name,
+        )
+


src/managers/tls.py

+                san_type, san_value = sans.split(":")
+
+                if san_type.strip() == "DNS":
+                    sans_dns.add(san_value)
+                if san_type.strip() == "IP Address":


tests/integration/helpers.py

+
+def get_ip_from_unit(juju: jubilant.Juju, unit_name: str) -> str:
+    """Get the IP address of a unit based on the substrate type."""
+    return juju.exec("unit-get", "private-address", unit=unit_name).stdout.strip()


tests/integration/ha/helpers/destroy_chaos_mesh.sh

+    if [ "$(kubectl get clusterrole | grep 'chaos-mesh' | awk '{print $1}' | wc -l)" != "0" ]; then
+        echo "deleting clusterroles"
+        timeout 30 kubectl delete clusterrole "$(kubectl get clusterrole | grep 'chaos-mesh' | awk '{print $1}')" || :


tests/integration/ha/helpers/helpers.py

+        env = os.environ
+        env["KUBECONFIG"] = os.path.expanduser("~/.kube/config")
+        subprocess.check_output(
+            f"microk8s kubectl -n {model_name} delete networkchaos network-loss-primary",


tests/integration/ha/helpers/helpers.py

+        env = os.environ
+        env["KUBECONFIG"] = os.path.expanduser("~/.kube/config")
+        try:


reneradoi

Very nice, especially the test coverage and the restart lock! I only have a few comments, but general concept looks good already.

reneradoi · 2026-03-20T15:14:12Z

src/events/tls.py

        self.charm.cluster_manager.reload_tls_settings(tls_config)
        self.charm.sentinel_manager.restart_service()

+    def _on_config_changed(self, event: ops.ConfigChangedEvent) -> None:


Separating concerns is nice and I really like it! After reading the code though I think I need to revert my earlier comment and admit that it makes more sense to have it in the same event handler and method. Reason being that it seems to be overly complicated just to adhere to the structure, and now half of the logic here is not related to TLS concerns. Therefore, I'd say let's move this code to the base_events._on_config_changed() and have a clean implementation there.

reneradoi · 2026-03-20T15:22:41Z

src/events/base_events.py

+                statuses_state=self.charm.state.statuses,
+            )
+
+        self.charm.state.statuses.delete(


Does this reset a status that might just previously have been set? Or am I missing something here?

reneradoi · 2026-03-20T15:44:59Z

src/events/base_events.py

+    """Event for restarting the workload when certain events happen, e.g. IP change."""
+
+    def __init__(
+        self, handle: ops.Handle, restart_valkey: bool = True, restart_sentinel: bool = True


Nice! This will certainly be useful!

reneradoi · 2026-03-20T15:47:39Z

src/events/base_events.py

+        if event.restart_valkey:
+            self.charm.workload.restart(self.charm.workload.valkey_service)
+        if event.restart_sentinel:
+            self.charm.sentinel_manager.restart_service()
+


We need to add error handling here in case these methods raise, otherwise the event would crash.

reneradoi · 2026-03-20T15:59:01Z

tests/integration/ha/helpers/destroy_chaos_mesh.sh

@@ -0,0 +1,52 @@
+#!/bin/bash
+
+# Utility script to removing chaosmesh from the K8S cluster, to clean up test artefacts


If available, can you please add the source of these deploy/destroy scripts as a comment so that we can cross check in case of future failures?

reneradoi · 2026-03-20T16:13:07Z

tests/integration/ha/test_network_cut.py

+    for unit in juju.status().apps[APP_NAME].units:
+        if unit == primary_unit_name:
+            continue
+        assert not is_unit_reachable(


In addition to that, we should also make sure that the connectivity from the controller is lost, see here for reference: https://github.com/canonical/charmed-etcd-operator/blob/3.6/edge/tests/integration/ha/test_network_cut.py#L236

skourta added 30 commits January 19, 2026 06:42

disbale default user and add charmed-operator user and password gener…

5fb720b

…ation on leader elected

add secret handlign and config for admin password

93f8b41

bind to 0.0.0.0

a0e62d4

switch to glide

c7caead

add unit tests

af42d57

add integeration tests

8889bd5

add install deps to ci unit tests

7157121

add sudo to apt

90750a1

install protobug for glide on integration tests

6a80e46

auto approve installing deps

301e627

update rust

2be061c

sudo apt

e2ea39f

set default rust on spread

07353c3

save acl after udpating password so the change persists across restarts

a8a2f18

feedback from rene

87c443e

switch updating password to write acl file and then load it

2cd5c8b

implement feedback

1f73be7

add different charm users

b51956d

update passwords on non leader units

1615306

chagne scope of status for units and fix exception catching

7aa4505

fixing unit tests WIP

c63f21e

Merge branch '9/edge' into create-all-charm-users

9093a80

small charm restructure and enahnce unit tests

d8e2754

fix integration tests

fc9c9d3

add wrong username update test

3bc8774

fix copilot feedback

b812847

fix unit tests

913f85f

add charm sentinel user

073f087

initial scale up implementation

f6a8489

Merge branch 'DPE-9135-create-all-charm-users' into DPE-9174-scale-up

9d36b0f

skourta added 21 commits March 16, 2026 15:26

clean cwrites even when test fails

a0c6017

remove f strings in loggers

d00a189

charm level feedback

abe43b9

rename ip to endpoint and add existing app

d801de9

add support for existing app in scale tests

4dc6340

patch is_failover_in_progress

4e399a8

Merge branch 'DPE-9324-scale-down' into dpe-9325-network-ha

335100b

simplify code

b229e7d

fix bug and unit tests

be5bd08

add network cut without ip change for vm

d62ea9b

Merge branch '9/edge' into DPE-9324-scale-down

cb1138d

only remove APP_NAME in tests

a931e7c

minor feedback

d0aeff6

Merge branch 'DPE-9324-scale-down' into dpe-9325-network-ha

abc1996

small refactor

dacaaba

fix bug in config gen

a883967

run tls on k8s too

2a78390

add tls on for k8s

1ca51a3

remove skip on build and deploy

3fbf5c9

do not crash if deletion on key fails on valkey on cw clearing

e61cf2c

add rolling restart for ip change

b9b961a

Base automatically changed from DPE-9324-scale-down to 9/edge March 18, 2026 12:14

Merge branch '9/edge' into dpe-9325-network-ha

95c6a04

skourta marked this pull request as ready for review March 18, 2026 14:55

skourta requested review from Mehdi-Bendriss, Copilot and reneradoi March 18, 2026 14:55

Copilot started reviewing on behalf of skourta March 18, 2026 14:55 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

reneradoi reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-9325]: Network HA tests#23

[DPE-9325]: Network HA tests#23
skourta wants to merge 185 commits into9/edgefrom
dpe-9325-network-ha

skourta commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

reneradoi left a comment

Uh oh!

reneradoi Mar 20, 2026

Uh oh!

reneradoi Mar 20, 2026

Uh oh!

reneradoi Mar 20, 2026

Uh oh!

reneradoi Mar 20, 2026

Uh oh!

reneradoi Mar 20, 2026

Uh oh!

reneradoi Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,52 @@
		#!/bin/bash

		# Utility script to removing chaosmesh from the K8S cluster, to clean up test artefacts

Conversation

skourta commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

reneradoi left a comment

Choose a reason for hiding this comment

Uh oh!

reneradoi Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

reneradoi Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

reneradoi Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

reneradoi Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

reneradoi Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

reneradoi Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants