Skip to content

[DPE-9325]: Network HA tests#23

Open
skourta wants to merge 185 commits into9/edgefrom
dpe-9325-network-ha
Open

[DPE-9325]: Network HA tests#23
skourta wants to merge 185 commits into9/edgefrom
dpe-9325-network-ha

Conversation

@skourta
Copy link
Contributor

@skourta skourta commented Mar 16, 2026

This pull request introduces several improvements focused on handling TLS certificate updates when a unit's IP address changes, enhances lock management logic, and adds support for advanced TLS configuration in integration tests. It also introduces new HA testing infrastructure for Kubernetes environments.

TLS certificate and IP change handling:

  • Added logic to detect IP address changes and trigger certificate regeneration or refresh, including a new _on_ip_change method in base_events.py and a custom exception TLSCertificatesRequireRefreshError. Certificates are now checked for SAN changes and updated if needed, with events emitted for client relations. (src/events/base_events.py, src/managers/tls.py, src/common/exceptions.py, [1] [2] [3] [4]
  • Updated integration tests to support TLS, including reading certificate files and configuring Glide clients with advanced TLS settings. (tests/integration/continuous_writes.py, [1] [2] [3]

Lock management enhancements:

  • Improved lock handling by adding a timestamp for lock requests/releases, customizing lock timestamp attribute per subclass, and refining logic for determining the next unit to receive the lock. (src/common/locks.py, [1] [2] [3] [4] [5]

Kubernetes HA testing infrastructure:

  • Added new fixtures and helpers for deploying Chaos Mesh in Kubernetes environments to enable HA failure simulations, including a deployment script and network loss YAML. (tests/integration/ha/conftest.py, tests/integration/ha/helpers/deploy_chaos_mesh.sh, tests/integration/ha/helpers/chaos_network_loss.yml, [1] [2] [3]

Configuration and dependency updates:

  • Added kubernetes dependency to pyproject.toml for K8S-related testing and operations. (pyproject.toml, pyproject.tomlR57)
  • Updated config management to bind to the current endpoint and propagate IP changes to Valkey runtime configuration. (src/managers/config.py, src/managers/cluster.py, [1] [2]

These changes improve the robustness of certificate management, lock coordination, and enable advanced HA testing capabilities in Kubernetes environments.

Base automatically changed from DPE-9324-scale-down to 9/edge March 18, 2026 12:14
@skourta skourta marked this pull request as ready for review March 18, 2026 14:55
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the charm’s HA/integration testing capabilities (including K8S network-failure simulation) and improves runtime behavior around IP changes by coordinating restarts and refreshing/regenerating TLS certificates when SANs no longer match.

Changes:

  • Add HA “network cut” integration tests for VM and K8S substrates (with TLS on/off variants) and supporting Chaos Mesh tooling.
  • Enhance TLS/IP-change handling by detecting SAN drift and triggering cert regeneration/refresh plus a coordinated restart workflow.
  • Improve lock coordination by introducing a restart lock and related peer state fields; add K8S client dependency for integration helpers.

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
tests/unit/test_charm.py Updates unit tests and adds coverage for IP-change behavior without a TLS relation.
tests/spread/vm/test_network_cut_tls_on.py/task.yaml Spread task to run VM HA network cut test with TLS enabled.
tests/spread/vm/test_network_cut_tls_off.py/task.yaml Spread task to run VM HA network cut test with TLS disabled.
tests/spread/k8s/test_network_cut_tls_on.py/task.yaml Spread task to run K8S HA network cut test with TLS enabled.
tests/spread/k8s/test_network_cut_tls_off.py/task.yaml Spread task to run K8S HA network cut test with TLS disabled.
tests/integration/helpers.py Adds TLS-aware CLI execution, more robust primary detection, and a helper to read unit IP.
tests/integration/ha/test_network_cut.py New HA scenario test validating failover/replica counts and TLS SAN updates after IP change.
tests/integration/ha/helpers/helpers.py New HA helper utilities for cutting/restoring network on VM/K8S (Chaos Mesh + kubernetes client).
tests/integration/ha/helpers/deploy_chaos_mesh.sh Script to deploy Chaos Mesh in test namespaces.
tests/integration/ha/helpers/destroy_chaos_mesh.sh Script to remove Chaos Mesh resources after tests.
tests/integration/ha/helpers/chaos_network_loss.yml NetworkChaos manifest template to simulate packet loss.
tests/integration/ha/conftest.py Adds fixture to deploy/destroy Chaos Mesh for K8S runs.
tests/integration/cw_helpers.py Adds TLS toggle to continuous-writes assertion helper.
tests/integration/continuous_writes.py Adds advanced TLS client configuration support for continuous writes.
tests/integration/conftest.py Adds async cleanup fixture for continuous writes.
src/statuses.py Adds new maintenance statuses for unhealthy restarts.
src/managers/tls.py Adds SAN parsing and comparison to decide when certs require regeneration.
src/managers/config.py Ensures replica config uses the current endpoint binding.
src/events/tls.py Observes config-changed to trigger cert refresh/regeneration and restart on IP change.
src/events/base_events.py Introduces restart event + restart lock usage and restart orchestration logic.
src/core/models.py Adds peer fields for restart lock coordination.
src/common/locks.py Adds restart lock and timestamps for lock request/release updates.
pyproject.toml Adds kubernetes dependency for K8S HA test tooling.
poetry.lock Updates lockfile to include new dependencies and Poetry version metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +125 to +129
subprocess.check_output(
f"microk8s kubectl -n {model_name} delete networkchaos network-loss-primary",
shell=True,
env=env,
)
Comment on lines +575 to +604
if event.restart_valkey and not self.charm.cluster_manager.is_healthy(
check_replica_sync=False
):
self.charm.status.set_running_status(
ClusterStatuses.VALKEY_UNHEALTHY_RESTART.value,
scope="unit",
component_name=self.charm.cluster_manager.name,
statuses_state=self.charm.state.statuses,
)

self.charm.state.statuses.delete(
ClusterStatuses.VALKEY_UNHEALTHY_RESTART.value,
scope="unit",
component=self.charm.cluster_manager.name,
)

if event.restart_sentinel and not self.charm.sentinel_manager.is_healthy():
self.charm.status.set_running_status(
ClusterStatuses.SENTINEL_UNHEALTHY_RESTART.value,
scope="unit",
component_name=self.charm.cluster_manager.name,
statuses_state=self.charm.state.statuses,
)

self.charm.state.statuses.delete(
ClusterStatuses.SENTINEL_UNHEALTHY_RESTART.value,
scope="unit",
component=self.charm.cluster_manager.name,
)

from logging import getLogger

import jubilant
import kubernetes as kubernetes
echo "deleting api-resources"
for i in $(kubectl api-resources | grep chaos-mesh | awk '{print $1}'); do timeout 30 kubectl delete "${i}" --all --all-namespaces || :; done

if [ "$(kubectl -n "${chaos_mesh_ns}" get mutatingwebhookconfiguration | grep -c 'choas-mesh-mutation')" = "1" ]; then
Comment on lines +575 to +604
if event.restart_valkey and not self.charm.cluster_manager.is_healthy(
check_replica_sync=False
):
self.charm.status.set_running_status(
ClusterStatuses.VALKEY_UNHEALTHY_RESTART.value,
scope="unit",
component_name=self.charm.cluster_manager.name,
statuses_state=self.charm.state.statuses,
)

self.charm.state.statuses.delete(
ClusterStatuses.VALKEY_UNHEALTHY_RESTART.value,
scope="unit",
component=self.charm.cluster_manager.name,
)

if event.restart_sentinel and not self.charm.sentinel_manager.is_healthy():
self.charm.status.set_running_status(
ClusterStatuses.SENTINEL_UNHEALTHY_RESTART.value,
scope="unit",
component_name=self.charm.cluster_manager.name,
statuses_state=self.charm.state.statuses,
)

self.charm.state.statuses.delete(
ClusterStatuses.SENTINEL_UNHEALTHY_RESTART.value,
scope="unit",
component=self.charm.cluster_manager.name,
)

Comment on lines +272 to +276
san_type, san_value = sans.split(":")

if san_type.strip() == "DNS":
sans_dns.add(san_value)
if san_type.strip() == "IP Address":

def get_ip_from_unit(juju: jubilant.Juju, unit_name: str) -> str:
"""Get the IP address of a unit based on the substrate type."""
return juju.exec("unit-get", "private-address", unit=unit_name).stdout.strip()
Comment on lines +35 to +37
if [ "$(kubectl get clusterrole | grep 'chaos-mesh' | awk '{print $1}' | wc -l)" != "0" ]; then
echo "deleting clusterroles"
timeout 30 kubectl delete clusterrole "$(kubectl get clusterrole | grep 'chaos-mesh' | awk '{print $1}')" || :
env = os.environ
env["KUBECONFIG"] = os.path.expanduser("~/.kube/config")
subprocess.check_output(
f"microk8s kubectl -n {model_name} delete networkchaos network-loss-primary",
Comment on lines +72 to +74
env = os.environ
env["KUBECONFIG"] = os.path.expanduser("~/.kube/config")
try:
Copy link
Contributor

@reneradoi reneradoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, especially the test coverage and the restart lock! I only have a few comments, but general concept looks good already.

self.charm.cluster_manager.reload_tls_settings(tls_config)
self.charm.sentinel_manager.restart_service()

def _on_config_changed(self, event: ops.ConfigChangedEvent) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating concerns is nice and I really like it! After reading the code though I think I need to revert my earlier comment and admit that it makes more sense to have it in the same event handler and method. Reason being that it seems to be overly complicated just to adhere to the structure, and now half of the logic here is not related to TLS concerns. Therefore, I'd say let's move this code to the base_events._on_config_changed() and have a clean implementation there.

statuses_state=self.charm.state.statuses,
)

self.charm.state.statuses.delete(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this reset a status that might just previously have been set? Or am I missing something here?

"""Event for restarting the workload when certain events happen, e.g. IP change."""

def __init__(
self, handle: ops.Handle, restart_valkey: bool = True, restart_sentinel: bool = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This will certainly be useful!

Comment on lines +570 to +574
if event.restart_valkey:
self.charm.workload.restart(self.charm.workload.valkey_service)
if event.restart_sentinel:
self.charm.sentinel_manager.restart_service()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add error handling here in case these methods raise, otherwise the event would crash.

@@ -0,0 +1,52 @@
#!/bin/bash

# Utility script to removing chaosmesh from the K8S cluster, to clean up test artefacts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If available, can you please add the source of these deploy/destroy scripts as a comment so that we can cross check in case of future failures?

for unit in juju.status().apps[APP_NAME].units:
if unit == primary_unit_name:
continue
assert not is_unit_reachable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to that, we should also make sure that the connectivity from the controller is lost, see here for reference: https://github.com/canonical/charmed-etcd-operator/blob/3.6/edge/tests/integration/ha/test_network_cut.py#L236

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants