Skip to content

[DPE-9326]: Add HA Failover tests#28

Open
skourta wants to merge 11 commits intodpe-9325-network-hafrom
dpe-9326-failover-tests
Open

[DPE-9326]: Add HA Failover tests#28
skourta wants to merge 11 commits intodpe-9325-network-hafrom
dpe-9326-failover-tests

Conversation

@skourta
Copy link
Contributor

@skourta skourta commented Mar 19, 2026

This pull request introduces a coordinated restart lock mechanism for workload restarts, ensures safe handling of IP/hostname changes (especially on VMs), and improves certificate SAN handling for TLS. It also adds new status reporting for unhealthy restarts and refines lock management for start and restart operations. The changes are grouped below by theme.

Restart Lock and Event Handling:

  • Introduced a new RestartLock class (subclassing DataBagLock) to coordinate restarts across units, with associated model fields (restart_member, request_restart_lock) and lock logic. This prevents concurrent restarts and ensures safe rolling restarts. [1] [2] [3]
  • Added a RestartWorkloadEvent and corresponding event source/handler in BaseEvents, which triggers coordinated restarts of Valkey and Sentinel services, checks health post-restart, and sets or clears new unhealthy restart statuses. [1] [2] [3] [4] [5]
  • Updated peer relation and leader election handlers to process both start and restart locks. [1] [2]

TLS Certificate and IP/Hostname Change Handling:

  • Implemented logic to detect when the private IP/hostname changes (on VMs), and trigger certificate regeneration or refresh if SANs have changed, using new methods get_current_sans and certificate_sans_require_update in the TLS manager. [1] [2]
  • On IP/hostname changes, the charm now emits a coordinated restart event (on VMs) to ensure services pick up new certificates and configuration.
  • Added observation of config_changed in TLS event handler for this purpose.

Lock Management Improvements:

  • Refined DataBagLock and StartLock logic: added dynamic timestamp attribute, improved lock selection, and clarified lock-free checks for both start and restart operations. [1] [2] [3] [4] [5]
  • Adjusted start event flow to defer only when necessary and ensure lock acquisition is handled after pre-checks. [1] [2]

Status and Configuration Reporting:

  • Added new status objects for reporting unhealthy states after restarts (VALKEY_UNHEALTHY_RESTART, SENTINEL_UNHEALTHY_RESTART).
  • Minor config logic tweaks for endpoint and replica configuration. [1] [2]

Dependency Update:

  • Added kubernetes as a dependency in pyproject.toml.

@skourta skourta changed the base branch from 9/edge to dpe-9325-network-ha March 19, 2026 16:27
@skourta skourta marked this pull request as ready for review March 19, 2026 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant