Skip to content

patch: advanced rolling ops using etcd#364

Open
patriciareinoso wants to merge 10 commits intocanonical:mainfrom
patriciareinoso:DPE-9349-rolling-ops
Open

patch: advanced rolling ops using etcd#364
patriciareinoso wants to merge 10 commits intocanonical:mainfrom
patriciareinoso:DPE-9349-rolling-ops

Conversation

@patriciareinoso
Copy link

@patriciareinoso patriciareinoso commented Mar 16, 2026

Context

All the code on this PR is new

This implementation is based on DA241 Spec

  • The content of charmlibs/advanced-rollingops/src/charmlibs/advanced_rollingops/_dp_interfaces_v1.py belongs to another library that is currently being migrated to charmlibs so you can ignore it for now.

Summary

This PR is the first part of the implementation of advanced rolling ops at cluster level.

This PR includes:

  • Management of the client certificate used to connect to etcd
    • The leader unit creates a self-signed certificate with live of 20 years
    • Share the certificate with the other units using a peer relation
  • Implementation of the integration with etcd
    • Share the mtls certificate
    • Observe the resource_created event
    • Observe the endpoints_changed event
  • Management of a env file needed to connecto etcd via etcdctl

This PR does not implement the locking mechanism. In here we only test that we can connect to etcd from each unit.

Current workflow:

  1. The unit make a request
  2. A new background process is spawn
  3. The background process dispatches a Juju hook
  4. The unit observes that hook
  5. The unit writes and read a value in etcd
  6. If the unit was able to connect to etcd, it executes the "restart" function.

This is a very simplified workflow to be able to test that the units from different apps can reach etcd.

To do

  • Replace the content of charmlibs/advanced-rollingops/src/charmlibs/advanced_rollingops/_dp_interfaces_v1.py with the the actual library once it is migrated to charmlibs
  • Implement the actual locking mechanism
  • Figure out how to properly install etcdctl

@james-garner-canonical
Copy link
Collaborator

james-garner-canonical commented Mar 16, 2026

Hi @patriciareinoso thanks for opening this PR. I haven't looked through the code yet, but I wanted to start with a request for some additional context first.

Please update the PR to describe whether this is a migration of an existing codebase or should be treated as all new. If it's a migration, please link to the existing library, and describe what's changing/open to change. If there was a spec involved in the design of this new lib, please also link to it.

You've probably already seen it, but I just want to also drop a link to our migration guide https://documentation.ubuntu.com/charmlibs/how-to/migrate/

If you'd like to talk about any of this synchronously, feel free to set up a meeting, my availability is in my calendar. Otherwise we'll proceed with asynchronous review when we have the context.

PS: If the PR isn't ready for review yet, you can mark it as a draft and manually re-request review when ready.

Copy link

@Gu1nness Gu1nness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions around security and try/catch errors.
Excellent work!


ca_key = PrivateKey.generate(key_size=4096)
ca_attributes = CertificateRequestAttributes(
common_name='rollingops-client-ca', is_ca=True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be common_name ? Or something decided by the charm integrating ?
Or is it because it can't be chosen in a distributed way ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to Neha's comment on the common_name, it might be worth adding add_unique_id_to_subject_name=False to the attributes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning to this file but it's a more generic question: Is is possible for a unit to request a lock for another unit by "hacking" the etcdctl calls that request a lock ? What would be the consequences ?
Is there a way to protect from that ?

os.chmod(cls.ENV_FILE, 0o600)

@classmethod
def load_env(cls) -> dict[str, str]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function scares me.
I trust that it probably works but it does scare me that we're trying to parse some exports that way.
Why do we need to go through a dedicated file to do that ?
It seems to me that we're writing to the file just to read it as an environment provider for the command runner.

Isn't there another way to do that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters that are passed as env=cls.load_env() to the subprocess command could instead be appended to the command itself (example).

This would eliminate the need to write and load an env file, but would require access to relation data at runtime.

Comment on lines +103 to +105
def _on_install(self, event: InstallEvent) -> None:
subprocess.run(['apt-get', 'update'], check=True)
subprocess.run(['apt-get', 'install', '-y', 'etcd-client'], check=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use https://github.com/canonical/charmlibs/tree/main/apt ?
Not sure this would work in airgapped env also.
Maybe good as a first start but that raises the question of migration in the future.

"""Stop the etcd worker process in the current unit."""
unit = event.departing_unit
if unit == self.model.unit:
self.worker.stop()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to be sure:

  • worker spawns one process and dies
  • Process triggers one dispatch and dies
  • Dispatch from the process prevents this event (relation-departed) from running (because only one event at a time).
    Is it to cover race conditions ? Or are the workers incomplete for now ?

@Gu1nness
Copy link

Also I think you need to add a changelog file (cf https://documentation.ubuntu.com/charmlibs/explanation/charmlibs-publishing/ )

james-garner-canonical added a commit that referenced this pull request Mar 17, 2026
#364 has revealed that CI breaks with packages with a `requires-python`
lower bound that's higher than 3.10 (it has `requires-python>=3.12`).

This is because we would run `just integration-k8s` and `just
integration-machine` in the `init` step to check if they have any tests,
and that uses our default `justfile` Python version of 3.10.

This PR fixes this by recording the minimum supported Python version
when we calculate the Python versions to test the library with, and
using that version in the failing step.
@james-garner-canonical
Copy link
Collaborator

CI failure should be fixed in #366

@james-garner-canonical
Copy link
Collaborator

Integration tests should be fixed for real now with #373 merged, please merge main to pull in the fix.

Copy link

@reneradoi reneradoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the implementation looks good for the scope of the PR. I've left a few comments, mostly about error handling, and a few questions.


ca_key = PrivateKey.generate(key_size=4096)
ca_attributes = CertificateRequestAttributes(
common_name='rollingops-client-ca', is_ca=True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to Neha's comment on the common_name, it might be worth adding add_unique_id_to_subject_name=False to the attributes.

client_key = PrivateKey.generate(key_size=4096)

csr_attributes = CertificateRequestAttributes(
common_name=common_name, add_unique_id_to_subject_name=False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to add Subject Alternative Names, as certificate verification based on common name has been deprecated.

os.chmod(cls.ENV_FILE, 0o600)

@classmethod
def load_env(cls) -> dict[str, str]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters that are passed as env=cls.load_env() to the subprocess command could instead be appended to the command itself (example).

This would eliminate the need to write and load an env file, but would require access to relation data at runtime.

)

@classmethod
def get_first_key_value(cls, key_prefix: str) -> tuple[str, dict[str, str]] | None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used - what is it needed for?


pid = int(pid_str)
try:
os.kill(pid, signal.SIGINT)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use SIGTERM instead, as SIGINT usually means keyboard interrupt?

app_data = relation.data[self.model.app]
secret_id = app_data.get(SECRET_FIELD)
if secret_id:
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified with := and would be nice to have a comment here to understand that this means the certificate already exists.

logger.error('No etcd endpoints available')
return

self.shared_certificates.sync_to_local_files()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why updating certs if endpoints on etcd side have changed?

"""
cls.BASE_DIR.mkdir(parents=True, exist_ok=True)

cls.SERVER_CA.write_text(tls_ca_pem or '')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't do or '' to be honest. It means silently continuing in a situation that is problematic, rather make it apparent that there is no CA.

cls.ENV_FILE.unlink(missing_ok=True)

@classmethod
def run(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment on the subprocess.run command: This might raise a CalledProcessError if check is enabled, these errors should be caught.

I think it is in general a code practice to always have check=true and catch the error instead of checking the output of the run command (see also in the other methods of this class).

return env

@classmethod
def ensure_initialized(cls):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this method may raise, we should make sure to catch the errors and act appropriately where it is called, otherwise this could result in uncaught errors at runtime.

Comment on lines +303 to +306
try:
import psycopg2
except ImportError:
psycopg2 = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about vendoring this library as is.

From what I can see, postgres is not used by the rolling ops functionality, and this import is referenced in a method that's not ultimately used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants