Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(slurmctld): Support InfluxDB #70

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jamesbeedy
Copy link
Contributor

@jamesbeedy jamesbeedy commented Jan 27, 2025

These changes add and modify code pertaining to the support of InfluxDB for a job profiling database.

Added pip requirements: netifaces-plus, influxdb

Changes:

  • cluster_name changed to a public property of the charm and is now stored in the peer-relation application data.

  • Removed default cluster-name charm config and replaced with generated name "charmed-hpc-XXXX" if no charm config is supplied.

  • Added _on_start charm event to allow slurmctld-peer interface time to join after _on_install hook completes and before the slurm.conf is written (now in _on_start()).

  • Added influxdb interface for relating to influxdb charm.

  • Improve the influxdb relation by using relation-joined instead of relation-changed and fix an error with departing units stuck in the new_nodes stored state by removing them from new_nodes when a unit departs.

  • Add new_node stored state reconciliation on slurmd_departed for nodes that have been removed from the relation but remain in new_nodes in stored state.

  • Fix the slurmdbd joined/changed race by adding logic to the config rendering that will disallow the AltAuth/jwt_key from making it into the configuration until we actually have the key from slurmctld.

  • Remove unused slurmd override template.

  • Update slurmdbd.pid filepath and add slurmdbd override.

  • Adjusted unit tests to account for new cluster_name changes.

  • Added slurmctld-peer interface for storing cluster_name.

  • Drive by fix for slurmdbd failing to start and false positive status with extra conditional in check_status, Fixes Slurmdbd false positive active status #74 and Slurmdbd failing to start #75

Fixes: #52, #56, #18, #56
Depends on: charmed-hpc/hpc-libs#66, charmed-hpc/hpc-libs#65, charmed-hpc/hpc-libs#64

TODO: Integration tests

@jamesbeedy jamesbeedy changed the title Support InfluxDB feat(slurmctld): Support InfluxDB Jan 29, 2025
@jamesbeedy jamesbeedy marked this pull request as ready for review January 29, 2025 23:01
@jamesbeedy jamesbeedy requested a review from a team as a code owner January 29, 2025 23:01
@jamesbeedy jamesbeedy requested review from jedel1043 and dsloanm and removed request for a team January 29, 2025 23:01
@NucciTheBoss NucciTheBoss self-requested a review January 30, 2025 14:21
@NucciTheBoss
Copy link
Member

@jamesbeedy we're gonna take a look at your PR this week! We have some documentation stuff that we're cooking up

@jamesbeedy
Copy link
Contributor Author

@jamesbeedy we're gonna take a look at your PR this week! We have some documentation stuff that we're cooking up

Sweet! Thanks!

@NucciTheBoss
Copy link
Member

Hey @jamesbeedy, this looks good so far. Can you just do a git rebase origin/main real quick? Code looks good but I'm noticing that some changes in this PR here look like they're duplicating previous commits that have been merged already.

@jamesbeedy jamesbeedy force-pushed the influxdb_interface branch 3 times, most recently from 484b186 to 51f12e3 Compare February 4, 2025 19:25
@NucciTheBoss NucciTheBoss added feature This pull request adds a new feature to the project C-slurm Component: Slurm and removed enhancement labels Feb 4, 2025
@jamesbeedy jamesbeedy force-pushed the influxdb_interface branch 2 times, most recently from f7318d1 to 0789dd7 Compare February 9, 2025 17:07
These changes add and modify code pertaining to
the support of InfluxDB for a job profiling database.

Added pip requirements: netifaces-plus, influxdb

Changes:
* cluster_name changed to a public property of the charm and is now stored in the peer-relation application data.

* Removed default cluster-name charm config and replaced with generated name "charmed-hpc-XXXX" if no charm config is supplied.

* Added _on_start charm event to allow slurmctld-peer interface time to join after _on_start hook is called and before the slurm.conf is written.

* Added influxdb interface for relating to influxdb charm.

* Adjusted unit tests to account for new cluster_name changes.

* Added slurmctld-peer interface for storing cluster_name.

Fixes: charmed-hpc#71

influxdb relation and new_nodes accounting

These changes improve the influxdb relation by using
relation-joined instead of relation-changed and fix an error
with departing units stuck in the new_nodes stored state by
removing them from new_nodes when a unit departs.

* Add logging for new_node reconcile accounting.

* These changes fix the slurmdbd joined/changed race by
adding logic to the config rendering that will disallow
the AltAuth/jwt_key from making it into the configuration
until we actually have the key from slurmctld.

* Remove unused slurmd overried template.

* Update slurmdbd.pid filepath and add slurmdbd override.

* Update tests/integration/test_charm.py

* Do not defer the joined event if we aren't the leader

These changes add two tests around setting the cluster_name.

1) test that the correct error is raised when setting the
cluster_name and no peer-relation exists

2) test that the start hook does the correct things when setting
the cluster_name fails

Note: integration test removed from this pr because it was added
      in charmed-hpc#78

* handle non-leader write to application data case

These changes add the raising of an error if the cluster_name
is set by a non-leader in the slurmctld-peer relation and
associated unit test.

* add JobAcctGatherType configuration

The JobAcctGatherType parameter tells slurm how to collect
profiling data. Set it to cgroup or linux depending on
is_container() or not.

* use influxdb relation data to store admin_info

Use the influxdb relation data to store admin_info instead
of StoredState. This enalbes slurmctld to be swapped out with
a new unit and persist the pre-existing influxdb information to be
used by the new slurmctld unit.

Drive by fix for charmed-hpc#79

* add test sbatch script and integration test

These changes add an integration test and associated sbatch script
to verify task level accounting works.

* stop sackd following installation until we have a relation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-slurm Component: Slurm feature This pull request adds a new feature to the project
Projects
None yet
2 participants