-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(slurmctld): Support InfluxDB #70
base: main
Are you sure you want to change the base?
Conversation
5cd7aac
to
552329a
Compare
@jamesbeedy we're gonna take a look at your PR this week! We have some documentation stuff that we're cooking up |
Sweet! Thanks! |
5b50989
to
167dcd6
Compare
Hey @jamesbeedy, this looks good so far. Can you just do a |
484b186
to
51f12e3
Compare
f7318d1
to
0789dd7
Compare
These changes add and modify code pertaining to the support of InfluxDB for a job profiling database. Added pip requirements: netifaces-plus, influxdb Changes: * cluster_name changed to a public property of the charm and is now stored in the peer-relation application data. * Removed default cluster-name charm config and replaced with generated name "charmed-hpc-XXXX" if no charm config is supplied. * Added _on_start charm event to allow slurmctld-peer interface time to join after _on_start hook is called and before the slurm.conf is written. * Added influxdb interface for relating to influxdb charm. * Adjusted unit tests to account for new cluster_name changes. * Added slurmctld-peer interface for storing cluster_name. Fixes: charmed-hpc#71 influxdb relation and new_nodes accounting These changes improve the influxdb relation by using relation-joined instead of relation-changed and fix an error with departing units stuck in the new_nodes stored state by removing them from new_nodes when a unit departs. * Add logging for new_node reconcile accounting. * These changes fix the slurmdbd joined/changed race by adding logic to the config rendering that will disallow the AltAuth/jwt_key from making it into the configuration until we actually have the key from slurmctld. * Remove unused slurmd overried template. * Update slurmdbd.pid filepath and add slurmdbd override. * Update tests/integration/test_charm.py * Do not defer the joined event if we aren't the leader These changes add two tests around setting the cluster_name. 1) test that the correct error is raised when setting the cluster_name and no peer-relation exists 2) test that the start hook does the correct things when setting the cluster_name fails Note: integration test removed from this pr because it was added in charmed-hpc#78 * handle non-leader write to application data case These changes add the raising of an error if the cluster_name is set by a non-leader in the slurmctld-peer relation and associated unit test. * add JobAcctGatherType configuration The JobAcctGatherType parameter tells slurm how to collect profiling data. Set it to cgroup or linux depending on is_container() or not. * use influxdb relation data to store admin_info Use the influxdb relation data to store admin_info instead of StoredState. This enalbes slurmctld to be swapped out with a new unit and persist the pre-existing influxdb information to be used by the new slurmctld unit. Drive by fix for charmed-hpc#79 * add test sbatch script and integration test These changes add an integration test and associated sbatch script to verify task level accounting works. * stop sackd following installation until we have a relation
5bb4206
to
2409a9e
Compare
These changes add and modify code pertaining to the support of InfluxDB for a job profiling database.
Added pip requirements: netifaces-plus, influxdb
Changes:
cluster_name changed to a public property of the charm and is now stored in the peer-relation application data.
Removed default cluster-name charm config and replaced with generated name "charmed-hpc-XXXX" if no charm config is supplied.
Added _on_start charm event to allow slurmctld-peer interface time to join after _on_install hook completes and before the slurm.conf is written (now in _on_start()).
Added influxdb interface for relating to influxdb charm.
Improve the influxdb relation by using relation-joined instead of relation-changed and fix an error with departing units stuck in the new_nodes stored state by removing them from new_nodes when a unit departs.
Add new_node stored state reconciliation on slurmd_departed for nodes that have been removed from the relation but remain in new_nodes in stored state.
Fix the slurmdbd joined/changed race by adding logic to the config rendering that will disallow the AltAuth/jwt_key from making it into the configuration until we actually have the key from slurmctld.
Remove unused slurmd override template.
Update slurmdbd.pid filepath and add slurmdbd override.
Adjusted unit tests to account for new cluster_name changes.
Added slurmctld-peer interface for storing cluster_name.
Drive by fix for slurmdbd failing to start and false positive status with extra conditional in check_status, Fixes Slurmdbd false positive active status #74 and Slurmdbd failing to start #75
Fixes: #52, #56, #18, #56
Depends on: charmed-hpc/hpc-libs#66, charmed-hpc/hpc-libs#65, charmed-hpc/hpc-libs#64
TODO: Integration tests