Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

24.05.ug before reduce patches #69

Merged
merged 126 commits into from
Dec 19, 2024

Conversation

itkovian
Copy link
Member

This is the version we will update to first.

itkovian and others added 30 commits November 4, 2022 14:13
fix: do not try to deref before pointer is initialised
fix: also pass nnodes through the message
itkovian and others added 28 commits December 5, 2024 15:15
Address race condition where submitted jobs would occasionally start
RUNNING when the test expected PENDING jobs.

Tests running_vs_pending are now not necessary.

Ticket 21393

Cherry-picked: d4ebd3f
Cherry-pick !228 into slurm-24.05

See merge request SchedMD/dev/slurm!230
If the backup controller is in control and scontrol reconfigure is run,
after resuming the primary controller it was possible that the backup
would not honor the request to relinquish control.

In most cases this would result in the backup controller crashing, but
in a quiet enough system both controllers could continue to operate as
the primary.

Changelog: slurmctld - Fix crash and possible split brain issue if the
 backup controller handles an scontrol reconfigure while in control
 before the primary resumes operation.
Ticket: 21532
Cherry-picked: f37371b
Cherry-pick !218 into slurm-24.05

See merge request SchedMD/dev/slurm!238
The controller doesn't pack job_record_t's node_addrs, so the stepmgr
wasn't getting them and passing them to the steps. When the steps
completed and tried communicating back to the dynamic stepmgr, it failed to
find its node addr. The job's node_addr's are passed in the cred so we
can just get them from there.

Changelog: Fix stepmgr not getting dynamic node addrs from the controller
Ticket: 21535
Cherry-picked: a4af0fb
Changelog: stepmgr - avoid "Unexpected missing socket" errors.
Ticket: 21422
Cherry-picked: 8469d65
Changelog: Fix `scontrol show steps` with dynamic stepmgr
Ticket: 21422
Cherry-picked: 59e69f3
Cherry-pick !220 into slurm-24.05

See merge request SchedMD/dev/slurm!241
Cherry-pick !224 into slurm-24.05

See merge request SchedMD/dev/slurm!243
Cherry-pick !256 into slurm-24.05

See merge request SchedMD/dev/slurm!259
Previous default behavior of trying IPv4 remains, however if the
controller appears to be IPv6 only skip the IPv4 attempt.

It is expected that all slurm controllers have the same IP address
families available.

Cherry-picked: bf4a853
Ticket: 20997
When using an IPv6 address directly, wrap the address with [] to
denote what is address vs what is the port.

Cherry-picked: 04babaa
Ticket: 20997
Changelog: Support IPv6 in configless mode.
Cherry-picked: 050517c
Ticket: 20997
Cherry-pick !118 into slurm-24.05

See merge request SchedMD/dev/slurm!233
Update slurm.spec and debian/changelog as well.
@itkovian itkovian force-pushed the 24.05.ug-before-reduce-patches branch from f801b37 to 30a0b7d Compare December 16, 2024 09:54
@hajgato hajgato merged commit 7a6ac98 into hpcugent:24.05.ug Dec 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.