Skip to content

Configure automatic restart for OOM-terminated pods in JupyterHub and Jobs#108

Draft
Copilot wants to merge 4 commits intomasterfrom
copilot/configure-oom-restart-pods
Draft

Configure automatic restart for OOM-terminated pods in JupyterHub and Jobs#108
Copilot wants to merge 4 commits intomasterfrom
copilot/configure-oom-restart-pods

Conversation

Copy link
Contributor

Copilot AI commented Jan 19, 2026

Pods terminated by OOM errors in JupyterHub and batch workloads did not restart automatically, requiring manual intervention.

Changes

JupyterHub singleuser pods (MAIA_scripts/MAIA_create_JupyterHub_config.py)

  • Set restartPolicy: OnFailure via extraPodConfig in generated JupyterHub values
  • Notebook pods now restart automatically on OOM or other failures

MAIAKubeGate Job workloads (charts/maiakubegate/)

  • Made restartPolicy and backoffLimit configurable
  • Defaults: restartPolicy: OnFailure, backoffLimit: 3
  • User workloads retry up to 3 times on failure

Kaniko build Jobs (charts/maiakubegate-kaniko/)

  • Made restart behavior configurable
  • Defaults: restartPolicy: Never, backoffLimit: 0 (no retries for builds)

Configuration example:

# JupyterHub values now include:
"singleuser": {
    "extraPodConfig": {
        "restartPolicy": "OnFailure"
    },
    ...
}
# Helm values for Jobs:
restartPolicy: OnFailure  # or Never
backoffLimit: 3           # retry attempts

All changes backward compatible via Helm default value patterns. Documentation added in docs/OOM_RESTART_CONFIGURATION.md.

Original prompt

This section details on the original issue you should resolve

<issue_title>Pods terminated with OOM should restart automatically (JupyterHub/Kubeflow)</issue_title>
<issue_description>## Problem
Currently, when a pod started through JupyterHub or Kubeflow in the MAIA platform runs out of memory (OOM), it does not restart automatically. This results in the need for manual intervention to restart the affected pods, interrupting user workflows and reducing reliability.

Expected Behavior

Pods that are terminated due to OOM (Out Of Memory) errors should be automatically restarted by Kubernetes without requiring manual action from users or administrators.

Steps to Reproduce

  1. Launch a pod via JupyterHub or Kubeflow in the MAIA cluster.
  2. Perform memory-intensive operations until the pod exceeds its memory limit and is terminated with an OOM error.
  3. Observe that the pod does not automatically restart and remains in a failed state.

Desired Solution

  • Configure MAIA platform pod templates, deployment, or JupyterHub/Kubeflow integrations so that OOM pods always restart automatically.
  • Potentially add or adjust the restartPolicy (should be Always for most interactive workloads).
  • Validate expected behavior by forcing several OOM events and confirming pods are restarted.

Additional Context

  • Applies to pods started both from JupyterHub and via Kubeflow pipelines/interactive sessions.
  • Consider if default behavior differs between Deployment vs Job resources, and document where relevant.

Please investigate and update the configuration as needed so that user pods do not require manual restart after OOM events.</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 3 commits January 19, 2026 09:18
Configure JupyterHub singleuser pods to restart on failure (including OOM) by adding extraPodConfig with restartPolicy: OnFailure. This ensures that user pods that encounter out-of-memory errors will be automatically restarted by Kubernetes without manual intervention.

Co-authored-by: SimoneBendazzoli93 <[email protected]>
- Configure maiakubegate Jobs with restartPolicy: OnFailure and backoffLimit: 3 (default)
- Make maiakubegate-kaniko Jobs configurable while keeping conservative defaults (Never/0)
- Add documentation in values.yaml for both charts explaining the restart behavior
- These changes allow user workload Jobs to automatically retry on OOM failures

Co-authored-by: SimoneBendazzoli93 <[email protected]>
Created comprehensive documentation explaining:
- The changes made to enable automatic pod restarts on OOM
- Configuration options for JupyterHub and Jobs
- Testing and validation procedures
- Migration notes for existing deployments

Co-authored-by: SimoneBendazzoli93 <[email protected]>
Copilot AI changed the title [WIP] Configure automatic restart for OOM terminated pods Configure automatic restart for OOM-terminated pods in JupyterHub and Jobs Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pods terminated with OOM should restart automatically (JupyterHub/Kubeflow)

2 participants