Configure automatic restart for OOM-terminated pods in JupyterHub and Jobs by Copilot · Pull Request #108 · minnelab/MAIA

Copilot · 2026-01-19T09:12:52Z

Pods terminated by OOM errors in JupyterHub and batch workloads did not restart automatically, requiring manual intervention.

Changes

JupyterHub singleuser pods (MAIA_scripts/MAIA_create_JupyterHub_config.py)

Set restartPolicy: OnFailure via extraPodConfig in generated JupyterHub values
Notebook pods now restart automatically on OOM or other failures

MAIAKubeGate Job workloads (charts/maiakubegate/)

Made restartPolicy and backoffLimit configurable
Defaults: restartPolicy: OnFailure, backoffLimit: 3
User workloads retry up to 3 times on failure

Kaniko build Jobs (charts/maiakubegate-kaniko/)

Made restart behavior configurable
Defaults: restartPolicy: Never, backoffLimit: 0 (no retries for builds)

Configuration example:

# JupyterHub values now include:
"singleuser": {
    "extraPodConfig": {
        "restartPolicy": "OnFailure"
    },
    ...
}

# Helm values for Jobs:
restartPolicy: OnFailure  # or Never
backoffLimit: 3           # retry attempts

All changes backward compatible via Helm default value patterns. Documentation added in docs/OOM_RESTART_CONFIGURATION.md.

Original prompt

This section details on the original issue you should resolve

<issue_title>Pods terminated with OOM should restart automatically (JupyterHub/Kubeflow)</issue_title>
<issue_description>## Problem
Currently, when a pod started through JupyterHub or Kubeflow in the MAIA platform runs out of memory (OOM), it does not restart automatically. This results in the need for manual intervention to restart the affected pods, interrupting user workflows and reducing reliability.

Expected Behavior

Pods that are terminated due to OOM (Out Of Memory) errors should be automatically restarted by Kubernetes without requiring manual action from users or administrators.

Steps to Reproduce

Launch a pod via JupyterHub or Kubeflow in the MAIA cluster.

Perform memory-intensive operations until the pod exceeds its memory limit and is terminated with an OOM error.

Observe that the pod does not automatically restart and remains in a failed state.

Desired Solution

Configure MAIA platform pod templates, deployment, or JupyterHub/Kubeflow integrations so that OOM pods always restart automatically.

Potentially add or adjust the restartPolicy (should be Always for most interactive workloads).

Validate expected behavior by forcing several OOM events and confirming pods are restarted.

Additional Context

Applies to pods started both from JupyterHub and via Kubeflow pipelines/interactive sessions.

Consider if default behavior differs between Deployment vs Job resources, and document where relevant.

Please investigate and update the configuration as needed so that user pods do not require manual restart after OOM events.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Pods terminated with OOM should restart automatically (JupyterHub/Kubeflow) #107

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Configure JupyterHub singleuser pods to restart on failure (including OOM) by adding extraPodConfig with restartPolicy: OnFailure. This ensures that user pods that encounter out-of-memory errors will be automatically restarted by Kubernetes without manual intervention. Co-authored-by: SimoneBendazzoli93 <[email protected]>

- Configure maiakubegate Jobs with restartPolicy: OnFailure and backoffLimit: 3 (default) - Make maiakubegate-kaniko Jobs configurable while keeping conservative defaults (Never/0) - Add documentation in values.yaml for both charts explaining the restart behavior - These changes allow user workload Jobs to automatically retry on OOM failures Co-authored-by: SimoneBendazzoli93 <[email protected]>

Created comprehensive documentation explaining: - The changes made to enable automatic pod restarts on OOM - Configuration options for JupyterHub and Jobs - Testing and validation procedures - Migration notes for existing deployments Co-authored-by: SimoneBendazzoli93 <[email protected]>

Initial plan

1a20d98

Copilot AI assigned Copilot and SimoneBendazzoli93 Jan 19, 2026

Copilot started work on behalf of SimoneBendazzoli93 January 19, 2026 09:13 View session

Copilot AI and others added 3 commits January 19, 2026 09:18

Copilot AI changed the title ~~[WIP] Configure automatic restart for OOM terminated pods~~ Configure automatic restart for OOM-terminated pods in JupyterHub and Jobs Jan 19, 2026

Copilot AI requested a review from SimoneBendazzoli93 January 19, 2026 09:27

Copilot finished work on behalf of SimoneBendazzoli93 January 19, 2026 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure automatic restart for OOM-terminated pods in JupyterHub and Jobs#108

Configure automatic restart for OOM-terminated pods in JupyterHub and Jobs#108
Copilot wants to merge 4 commits intomasterfrom
copilot/configure-oom-restart-pods

Copilot AI commented Jan 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Expected Behavior

Steps to Reproduce

Desired Solution

Additional Context

Comments on the Issue (you are @copilot in this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 19, 2026 •

edited

Loading