Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue launching job within an existing job array job #1782

Open
marksibrahim opened this issue Nov 26, 2024 · 4 comments
Open

Issue launching job within an existing job array job #1782

marksibrahim opened this issue Nov 26, 2024 · 4 comments

Comments

@marksibrahim
Copy link

In our workflow we launch a job array with submitit SLURM for model training. Within each job, we then launch separate jobs for evaluations in our Python code. We confirmed the workflow works when the main training job is not a job array. When the main training job is a job array, the subsequent evaluation jobs look for the .pkl with the wrong parent job IDs and exist with an error.

The hacky workaround we have for now is to manually remove the environment variable

import os

# main slurm job array with submitit running

training_job_id = os.environ["SLURM_ARRAY_JOB_ID"]
os.environ.pop("SLURM_ARRAY_JOB_ID")

# launch evaluation job 
  job = executor.submit(my_evaluation_job)

# reinstate main job id
 os.environ["SLURM_ARRAY_JOB_ID"] = training_job_id

Is there a recommended workflow to make sure launching a job within an existing submitit job array works as expected?

@baldassarreFe
Copy link
Contributor

Can you check if this context manager solves your issue?

def clean_env(extra_names: tp.Sequence[str] = ()) -> tp.Iterator[None]:
"""Removes slurm and submitit related environment variables so as to avoid interferences
when submiting a new job from a job.
Parameters
----------
extra_names: Sequence[str]
Additional environment variables to hide inside the context,
e.g. TRITON_CACHE_DIR and TORCHINDUCTOR_CACHE_DIR when using torch.compile.
Note
----
A slurm job submitted from within a slurm job inherits some of its attributes, which may
be confusing a cause weird gres errors (or pytorch distributed).
Submitting within this context should prevent this.
Usage
-----
with submitit.helpers.clean_env():
executor.submit(...)
"""
distrib_names = ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE", "LOCAL_RANK", "LOCAL_WORLD_SIZE")
cluster_env = {
x: os.environ.pop(x)
for x in os.environ
if (
x.startswith(("SLURM_", "SLURMD_", "SRUN_", "SBATCH_", "SUBMITIT_"))
or x in distrib_names
or x in extra_names
)
}
try:
yield
finally:
os.environ.update(cluster_env)

If not, can you identify which env variable(s) are still present and interfere with launching new jobs? If needed, we can update the list of variables that are hidden by the context manager.

@marksibrahim
Copy link
Author

@baldassarreFe thank you for getting back to us. This would take care of the variable, but we're worried this solution would alter the state of the existing job's environment variables and thus break requeuing. Is there a way to reinstate the existing job's environment variables after this cleanup step?

@baldassarreFe
Copy link
Contributor

I believe the context manager does exactly what you need. The env is temporarily altered and it's then restored as it was before. You can check by printing the env before and after just to be sure.

@marksibrahim
Copy link
Author

marksibrahim commented Dec 4, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants