Skip to content

Conversation

nastaran78
Copy link
Collaborator

@nastaran78 nastaran78 commented Jul 22, 2025

we added support of async evaluation in this PR.

TL;DR
Async: return None, launch a job, and call neps.save_pipeline_results() when the job finishes.


1  Return types

Allowed return When to use Minimal example
Scalar simple objective, single fidelity return loss
Dict need cost/extra metrics {"objective_to_minimize": loss, "cost": 3}
None you launch the job elsewhere (SLURM, k8s …) see § 3 Async

All other values raise a TypeError inside NePS.

2  Result dictionary keys

key purpose required?
objective_to_minimize scalar NePS will minimise yes
cost wall‑clock, GPU‑hours, … — only if you passed max_cost_total to neps.run yes iff cost budget enabled
learning_curve list/np.array of intermediate objectives optional
extra any JSON‑serialisable blob optional
exception any Exception illustrating the error in evaluation optional

Tip  Return exactly what you need; extra keys are preserved in the trial’s report.yaml.


3  Asynchronous evaluation (advanced)

3.1 Design

  1. The Python side (your evaluate_pipeline function)

    • creates & submits a job script.
    • returns None so the worker thread isn’t blocked.
  2. The submit script or the job must call

    neps.save_pipeline_results(
        user_result=result_dict,
        pipeline_id=pipeline_id,
        root_directory=root_directory,
    )

    when it finishes.
    This writes report.yaml and marks the trial SUCCESS / CRASHED.

3.2 Code walk‑through

submit.py – called by NePS synchronously

from pathlib import Path
import neps

def evaluate_pipeline(
    pipeline_directory: Path,
    pipeline_id: str,          # NePS injects this automatically
    root_directory: Path,      # idem
    learning_rate: float,
    optimizer: str,
):
    # 1) write a Slurm script
    script = f"""#!/bin/bash
#SBATCH --time=0-00:10
#SBATCH --job-name=trial_{pipeline_id}
#SBATCH --partition=bosch_cpu-cascadelake
#SBATCH --output={pipeline_directory}/%j.out
#SBATCH --error={pipeline_directory}/%j.err

python run_pipeline.py \
       --learning_rate {learning_rate} \
       --optimizer {optimizer} \
       --pipeline_id {pipeline_id} \
       --root_dir {root_directory}
""")

    # 2) submit and RETURN None (async)
    sumit_job(script)
    return None  # ⟵ signals async mode

run_pipeline.py – executed on the compute node

import argparse, json, time, neps
from pathlib import Path

parser = argparse.ArgumentParser()
parser.add_argument("--learning_rate", type=float)
parser.add_argument("--optimizer")
parser.add_argument("--pipeline_id")
parser.add_argument("--root_dir")
args = parser.parse_args()
try:
    # … do heavy training …
    val_loss = 0.1234
    wall_clock_cost = 180  # seconds
    result = {
        "objective_to_minimize": val_loss,
        "cost": wall_clock_cost,
    }
except Exception as e:
    result = {
        "objective_to_minimize": val_loss,
        "cost": wall_clock_cost,
        "exception": e
    }

neps.save_pipeline_results(
    user_result=result,
    pipeline_id=args.pipeline_id,
    root_directory=Path(args.root_dir),
)

3.3 Why this matters

  • No worker idles while your job is in the queue ➜ better throughput.
  • Crashes inside the job still mark the trial CRASHED instead of hanging.
  • Compatible with Successive‑Halving/ASHA — NePS just waits for report.yaml.

4  Extra injected arguments

name provided when description
pipeline_directory always per‑trial working dir (…/trials/<id>/)
previous_pipeline_directory only for multi‑fidelity directory of the lower‑fidelity checkpoint. Can be None.
pipeline_id async only trial id string you pass to save_evaluation_results
root_directory async only optimisation root folder, same to pass back

@nastaran78 nastaran78 force-pushed the eval_callback branch 4 times, most recently from 195d67b to c605336 Compare July 29, 2025 14:20
@nastaran78 nastaran78 changed the title feat: Callback for saving Evaluation Result feat: Async Saving Evaluation Result Jul 29, 2025
@nastaran78 nastaran78 force-pushed the eval_callback branch 3 times, most recently from 9394b98 to 9bfc463 Compare August 18, 2025 22:57
@nastaran78 nastaran78 requested a review from Neeratyoy August 19, 2025 14:11
@nastaran78
Copy link
Collaborator Author

1- issue for using config instead of pipeline

@automl automl deleted a comment from Neeratyoy Aug 23, 2025
@automl automl deleted a comment from Neeratyoy Aug 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants