Update README.md with review comments 1

shauryagup · web-flow · commit 3519afcda5b5 · 2025-04-22T17:35:47.000-07:00
diff --git a/pathwaysutils/elastic/README.md b/pathwaysutils/elastic/README.md
@@ -1,30 +1,31 @@
 # Elastic Training with Pathways
 
-This document demonstrates how to leverage the elasticity primitives within `manager.py` to create resilient JAX training loop that can handle hardware failures gracefully. We illustrate this using an example based on the MaxText training loop running on TPUs provisioned by GKE via `PathwaysJob` API.
+This document demonstrates how to leverage the elasticity primitives within `pathwaysutils.elastic` to create a resilient JAX training loop that can handle hardware failures gracefully. We illustrate this using an example based on the MaxText training loop running on TPUs provisioned by GKE via `PathwaysJob` API.
 
 ## Overview
 
-Distributed training jobs, especially long-running ones, are susceptible to various failures, such as machine preemptions or hardware issues. Elasticity allows a training job to adapt to changes in the number of available accelerators without crashing. It typically involves:
+Distributed training jobs, especially long-running ones, are susceptible to various failures, such as machine preemptions and hardware issues. Elasticity allows a training job to adapt to changes in the number of available accelerators without crashing. It typically involves:
 
-1.  **Training State Management:** Regularly snapshotting the training state (model params, optimizer state, data iterator state).
-2.  **Failure Detection:** Pathways Resource Manager detecting when workers join or leave.
-3.  **Failure Propogation:** Pathways runtime propogates the error to JAX client.
-4.  **Training Reconfiguration:** Adapting the training computation distribution to the current set of healthy workers.
-5.  **Resumption:** Continuing training from the last valid snapshot with the new configuration.
+1.  **Training State Management**: Regularly snapshotting the training state (model params, optimizer state, data iterator state).
+1.  **Failure Detection**: Pathways Resource Manager detects when workers join or leave.
+1.  **Failure Propogation**: Pathways runtime propagates the error to JAX client.
+1.  **Training Reconfiguration**: Adapting the training computation distribution to the current set of healthy workers.
+1.  **Resumption**: Continuing training from the last valid snapshot with the new configuration.
 
 The `pathwaysutils.elastic` primitives provide building blocks to integrate this logic into JAX training loops run using the Pathways' `Proxy` JAX backend.
 
 ## Prerequisites
 
 * A [Pathways compatible GKE cluster](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster) with TPU and CPU nodepools.
 * `kubectl` configured to interact with your cluster.
-* Access to a container image containing JAX, your model code (e.g., MaxText), and the `pathwaysutils` library with elasticity features integrated.
+* Access to a container image containing JAX, your model code (e.g., MaxText), and the `pathwaysutils` package with elasticity features integrated.
 
 ## Elastic MaxText Training with Pathways on GKE
 
-This example demonstrates running an elastic MaxText job on 3 x v5e-32 slices using Pathways. See the [PathwaysJob docs](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro#pathwaysjob_api) for more details about the various attributes set in the YAML below.
+This example demonstrates running an elastic MaxText job on 3 x v5e-32 slices using Pathways. See the [PathwaysJob docs](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro#pathwaysjob_api) for more details about the various attributes set in the YAML below. 
 
 ### 1. Elastic PathwaysJob Definition (`pathwaysjob-elastic.py`)
+Please set the variables marked with `<>` below before executing the script.
 ```yaml
 apiVersion: pathways-job.pathways.domain/v1
 kind: PathwaysJob
@@ -34,7 +35,7 @@ spec:
   maxRestarts: 0
   workers:
   - type: ct5lp-hightpu-4t
-    topology: 4x8 
+    topology: 4x8
     numSlices: 3
     maxSliceRestarts: 2
   pathwaysDir: "gs://<BUCKET>" # Pre-create this bucket.
@@ -50,40 +51,51 @@ spec:
           command:
           - bash
           - -c
-          - |
-            python3 -m MaxText.elastic_train MaxText/configs/base.yml base_output_directory=gs://<BUCKET> per_device_batch_size=4 enable_checkpointing=false remat_policy=full global_parameter_scale=8 steps=50 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=pathways-<USER> enable_pathways_goodput=True
+          - >
+            python3 -m MaxText.elastic_train MaxText/configs/base.yml
+            base_output_directory=gs://<BUCKET>
+            per_device_batch_size=4
+            enable_checkpointing=false
+            remat_policy=full
+            global_parameter_scale=8
+            steps=50
+            max_target_length=2048
+            use_iota_embed=true
+            reuse_example_batch=1
+            dataset_type=synthetic
+            attention=flash
+            gcs_metrics=True
+            enable_pathways_goodput=True
+            run_name=pathways-<USER>
 ```
-The MaxText elastic training [script](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/elastic_train.py) invoked by the `main` container above is integrated with `pathwaysutils.elastic` primitives. 
+The MaxText elastic training [script](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/elastic_train.py) invoked by the `main` container above is integrated with `pathwaysutils.elastic` primitives.
 
 ### 2. Running the Elastic Training Loop and Simulating hardware failures
 
-The following bash script demonstrates launching the above elastic maxtext job with Pathways, monitoring its progress, simulating a worker failure by issuing a `SIGILL` to a Pathways worker pod, and observing the recovery. Please set the variables marked as `<>` below before executing the script. At the end of the script, we verify elasticity worked as expected.
+The following bash script demonstrates launching the above elastic maxtext job with Pathways, monitoring its progress, simulating a hardware failure by issuing a `kubectl drain` to a randomly selected TPU node, and observing the recovery. Please set the variables marked as `<>` below before executing the script. At the end of the script, we verify elasticity worked as expected.
 
 ```bash
 #!/bin/bash
 WORKING_DIR=</LOCAL/DIRECTORY/PATH>
 USER_LABEL_SELECTOR="<USER>"
 LOG_DIR="${WORKING_DIR}/logs"
+RUN_ID=pathways-${USER_LABEL_SELECTOR}
+LOG_FILE="${LOG_DIR}/logs_${RUN_ID}.log"
 JOB_DEFINITION_FILE="${WORKING_DIR}/pathwaysjob-elastic.yaml" # Copy the above yaml into this file
 
 mkdir -p ${LOG_DIR}
 
-run_id=$(date +"%s")
-echo "Running Elastic MaxText with Run ID: $run_id"
+echo "Running Elastic MaxText with Run ID: ${RUN_ID}"
 
 # 1. Launch the PathwaysJob
 kubectl apply -f "$JOB_DEFINITION_FILE"
-if [ $? -ne 0 ]; then
-echo "Error: Failed to apply job definition."
-exit 1
-fi
 
 # 2. Monitor the PathwaysJob
 echo "Waiting for pods to start..."
 head_pod=""
 for i in $(seq 1 10)
 do
-  head_pod=$(kubectl get pods | grep "$USER_LABEL_SELECTOR" | grep 'head' | grep 'Running' | awk '{print $1}' | head -n 1)
+  head_pod=$(kubectl get pods -o=name --field-selector='status.phase==Running' | grep "$USER_LABEL_SELECTOR" | grep 'head' | head -n 1)
   if [ -n "$head_pod" ]; then
     echo "Found head pod: $head_pod"
     break
@@ -93,21 +105,20 @@ do
 done
 
 if [ -z "$head_pod" ]; then
-  echo "Error: Could not find running head pod after multiple attempts. Cleaning up..."
+  echo "Error: Could not find running head pod after multiple attempts. Cleaning up..." 1>&2
   kubectl delete -f "$JOB_DEFINITION_FILE"
   exit 1
 fi
 
-log_file="${LOG_DIR}/logs_${run_id}.log"
-echo "Streaming logs from $head_pod to $log_file"
-kubectl logs -f "$head_pod" >> "${log_file}" &
+echo "Streaming logs from $head_pod to ${LOG_FILE}"
+kubectl logs -f "$head_pod" >> "${LOG_FILE}" &
 logs_pid=$!
 echo "Waiting for job to start making progress..."
 sleep 90s
 
 # 3. Simulate Failure: Evict a Worker Pod
 echo "Randomly select a worker pod to disrupt..."
-read -r node_name pod_name <<<$(kubectl get pods -o wide | grep "$USER_LABEL_SELECTOR" | grep 'worker-[0-9]-0-' | grep 'Running' | shuf | head -n 1 | awk '{print $7, $1}')
+read -r node_name pod_name <<<$(kubectl get pods -o wide --field-selector='status.phase==Running' | grep "$USER_LABEL_SELECTOR" | grep worker | shuf | head -n 1 | awk '{print $7, $1}')
 
 if [ -z "$pod_name" ] || [ -z "$node_name" ]; then
   echo "Warning: Could not find a running worker pod to disrupt. Skipping disruption."
@@ -127,11 +138,11 @@ fi
 sleep 90s
 
 # 6. Terminate the Job and Cleanup
-echo "Terminating Run ID $run_id"
+echo "Terminating Run ID ${RUN_ID}"
 kubectl delete -f "$JOB_DEFINITION_FILE"
 # Ensure log streaming process is killed
 kill "$logs_pid" 2>/dev/null 
-echo "Completed Run ID $run_id."
+echo "Completed Run ID ${RUN_ID}."
 
 # 6. Verify by printing steps where training reconfigured from N to N-1 slices and later back to N slices
 # Expect output like:
@@ -156,5 +167,5 @@ awk '
     prev_step = step
     prev_good_slice_count = good_slice_count
   }
-' "$log_file"
+' "${LOG_FILE}"
 ```