Skip to content

Conversation

@akansha1812
Copy link

Add complete helm chart with readme and tests the scripts

Comment on lines 270 to 273
```
cd $REPO_ROOT/src/utils/checkpointing_metrics
python3 calculate_checkpoint_metrics.py --gcs_logs_path=${GCS_LOGS_PATH}
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this? I'm not sure if it has been updated to work with Nemo 2.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update the file path to match the other recipes in this directory with a "-gcs" suffix.

@@ -0,0 +1,303 @@
<!-- mdformat global-off -->
# Pretrain llama3-1-70b-gpus128 workloads on a4 GKE Node pools with Nvidia NeMo Framework using Google Cloud Storage for training data and checkpoints
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A4

@@ -0,0 +1,303 @@
<!-- mdformat global-off -->
# Pretrain llama3-1-70b-gpus128 workloads on a4 GKE Node pools with Nvidia NeMo Framework using Google Cloud Storage for training data and checkpoints
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there supposed to be 2 spaces here?


### Configure and submit a pretraining job

#### Using 16 node (64 gpus) fp8 precision
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is if fp8 or bf16?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bf16. updated

Comment on lines +224 to +267
### Analyze results

When completed, the job creates several artifacts, including logs and traces, and places them
in the Google Cloud Storage logs bucket as follows:

```
gs://${GCS_BUCKET_LOGS}/nemo-experiments-storage/<JOB_ID>
├── nemo-configuration.yaml
├── lightning_logs.txt
├── nemo_error_logs.txt
├── nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt
├── dllogger
│ ├── rank-0
│ │ ├── dllogger.json
...
```

- `nemo-configuration.yaml`: the NeMo configuration used by the pretraining script. This includes
the combined [configuration file](../16node-bf16-seq8192-gbs512/llama3-1-70b.py)
and the command line overrides
- `lightning_logs.txt`: the log files generated by PyTorch Lightning, which is used by NeMo
- `nemo_error_logs.txt`: the warning and error logs generated by NeMo
- `nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt`: the NeMo logs for each rank
- `dllogger/`: The log captured by [NVIDIA DLLogger](https://github.com/NVIDIA/dllogger):
DLLogger is configured to store logs on the rank 0 node. The log is in JSON format
and includes loss, step_time, and other key metrics for each training step

The `<JOB_ID>` has the following format:
- `$USER--llama31-70b-gcs-[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]`, where the suffix of the ID is a day and time when the job was started.


The NeMo log files include information about checkpoint operations on each rank. You can use the [checkpointing_metrics](../../../../src/utils/checkpointint_metrics) utility to calculate statistics for checkpoint write times.

To calculate statistics:


1. Set a path to the NeMo logs.

```
export JOB_ID=<JOB_ID>
export GCS_LOGS_PATH="gs://${GCS_BUCKET_LOGS}/nemo-experiments-storage/${JOB_ID}"
```

Replace `<JOB_ID>` with the ID of your job.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems to end abruptly. I'm ok if we don't want to update the checkpoint metrics utility, but we should at least tell users where they can find this data in the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants