use torchmetrics ppl logging #382

pstjohn · 2024-10-30T01:47:18Z

Some experiments using torchmetrics for perplexity logging.

Example wandb chart for single-GPU training: https://wandb.ai/clara-discovery/bionemo-hf-pretraining/runs/fgn32r4p?nw=nwuserpstjohn

pstjohn · 2024-10-30T01:50:14Z

sub-packages/bionemo-llm/src/bionemo/llm/data/collate.py

@@ -78,7 +78,7 @@ def bert_padding_collate_fn(
        "text": padding_value,
        "types": 0,
        "attention_mask": False,
-        "labels": -1,
+        "labels": -100,


whoops 😄 . We should fix this, currently labels are padded with both -1 and -100

pstjohn · 2024-11-01T17:05:30Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

    def validation_step(self, batch, batch_idx: Optional[int] = None) -> Tensor:
        """In mcore the loss-function is part of the forward-pass when labels are provided."""
-        return self.forward_step(batch)
+        outputs = self.forward_step(batch)
+        self.valid_ppl(outputs["token_logits"], batch["labels"])
+        self.log("valid_ppl_step", self.valid_ppl)
+        return outputs


Where does validation_step get called and how does this interact with megatron parallelism? Just wondering whether torchmetrics would correctly detect the parallelism settings and collect data from all ranks with this configuration

Maybe this would be a good place for a coment about token_logits being [s,b,*] while other things are [b,s,*]

pstjohn · 2024-11-12T19:43:03Z

Running with tp=2, I get the following error:

  File "/usr/local/lib/python3.10/dist-packages/bionemo/llm/lightning.py", line 319, in validation_step
    self.valid_ppl(outputs["token_logits"], batch["labels"])
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 312, in forward
    self._forward_cache = self._forward_reduce_state_update(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 381, in _forward_reduce_state_update
    self.update(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 483, in wrapped_func
    update(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/text/perplexity.py", line 82, in update
    total_log_probs, count = _perplexity_update(preds, target, self.ignore_index)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/text/perplexity.py", line 83, in _perplexity_update
    _check_shape_and_type_consistency(preds, target)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/text/perplexity.py", line 55, in _check_shape_and_type_consistency
    raise ValueError(
ValueError: Input tensors `preds` and `target` are expected to have equaling first two dimensions, [batch_size, seq_len], but got torch.Size([1024, 16]) and torch.Size([16, 1024]).

transposing labels

sichu2023 · 2024-11-20T20:59:25Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

+        logits = outputs["token_logits"].transpose(0, 1)  #  [s, b] -> [b, s]
+        self.train_ppl(logits, batch["labels"])
+        self.log("train_ppl_step", self.train_ppl)


For PP, we need to check whether this is at the last PP stage.
https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-llm/src/bionemo/llm/lightning.py#L410

I believe NeMo returns 0 by default when it isn't at the last PP stage but don't rely on my vague memory here. I don't know how it translates to other outputs such as token_logits.

If you need a device group for each parallel group, check out parallel_states.get_pipeline_model_parallel_group. You might have multiple PP group if PP and TP happen at the same time. Each TP under a PP group, or the other way around.

I am not sure how VPP plays a role here.

sichu2023 · 2024-11-20T21:01:53Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

+        self.train_ppl = torchmetrics.text.Perplexity(ignore_index=-100)
+        self.valid_ppl = torchmetrics.text.Perplexity(ignore_index=-100)


Would be great to be configurable once the PR has matured.

For sure, I think we sould make an ESM2LightningModule with a lot of these args pre-defined where we can then attach metrics in an obvious way

sichu2023 · 2024-11-20T21:03:21Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

+        outputs = self.forward_step(batch)
+        logits = outputs["token_logits"].transpose(0, 1)  #  [s, b] -> [b, s]
+        self.train_ppl(logits, batch["labels"])
+        self.log("train_ppl_step", self.train_ppl)


And similarly, we need a different mechanism for CP, which isn't implemented even in the current PerplexityLoggingCallback.
https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-llm/src/bionemo/llm/lightning.py#L429

sichu2023 · 2024-11-20T21:03:36Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

+        outputs = self.forward_step(batch)
+        logits = outputs["token_logits"].transpose(0, 1)  #  [s, b] -> [b, s]
+        self.train_ppl(logits, batch["labels"])
+        self.log("train_ppl_step", self.train_ppl)


For TP, things are easier since there is already a TP aware method to get the loss.
https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-llm/src/bionemo/llm/model/loss.py#L262

sichu2023 · 2024-11-20T21:10:51Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

+        outputs = self.forward_step(batch)
+        logits = outputs["token_logits"].transpose(0, 1)  #  [s, b] -> [b, s]
+        self.train_ppl(logits, batch["labels"])
+        self.log("train_ppl_step", self.train_ppl)


We don't need to sync_dist here.

sichu2023 · 2024-11-20T21:11:29Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

+    def on_train_epoch_end(self):
+        """Log perplexity at the end of the epoch."""
+        self.log("train_ppl_step", self.train_ppl)
+
+    def on_valid_epoch_end(self):
+        """Log perplexity at the end of the epoch."""
+        self.log("valid_ppl_step", self.valid_ppl)


We only need to log once in step.

sichu2023 · 2024-11-20T21:13:02Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

+        outputs = self.forward_step(batch)
+        logits = outputs["token_logits"].transpose(0, 1)  #  [s, b] -> [b, s]
+        self.valid_ppl(logits, batch["labels"])
+        self.log("valid_ppl_step", self.valid_ppl)


But we will need sync_dist=True and on_epoch=True here.

why is that? It seems to give similar perplexity values to the callback:
https://wandb.ai/clara-discovery/bionemo-hf-pretraining/reports/val_ppl-valid_ppl_step-24-11-20-18-12-31---VmlldzoxMDI2NjQ0MA

The sync_dist, sync_dist_group and reduce_fx flags from self.log(...) don’t affect the metric logging in any manner. The metric class contains its own distributed synchronization logic.

https://lightning.ai/docs/torchmetrics/stable/pages/lightning.html#logging-torchmetrics

So I think we do want on_epoch=True here (I'm not sure why my example run was so close without it...), but I think sync_dist is handled internally?

You need sync_dist_group because you don't want to sync with 0/other weird value from non-last-pp-stage devices.

Maybe they are the same on single device because you had self.log("valid_ppl_step", self.valid_ppl) in on_valid_epoch_end?

sichu2023 · 2024-11-20T21:21:20Z

sub-packages/bionemo-llm/src/bionemo/llm/lightning.py

+        self.train_ppl = torchmetrics.text.Perplexity(ignore_index=-100)
+        self.valid_ppl = torchmetrics.text.Perplexity(ignore_index=-100)


https://lightning.ai/docs/torchmetrics/stable/pages/overview.html#metrics-in-distributed-data-parallel-ddp-mode

https://lightning.ai/docs/torchmetrics/stable/pages/overview.html#metric-kwargs

Any changes to the defaults we should make there? I used torchmetrics with the HF pretraining and it seemed to work well, so I don't think the oversampling is too big of an issue

sync_on_compute is on by default so we are good on that end. I wonder if we need to specify process_group to skip non-last-pp-stage devices.

I also don't have a clear idea on how to treat tp devices.

sichu2023 · 2024-12-12T12:22:33Z

Related effort: #525

pstjohn commented Oct 30, 2024

View reviewed changes

pstjohn commented Nov 1, 2024

View reviewed changes

pstjohn force-pushed the pstjohn/main/torchmetrics-ppl-logging branch from 72097b9 to 238a30f Compare November 12, 2024 20:27

use torchmetrics ppl logging

11169e0

transposing labels

pstjohn force-pushed the pstjohn/main/torchmetrics-ppl-logging branch from 238a30f to 11169e0 Compare November 20, 2024 20:02

sichu2023 reviewed Nov 20, 2024

View reviewed changes

sichu2023 mentioned this pull request Dec 12, 2024

Implement megatron-aware perplexity in torchmetrics #525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use torchmetrics ppl logging #382

use torchmetrics ppl logging #382

pstjohn commented Oct 30, 2024 •

edited

Loading

pstjohn Oct 30, 2024

pstjohn Nov 1, 2024

jstjohn Nov 12, 2024

pstjohn commented Nov 12, 2024

sichu2023 Nov 20, 2024

sichu2023 Nov 20, 2024

sichu2023 Nov 20, 2024

sichu2023 Nov 20, 2024

sichu2023 Nov 20, 2024

pstjohn Nov 21, 2024

sichu2023 Nov 20, 2024

sichu2023 Nov 20, 2024

sichu2023 Nov 20, 2024

sichu2023 Nov 20, 2024

sichu2023 Nov 20, 2024

pstjohn Nov 21, 2024

pstjohn Nov 21, 2024 •

edited

Loading

sichu2023 Nov 21, 2024

sichu2023 Nov 21, 2024

sichu2023 Nov 20, 2024

sichu2023 Nov 20, 2024

pstjohn Nov 21, 2024

sichu2023 Nov 21, 2024

sichu2023 Nov 21, 2024

sichu2023 commented Dec 12, 2024

		self.train_ppl = torchmetrics.text.Perplexity(ignore_index=-100)
		self.valid_ppl = torchmetrics.text.Perplexity(ignore_index=-100)

use torchmetrics ppl logging #382

Are you sure you want to change the base?

use torchmetrics ppl logging #382

Conversation

pstjohn commented Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pstjohn commented Nov 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pstjohn Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sichu2023 commented Dec 12, 2024

pstjohn commented Oct 30, 2024 •

edited

Loading

pstjohn Nov 21, 2024 •

edited

Loading