-
Notifications
You must be signed in to change notification settings - Fork 553
Description
📚 Documentation
The RateTracker
class in https://github.com/pytorch/xla/blob/fe3f23c62c747da30595cb9906d929b926aae6e4/torch_xla/core/xla_model.py doesn't have a docstring. This class is used in lots of tests, including this one that is referenced from the main documentation, so new PyTorch/XLA users may see it as a natural and supported way to track and report training efficiency metrics.
RateTracker
's behavior is subtle and potentially confusing, since tracking throughput can involve measuring data at different granularities (e.g. batch, example, or, for LLMs, tokens) and reporting per-accelerator, per-host, or globally. Here is what I think the answers to these are; please correct me.
Following the examples in those tests, (where the batch size is added to the tracker at each training step), I think that rate
measures the examples (not tokens) per second seen during the last batch (specifically, since the last time .rate()
was called) and global_rate
measures the same for the whole training run. Therefore the expectation is that global_rate will be slow in the beginning but after compilation and other one-time costs it will rise and typically approach the per-batch training rate, though the latter may vary.
In terms of what granularity of devices the metrics reflect, for SPMD, I think these will be both global metrics (for the whole training job), but for other distribution strategies, I think they're per-device.
Is that right?