Skip to content

xla_model.RateTracker doesn't have a docstring and its behavior is subtle and potentially confusing. #6760

@ebreck

Description

@ebreck

📚 Documentation

The RateTracker class in https://github.com/pytorch/xla/blob/fe3f23c62c747da30595cb9906d929b926aae6e4/torch_xla/core/xla_model.py doesn't have a docstring. This class is used in lots of tests, including this one that is referenced from the main documentation, so new PyTorch/XLA users may see it as a natural and supported way to track and report training efficiency metrics.

RateTracker's behavior is subtle and potentially confusing, since tracking throughput can involve measuring data at different granularities (e.g. batch, example, or, for LLMs, tokens) and reporting per-accelerator, per-host, or globally. Here is what I think the answers to these are; please correct me.

Following the examples in those tests, (where the batch size is added to the tracker at each training step), I think that rate measures the examples (not tokens) per second seen during the last batch (specifically, since the last time .rate() was called) and global_rate measures the same for the whole training run. Therefore the expectation is that global_rate will be slow in the beginning but after compilation and other one-time costs it will rise and typically approach the per-batch training rate, though the latter may vary.

In terms of what granularity of devices the metrics reflect, for SPMD, I think these will be both global metrics (for the whole training job), but for other distribution strategies, I think they're per-device.

Is that right?

Metadata

Metadata

Assignees

No one assigned

    Labels

    usabilityBugs/features related to improving the usability of PyTorch/XLA

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions