[Device-Metrics]: Add support to dump device metrics in log xpu, cpu#10554
[Device-Metrics]: Add support to dump device metrics in log xpu, cpu#10554koppulac wants to merge 2 commits into
Conversation
Signed-off-by: Chiranjeevi Koppula <ckoppula@habana.ai>
There was a problem hiding this comment.
Code Review
This pull request adds background telemetry sampling for Intel XPU and host CPU/RAM resources using xpu-smi and psutil to prevent blocking the main training thread. It also removes the TorchProfilerCallback and ModuleProfilerCallback classes. The review feedback suggests removing the hardcoded sudo from the xpu-smi command to prevent blocking or permission issues in restricted environments, and questions whether the deletion of the profiler callbacks was intentional, as it could break existing profiling workflows.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| Metric 31 provides the most accurate compute utilization for AI/ML workloads. | ||
| """ | ||
| metrics: dict[str, float] = {} | ||
| cmd = ["sudo", "xpu-smi", "dump", "-d", str(device_id), "-m", "5,18,31", "-n", "1"] |
There was a problem hiding this comment.
Hardcoding sudo in the xpu-smi command is problematic because:
- In many environments (such as Docker containers, standard user accounts on shared servers, or non-root execution environments), the user does not have
sudoprivileges orsudorequires a password. - If
sudorequires a password, the command will block waiting for input, eventually hitting the 3-second timeout on every logging step. This will severely degrade training performance by blocking the background thread pool queue.
Since querying telemetry via xpu-smi dump does not require root privileges, we should remove sudo from the command.
| cmd = ["sudo", "xpu-smi", "dump", "-d", str(device_id), "-m", "5,18,31", "-n", "1"] | |
| cmd = ["xpu-smi", "dump", "-d", str(device_id), "-m", "5,18,31", "-n", "1"] |
| self.thread_pool.submit(self._write_log, args.output_dir, logs) | ||
|
|
||
|
|
||
| class TorchProfilerCallback(TrainerCallback): |
There was a problem hiding this comment.
The deletion of TorchProfilerCallback and ModuleProfilerCallback (along with the removal of defaultdict and fnmatch imports) seems unrelated to the PR's objective of adding XPU and CPU device metrics support. Please confirm if this deletion was intentional or if it was an accidental omission during a rebase/merge, as removing these callbacks will break any existing profiling workflows that rely on them.
Signed-off-by: Chiranjeevi Koppula <ckoppula@habana.ai>
What does this PR do?
Fixes # (issue)
Before submitting