Skip to content

[Device-Metrics]: Add support to dump device metrics in log xpu, cpu#10554

Open
koppulac wants to merge 2 commits into
hiyouga:mainfrom
koppulac:device_metrics
Open

[Device-Metrics]: Add support to dump device metrics in log xpu, cpu#10554
koppulac wants to merge 2 commits into
hiyouga:mainfrom
koppulac:device_metrics

Conversation

@koppulac

@koppulac koppulac commented Jun 8, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes # (issue)

Before submitting

Signed-off-by: Chiranjeevi Koppula <ckoppula@habana.ai>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds background telemetry sampling for Intel XPU and host CPU/RAM resources using xpu-smi and psutil to prevent blocking the main training thread. It also removes the TorchProfilerCallback and ModuleProfilerCallback classes. The review feedback suggests removing the hardcoded sudo from the xpu-smi command to prevent blocking or permission issues in restricted environments, and questions whether the deletion of the profiler callbacks was intentional, as it could break existing profiling workflows.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Metric 31 provides the most accurate compute utilization for AI/ML workloads.
"""
metrics: dict[str, float] = {}
cmd = ["sudo", "xpu-smi", "dump", "-d", str(device_id), "-m", "5,18,31", "-n", "1"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding sudo in the xpu-smi command is problematic because:

  1. In many environments (such as Docker containers, standard user accounts on shared servers, or non-root execution environments), the user does not have sudo privileges or sudo requires a password.
  2. If sudo requires a password, the command will block waiting for input, eventually hitting the 3-second timeout on every logging step. This will severely degrade training performance by blocking the background thread pool queue.

Since querying telemetry via xpu-smi dump does not require root privileges, we should remove sudo from the command.

Suggested change
cmd = ["sudo", "xpu-smi", "dump", "-d", str(device_id), "-m", "5,18,31", "-n", "1"]
cmd = ["xpu-smi", "dump", "-d", str(device_id), "-m", "5,18,31", "-n", "1"]

self.thread_pool.submit(self._write_log, args.output_dir, logs)


class TorchProfilerCallback(TrainerCallback):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The deletion of TorchProfilerCallback and ModuleProfilerCallback (along with the removal of defaultdict and fnmatch imports) seems unrelated to the PR's objective of adding XPU and CPU device metrics support. Please confirm if this deletion was intentional or if it was an accidental omission during a rebase/merge, as removing these callbacks will break any existing profiling workflows that rely on them.

Signed-off-by: Chiranjeevi Koppula <ckoppula@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant