Add configurable profiler for v1 training by addsubmuldiv · Pull Request #10595 · hiyouga/LlamaFactory

addsubmuldiv · 2026-06-21T08:54:21Z

What does this PR do?

Adds configurable torch profiler support for the v1 trainer only.

Main changes:

Adds a self-contained v1 profiler controller under src/llamafactory/v1/core/utils/profiler.py, next to other training-core utilities.
Adds the v1 profiler callback under src/llamafactory/v1/utils/callbacks/profiler_callback.py, matching the existing v1 callback layout.
Keeps the profiler implementation isolated from the legacy trainer and shared extras modules.
Reuses enable_torch_profiler as the v1 public switch.
Supports CPU, CUDA, and Ascend NPU profiler backends with v1 YAML arguments for activities, schedule, rank selection, and output directory.
Adds Ascend NPU options for profiler_level and profiler_aic_metrics, plus validated profiler_backend_options.npu advanced options.
Writes profiler traces under per-rank directories such as <profiler_output_dir>/rank_0/.

Example:

enable_torch_profiler: true
profiler_output_dir: ./saves/profile
profiler_skip_first: 8
profiler_wait_steps: 0
profiler_warmup_steps: 1
profiler_active_steps: 3
profiler_repeat: 1
profiler_rank_mode: rank0

# Ascend NPU only
profiler_level: level1
profiler_aic_metrics: pipe_utilization

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

Verification

Local checks in the lf_dev environment:

python -m py_compile src/llamafactory/v1/core/utils/profiler.py src/llamafactory/v1/utils/callbacks/profiler_callback.py src/llamafactory/v1/utils/callbacks/__init__.py src/llamafactory/v1/config/training_args.py src/llamafactory/v1/core/base_trainer.py
ruff check src/llamafactory/v1/core/utils/profiler.py src/llamafactory/v1/utils/callbacks/profiler_callback.py src/llamafactory/v1/utils/callbacks/__init__.py src/llamafactory/v1/config/training_args.py src/llamafactory/v1/core/base_trainer.py
git diff --check
Verified the final PR diff only touches src/llamafactory/v1/....
Profiler config smoke check for enable_torch_profiler, explicit false overrides, and NPU profiler_level / profiler_aic_metrics / profiler_backend_options.npu validation.
Profiler callback import smoke check through llamafactory.v1.utils.callbacks.ProfilerCallback.

Manual Ascend NPU smoke tests:

v1 trainer, 1 NPU: training completed and generated rank_0 profiler output.
v1 trainer, 2 NPUs with profiler_rank_mode: all: training completed and generated separate rank_0 and rank_1 profiler outputs, including trace_view.json, operator_details.csv, and step_trace_time.csv.

Legacy trainer / v0 was not rerun after this change because the final diff no longer touches legacy trainer files.

gemini-code-assist

Code Review

This pull request introduces a robust PyTorch profiler integration supporting CPU, CUDA, and Ascend NPU devices, managed via a new ProfilerController and exposed through training arguments and callbacks. The feedback highlights two important robustness improvements: checking for the existence of _ExperimentalConfig on the torch_npu profiler module to prevent AttributeError on older versions, and wrapping inspect.signature in a try-except block to safely handle built-in functions that lack signature metadata.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-21T08:56:23Z

+        if self.name != "npu" or config.activities == "cpu":
+            return None
+
+        profiler = self.profiler_module


In older or custom versions of torch_npu, _ExperimentalConfig might not be available in torch_npu.profiler. To prevent an AttributeError and subsequent training crash, we should check if _ExperimentalConfig exists on the profiler module before attempting to instantiate it.

profiler = self.profiler_module if not hasattr(profiler, "_ExperimentalConfig"): if logger is not None: _log( logger, "warning", "The installed torch_npu version does not support _ExperimentalConfig. NPU-specific profiling options will be ignored.", ) return None

gemini-code-assist · 2026-06-21T08:56:23Z

+    logger: Any = None,
+    explicit_keys: Optional[set[str]] = None,
+) -> dict[str, Any]:
+    parameters = inspect.signature(fn).parameters


To prevent potential runtime crashes on certain Python/PyTorch environments or compiled/built-in objects where inspect.signature might raise a ValueError (e.g., "no signature found for builtin"), it is safer to wrap the signature inspection in a try...except ValueError block and return the keyword arguments unfiltered as a fallback.

Suggested change

parameters = inspect.signature(fn).parameters

try:

parameters = inspect.signature(fn).parameters

except ValueError:

return kwargs

Copilot

Pull request overview

This PR adds a shared, unified torch-profiler configuration and control path so both the legacy (v0) trainer and the v1 trainer can enable profiling via the same YAML arguments (enable_torch_profiler + associated profiler_* fields), including CPU/CUDA and Ascend NPU-specific options.

Changes:

Introduces a reusable ProfilerConfig/ProfilerController implementation that encapsulates profiler backend selection, validation, scheduling, stepping, and trace output layout.
Wires profiler callbacks into both training architectures (legacy TorchProfilerCallback and new v1 ProfilerCallback) behind the existing enable_torch_profiler switch.
Adds English + Simplified Chinese documentation for profiler arguments, schedule semantics, and Ascend NPU settings.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/llamafactory/v1/core/base_trainer.py	Registers the v1 profiler callback when `enable_torch_profiler` is set.
src/llamafactory/v1/config/training_args.py	Adds v1 training args for profiler configuration (schedule, activities, rank mode, NPU options).
src/llamafactory/v1/accelerator/profiler.py	Implements the v1 `ProfilerCallback` that delegates to the shared controller.
src/llamafactory/train/callbacks.py	Refactors legacy `TorchProfilerCallback` to use the shared `ProfilerController` and updates docs/comments.
src/llamafactory/hparams/training_args.py	Extends legacy profiler hyperparameters to match the unified config surface.
src/llamafactory/extras/profiler.py	Adds the shared profiler config/controller and backend handling (CPU/CUDA/NPU).
docs/zh/hyperparameters/training-argument.md	Adds Simplified Chinese documentation for profiler YAML options and NPU notes.
docs/en/hyperparameters/training-argument.md	Adds English documentation for profiler YAML options and NPU notes.

+    def validate(self, backend_name: Optional[str] = None) -> None:
+        if not self.enabled:
+            return
+
+        _validate_int_option("profiler_skip_first", self.skip_first, min_value=0)
+        _validate_int_option("profiler_wait_steps", self.wait_steps, min_value=0)
+        _validate_int_option("profiler_warmup_steps", self.warmup_steps, min_value=0)
+        _validate_int_option("profiler_active_steps", self.active_steps, min_value=1)
+        _validate_int_option("profiler_repeat", self.repeat, min_value=0)
+        _validate_choice_option("profiler_activities", self.activities, _SUPPORTED_ACTIVITIES)
+        _validate_choice_option("profiler_rank_mode", self.rank_mode, _SUPPORTED_RANK_MODES)
+
+        if backend_name == "npu" and self.activities != "cpu":
+            self.npu_profiler_level_name()
+            self.npu_aic_metrics_name()
+            self.npu_backend_options()
+        self.schedule_kwargs()


addsubmuldiv added 4 commits June 20, 2026 17:52

add unified profiler configuration

4d0d387

add npu profiler options

9f20b2a

document profiler training arguments

367f14e

reuse torch profiler enable flag

e863a2e

gemini-code-assist Bot reviewed Jun 21, 2026

View reviewed changes

addsubmuldiv marked this pull request as ready for review June 21, 2026 09:01

Copilot AI review requested due to automatic review settings June 21, 2026 09:01

Copilot started reviewing on behalf of addsubmuldiv June 21, 2026 09:02 View session

addsubmuldiv changed the title ~~Add profiler config~~ Add configurable torch profiler for training Jun 21, 2026

Copilot AI reviewed Jun 21, 2026

View reviewed changes

addsubmuldiv marked this pull request as draft June 22, 2026 02:04

addsubmuldiv changed the title ~~Add configurable torch profiler for training~~ Add configurable profiler for v1 training Jun 22, 2026

addsubmuldiv added 5 commits June 22, 2026 02:14

isolate profiler support to v1

b1844be

move v1 profiler callback to callbacks package

b6f8634

move v1 profiler controller to core utils

6200725

restore v1 accelerator profiler placeholder

4acc108

support bare optional profiler bool flags

95b3cef

addsubmuldiv marked this pull request as ready for review June 22, 2026 07:30

addsubmuldiv temporarily deployed to docker July 1, 2026 03:59 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable profiler for v1 training#10595

Add configurable profiler for v1 training#10595
addsubmuldiv wants to merge 9 commits into
hiyouga:mainfrom
addsubmuldiv:add-profiler-config

addsubmuldiv commented Jun 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

addsubmuldiv commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

addsubmuldiv commented Jun 21, 2026 •

edited

Loading