Skip to content

Add configurable profiler for v1 training#10595

Open
addsubmuldiv wants to merge 9 commits into
hiyouga:mainfrom
addsubmuldiv:add-profiler-config
Open

Add configurable profiler for v1 training#10595
addsubmuldiv wants to merge 9 commits into
hiyouga:mainfrom
addsubmuldiv:add-profiler-config

Conversation

@addsubmuldiv

@addsubmuldiv addsubmuldiv commented Jun 21, 2026

Copy link
Copy Markdown

What does this PR do?

Adds configurable torch profiler support for the v1 trainer only.

Main changes:

  • Adds a self-contained v1 profiler controller under src/llamafactory/v1/core/utils/profiler.py, next to other training-core utilities.
  • Adds the v1 profiler callback under src/llamafactory/v1/utils/callbacks/profiler_callback.py, matching the existing v1 callback layout.
  • Keeps the profiler implementation isolated from the legacy trainer and shared extras modules.
  • Reuses enable_torch_profiler as the v1 public switch.
  • Supports CPU, CUDA, and Ascend NPU profiler backends with v1 YAML arguments for activities, schedule, rank selection, and output directory.
  • Adds Ascend NPU options for profiler_level and profiler_aic_metrics, plus validated profiler_backend_options.npu advanced options.
  • Writes profiler traces under per-rank directories such as <profiler_output_dir>/rank_0/.

Example:

enable_torch_profiler: true
profiler_output_dir: ./saves/profile
profiler_skip_first: 8
profiler_wait_steps: 0
profiler_warmup_steps: 1
profiler_active_steps: 3
profiler_repeat: 1
profiler_rank_mode: rank0

# Ascend NPU only
profiler_level: level1
profiler_aic_metrics: pipe_utilization

Before submitting

Verification

Local checks in the lf_dev environment:

  • python -m py_compile src/llamafactory/v1/core/utils/profiler.py src/llamafactory/v1/utils/callbacks/profiler_callback.py src/llamafactory/v1/utils/callbacks/__init__.py src/llamafactory/v1/config/training_args.py src/llamafactory/v1/core/base_trainer.py
  • ruff check src/llamafactory/v1/core/utils/profiler.py src/llamafactory/v1/utils/callbacks/profiler_callback.py src/llamafactory/v1/utils/callbacks/__init__.py src/llamafactory/v1/config/training_args.py src/llamafactory/v1/core/base_trainer.py
  • git diff --check
  • Verified the final PR diff only touches src/llamafactory/v1/....
  • Profiler config smoke check for enable_torch_profiler, explicit false overrides, and NPU profiler_level / profiler_aic_metrics / profiler_backend_options.npu validation.
  • Profiler callback import smoke check through llamafactory.v1.utils.callbacks.ProfilerCallback.

Manual Ascend NPU smoke tests:

  • v1 trainer, 1 NPU: training completed and generated rank_0 profiler output.
  • v1 trainer, 2 NPUs with profiler_rank_mode: all: training completed and generated separate rank_0 and rank_1 profiler outputs, including trace_view.json, operator_details.csv, and step_trace_time.csv.

Legacy trainer / v0 was not rerun after this change because the final diff no longer touches legacy trainer files.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust PyTorch profiler integration supporting CPU, CUDA, and Ascend NPU devices, managed via a new ProfilerController and exposed through training arguments and callbacks. The feedback highlights two important robustness improvements: checking for the existence of _ExperimentalConfig on the torch_npu profiler module to prevent AttributeError on older versions, and wrapping inspect.signature in a try-except block to safely handle built-in functions that lack signature metadata.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

if self.name != "npu" or config.activities == "cpu":
return None

profiler = self.profiler_module

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In older or custom versions of torch_npu, _ExperimentalConfig might not be available in torch_npu.profiler. To prevent an AttributeError and subsequent training crash, we should check if _ExperimentalConfig exists on the profiler module before attempting to instantiate it.

        profiler = self.profiler_module
        if not hasattr(profiler, "_ExperimentalConfig"): 
            if logger is not None:
                _log(
                    logger,
                    "warning",
                    "The installed torch_npu version does not support _ExperimentalConfig. NPU-specific profiling options will be ignored.",
                )
            return None

logger: Any = None,
explicit_keys: Optional[set[str]] = None,
) -> dict[str, Any]:
parameters = inspect.signature(fn).parameters

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential runtime crashes on certain Python/PyTorch environments or compiled/built-in objects where inspect.signature might raise a ValueError (e.g., "no signature found for builtin"), it is safer to wrap the signature inspection in a try...except ValueError block and return the keyword arguments unfiltered as a fallback.

Suggested change
parameters = inspect.signature(fn).parameters
try:
parameters = inspect.signature(fn).parameters
except ValueError:
return kwargs

@addsubmuldiv addsubmuldiv marked this pull request as ready for review June 21, 2026 09:01
Copilot AI review requested due to automatic review settings June 21, 2026 09:01
@addsubmuldiv addsubmuldiv changed the title Add profiler config Add configurable torch profiler for training Jun 21, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a shared, unified torch-profiler configuration and control path so both the legacy (v0) trainer and the v1 trainer can enable profiling via the same YAML arguments (enable_torch_profiler + associated profiler_* fields), including CPU/CUDA and Ascend NPU-specific options.

Changes:

  • Introduces a reusable ProfilerConfig/ProfilerController implementation that encapsulates profiler backend selection, validation, scheduling, stepping, and trace output layout.
  • Wires profiler callbacks into both training architectures (legacy TorchProfilerCallback and new v1 ProfilerCallback) behind the existing enable_torch_profiler switch.
  • Adds English + Simplified Chinese documentation for profiler arguments, schedule semantics, and Ascend NPU settings.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/llamafactory/v1/core/base_trainer.py Registers the v1 profiler callback when enable_torch_profiler is set.
src/llamafactory/v1/config/training_args.py Adds v1 training args for profiler configuration (schedule, activities, rank mode, NPU options).
src/llamafactory/v1/accelerator/profiler.py Implements the v1 ProfilerCallback that delegates to the shared controller.
src/llamafactory/train/callbacks.py Refactors legacy TorchProfilerCallback to use the shared ProfilerController and updates docs/comments.
src/llamafactory/hparams/training_args.py Extends legacy profiler hyperparameters to match the unified config surface.
src/llamafactory/extras/profiler.py Adds the shared profiler config/controller and backend handling (CPU/CUDA/NPU).
docs/zh/hyperparameters/training-argument.md Adds Simplified Chinese documentation for profiler YAML options and NPU notes.
docs/en/hyperparameters/training-argument.md Adds English documentation for profiler YAML options and NPU notes.

Comment on lines +118 to +134
def validate(self, backend_name: Optional[str] = None) -> None:
if not self.enabled:
return

_validate_int_option("profiler_skip_first", self.skip_first, min_value=0)
_validate_int_option("profiler_wait_steps", self.wait_steps, min_value=0)
_validate_int_option("profiler_warmup_steps", self.warmup_steps, min_value=0)
_validate_int_option("profiler_active_steps", self.active_steps, min_value=1)
_validate_int_option("profiler_repeat", self.repeat, min_value=0)
_validate_choice_option("profiler_activities", self.activities, _SUPPORTED_ACTIVITIES)
_validate_choice_option("profiler_rank_mode", self.rank_mode, _SUPPORTED_RANK_MODES)

if backend_name == "npu" and self.activities != "cpu":
self.npu_profiler_level_name()
self.npu_aic_metrics_name()
self.npu_backend_options()
self.schedule_kwargs()
@addsubmuldiv addsubmuldiv marked this pull request as draft June 22, 2026 02:04
@addsubmuldiv addsubmuldiv changed the title Add configurable torch profiler for training Add configurable profiler for v1 training Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants