Add configurable profiler for v1 training#10595
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a robust PyTorch profiler integration supporting CPU, CUDA, and Ascend NPU devices, managed via a new ProfilerController and exposed through training arguments and callbacks. The feedback highlights two important robustness improvements: checking for the existence of _ExperimentalConfig on the torch_npu profiler module to prevent AttributeError on older versions, and wrapping inspect.signature in a try-except block to safely handle built-in functions that lack signature metadata.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if self.name != "npu" or config.activities == "cpu": | ||
| return None | ||
|
|
||
| profiler = self.profiler_module |
There was a problem hiding this comment.
In older or custom versions of torch_npu, _ExperimentalConfig might not be available in torch_npu.profiler. To prevent an AttributeError and subsequent training crash, we should check if _ExperimentalConfig exists on the profiler module before attempting to instantiate it.
profiler = self.profiler_module
if not hasattr(profiler, "_ExperimentalConfig"):
if logger is not None:
_log(
logger,
"warning",
"The installed torch_npu version does not support _ExperimentalConfig. NPU-specific profiling options will be ignored.",
)
return None| logger: Any = None, | ||
| explicit_keys: Optional[set[str]] = None, | ||
| ) -> dict[str, Any]: | ||
| parameters = inspect.signature(fn).parameters |
There was a problem hiding this comment.
To prevent potential runtime crashes on certain Python/PyTorch environments or compiled/built-in objects where inspect.signature might raise a ValueError (e.g., "no signature found for builtin"), it is safer to wrap the signature inspection in a try...except ValueError block and return the keyword arguments unfiltered as a fallback.
| parameters = inspect.signature(fn).parameters | |
| try: | |
| parameters = inspect.signature(fn).parameters | |
| except ValueError: | |
| return kwargs |
There was a problem hiding this comment.
Pull request overview
This PR adds a shared, unified torch-profiler configuration and control path so both the legacy (v0) trainer and the v1 trainer can enable profiling via the same YAML arguments (enable_torch_profiler + associated profiler_* fields), including CPU/CUDA and Ascend NPU-specific options.
Changes:
- Introduces a reusable
ProfilerConfig/ProfilerControllerimplementation that encapsulates profiler backend selection, validation, scheduling, stepping, and trace output layout. - Wires profiler callbacks into both training architectures (legacy
TorchProfilerCallbackand new v1ProfilerCallback) behind the existingenable_torch_profilerswitch. - Adds English + Simplified Chinese documentation for profiler arguments, schedule semantics, and Ascend NPU settings.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/llamafactory/v1/core/base_trainer.py | Registers the v1 profiler callback when enable_torch_profiler is set. |
| src/llamafactory/v1/config/training_args.py | Adds v1 training args for profiler configuration (schedule, activities, rank mode, NPU options). |
| src/llamafactory/v1/accelerator/profiler.py | Implements the v1 ProfilerCallback that delegates to the shared controller. |
| src/llamafactory/train/callbacks.py | Refactors legacy TorchProfilerCallback to use the shared ProfilerController and updates docs/comments. |
| src/llamafactory/hparams/training_args.py | Extends legacy profiler hyperparameters to match the unified config surface. |
| src/llamafactory/extras/profiler.py | Adds the shared profiler config/controller and backend handling (CPU/CUDA/NPU). |
| docs/zh/hyperparameters/training-argument.md | Adds Simplified Chinese documentation for profiler YAML options and NPU notes. |
| docs/en/hyperparameters/training-argument.md | Adds English documentation for profiler YAML options and NPU notes. |
| def validate(self, backend_name: Optional[str] = None) -> None: | ||
| if not self.enabled: | ||
| return | ||
|
|
||
| _validate_int_option("profiler_skip_first", self.skip_first, min_value=0) | ||
| _validate_int_option("profiler_wait_steps", self.wait_steps, min_value=0) | ||
| _validate_int_option("profiler_warmup_steps", self.warmup_steps, min_value=0) | ||
| _validate_int_option("profiler_active_steps", self.active_steps, min_value=1) | ||
| _validate_int_option("profiler_repeat", self.repeat, min_value=0) | ||
| _validate_choice_option("profiler_activities", self.activities, _SUPPORTED_ACTIVITIES) | ||
| _validate_choice_option("profiler_rank_mode", self.rank_mode, _SUPPORTED_RANK_MODES) | ||
|
|
||
| if backend_name == "npu" and self.activities != "cpu": | ||
| self.npu_profiler_level_name() | ||
| self.npu_aic_metrics_name() | ||
| self.npu_backend_options() | ||
| self.schedule_kwargs() |
What does this PR do?
Adds configurable torch profiler support for the v1 trainer only.
Main changes:
src/llamafactory/v1/core/utils/profiler.py, next to other training-core utilities.src/llamafactory/v1/utils/callbacks/profiler_callback.py, matching the existing v1 callback layout.extrasmodules.enable_torch_profileras the v1 public switch.profiler_levelandprofiler_aic_metrics, plus validatedprofiler_backend_options.npuadvanced options.<profiler_output_dir>/rank_0/.Example:
Before submitting
Verification
Local checks in the
lf_devenvironment:python -m py_compile src/llamafactory/v1/core/utils/profiler.py src/llamafactory/v1/utils/callbacks/profiler_callback.py src/llamafactory/v1/utils/callbacks/__init__.py src/llamafactory/v1/config/training_args.py src/llamafactory/v1/core/base_trainer.pyruff check src/llamafactory/v1/core/utils/profiler.py src/llamafactory/v1/utils/callbacks/profiler_callback.py src/llamafactory/v1/utils/callbacks/__init__.py src/llamafactory/v1/config/training_args.py src/llamafactory/v1/core/base_trainer.pygit diff --checksrc/llamafactory/v1/....enable_torch_profiler, explicit false overrides, and NPUprofiler_level/profiler_aic_metrics/profiler_backend_options.npuvalidation.llamafactory.v1.utils.callbacks.ProfilerCallback.Manual Ascend NPU smoke tests:
rank_0profiler output.profiler_rank_mode: all: training completed and generated separaterank_0andrank_1profiler outputs, includingtrace_view.json,operator_details.csv, andstep_trace_time.csv.Legacy trainer / v0 was not rerun after this change because the final diff no longer touches legacy trainer files.