Optimization
This page contains the API reference documentation for learning rate optimizers included in timm
.
Optimizers
Factory functions
timm.optim.create_optimizer_v2
< source >( model_or_params: typing.Union[torch.nn.modules.module.Module, typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] opt: str = 'sgd' lr: typing.Optional[float] = None weight_decay: float = 0.0 momentum: float = 0.9 foreach: typing.Optional[bool] = None filter_bias_and_bn: bool = True layer_decay: typing.Optional[float] = None param_group_fn: typing.Optional[typing.Callable[[torch.nn.modules.module.Module], typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]]]] = None **kwargs: typing.Any )
Parameters
- model_or_params — A PyTorch model or an iterable of parameters/parameter groups. If a model is provided, parameters will be automatically extracted and grouped based on the other arguments.
- opt — Name of the optimizer to create (e.g., ‘adam’, ‘adamw’, ‘sgd’). Use list_optimizers() to see available options.
- lr — Learning rate. If None, will use the optimizer’s default.
- weight_decay — Weight decay factor. Will be used to create param groups if model_or_params is a model.
- momentum — Momentum factor for optimizers that support it. Only used if the chosen optimizer accepts a momentum parameter.
- foreach — Enable/disable foreach (multi-tensor) implementation if available. If None, will use optimizer-specific defaults.
- filter_bias_and_bn — If True, bias, norm layer parameters (all 1d params) will not have weight decay applied. Only used when model_or_params is a model and weight_decay > 0.
- layer_decay — Optional layer-wise learning rate decay factor. If provided, learning rates will be scaled by layer_decay^(max_depth - layer_depth). Only used when model_or_params is a model.
- param_group_fn — Optional function to create custom parameter groups. If provided, other parameter grouping options will be ignored.
- **kwargs — Additional optimizer-specific arguments (e.g., betas for Adam).
Create an optimizer instance via timm registry.
Creates and configures an optimizer with appropriate parameter groups and settings. Supports automatic parameter group creation for weight decay and layer-wise learning rates, as well as custom parameter grouping.
Examples:
Basic usage with a model
optimizer = create_optimizer_v2(model, ‘adamw’, lr=1e-3)
SGD with momentum and weight decay
optimizer = create_optimizer_v2( … model, ‘sgd’, lr=0.1, momentum=0.9, weight_decay=1e-4 … )
Adam with layer-wise learning rate decay
optimizer = create_optimizer_v2( … model, ‘adam’, lr=1e-3, layer_decay=0.7 … )
Custom parameter groups
def group_fn(model): … return [ … {‘params’: model.backbone.parameters(), ‘lr’: 1e-4}, … {‘params’: model.head.parameters(), ‘lr’: 1e-3} … ] optimizer = create_optimizer_v2( … model, ‘sgd’, param_group_fn=group_fn … )
Note: Parameter group handling precedence:
- If param_group_fn is provided, it will be used exclusively
- If layer_decay is provided, layer-wise groups will be created
- If weight_decay > 0 and filter_bias_and_bn is True, weight decay groups will be created
- Otherwise, all parameters will be in a single group
timm.optim.list_optimizers
< source >( filter: typing.Union[str, typing.List[str]] = '' exclude_filters: typing.Optional[typing.List[str]] = None with_description: bool = False ) → If with_description is False
Parameters
- filter — Wildcard style filter string or list of filter strings (e.g., ‘adam’ for all Adam variants, or [‘adam’, ‘*8bit’] for Adam variants and 8-bit optimizers). Empty string means no filtering.
- exclude_filters — Optional list of wildcard patterns to exclude. For example, [’8bit’, ‘fused’] would exclude 8-bit and fused implementations.
- with_description — If True, returns tuples of (name, description) instead of just names. Descriptions provide brief explanations of optimizer characteristics.
Returns
If with_description is False
List of optimizer names as strings (e.g., [‘adam’, ‘adamw’, …]) If with_description is True: List of tuples of (name, description) (e.g., [(‘adam’, ‘Adaptive Moment…’), …])
List available optimizer names, optionally filtered.
List all registered optimizers, with optional filtering using wildcard patterns. Optimizers can be filtered using include and exclude patterns, and can optionally return descriptions with each optimizer name.
Examples:
list_optimizers() [‘adam’, ‘adamw’, ‘sgd’, …]
list_optimizers([‘la’, ‘nla’]) # List lamb & lars [‘lamb’, ‘lambc’, ‘larc’, ‘lars’, ‘nlarc’, ‘nlars’]
list_optimizers(’adam’, exclude_filters=[‘bnb’, ‘fused’]) # Exclude bnb & apex adam optimizers [‘adam’, ‘adamax’, ‘adamp’, ‘adamw’, ‘nadam’, ‘nadamw’, ‘radam’]
list_optimizers(with_description=True) # Get descriptions [(‘adabelief’, ‘Adapts learning rate based on gradient prediction error’), (‘adadelta’, ‘torch.optim Adadelta, Adapts learning rates based on running windows of gradients’), (‘adafactor’, ‘Memory-efficient implementation of Adam with factored gradients’), …]
timm.optim.get_optimizer_class
< source >( name: str bind_defaults: bool = True ) → If bind_defaults is False
Parameters
- name — Name of the optimizer to retrieve (e.g., ‘adam’, ‘sgd’)
- bind_defaults — If True, returns a partial function with default arguments from OptimInfo bound. If False, returns the raw optimizer class.
Returns
If bind_defaults is False
The optimizer class (e.g., torch.optim.Adam) If bind_defaults is True: A partial function with default arguments bound
Raises
ValueError
ValueError
— If optimizer name is not found in registry
Get optimizer class by name with option to bind default arguments.
Retrieves the optimizer class or a partial function with default arguments bound. This allows direct instantiation of optimizers with their default configurations without going through the full factory.
Examples:
Get SGD with nesterov momentum default
SGD = get_optimizer_class(‘sgd’) # nesterov=True bound opt = SGD(model.parameters(), lr=0.1, momentum=0.9)
Get raw optimizer class
SGD = get_optimizer_class(‘sgd’) opt = SGD(model.parameters(), lr=1e-3, momentum=0.9)
Optimizer Classes
class timm.optim.AdaBelief
< source >( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-16 weight_decay = 0 amsgrad = False decoupled_decay = True fixed_decay = False rectify = True degenerated_to_sgd = True )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) — learning rate (default: 1e-3)
- betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
- eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-16)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
- amsgrad (boolean, optional) — whether to use the AMSGrad variant of this
algorithm from the paper
On the Convergence of Adam and Beyond
_ (default: False) - decoupled_decay (boolean, optional) — (default: True) If set as True, then the optimizer uses decoupled weight decay as in AdamW
- fixed_decay (boolean, optional) — (default: False) This is used when weightdecouple is set as True. When fixed_decay == True, the weight decay is performed as $W{new} = W{old} - W{old} \times decay$. When fixeddecay == False, the weight decay is performed as $W{new} = W{old} - W{old} \times decay \times lr$. Note that in this case, the weight decay ratio decreases with learning rate (lr).
- rectify (boolean, optional) — (default: True) If set as True, then perform the rectified update similar to RAdam
- degenerated_to_sgd (boolean, optional) (default —True) If set as True, then perform SGD update when variance of gradient is high
Implements AdaBelief algorithm. Modified from Adam in PyTorch
reference: AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients, NeurIPS 2020
For a complete table of recommended hyperparameters, see https://github.com/juntang-zhuang/Adabelief-Optimizer’ For example train/args for EfficientNet see these gists
- link to train_scipt: https://gist.github.com/juntang-zhuang/0a501dd51c02278d952cf159bc233037
- link to args.yaml: https://gist.github.com/juntang-zhuang/517ce3c27022b908bb93f78e4f786dc3
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.Adafactor
< source >( params: typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] lr: typing.Optional[float] = None eps: float = 1e-30 eps_scale: float = 0.001 clip_threshold: float = 1.0 decay_rate: float = -0.8 betas: typing.Optional[typing.Tuple[float, float]] = None weight_decay: float = 0.0 scale_parameter: bool = True warmup_init: bool = False min_dim_size_to_factor: int = 16 caution: bool = False )
Implements Adafactor algorithm.
This implementation is based on: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
(see https://arxiv.org/abs/1804.04235)
Note that this optimizer internally adjusts the learning rate depending on the scale_parameter, relative_step and warmup_init options.
To use a manual (external) learning rate schedule you should set scale_parameter=False
and
relative_step=False
.
Ags: params: iterable of parameters to optimize or dicts defining parameter groups lr: external learning rate eps: regularization constants for square gradient and parameter scale respectively eps_scale: regularization constants for parameter scale respectively clip_threshold: threshold of root-mean-square of final gradient update decay_rate: coefficient used to compute running averages of square gradient beta1: coefficient used for computing running averages of gradient weight_decay: weight decay scale_parameter: if True, learning rate is scaled by root-mean-square of parameter warmup_init: time-dependent learning rate computation depends on whether warm-up initialization is being used
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.AdafactorBigVision
< source >( params: typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] lr: float = 1.0 min_dim_size_to_factor: int = 16 decay_rate: float = 0.8 decay_offset: int = 0 beta2_cap: float = 0.999 momentum: typing.Optional[float] = 0.9 momentum_dtype: typing.Union[str, torch.dtype] = torch.bfloat16 eps: typing.Optional[float] = None weight_decay: float = 0.0 clipping_threshold: typing.Optional[float] = None unscaled_wd: bool = False caution: bool = False foreach: typing.Optional[bool] = False )
PyTorch implementation of BigVision’s Adafactor variant with both single and multi tensor implementations.
Adapted from https://github.com/google-research/big_vision by Ross Wightman
class timm.optim.Adahessian
< source >( params lr = 0.1 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.0 hessian_power = 1.0 update_each = 1 n_samples = 1 avg_conv_kernel = False )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) — learning rate (default: 0.1)
- betas ((float, float), optional) — coefficients used for computing running averages of gradient and the squared hessian trace (default: (0.9, 0.999))
- eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0.0)
- hessian_power (float, optional) — exponent of the hessian trace (default: 1.0)
- update_each (int, optional) — compute the hessian trace approximation only after this number of steps (to save time) (default: 1)
- n_samples (int, optional) — how many times to sample
z
for the approximation of the hessian trace (default: 1)
Implements the AdaHessian algorithm from “ADAHESSIAN: An Adaptive Second OrderOptimizer for Machine Learning”
Gets all parameters in all param_groups with gradients
Computes the Hutchinson approximation of the hessian trace and accumulates it for each trainable parameter.
step
< source >( closure = None )
Performs a single optimization step.
Zeros out the accumalated hessian traces.
class timm.optim.AdamP
< source >( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 delta = 0.1 wd_ratio = 0.1 nesterov = False )
class timm.optim.Adan
< source >( params lr: float = 0.001 betas: typing.Tuple[float, float, float] = (0.98, 0.92, 0.99) eps: float = 1e-08 weight_decay: float = 0.0 no_prox: bool = False caution: bool = False foreach: typing.Optional[bool] = None )
Parameters
- params — Iterable of parameters to optimize or dicts defining parameter groups.
- lr — Learning rate.
- betas — Coefficients used for first- and second-order moments.
- eps — Term added to the denominator to improve numerical stability.
- weight_decay — Decoupled weight decay (L2 penalty)
- no_prox — How to perform the weight decay
- caution — Enable caution from ‘Cautious Optimizers’
- foreach — If True would use torch._foreach implementation. Faster but uses slightly more memory.
Implements a pytorch variant of Adan.
Adan was proposed in Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models https://arxiv.org/abs/2208.06677
Performs a single optimization step.
class timm.optim.Adopt
< source >( params: typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] lr: typing.Union[float, torch.Tensor] = 0.001 betas: typing.Tuple[float, float] = (0.9, 0.9999) eps: float = 1e-06 clip_exp: typing.Optional[float] = 0.333 weight_decay: float = 0.0 decoupled: bool = False caution: bool = False foreach: typing.Optional[bool] = False maximize: bool = False capturable: bool = False differentiable: bool = False )
ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate: https://arxiv.org/abs/2411.02853
step
< source >( closure = None )
Perform a single optimization step.
class timm.optim.Lamb
< source >( params: typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] lr: float = 0.001 bias_correction: bool = True betas: typing.Tuple[float, float] = (0.9, 0.999) eps: float = 1e-06 weight_decay: float = 0.01 grad_averaging: bool = True max_grad_norm: typing.Optional[float] = 1.0 trust_clip: bool = False always_adapt: bool = False caution: bool = False decoupled_decay: bool = False )
Parameters
- params — Iterable of parameters to optimize or dicts defining parameter groups.
- lr — Learning rate
- betas — Coefficients used for computing running averages of gradient and its norm.
- eps — Term added to the denominator to improve numerical stability.
- weight_decay — Weight decay
- grad_averaging — Whether apply (1-beta2) to grad when calculating running averages of gradient.
- max_grad_norm — Value used to clip global grad norm.
- trust_clip — Enable LAMBC trust ratio clipping.
- always_adapt — Apply adaptive learning rate to 0.0 weight decay parameter.
- caution — Apply caution.
Implements a pure pytorch variant of FuseLAMB (NvLamb variant) optimizer from apex.optimizers.FusedLAMB reference: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/Transformer-XL/pytorch/lamb.py
LAMB was proposed in:
- Large Batch Optimization for Deep Learning - Training BERT in 76 minutes: https://arxiv.org/abs/1904.00962
- On the Convergence of Adam and Beyond: https://openreview.net/forum?id=ryQu7f-RZ
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.LaProp
< source >( params: typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] lr: float = 0.0004 betas: typing.Tuple[float, float] = (0.9, 0.999) eps: float = 1e-15 weight_decay: float = 0.0 caution: bool = False )
LaProp Optimizer
Paper: LaProp: Separating Momentum and Adaptivity in Adam, https://arxiv.org/abs/2002.04839
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.Lars
< source >( params lr = 1.0 momentum = 0 dampening = 0 weight_decay = 0 nesterov = False trust_coeff = 0.001 eps = 1e-08 trust_clip = False always_adapt = False )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups.
- lr (float, optional) — learning rate (default: 1.0).
- momentum (float, optional) — momentum factor (default: 0)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
- dampening (float, optional) — dampening for momentum (default: 0)
- nesterov (bool, optional) — enables Nesterov momentum (default: False)
- trust_coeff (float) — trust coefficient for computing adaptive lr / trust_ratio (default: 0.001)
- eps (float) — eps for division denominator (default: 1e-8)
- trust_clip (bool) — enable LARC trust ratio clipping (default: False)
- always_adapt (bool) — always apply LARS LR adapt, otherwise only when group weight_decay != 0 (default: False)
LARS for PyTorch
Paper: Large batch training of Convolutional Networks
- https://arxiv.org/pdf/1708.03888.pdf
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.Lion
< source >( params: typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] lr: float = 0.0001 betas: typing.Tuple[float, float] = (0.9, 0.99) weight_decay: float = 0.0 caution: bool = False maximize: bool = False foreach: typing.Optional[bool] = None )
Implements Lion algorithm.
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.MADGRAD
< source >( params: typing.Any lr: float = 0.01 momentum: float = 0.9 weight_decay: float = 0 eps: float = 1e-06 decoupled_decay: bool = False )
Parameters
- params (iterable) — Iterable of parameters to optimize or dicts defining parameter groups.
- lr (float) — Learning rate (default: 1e-2).
- momentum (float) — Momentum value in the range [0,1) (default: 0.9).
- weight_decay (float) — Weight decay, i.e. a L2 penalty (default: 0).
- eps (float) — Term added to the denominator outside of the root operation to improve numerical stability. (default: 1e-6).
MADGRAD_: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization.
.. _MADGRAD: https://arxiv.org/abs/2101.11075
MADGRAD is a general purpose optimizer that can be used in place of SGD or Adam may converge faster and generalize better. Currently GPU-only. Typically, the same learning rate schedule that is used for SGD or Adam may be used. The overall learning rate is not comparable to either method and should be determined by a hyper-parameter sweep.
MADGRAD requires less weight decay than other methods, often as little as zero. Momentum values used for SGD or Adam’s beta1 should work here also.
On sparse problems both weight_decay and momentum should be set to 0.
step
< source >( closure: typing.Optional[typing.Callable[[], float]] = None )
Performs a single optimization step.
class timm.optim.Mars
< source >( params: typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] lr: float = 0.003 betas: typing.Tuple[float, float] = (0.9, 0.99) eps: float = 1e-08 weight_decay: float = 0.0 gamma: float = 0.025 mars_type: str = 'adamw' optimize_1d: bool = False lr_1d_factor: float = 1.0 betas_1d: typing.Optional[typing.Tuple[float, float]] = None caution: bool = False )
MARS Optimizer
Paper: MARS: Unleashing the Power of Variance Reduction for Training Large Models https://arxiv.org/abs/2411.10438
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.NAdamW
< source >( params: typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] lr: float = 0.001 betas: typing.Tuple[float, float] = (0.9, 0.999) eps: float = 1e-08 weight_decay: float = 0.01 caution: bool = False maximize: bool = False foreach: typing.Optional[bool] = None capturable: bool = False )
Parameters
- params — iterable of parameters to optimize or dicts defining parameter groups
- lr — learning rate
- betas — coefficients used for computing running averages of gradient and its square
- eps — term added to the denominator to improve numerical stability
- weight_decay — weight decay coefficient
- caution — enable caution
Implements NAdamW algorithm.
See Table 1 in https://arxiv.org/abs/1910.05446 for the implementation of the NAdam algorithm (there is also a comment in the code which highlights the only difference of NAdamW and AdamW).
For further details regarding the algorithm we refer to
- Decoupled Weight Decay Regularization: https://arxiv.org/abs/1711.05101
- On the Convergence of Adam and Beyond: https://openreview.net/forum?id=ryQu7f-RZ
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.NvNovoGrad
< source >( params lr = 0.001 betas = (0.95, 0.98) eps = 1e-08 weight_decay = 0 grad_averaging = False amsgrad = False )
Parameters
- params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) — learning rate (default: 1e-3)
- betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its square (default: (0.95, 0.98))
- eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
- grad_averaging — gradient averaging
- amsgrad (boolean, optional) — whether to use the AMSGrad variant of this
algorithm from the paper
On the Convergence of Adam and Beyond
_ (default: False)
Implements Novograd algorithm.
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.RMSpropTF
< source >( params: typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] lr: float = 0.01 alpha: float = 0.9 eps: float = 1e-10 weight_decay: float = 0 momentum: float = 0.0 centered: bool = False decoupled_decay: bool = False lr_in_momentum: bool = True caution: bool = False )
Parameters
- params — iterable of parameters to optimize or dicts defining parameter groups
- lr — learning rate
- momentum — momentum factor
- alpha — smoothing (decay) constant
- eps — term added to the denominator to improve numerical stability
- centered — if
True
, compute the centered RMSProp, the gradient is normalized by an estimation of its variance - weight_decay — weight decay (L2 penalty) (default: 0)
- decoupled_decay — decoupled weight decay as per https://arxiv.org/abs/1711.05101
- lr_in_momentum — learning rate scaling is included in the momentum buffer update as per defaults in Tensorflow
- caution — apply caution
Implements RMSprop algorithm (TensorFlow style epsilon)
NOTE: This is a direct cut-and-paste of PyTorch RMSprop with eps applied before sqrt and a few other modifications to closer match Tensorflow for matching hyper-params.
Noteworthy changes include:
- Epsilon applied inside square-root
- square_avg initialized to ones
- LR scaling of update accumulated in momentum buffer
Proposed by G. Hinton in his course.
The centered version first appears in Generating Sequences With Recurrent Neural Networks.
step
< source >( closure = None )
Performs a single optimization step.
class timm.optim.SGDP
< source >( params lr = <required parameter> momentum = 0 dampening = 0 weight_decay = 0 nesterov = False eps = 1e-08 delta = 0.1 wd_ratio = 0.1 )
class timm.optim.SGDW
< source >( params: typing.Union[typing.Iterable[torch.Tensor], typing.Iterable[typing.Dict[str, typing.Any]]] lr: float = 0.001 momentum: float = 0.0 dampening: float = 0.0 weight_decay: float = 0.0 nesterov: bool = False caution: bool = False maximize: bool = False foreach: typing.Optional[bool] = None differentiable: bool = False )
step
< source >( closure = None )
Performs a single optimization step.