前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >torch.optim

torch.optim

作者头像
狼啸风云
修改2022-09-03 21:46:36
8000
修改2022-09-03 21:46:36
举报

目录

How to use an optimizer

Constructing it

Per-parameter options

Taking an optimization step

Algorithms

How to adjust Learning Rate


torch.optim is a package implementing various optimization algorithms. Most commonly used methods are already supported, and the interface is general enough, so that more sophisticated ones can be also easily integrated in the future.

How to use an optimizer

To use torch.optim you have to construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients.

Constructing it

To construct an Optimizer you have to give it an iterable containing the parameters (all should be Variable s) to optimize. Then, you can specify optimizer-specific options such as the learning rate, weight decay, etc.

Note

If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.

In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

Example:

代码语言:javascript
复制
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.Adam([var1, var2], lr=0.0001)

Per-parameter options

Optimizer s also support specifying per-parameter options. To do this, instead of passing an iterable of Variable s, pass in an iterable of dict s. Each of them will define a separate parameter group, and should contain a params key, containing a list of parameters belonging to it. Other keys should match the keyword arguments accepted by the optimizers, and will be used as optimization options for this group.

Note

You can still pass options as keyword arguments. They will be used as defaults, in the groups that didn’t override them. This is useful when you only want to vary a single option, while keeping all others consistent between parameter groups.

For example, this is very useful when one wants to specify per-layer learning rates:

代码语言:javascript
复制
optim.SGD([
                {'params': model.base.parameters()},
                {'params': model.classifier.parameters(), 'lr': 1e-3}
            ], lr=1e-2, momentum=0.9)

This means that model.base’s parameters will use the default learning rate of 1e-2, model.classifier’s parameters will use a learning rate of 1e-3, and a momentum of 0.9 will be used for all parameters.

Taking an optimization step

All optimizers implement a step() method, that updates the parameters. It can be used in two ways:

代码语言:javascript
复制
optimizer.step()

This is a simplified version supported by most optimizers. The function can be called once the gradients are computed using e.g. backward().

Example:

代码语言:javascript
复制
for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()
optimizer.step(closure)

Some optimization algorithms such as Conjugate Gradient and LBFGS need to reevaluate the function multiple times, so you have to pass in a closure that allows them to recompute your model. The closure should clear the gradients, compute the loss, and return it.

Example:

代码语言:javascript
复制
for input, target in dataset:
    def closure():
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        return loss
    optimizer.step(closure)

Algorithms

class torch.optim.Optimizer(params, defaults)[source]

Base class for all optimizers.

Warning

Parameters need to be specified as collections that have a deterministic ordering that is consistent between runs. Examples of objects that don’t satisfy those properties are sets and iterators over values of dictionaries.

Parameters

  • params (iterable) – an iterable of torch.Tensor s or dict s. Specifies what Tensors should be optimized.
  • defaults – (dict): a dict containing default values of optimization options (used when a parameter group doesn’t specify them).

add_param_group(param_group)[source]

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters

  • param_group (dict) – Specifies what Tensors should be optimized along with group
  • optimization options. (specific) –

load_state_dict(state_dict)[source]

Loads the optimizer state.

Parameters

state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict().

state_dict()[source]

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content differs between optimizer classes.
  • param_groups - a dict containing all parameter groups

step(closure)[source]

Performs a single optimization step (parameter update).

Parameters

closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

zero_grad()[source]

Clears the gradients of all optimized torch.Tensor s.

class torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)[source]

Implements Adadelta algorithm.

It has been proposed in ADADELTA: An Adaptive Learning Rate Method.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • rho (float, optional) – coefficient used for computing a running average of squared gradients (default: 0.9)
  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-6)
  • lr (float, optional) – coefficient that scale delta before it is applied to the parameters (default: 1.0)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0)[source]

Implements Adagrad algorithm.

It has been proposed in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (default: 1e-2)
  • lr_decay (float, optional) – learning rate decay (default: 0)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)[source]

Implements Adam algorithm.

It has been proposed in Adam: A Method for Stochastic Optimization.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (default: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)[source]

Implements AdamW algorithm.

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (default: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
  • weight_decay (float, optional) – weight decay coefficient (default: 1e-2)
  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)[source]

Implements lazy version of Adam algorithm suitable for sparse tensors.

In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (default: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]

Implements Adamax algorithm (a variant of Adam based on infinity norm).

It has been proposed in Adam: A Method for Stochastic Optimization.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (default: 2e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square
  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)[source]

Implements Averaged Stochastic Gradient Descent.

It has been proposed in Acceleration of stochastic approximation by averaging.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (default: 1e-2)
  • lambd (float, optional) – decay term (default: 1e-4)
  • alpha (float, optional) – power for eta update (default: 0.75)
  • t0 (float, optional) – point at which to start averaging (default: 1e6)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100, line_search_fn=None)[source]

Implements L-BFGS algorithm, heavily inspired by minFunc <https://www.cs.ubc.ca/~schmidtm/Software/minFunc.html>.

Warning

This optimizer doesn’t support per-parameter options and parameter groups (there can be only one).

Warning

Right now all parameters have to be on a single device. This will be improved in the future.

Note

This is a very memory intensive optimizer (it requires additional param_bytes * (history_size + 1) bytes). If it doesn’t fit in memory try reducing the history size, or use a different algorithm.

Parameters

  • lr (float) – learning rate (default: 1)
  • max_iter (int) – maximal number of iterations per optimization step (default: 20)
  • max_eval (int) – maximal number of function evaluations per optimization step (default: max_iter * 1.25).
  • tolerance_grad (float) – termination tolerance on first order optimality (default: 1e-5).
  • tolerance_change (float) – termination tolerance on function value/parameter changes (default: 1e-9).
  • history_size (int) – update history size (default: 100).
  • line_search_fn (str) – either ‘strong_wolfe’ or None (default: None).

step(closure)[source]

Performs a single optimization step.

Parameters

closure (callable) – A closure that reevaluates the model and returns the loss.

class torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)[source]

Implements RMSprop algorithm.

Proposed by G. Hinton in his course.

The centered version first appears in Generating Sequences With Recurrent Neural Networks.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (default: 1e-2)
  • momentum (float, optional) – momentum factor (default: 0)
  • alpha (float, optional) – smoothing constant (default: 0.99)
  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
  • centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))[source]

Implements the resilient backpropagation algorithm.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (default: 1e-2)
  • etas (Tuple[float, float], optional) – pair of (etaminus, etaplis), that are multiplicative increase and decrease factors (default: (0.5, 1.2))
  • step_sizes (Tuple[float, float], optional) – a pair of minimal and maximal allowed step sizes (default: (1e-6, 50))

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]

Implements stochastic gradient descent (optionally with momentum).

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float) – learning rate
  • momentum (float, optional) – momentum factor (default: 0)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • dampening (float, optional) – dampening for momentum (default: 0)
  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

Example

代码语言:javascript
复制
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

Note

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

v=ρ∗v+gp=p−lr∗vv = \rho * v + g \\ p = p - lr * v v=ρ∗v+gp=p−lr∗v

where p, g, v and ρ\rhoρ denote the parameters, gradient, velocity, and momentum respectively.

This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form

v=ρ∗v+lr∗gp=p−vv = \rho * v + lr * g \\ p = p - v v=ρ∗v+lr∗gp=p−v

The Nesterov version is analogously modified.

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

How to adjust Learning Rate

torch.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs. torch.optim.lr_scheduler.ReduceLROnPlateau allows dynamic learning rate reducing based on some validation measurements.

Learning rate scheduling should be applied after optimizer’s update; e.g., you should write your code this way:

代码语言:javascript
复制
>>> scheduler = ...
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()

Warning

Prior to PyTorch 1.1.0, the learning rate scheduler was expected to be called before the optimizer’s update; 1.1.0 changed this behavior in a BC-breaking way. If you use the learning rate scheduler (calling scheduler.step()) before the optimizer’s update (calling optimizer.step()), this will skip the first value of the learning rate schedule. If you are unable to reproduce results after upgrading to PyTorch 1.1.0, please check if you are calling scheduler.step() at the wrong time.

class torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)[source]

Sets the learning rate of each parameter group to the initial lr times a given function. When last_epoch=-1, sets initial lr as lr.

Parameters

  • optimizer (Optimizer) – Wrapped optimizer.
  • lr_lambda (function or list) – A function which computes a multiplicative factor given an integer parameter epoch, or a list of such functions, one for each group in optimizer.param_groups.
  • last_epoch (int) – The index of last epoch. Default: -1.

Example

代码语言:javascript
复制
>>> # Assuming optimizer has two groups.
>>> lambda1 = lambda epoch: epoch // 30
>>> lambda2 = lambda epoch: 0.95 ** epoch
>>> scheduler = LambdaLR(optimizer, lr_lambda=[lambda1, lambda2])
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()

load_state_dict(state_dict)[source]

Loads the schedulers state.

Parameters

state_dict (dict) – scheduler state. Should be an object returned from a call to state_dict().

state_dict()[source]

Returns the state of the scheduler as a dict.

It contains an entry for every variable in self.__dict__ which is not the optimizer. The learning rate lambda functions will only be saved if they are callable objects and not if they are functions or lambdas.

class torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1)[source]

Sets the learning rate of each parameter group to the initial lr decayed by gamma every step_size epochs. When last_epoch=-1, sets initial lr as lr.

Parameters

  • optimizer (Optimizer) – Wrapped optimizer.
  • step_size (int) – Period of learning rate decay.
  • gamma (float) – Multiplicative factor of learning rate decay. Default: 0.1.
  • last_epoch (int) – The index of last epoch. Default: -1.

Example

代码语言:javascript
复制
>>> # Assuming optimizer uses lr = 0.05 for all groups
>>> # lr = 0.05     if epoch < 30
>>> # lr = 0.005    if 30 <= epoch < 60
>>> # lr = 0.0005   if 60 <= epoch < 90
>>> # ...
>>> scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()

class torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1)[source]

Set the learning rate of each parameter group to the initial lr decayed by gamma once the number of epoch reaches one of the milestones. When last_epoch=-1, sets initial lr as lr.

Parameters

  • optimizer (Optimizer) – Wrapped optimizer.
  • milestones (list) – List of epoch indices. Must be increasing.
  • gamma (float) – Multiplicative factor of learning rate decay. Default: 0.1.
  • last_epoch (int) – The index of last epoch. Default: -1.

Example

代码语言:javascript
复制
>>> # Assuming optimizer uses lr = 0.05 for all groups
>>> # lr = 0.05     if epoch < 30
>>> # lr = 0.005    if 30 <= epoch < 80
>>> # lr = 0.0005   if epoch >= 80
>>> scheduler = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()

class torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1)[source]

Set the learning rate of each parameter group to the initial lr decayed by gamma every epoch. When last_epoch=-1, sets initial lr as lr.

Parameters

  • optimizer (Optimizer) – Wrapped optimizer.
  • gamma (float) – Multiplicative factor of learning rate decay.
  • last_epoch (int) – The index of last epoch. Default: -1.

class torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1)[source]

Set the learning rate of each parameter group using a cosine annealing schedule, where ηmax\eta_{max}ηmax​ is set to the initial lr and TcurT_{cur}Tcur​ is the number of epochs since the last restart in SGDR:

ηt=ηmin+12(ηmax−ηmin)(1+cos⁡(TcurTmaxπ))\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi)) ηt​=ηmin​+21​(ηmax​−ηmin​)(1+cos(Tmax​Tcur​​π))

When last_epoch=-1, sets initial lr as lr.

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

Parameters

  • optimizer (Optimizer) – Wrapped optimizer.
  • T_max (int) – Maximum number of iterations.
  • eta_min (float) – Minimum learning rate. Default: 0.
  • last_epoch (int) – The index of last epoch. Default: -1.

class torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)[source]

Reduce learning rate when a metric has stopped improving. Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This scheduler reads a metrics quantity and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.

Parameters

  • optimizer (Optimizer) – Wrapped optimizer.
  • mode (str) – One of min, max. In min mode, lr will be reduced when the quantity monitored has stopped decreasing; in max mode it will be reduced when the quantity monitored has stopped increasing. Default: ‘min’.
  • factor (float) – Factor by which the learning rate will be reduced. new_lr = lr * factor. Default: 0.1.
  • patience (int) – Number of epochs with no improvement after which learning rate will be reduced. For example, if patience = 2, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the 3rd epoch if the loss still hasn’t improved then. Default: 10.
  • verbose (bool) – If True, prints a message to stdout for each update. Default: False.
  • threshold (float) – Threshold for measuring the new optimum, to only focus on significant changes. Default: 1e-4.
  • threshold_mode (str) – One of rel, abs. In rel mode, dynamic_threshold = best * ( 1 + threshold ) in ‘max’ mode or best * ( 1 - threshold ) in min mode. In abs mode, dynamic_threshold = best + threshold in max mode or best - threshold in min mode. Default: ‘rel’.
  • cooldown (int) – Number of epochs to wait before resuming normal operation after lr has been reduced. Default: 0.
  • min_lr (float or list) – A scalar or a list of scalars. A lower bound on the learning rate of all param groups or each group respectively. Default: 0.
  • eps (float) – Minimal decay applied to lr. If the difference between new and old lr is smaller than eps, the update is ignored. Default: 1e-8.

Example

代码语言:javascript
复制
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> scheduler = ReduceLROnPlateau(optimizer, 'min')
>>> for epoch in range(10):
>>>     train(...)
>>>     val_loss = validate(...)
>>>     # Note that step should be called after validate()
>>>     scheduler.step(val_loss)

class torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr, max_lr, step_size_up=2000, step_size_down=None, mode='triangular', gamma=1.0, scale_fn=None, scale_mode='cycle', cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1)[source]

Sets the learning rate of each parameter group according to cyclical learning rate policy (CLR). The policy cycles the learning rate between two boundaries with a constant frequency, as detailed in the paper Cyclical Learning Rates for Training Neural Networks. The distance between the two boundaries can be scaled on a per-iteration or per-cycle basis.

Cyclical learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training.

This class has three built-in policies, as put forth in the paper: “triangular”:

A basic triangular cycle w/ no amplitude scaling.

“triangular2”:

A basic triangular cycle that scales initial amplitude by half each cycle.

“exp_range”:

A cycle that scales initial amplitude by gamma**(cycle iterations) at each cycle iteration.

This implementation was adapted from the github repo: bckenstler/CLR

Parameters

  • optimizer (Optimizer) – Wrapped optimizer.
  • base_lr (float or list) – Initial learning rate which is the lower boundary in the cycle for each parameter group.
  • max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_lr - base_lr). The lr at any cycle is the sum of base_lr and some scaling of the amplitude; therefore max_lr may not actually be reached depending on scaling function.
  • step_size_up (int) – Number of training iterations in the increasing half of a cycle. Default: 2000
  • step_size_down (int) – Number of training iterations in the decreasing half of a cycle. If step_size_down is None, it is set to step_size_up. Default: None
  • mode (str) – One of {triangular, triangular2, exp_range}. Values correspond to policies detailed above. If scale_fn is not None, this argument is ignored. Default: ‘triangular’
  • gamma (float) – Constant in ‘exp_range’ scaling function: gamma**(cycle iterations) Default: 1.0
  • scale_fn (function) – Custom scaling policy defined by a single argument lambda function, where 0 <= scale_fn(x) <= 1 for all x >= 0. If specified, then ‘mode’ is ignored. Default: None
  • scale_mode (str) – {‘cycle’, ‘iterations’}. Defines whether scale_fn is evaluated on cycle number or cycle iterations (training iterations since start of cycle). Default: ‘cycle’
  • cycle_momentum (bool) – If True, momentum is cycled inversely to learning rate between ‘base_momentum’ and ‘max_momentum’. Default: True
  • base_momentum (float or list) – Lower momentum boundaries in the cycle for each parameter group. Note that momentum is cycled inversely to learning rate; at the peak of a cycle, momentum is ‘base_momentum’ and learning rate is ‘max_lr’. Default: 0.8
  • max_momentum (float or list) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). The momentum at any cycle is the difference of max_momentum and some scaling of the amplitude; therefore base_momentum may not actually be reached depending on scaling function. Note that momentum is cycled inversely to learning rate; at the start of a cycle, momentum is ‘max_momentum’ and learning rate is ‘base_lr’ Default: 0.9
  • last_epoch (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning. Default: -1

Example

代码语言:javascript
复制
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.01, max_lr=0.1)
>>> data_loader = torch.utils.data.DataLoader(...)
>>> for epoch in range(10):
>>>     for batch in data_loader:
>>>         train_batch(...)
>>>         scheduler.step()

get_lr()[source]

Calculates the learning rate at batch index. This function treats self.last_epoch as the last batch index.

If self.cycle_momentum is True, this function has a side effect of updating the optimizer’s momentum.

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019年09月20日,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • How to use an optimizer
    • Constructing it
      • Per-parameter options
        • Taking an optimization step
        • Algorithms
        • How to adjust Learning Rate
        领券
        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档