cover

Optimizers in deep learning

February 28, 2024 13 min read

In this post I briefly describe the concepts beside the most popular optimizer algorithms in deep learning. I cover SGD, RMSprop, AdaGrad, Adam, AdamW, AMSGrad etc.

Given a loss function L(θ)\mathcal{L}({\bf \theta}), neural networks aim to find its (ideally global, in practice - local) minimum, attained at some parameters vector θ\theta^*.

In all iterative methods we are going to start from some initialization of parameters θ0\theta^0 and make steps of size η\eta in a certain search direction pp: θi+1=θi+ηp\theta_{i+1} = \theta_i + \eta \cdot p.

The way we define search direction pp and step size η\eta differs significantly between methods.

First order method: Stochastic Gradient Descent (SGD)

A simple way to implement this minimization from practical standpoint is to consider Taylor expansion of L(θ)\mathcal{L}({\bf \theta}):

LX(θi+1)=LX(θi+ηp)=LX(θi)+LX(θi)(θi+1θi)+O((θi+1θi)2)=LX(θi)+LX(θi)(ηp)+O(η2p2)\mathcal{L}_X({\theta_{i+1}}) = \mathcal{L}_X({\theta_{i} + \eta p}) = \mathcal{L}_X({\theta_{i}}) + \nabla \mathcal{L}_X(\theta_i) (\theta_{i+1} - \theta_i) + O((\theta_{i+1} - \theta_i)^2) = \mathcal{L}_X({\theta_{i}}) + \nabla \mathcal{L}_X(\theta_i) (\eta p) + O(\eta^2 p^2),

where LX(θi)\nabla \mathcal{L}_X(\theta_i) is the gradient of loss function, calculated over the whole set of samples XX.

The simplest optimizer, based on first-order approximation, is known as Stochastic Gradient Descent (SGD).

In SGD the search direction pp ideally corresponds to neg-gradient direction: p=LX(θi)p = \nabla \mathcal{L}_X(\theta_i).

In practice we approximate true gradient LX(θi)\nabla \mathcal{L}_X(\theta_i) using gx(θi)LX(θi)g_x(\theta_i) \approx \nabla \mathcal{L}_X(\theta_i), calculated over a mini-batch of samples xx as an approximation of true gradient.

This leads to the following method:

Lx(θi+1)=Lx(θi)+ηgx(θi)\mathcal{L}_x({\theta_{i+1}}) = \mathcal{L}_x({\theta_{i}}) + \eta g_x(\theta_i), where η\eta is learning rate.

Second order methods: Newton-Raphson method and quasi-newtonian methods

In case of first-order methods the search direction pp

Second order methods are based on Newton-Raphson formula. Consider Taylor approximation of loss function up to the second term:

LX(θi+1)=LX(θi)+LX(θi)(θi+1θi)+12(θi+1θi)TH(θi+1θi)+O((θi+1θi)3)\mathcal{L}_X({\theta_{i+1}}) = \mathcal{L}_X({\theta_{i}}) + \nabla \mathcal{L}_X(\theta_i) (\theta_{i+1} - \theta_i) + \frac{1}{2} (\theta_{i+1} - \theta_i)^T H (\theta_{i+1} - \theta_i) + O((\theta_{i+1} - \theta_i)^3), where HH is Hessian matrix. Equivalently:

LX(θi+p)=LX(θi)+LX(θi)p+12pTHp+O(p3)\mathcal{L}_X({\theta_{i}} + p) = \mathcal{L}_X({\theta_{i}}) + \nabla \mathcal{L}_X(\theta_i) p + \frac{1}{2} p^T H p + O(p^3)

Again, substitute the whole dataset XX with mini-batches xx, so that gx(θi)LX(θi)g_x(\theta_i) \approx \nabla \mathcal{L}_X(\theta_i).

Find the direction that minimizes L(x+p)\mathcal{L}(x + p) by taking a derivative in pp of second-order Taylor approximation:

0=Lxp0+gx(θi)+Hp0 = \frac{\partial \mathcal{L}_x}{\partial p} \approx 0 + g_x(\theta_i) + H p

p=H1gx(θi)p = - H^{-1} g_x(\theta_i)

Hence, θi+1=θi+p\theta_{i+1} = \theta_{i} + p and θi+1=θiH1gx(θi)\theta_{i+1} = \theta_{i} - H^{-1} g_x(\theta_i).

In practice full calculation of Hessian at each step of the algorithm is computationally expensive. A whole set of methods, known as quasi-newtonian, stem from the idea of computationally-effective approximation of Hessian or inverse Hessian. The simplest way to achieve that is to approximate just the diagonal of Hessian with difference of consecutive gradients computations. Such an approximation was suggested by LeCun in 1988, and a mixture of this approach with AdaGrad was suggested by his team in 2012.

Natural Gradient Descent (NGD)

TODO: Fisher information matrix is negative Hessian of log-likelihood.

TODO: Fisher information matrix is the Hessian of KL-divergence between distributions p(xθ)p(x|\theta) and p(xθ)p(x|\theta')

TODO: natural gradient is the direction of steepest descent in distribution space

Adaptive learning rate gradient methods

A whole different class of methods stems from the idea that for some batches, and some parts of landscape the gradient would be huge in absolute value, while in some points it would be small.

With that in mind, we cannot choose one learning rate step to rule them all. We need it to be adaptable to the absolute value of gradient.

Momentum and Nesterov Adaptive Gradient

Very soon it was found out quite soon that keeping the learning rate the same leads to very slow training.

Hence, momentum-based methods appeared, which added an extra momentum term to the weight updates:

θt+1=θtηgt+μΔθt\theta_{t+1} = \theta_t - \eta g_t + \mu \Delta \theta_t, where the last term represents momentum.

Alternatively one might denote momentum as vt=θtθt1v_t = \theta_t - \theta_{t-1} and represent momentum method as follows:

vt+1=ηgt+μvtv_{t+1} = -\eta g_t + \mu v_{t} and θt+1=θt+vt+1\theta_{t+1} = \theta_t + v_{t+1}

In 1983 Yuri Nesterov suggested Nesterov Accelerate Gradients method, which is similar to momentum method, but with an alteration:

vt+1=ηgt^+μvtv_{t+1} = -\eta \hat{g_t} + \mu v_{t}, where gt^=L(θt+μvt)\hat{g_t} = \nabla \mathcal{L}(\theta_t + \mu v_{t}) and θt+1=θt+vt+1\theta_{t+1} = \theta_t + v_{t+1}.

They say that Nesterov momentum looks ahead into future and uses momentum to predict it.

Rprop

The 1992 algorithm Rprop was one of the early optimizers with this notion in mind. It used to work for batches of size 1, i.e. for purely stochastic gradient descent, not mini-batch stochastic gradient descent.

Ideology of this algorithm is as follows: consider a coordinate of gradient. If over two consecutive iterations this coordinate has the same sign, we increase the step size in this coordinate by some value. If the signs are different, we decrease the step size.

AdaGrad

AdaGrad was suggested in 2011 and sparked a series of copycats methods with tiny adjustments, such as RMSprop, AdaDelta and Adam.

It adapts the step size differently than Rprop or momentum algorithms.

During the calculation of gradients AdaGrad maintains the history of outer products of gradients:

Gt=τ=1tgτgτTG_{t} = \sum \limits_{\tau=1}^{t} g_\tau g_\tau^T

In practice this matrix of outer products is never used and just its diagonal (which represents second moments of gradients) is maintained:

diag(Gt)=(τ=1tgτ(1,1)2000τ=1tgτ(2,2)2000τ=1tgτ(3,3)2)diag(G_t) = \begin{pmatrix} \sum \limits_{\tau=1}^{t} g_\tau(1,1)^2 && 0 && 0 \\ 0 && \sum \limits_{\tau=1}^{t} g_\tau(2,2)^2 && 0 \\ 0 && 0 && \sum \limits_{\tau=1}^{t} g_\tau(3,3)^2 \end{pmatrix}

Finally, we apply this matrix to gradients to maintain some momentum in each coordinate:

θt+1=θtηdiag(Gt)g(θt)\theta_{t+1} = \theta_t - \frac{\eta}{ \sqrt{diag(G_t) } } g(\theta_t)

RMSprop

For some reason classical Rprop does not work for mini-batches.

Hence, in his 2012 lecture Hinton came up with a modification of Rprop, called RMSprop, adapted for mini-batch stochastic optimization.

Actually, RMSprop is much closer to AdaGrad than to Rprop with the difference that it uses exponential moving average (EMA) instead of aggregation of second moment of gradient over the whole history. I guess, an observation that is simple moving average (SMA) works, EMA usually works better, was an inspiration for this method.

During the calculation of gradients RMSprop maintains moving averages of squares of gradients:

E[g2]t+1=βE[g2]t+(1β)g(θt)2E[g^2]_{t+1} = \beta E[g^2]_{t} + (1 - \beta) g(\theta_t)^2

He uses it to divide gradient by its square root (or, equivalently, multiply by a square root of its inverse):

θt+1=θtηE[g2]t+1g(θt)\theta_{t+1} = \theta_t - \frac{\eta}{ \sqrt{E[g^2]_{t+1} } } g(\theta_t)

Here instead of altering the effective learning rate by hand like in Rprop, we just accumulate EMA of gradient squares and divide by it.

There’s a hint of a quasi-newtonian vibe to this method, if E[g2]t+1\sqrt{E[g^2]_{t+1}} were a diagonal approximation of Hamiltonian, but I cannot see, how to give such an interpretation to it.

AdaDelta

Matt Zeiler came up with this method in late 2012, drawing inspiration from AdaGrad.

He kept the EMA of second moment of gradient in denominator as in RMSprop (not sure about who came up with this earlier, he or Hinton). This term reminded him of diagonal approximation of Hessian from quasi-newtonian method, too.

But also he came up with a notion that metaphoric units, in which θt+1θt\theta_{t+1} - \theta_t is measured, should be the same as θ\theta, which is not the case with AdaGrad and others, while it is in case of quasi-newtonian methods, as H1gx(θi)=Lθ/2Lθ2H^{-1} g_x(\theta_i) = \frac{\partial{\mathcal{L}}}{\partial \theta} / \frac{\partial^2{\mathcal{L}}}{\partial \theta^2}, where L\partial \mathcal{L} is dimensionless, while θ\partial \theta is measured in some units (e.g. meters, kilograms etc.). Thus, in quasi-newtonian methods δθ=Lθ/2Lθ21θ/1θ2θ\delta \theta = \frac{\partial{\mathcal{L}}}{\partial \theta} / \frac{\partial^2{\mathcal{L}}}{\partial \theta^2} \propto \frac{1}{\partial \theta} / \frac{1}{\partial \theta^2} \propto \theta and dimensionality is maintained, so that δθiθi\delta \theta_i \propto \theta_i, so that if i-th coordinate of θ\bf \theta is in meters, its update is also in meters.

This is not the case for AdaGrad, where update of i-th coordinate is dimension-less. Hence, Zeiler suggests to use EMA in numerator as well for gradient updates themselves to maintain dimensionality:

θt+1=θtηE[Δθt2]E[g2]t+1g(θt)\theta_{t+1} = \theta_t - \eta \cdot \frac{\sqrt{E[\Delta \theta_t^2]}}{ \sqrt{E[g^2]_{t+1} } } g(\theta_t), where

E[g2]t+1=βE[g2]t+(1β)g(θt)2E[g^2]_{t+1} = \beta E[g^2]_{t} + (1 - \beta) g(\theta_t)^2 is accumulated second moment of gradient and

E[Δθ2]t+1=βE[Δθ2]t+(1β)(Δθ)2E[\Delta \theta^2]_{t+1} = \beta E[\Delta \theta^2]_{t} + (1 - \beta) (\Delta \theta)^2 is accumulated squared update of theta.

Adam

Adam is a 2014 improvement over AdaGrad and RMSprop. It is actually very similar to AdaDelta, except by the fact that they use linear accumulated gradient instead of square updates of theta:

θt+1=θtηE[g]E[g2]t+1g(θt)\theta_{t+1} = \theta_t - \eta \cdot \frac{E[g]}{ \sqrt{E[g^2]_{t+1} } } g(\theta_t), where

E[g2]t+1=βE[g2]t+(1β)g(θt)2E[g^2]_{t+1} = \beta E[g^2]_{t} + (1 - \beta) g(\theta_t)^2 is accumulated second moment of gradient and

E[g]t+1=βE[g]t+(1β)gE[g]_{t+1} = \beta E[g]_{t} + (1 - \beta) g is accumulated gradient (I am slightly vague and omit normalization of both numerator and denominator here by (1 - \beta) here; obviously betas for numerator and denominator can be different).

There’s an interesting discussion of interaction of ADAM with Fisher information matrix, natural gradient and quasi-newtonian methods.

TODO: Connection to Hessian, natural gradient and Fisher information.

AMSGrad

AMSGrad is a 2018 modification of Adam by Reddi et al. Those authors find out that sometimes Adam fails to converge and, thus, slightly changed its formulation. It is almost exactly the same, except by the fact they are using maximum of current step’s and previou step’s second momentum:

θt+1=θtηE[g]max(E[g2]t+1,E[g2]t)g(θt)\theta_{t+1} = \theta_t - \eta \cdot \frac{E[g]}{ max(\sqrt{E[g^2]_{t+1}}, \sqrt{E[g^2]_{t}}) } g(\theta_t), where

E[g2]t+1=βE[g2]t+(1β)g(θt)2E[g^2]_{t+1} = \beta E[g^2]_{t} + (1 - \beta) g(\theta_t)^2 is accumulated second moment of gradient and

E[g]t+1=βE[g]t+(1β)gE[g]_{t+1} = \beta E[g]_{t} + (1 - \beta) g is accumulated gradient.

AdamW and Adam-l2

AdamW is 2019 modification of Adam.

Oftentimes Adam optimizer is combined with l2-regularization of weights, so that instead of minimizing L(θ)\mathcal{L}(\theta) they minimize L^(θ)=L(θ)+θ22\mathcal{\hat{L}}(\theta) = \mathcal{L}(\theta) + ||\theta||_2^2.

From here one can come up with different construction of Adam-like optimizer. Naive one, called Adam-l2 works as follows:

g(θ)t+1=L(θt)+λθtg(\theta)_{t+1} = \nabla \mathcal{L}(\theta_{t}) + \lambda \theta_{t} is the analogue of gradient, which includes l2-normalization;

θt+1=θtηE[g]E[g2]t+1g(θt)\theta_{t+1} = \theta_t - \eta \cdot \frac{E[g]}{ \sqrt{E[g^2]_{t+1}} } g(\theta_t), is the update rule, analogous to Adam, where as usual:

E[g2]t+1=βE[g2]t+(1β)g(θt)2E[g^2]_{t+1} = \beta E[g^2]_{t} + (1 - \beta) g(\theta_t)^2 is accumulated second moment of gradient and

E[g]t+1=βE[g]t+(1β)gE[g]_{t+1} = \beta E[g]_{t} + (1 - \beta) g is accumulated gradient.

An alternative to it is called AdamW and stems from a different problem statement:

g(θ)t+1=L(θt)g(\theta)_{t+1} = \nabla \mathcal{L}(\theta_{t}) - here we keep gradient the same, but add regularization term to the Adam update rule

θt+1=θtηE[g]E[g2]t+1g(θt)ηλθt\theta_{t+1} = \theta_t - \eta \cdot \frac{E[g]}{ \sqrt{E[g^2]_{t+1}} } g(\theta_t) - \eta \lambda \theta_{t}, note the last term, corresponding to regularization; the others are as usual:

E[g2]t+1=βE[g2]t+(1β)g(θt)2E[g^2]_{t+1} = \beta E[g^2]_{t} + (1 - \beta) g(\theta_t)^2 is accumulated second moment of gradient and

E[g]t+1=βE[g]t+(1β)gE[g]_{t+1} = \beta E[g]_{t} + (1 - \beta) g is accumulated gradient.

These two alternative views at regularization stem from the same approach, but lead to somewhat different performance.

References:


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram