(If you want to jump straight to the cheat sheet, click here.)
Gradient descent is an optimisation method for finding the minimum of a function. It is commonly used in deep learning models to update the weights of the neural network through backpropagation.
In this post, I will summarise the common gradient descent optimisation algorithms that are used in popular deep learning frameworks (e.g. TensorFlow, Keras, PyTorch, Caffe). The purpose of this post is to make it easy to read and digest since there aren’t many of such summaries out there, and as a cheat sheet if you want to implement them from scratch.
There are 3 main ways how these optimisers can act upon gradient descent:
(1) modifying the learning rate component, α, or
(2) modifying the gradient component, ∂L/∂w, or
See the last term in Eqn. 1 below:
Learning rate schedulers vs. Gradient descent optimisers
The main difference between these two is that gradient descent optimisers adapt the learning rate component by multiplying the learning rate with a factor that is a function of the gradients, whereas learning rate schedulers multiply the learning rate by a factor which is a constant or a function of the time step.
For (1), these optimisers multiply a positive factor to the learning rate, such that they become smaller (e.g. RMSprop). For (2), optimisers usually make use of the moving averages of the gradient (momentum), instead of just taking one value like in vanilla gradient descent. Optimisers that act on both (3) are like Adam and AMSGrad.
Fig. 3 is an evolutionary map of how these optimisers evolved (not necessarily in chronological order) from the simple vanilla stochastic gradient descent (SGD), down to the variants of Adam. SGD initially branched out into two main types of optimisers: those which act on (i) the learning rate component, through momentum and (ii) the gradient component, through AdaGrad. Down the generation line, we see the birth of Adam (pun intended 😬), a combination of momentum and RMSprop, a successor of AdaGrad. You don’t have to agree with me, but this is how I see them 🤭.
- t — time step
- w — weight/parameter which we want to update
- α — learning rate
- ∂L/∂w — gradient of L, the loss function to minimise, w.r.t. to w
- I have also standardised the notations and Greek letters used in this post (hence might be different from the papers) so that we can explore how optimisers ‘evolve’ as we scroll.