Visualize various gradient descent algorithms
Recently, I implemented multiple gradentient descent-related optimization algorithms and applied to a simple loss function for visualization purpose.
The result look like such
Legend:
- The contour shows the surface of the loss function given the sampled data and the model function, a quadratic function with two parameters (w1 and w2).
- The star indicates the true parameter of the quadratic function, from which the data is sampled.
- The white line is the trajectory of gradient descent with the corresponding algorithm.
bgd: batch gradient descent; sgd: stochastic gradient descent.
Some comments:
- Batch gradient descent is the most smooth
- Stochastic gradient descent is more jumpy
- Plain momentum and Nesterov accelerated momentum (NAM) overshoot and then correct themselve
- Adagrad gets stuck as $G$ continues to increase adding squared gradients at each time step, effectively keeping reducing the learning rate to become infinitesimally small, thus learning gets halted.
- Adadelta fixed the diminishing problem of Adagrad, resulting in very jumpy trajectories
- RMSprop is quite similar to Adagrad, replacing $RMS[\Delta \theta]_{t-1}$ with $\eta$, but it doesn’t seem to be as good as Adadelta based on the above loss function trajectory.
- Adam looks good.
- AdaMax doesn’t look as good as Adam based on loss function
- Nadam performs similar to Adam
For details, please see this notebooks and code in here.
Please let me know if you have any comment or bug report.