In the previous blog, I discussed regression and types of regression (https://tinyurl.com/simpleregression). The aim in regression is to find the best fit line to the training data by identifying the best model parameters . But how do we find best model parameters? Using optimization algorithms. Gradient descent, AdaGrad, Adam optimizer, etc. are a few optimization algorithms. Let’s understand what is gradient descent algorithm and how it works?
Gradient Descent
Gradient descent is a generic optimization algorithm which is iterative in nature. It means it performs ‘N’ iterations to identify the best model parameters. Recall the hypothesis and cost function of linear regression.
For simplicity we assume there are only two parameters in this equation, theta0 and theta1. The aim is to identify parameters such that cost is minimized.
How does gradient descent work?
Gradient descent works in the following way
- Start with a random value for theta.
- For each iteration, simultaneously update value of each parameter - theta0 and theta 1 by taking partial derivative of the cost function.
derivative: Slope of the function at a given point.
partial derivative: When the function has multiple variables, taking a derivative with respect to one single variable.
convergence: If we plot a graph of number of iterations vs cost function, gradient descent is said to converge when the graph flattens out.
From Fig 1, it can be seen we initialize a random value for theta. For each iteration the partial derivative of the cost function is calculated and plugged in the above mentioned formula. A minimum value of cost is reached for a particular value of theta (assuming there is single model parameter) and hence we choose the model parameter.
Learning rate
The learning rate (alpha) is a hyperparameter which determines the size of the steps taken to reach the minimum.
If the value of alpha is too small, the algorithm will take a long time to reach the minimum.
If the value of alpha is too large, the algorithm might take very large steps and take a very long time to reach minimum.
The cost function of the regression is convex in nature, which means it doesn't not have any local minimum but contains single global minimum. Fig 4 shows, local and global minimum. If there are multiple local and global minimum, then gradient descent might get stuck on the local minimum and this is based on the learning rate and initial values of theta.
Gradient descent works well, when the features are on the same scale. When this condition satisfies, we can see a nice bowl like graph. Otherwise there will be an elongated bowl with elongation along the feature which has smaller values compared to other features.
Batch Gradient Descent
In batch gradient descent, the whole batch of the training examples is used in each iteration to calculate the gradient. This can be very slow, when there is large number of training examples. Batch gradient descent handles huge number of variables with ease. How to decide the number of iterations? We can set the number of iterations to a higher value and then stop the algorithm when the gradient becomes smaller than a prespecified value called tolerance; which is a tiny number.
Stochastic Gradient Descent
In Stochastic Gradient Descent (SGD), only a single instance of the training set is used in one iteration. This is much faster compared to batch gradient descent. As a consequence, we do not see a smooth transition to the global minimum. The final values obtained by SGD are minimum but not optimal, as SGD bounces around after reaching minimum value. With irregular cost function having local minimum, SGD might help to escape the local minimum and finding the global minimum.
Mini Batch Gradient Descent
In batch gradient descent whole training set is used and in SGD one training example is used. In mini-batch gradient descent — a random set of instances from the training set used to calculate each gradient. Two main steps are involved in building mini-batches which are shuffling and partitioning. Usually, the mini-batch size will be in power of 2 e.g., 16,32,64,128.
Mini-batch gradient descent moves closer to the minimum value compared to SGD. Batch gradient descent stops at the minimum value, whereas SGD and mini-batch gradient descent move close to minimum values.
Thank you for reading !!
References:
- Géron, A. (2019). Gradient Descent[Image],Learning rate too small[Image], Learning rate too large[Image],Gradient Descent pitfalls[Image]. In Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (p. 120,121,122)