Difference between Batch Gradient Descent and Stochastic GradientĀ Descent

By Aerin Kim šŸ™

Let’s take the simplest example, which is Linear Regression.

As always, we start with the cost function.

Linear Regression Recap done.

Now, what was the Gradient Descent algorithm?

Above algorithm says, to perform the GD, we need to calculate the gradient of the cost function J. And to calculate the gradient of the cost function, we need to sum (yellow circle!) the cost of each sample. If we have 3 million samples, we have to loop through 3 million times or use the dot product.

Here is Python code:

def gradientDescent(X, y, theta, alpha, num_iters):
"""
Performs gradient descent to learn theta
"""
m = y.size # number of training examples
for i in range(num_iters):
y_hat = np.dot(X, theta)
theta = theta - alpha * (1.0/m) * np.dot(X.T, y_hat-y)
return theta

Do you see np.dot(X.T, y_hat-y) above? That’s the vectorized version of “looping through (summing) 3 million samples”.

Wait.. just to move a single step towards the minimum, do we really have to calculate each cost 3 million times?

Yes. If you insist to use GD.

But if you use Stochastic GD, you don’t have to!

Basically, in SGD, we are using the cost gradient of 1 example at each iteration, instead of using the sum of the cost gradient of ALL examples.

def SGD(f, theta0, alpha, num_iters):
"""
Arguments:
f -- the function to optimize, it takes a single argument
and yield two outputs, a cost and the gradient
with respect to the arguments
theta0 -- the initial point to start SGD from
num_iters -- total iterations to run SGD for
       Return:
theta -- the parameter value after SGD finishes
"""
start_iter = 0
theta= theta0
    for iter in xrange(start_iter + 1, num_iters + 1):
_, grad = f(theta)
theta = theta - (alpha * grad) # there is NO dot product!
return theta

Well, Stochastic Gradient Descent has a fancy name but I guess it’s a pretty simple algorithm!

Few things to note:

a) In SGD, before for-looping, you need to randomly shuffle the training examples.

b) In SGD, because it’s using only one example at a time, its path to the minima is noisier (more random) than that of the batch gradient. But it’s ok as we are indifferent to the path, as long as it gives us the minimum AND the shorter training time.

c) Mini-batch gradient descent uses n data points (instead of 1 sample in SGD) at each iteration.

If you like my post, could you please clap? It gives me motivation to write more. :)