 towardsdatascience.com Go back Open original

By Aerin Kim 🙏

Let’s take the simplest example, which is Linear Regression.

Linear Regression Recap done.

Now, what was the Gradient Descent algorithm?

Above algorithm says, to perform the GD, we need to calculate the gradient of the cost function J. And to calculate the gradient of the cost function, we need to sum (yellow circle!) the cost of each sample. If we have 3 million samples, we have to loop through 3 million times or use the dot product.

Here is Python code:

`def gradientDescent(X, y, theta, alpha, num_iters):    """       Performs gradient descent to learn theta    """    m = y.size  # number of training examples    for i in range(num_iters):        y_hat = np.dot(X, theta)        theta = theta - alpha * (1.0/m) * np.dot(X.T, y_hat-y)    return theta`

Do you see `np.dot(X.T, y_hat-y)` above? That’s the vectorized version of “looping through (summing) 3 million samples”.

Wait.. just to move a single step towards the minimum, do we really have to calculate each cost 3 million times?

Yes. If you insist to use GD.

But if you use Stochastic GD, you don’t have to!

Basically, in SGD, we are using the cost gradient of 1 example at each iteration, instead of using the sum of the cost gradient of ALL examples.

`def SGD(f, theta0, alpha, num_iters):    """        Arguments:       f -- the function to optimize, it takes a single argument            and yield two outputs, a cost and the gradient            with respect to the arguments       theta0 -- the initial point to start SGD from       num_iters -- total iterations to run SGD for`
`       Return:       theta -- the parameter value after SGD finishes    """    start_iter = 0    theta= theta0`
`    for iter in xrange(start_iter + 1, num_iters + 1):        _, grad = f(theta)        theta = theta - (alpha * grad) # there is NO dot product!    return theta`

Well, Stochastic Gradient Descent has a fancy name but I guess it’s a pretty simple algorithm!

Few things to note:

a) In SGD, before for-looping, you need to randomly shuffle the training examples.

b) In SGD, because it’s using only one example at a time, its path to the minima is noisier (more random) than that of the batch gradient. But it’s ok as we are indifferent to the path, as long as it gives us the minimum AND the shorter training time.

c) Mini-batch gradient descent uses n data points (instead of 1 sample in SGD) at each iteration.

If you like my post, could you please clap? It gives me motivation to write more. :)