A Short Machine Learning Explanation — in terms of Linear Algebra, Probability and Calculus

By Pranoy Radhakrishnan

In some cases we will need an array with more than two axes.
In the general case, an array of numbers arranged on a regular grid with a
variable number of axes is known as a tensor.

Tensors and Multidimensional arrays are different types of object, the first is a type of function, the second is a data structure suitable for representing a tensor in a coordinate system.

A scalar is just a single number, in contrast to most of the other
objects studied in linear algebra, which are usually arrays of multiple numbers.

In terms of tensor — A tensor that contains only one number is called a Scalar(or scalar tensor, or 0-dimensional tensor, or 0D tensor).

A vector is an array of numbers or a 1D Tensor.

Note- 5D vector is different from a 5D Tensor. 5D vector is a 1D Tensor with 5 elements. Dimensionality of a vector refers to the number of entries along a specific axis.

A matrix is a 2-D array of numbers or 2D tensor. A matrix has two axes (often referred to rows and columns). You can visually interpret a matrix as a rectangular grid of numbers.

A transformation converts an input from one set (domain) to an output of the same or another set (range).

Transformation is nothing but function. Transformations are functions which operate on vectors.

For example:-

Consider the function f(x)= x+1.

We can think of “f” as a function which maps x to x+1,or as a transformation applied to x to get an output.

A linear transformation is a transformation for which the following holds:

A machine-learning model transforms its input data into meaningful outputs, a process that is “learned” from exposure to known examples of inputs and outputs. Therefore, the central problem in machine learning and deep learning is to meaningfully transform data: in other words, to learn useful representations of the input data at hand — representations that get us closer to the expected output.

Most supervised learning algorithms are based on estimating a probability distribution p( y | x). We can do this simply by using maximum
likelihood estimation to find the best parameter vector θ for a parametric family of distributions p(y | x; θ).

The maximum likelihood estimator can be generalized to the case where our goal is to estimate a conditional probability P(y | x;θ) in order to predict y given x . This is actually the most common situation because it forms the basis for most supervised learning

Mean Squared Error -

Loss Function

The conditional log-likelihood is given by -

conditional log-likelihood

Maximizing the log-likelihood with respect to w yields the same estimate of the parameters w as does minimizing the mean squared error.
The two criteria have different values but the same location of the optimum. This justifies the use of the MSE as a maximum likelihood estimation procedure.

Derivative is Instantaneous rate of change or Slope of tangent to a curve.

Gradient or Slope

We can find the Slope of the curve at point A by taking rise/run i.e. taking delta_y/delta_x. If both deltas are infinitesimally small, we get a more accurate value for the slope. In Calculus we denote this as —

which means infinitesimally small change in “y” over an infinitesimally small change in “x” — also called derivative.

When we take the derivative of a function f(x), we get another function f’(x). To get the slope of the tangent at a particular point, substitute the values in the the function f’(x).

Suppose we have a function y = f (x).
The derivative of this function is denoted as f’(x) . The derivative f’(x) gives the slope of f(x) at the point x. In other words, it specifies how to scale a small change in the input in order to obtain the corresponding change in the output.

The derivative is therefore useful for minimizing a function because it tells
us how to change x in order to make a small improvement in y.

The gradient points directly uphill, and the negative gradient points directly downhill. We can decrease f by moving in the direction of the negative gradient. This is known as the method of steepest descent or gradient descent.

If a variable z depends on the variable y, which itself depends on the variable x, so that y and z are therefore dependent variables, then z depends on x as well.

i.e. If z=f(y) and y=f(x), then

The chain rule then states,

Chain Rule

Chain Rule is used in finding the gradients of the weights in a neural network. Application of Chain Rule in a Neural Networks to find the gradients is termed as Backpropagation.