In some cases we will need an array with more than two axes.

In the general case, an array of numbers arranged on a regular grid with a

variable number of axes is known as a tensor.

**Tensors** and **Multidimensional arrays** are different types of object, the first is a ** type of function**, the second is a

**suitable for representing a tensor in a coordinate system.**

*data structure*A scalar is just a single number, in contrast to most of the other

objects studied in linear algebra, which are usually arrays of multiple numbers.

In terms of tensor — A tensor that contains only one number is called a Scalar(or scalar tensor, or 0-dimensional tensor, or 0D tensor).

A vector is an array of numbers or a 1D Tensor.

Note- 5D vector is different from a 5D Tensor. 5D vector is a 1D Tensor with 5 elements. Dimensionality of a vector refers to the number of entries along a specific axis.

A matrix is a 2-D array of numbers or 2D tensor. A matrix has two axes (often referred to rows and columns). You can visually interpret a matrix as a rectangular grid of numbers.

A transformation converts an input from one set (domain) to an output of the same or another set (range).

Transformation is nothing but function. **Transformations are functions which operate on vectors.**

For example:-

Consider the function f(x)= x+1.

We can think of “f” as a function which maps x to x+1,or as a transformation applied to x to get an output.

A linear transformation is a transformation for which the following holds:

A machine-learning model transforms its input data into meaningful outputs, a process that is “learned” from exposure to known examples of inputs and outputs. Therefore,** the central problem in machine learning and deep learning is to meaningfully transform data: in other words, to learn useful representations of the input data at hand — representations that get us closer to the expected output.**

Most supervised learning algorithms are based on estimating a probability distribution p( y | x). We can do this simply by using maximum

likelihood estimation to find the best parameter vector θ for a parametric family of distributions p(y | x; θ).

**The maximum likelihood estimator can be generalized to the case where our goal is to estimate a conditional probability P(y | x;θ) in order to predict y given x . **This is actually the most common situation because it forms the basis for most supervised learning

Mean Squared Error -

The conditional log-likelihood is given by -

**Maximizing the log-likelihood with respect to w yields the same estimate of the parameters w as does minimizing the mean squared error.**

The two criteria have different values but the same location of the optimum. This justifies the use of the MSE as a maximum likelihood estimation procedure.

Derivative is **Instantaneous rate of change** or **Slope of tangent to a curve.**

We can find the **Slope of the curve at point A** by taking **rise/run** i.e. taking **delta_y/delta_x. **If both deltas are infinitesimally small, we get a more accurate value for the slope. In Calculus we denote this as —

**which means infinitesimally small change in “y” over an infinitesimally small change in “x” — also called derivative.**

**When we take the derivative of a function f(x), we get another function f’(x). To get the slope of the tangent at a particular point, substitute the values in the the function f’(x).**

Suppose we have a function y = f (x).

The derivative of this function is denoted as f’(x) . The derivative f’(x) gives the slope of f(x) at the point x. In other words, **it specifies how to scale a small change in the input in order to obtain the corresponding change in the output.**

The derivative is therefore useful for minimizing a function because it tells

us how to change x in order to make a small improvement in y.

**The gradient points directly uphill, and the negative gradient points directly downhill. We can decrease f by moving in the direction of the negative gradient. This is known as the method of steepest descent or gradient descent.**

If a variable *z* depends on the variable *y*, which itself depends on the variable *x*, so that *y* and *z* are therefore dependent variables, then *z* depends on *x* as well.

i.e. If z=f(y) and y=f(x), then

The chain rule then states,

Chain Rule is used in finding the gradients of the weights in a neural network. Application of Chain Rule in a Neural Networks to find the gradients is termed as Backpropagation.