# The Math Behind ML (the important stuff)

## Important mathematical prerequisites for getting into Machine Learning, Deep Learning, or any of the other space

| UPDATED

Important mathematical prerequisites for getting into Machine Learning, Deep Learning, or any of the other space

### Table of Contents

- What is broadcasting in connection to Linear Algebra?
- What are scalars, vectors, matrices, and tensors?
- What is the Hadamard product of two matrices?
- What is an inverse matrix?
- If inverse of a matrix exists, how do you calculate it?
- What is the determinant of a square matrix? How is it calculated? What is the connection of determinant to eigenvalues?
- Discuss span and linear dependence.
- What is Ax = b? When does Ax =b has a unique solution?
- In Ax = b, what happens when A is fat or tall?
- When does inverse of A exist?
- What is a norm? What is L1, L2 and L infinity norm?
- What are the conditions a norm has to satisfy?
- Why is squared of L2 norm preferred in ML than just L2 norm?
- When L1 norm is preferred over L2 norm?
- Can the number of nonzero elements in a vector be defined as L0 norm? If no, why?
- What is Frobenius norm?
- What is a diagonal matrix?
- Why is multiplication by diagonal matrix computationally cheap? How is the multiplication different for square vs. non-square diagonal matrix?
- At what conditions does the inverse of a diagonal matrix exist?
- What is a symmetrix matrix?
- What is a unit vector?
- When are two vectors x and y orthogonal?
- At R^n what is the maximum possible number of orthogonal vectors with non-zero norm?
- When are two vectors x and y orthonormal?
- What is an orthogonal matrix? Why is computationally preferred?
- What is eigendecomposition, eigenvectors and eigenvalues?
- How to find eigen values of a matrix?
- Write the eigendecomposition formula for a matrix. If the matrix is real symmetric, how will this change?
- Is the Eigendecomposition guaranteed to be unique? If not, then how do we represent it?
- What are positive definite, negative definite, positive semi definite and negative semi definite matrices?
- What is Singular Value Decomposition? Why do we use it? Why not just use ED?
- Given a matrix A, how will you calculate its Singular Value Decomposition?
- What are singular values, left singulars and right singulars?
- What is the connection of Singular Value Decomposition of A with functions of A?
- Why are singular values always non-negative?
- What is the Moore Penrose pseudo inverse and how to calculate it?
- If we do Moore Penrose pseudo inverse on Ax = b, what solution is provided is A is fat? Moreover, what solution is provided if A is tall?
- Which matrices can be decomposed by ED?
- Which matrices can be decomposed by SVD?
- What is the trace of a matrix?
- How to write Frobenius norm of a matrix A in terms of trace?
- Why is trace of a multiplication of matrices invariant to cyclic permutations?
- What is the trace of a scalar?
- Write the frobenius norm of a matrix in terms of trace?

- What is underflow and overflow?
- How to tackle the problem of underflow or overflow for softmax function or log softmax function?
- What is poor conditioning?
- What is the condition number?
- What are grad, div and curl?
- What are critical or stationary points in multi-dimensions?
- Why should you do gradient descent when you want to minimize a function?
- What is line search?
- What is hill climbing?
- What is a Jacobian matrix?
- What is curvature?
- What is a Hessian matrix?

**Basics of Probability and Information Theory**

- Compare “Frequentist probability” vs. “Bayesian probability”?
- What is a random variable?
- What is a probability distribution?
- What is a probability mass function?
- What is a probability density function?
- What is a joint probability distribution?
- What are the conditions for a function to be a probability mass function?
- What are the conditions for a function to be a probability density function?
- What is a marginal probability? Given the joint probability function, how will you calculate it?
- What is conditional probability? Given the joint probability function, how will you calculate it?
- State the Chain rule of conditional probabilities.
- What are the conditions for independence and conditional independence of two random variables?
- What are expectation, variance and covariance?
- Compare covariance and independence.
- What is the covariance for a vector of random variables?
- What is a Bernoulli distribution? Calculate the expectation and variance of a random variable that follows Bernoulli distribution?
- What is a multinoulli distribution?
- What is a normal distribution?
- Why is the normal distribution a default choice for a prior over a set of real numbers?
- What is the central limit theorem?
- What are exponential and Laplace distribution?
- What are Dirac distribution and Empirical distribution?
- What is mixture of distributions?
- Name two common examples of mixture of distributions.
- Is Gaussian mixture model a universal approximator of densities?
- Write the formulae for logistic and softplus function.
- Write the formulae for Bayes rule
- What do you mean by measure zero and almost everywhere?
- If two random variables are related in a deterministic way, how are the PDFs related?
- Define self-information. What are its units?
- What are Shannon entropy and differential entropy?
- What is Kullback-Leibler (KL) divergence?
- Can KL divergence be used as a distance measure?
- Define cross-entropy
- What are structured probabilistic models or graphical models?
- In the context of structured probabilistic models, what are directed and undirected models? How are they represented? What are cliques in undirected structured probabilistic models?

- What is population mean and sample mean?
- What is population standard deviation and sample standard deviation?
- Why population s.d. has $N$ degrees of freedom while sample s.d. has $N-1$ degrees of freedom? In other words, why $1/N$ inside root for pop. s.d. and $1/(N-1)$ inside root for sample s.d.?
- What is the formula for calculating the s.d. of the sample mean?
- What is confidence interval?
- What is standard error?

## Linear Algebra

#### What is broadcasting in connection to Linear Algebra?

Arrays with different sizes cannot be added, subtracted, or generally be used in arithmetic. Two matrices are compatible if the corresponding dimensions in each matrix (rows vs rows, columns vs columns) meet the following requirements:

- The dimensions are equal, or
- One dimension is of size 1

A way to overcome this requirement is to duplicate the smaller array so that it is the dimensionality and size as the larger array.

#### What are scalars, vectors, matrices, and tensors?

- A scalar is a single number
- A vector is an array of numbers. $\mathbf{x} =\begin{bmatrix} x_1 \\\\ x_2 \\\\ \vdots \\\\ x_n \end{bmatrix}$
- A matrix is a 2-D array $\mathbf{A}= \begin{bmatrix} A_{1,1} & A_{1,2} & \cdots & A_{1,n} \\\\ A_{2,1} & A_{2,2} & \cdots & A_{2,n} \\\\ \vdots & \vdots & \ddots & \vdots \\\\ A_{m,1} & A_{m,2} & \cdots & A_{m,n} \end{bmatrix}$
- A tensor is a $n$-dimensional array with $n>2$

#### What is the Hadamard product of two matrices?

Hadamard product of matrices is an elementwise operation. Values that correspond positionally are multiplied to produce a new matrix.

#### What is an inverse matrix?

The inverse of a square matrix $A$, sometimes called a reciprocal matrix, is a matrix $A^{-1}$ such that $AA^{-1}=I$, where $I$ is the identity matrix.

A square matrix $A$ has an inverse iff the determinant $|A|\neq 0$ (Lipschutz 1991, p. 45). The so-called invertible matrix theorem is major result in linear algebra which associates the existence of a matrix inverse with a number of other equivalent properties. A matrix possessing an inverse is called nonsingular, or invertible.

#### If inverse of a matrix exists, how do you calculate it?

For a $2×2$ matrix $A= \begin{bmatrix} a & b \\ c & d \end{bmatrix}$, the matrix inverse is

$A^{-1} = \frac{1}{|A|}\begin{bmatrix}d & -b \\ -c & a\end{bmatrix} \\ = \frac{1}{ad-bc}\begin{bmatrix}d & -b \\ -c & a\end{bmatrix}$

For a $3×3$ matrix

$A= \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{bmatrix}$

the matrix inverse is

$A^{-1}=\frac{1}{|A|} \begin{bmatrix} \begin{vmatrix} a_{22} & a_{23} \\ a_{32} & a_{33} \end{vmatrix} & \begin{vmatrix} a_{13} & a_{12} \\ a_{33} & a_{32} \end{vmatrix} & \begin{vmatrix} a_{12} & a_{13} \\ a_{22} & a_{23} \end{vmatrix} \\ \begin{vmatrix} a_{23} & a_{21} \\ a_{33} & a_{31} \end{vmatrix} & \begin{vmatrix} a_{11} & a_{13} \\ a_{31} & a_{33} \end{vmatrix} & \begin{vmatrix} a_{13} & a_{11} \\ a_{23} & a_{21}\end{vmatrix} \\ \begin{vmatrix} a_{21} & a_{22} \\ a_{31} & a_{32} \end{vmatrix} & \begin{vmatrix} a_{12} & a_{11} \\ a_{32} & a_{31} \end{vmatrix} & \begin{vmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{vmatrix}\end{bmatrix}$

A general $n×n$ matrix can be inverted using methods such as the *Gauss-Jordan elimination*, *Gaussian elimination*, or *LU decomposition*.

#### What is the determinant of a square matrix? How is it calculated? What is the connection of determinant to eigenvalues?

Determinants are mathematical objects that are very useful in the analysis and solution of systems of linear equations. As shown by Cramer’s rule, a nonhomogeneous system of linear equations has a unique solution iff the determinant of the system’s matrix is nonzero (i.e., the matrix is nonsingular). For example, eliminating $x$, $y$, and $z$ from the equations

$a_1x+a_2y+a_3z = 0 \\ b_1x+b_2y+b_3z = 0 \\ c_1x+c_2y+c_3z = 0$

gives the expression

$a_1b_2c_3-a_1b_3c_2+a_2b_3c_1-a_2b_1c_3+a_3b_1c_2-a_3b_2c_1=0$, which is called the determinant for this system of equation. Determinants are defined only for square matrices.

If the determinant of a matrix is $0$, the matrix is said to be singular, and if the determinant is $1$, the matrix is said to be unimodular.

The determinant of a matrix $A$,

$\begin{vmatrix}a_1 & a_2 & \dots & a_n \\ b_1 & b_2 & \dots & b_n \\ \vdots & \vdots & \ddots & \vdots\\ z_1 & z_2 & \dots & z_n\end{vmatrix}$

is commonly denoted $\det(A)$, $|A|$, or in component notation as $\sum{+/-a_1b_2c_3 ...}$, $D(a_1b_2c_3 ...)$, or $|a_1b_2c_3...|$ (Muir 1960, p. 17). Note that the notation $\det(A)$ may be more convenient when indicating the absolute value of a determinant, i.e., $|\det(A)|$ instead of $||A||$.

A $2×2$ determinant is defined to be

$det[a b; c d]=|a b; c d|=ad-bc$.

A $k×k$ determinant can be expanded “by minors” to obtain

$\begin{vmatrix}a_{11} & a_{12} & a_{13} & \dots & a_{1k} \\ a_{21} & a_{22} & a_{23} & \dots & a_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \\ a_{k1} & a_{k2} & a_{k3} & \dots & a_{kk} \end{vmatrix}=a_{11} \begin{vmatrix}a_{22} & a_{23} & \dots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ a_{k2} & a_{k3} & \dots & a_{kk} \end{vmatrix} \\ -a_{12} \begin{vmatrix} a_{21} & a_{23} & \dots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ a_{k1} & a_{k3} & \dots & a_{kk} \end{vmatrix} +... \pm a_{1k} \begin{vmatrix}a_{21} & a_{22} & \dots & a_{2(k-1)} \\ \vdots & \vdots & \ddots & \vdots \\ a_{k1} & \dots a_{k2} & \dots & a_{k(k-1)} \end{vmatrix}$.

A general determinant for a matrix $A$ has a value $|A|=\sum_{i=1}^ka_{ij}C_{ij}$, with no implied summation over $j$ and where $C_{ij}$ (also denoted $a^{ij}$) is the cofactor of $a_{ij}$ defined by $C_{ij}={-1}^{i+j}M_{ij}$ and $M_{ij}$ is the minor of matrix $A$ formed by eliminating row $i$ and column $j$ from $A$. This process is called determinant expansion by minors (or “Laplacian expansion by minors,” sometimes further shortened to simply “Laplacian expansion”).

#### Discuss span and linear dependence.

The span of vectors $v_1$, $v_2$, …, $v_n$ is the set of linear combinations $c_1v_1 + c_2v_2 + ... + c_nv_n$, and that this is a vector space.

We can take this idea further. If $V$ is a vector space and $S = \{v_1, v_2, ... , v_n\}$ is a subset of $V$, then is $\text{Span}(S)$ equal to $V$? We say that **$S$ spans $V$** if every vector $v$ in $V$ can be written as a linear combination of vectors in $S$ (e.g., $v = c_1v_1 + c_2v_2 + ... + c_nv_n$).

The next question we ask after this is whether there are any reduncancies. I.e., is there a smaller subset of $S$ that also spans $\text{Span}(S)$. If so, then we can write one of the vectors as a linear combination of the others (like so: $vi = c1v1 + c2v2 + ... + ci -1vi -1 + ci+1vi+1 + ... + cnvn$)

If one of the vectors within $S$ can be written as a linear combination like this, then $S$ is a linearly dependent set. If we can’t write any of the vectors as a linear combination of the others in $S$, then $S$ is linearly *in*dependent

#### What is $Ax = b$? When does $Ax =b$ have a unique solution?

Suppose we want to solve a system of equation like $3x + 2y = 2, x-y = 4, 5y + z = -1$ (where we want to find solutions for $x$, $y$, and $z$). We can solve this using methods like substitution or elimination, but these are only tractable when we have few equations or few unknowns. If we want to use a more efficient method for larger numbers of unknowns or equations, we can write the system in the form of several matrices:

$\begin{bmatrix}
3 & 2 & 0 \\
1 & -1 & 0 \\
0 & 5 & 1
\end{bmatrix}
\begin{bmatrix}
x\\
y\\
z
\end{bmatrix}
=
\begin{bmatrix}
2\\
4\\
-1
\end{bmatrix}$
This form is called the **$Ax = b$ form** of a system where $A$ is called the coefficient matrix, $x$ is the unknowns or solution vector, and $b$ is the constant vector. The general form of such a system with $m$ equations and $n$ unknowns is given by the following:
$\begin{bmatrix}
a_{11} & a_{21} & \cdots & a_{m1}\\
a_{12} & a_{22} & \cdots & a_{m2}\\
\vdots & \vdots & \ddots & \vdots\\
a_{1n} & a_{2n} & \cdots & a_{mn}
\end{bmatrix}\begin{bmatrix}
x_1\\
x_2\\
\vdots \\
x_n
\end{bmatrix}=\begin{bmatrix}
b_1\\
b_2\\
\vdots \\
b_m
\end{bmatrix}$

Whether a $Ax = b$ system has a unique solution or not is determined by Cramer’s Rule: A unique solution $x$ exists $\text{iff } A$ is invertible. The solution itself can be found using the rule $x_j = \frac{\text{det}{A_j}}{\text{det}{A}}$ where $A_j$ is the matrix $A$ with the $j$th column replaced by the constant vector $b$.

#### In $\mathbf{Ax = b}$, what happens when $\mathbf{A}$ is fat or tall?

For a system of equations $\mathbf{Ax}$, we first want to describe the fundamental theorem:

- For any system of equations, $\mathbf{Ax = b}$, only one of the following three possibilities hold: a) The system has no solution, b) The system has a unique solution, or c) The system has infinitely many solutions.
- For a homogeneous system of equations, $\mathbf{Ax}= 0$, only one of the following two possibilities hold: a) The system has a unique solution, or b) The system has infinitely many solutions.

When a system of equations $\mathbf{Ax = b}$ has at least one solution, we say that the system is **consistent**, otherwise **inconsistent**. We are concerned with determining when a system of equations is consistent and studying the number of solutions of a system of equations.

**Fat $\mathbf{A}$ Matrix**

Let $\mathbf{A}$ be an $m \times n$ fat matrix, that is, $n > m$. Then the following assertions hold.
a. $\text{rank}(\mathbf{A}) \leq m$.
b. The **homogeneous** system of equations $\mathbf{Ax}= 0$ has infinitely many solutions.
c. The system of equations $\mathbf{Ax = b}$ is either inconsistent or has infinitely many solutions (in other words, it *never has a unique solution*).
d. The system of equations $\mathbf{Ax = b}$ is consistent for every $\mathbf{b}$ in $\mathbb{R}^m$ if and only if $\text{rank}(\mathbf{A}) = m$.

**Tall $\mathbf{A}$ Matrix**

Let $\mathbf{A}$ be an $m \times n$ tall matrix, that is, $n < m$. Then the following assertions hold. a. $\text{rank}(\mathbf{A}) \leq n$. b. The homogeneous system of equations $\mathbf{Ax}= 0$ has a unique (namely, the trivial) solution if and only if $\text{rank}(\mathbf{A}) = n$ (in other words, it has infinitely many solutions if and only if $\text{rank}(\mathbf{A}) < n$). c. The system of equations $\mathbf{Ax = b}$ is inconsistent for some $\mathbf{b}$ in $\mathbb{R}^m$ (in other words, the system is never consistent for all $\mathbf{b}$ in $\mathbb{R}^m$). d. The system of equations $\mathbf{Ax = b}$ is has at most one solution for every $\mathbf{b}$ in $\mathbb{R}^m$ if and only if $\text{rank}(\mathbf{A}) = n$.

#### When does inverse of $A$ exist?

Suppose $\mathbf{A}$ is an $n \times n$ matrix. The inverse of $\mathbf{A}$ is another $n \times n$ matrix, denoted $\mathbf{A}^{-1}$, that satisfies the following conditions. $\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}_n$ where $\mathbf{I}_n$ is the identity matrix. Not every square matrix has an inverse; but if a matrix does have an inverse, it is unique.

If the rank of an $n \times n$ matrix is less than n, the matrix does not have an inverse. When the determinant for a square matrix is equal to zero, the inverse for that matrix does not exist. A square matrix that has an inverse is said to be nonsingular or invertible.

#### What is a norm? What is L1, L2 and L infinity norm?

In Linear Algebra, a Norm refers to the total length of all the vectors in a space. There are multiple ways to measure the magnitude of vectors.

The $L_0$ norm is not actually a norm. It just corresponds to the total number of nonzero elements in a vector.

$L_1$ Norm is the sum of the magnitudes of the vectors in a space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally.

$L_2$ norm is the most popular norm, also known as the Euclidean norm. It is the shortest distance to go from one point to another.

The $L_\infty$ norm simply gives the largest magnitude among each element of a vector. For example, if we have the vector Having the vector $X = [-6, 4, 2]$, the $L_\infty$ norm would be 6.

#### What are the conditions a norm has to satisfy?

Given a vector space $V$ over a field $F$ of the real numbers $\mathbb{R}$ or complex numbers $\mathbb{C}$, a norm on $V$ is a nonnegative-valued function $p: V \rightarrow \mathbb{R}$ with the following properties:

For all $a \in F$ and all $\textbf{u}, \textbf{v} \in V$,

- $p(\textbf{u} + \textbf{v}) \leq p(\textbf{u}) + p(\textbf{v})$ (being
*subadditive*or satisfying the*triangle inequality*). - $p(a\textbf{v}) = | a | p(\textbf{v})$ (being
*absolutely homogeneous*or*absolutely scalable*). - If $p(\textbf{v}) = 0$ then $\textbf{v} = 0$ is the zero vector (being
*positive definite*or being*point-separating*).

#### Why is squared of L2 norm preferred in ML than just L2 norm?

The $L_2$ norm corresponds to Hilbert space. On raising power to cube ($L_3$), quad ($L_4$) or higher, the function becomes sensitive to the influence of outliers (since it gives more weight to the anomalous points) and thus introduces unwanted bias into distance calculation (you will get skewed results). Using squared $L_2$ norm, optimization functions in ML become tractable.

#### When L1 norm is preferred over L2 norm?

The $L_1$ norm likes sparse vectors (hence sets some components exactly to 0). Now “sets” doesn’t really mean anything since it’s not like the $L_1$ norm takes decisions, but in several optimisation problems if you use the $L_1$ norm you end up with sparse vectors. A prime example is the lasso regression, or in general any optimizations problem where a regularizer in the form $\lambda \| x \|_1$ is added. The solutions to the latter will be usually sparser than when using a regularizer in the form $\lambda \| x \|_2$

#### Can the number of nonzero elements in a vector be defined as L0 norm? If no, why?

The $L_0$ norm corresponds to the total number of nonzero elements in a vector, but it cannot actually be defined as a norm because it does not satisfy all the conditions a norm must have.

#### What is Frobenius norm?

The Frobenius norm $L_F$ measures the size of a Matrix $A$ as $L_F = \Big(\sum_{i,j} A_{ij}^2\Big)^{\frac{1}{2}}$

#### What is a diagonal matrix?

A diagonal matrix is a matrix with all non-diagonal elements being zero. We form a square diagonal matrix by moving vector elements into the diagonal position of the matrix. $M=\text{diag}(v)$ Provided $v$ has no element with zero value, we replace each diagonal element with $\frac{1}{v_i}$ to form its inverse M^{-1}.

#### Why is multiplication by diagonal matrix computationally cheap? How is the multiplication different for square vs. non-square diagonal matrix?

Machine learning may approximate solutions with diagonal matrices because finding inverse or performing matrix multiplication is easy. Non-square diagonal matrices do not have an inverse matrix.

#### At what conditions does the inverse of a diagonal matrix exist?

The invert of a square diagonal matrix exists if all entries of the diagonal are non-zeros. If it is the case, the invert is easy to find. Also, the inverse doen’t exist if the matrix is non-square.

#### What is a symmetric matrix?

A symmetric matrix is a matrix with $A_{ij}=A_{ji}$. $A^T=A$ In machine learning, many equations in calculating elements between $i \leftrightarrow j$ are not directional ($f(x,y)=f(y,x)$). For example, the distance between 2 points is not directional. Hence, many matrices in machine learning is symmetrical. The inverse of a symmetric matrix is also symmetric. Any real $n \times n$ symmetric matrices can be decomposed into n eigenvalues and eigenvectors which is very desirable in matrix factorization. Symmetric matrix can easily decompose into orthogonal matrices: $A=Q \Lambda Q^T$ which its inverse equals to its transpose.

#### What is a unit vector?

A unit vector is a vector with $\|x\|=1$.

#### When are two vectors x and y orthogonal?

In Euclidean space, two vectors are orthogonal if and only if their dot product is zero, i.e. they make an angle of $90^{\circ}$ ($\pi /2$ radians), or one of the vectors is zero. Hence orthogonality of vectors is an extension of the concept of perpendicular vectors to spaces of any dimension.

#### At $R^n$ what is the maximum possible number of orthogonal vectors with non-zero norm?

Orthogonal vectors are perpendicular. It’s easy to see that if you are in $R^2$ you can only have two vectors which are perpendicular. Similarly, in $R^3$ you can only have three perpendicular vectors. Orthogonality extends this notion of perpendicular to higher dimensions. So given that interpretation you can see that in $R^n$ you can only have $n$ orthogonal vectors. You would need to add another dimension to the space to have an additional orthogonal vector, thus $n$ is maximal.

#### When are two vectors x and y orthonormal?

A set of vectors are orthonormal if and only if the inner product of any two vectors are zero.

#### What is an orthogonal matrix? Why is computationally preferred?

An orthogonal matrix is a square matrix whose rows (columns) are mutually orthonormal. i.e. no dot products of 2 row vectors (column vectors) are 0. $A^T A = A A^T = I$ For an orthogonal matrix, there is one important property. Its inverse is the transpose which is very easy to compute: $A A^T = I \Rightarrow A^{-1} A A^T = A^{-1} I \Rightarrow A^T = A^{-1}$ $A^T = A^{-1}$ Also orthogonal matrices $Q$ does not amplify errors which is very desirable. $\| Qx \|^2_2 = (Qx)^T Qx = x^TQ^T Qx = x^T x = \| x \|^2_2$ The size of the multiplication result $\|Qx\|$ has the same size as $\|x\|$. If we multiple x with orthogonal matrices, the errors present in $x$ will not be amplified by the multiplication. i.e. it is more numerically stable.

Decomposing matrices into smaller components helps us to solve some problems faster with better numeric stability. Here is the pseudo code to use SVD to decompose matrix into orthogonal matrices and solve the linear equation with the result. $Ax =b \\ (U,D,V) \leftarrow \text{SVD}(A) \\ A^{+} = V D^{+}U^T \\ x = A^{+} y$ where $D^+$ takes the reciprocal $\frac{1}{x_i}$ of the non-zero elements of $D$ and then transpose.

#### What are eigendecomposition, eigenvectors and eigenvalues?

Scalar $\lambda$ and vector $v$ are the eigenvalue and eigenvector of $A$ respectively if $Av= \lambda v$. A $n \times n$ matrix has at most $n$ eigenvalues and eigenvectors. A matrix is singular iff any eigenvalues are $0$.

In eigendecomposition, we decompose an initial matrix into the product of its eigenvectors and eigenvalues.

#### How to find eigenvalues of a matrix?

To find the eigenvalues and eigenvectors for $A$,

$Av = \lambda v$

$(A - \lambda I) v = 0$

$\implies \text{det}(A - \lambda I) = 0$

#### Write the eigendecomposition formula for a matrix. If the matrix is real symmetric, how will this change?

Matrix decomposition decompose a matrix into special type of matrices for easy manipulation in solving problems like linear equations. But eigendecomposition is only defined for square matrices.

Say $A$ has $n$ eigenvalues $\lambda_1, \lambda_2, ..., \lambda_n$. We concatenate all values to form a column vector $\lambda=[\lambda_1, \lambda_2, ... , \lambda_n]^T$. $A$ also has $n$ eigenvectors $v_1,v_2, ...,v_n$. We compose a matrix $V$ with $v_i$ as the column $i$ of the matrix. $V=[v^{(1)},...,v^{(n)}]$. The eigen decomposition of $A$ is:

$A = V \text{diag}(\lambda)V^{-1}$

Not every $A$ has eigendecomposition. But in deep learning, we often due with real symmetric metrics. Real symmetric metrics are eigendecomposable and the equation can be further simplify to:

$A=Q \Lambda Q^T$

which $Q$ is an orthogonal matrix composed of eigenvectors of $A$. $\Lambda$ is a diagonal matrix. The value of diagonal element $\Lambda_{ii}$ is the eigenvalue of the corresponding eigenvector in column $i$ of $Q$. We do not specify the order of eigenvectors. Therefore different order of $v^{(i)}$ creates different $V$. By convention, we re-arrange the column order $v^{(i)}$ in $v$ by the descending sorting order of its eigenvalue $\lambda_i$. It helps us to decompose $A$ in a more deterministic way.

In eigendecomposition, we decompose the matrix $A$ into different eigenvectors $v^{(i)}$ scaled by the eigenvalue $\lambda_i$. Therefore, for any vectors pointing at the same direction of eigenvector $v^{(i)}$, $Ax$ scales $x$ by the corresponding eigenvalue $\lambda_i$.

For a quadratic equation in the form of $f(x)=x^TAx$, if $x$ is an unit vector equal to $v^{(i)}$, $f(x)$ will be equal to the eigenvalues $\lambda_i$. Therefore, the max and min of $f(x)$ is the max and min of the eigenvalues.

#### Is the Eigendecomposition guaranteed to be unique? If not, then how do we represent it?

A matrix does not necessarily have distinct eigenvalues (although almost all do), and a matrix does not necessarily have a single eigenvalue with multipicity n. In fact, given any set of n values, you can construct a matrix with those values as eigenvalues (indeed just take the corresponding diagonal matrix).

#### What are positive definite, negative definite, positive semi definite and negative semi definite matrices?

If all eigenvalues of $A$ are positive, the matrix is positive definite.

If all eigenvalues of $A$ are positive or zero, the matrix is positive semi-deﬁnite.

If all eigenvalues of $A$ are negative, the matrix is negative definite.

If a matrix is positive semi-deﬁnite, $x^TAx \geq 0$. If a matrix is positive definite and $x^TAx=0$, it implies $x=0$.

Positive definiteness or negative definiteness helps us to solve optimization problem. Quadratic equation $x^TAx$ with positive definite matrices $A$ are always positive for non-zero $x$ and the function is convex. i.e. it guarantees the existences of the global minimum. This allows us to use Hessian matrix to solve the optimization problem. Similar arguments hold true for negative definite.

#### What is Singular Value Decomposition? Why do we use it? Why not just use ED?

SVD factorizes a matrix into singular vectors and singular values. $A=UDV^T$

- A is a m×n matrix. (Does not need to be a square matrix like eigendecomposition.)
- Left-singular vector: $U$ is an $m \times m$ orthogonal matrix (the eigenvectors of $AA^T$)
- Right-singular vector: $V$ is a $n \times n$ orthogonal matrix (the eigenvectors of $A^TA$)
- Singular values: $D$ is a $m \times n$ diagonal matrix (square roots of the eigenvalues of $AA^T$ and $A^TA$)

SVD is a powerful but expensive matrix factorization method. In numerical linear algebra, many problems can be solved to represent $A$ in this form.

Every real matrix has an SVD, but this is not true for eigendecomposition.

#### Given a matrix A, how will you calculate its Singular Value Decomposition?

$A$ is a matrix that can be seen as a linear transformation. This transformation can be decomposed in three sub-transformations: 1. rotation, 2. re-scaling, 3. rotation. These three steps correspond to the three matrices $U$, $D$, and $V$.

The matrices $U$, $D$ and $V$ can be found by transforming $A$ in a square matrix and by computing the eigenvectors of this square matrix. The square matrix can be obtain by multiplying the matrix A by its transpose in one way or the other:

- $U$ corresponds to the eigenvectors of $AA^T$
- $V$ corresponds to the eigenvectors of $A^TA$
- $D$ corresponds to the eigenvalues $AA^T$ or $A^TA$ which are the same.

#### What are singular values, left singulars and right singulars?

We will decompose A into 3 matrices (instead of two with eigendecomposition): $A=UDV^T$

The matrices $U$, $D$, and $V$ have the following properties:

- $U$ and $V$ are orthogonal matrices ($U^T=U^{-1}$ and $V^T=V^{-1}$)
- $D$ is a diagonal matrix. However $D$ is not necessarily square.

The columns of $U$ are called the left-singular vectors of $A$ while the columns of $V$ are the right-singular vectors of $A$. The values along the diagonal of $D$ are the singular values of $A$.

#### What is the connection of Singular Value Decomposition of A with functions of A?

In higher mathematics, a function is said to be linear if $f(x+y)=f(x)+f(y)$. The process of SVD decomposition can be seen as a function, which we will call $S$, which take in a matrix and returns three matrices: $S(A)=(U,\Sigma,V)$. The three matrices that are returned have the property that $A=U \Sigma V^T$. However, given two matrices of the same size, say $A$ and $B$, $S(A+B)\neq S(A)+S(B)$. In other words, the $U$ in the SVD decomposition for $A+B$ is not equal to the $U$ in the SVD decomposition for $A$ plus the $U$ in the SVD decomposition for $B$. The same is true for the $\Sigma$ and the $V$.

Linear algebra has the moniker “linear” because all linear functions that act on vectors of finite dimension can be represented by matrices, and all matrices represent linear functions acting on vectors. However, certain aspects of linear algebra are nonlinear. For example, the function that accepts a matrix and returns its square: $f(A)=A^2$ is nonlinear because $(A+B)^2\neq A^2+B^2$ in general. So SVD is a linear algebra topic because it involves breaking up a single linear action into three simpler linear actions, but the function that takes each matrix to its decomposition is nonlinear.

#### Why are singular values always non-negative?

If $A$ has real entries then $A^TA$ is positive semidefinite, since $\langle A^TAv,v\rangle=\langle Av,Av\rangle\geq 0$ for all $v$. Therefore the eigenvalues of $A^TA$ are non-negative.

#### What is the Moore Penrose pseudoinverse and how to calculate it?

The Moore-Penrose pseudoinverse is a direct application of the SVD. But before all, we have to remind that systems of equations can be expressed under the matrix form.

The inverse of a matrix $A$ can be used to solve the equation $Ax=b$:

$A^{-1}Ax=A^{-1}b \\ I_nx=A^{-1}b \\ x=A^{-1}b$

But in the case where the set of equations have $0$ or many solutions the inverse cannot be found and the equation cannot be solved. The pseudoinverse is $A^+$ such as:

$AA^+ \approx I_n$

minimizing

$||AA^+-I_n||2$

The following formula can be used to find the pseudoinverse:

$A^+=VD^+UT$

with $U$, $D$ and $V$ respectively the left singular vectors, the singular values and the right singular vectors of $A$. $A^+$ is the pseudoinverse of $A$ and $D^+$ the pseudoinverse of $D$. We saw that $D$ is a diagonal matrix and thus $D^+$ can be calculated by taking the reciprocal of the non zero values of $D$.

#### If we do Moore Penrose pseudo inverse on Ax = b, what solution is provided is A is fat? Moreover, what solution is provided if A is tall?

A non-square matrix does not have an inverse, but if it is full rank ($\text{rank}(A)=n$), it can have either a right pseudo-inverse or left pseudo-inverse:

Given a matrix $A \in R^{m \times n}$, let us define:

- $A$ is a
**fat matrix**if $m \leq n$ and $\text{null}(A^T)={0}$ - $A$ is a
**tall matrix**is $m \geq n$ and $\text{range}(A)=R^n$

Using the finite rank lemma, we can find that:

- When $A$ is a fat matrix, there exists a right inverse $A_R^{-1} \in R^{n \times m}$ such that $AA_R^{-1} = I_m$. The right pseudo-inverse is not unique, but an example is $A_R^{-1}=A^T(AA^T)^{-1}$
- When $A$ is a tall matrix, there exists a left inverse $A_L^{-1} \in R^{n \times m}$ such that $A_L^{-1}A = I_n$.The left pseudo-inverse is not unique, but an example is $A_L^{-1}=(A^TA)^{-1}A^T$

#### Which matrices can be decomposed by ED?

The eigendecomposition can only exists for square matrices, and even among square matrices sometimes it doesn’t exist.

#### Which matrices can be decomposed by SVD?

The SVD always exists for any sort of rectangular or square matrix.

#### What is the trace of a matrix?

The trace is the sum of all values in the diagonal of a square matrix.

For example: $A= \begin{bmatrix} 2 & 9 & 8 \\\\ 4 & 7 & 1 \\\\ 8 & 2 & 5 \end{bmatrix}$

$\mathrm{Tr}(A) = 2 + 7 + 5 = 14$

#### How to write Frobenius norm of a matrix A in terms of trace?

$\text{tr}(\mathrm A \mathrm B) = \left< \mathrm A^T, \mathrm B \right> = \left< \text{vec} (\mathrm A^T), \text{vec} (\mathrm B) \right> \leq \|\text{vec} (\mathrm A^T)\|_2 \|\text{vec} (\mathrm B)\|_2 = \|\mathrm A\|_F \|\mathrm B\|_F$

#### Why is trace of a multiplication of matrices invariant to cyclic permutations?

More generally, the trace is invariant under cyclic permutations, that is,

$\operatorname {tr} (\mathbf {A} \mathbf {B} \mathbf {C} \mathbf {D} )=\operatorname {tr} (\mathbf {B} \mathbf {C} \mathbf {D} \mathbf {A} )=\operatorname {tr} (\mathbf {C} \mathbf {D} \mathbf {A} \mathbf {B} )=\operatorname {tr} (\mathbf {D} \mathbf {A} \mathbf {B} \mathbf {C} ).$

This is known as the cyclic property.

Arbitrary permutations are not allowed: in general, $\operatorname {tr} (\mathbf {A} \mathbf {B} \mathbf {C} )\neq \operatorname {tr} (\mathbf {A} \mathbf {C} \mathbf {B} ).$

However, if products of three symmetric matrices are considered, any permutation is allowed, since: $\operatorname {tr} (\mathbf {A} \mathbf {B} \mathbf {C} )=\operatorname {tr} \left(\left(\mathbf {A} \mathbf {B} \mathbf {C} \right)^{\mathsf {T}}\right)=\operatorname {tr} (\mathbf {C} \mathbf {B} \mathbf {A} )=\operatorname {tr} (\mathbf {A} \mathbf {C} \mathbf {B} ),$

where the first equality is because the traces of a matrix and its transpose are equal.

#### What is the trace of a scalar?

The trace of a square matrix is the sum of its diagonal elements.

The trace enjoys several properties that are often very useful when proving results in matrix algebra and its applications.

Let us start with a formal definition.

Definition Let $A$ be a $K \times K$ matrix. Then, its trace, denoted by $\text{trace}(A)$ or $\text{tr}(A)$, is the sum of its diagonal elements:

$tr(A) = \sum_{k=1}^KA_{KK}$

#### Write the frobenius norm of a matrix in terms of trace?

The Frobenius norm, sometimes also called the Euclidean norm (a term unfortunately also used for the vector $L^2$-norm), is matrix norm of an $m×n$ matrix $A$ defined as the square root of the sum of the absolute squares of its elements,

$||A||_F=\sqrt{\sum_{i=1}^m\sum_{j=1}^n|a_{ij}|^2}$

The Frobenius norm can also be considered as a vector norm.

## Numerical Optimization

#### What is underflow and overflow?

Overflow is when the absolute value of the number is too high for the computer to represent it.

Underflow is when the absolute value of the number is too close to zero for the computer to represent it.

You can get overflow with both integers and floating point numbers. You can only get underflow with floating point numbers.

To get an overflow, repeatedly multiply a number by ten. To get an underflow repeatedly divide it by ten.

If the variable x is a signed byte it can have values in the range `-128`

to `+127`

, then

```
x = 127
x = x + 1
```

will result in an overflow. `+128`

is not a valid value for x.

For floating point numbers, the range depends on their representation. If x is a single precision (32-bit IEEE) number, then

```
x = 1e-38
x = x / 1000
```

will result in an underflow as `1e-42`

is not a valid value for x.

Consider softmax function, commonly used to transform a group of real values to “probabilities”: $\text{softmax}(x)_i = \frac{e^{x_i}}{\sum^n_{j=1} e^{x_j}}$. In applied use-cases, $e^x$ can be very very large or very very small. Analytically, if $x_i = c$ for all $i$, then $\text{softmax}(x)_i = \frac{1}{d}$. Numerically, this may not occur when $|c|$ is large. A positive $c$ causes overflow. A negative $c$ causes underflow and divide-by-zero error.

#### How to tackle the problem of underflow or overflow for softmax function or log softmax function?

To avoid overflow or underflow, we can transform $x_j$ with $\max(x)$:

$m = \max(x) \\ x_j \rightarrow x_j - m = x^{'}_j$

Here is the proof that this transformation is valid: $\text{softmax}(x)_i = \frac{e^{x_i}}{\sum e^{x_j}} = \frac{e^{-m} e^{x_i}}{e^{-m} \sum e^{x_j}} = \frac{e^{x_i - m}}{\sum e^{x_j - m}} = \frac{e^{x^{'}_i}}{\sum e^{x^{'}_j}}$

There can be no overflow, as $\exp(\text{largest attribute of }x) = 1$. The denominator is at least $1$, so there’s no divide-by-zero error.

#### What is poor conditioning?

Conditioning measures how rapidly the output changed with tiny changes in input.

For example, in a linear equation, we can use the inverse matrix $A^{-1}$ to solve $x$.

$Ax = b \\ x = A^{-1} b$

Nevertheless it is not commonly done in machine learning because $A^{-1}$ is slow to compute, and worse, $A^{-1}$ may amplify input errors rapidly.

For the function:

$f(b) = A^{-1}b \quad \text{where } A \in \mathbb{R}^{n \times n}$.

Condition number is defined as: $\underset{i, j}{\max} \left| \frac{\lambda_i}{\lambda_j} \right| \quad \text{where } \lambda_i, \lambda_j \text{ are the eigenvalues for } A.$ Poorly conditioned matrix $A$ is a matrix with a high condition number. $A^{-1}$ amplifies input errors. Small errors in $x$ can change the output of $A^{-1}x$ rapidly.

Other methods including matrix factorization can replace the matrix inversion method to avoid the poor conditioning to improve the numerical stability.

#### What is the condition number?

The ratio $C$ of the largest to smallest singular value in the singular value decomposition of a matrix. The base-$b$ logarithm of $C$ is an estimate of how many base-$b$ digits are lost in solving a linear system with that matrix. In other words, it estimates worst-case loss of precision. A system is said to be singular if the condition number is infinite, and ill-conditioned if it is too large, where “too large” means roughly $log(C)\>~$ the precision of matrix entries.

#### What are grad, div and curl?

If $U(r) = U(x, y, z)$ is a scalar field, ie a scalar function of position $r = [x, y, z]$ in 3 dimensions, then its gradient at any point is defined in Cartesian co-ordinates by $\text{grad}(U) = \nabla U = \frac{\partial U}{\partial x} \mathbf{\hat{i}} + \frac{\partial U}{\partial y} \mathbf{\hat{j}} + \frac{\partial U}{\partial z} \mathbf{\hat{k}}$. It’s important to note that $\nabla$ is a vector operator that makes $\nabla U$ a vector field. What this means is that the gradient of a scalar field tends to point in the direction of greatest change of the field.

The divergence computes a scalar quantity from a vector field by differentiation (the reverse of what was done with the gradient).

If $a(x, y, z)$ is a vector function of position in 3 dimensions, that is $\mathbf{a} = a_i \mathbf{\hat{i}} + a_2 \mathbf{\hat{j}} + a_3 \mathbf{\hat{k}}$, then its divergence at any point is defined in Cartesian co-ordinates by $\text{div} \mathbf{a} = \frac{\partial a_1}{\partial x} + \frac{\partial a_2}{\partial y} + \frac{\partial a_3}{\partial z} = \left ( \mathbf{\hat{i}}\frac{\partial}{\partial x} + \mathbf{\hat{j}}\frac{\partial}{\partial y} + \mathbf{\hat{k}}\frac{\partial}{\partial z} \right ) \cdot \mathbf{a} = \nabla \cdot \mathbf{a}$ It’s worth noting that the divergence of a vector field is a scalar field.

If grad is the $\nabla$ operator applied to a scalar field as $\nabla U$, and div is the dot product of $\nabla$ and a vector field as $\nabla \cdot \mathbf{a}$, then the curl is the cross product of $\nabla$ and a vector field $\nabla \times \mathbf{a}$. The curl can be described using the pseudo-determinant recipe for vector products as $\text{curl}(\textbf{a}) \equiv \nabla \times \textbf{a} = \begin{vmatrix} \mathbf{\hat{i}} & \mathbf{\hat{j}} & \mathbf{\hat{k}}\\ \frac{\partial}{\partial x} & \frac{\partial}{\partial y} & \frac{\partial}{\partial z} \\ a_x & a_y & a_z \end{vmatrix} = \left ( \frac{\partial a_z}{\partial y} - \frac{\partial a_y}{\partial z} \right ) \mathbf{\hat{i}} + \left ( \frac{\partial a_x}{\partial z} - \frac{\partial a_z}{\partial x} \right ) \mathbf{\hat{j}} + \left ( \frac{\partial a_y}{\partial x} - \frac{\partial a_x }{\partial y} \right ) \mathbf{\hat{k}}$

#### What are critical or stationary points in multi-dimensions?

A critical point is a place where there is potentially a maximum or a minimum. This can happen if the derivative is zero, or if the function is not differentiable at a point (there could be a vertex as in the absolute value function). A stationary point is just where the derivative is zero. If you think of the derivative as a velocity, then those are places where the velocity is zero, and something with zero velocity is stationary.

All stationary points are critical points but not all critical points are stationary points.

#### Why should you do gradient descent when you want to minimize a function?

There are two main reasons for using gradient descent for minimizing a function:

- Sometimes we cannot solve the minimization problem by hand. This includes situations where we have an explicit form for the function but the equations for the local minima are not able to be solved in closed form and also situations where we don’t have an explicit form for the function (but of course we can still compute it).
- Sometimes even in situations where we have a nice formula for the minimum, it can be computationally more efficient to use gradient descent than to compute the formula. This includes very high-dimensional linear regressions. The formula for the least squares estimator is well-known but it involves inverting a potentially large matrix. At a certain point gradient descent starts to be faster, cause it contains less expensive operations even though you must iterate them many times to find the minimum.

#### What is line search?

In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum $\mathbf {x}^{*}$ of an objective function $f: \mathbb {R} ^{n}\to \mathbb {R}$.

The line search approach first finds a descent direction along which the objective function $f$ will be reduced and then computes a step size that determines how far $\mathbf{x}$ should move along that direction. The descent direction can be computed by various methods, such as gradient descent, Newton’s method and quasi-Newton method. The step size can be determined either exactly or inexactly.

#### What is hill climbing?

Hill Climbing is a heuristic search used for mathematical optimization problems in the field of Artificial Intelligence.

Given a large set of inputs and a good heuristic function, it tries to find a sufficiently good solution to the problem. This solution may not be the global optimal maximum.

- In the above definition,
**mathematical optimization problems**implies that hill-climbing solves the problems where we need to maximize or minimize a given real function by choosing values from the given inputs. Example-Travelling salesman problem where we need to minimize the distance traveled by the salesman. - ‘Heuristic search’ means that this search algorithm may not find the optimal solution to the problem. However, it will give a good solution in
**reasonable time.** - A
**heuristic function**is a function that will rank all the possible alternatives at any branching step in search algorithm based on the available information. It helps the algorithm to select the best route out of possible routes.

Hill Clibing has two main features: 1) It is a variant of generate and test algorithm:

- Generate possible solutions.
- Test to see if this is the expected solution.
- If the solution has been found quit else go to step 1.

as it takes the feedback from the test procedure while the generator utilizes the feedback in deciding the next move in search space. 2) It uses a greedy approach. At any point in state space, the search moves in that direction only which optimizes the cost of function with the hope of finding the optimal solution at the end.

#### What is a Jacobian matrix?

$f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$ $J = \begin{pmatrix} \frac{\partial f_1}{ \partial x_1} & \frac{\partial f_1}{ \partial x_2} & \dots & \frac{\partial f_1}{ \partial x_n} \\ \frac{\partial f_2}{ \partial x_1} & \frac{\partial f_2}{ \partial x_2} & \dots & \frac{\partial f_2}{ \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{ \partial x_1} & \frac{\partial f_m}{ \partial x_2} & \dots & \frac{\partial f_m}{ \partial x_n} \\ \end{pmatrix}$

For example: $f: \mathbb{R}^{3} \rightarrow \mathbb{R}^{2} \\ f(x,y,z)=(xy+ y, 2xz) \\$

$f_1 = xy+ y f_2 = 2xyz$

$\frac{\partial f_2}{ \partial x_2} = \frac{\partial f_2}{ \partial y} = 2xz$

#### What is curvature?

The curvature is related to the second derivative.

For a two-dimensional curve written in the form $y=f(x)$, the equation of curvature becomes $\kappa=\frac{\frac{d^2y}{dx^2}}{[1+(\frac{dy}{dx})^2]^\frac{3}{2}}$.

If the two-dimensional curve is instead parameterized in polar coordinates, then $\kappa=\frac{r^2+2r_\theta^2-rr_{\theta\theta}}{(r^2+r_\theta^2)^\frac{3}{2}},$

#### What is a Hessian matrix?

The first derivative measures the gradient and the second derivative measures the curvature.

When $f''(x)<0$, $f(x)$ curves down and when $f''(x)>0$, $f(x) curves up$.

The second derivative indicates whether a gradient step drops the cost as much as the gradient alone may imply. For example, at $x=1.0$ (the orange dot on the right below), the gradient is positive and the cost