The Math Behind ML (the important stuff)

Important mathematical prerequisites for getting into Machine Learning, Deep Learning, or any of the other space

Important mathematical prerequisites for getting into Machine Learning, Deep Learning, or any of the other space

Table of Contents

Linear Algebra

Numerical Optimization

Basics of Probability and Information Theory

Confidence interval

Linear Algebra

What is broadcasting in connection to Linear Algebra?

Arrays with different sizes cannot be added, subtracted, or generally be used in arithmetic. Two matrices are compatible if the corresponding dimensions in each matrix (rows vs rows, columns vs columns) meet the following requirements:

  1. The dimensions are equal, or
  2. One dimension is of size 1

A way to overcome this requirement is to duplicate the smaller array so that it is the dimensionality and size as the larger array.

What are scalars, vectors, matrices, and tensors?

  • A scalar is a single number
  • A vector is an array of numbers. x=[x1x2xn]\mathbf{x} =\begin{bmatrix} x_1 \\\\ x_2 \\\\ \vdots \\\\ x_n \end{bmatrix}
  • A matrix is a 2-D array A=[A1,1A1,2A1,nA2,1A2,2A2,nAm,1Am,2Am,n]\mathbf{A}= \begin{bmatrix} A_{1,1} & A_{1,2} & \cdots & A_{1,n} \\\\ A_{2,1} & A_{2,2} & \cdots & A_{2,n} \\\\ \vdots & \vdots & \ddots & \vdots \\\\ A_{m,1} & A_{m,2} & \cdots & A_{m,n} \end{bmatrix}
  • A tensor is a nn-dimensional array with n>2n>2

What is the Hadamard product of two matrices?

Hadamard product of matrices is an elementwise operation. Values that correspond positionally are multiplied to produce a new matrix.

What is an inverse matrix?

The inverse of a square matrix AA, sometimes called a reciprocal matrix, is a matrix A1A^{-1} such that AA1=IAA^{-1}=I, where II is the identity matrix.

A square matrix AA has an inverse iff the determinant A0|A|\neq 0 (Lipschutz 1991, p. 45). The so-called invertible matrix theorem is major result in linear algebra which associates the existence of a matrix inverse with a number of other equivalent properties. A matrix possessing an inverse is called nonsingular, or invertible.

If inverse of a matrix exists, how do you calculate it?

For a 2×22×2 matrix A=[abcd]A= \begin{bmatrix} a & b \\ c & d \end{bmatrix}, the matrix inverse is

A1=1A[dbca]=1adbc[dbca]A^{-1} = \frac{1}{|A|}\begin{bmatrix}d & -b \\ -c & a\end{bmatrix} \\ = \frac{1}{ad-bc}\begin{bmatrix}d & -b \\ -c & a\end{bmatrix}

For a 3×33×3 matrix

A=[a11a12a13a21a22a23a31a32a33]A= \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{bmatrix}

the matrix inverse is

A1=1A[a22a23a32a33a13a12a33a32a12a13a22a23a23a21a33a31a11a13a31a33a13a11a23a21a21a22a31a32a12a11a32a31a11a12a21a22]A^{-1}=\frac{1}{|A|} \begin{bmatrix} \begin{vmatrix} a_{22} & a_{23} \\ a_{32} & a_{33} \end{vmatrix} & \begin{vmatrix} a_{13} & a_{12} \\ a_{33} & a_{32} \end{vmatrix} & \begin{vmatrix} a_{12} & a_{13} \\ a_{22} & a_{23} \end{vmatrix} \\ \begin{vmatrix} a_{23} & a_{21} \\ a_{33} & a_{31} \end{vmatrix} & \begin{vmatrix} a_{11} & a_{13} \\ a_{31} & a_{33} \end{vmatrix} & \begin{vmatrix} a_{13} & a_{11} \\ a_{23} & a_{21}\end{vmatrix} \\ \begin{vmatrix} a_{21} & a_{22} \\ a_{31} & a_{32} \end{vmatrix} & \begin{vmatrix} a_{12} & a_{11} \\ a_{32} & a_{31} \end{vmatrix} & \begin{vmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{vmatrix}\end{bmatrix}

A general n×nn×n matrix can be inverted using methods such as the Gauss-Jordan elimination, Gaussian elimination, or LU decomposition.

What is the determinant of a square matrix? How is it calculated? What is the connection of determinant to eigenvalues?

Determinants are mathematical objects that are very useful in the analysis and solution of systems of linear equations. As shown by Cramer’s rule, a nonhomogeneous system of linear equations has a unique solution iff the determinant of the system’s matrix is nonzero (i.e., the matrix is nonsingular). For example, eliminating xx, yy, and zz from the equations

a1x+a2y+a3z=0b1x+b2y+b3z=0c1x+c2y+c3z=0a_1x+a_2y+a_3z = 0 \\ b_1x+b_2y+b_3z = 0 \\ c_1x+c_2y+c_3z = 0

gives the expression

a1b2c3a1b3c2+a2b3c1a2b1c3+a3b1c2a3b2c1=0a_1b_2c_3-a_1b_3c_2+a_2b_3c_1-a_2b_1c_3+a_3b_1c_2-a_3b_2c_1=0, which is called the determinant for this system of equation. Determinants are defined only for square matrices.

If the determinant of a matrix is 00, the matrix is said to be singular, and if the determinant is 11, the matrix is said to be unimodular.

The determinant of a matrix AA,

a1a2anb1b2bnz1z2zn\begin{vmatrix}a_1 & a_2 & \dots & a_n \\ b_1 & b_2 & \dots & b_n \\ \vdots & \vdots & \ddots & \vdots\\ z_1 & z_2 & \dots & z_n\end{vmatrix}

is commonly denoted det(A)\det(A), A|A|, or in component notation as +/a1b2c3...\sum{+/-a_1b_2c_3 ...}, D(a1b2c3...)D(a_1b_2c_3 ...), or a1b2c3...|a_1b_2c_3...| (Muir 1960, p. 17). Note that the notation det(A)\det(A) may be more convenient when indicating the absolute value of a determinant, i.e., det(A)|\det(A)| instead of A||A||.

A 2×22×2 determinant is defined to be

det[ab;cd]=ab;cd=adbcdet[a b; c d]=|a b; c d|=ad-bc.

A k×kk×k determinant can be expanded “by minors” to obtain

a11a12a13a1ka21a22a23a2kak1ak2ak3akk=a11a22a23a2kak2ak3akka12a21a23a2kak1ak3akk+...±a1ka21a22a2(k1)ak1ak2ak(k1)\begin{vmatrix}a_{11} & a_{12} & a_{13} & \dots & a_{1k} \\ a_{21} & a_{22} & a_{23} & \dots & a_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \\ a_{k1} & a_{k2} & a_{k3} & \dots & a_{kk} \end{vmatrix}=a_{11} \begin{vmatrix}a_{22} & a_{23} & \dots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ a_{k2} & a_{k3} & \dots & a_{kk} \end{vmatrix} \\ -a_{12} \begin{vmatrix} a_{21} & a_{23} & \dots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ a_{k1} & a_{k3} & \dots & a_{kk} \end{vmatrix} +... \pm a_{1k} \begin{vmatrix}a_{21} & a_{22} & \dots & a_{2(k-1)} \\ \vdots & \vdots & \ddots & \vdots \\ a_{k1} & \dots a_{k2} & \dots & a_{k(k-1)} \end{vmatrix}.

A general determinant for a matrix AA has a value A=i=1kaijCij|A|=\sum_{i=1}^ka_{ij}C_{ij}, with no implied summation over jj and where CijC_{ij} (also denoted aija^{ij}) is the cofactor of aija_{ij} defined by Cij=1i+jMijC_{ij}={-1}^{i+j}M_{ij} and MijM_{ij} is the minor of matrix AA formed by eliminating row ii and column jj from AA. This process is called determinant expansion by minors (or “Laplacian expansion by minors,” sometimes further shortened to simply “Laplacian expansion”).

Discuss span and linear dependence.

The span of vectors v1v_1, v2v_2, …, vnv_n is the set of linear combinations c1v1+c2v2+...+cnvnc_1v_1 + c_2v_2 + ... + c_nv_n, and that this is a vector space.

We can take this idea further. If VV is a vector space and S={v1,v2,...,vn}S = \{v_1, v_2, ... , v_n\} is a subset of VV, then is Span(S)\text{Span}(S) equal to VV? We say that SS spans VV if every vector vv in VV can be written as a linear combination of vectors in SS (e.g., v=c1v1+c2v2+...+cnvnv = c_1v_1 + c_2v_2 + ... + c_nv_n).

The next question we ask after this is whether there are any reduncancies. I.e., is there a smaller subset of SS that also spans Span(S)\text{Span}(S). If so, then we can write one of the vectors as a linear combination of the others (like so: vi=c1v1+c2v2+...+ci1vi1+ci+1vi+1+...+cnvnvi = c1v1 + c2v2 + ... + ci -1vi -1 + ci+1vi+1 + ... + cnvn)

If one of the vectors within SS can be written as a linear combination like this, then SS is a linearly dependent set. If we can’t write any of the vectors as a linear combination of the others in SS, then SS is linearly independent

What is Ax=bAx = b? When does Ax=bAx =b have a unique solution?

Suppose we want to solve a system of equation like 3x+2y=2,xy=4,5y+z=13x + 2y = 2, x-y = 4, 5y + z = -1 (where we want to find solutions for xx, yy, and zz). We can solve this using methods like substitution or elimination, but these are only tractable when we have few equations or few unknowns. If we want to use a more efficient method for larger numbers of unknowns or equations, we can write the system in the form of several matrices:

[320110051][xyz]=[241]\begin{bmatrix} 3 & 2 & 0 \\ 1 & -1 & 0 \\ 0 & 5 & 1 \end{bmatrix} \begin{bmatrix} x\\ y\\ z \end{bmatrix} = \begin{bmatrix} 2\\ 4\\ -1 \end{bmatrix} This form is called the Ax=bAx = b form of a system where AA is called the coefficient matrix, xx is the unknowns or solution vector, and bb is the constant vector. The general form of such a system with mm equations and nn unknowns is given by the following: [a11a21am1a12a22am2a1na2namn][x1x2xn]=[b1b2bm]\begin{bmatrix} a_{11} & a_{21} & \cdots & a_{m1}\\ a_{12} & a_{22} & \cdots & a_{m2}\\ \vdots & \vdots & \ddots & \vdots\\ a_{1n} & a_{2n} & \cdots & a_{mn} \end{bmatrix}\begin{bmatrix} x_1\\ x_2\\ \vdots \\ x_n \end{bmatrix}=\begin{bmatrix} b_1\\ b_2\\ \vdots \\ b_m \end{bmatrix}

Whether a Ax=bAx = b system has a unique solution or not is determined by Cramer’s Rule: A unique solution xx exists iff A\text{iff } A is invertible. The solution itself can be found using the rule xj=detAjdetAx_j = \frac{\text{det}{A_j}}{\text{det}{A}} where AjA_j is the matrix AA with the jjth column replaced by the constant vector bb.

In Ax=b\mathbf{Ax = b}, what happens when A\mathbf{A} is fat or tall?

For a system of equations Ax\mathbf{Ax}, we first want to describe the fundamental theorem:

  1. For any system of equations, Ax=b\mathbf{Ax = b}, only one of the following three possibilities hold: a) The system has no solution, b) The system has a unique solution, or c) The system has infinitely many solutions.
  2. For a homogeneous system of equations, Ax=0\mathbf{Ax}= 0, only one of the following two possibilities hold: a) The system has a unique solution, or b) The system has infinitely many solutions.

When a system of equations Ax=b\mathbf{Ax = b} has at least one solution, we say that the system is consistent, otherwise inconsistent. We are concerned with determining when a system of equations is consistent and studying the number of solutions of a system of equations.

Fat A\mathbf{A} Matrix

Let A\mathbf{A} be an m×nm \times n fat matrix, that is, n>mn > m. Then the following assertions hold. a. rank(A)m\text{rank}(\mathbf{A}) \leq m. b. The homogeneous system of equations Ax=0\mathbf{Ax}= 0 has infinitely many solutions. c. The system of equations Ax=b\mathbf{Ax = b} is either inconsistent or has infinitely many solutions (in other words, it never has a unique solution). d. The system of equations Ax=b\mathbf{Ax = b} is consistent for every b\mathbf{b} in Rm\mathbb{R}^m if and only if rank(A)=m\text{rank}(\mathbf{A}) = m.

Tall A\mathbf{A} Matrix

Let A\mathbf{A} be an m×nm \times n tall matrix, that is, n<mn < m. Then the following assertions hold. a. rank(A)n\text{rank}(\mathbf{A}) \leq n. b. The homogeneous system of equations Ax=0\mathbf{Ax}= 0 has a unique (namely, the trivial) solution if and only if rank(A)=n\text{rank}(\mathbf{A}) = n (in other words, it has infinitely many solutions if and only if rank(A)<n\text{rank}(\mathbf{A}) < n). c. The system of equations Ax=b\mathbf{Ax = b} is inconsistent for some b\mathbf{b} in Rm\mathbb{R}^m (in other words, the system is never consistent for all b\mathbf{b} in Rm\mathbb{R}^m). d. The system of equations Ax=b\mathbf{Ax = b} is has at most one solution for every b\mathbf{b} in Rm\mathbb{R}^m if and only if rank(A)=n\text{rank}(\mathbf{A}) = n.

When does inverse of AA exist?

Suppose A\mathbf{A} is an n×nn \times n matrix. The inverse of A\mathbf{A} is another n×nn \times n matrix, denoted A1\mathbf{A}^{-1}, that satisfies the following conditions. AA1=A1A=In\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}_n where In\mathbf{I}_n is the identity matrix. Not every square matrix has an inverse; but if a matrix does have an inverse, it is unique.

If the rank of an n×nn \times n matrix is less than n, the matrix does not have an inverse. When the determinant for a square matrix is equal to zero, the inverse for that matrix does not exist. A square matrix that has an inverse is said to be nonsingular or invertible.

What is a norm? What is L1, L2 and L infinity norm?

In Linear Algebra, a Norm refers to the total length of all the vectors in a space. There are multiple ways to measure the magnitude of vectors.

The L0L_0 norm is not actually a norm. It just corresponds to the total number of nonzero elements in a vector.

L1L_1 Norm is the sum of the magnitudes of the vectors in a space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally.

L2L_2 norm is the most popular norm, also known as the Euclidean norm. It is the shortest distance to go from one point to another.

The LL_\infty norm simply gives the largest magnitude among each element of a vector. For example, if we have the vector Having the vector X=[6,4,2]X = [-6, 4, 2], the LL_\infty norm would be 6.

What are the conditions a norm has to satisfy?

Given a vector space VV over a field FF of the real numbers R\mathbb{R} or complex numbers C\mathbb{C}, a norm on VV is a nonnegative-valued function p:VRp: V \rightarrow \mathbb{R} with the following properties:

For all aFa \in F and all u,vV\textbf{u}, \textbf{v} \in V,

  1. p(u+v)p(u)+p(v)p(\textbf{u} + \textbf{v}) \leq p(\textbf{u}) + p(\textbf{v}) (being subadditive or satisfying the triangle inequality).
  2. p(av)=ap(v)p(a\textbf{v}) = | a | p(\textbf{v}) (being absolutely homogeneous or absolutely scalable).
  3. If p(v)=0p(\textbf{v}) = 0 then v=0\textbf{v} = 0 is the zero vector (being positive definite or being point-separating).

Why is squared of L2 norm preferred in ML than just L2 norm?

The L2L_2 norm corresponds to Hilbert space. On raising power to cube (L3L_3), quad (L4L_4) or higher, the function becomes sensitive to the influence of outliers (since it gives more weight to the anomalous points) and thus introduces unwanted bias into distance calculation (you will get skewed results). Using squared L2L_2 norm, optimization functions in ML become tractable.

When L1 norm is preferred over L2 norm?

The L1L_1 norm likes sparse vectors (hence sets some components exactly to 0). Now “sets” doesn’t really mean anything since it’s not like the L1L_1 norm takes decisions, but in several optimisation problems if you use the L1L_1 norm you end up with sparse vectors. A prime example is the lasso regression, or in general any optimizations problem where a regularizer in the form λx1\lambda \| x \|_1 is added. The solutions to the latter will be usually sparser than when using a regularizer in the form λx2\lambda \| x \|_2

Can the number of nonzero elements in a vector be defined as L0 norm? If no, why?

The L0L_0 norm corresponds to the total number of nonzero elements in a vector, but it cannot actually be defined as a norm because it does not satisfy all the conditions a norm must have.

What is Frobenius norm?

The Frobenius norm LFL_F measures the size of a Matrix AA as LF=(i,jAij2)12L_F = \Big(\sum_{i,j} A_{ij}^2\Big)^{\frac{1}{2}}

What is a diagonal matrix?

A diagonal matrix is a matrix with all non-diagonal elements being zero. We form a square diagonal matrix by moving vector elements into the diagonal position of the matrix. M=diag(v)M=\text{diag}(v) Provided vv has no element with zero value, we replace each diagonal element with 1vi\frac{1}{v_i} to form its inverse M^{-1}.

Why is multiplication by diagonal matrix computationally cheap? How is the multiplication different for square vs. non-square diagonal matrix?

Machine learning may approximate solutions with diagonal matrices because finding inverse or performing matrix multiplication is easy. Non-square diagonal matrices do not have an inverse matrix.

At what conditions does the inverse of a diagonal matrix exist?

The invert of a square diagonal matrix exists if all entries of the diagonal are non-zeros. If it is the case, the invert is easy to find. Also, the inverse doen’t exist if the matrix is non-square.

What is a symmetric matrix?

A symmetric matrix is a matrix with Aij=AjiA_{ij}=A_{ji}. AT=AA^T=A In machine learning, many equations in calculating elements between iji \leftrightarrow j are not directional (f(x,y)=f(y,x)f(x,y)=f(y,x)). For example, the distance between 2 points is not directional. Hence, many matrices in machine learning is symmetrical. The inverse of a symmetric matrix is also symmetric. Any real n×nn \times n symmetric matrices can be decomposed into n eigenvalues and eigenvectors which is very desirable in matrix factorization. Symmetric matrix can easily decompose into orthogonal matrices: A=QΛQTA=Q \Lambda Q^T which its inverse equals to its transpose.

What is a unit vector?

A unit vector is a vector with x=1\|x\|=1.

When are two vectors x and y orthogonal?

In Euclidean space, two vectors are orthogonal if and only if their dot product is zero, i.e. they make an angle of 9090^{\circ} (π/2\pi /2 radians), or one of the vectors is zero. Hence orthogonality of vectors is an extension of the concept of perpendicular vectors to spaces of any dimension.

At RnR^n what is the maximum possible number of orthogonal vectors with non-zero norm?

Orthogonal vectors are perpendicular. It’s easy to see that if you are in R2R^2 you can only have two vectors which are perpendicular. Similarly, in R3R^3 you can only have three perpendicular vectors. Orthogonality extends this notion of perpendicular to higher dimensions. So given that interpretation you can see that in RnR^n you can only have nn orthogonal vectors. You would need to add another dimension to the space to have an additional orthogonal vector, thus nn is maximal.

When are two vectors x and y orthonormal?

A set of vectors are orthonormal if and only if the inner product of any two vectors are zero.

What is an orthogonal matrix? Why is computationally preferred?

An orthogonal matrix is a square matrix whose rows (columns) are mutually orthonormal. i.e. no dot products of 2 row vectors (column vectors) are 0. ATA=AAT=IA^T A = A A^T = I For an orthogonal matrix, there is one important property. Its inverse is the transpose which is very easy to compute: AAT=IA1AAT=A1IAT=A1A A^T = I \Rightarrow A^{-1} A A^T = A^{-1} I \Rightarrow A^T = A^{-1} AT=A1A^T = A^{-1} Also orthogonal matrices QQ does not amplify errors which is very desirable. Qx22=(Qx)TQx=xTQTQx=xTx=x22\| Qx \|^2_2 = (Qx)^T Qx = x^TQ^T Qx = x^T x = \| x \|^2_2 The size of the multiplication result Qx\|Qx\| has the same size as x\|x\|. If we multiple x with orthogonal matrices, the errors present in xx will not be amplified by the multiplication. i.e. it is more numerically stable.

Decomposing matrices into smaller components helps us to solve some problems faster with better numeric stability. Here is the pseudo code to use SVD to decompose matrix into orthogonal matrices and solve the linear equation with the result. Ax=b(U,D,V)SVD(A)A+=VD+UTx=A+yAx =b \\ (U,D,V) \leftarrow \text{SVD}(A) \\ A^{+} = V D^{+}U^T \\ x = A^{+} y where D+D^+ takes the reciprocal 1xi\frac{1}{x_i} of the non-zero elements of DD and then transpose.

What are eigendecomposition, eigenvectors and eigenvalues?

Scalar λ\lambda and vector vv are the eigenvalue and eigenvector of AA respectively if Av=λvAv= \lambda v. A n×nn \times n matrix has at most nn eigenvalues and eigenvectors. A matrix is singular iff any eigenvalues are 00.

In eigendecomposition, we decompose an initial matrix into the product of its eigenvectors and eigenvalues.

How to find eigenvalues of a matrix?

To find the eigenvalues and eigenvectors for AA,

Av=λvAv = \lambda v

(AλI)v=0(A - \lambda I) v = 0

    det(AλI)=0\implies \text{det}(A - \lambda I) = 0

Write the eigendecomposition formula for a matrix. If the matrix is real symmetric, how will this change?

Matrix decomposition decompose a matrix into special type of matrices for easy manipulation in solving problems like linear equations. But eigendecomposition is only defined for square matrices.

Say AA has nn eigenvalues λ1,λ2,...,λn\lambda_1, \lambda_2, ..., \lambda_n. We concatenate all values to form a column vector λ=[λ1,λ2,...,λn]T\lambda=[\lambda_1, \lambda_2, ... , \lambda_n]^T. AA also has nn eigenvectors v1,v2,...,vnv_1,v_2, ...,v_n. We compose a matrix VV with viv_i as the column ii of the matrix. V=[v(1),...,v(n)]V=[v^{(1)},...,v^{(n)}]. The eigen decomposition of AA is:

A=Vdiag(λ)V1A = V \text{diag}(\lambda)V^{-1}

Not every AA has eigendecomposition. But in deep learning, we often due with real symmetric metrics. Real symmetric metrics are eigendecomposable and the equation can be further simplify to:

A=QΛQTA=Q \Lambda Q^T

which QQ is an orthogonal matrix composed of eigenvectors of AA. Λ\Lambda is a diagonal matrix. The value of diagonal element Λii\Lambda_{ii} is the eigenvalue of the corresponding eigenvector in column ii of QQ. We do not specify the order of eigenvectors. Therefore different order of v(i)v^{(i)} creates different VV. By convention, we re-arrange the column order v(i)v^{(i)} in vv by the descending sorting order of its eigenvalue λi\lambda_i. It helps us to decompose AA in a more deterministic way.

In eigendecomposition, we decompose the matrix AA into different eigenvectors v(i)v^{(i)} scaled by the eigenvalue λi\lambda_i. Therefore, for any vectors pointing at the same direction of eigenvector v(i)v^{(i)}, AxAx scales xx by the corresponding eigenvalue λi\lambda_i.

For a quadratic equation in the form of f(x)=xTAxf(x)=x^TAx, if xx is an unit vector equal to v(i)v^{(i)}, f(x)f(x) will be equal to the eigenvalues λi\lambda_i. Therefore, the max and min of f(x)f(x) is the max and min of the eigenvalues.

Is the Eigendecomposition guaranteed to be unique? If not, then how do we represent it?

A matrix does not necessarily have distinct eigenvalues (although almost all do), and a matrix does not necessarily have a single eigenvalue with multipicity n. In fact, given any set of n values, you can construct a matrix with those values as eigenvalues (indeed just take the corresponding diagonal matrix).

What are positive definite, negative definite, positive semi definite and negative semi definite matrices?

If all eigenvalues of AA are positive, the matrix is positive definite.

If all eigenvalues of AA are positive or zero, the matrix is positive semi-definite.

If all eigenvalues of AA are negative, the matrix is negative definite.

If a matrix is positive semi-definite, xTAx0x^TAx \geq 0. If a matrix is positive definite and xTAx=0x^TAx=0, it implies x=0x=0.

Positive definiteness or negative definiteness helps us to solve optimization problem. Quadratic equation xTAxx^TAx with positive definite matrices AA are always positive for non-zero xx and the function is convex. i.e. it guarantees the existences of the global minimum. This allows us to use Hessian matrix to solve the optimization problem. Similar arguments hold true for negative definite.

What is Singular Value Decomposition? Why do we use it? Why not just use ED?

SVD factorizes a matrix into singular vectors and singular values. A=UDVTA=UDV^T

  • A is a m×n matrix. (Does not need to be a square matrix like eigendecomposition.)
  • Left-singular vector: UU is an m×mm \times m orthogonal matrix (the eigenvectors of AATAA^T)
  • Right-singular vector: VV is a n×nn \times n orthogonal matrix (the eigenvectors of ATAA^TA)
  • Singular values: DD is a m×nm \times n diagonal matrix (square roots of the eigenvalues of AATAA^T and ATAA^TA)

SVD is a powerful but expensive matrix factorization method. In numerical linear algebra, many problems can be solved to represent AA in this form.

Every real matrix has an SVD, but this is not true for eigendecomposition.

Given a matrix A, how will you calculate its Singular Value Decomposition?

AA is a matrix that can be seen as a linear transformation. This transformation can be decomposed in three sub-transformations: 1. rotation, 2. re-scaling, 3. rotation. These three steps correspond to the three matrices UU, DD, and VV.

The matrices UU, DD and VV can be found by transforming AA in a square matrix and by computing the eigenvectors of this square matrix. The square matrix can be obtain by multiplying the matrix A by its transpose in one way or the other:

  • UU corresponds to the eigenvectors of AATAA^T
  • VV corresponds to the eigenvectors of ATAA^TA
  • DD corresponds to the eigenvalues AATAA^T or ATAA^TA which are the same.

What are singular values, left singulars and right singulars?

We will decompose A into 3 matrices (instead of two with eigendecomposition): A=UDVTA=UDV^T

The matrices UU, DD, and VV have the following properties:

  • UU and VV are orthogonal matrices (UT=U1U^T=U^{-1} and VT=V1V^T=V^{-1})
  • DD is a diagonal matrix. However DD is not necessarily square.

The columns of UU are called the left-singular vectors of AA while the columns of VV are the right-singular vectors of AA. The values along the diagonal of DD are the singular values of AA.

What is the connection of Singular Value Decomposition of A with functions of A?

In higher mathematics, a function is said to be linear if f(x+y)=f(x)+f(y)f(x+y)=f(x)+f(y). The process of SVD decomposition can be seen as a function, which we will call SS, which take in a matrix and returns three matrices: S(A)=(U,Σ,V)S(A)=(U,\Sigma,V). The three matrices that are returned have the property that A=UΣVTA=U \Sigma V^T. However, given two matrices of the same size, say AA and BB, S(A+B)S(A)+S(B)S(A+B)\neq S(A)+S(B). In other words, the UU in the SVD decomposition for A+BA+B is not equal to the UU in the SVD decomposition for AA plus the UU in the SVD decomposition for BB. The same is true for the Σ\Sigma and the VV.

Linear algebra has the moniker “linear” because all linear functions that act on vectors of finite dimension can be represented by matrices, and all matrices represent linear functions acting on vectors. However, certain aspects of linear algebra are nonlinear. For example, the function that accepts a matrix and returns its square: f(A)=A2f(A)=A^2 is nonlinear because (A+B)2A2+B2(A+B)^2\neq A^2+B^2 in general. So SVD is a linear algebra topic because it involves breaking up a single linear action into three simpler linear actions, but the function that takes each matrix to its decomposition is nonlinear.

Why are singular values always non-negative?

If AA has real entries then ATAA^TA is positive semidefinite, since ATAv,v=Av,Av0\langle A^TAv,v\rangle=\langle Av,Av\rangle\geq 0 for all vv. Therefore the eigenvalues of ATAA^TA are non-negative.

What is the Moore Penrose pseudoinverse and how to calculate it?

The Moore-Penrose pseudoinverse is a direct application of the SVD. But before all, we have to remind that systems of equations can be expressed under the matrix form.

The inverse of a matrix AA can be used to solve the equation Ax=bAx=b:

A1Ax=A1bInx=A1bx=A1bA^{-1}Ax=A^{-1}b \\ I_nx=A^{-1}b \\ x=A^{-1}b

But in the case where the set of equations have 00 or many solutions the inverse cannot be found and the equation cannot be solved. The pseudoinverse is A+A^+ such as:

AA+InAA^+ \approx I_n

minimizing

AA+In2||AA^+-I_n||2

The following formula can be used to find the pseudoinverse:

A+=VD+UTA^+=VD^+UT

with UU, DD and VV respectively the left singular vectors, the singular values and the right singular vectors of AA. A+A^+ is the pseudoinverse of AA and D+D^+ the pseudoinverse of DD. We saw that DD is a diagonal matrix and thus D+D^+ can be calculated by taking the reciprocal of the non zero values of DD.

If we do Moore Penrose pseudo inverse on Ax = b, what solution is provided is A is fat? Moreover, what solution is provided if A is tall?

A non-square matrix does not have an inverse, but if it is full rank (rank(A)=n\text{rank}(A)=n), it can have either a right pseudo-inverse or left pseudo-inverse:

Given a matrix ARm×nA \in R^{m \times n}, let us define:

  • AA is a fat matrix if mnm \leq n and null(AT)=0\text{null}(A^T)={0}
  • AA is a tall matrix is mnm \geq n and range(A)=Rn\text{range}(A)=R^n

Using the finite rank lemma, we can find that:

  • When AA is a fat matrix, there exists a right inverse AR1Rn×mA_R^{-1} \in R^{n \times m} such that AAR1=ImAA_R^{-1} = I_m. The right pseudo-inverse is not unique, but an example is AR1=AT(AAT)1A_R^{-1}=A^T(AA^T)^{-1}
  • When AA is a tall matrix, there exists a left inverse AL1Rn×mA_L^{-1} \in R^{n \times m} such that AL1A=InA_L^{-1}A = I_n.The left pseudo-inverse is not unique, but an example is AL1=(ATA)1ATA_L^{-1}=(A^TA)^{-1}A^T

Which matrices can be decomposed by ED?

The eigendecomposition can only exists for square matrices, and even among square matrices sometimes it doesn’t exist.

Which matrices can be decomposed by SVD?

The SVD always exists for any sort of rectangular or square matrix.

What is the trace of a matrix?

The trace is the sum of all values in the diagonal of a square matrix.

For example: A=[298471825]A= \begin{bmatrix} 2 & 9 & 8 \\\\ 4 & 7 & 1 \\\\ 8 & 2 & 5 \end{bmatrix}

Tr(A)=2+7+5=14\mathrm{Tr}(A) = 2 + 7 + 5 = 14

How to write Frobenius norm of a matrix A in terms of trace?

tr(AB)=<AT,B>=<vec(AT),vec(B)>vec(AT)2vec(B)2=AFBF\text{tr}(\mathrm A \mathrm B) = \left< \mathrm A^T, \mathrm B \right> = \left< \text{vec} (\mathrm A^T), \text{vec} (\mathrm B) \right> \leq \|\text{vec} (\mathrm A^T)\|_2 \|\text{vec} (\mathrm B)\|_2 = \|\mathrm A\|_F \|\mathrm B\|_F

Why is trace of a multiplication of matrices invariant to cyclic permutations?

More generally, the trace is invariant under cyclic permutations, that is,

tr(ABCD)=tr(BCDA)=tr(CDAB)=tr(DABC).{\displaystyle \operatorname {tr} (\mathbf {A} \mathbf {B} \mathbf {C} \mathbf {D} )=\operatorname {tr} (\mathbf {B} \mathbf {C} \mathbf {D} \mathbf {A} )=\operatorname {tr} (\mathbf {C} \mathbf {D} \mathbf {A} \mathbf {B} )=\operatorname {tr} (\mathbf {D} \mathbf {A} \mathbf {B} \mathbf {C} ).}

This is known as the cyclic property.

Arbitrary permutations are not allowed: in general, tr(ABC)tr(ACB).{\displaystyle \operatorname {tr} (\mathbf {A} \mathbf {B} \mathbf {C} )\neq \operatorname {tr} (\mathbf {A} \mathbf {C} \mathbf {B} ).}

However, if products of three symmetric matrices are considered, any permutation is allowed, since: tr(ABC)=tr((ABC)T)=tr(CBA)=tr(ACB),{\displaystyle \operatorname {tr} (\mathbf {A} \mathbf {B} \mathbf {C} )=\operatorname {tr} \left(\left(\mathbf {A} \mathbf {B} \mathbf {C} \right)^{\mathsf {T}}\right)=\operatorname {tr} (\mathbf {C} \mathbf {B} \mathbf {A} )=\operatorname {tr} (\mathbf {A} \mathbf {C} \mathbf {B} ),}

where the first equality is because the traces of a matrix and its transpose are equal.

What is the trace of a scalar?

The trace of a square matrix is the sum of its diagonal elements.

The trace enjoys several properties that are often very useful when proving results in matrix algebra and its applications.

Let us start with a formal definition.

Definition Let AA be a K×KK \times K matrix. Then, its trace, denoted by trace(A)\text{trace}(A) or tr(A)\text{tr}(A), is the sum of its diagonal elements:

tr(A)=k=1KAKKtr(A) = \sum_{k=1}^KA_{KK}

Write the frobenius norm of a matrix in terms of trace?

The Frobenius norm, sometimes also called the Euclidean norm (a term unfortunately also used for the vector L2L^2-norm), is matrix norm of an m×nm×n matrix AA defined as the square root of the sum of the absolute squares of its elements,

AF=i=1mj=1naij2||A||_F=\sqrt{\sum_{i=1}^m\sum_{j=1}^n|a_{ij}|^2}

The Frobenius norm can also be considered as a vector norm.

Numerical Optimization

What is underflow and overflow?

Overflow is when the absolute value of the number is too high for the computer to represent it.

Underflow is when the absolute value of the number is too close to zero for the computer to represent it.

You can get overflow with both integers and floating point numbers. You can only get underflow with floating point numbers.

To get an overflow, repeatedly multiply a number by ten. To get an underflow repeatedly divide it by ten.

If the variable x is a signed byte it can have values in the range -128 to +127, then

x = 127
x = x + 1

will result in an overflow. +128 is not a valid value for x.

For floating point numbers, the range depends on their representation. If x is a single precision (32-bit IEEE) number, then

x = 1e-38
x = x / 1000

will result in an underflow as 1e-42 is not a valid value for x.

Consider softmax function, commonly used to transform a group of real values to “probabilities”: softmax(x)i=exij=1nexj\text{softmax}(x)_i = \frac{e^{x_i}}{\sum^n_{j=1} e^{x_j}}. In applied use-cases, exe^x can be very very large or very very small. Analytically, if xi=cx_i = c for all ii, then softmax(x)i=1d\text{softmax}(x)_i = \frac{1}{d}. Numerically, this may not occur when c|c| is large. A positive cc causes overflow. A negative cc causes underflow and divide-by-zero error.

How to tackle the problem of underflow or overflow for softmax function or log softmax function?

To avoid overflow or underflow, we can transform xjx_j with max(x)\max(x):

m=max(x)xjxjm=xjm = \max(x) \\ x_j \rightarrow x_j - m = x^{'}_j

Here is the proof that this transformation is valid: softmax(x)i=exiexj=emexiemexj=eximexjm=exiexj\text{softmax}(x)_i = \frac{e^{x_i}}{\sum e^{x_j}} = \frac{e^{-m} e^{x_i}}{e^{-m} \sum e^{x_j}} = \frac{e^{x_i - m}}{\sum e^{x_j - m}} = \frac{e^{x^{'}_i}}{\sum e^{x^{'}_j}}

There can be no overflow, as exp(largest attribute of x)=1\exp(\text{largest attribute of }x) = 1. The denominator is at least 11, so there’s no divide-by-zero error.

What is poor conditioning?

Conditioning measures how rapidly the output changed with tiny changes in input.

For example, in a linear equation, we can use the inverse matrix A1A^{-1} to solve xx.

Ax=bx=A1bAx = b \\ x = A^{-1} b

Nevertheless it is not commonly done in machine learning because A1A^{-1} is slow to compute, and worse, A1A^{-1} may amplify input errors rapidly.

For the function:

f(b)=A1bwhere ARn×nf(b) = A^{-1}b \quad \text{where } A \in \mathbb{R}^{n \times n}.

Condition number is defined as: maxi,jλiλjwhere λi,λj are the eigenvalues for A.\underset{i, j}{\max} \left| \frac{\lambda_i}{\lambda_j} \right| \quad \text{where } \lambda_i, \lambda_j \text{ are the eigenvalues for } A. Poorly conditioned matrix AA is a matrix with a high condition number. A1A^{-1} amplifies input errors. Small errors in xx can change the output of A1xA^{-1}x rapidly.

Other methods including matrix factorization can replace the matrix inversion method to avoid the poor conditioning to improve the numerical stability.

What is the condition number?

The ratio CC of the largest to smallest singular value in the singular value decomposition of a matrix. The base-bb logarithm of CC is an estimate of how many base-bb digits are lost in solving a linear system with that matrix. In other words, it estimates worst-case loss of precision. A system is said to be singular if the condition number is infinite, and ill-conditioned if it is too large, where “too large” means roughly log(C) log(C)\>~ the precision of matrix entries.

What are grad, div and curl?

If U(r)=U(x,y,z)U(r) = U(x, y, z) is a scalar field, ie a scalar function of position r=[x,y,z]r = [x, y, z] in 3 dimensions, then its gradient at any point is defined in Cartesian co-ordinates by grad(U)=U=Uxi^+Uyj^+Uzk^\text{grad}(U) = \nabla U = \frac{\partial U}{\partial x} \mathbf{\hat{i}} + \frac{\partial U}{\partial y} \mathbf{\hat{j}} + \frac{\partial U}{\partial z} \mathbf{\hat{k}}. It’s important to note that \nabla is a vector operator that makes U\nabla U a vector field. What this means is that the gradient of a scalar field tends to point in the direction of greatest change of the field.

The divergence computes a scalar quantity from a vector field by differentiation (the reverse of what was done with the gradient).

If a(x,y,z)a(x, y, z) is a vector function of position in 3 dimensions, that is a=aii^+a2j^+a3k^\mathbf{a} = a_i \mathbf{\hat{i}} + a_2 \mathbf{\hat{j}} + a_3 \mathbf{\hat{k}}, then its divergence at any point is defined in Cartesian co-ordinates by diva=a1x+a2y+a3z=(i^x+j^y+k^z)a=a\text{div} \mathbf{a} = \frac{\partial a_1}{\partial x} + \frac{\partial a_2}{\partial y} + \frac{\partial a_3}{\partial z} = \left ( \mathbf{\hat{i}}\frac{\partial}{\partial x} + \mathbf{\hat{j}}\frac{\partial}{\partial y} + \mathbf{\hat{k}}\frac{\partial}{\partial z} \right ) \cdot \mathbf{a} = \nabla \cdot \mathbf{a} It’s worth noting that the divergence of a vector field is a scalar field.

If grad is the \nabla operator applied to a scalar field as U\nabla U, and div is the dot product of \nabla and a vector field as a\nabla \cdot \mathbf{a}, then the curl is the cross product of \nabla and a vector field ×a\nabla \times \mathbf{a}. The curl can be described using the pseudo-determinant recipe for vector products as curl(a)×a=i^j^k^xyzaxayaz=(azyayz)i^+(axzazx)j^+(ayxaxy)k^\text{curl}(\textbf{a}) \equiv \nabla \times \textbf{a} = \begin{vmatrix} \mathbf{\hat{i}} & \mathbf{\hat{j}} & \mathbf{\hat{k}}\\ \frac{\partial}{\partial x} & \frac{\partial}{\partial y} & \frac{\partial}{\partial z} \\ a_x & a_y & a_z \end{vmatrix} = \left ( \frac{\partial a_z}{\partial y} - \frac{\partial a_y}{\partial z} \right ) \mathbf{\hat{i}} + \left ( \frac{\partial a_x}{\partial z} - \frac{\partial a_z}{\partial x} \right ) \mathbf{\hat{j}} + \left ( \frac{\partial a_y}{\partial x} - \frac{\partial a_x }{\partial y} \right ) \mathbf{\hat{k}}

What are critical or stationary points in multi-dimensions?

A critical point is a place where there is potentially a maximum or a minimum. This can happen if the derivative is zero, or if the function is not differentiable at a point (there could be a vertex as in the absolute value function). A stationary point is just where the derivative is zero. If you think of the derivative as a velocity, then those are places where the velocity is zero, and something with zero velocity is stationary.

All stationary points are critical points but not all critical points are stationary points.

Why should you do gradient descent when you want to minimize a function?

There are two main reasons for using gradient descent for minimizing a function:

  1. Sometimes we cannot solve the minimization problem by hand. This includes situations where we have an explicit form for the function but the equations for the local minima are not able to be solved in closed form and also situations where we don’t have an explicit form for the function (but of course we can still compute it).
  2. Sometimes even in situations where we have a nice formula for the minimum, it can be computationally more efficient to use gradient descent than to compute the formula. This includes very high-dimensional linear regressions. The formula for the least squares estimator is well-known but it involves inverting a potentially large matrix. At a certain point gradient descent starts to be faster, cause it contains less expensive operations even though you must iterate them many times to find the minimum.

What is line search?

In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum x\mathbf {x}^{*} of an objective function f:RnRf: \mathbb {R} ^{n}\to \mathbb {R}.

The line search approach first finds a descent direction along which the objective function ff will be reduced and then computes a step size that determines how far x\mathbf{x} should move along that direction. The descent direction can be computed by various methods, such as gradient descent, Newton’s method and quasi-Newton method. The step size can be determined either exactly or inexactly.

What is hill climbing?

Hill Climbing is a heuristic search used for mathematical optimization problems in the field of Artificial Intelligence.

Given a large set of inputs and a good heuristic function, it tries to find a sufficiently good solution to the problem. This solution may not be the global optimal maximum.

  • In the above definition, mathematical optimization problems implies that hill-climbing solves the problems where we need to maximize or minimize a given real function by choosing values from the given inputs. Example-Travelling salesman problem where we need to minimize the distance traveled by the salesman.
  • ‘Heuristic search’ means that this search algorithm may not find the optimal solution to the problem. However, it will give a good solution in reasonable time.
  • A heuristic function is a function that will rank all the possible alternatives at any branching step in search algorithm based on the available information. It helps the algorithm to select the best route out of possible routes.

Hill Clibing has two main features: 1) It is a variant of generate and test algorithm:

  1. Generate possible solutions.
  2. Test to see if this is the expected solution.
  3. If the solution has been found quit else go to step 1.

as it takes the feedback from the test procedure while the generator utilizes the feedback in deciding the next move in search space. 2) It uses a greedy approach. At any point in state space, the search moves in that direction only which optimizes the cost of function with the hope of finding the optimal solution at the end.

What is a Jacobian matrix?

f:RnRmf: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m} J=(f1x1f1x2f1xnf2x1f2x2f2xnfmx1fmx2fmxn)J = \begin{pmatrix} \frac{\partial f_1}{ \partial x_1} & \frac{\partial f_1}{ \partial x_2} & \dots & \frac{\partial f_1}{ \partial x_n} \\ \frac{\partial f_2}{ \partial x_1} & \frac{\partial f_2}{ \partial x_2} & \dots & \frac{\partial f_2}{ \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{ \partial x_1} & \frac{\partial f_m}{ \partial x_2} & \dots & \frac{\partial f_m}{ \partial x_n} \\ \end{pmatrix}

For example: f:R3R2f(x,y,z)=(xy+y,2xz)f: \mathbb{R}^{3} \rightarrow \mathbb{R}^{2} \\ f(x,y,z)=(xy+ y, 2xz) \\

f1=xy+yf2=2xyzf_1 = xy+ y f_2 = 2xyz

f2x2=f2y=2xz\frac{\partial f_2}{ \partial x_2} = \frac{\partial f_2}{ \partial y} = 2xz

What is curvature?

The curvature is related to the second derivative.

For a two-dimensional curve written in the form y=f(x)y=f(x), the equation of curvature becomes κ=d2ydx2[1+(dydx)2]32\kappa=\frac{\frac{d^2y}{dx^2}}{[1+(\frac{dy}{dx})^2]^\frac{3}{2}}.

If the two-dimensional curve is instead parameterized in polar coordinates, then κ=r2+2rθ2rrθθ(r2+rθ2)32,\kappa=\frac{r^2+2r_\theta^2-rr_{\theta\theta}}{(r^2+r_\theta^2)^\frac{3}{2}},

What is a Hessian matrix?

The first derivative measures the gradient and the second derivative measures the curvature.

When f(x)<0f''(x)<0, f(x)f(x) curves down and when f(x)>0f''(x)>0, f(x)curvesupf(x) curves up.

The second derivative indicates whether a gradient step drops the cost as much as the gradient alone may imply. For example, at x=1.0x=1.0 (the orange dot on the right below), the gradient is positive and the cost