The Math Behind ML (the important stuff)
Important mathematical prerequisites for getting into Machine Learning, Deep Learning, or any of the other space
| UPDATED
This post is Part 1 of the 4-part Machine Learning Research Interview Handbook series (you can see the rest of the series here).
Important mathematical prerequisites for getting into Machine Learning, Deep Learning, or any of the other space
Table of Contents
- What is broadcasting in connection to Linear Algebra?
- What are scalars, vectors, matrices, and tensors?
- What is the Hadamard product of two matrices?
- What is an inverse matrix?
- If inverse of a matrix exists, how do you calculate it?
- What is the determinant of a square matrix? How is it calculated? What is the connection of determinant to eigenvalues?
- Discuss span and linear dependence.
- What is Ax = b? When does Ax =b has a unique solution?
- In Ax = b, what happens when A is fat or tall?
- When does inverse of A exist?
- What is a norm? What is L1, L2 and L infinity norm?
- What are the conditions a norm has to satisfy?
- Why is squared of L2 norm preferred in ML than just L2 norm?
- When L1 norm is preferred over L2 norm?
- Can the number of nonzero elements in a vector be defined as L0 norm? If no, why?
- What is Frobenius norm?
- What is a diagonal matrix?
- Why is multiplication by diagonal matrix computationally cheap? How is the multiplication different for square vs. non-square diagonal matrix?
- At what conditions does the inverse of a diagonal matrix exist?
- What is a symmetrix matrix?
- What is a unit vector?
- When are two vectors x and y orthogonal?
- At R^n what is the maximum possible number of orthogonal vectors with non-zero norm?
- When are two vectors x and y orthonormal?
- What is an orthogonal matrix? Why is computationally preferred?
- What is eigendecomposition, eigenvectors and eigenvalues?
- How to find eigen values of a matrix?
- Write the eigendecomposition formula for a matrix. If the matrix is real symmetric, how will this change?
- Is the Eigendecomposition guaranteed to be unique? If not, then how do we represent it?
- What are positive definite, negative definite, positive semi definite and negative semi definite matrices?
- What is Singular Value Decomposition? Why do we use it? Why not just use ED?
- Given a matrix A, how will you calculate its Singular Value Decomposition?
- What are singular values, left singulars and right singulars?
- What is the connection of Singular Value Decomposition of A with functions of A?
- Why are singular values always non-negative?
- What is the Moore Penrose pseudo inverse and how to calculate it?
- If we do Moore Penrose pseudo inverse on Ax = b, what solution is provided is A is fat? Moreover, what solution is provided if A is tall?
- Which matrices can be decomposed by ED?
- Which matrices can be decomposed by SVD?
- What is the trace of a matrix?
- How to write Frobenius norm of a matrix A in terms of trace?
- Why is trace of a multiplication of matrices invariant to cyclic permutations?
- What is the trace of a scalar?
- Write the frobenius norm of a matrix in terms of trace?
- What is underflow and overflow?
- How to tackle the problem of underflow or overflow for softmax function or log softmax function?
- What is poor conditioning?
- What is the condition number?
- What are grad, div and curl?
- What are critical or stationary points in multi-dimensions?
- Why should you do gradient descent when you want to minimize a function?
- What is line search?
- What is hill climbing?
- What is a Jacobian matrix?
- What is curvature?
- What is a Hessian matrix?
Basics of Probability and Information Theory
- Compare “Frequentist probability” vs. “Bayesian probability”?
- What is a random variable?
- What is a probability distribution?
- What is a probability mass function?
- What is a probability density function?
- What is a joint probability distribution?
- What are the conditions for a function to be a probability mass function?
- What are the conditions for a function to be a probability density function?
- What is a marginal probability? Given the joint probability function, how will you calculate it?
- What is conditional probability? Given the joint probability function, how will you calculate it?
- State the Chain rule of conditional probabilities.
- What are the conditions for independence and conditional independence of two random variables?
- What are expectation, variance and covariance?
- Compare covariance and independence.
- What is the covariance for a vector of random variables?
- What is a Bernoulli distribution? Calculate the expectation and variance of a random variable that follows Bernoulli distribution?
- What is a multinoulli distribution?
- What is a normal distribution?
- Why is the normal distribution a default choice for a prior over a set of real numbers?
- What is the central limit theorem?
- What are exponential and Laplace distribution?
- What are Dirac distribution and Empirical distribution?
- What is mixture of distributions?
- Name two common examples of mixture of distributions.
- Is Gaussian mixture model a universal approximator of densities?
- Write the formulae for logistic and softplus function.
- Write the formulae for Bayes rule
- What do you mean by measure zero and almost everywhere?
- If two random variables are related in a deterministic way, how are the PDFs related?
- Define self-information. What are its units?
- What are Shannon entropy and differential entropy?
- What is Kullback-Leibler (KL) divergence?
- Can KL divergence be used as a distance measure?
- Define cross-entropy
- What are structured probabilistic models or graphical models?
- In the context of structured probabilistic models, what are directed and undirected models? How are they represented? What are cliques in undirected structured probabilistic models?
- What is population mean and sample mean?
- What is population standard deviation and sample standard deviation?
- Why population s.d. has degrees of freedom while sample s.d. has degrees of freedom? In other words, why inside root for pop. s.d. and inside root for sample s.d.?
- What is the formula for calculating the s.d. of the sample mean?
- What is confidence interval?
- What is standard error?
Linear Algebra
What is broadcasting in connection to Linear Algebra?
(click the triangle to expand and reveal the full answer)
Arrays with different sizes cannot be added, subtracted, or generally be used in arithmetic. Two matrices are compatible if the corresponding dimensions in each matrix (rows vs rows, columns vs columns) meet the following requirements:
- The dimensions are equal, or
- One dimension is of size 1
A way to overcome this requirement is to duplicate the smaller array so that it is the dimensionality and size as the larger array.
What are scalars, vectors, matrices, and tensors?
(click the triangle to expand and reveal the full answer)
- A scalar is a single number
- A vector is an array of numbers.
- A matrix is a 2-D array
- A tensor is a -dimensional array with
What is the Hadamard product of two matrices?
(click the triangle to expand and reveal the full answer)
Hadamard product of matrices is an elementwise operation. Values that correspond positionally are multiplied to produce a new matrix.
What is an inverse matrix?
(click the triangle to expand and reveal the full answer)
The inverse of a square matrix , sometimes called a reciprocal matrix, is a matrix such that , where is the identity matrix.
A square matrix has an inverse iff the determinant (Lipschutz 1991, p. 45). The so-called invertible matrix theorem is major result in linear algebra which associates the existence of a matrix inverse with a number of other equivalent properties. A matrix possessing an inverse is called nonsingular, or invertible.
If inverse of a matrix exists, how do you calculate it?
(click the triangle to expand and reveal the full answer)
For a matrix , the matrix inverse is
For a matrix
the matrix inverse is
A general matrix can be inverted using methods such as the Gauss-Jordan elimination, Gaussian elimination, or LU decomposition.
What is the determinant of a square matrix? How is it calculated? What is the connection of determinant to eigenvalues?
(click the triangle to expand and reveal the full answer)
Determinants are mathematical objects that are very useful in the analysis and solution of systems of linear equations. As shown by Cramer’s rule, a nonhomogeneous system of linear equations has a unique solution iff the determinant of the system’s matrix is nonzero (i.e., the matrix is nonsingular). For example, eliminating , , and from the equations
gives the expression
, which is called the determinant for this system of equation. Determinants are defined only for square matrices.
If the determinant of a matrix is , the matrix is said to be singular, and if the determinant is , the matrix is said to be unimodular.
The determinant of a matrix ,
is commonly denoted , , or in component notation as , , or (Muir 1960, p. 17). Note that the notation may be more convenient when indicating the absolute value of a determinant, i.e., instead of .
A determinant is defined to be
.
A determinant can be expanded “by minors” to obtain
.
A general determinant for a matrix has a value , with no implied summation over and where (also denoted ) is the cofactor of defined by and is the minor of matrix formed by eliminating row and column from . This process is called determinant expansion by minors (or “Laplacian expansion by minors,” sometimes further shortened to simply “Laplacian expansion”).
Discuss span and linear dependence.
(click the triangle to expand and reveal the full answer)
The span of vectors , , …, is the set of linear combinations , and that this is a vector space.
We can take this idea further. If is a vector space and is a subset of , then is equal to ? We say that spans if every vector in can be written as a linear combination of vectors in (e.g., ).
The next question we ask after this is whether there are any reduncancies. I.e., is there a smaller subset of that also spans . If so, then we can write one of the vectors as a linear combination of the others (like so: )
If one of the vectors within can be written as a linear combination like this, then is a linearly dependent set. If we can’t write any of the vectors as a linear combination of the others in , then is linearly independent
What is ? When does have a unique solution?
(click the triangle to expand and reveal the full answer)
Suppose we want to solve a system of equation like (where we want to find solutions for , , and ). We can solve this using methods like substitution or elimination, but these are only tractable when we have few equations or few unknowns. If we want to use a more efficient method for larger numbers of unknowns or equations, we can write the system in the form of several matrices:
This form is called the form of a system where is called the coefficient matrix, is the unknowns or solution vector, and is the constant vector. The general form of such a system with equations and unknowns is given by the following:
Whether a system has a unique solution or not is determined by Cramer’s Rule: A unique solution exists is invertible. The solution itself can be found using the rule where is the matrix with the th column replaced by the constant vector .
In , what happens when is fat or tall?
(click the triangle to expand and reveal the full answer)
For a system of equations , we first want to describe the fundamental theorem:
- For any system of equations, , only one of the following three possibilities hold: a) The system has no solution, b) The system has a unique solution, or c) The system has infinitely many solutions.
- For a homogeneous system of equations, , only one of the following two possibilities hold: a) The system has a unique solution, or b) The system has infinitely many solutions.
When a system of equations has at least one solution, we say that the system is consistent, otherwise inconsistent. We are concerned with determining when a system of equations is consistent and studying the number of solutions of a system of equations.
Fat Matrix
Let be an fat matrix, that is, . Then the following assertions hold. a. . b. The homogeneous system of equations has infinitely many solutions. c. The system of equations is either inconsistent or has infinitely many solutions (in other words, it never has a unique solution). d. The system of equations is consistent for every in if and only if .
Tall Matrix
Let be an tall matrix, that is, . Then the following assertions hold. a. . b. The homogeneous system of equations has a unique (namely, the trivial) solution if and only if (in other words, it has infinitely many solutions if and only if ). c. The system of equations is inconsistent for some in (in other words, the system is never consistent for all in ). d. The system of equations is has at most one solution for every in if and only if .
When does inverse of exist?
(click the triangle to expand and reveal the full answer)
Suppose is an matrix. The inverse of is another matrix, denoted , that satisfies the following conditions. where is the identity matrix. Not every square matrix has an inverse; but if a matrix does have an inverse, it is unique.
If the rank of an matrix is less than n, the matrix does not have an inverse. When the determinant for a square matrix is equal to zero, the inverse for that matrix does not exist. A square matrix that has an inverse is said to be nonsingular or invertible.
What is a norm? What is L1, L2 and L infinity norm?
(click the triangle to expand and reveal the full answer)
In Linear Algebra, a Norm refers to the total length of all the vectors in a space. There are multiple ways to measure the magnitude of vectors.
The norm is not actually a norm. It just corresponds to the total number of nonzero elements in a vector.
Norm is the sum of the magnitudes of the vectors in a space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally.
norm is the most popular norm, also known as the Euclidean norm. It is the shortest distance to go from one point to another.
The norm simply gives the largest magnitude among each element of a vector. For example, if we have the vector Having the vector , the norm would be 6.
What are the conditions a norm has to satisfy?
(click the triangle to expand and reveal the full answer)
Given a vector space over a field of the real numbers or complex numbers , a norm on is a nonnegative-valued function with the following properties:
For all and all ,
- (being subadditive or satisfying the triangle inequality).
- (being absolutely homogeneous or absolutely scalable).
- If then is the zero vector (being positive definite or being point-separating).
Why is squared of L2 norm preferred in ML than just L2 norm?
(click the triangle to expand and reveal the full answer)
The norm corresponds to Hilbert space. On raising power to cube (), quad () or higher, the function becomes sensitive to the influence of outliers (since it gives more weight to the anomalous points) and thus introduces unwanted bias into distance calculation (you will get skewed results). Using squared norm, optimization functions in ML become tractable.
When L1 norm is preferred over L2 norm?
(click the triangle to expand and reveal the full answer)
The norm likes sparse vectors (hence sets some components exactly to 0). Now “sets” doesn’t really mean anything since it’s not like the norm takes decisions, but in several optimisation problems if you use the norm you end up with sparse vectors. A prime example is the lasso regression, or in general any optimizations problem where a regularizer in the form is added. The solutions to the latter will be usually sparser than when using a regularizer in the form
Can the number of nonzero elements in a vector be defined as L0 norm? If no, why?
(click the triangle to expand and reveal the full answer)
The norm corresponds to the total number of nonzero elements in a vector, but it cannot actually be defined as a norm because it does not satisfy all the conditions a norm must have.
What is Frobenius norm?
(click the triangle to expand and reveal the full answer)
The Frobenius norm measures the size of a Matrix as
What is a diagonal matrix?
(click the triangle to expand and reveal the full answer)
A diagonal matrix is a matrix with all non-diagonal elements being zero. We form a square diagonal matrix by moving vector elements into the diagonal position of the matrix.
Provided has no element with zero value, we replace each diagonal element with to form its inverse M^{-1}.
Why is multiplication by diagonal matrix computationally cheap? How is the multiplication different for square vs. non-square diagonal matrix?
(click the triangle to expand and reveal the full answer)
Machine learning may approximate solutions with diagonal matrices because finding inverse or performing matrix multiplication is easy. Non-square diagonal matrices do not have an inverse matrix.
At what conditions does the inverse of a diagonal matrix exist?
(click the triangle to expand and reveal the full answer)
The invert of a square diagonal matrix exists if all entries of the diagonal are non-zeros. If it is the case, the invert is easy to find. Also, the inverse doen’t exist if the matrix is non-square.
What is a symmetric matrix?
(click the triangle to expand and reveal the full answer)
A symmetric matrix is a matrix with .
In machine learning, many equations in calculating elements between are not directional (). For example, the distance between 2 points is not directional. Hence, many matrices in machine learning is symmetrical. The inverse of a symmetric matrix is also symmetric. Any real symmetric matrices can be decomposed into n eigenvalues and eigenvectors which is very desirable in matrix factorization. Symmetric matrix can easily decompose into orthogonal matrices: which its inverse equals to its transpose.
What is a unit vector?
(click the triangle to expand and reveal the full answer)
A unit vector is a vector with .
When are two vectors x and y orthogonal?
(click the triangle to expand and reveal the full answer)
In Euclidean space, two vectors are orthogonal if and only if their dot product is zero, i.e. they make an angle of ( radians), or one of the vectors is zero. Hence orthogonality of vectors is an extension of the concept of perpendicular vectors to spaces of any dimension.
At what is the maximum possible number of orthogonal vectors with non-zero norm?
(click the triangle to expand and reveal the full answer)
Orthogonal vectors are perpendicular. It’s easy to see that if you are in you can only have two vectors which are perpendicular. Similarly, in you can only have three perpendicular vectors. Orthogonality extends this notion of perpendicular to higher dimensions. So given that interpretation you can see that in you can only have orthogonal vectors. You would need to add another dimension to the space to have an additional orthogonal vector, thus is maximal.
When are two vectors x and y orthonormal?
(click the triangle to expand and reveal the full answer)
A set of vectors are orthonormal if and only if the inner product of any two vectors are zero.
What is an orthogonal matrix? Why is computationally preferred?
(click the triangle to expand and reveal the full answer)
An orthogonal matrix is a square matrix whose rows (columns) are mutually orthonormal. i.e. no dot products of 2 row vectors (column vectors) are 0.
For an orthogonal matrix, there is one important property. Its inverse is the transpose which is very easy to compute:
Also orthogonal matrices does not amplify errors which is very desirable.
The size of the multiplication result has the same size as . If we multiple x with orthogonal matrices, the errors present in will not be amplified by the multiplication. i.e. it is more numerically stable.
Decomposing matrices into smaller components helps us to solve some problems faster with better numeric stability. Here is the pseudo code to use SVD to decompose matrix into orthogonal matrices and solve the linear equation with the result.
where takes the reciprocal of the non-zero elements of and then transpose.
What are eigendecomposition, eigenvectors and eigenvalues?
(click the triangle to expand and reveal the full answer)
Scalar and vector are the eigenvalue and eigenvector of respectively if . A matrix has at most eigenvalues and eigenvectors. A matrix is singular iff any eigenvalues are .
In eigendecomposition, we decompose an initial matrix into the product of its eigenvectors and eigenvalues.
How to find eigenvalues of a matrix?
(click the triangle to expand and reveal the full answer)
To find the eigenvalues and eigenvectors for ,
Write the eigendecomposition formula for a matrix. If the matrix is real symmetric, how will this change?
(click the triangle to expand and reveal the full answer)
Matrix decomposition decompose a matrix into special type of matrices for easy manipulation in solving problems like linear equations. But eigendecomposition is only defined for square matrices.
Say has eigenvalues . We concatenate all values to form a column vector . also has eigenvectors . We compose a matrix with as the column of the matrix. . The eigen decomposition of is:
Not every has eigendecomposition. But in deep learning, we often due with real symmetric metrics. Real symmetric metrics are eigendecomposable and the equation can be further simplify to:
which is an orthogonal matrix composed of eigenvectors of . is a diagonal matrix. The value of diagonal element is the eigenvalue of the corresponding eigenvector in column of . We do not specify the order of eigenvectors. Therefore different order of creates different . By convention, we re-arrange the column order in by the descending sorting order of its eigenvalue . It helps us to decompose in a more deterministic way.
In eigendecomposition, we decompose the matrix into different eigenvectors scaled by the eigenvalue . Therefore, for any vectors pointing at the same direction of eigenvector , scales by the corresponding eigenvalue .
For a quadratic equation in the form of , if is an unit vector equal to , will be equal to the eigenvalues . Therefore, the max and min of is the max and min of the eigenvalues.
Is the Eigendecomposition guaranteed to be unique? If not, then how do we represent it?
(click the triangle to expand and reveal the full answer)
A matrix does not necessarily have distinct eigenvalues (although almost all do), and a matrix does not necessarily have a single eigenvalue with multipicity n. In fact, given any set of n values, you can construct a matrix with those values as eigenvalues (indeed just take the corresponding diagonal matrix).
What are positive definite, negative definite, positive semi definite and negative semi definite matrices?
(click the triangle to expand and reveal the full answer)
If all eigenvalues of are positive, the matrix is positive definite.
If all eigenvalues of are positive or zero, the matrix is positive semi-definite.
If all eigenvalues of are negative, the matrix is negative definite.
If a matrix is positive semi-definite, . If a matrix is positive definite and , it implies .
Positive definiteness or negative definiteness helps us to solve optimization problem. Quadratic equation with positive definite matrices are always positive for non-zero and the function is convex. i.e. it guarantees the existences of the global minimum. This allows us to use Hessian matrix to solve the optimization problem. Similar arguments hold true for negative definite.
What is Singular Value Decomposition? Why do we use it? Why not just use ED?
(click the triangle to expand and reveal the full answer)
SVD factorizes a matrix into singular vectors and singular values.
- is a matrix. (Does not need to be a square matrix like eigendecomposition.)
- Left-singular vector: is an orthogonal matrix (the eigenvectors of )
- Right-singular vector: is a orthogonal matrix (the eigenvectors of )
- Singular values: is a diagonal matrix (square roots of the eigenvalues of and )
SVD is a powerful but expensive matrix factorization method. In numerical linear algebra, many problems can be solved to represent in this form.
Every real matrix has an SVD, but this is not true for eigendecomposition.
Given a matrix A, how will you calculate its Singular Value Decomposition?
(click the triangle to expand and reveal the full answer)
is a matrix that can be seen as a linear transformation. This transformation can be decomposed in three sub-transformations: 1. rotation, 2. re-scaling, 3. rotation. These three steps correspond to the three matrices , , and .
The matrices , and can be found by transforming in a square matrix and by computing the eigenvectors of this square matrix. The square matrix can be obtain by multiplying the matrix A by its transpose in one way or the other:
- corresponds to the eigenvectors of
- corresponds to the eigenvectors of
- corresponds to the eigenvalues or which are the same.
What are singular values, left singulars and right singulars?
(click the triangle to expand and reveal the full answer)
We will decompose A into 3 matrices (instead of two with eigendecomposition):
The matrices , , and have the following properties:
- and are orthogonal matrices ( and )
- is a diagonal matrix. However is not necessarily square.
The columns of are called the left-singular vectors of while the columns of are the right-singular vectors of . The values along the diagonal of are the singular values of .
What is the connection of Singular Value Decomposition of A with functions of A?
(click the triangle to expand and reveal the full answer)
In higher mathematics, a function is said to be linear if . The process of SVD decomposition can be seen as a function, which we will call , which take in a matrix and returns three matrices: . The three matrices that are returned have the property that . However, given two matrices of the same size, say and , . In other words, the in the SVD decomposition for is not equal to the in the SVD decomposition for plus the in the SVD decomposition for . The same is true for the and the .
Linear algebra has the moniker “linear” because all linear functions that act on vectors of finite dimension can be represented by matrices, and all matrices represent linear functions acting on vectors. However, certain aspects of linear algebra are nonlinear. For example, the function that accepts a matrix and returns its square: is nonlinear because in general. So SVD is a linear algebra topic because it involves breaking up a single linear action into three simpler linear actions, but the function that takes each matrix to its decomposition is nonlinear.
Why are singular values always non-negative?
(click the triangle to expand and reveal the full answer)
If has real entries then is positive semidefinite, since for all . Therefore the eigenvalues of are non-negative.
What is the Moore Penrose pseudoinverse and how to calculate it?
(click the triangle to expand and reveal the full answer)
The Moore-Penrose pseudoinverse is a direct application of the SVD. But before all, we have to remind that systems of equations can be expressed under the matrix form.
The inverse of a matrix can be used to solve the equation :
But in the case where the set of equations have or many solutions the inverse cannot be found and the equation cannot be solved. The pseudoinverse is such as:
minimizing
The following formula can be used to find the pseudoinverse:
with , and respectively the left singular vectors, the singular values and the right singular vectors of . is the pseudoinverse of and the pseudoinverse of . We saw that is a diagonal matrix and thus can be calculated by taking the reciprocal of the non zero values of .
If we do Moore Penrose pseudo inverse on Ax = b, what solution is provided is A is fat? Moreover, what solution is provided if A is tall?
(click the triangle to expand and reveal the full answer)
A non-square matrix does not have an inverse, but if it is full rank (), it can have either a right pseudo-inverse or left pseudo-inverse:
Given a matrix , let us define:
- is a fat matrix if and
- is a tall matrix is and
Using the finite rank lemma, we can find that:
- When is a fat matrix, there exists a right inverse such that . The right pseudo-inverse is not unique, but an example is
- When is a tall matrix, there exists a left inverse such that .The left pseudo-inverse is not unique, but an example is
Which matrices can be decomposed by ED?
(click the triangle to expand and reveal the full answer)
The eigendecomposition can only exists for square matrices, and even among square matrices sometimes it doesn’t exist.
Which matrices can be decomposed by SVD?
(click the triangle to expand and reveal the full answer)
The SVD always exists for any sort of rectangular or square matrix.
What is the trace of a matrix?
(click the triangle to expand and reveal the full answer)
The trace is the sum of all values in the diagonal of a square matrix.
For example:
How to write Frobenius norm of a matrix A in terms of trace?
(click the triangle to expand and reveal the full answer)
Why is trace of a multiplication of matrices invariant to cyclic permutations?
(click the triangle to expand and reveal the full answer)
More generally, the trace is invariant under cyclic permutations, that is,
This is known as the cyclic property.
Arbitrary permutations are not allowed: in general,
However, if products of three symmetric matrices are considered, any permutation is allowed, since:
where the first equality is because the traces of a matrix and its transpose are equal.
What is the trace of a scalar?
(click the triangle to expand and reveal the full answer)
The trace of a square matrix is the sum of its diagonal elements.
The trace enjoys several properties that are often very useful when proving results in matrix algebra and its applications.
Let us start with a formal definition.
Definition Let be a matrix. Then, its trace, denoted by or , is the sum of its diagonal elements:
Write the frobenius norm of a matrix in terms of trace?
(click the triangle to expand and reveal the full answer)
The Frobenius norm, sometimes also called the Euclidean norm (a term unfortunately also used for the vector -norm), is matrix norm of an matrix defined as the square root of the sum of the absolute squares of its elements,
The Frobenius norm can also be considered as a vector norm.
Numerical Optimization
What is underflow and overflow?
(click the triangle to expand and reveal the full answer)
Overflow is when the absolute value of the number is too high for the computer to represent it.
Underflow is when the absolute value of the number is too close to zero for the computer to represent it.
You can get overflow with both integers and floating point numbers. You can only get underflow with floating point numbers.
To get an overflow, repeatedly multiply a number by ten. To get an underflow repeatedly divide it by ten.
If the variable x is a signed byte it can have values in the range -128
to +127
, then
x = 127
x = x + 1
will result in an overflow. +128
is not a valid value for x.
For floating point numbers, the range depends on their representation. If x is a single precision (32-bit IEEE) number, then
x = 1e-38
x = x / 1000
will result in an underflow as 1e-42
is not a valid value for x.
Consider softmax function, commonly used to transform a group of real values to “probabilities”: . In applied use-cases, can be very very large or very very small. Analytically, if for all , then . Numerically, this may not occur when is large. A positive causes overflow. A negative causes underflow and divide-by-zero error.
How to tackle the problem of underflow or overflow for softmax function or log softmax function?
(click the triangle to expand and reveal the full answer)
To avoid overflow or underflow, we can transform with :
Here is the proof that this transformation is valid:
There can be no overflow, as . The denominator is at least , so there’s no divide-by-zero error.
What is poor conditioning?
(click the triangle to expand and reveal the full answer)
Conditioning measures how rapidly the output changed with tiny changes in input.
For example, in a linear equation, we can use the inverse matrix to solve .
Nevertheless it is not commonly done in machine learning because is slow to compute, and worse, may amplify input errors rapidly.
For the function:
.
Condition number is defined as: Poorly conditioned matrix is a matrix with a high condition number. amplifies input errors. Small errors in can change the output of rapidly.
Other methods including matrix factorization can replace the matrix inversion method to avoid the poor conditioning to improve the numerical stability.
What is the condition number?
(click the triangle to expand and reveal the full answer)
The ratio of the largest to smallest singular value in the singular value decomposition of a matrix. The base- logarithm of is an estimate of how many base- digits are lost in solving a linear system with that matrix. In other words, it estimates worst-case loss of precision. A system is said to be singular if the condition number is infinite, and ill-conditioned if it is too large, where “too large” means roughly the precision of matrix entries.
What are grad, div and curl?
(click the triangle to expand and reveal the full answer)
If is a scalar field, ie a scalar function of position in 3 dimensions, then its gradient at any point is defined in Cartesian co-ordinates by . It’s important to note that is a vector operator that makes a vector field. What this means is that the gradient of a scalar field tends to point in the direction of greatest change of the field.
The divergence computes a scalar quantity from a vector field by differentiation (the reverse of what was done with the gradient).
If is a vector function of position in 3 dimensions, that is , then its divergence at any point is defined in Cartesian co-ordinates by It’s worth noting that the divergence of a vector field is a scalar field.
If grad is the operator applied to a scalar field as , and div is the dot product of and a vector field as , then the curl is the cross product of and a vector field . The curl can be described using the pseudo-determinant recipe for vector products as
What are critical or stationary points in multi-dimensions?
(click the triangle to expand and reveal the full answer)
A critical point is a place where there is potentially a maximum or a minimum. This can happen if the derivative is zero, or if the function is not differentiable at a point (there could be a vertex as in the absolute value function). A stationary point is just where the derivative is zero. If you think of the derivative as a velocity, then those are places where the velocity is zero, and something with zero velocity is stationary.
All stationary points are critical points but not all critical points are stationary points.
Why should you do gradient descent when you want to minimize a function?
(click the triangle to expand and reveal the full answer)
There are two main reasons for using gradient descent for minimizing a function:
- Sometimes we cannot solve the minimization problem by hand. This includes situations where we have an explicit form for the function but the equations for the local minima are not able to be solved in closed form and also situations where we don’t have an explicit form for the function (but of course we can still compute it).
- Sometimes even in situations where we have a nice formula for the minimum, it can be computationally more efficient to use gradient descent than to compute the formula. This includes very high-dimensional linear regressions. The formula for the least squares estimator is well-known but it involves inverting a potentially large matrix. At a certain point gradient descent starts to be faster, cause it contains less expensive operations even though you must iterate them many times to find the minimum.
What is line search?
(click the triangle to expand and reveal the full answer)
In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum of an objective function .
The line search approach first finds a descent direction along which the objective function will be reduced and then computes a step size that determines how far should move along that direction. The descent direction can be computed by various methods, such as gradient descent, Newton’s method and quasi-Newton method. The step size can be determined either exactly or inexactly.
What is hill climbing?
(click the triangle to expand and reveal the full answer)
Hill Climbing is a heuristic search used for mathematical optimization problems in the field of Artificial Intelligence.
Given a large set of inputs and a good heuristic function, it tries to find a sufficiently good solution to the problem. This solution may not be the global optimal maximum.
- In the above definition, mathematical optimization problems implies that hill-climbing solves the problems where we need to maximize or minimize a given real function by choosing values from the given inputs. Example-Travelling salesman problem where we need to minimize the distance traveled by the salesman.
- ‘Heuristic search’ means that this search algorithm may not find the optimal solution to the problem. However, it will give a good solution in reasonable time.
- A heuristic function is a function that will rank all the possible alternatives at any branching step in search algorithm based on the available information. It helps the algorithm to select the best route out of possible routes.
Hill Clibing has two main features: 1) It is a variant of generate and test algorithm:
- Generate possible solutions.
- Test to see if this is the expected solution.
- If the solution has been found quit else go to step 1.
as it takes the feedback from the test procedure while the generator utilizes the feedback in deciding the next move in search space. 2) It uses a greedy approach. At any point in state space, the search moves in that direction only which optimizes the cost of function with the hope of finding the optimal solution at the end.
What is a Jacobian matrix?
(click the triangle to expand and reveal the full answer)
For example:
What is curvature?
(click the triangle to expand and reveal the full answer)
The curvature is related to the second derivative.
For a two-dimensional curve written in the form