# Machine Learning Fundamentals

## Fundamentals of all types of machine learning, deep or otherwise

This post is Part 2 of the 4-part Machine Learning Research Interview Handbook series (you can see the rest of the series here).

In this part of the series we go into the fundamentals of machine learning systems. After all, there’s more than just neural networks, and you should ideally be able to understand the general rules and properties that hold for all frameworks.

Learning Theory

Model and feature selection

Curse of dimensionality

Universal approximation of neural networks

Deep Learning motivation

Support Vector Machines

Bayesian Machine Learning

Regularization

Evaluation of Machine Learning systems

Clustering

Dimensionality Reduction

Basics of Natural Language Processing

# Machine Learning

## Learning Theory #### Describe bias and variance with examples.

Variance: refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set. More flexible statistical methods have higher variance.

• Explanation: different training data sets will result in a different $\hat{f}$. But ideally the estimate for $f$ should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in $\hat{f}$

Bias: refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.

• Explanation: As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases.

Decomposition: The expected test MSE, for a given value $x_0$ can always be decomposed into the sum of three fundamental quantities: the variance of $\hat{f}(x_0)$, the squared bias of $\hat{f}(x_0)$, and the variance of the error variance terms $\epsilon$.

$E(y_0-\hat{f}(x_0))^2=\text{Var}(\hat{f}(x_0))+[\text{Bias}(\hat{f}(x_0))]^2+\text{Var}(\epsilon)$

The overall expected test MSE can be computed by averaging $E(y_0-\hat{f}(x_0))^2$ over all possible values of $x_0$ in the test set. Blue squared bias curve, orange variance curve, Dashed Var(epsilon) line, and red test MSE curve for three different datasets. The vertical dotted line indicates the flexibility corresponding to the smallest test MSE.

#### What is Empirical Risk Minimization?

ERM: the function that minimizes loss on the training set.

In many machine learning task, we have data $Z$ from some distribution $p$ and the task is to minimize the risk: $R_(f)=E_{Z \sim p}[\text{loss}(f(Z),Z)]$ Loss function: In classification $Z=(X,Y)$ and we use 0/1 loss $\text{loss}(f(Z), Z) = I_{f(X) \neq Y}$, in regression $Z=(X,Y)$ and we use squared error $\text{loss}(f(Z), Z) = (f(X) - Y )^2$ and in density estimation $Z=X$ and we use negative log likelihood loss $\text{loss}(f(Z), Z) = - \log f(X)$. We are interested in finding the optimal predictor $f^*=\arg\min_f R(f)$ In practice, we compute the empirical risk: $\hat{R}(f)=\frac{1}{n} \sum_{i=1}^n \text{loss}(f(X_i),Y_I)$ We choose the $\hat{f}$ that minimizes the empirical risk over some class $F$, such as parametric models, histogram classifiers, decision trees or linear/polynomial functions, etc. $\hat{f}^{ERM}=\arg \min_{f \in F} \hat{R}(f)$

#### What is Hoeffding’s inequality?

Let $Z_1, ..., Z_n$ be $n$ number of random independent, identically distributed random variables, such that $0 \leq Z_i \leq 1$. Then, $P\left ( \left | \frac{1}{n} \sum_{i=1}^n Z_i - E[Z] \right | > \epsilon \right ) \leq \delta = 2 \exp(-2n \epsilon^2)$

Basically, we have a bunch of variables $Z_i$. We know that when we average a bunch of them up, we should usually get something close to the expected value $E[Z]$. Hoeffding quantifies “usually” and “close” for us

What the Hoeffding Inequality gives us is a probabilistic guarantee that $v$ doesn’t stray too far from $E[Z]$. $\epsilon$ is some small value which we use to measure the average deviation of $Z_1, ..., Z_n$ from $E[Z]$. We claim that the probability of $Z_i$ being more than $\epsilon$ away from $E[Z]$ is less than or equal to some bound which shrinks exponentially as $\epsilon$ and/or our sample size $n$ increases. In other words, the larger your sample size and the wider your margin of error, the less likely you are to step over that margin of error with your best guess.

#### Write the formulae for training error and generalization error. Point out the differences.

Generalization error is defined as the expected value of the error on a new input. Here the expectation is taken across different possible inputs, drawn from the distribution of inputs we expect the system to encounter in practice. [DL Book]

Training error: when training a machine learning model, we have access to a training set, we can compute some error measure on the training set called the training error, and we reduce this training error.

In our linear regression example, we trained the model by minimizing the training error:

$\frac{1}{m^{(train)}}||\mathbf{X}^{(train)}\mathbf{w}-\mathbf{y}^{(train)}||^2_2$

but we actually care about the test error:

$\frac{1}{m^{(test)}}||\mathbf{X}^{(test)}\mathbf{w}-\mathbf{y}^{(test)}||^2_2$

Error v.s. Loss v.s. Risk:

Error is the difference between the actual / true value ($\hat{\theta}$) and the predicted / estimated value ($\hat{\theta}$). $Error=\theta-\hat{\theta}$ Loss (𝐿) is a measurement of how well our model perform against your training data.

• Regression Losses

• Mean Square Error/Quadratic Loss/$L_2$ Loss $\text{MSE}=\frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{n}$
• Mean Absolute Error/$L_1$ Loss $\text{MAE}=\frac{\sum_{i=1}^n|y_i-\hat{y}_i|}{n}$
• Mean Bias Error $\text{MBE}=\frac{\sum_{i=1}^n(y_i-\hat{y}_i)}{n}$
• Classification Losses

• Hinge Loss/Multi class SVM Loss $\text{SVM Loss}=\sum_{j \neq y_i} \max(0,s_j-s_{y_i}+1) \\ L(\mathbf{X},\mathbf{y},\beta)=\sum_{i=1}^n \max[0,1-y_i(\beta_0+\beta_1x_{i1}+,,,+\beta_px_{ip})]$
• Cross Entropy Loss/Negative Log Likelihood $\text{CrossEntropyLoss}=-\left[y_i\log(\hat{y_i})+(1-y_i)\log (1-y_i)\right]$

Risk is the average measure of loss, or expected loss, across your whole data distribution.

$R(\theta,\hat{\theta})=E(L(\theta,\hat{\theta}))$

Empirical Risk: when we train our model, we do not have the full distribution of the data. This may be because some of our data is used for validation and testing, or that new data points are produced in real-time. The best we can do is to pick our training data in a random way and assume that our training data is representative of the real data.

Therefore, because we don’t have all the data, the best we can do is to minimize the empirical risk, from data that we do have (our training data), and use regularization techniques to generalize (i.e. avoid overfitting). This is why minimizing loss and minimizing empirical risk are roughly the same thing.

#### State the uniform convergence theorem.

A sequence $\{f_n\}_{n=1}^{\infty}$ of real-valued functions on a set $X, f_n : X \to \mathbb{R}$, is said to be uniformly convergent on $X$ to a limit function $f : X \to \mathbb{R}$ if for each $\epsilon > 0$ there exists an $\mathbb{N} \in N$ such that $|f(x) - f_n(x)| < \epsilon$ holds for all $n \geq N$ and $x \in X$.

Uniform convergence simplifies certain calculations, for instance by interchanging the integral and the limit sign in integration.

Difficulties which arise when the convergence is pointwise but not uniform can be seen in the example of the non Riemann integrable indicator function of rational numbers in $[0,1]$ and provide partial explanations of some other anomalies such as the Gibbs phenomenon. Many theorems of functional analysis use uniform convergence in their formulation, such as the Weierstrass approximation theorem and some results of Fourier analysis. Uniform convergence can be used to construct a nowhere-differentiable continuous function.

#### What is the sample complexity bound of uniform convergence theorem?

For the sample complexity of a uniformly convergent function: Let $F_H = \{f_h | h \in H \}$, where $f_h$ is the loss function for hypothesis $h$.

• $F_H$ has the uniform convergence property. As such, an ERM (Empirical Risk Minimization) algorithm “learns” $H$.
• The sample complexity of learning $H$ is bounded by $m_{F_H}(\eta, \epsilon)$

#### What is the error bound of uniform convergence theorem?

We measure the quality of a hypothesis by its generalization error, the expected loss of a hypothesis on a new test example. In online learning, we can bound the expected generalization error of online gradient descent using online-to-batch conversion. This section allows us to derive high probability bounds as well as more general results for hypothesis classes without relying on convexity.

Uniform convergence only gives us upper bounds, so we can’t directly compare the generalization error of two algorithms. Also, it is worst case over all distributions p∗, so it lacks sensitivity to the exact problem structure.

#### What is the bias-variance trade-off theorem?

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. $\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$

Underfitting happens when a model unable to capture the underlying pattern of the data $\Rightarrow$ high bias, low variance.

Overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy dataset $\Rightarrow$ low bias and high variance. #### From the bias-variance trade-off, can you derive the bound on training set size?

Increasing training data size leads to decreased variance. Decreasing the variance leads to an increase the bias. So, increase the training data size leads to decreasing variance and increasing bias. From this, it’s possible to derive a training set size at $\min(\text{Bias}^2 + \text{Variance} + \text{Irreducible Error})$

#### What is the VC dimension?

In learning theory, the VC dimension is a measure of capacity of a class of hypotheses $\mathcal{H}$ (e.g., a set of classifiers). This notion of capacity indicates how complicated $\mathcal{H}$ is. Although complicated $\mathcal{H}$ may be able to fit well to the dataset at hand, yielding a low training error, there is a possibility that it overfits and gives high generalization error. The VC dimension provides a tool to analyze the generalization error of a class of hypotheses based on how complicated it is, independent of the input distribution, the target function, and the learning algorithm (i.e., a systematic approach to choose the best hypothesis $g \in \mathcal{H}$).

Capacity in VC theory is captured by the concept called shattering. Here we focus only on binary classification problems.

A hypothesis class H is said to shatter $n$ points if there exists a dataset $X=\{x_1,...,x_n\}$ such that for any label assignment $y=(y_1,...,y_n)$ where $y_i \in \{-1,+1\}$, there exists a hypothesis $h \in \mathcal{H}$ which can produce $y=h(X)$.

In a simpler terms, we say $H$ shatters $n$ points if there exists a configuration of $X=\{x_1,...,x_n\}$ such that $\mathcal{H}$ can produce all possible $2^n$ assignments of $y$. Things worth noting are

• If $\mathcal{H}$ can produce any assignment of $y$ on just even one configuration of $n$ points, then we say $\mathcal{H}$ can shatter $n$ points. So, when constructing an example, it makes sense to imagine a configuration of points such that $\mathcal{H}$ can shatter easily.
• If $\mathcal{H}$ can shatter $n$ points, then obviously it can shatter less than $n$ points.
• Likewise, if $\mathcal{H}$ cannot shatter n points, then it cannot shatter more than $n$ points.

The VC dimension of $\mathcal{H}$, denoted by $d_{VC}(\mathcal{H})$, is the largest number of points $\mathcal{H}$ can shatter.

As an example, the VC dimension of a linear classifier in two-dimensional space is 3. That is, three is the highest number of points a line can produce all possible $\{-1,+1\}$ assignments. With four points, there are two cases out of 16 possible assignments a line cannot produce. In general, $d_{VC}(\text{linear classifiers})=d+1$ where $d$ is the input dimension.

The VC dimension can be used to bound probabilistically the difference between the training and test errors. This result is known as VC inequality.

#### What does the training set size depend on for a finite and infinite hypothesis set? Compare and contrast.

Most of the tools in machine learning are designed to make better use data using $P_{AC}$ (Probably Approximately Correct), first introduced by Professor Valiant. The $P_{AC}$ analyses assume that the true answer/concept is in the given hypothesis space $\mathcal{H}$. A machine learning algorithm $L$ with hypothesis space $\mathcal{H}$ is one that, given a training data set $D$, will always return a hypothesis $\mathcal{H}$ consistent with $D$ if one exists, otherwise it will indicate that no such hypothesis exists. In a finite machine learning hypothesis $\mathcal{H}$ does not have polynomial sample complexity. If $\mathcal{H}$ has polynomial sample complexity it is called infinite hypothesis.

#### What is the VC dimension for an n-dimensional linear classifier?

Let d be the dimension of the input data and $\mathcal{H} = \{f ~|~ f(\boldsymbol{x}) = \text{sign}(\boldsymbol{w}^\top \boldsymbol{x} + b), b \in \mathbb{R} \}$ be the set of all linear classifiers. $d_{VC}(\mathcal{H}) \geq d+1.$ Proof We need to prove that $\mathcal{H}$ can shatter at least $d+1$ points. That is, it suffices to show that there exists a set of $d+1$ points such that $\mathcal{H}$ can produce any pre-specified $\{-1,+1\}$ assignment $y=(y_1,...,y_d+1)$. Let $\boldsymbol{X} = \left[ \begin{array}{c} \boldsymbol{x}_1^\top \\ \vdots \\ \boldsymbol{x}_{d+1}^\top \end{array} \right] = \left[ \begin{array}{ccccc} 1 & 0 & 0 & \cdots & 0 \\ 1 & 1 & 0 & \cdots & 0 \\ 1 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 0 & \cdots & 1 \end{array} \right] \in \mathbb{R}^{d+1 \times d+1}$

The first coordinate of each $x_i$ is 1 to allow the bias term $b$ to be produced. Note that $X$ is invertible. With this definition, for any $y$ we can always have $sign(Xw)=y$ by choosing $\boldsymbol{w} = \boldsymbol{X}^{-1} \boldsymbol{y}$ since $Xw=y$ implies sign $\boldsymbol{X}\boldsymbol{w} = \boldsymbol{y}$. $d_{VC}(\mathcal{H}) \leq d + 1.$ Proof We need to show that no $d+2$ points can be shattered by $\mathcal{H}$. That is to show that there exists a certain assignment $\boldsymbol{y} \in \mathbb{R}^{d+2}$ not achievable by $\mathcal{H}$. We will construct one such assignment.

Assume we have $d+2$ points $x_1,...,x_{d+2}$ in $(d+1)$-dimensional space. Then, the set is linearly dependent. So, there exists $\boldsymbol{x}_j = \sum_{i\neq j} a_i \boldsymbol{x}_i$ such that not all $a_i$ are $0$. Set $y_j=-1$ and $y_i = \text{sign}(a_i) = \text{sign}(\boldsymbol{w}^\top \boldsymbol{x}_i)$. If $a_i=0$, $y_i$ can be arbitrary. $\text{sign}(\boldsymbol{w}^\top \boldsymbol{x}_j) = \text{sign}(\sum_{i \neq j} a_i \boldsymbol{w}^\top \boldsymbol{x}_i) > 0 \neq y_j$ Since there exists an assignment of d+2 points not achievable by H, we can say that $d_{VC}(\mathcal{H}) \leq d + 1$.

$d_{VC}(\mathcal{H}) = d + 1$

Proof Combining the two lemmas.

#### How is the VC dimension of a SVM bounded although it is projected to an infinite dimension?

Choose some $\eta$ between 0 and 1. Vapnik (1995) showed that with probability $1- \eta$: $R(\alpha) \leq R_{emp}(\alpha) + \sqrt{\frac{h(\log(2l/h)+1)-\log(\eta/4)}{l}}$

• $h$ is VC dimension and is a measure of the capacity or complexity of the machine.
• Note the bound is independent of $P(x,y)$
• If we know $h$, can readily compute the RHS. This provides a principled way to choose a learning machine.

Finding VC dimensions of machines with different kernels is non-trivial. Some kernels (e.g. RBF) have infinite VC dimension but still work well in practice.

It is possible to derive a bound based on the margin $\left ( \sum^N_{i=1} \alpha_i \right )^{-1/2}$ and the “radius” $R=\max(K(x_i, x_i))$ but the bound tends to be unrealistic.

#### Considering that Empirical Risk Minimization is an NP-hard problem, how does logistic regression and SVM loss work?

Both logistic regression and SVMs can be seen under an empirical risk minimization light, where one is interested in minimizing the following function with regards to the coefficients $w$: $L(x,y,w)=\sum_i^n f(x_i,y_i,w)+R(w)$ When $f \equiv \text{hinge loss}$ and $R(w) =w^Tw$ we have the SVM, and when $f \equiv \text{logistic loss}$ and $R(w) =0$ we have logistic regression.

So you can treat the loss and regularization as parameters of a model, which is empirical risk minimization.

Most, if not all, linear (kernelized/regularized) models can be put under this umbrella as well, it’s a really broad definition.

Of course, the SVM and logistic regression can be seen as different models, because they have different functional forms, having completely different specialized techniques for optimization.

## Model and feature selection #### Why are model selection methods needed?

• Model Interpretability: Irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted.
• We want to estimate the generalization performance, the predictive performance of our model on future (unseen) data.
• We want to increase the Prediction Accuracy by tweaking the learning algorithm and selecting the best performing model from a given hypothesis space.
• We want to identify the machine learning algorithm that is best-suited for the problem at hand; thus, we want to compare different algorithms, selecting the best-performing one as well as the best performing model from the algorithm’s hypothesis space.
• Fast (to train and test)
• Scalable (it can be applied to a large dataset)

The idea of model selection method is intuitive. It answers the following question:

How to select the right input variables for an optimal model?

An Optimal model is a model that fits the data with best values for the evaluation metrics.

(Feature selection simplifies a machine learning problem by choosing which subset of the available features should be used.)

#### How do you do a trade-off between bias and variance?

Shrinkage methods: By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias.

Bagging and other resampling techniques can be used to reduce the variance in model predictions.

#### What are the different attributes that can be selected by model selection methods?

Model selection can be used (at least indirectly) to select for attributes like model robustness, additional engineered features, data transformation steps, feature filtering steps, and data split

Some of the attributes that can be seleted include, but are not limited to:

• Accuracy & Loss metrics
• Akaike Information Criterion (AIC)
• Bayesian Information Criterion (BIC)
• Minimum Description Length (MDL)
• Structural Risk Minimization (SRM)

Because of how tightly linked and entangled the

#### Why is cross-validation required?

Cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility.

Cross Validation is a very useful technique for assessing the effectiveness of your model, particularly in cases where you need to mitigate overfitting. You need some kind of assurance that your model has got most of the patterns from the data correct, and its not picking up too much on the noise, or in other words its low on bias and variance.

Evaluation of residuals only gives us an idea about how well our model does on data used to train it. So, the problem with this evaluation technique is that it does not give an indication of how well the learner will generalize to an independent/ unseen data set. Getting this idea about our model is known as Cross Validation.

It is also of use in determining the hyper parameters of your model, in the sense that which parameters will result in lowest test error.

#### Describe different cross-validation techniques.

• Hold-out cross validation
• K-fold cross validation
• Leave-one-out cross validation

#### What is hold-out cross validation? What are its advantages and disadvantages?

Hold-out cross validation:

1. Randomly dividing the available set of observations into two parts, a training set and a validation set or hold-out set.
2. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set.
3. The resulting validation set error rate—typically assessed using MSE in the case of a quantitative response—provides an estimate of the test error rate.

• High variance: because it is not certain which data points will end up in the validation set and the result might be entirely different for different sets.
• Removing a part of it for validation poses a problem of underfitting

#### What is k-fold cross validation? What are its advantages and disadvantages?

K-fold cross validation:

1. Randomly k-fold CV dividing the set of observations into k groups, or folds, of approximately equal size.
2. The first fold is treated as a validation set, and the method is fit on the remaining k - 1 folds.
3. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set.
4. This process results in k estimates of the test error, MSE1,MSE2, … ,MSEk.
5. The k-fold CV estimate is computed by averaging these values, $CV_{(k)}=\frac{1}{k}\sum_{i=1}^kMSE_i$ Advantage: significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set.

Disadvantage: the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation.

#### What is leave-one-out cross validation? What are its advantages and disadvantages?

Leave-one-out cross validation: leaves $1$ data points out of training data as validation set. The statistical learning method is fit on the $n - 1$ training observations, and a prediction $y_1$ is made for the excluded observation, using its value $x_1$. and then the error is averaged for all trials, to give overall effectiveness.

Since $(x_1,y_1)$ was not used in the fitting process, $MSE_1 =(y_1 - \hat{y}_1)^2$ provides an approximately unbiased estimate for the test error. But even though $MSE_1$ is unbiased for the test error, it is a poor estimate because it is highly variable, since it is based upon a single observation $(x_1,y_1)$.

• k-Fold more biased than LOOCV; k-Fold less variance than LOOCV
• When we perform LOOCV, we are in effect averaging the outputs of n fitted models, each of which is trained on an almost identical set of observations; therefore, these outputs are highly (positively) correlated with each other.
• very expensive to compute

#### Why is feature selection required?

• Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
• Improves Accuracy: Less misleading data means modeling accuracy improves.
• Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

#### Describe some feature selection methods.

• Filter Methods: apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset.

• distance metrics, correlation, mutual information, and consistency metrics
• Wrapper Methods: consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy.

• Best Subset, forward and backward, recursive feature elimination
• Embedded Methods: learn which features best contribute to the accuracy of the model while the model is being created.

• regularization methods: LASSO, Elastic Net and Ridge Regression

#### What is forward feature selection method? What are its advantages and disadvantages?

Forward stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the ﬁt.

algorithm:

1. Let $\mathcal{M}_0$ denote the null model, which contains no predictors.
2. for $k=0,...,p-1$:
3. Consider all $p-k$ models that audment the predictors in $\mathcal{M}_k$ with one additional predictor
4. Choose the best among these $p-k$ models, and call it $\mathcal{M}_{k+1}$. Here best is defined as having smallest RSS or highest $R^2$.
5. Select a single best model from among $\mathcal{M}_0,...,\mathcal{M}_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$

• Computational: for large p we cannot compute the best subset sequence, but we can always compute the forward stepwise sequence
• Statistical: a price is paid in variance for selecting the best subset of each size; forward stepwise is a more constrained search, and will have lower variance, but perhaps more bias

#### What is backward feature selection method? What are its advantages and disadvantages?

Backward-stepwise selection starts with the full model, and sequentially deletes the predictor that has the least impact on the ﬁt. The candidate for dropping is the variable with the smallest Z-score

algorithm:

1. Let $\mathcal{M}_p$ denote the full model, which contains all $p$ predictors.
2. for $k=p, p-1,...,1$:
3. Consider all $k$ models that contain all but one of the predictors in $\mathcal{M}_k$, for a total of $k-1$ predictors.
4. Choose the best among these $k$ models, and call it $\mathcal{M}_{k-1}$. Here best is defined as having smallest RSS or highest $R^2$.
5. Select a single best model from among $\mathcal{M}_0,...,\mathcal{M}_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$

• Backward-stepwise selection can only be used when $N>p$, while forward stepwise can always be used.
• Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the $p$ predictors.

#### What is filter feature selection method and describe two of them?

Filter Methods: apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset.

• Chi-squared: Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores. to evaluate how likely it is that any observed difference between the sets arose by chance or if the association between two categorical variables of the sample would reflect their real association in the population.
• Correlation: The Correlation Feature Selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: “Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other
• Entropy: Entropy measures the amount of information in a random variable; it’s the average length of the message needed to transmit an outcome of that variable using the optimal code.

• Information content: $I(E)=-\log [\Pr(E)]=-\log (P)$
• Define the entropy as the expected value of information: $H(X)=E[I(X)]=E[-\log (P(X))]=-\sum_{i=1}^nP(x_i)\log (P(x_i))$
• One-attribute-rule(OneR): The idea of the OneR (one-attribute-rule) algorithm is to find the one attribute to use that makes fewest prediction errors.

#### What are Cross Entropy and KL divergence? Describe KL divergence intuitively.

Cross entropy is, at its core, a way of measuring the “distance” between two probability distributions P and Q. As you observed, entropy on its own is just a measure of a single probability distribution. As such, if we are trying to find a way to model a true probability distribution, P, using, say, a neural network to produce an approximate probability distribution Q, then there is the need for some sort of distance or difference measure which can be minimized.

Cross entropy function: $H(p,q)=H(p)+D_{KL}(p||q)$

• The first term, the entropy of the true probability distribution p, during optimization is fixed – it reduces to an additive constant during optimization.
• Only the parameters of the second, approximation distribution, q that can be varied during optimization – and hence the core of the cross entropy measure of distance is the KL divergence function.

KL divergence is an expression of “surprise” – under the assumption that P and Q are close, it is surprising if it turns out that they are NOT CLOSE, hence in those cases the KL divergence will be high. If they are CLOSE together, then the KL divergence will be low.

KL divergence is the information gained when we move from a prior distribution Q to a posterior distribution P.

Derivation of KL divergence:

The expression for KL divergence can also be derived by using a likelihood ratio approach.

The likelihood ratio $LR=\frac{p(x)}{q(x)}$

• Interpretation: if a value x is sampled from some unknown distribution, the likelihood ratio expresses how much more likely the sample has come from distribution $p$ than from distribution $q$. If it is more likely from $p$, the $LR > 1$, otherwise if it is more likely from $q$, the $LR < 1$.

Let’s say we have lots of independent samples and we want to estimate the likelihood function taking into account all this evidence – it then becomes: $LR=\prod_{i=1}^n\frac{p(x_i)}{q(x_i)}$ If we convert the ratio to $\log$, it’s possible to turn the product in the above definition to a summation: $LR=\sum_{i=1}^n\log \frac{p(x_i)}{q(x_i)}$ So now we have the likelihood ratio as a summation. Let’s say we want to answer the question of how much, on average, each sample gives evidence of $p(x)$ over $q(x)$. To do this, we can take the expected value of the likelihood ratio and arrive at: $D_{KL}(P||Q)=\sum_{i=1}^n p(x_i)\log \frac{p(x_i)}{q(x_i)}$ The expression above is the definition of KL divergence. It is basically the expected value of the likelihood ratio – where the likelihood ratio expresses how much more likely the sampled data is from distribution P instead of distribution Q. Another way of expressing the above definition is as follows (using log rules): $D_{KL}(P||Q)=\sum_{i=1}^n p(x_i)\log p(x_i)-\sum_{i=1}^n p(x_i)\log q(x_i)$

• The first term in the above equation is the entropy of the distribution P. As you can recall it is the expected value of the information content of P.
• The second term is the information content of Q, but instead weighted by the distribution P.
• This yields the interpretation of the KL divergence to be something like the following: if P is the “true” distribution, then the KL divergence is the amount of information “lost” when expressing it via Q.

Cross entropy:

$H(p,q) =H(p)+D_{KL}(p||q) \\ = -\sum_{i=1}^nP(x_i)\log (P(x_i))+\sum_{i=1}^n p(x_i)\log p(x_i)-\sum_{i=1}^n p(x_i)\log q(x_i) \\= -\sum_{i=1}^n p(x_i)\log q(x_i)$

If we have two separate probability distributions $P(x)$ and $Q(x)$ over the same random variable x, we can measure how different these two distributions are using the Kullback-Leibler (KL) divergence: $D_{KL}(P||Q)=E_{x \sim P}[\log \frac{P(x)}{Q(x)}]=E_{x \sim P}[\log P(x)- \log Q(x)]$

The KL divergence has many useful properties

• Nonnegative. The KL divergence is $0$ if and only if $P$ and $Q$ are the same distribution in the case of discrete variables, or equal “almost everywhere” in the case of continuous variables.
• Asymmetric: $D_{KL}(P||Q) \neq D_{KL}(Q||P)$, this asymmetry means that there are important consequences to the choice of whether to use $D_{KL}(P||Q)$ or $D_{KL}(Q||P)$.

Cross-entropy: $H(P, Q) = H(P) +D_{KL}(P||Q)$ $H(P,Q)=-E_{x \sim P} \log Q(x)$ Minimizing the cross-entropy with respect to $Q$ is equivalent to minimizing the KL divergence, because $Q$ does not participate in the omitted term.

## Curse of dimensionality #### Describe the curse of dimensionality with examples.

Curse of dimensionality: as the dimensionality of the features space increases, the number configurations can grow exponentially, and thus the number of configurations covered by an observation decreases.

As the number of feature or dimensions grows, the amount of data we need to generalise accurately grows exponentially.

fun example: It’s easy to hunt a dog and maybe catch it if it were running around on the plain (two dimensions). It’s much harder to hunt birds, which now have an extra dimension they can move in. If we pretend that ghosts are higher-dimensional beings

#### What is local constancy or smoothness prior or regularization?

(See DL Book 5.11.2)

Smoothness prior or local constancy prior: This prior states that the function we learn should not change very much within a small region.

Many simpler algorithms rely exclusively on this prior to generalize well, and as a result they fail to scale to the statistical challenges involved in solving AIlevel tasks.

• KNN, decision trees, local kernel

All of these different methods are designed to encourage the learning process to learn a function $f^*$ that satisfies the condition $f^*(x)\approx f^*(x+\epsilon)$ In other words, if we know a good answer for an input $x$ (for example, if $x$ is a labeled training example) then that answer is probably good in the neighborhood of $x$.

Assuming only smoothness of the underlying function will not allow a learnerto represent a complex function that has many more regions to be distinguished than the number of training examples

## Universal approximation of neural networks #### State the universal approximation theorem? What is the technique used to prove that?

Universal approximation theorem (Hornik et al., 1989; Cybenko, 1989) states that a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units.

The universal approximation theorem means that regardless of what function we are trying to learn, we know that a large MLP will be able to represent this function.

However, we are not guaranteed that the training algorithm will be able to learn that function. Even if the MLP is able to represent the function, learning can fail for two different reasons.

• The optimization algorithm used for training may not be able to find the value of the parameters that corresponds to the desired function.
• The training algorithm might choose the wrong function due to overfitting

The universal approximation theorem says that there exists a network large enough to achieve any degree of accuracy we desire, but the theorem does not say how large this network will be.

#### What is a Borel measurable function?

Any continuous function on a closed and bounded subset of $R^n$ is Borel measurable and therefore may be approximated by a neural network.

## Deep Learning motivation #### What is the mathematical motivation of Deep Learning as opposed to standard Machine Learning techniques?

In deep learning (or neural networks) each node $i$ the network creates a linear function. Hence a neural network, which is combination of multiple nodes arranged in multiple layers, is capable of learning complex non-linear functions. In simplest terms one can think of it as a two dimensional array of logistic regression functions.

In standard Machine learning, the order of the regions that can be recognized in the function space is smaller. This effectively means that it’s harder to learn more complex features in the samples. However, this also means that the numbers of samples needed can be orders of magnitude smaller than for deep learning.

For deep learning, the nested linear functions in a neural network allow for much higher-order feature regions to be recognized. The tradeoff is that this often requires many times more samples than are used in standard machine learning.

#### What are the reasons for choosing a deep model as opposed to shallow model?

In short, “shallow” neural networks is a term used to describe NN that usually have only one hidden layer as opposed to deep NN which have several hidden layers, often of various types.

There are papers that highlight that deep NN with the right architectures achieve better results than shallow ones that have the same computational power (e.g. number of neurons or connections).

The main explanation is that the deep models are able to extract/build better features than shallow models and to achieve this they are using the intermediate hidden layers

#### How does Deep Learning tackle the curse of dimensionality?

The curse of dimensionality normally comes about because in data there are relevant and too many irrelevant (noise) features. The neurons in deep learning (DL) architectures, use lots of data in order to model a problem and thereby a DL system reduces the influence of irrelevant features while increasing the influence of relevant features during learning.

Let me explain this a little further by focusing on a single processing unit (neuron).

Given a raw high-dimensional feature vector v, $v=[v_1, v_2, ..., v_n]$ For $n = \text{very large number}$ such as an image $n=\text{width} \times \text{height}$

We know that the actual information exists in a much lower dimensional space than $n$. That is why dimensionality reduction works well because it eliminates curse of dimensionality by projecting the data into a much lower relevant representational space. The process of learning in machine learning (ML) algorithms finds that smaller dimensional representation space in the large raw vector $v$.

For simplicity consider a single node

$y= \phi (\sum_{i=1}^n v_iw_i+b)$

The node makes a decision by weighing each feature $v_i$ thus after training the weights for corresponding relevant features will be high. Further, consider $v$ is a concatenation of the relevant feature vector $v_{\text{relevant}}$ and the irrelevant feature vector $v_{\text{irrelevant}}$ such as

$v=[v_{\text{relevant}}, v_{\text{irrelevant}}]$

The weight vector can also be seen as a concatenated vector:

$w=[w_{\text{relevant}}, w_{\text{irrelevant}}]$

So we can further write

$y= \phi (v^Tw+b)$

$y= \phi(v_{\text{relevant}}^T w_{\text{relevant}}+v_{\text{irrelevant}}^T w_{\text{irrelevant}}+b)$

After learning

$w_{\text{irrelevant}} \approx 0$

So that reduces the effective dimensionality of the problem to the dimensionality of $w_{\text{relevant}}$. This is a form of dimensionality reduction. This process occurs at every layer of the deep neural nets (DNN) because each of the neurons in the DNN will only be sensitive to a particular relevant feature.

## Support Vector Machines #### What is a large margin classifier?

Margin: the smallest (perpendicular) distance from each training observation to a given separating hyperplane $\Rightarrow$ the minimal distance from the observations to the hyperplane.

Maximal margin hyperplane: the separating hyperplane that is farthest from the training observations.

• The maximal margin hyperplane is the separating hyperplane for which the margin is largest
• Overfitting when $p$ is large.

Maximal margin classifier: classify a test observation based on which side of the maximal margin hyperplane it lies.

The maximal margin hyperplane is the solution to the optimization problem $\max_{\beta_0,...\beta_p} M \\ s.t. \sum_{j=1}^p \beta_j^2=1, \quad (9.10) \\ y_i(\beta_0+\beta_1x_{i1}+\beta_2x_{i2},...+\beta_px_{ip})>M \quad \forall i=1,..,n. \quad (9.11)$

• The constraint in $(9.11)$ in fact requires that each observation be on the correct side of the hyperplane, with some cushion, provided that margin $M$ is positive.)
• The constraint in $(9.10)$ makes sure the perpendicular distance from the $i$-th observation to the hyperplane is given by $y_i(\beta_0+\beta_1x_{i1}+\beta_2x_{i2},...+\beta_px_{ip})$

#### Why SVM is an example of a large margin classifier?

Support Vector Classifier (Soft Margin Classifier): Rather than seeking the largest possible margin that every observation is not only on the correct side of the hyperplane but also on the correct side of the margin, we instead allow some observationsto be on the incorrect side of the margin, or even the incorrect side of the hyperplane.

Optimization problem: $\max_{\beta_0,...\beta_p,\epsilon_1,..,\epsilon_n} M \\ s.t. \sum_{j=1}^p \beta_j^2=1, \quad (9.13) \\ y_i(\beta_0+\beta_1x_{i1}+\beta_2x_{i2},...+\beta_px_{ip})>M(1-\epsilon_i) \quad \forall i=1,..,n. \quad (9.14) \\ \epsilon_i\geq0,\sum_{i=1}^p\epsilon_i \leq C, \quad (9.15)$

• Slack variables: $\epsilon_1,..,\epsilon_n$ - allow individual observations to be on the wrong side of the margin or the hyperplane

• $\epsilon_i=0$: the $i$-th observation is on the correct side of the margin
• $\epsilon_i>0$: the $i$-th observation is on the wrong side of the margin $\Rightarrow$ $i$-th observation violated the margin.
• $\epsilon_i>1$: the $i$-th observation is on the wrong side of the hyperplane
• Classify the test observation based on the sign of $f(x^∗) = \beta_0+\beta_1x_{1}^*+\beta_2x_{2}^*,...+\beta_px_{p}^*$

The support vector machine (SVM) is an extension of the support vector classifier that results from enlarging the feature space using kernels.

#### SVM being a large margin classifier, is it influenced by outliers?

Yes, if $C$ is large, otherwise it is not.

#### What is the role of C in SVM?

• Tuning parameter $C$: $C$ bounds the sum of the $\epsilon_i$’s, and so it determines the number and severity of the violations to the margin(and to the hyperplane) that we will tolerate.
• budget for the amount that the margin can be violated by the $n$ observations.
• Generally chosen via cross-validation.
• C controls the bias-variance trade-off of the support vector classifier.

#### What is a kernel in SVM? Why do we use kernels in SVM?

Kernel: Kernel is a function that quantifies the similarity of two observations.

• Linear kernel: $K(x_i,x_{i^{'}})=\sum_{j=1}^px_{ij}x_{i^{'}j}$

• Linear kernel essentially quantifies the similarity of a pair of observations using Pearson (standard) correlation.
• Polynomial kernel of degree d: $K(x_i,x_{i^{'}})=(1+\sum_{j=1}^px_{ij}x_{i^{'}j})^d$

• fitting a support vector classifier in a higher-dimensional space involving polynomials of degree d.
• Radial kernel: $K(x_i,x_{i^{'}})=\exp(-\gamma \sum_{j=1}^p(x_{ij}-x_{i^{'}j})^2)$

• Radial kernel has very local behavior: only nearby training observations have an effect on the class label of a test observation

• If a given test observation $x^∗ = (x^∗_1 . . .x^∗_p)^T$ is far from a training observation $x_i$ in terms of Euclidean distance; $\Rightarrow \sum_{j=1}p(x_{ij}-x_{i{'}j})^2$ will be large $\Rightarrow K(x_i,x_{i^{'}})=\exp(-\gamma \sum_{j=1}^p(x_{ij}-x_{i^{'}j})^2)$ will be very tiny. $\Rightarrow x_i$ will play virtually no role in $f(x∗)$.

Support Vector Machine: When the support vector classifieris combined with a non-linear kernel, the resulting classifier is known as a support vector machine.

$f(x)=\beta_0+\sum_{i \in S}^n \alpha_i K(x,x_i)$

In machine learning, a “kernel” is usually used to refer to the kernel trick, a method of using a linear classifier to solve a non-linear problem. The kernel function is what is applied on each data instance to map the original non-linear observations into a higher-dimensional space in which they become separable.

Advantage of Kernel over enlarging the feature space using functions of the original features:

• Computational: one need only compute $K(x_i,x_{i^{'}})$ for all $\left(\begin{array}{c}n\\ 2\end{array}\right)$ distinct pairs $i, i^{'}$. This can be done without explicitly working in the enlarged feature space.

• Curse of dimensionality: for some kernels, such as the radial kernel, the feature space is implicit and infinite-dimensional.

#### Can we apply the kernel trick to logistic regression? Why is it not used in practice then?

While converting the primal SVM formulation into its dual form (which gives us the kernel version of the SVM), we notice that one of the equations we get is, $\beta=\sum_{i=1}^n \alpha_iy_ix_i$ This, in the kernelized version where the kernel $k$ has an implicit representation for points given by $x \rightarrow \phi(x)$, $\beta=\sum_{i=1}^n \alpha_iy_i\phi(x_i)$ Surprisingly, one can also get this form from the representer theorem . This suggests something general about the classifiers we learn.

Now, let us get back to logistic regression, which is modeled as, $p(y=1|x)=\frac{1}{1+\exp(-\beta^Tx)}$ First of all, let us map the $x$ to the space of implicit representation of the kernel. So, our model in the implicit representation space would look like, $p(y=1|x)=\frac{1}{1+\exp(-\beta^T\phi(x))}$ Next, let us use the form of $\beta$, we observed from SVM and the representer theorem. This will give us, $p(y=1|x)=\frac{1}{1+\exp(-\sum_{i=1}^n\alpha_iy_i\phi(x_i)^T\phi(x))} \\ =\frac{1}{1+\exp(-\sum_{i=1}^n\alpha_iy_iK(x_i,x))}$ This gives us the kernelized logistic regression model.

Why is it not used in practice then? logistic regression with kernels is merely an SVM without maximum margins

#### How does the SVM parameter C affect the bias/variance trade off?

• if $C$ is small: highly fit to the data, fewer support vectors $\Rightarrow$ low bias, high variance;
• if $C$ is large: margin wider, many support vectors $\Rightarrow$ high bias, low variance;

#### Logistic regression vs. SVMs: When to use which one?

SVM Optimization problem: $\max_{\beta_0,...\beta_p,\epsilon_1,..,\epsilon_n} M \\ s.t. \sum_{j=1}^p \beta_j^2=1, \quad (9.13) \\ y_i(\beta_0+\beta_1x_{i1}+\beta_2x_{i2},...+\beta_px_{ip})>M(1-\epsilon_i) \quad \forall i=1,..,n. \quad (9.14) \\ \epsilon_i\geq0,\sum_{i=1}^p\epsilon_i \leq C, \quad (9.15)$

Rewrite the criterion (9.12)–(9.15) for fitting the support vector classifier $f(X) = \beta_0 + \beta_1X_1 + . . . + \beta_pX_p$ as $\min_{\beta_0,...,\beta_p}\left\{ \sum_{i=1}^n\max[0,1-y_if(x_i)]+\lambda\sum_{j=1}^p\beta_j^2 \right\}$

• $\lambda$ is small: few violations to the margin ; high-variance, low-bias; $\Leftrightarrow$ small $C$;

“Loss + Penalty” form: $\min_{\beta_0,...,\beta_p}\left\{ L(\mathbf{X},\mathbf{y},\beta)+\lambda P(\beta) \right\}$

• $L(\mathbf{X},\mathbf{y},\beta)$: loss function
• $P(\beta)$: penalty function

Ridge regression and the lasso:

$L(\mathbf{X},\mathbf{y},\beta)=\sum_{i=1}^n \left( y_i-\beta_0-\sum_{j=1}^p x_{ij}\beta_j \right)^2 \\ P(\beta) = \sum_{j=1}^p \beta_j^2 \quad \text{ridge} \, \text{regression} \\ P(\beta) = \sum_{j=1}^p |\beta_j| \quad \text{lasso}$

SVM: hinge loss

$L(\mathbf{X},\mathbf{y},\beta)=\sum_{i=1}^n \max[0,1-y_i(\beta_0+\beta_1x_{i1}+,,,+\beta_px_{ip})]$

Optimization problems of linear SVM and (regularized) LR: $\min_\beta \lambda||\beta||^2+\sum_{i=1}^n \max[0,1-y_i(\beta_0+\beta_1x_{i1}+,,,+\beta_px_{ip})] \\ \min_\beta \lambda||\beta||^2+\sum_{i=1}^n \log(1+\exp(1-y_i(\beta_0+\beta_1x_{i1}+,,,+\beta_px_{ip})))$ That is, they only differ in the loss function — SVM minimizes hinge loss while logistic regression minimizes logistic loss.

• Logistic loss diverges faster than hinge loss. So, in general, it will be more sensitive to outliers.
• Logistic loss does not go to zero even if the point is classified sufficiently confidently. This might lead to minor degradation in accuracy.

Main Difference:

• SVM try to maximize the margin between the closest support vectors while LR the posterior class probability. Thus, SVM find a solution which is as fare as possible for the two categories while LR has not this property. • LR is more sensitive to outliers than SVM because the cost function of LR diverges faster than those of SVM. So putting an outlier on above picture would give below picture: • Logistic Regression produces probabilistic values while SVM produces 1 or 0. So in a few words LR makes not absolute prediction and it does not assume data is enough to give a final decision. This maybe be good property when what we want is an estimation or we do not have high confidence into data.

• In order to get discrete values 1 or 0 for the LR we can say that when a function value is greater than a threshold we classify as 1 and when a function value is smaller than the threshold we classify as 0.

## Bayesian Machine Learning #### What are the differences between “Bayesian” and “Freqentist” approach for Machine Learning?

The Bayesian approach differs from the frequentist (aka “standard”) method for inference in its use of a prior distribution to express the uncertainty present before seeing the data, and to allow the uncertainty remaining after seeing the data to be expressed in the form of a posterior distribution.

Given a specific set of data, the frequentist believes that there is a true, underlying distribution from which said data was generated. The inability to get the exact parameters is a function of finite sample size. The Bayesian, on the other hand, think that we start with some assumption about the parameters (even if unknowingly) and use the data to refine our opinion about those parameters.

• The Bayesian probability measures a “degree of belief”. It pretty much matches our every-day intuitive understanding of probability,
• The frequentists can assign probabilities only to events/obervations that come from repeatable experiments. With ”probability of an event” they mean the relative frequency of the event occuring in an infinitively long series of repetitions. For instance, when a frequentists says that the probability for “heads” in a coin toss is $0.5$ ($50\%$) he means that in infinititively many such coin tosses, $50\%$ of the coins will show “head”.

Frequentist: best suited to falsify a hypothesis. Bayesian: best suited to (re)allocate the credibility of a statement

Downsides of Frequentists

• Frequentists approach relies on data more than Bayesian as we totally ignore our knowledge or logical thinking which have been introduced in a form of prior probability.
• P-value does not provide the probability of your hypothesis to be collect. It only avoids the most extreme value that seems to be rare. It sometimes make the situation difficult as you may find it challenging to explain the actual meaning of value. Whereas, the posterior probability describes in percentage how likely your hypothesis is correct based on our prior knowledge.

Downsides of Bayesians

• Bayesian Statistic requires more mathematical knowledge since the formula requires us to deduce two probability distributions.
• What if your prior has become meaningless as the logic we have is no longer valid? (Some articles suggest that the prior at early stage can be any number as it can be updated as more information comes in)

#### Compare and contrast maximum likelihood and maximum a posteriori estimation.

maximum likelihood estimation (MLE):

Likelihood function: $P(X|\theta)$

MLE for $\theta$, the parameter we want to infer:

$\theta_{MLE}=\arg \max_\theta P(X|\theta) \\ =\arg \max_\theta \prod_i P(x_i|\theta)$

As taking a product of some numbers less than $1$ would approaching $0$ as the number of those numbers goes to infinity, it would be not practical to compute, because of computation underflow. Hence, we will instead work in the log space, as logarithm is monotonically increasing, so maximizing a function is equal to maximizing the log of that function.

$\theta_{MLE}=\arg \max_\theta \log P(X|\theta) \\ =\arg \max_\theta \log \prod_i P(x_i|\theta) \\ =\arg \max_\theta \sum_i \log P(x_i|\theta)$

To use this framework, we just need to derive the log likelihood of our model, then maximizing it with regard of $\theta$ using our favorite optimization algorithm like gradient descent.

maximum a posteriori estimation (MAP):

MAP usually comes up in Bayesian setting. Because, as the name suggests, it works on a posterior distribution, not only the likelihood.

Bayes’ rule: $\theta_{MAP}=\arg \max_\theta P(X|\theta) P(\theta) \\ =\arg \max_\theta \log P(X|\theta) P(\theta) \\ =\arg \max_\theta \log \prod_i P(x_i|\theta) P(\theta)\\ =\arg \max_\theta \sum_i \log P(x_i|\theta)P(\theta)$

Comparing both MLE and MAP equation, the only thing differs is the inclusion of prior $P(\theta)$ in MAP. What it means is that, the likelihood is now weighted with some weight coming from the prior.

Let’s consider what if we use the simplest prior in our MAP estimation, i.e. uniform prior. This means, we assign equal weights everywhere, on all possible values of the $\theta$. For example ,our prior $P(\theta)$ is $\frac{1}{6}$

$\theta_{MAP} =\arg \max_\theta \sum_i \log P(x_i|\theta)P(\theta) \\ = \arg \max_\theta \sum_i \log P(x_i|\theta) const \\ = \arg \max_\theta \sum_i \log P(x_i|\theta) \\ = \theta_{MLE}$

Comparison:

What we could conclude then, is that MLE is a special case of MAP, where the prior is uniform

#### How does Bayesian methods do automatic feature selection?

One type of Bayesian method is Bayesian inference and feature selection has to do with $L^1$ regularization because it is used extensively for this purpose. So, for $L^1$ regularization, the penalty $\alpha \Omega(w)=\alpha \sum_i |w_i|$ used to regularize a cost function is equivalent to the log-prior term that is maximized by MAP Bayesian inference when the prior is an isotropic Laplace distribution.

$L^1$ regularization finds the specific subset of the available features to be used.

#### What do you mean by Bayesian regularization?

We usually use regularization to guard against overfitting (or a network having “high variance”). One of the techniques to reduce variance and improve generalization is to apply weight decay and weight constraints. If we manage to trim the growing weights on a Neural Network to some meaningful degree, then we can control the variance of the network and avoid overfitting.

We want to be able to use Bayesian inference in such a way that the weight distribution is made optimal to learn the correct function that relevantly maps the input to the output. In likelihood terms, we can also state that we want to find the weight vectors that maximizes the log probability density towards a correct answer. Minimizing the squared error is the same as maximizing the log probability density of the correct answer. This is called Maximum Likelihood Estimation (MLE).

If we want to use Bayesian inference to regularize the maximum likelihood, the solution would be to apply Maximum A Posteriori estimation (or MAP). MAP tries to find the mode of the posterior distribution by employing Bayes’s Theorem. So for Neural Networks, this can be written as:

$P(\mathbf{w} | \mathcal{D}) = \frac{P(\mathbf{w})P(\mathcal{D} | \mathbf{w})}{\int P(\mathbf{w})P(\mathcal{D} | \mathbf{w})}$

Where, $P(\mathbf{w} | \mathcal{D})$ is the posterior probability of the weight vector $\mathbf{w}$ given the training data set $\mathcal{D}$, $P(\mathbf{w})$ is the prior probability of the weight vector, $P(\mathcal{D} | \mathbf{w})$ is the probability of the observed data given weight vector $\mathbf{w}$, and the denominator is the integral of all possible weight vectors.

We can convert the above equation to a cost function again applying the negative log likelihood as follows:

$\text{cost} = - \log P(\mathbf{w} | \mathcal{D}) \\ - \log P(\mathbf{w} | \mathcal{D}) = -\log P(\mathbf{w} | \mathcal{D}) = -\log P(\mathbf{w}) - \log P( \mathcal{D} | \mathbf{w}) + \log P(\mathcal{D}) \\ \text{where }P(\mathcal{D}) \text{ is the integral}$

Here, $P(\mathcal{D})$ is an integral over all possible weights and hence $\log P(\mathcal{D})$ converts to some constant. From the Maximum Likelihood, we already know the equation for $\log P(\mathcal{D} | \mathbf{w})$

$\log P(\mathbf{w})$ is the log probability of the prior weights. This is based on how we initialize the weights. One of the best ways to do this is to sample from a zero-mean Gaussian distribution.

$P(\mathbf{w}) = \frac{1}{\sigma \sqrt{2 \pi}} e^\frac{-\mathbf{w}^2}{2\sigma_\mathbf{w}^2}$

$-\log P(\mathbf{w}) = K + \frac{\mathbf{w}^2}{2\sigma_\mathbf{w}^2} \log e$

So, the Bayesian Inference for MAP is as follows:

$\text{cost} = - \log P(\mathbf{w} ) - \log P(\mathcal{D} | \mathbf{w}) +\log P( \mathcal{D}) \\ = \frac{1}{2\sigma_\mathbf{w}^2} \sum_i \mathbf{w}_i^2 + \frac{1}{2\sigma_\mathcal{D}^2} \sum_c (t_c - y_c)^2 + K \\ = \frac{1}{2\sigma_\mathbf{w}^2} \sum_i \mathbf{w}_i^2 + \frac{1}{2\sigma_\mathcal{D}^2} \sum_c (t_c - y_c)^2 \\ = \frac{1}{\sigma_\mathcal{D} ^2}\begin{bmatrix} \frac{\sigma_\mathcal{D} ^2}{2\sigma_\mathbf{w}^2} \sum_i \mathbf{w}_i^2 + E \end{bmatrix}$

$\text{cost approximation} = \begin{bmatrix} \frac{-\mathbf{w}^2}{2\sigma_\mathbf{w}^2} \sum_i \mathbf{w}_i^2 + E \end{bmatrix}$

This cost approximation is similar to the loss function in $L_2$ regularization. Also note that we started we a randomly initialized zero-mean-gaussian weight vector for MAP and then started working towards fixing it to improve $P(\mathbf{w} | \mathcal{D})$. This has the same side-effect as $L_2$ regularizers which can get stuck in local minima.

We take the MAP approach because a full bayesian approach over all possible weights is computational intensive and is not tractable. There are tricks with MCMC which can help approximate a unbiased sample from true posteriors over the entire weights.

#### When will you use Bayesian methods instead of Frequentist methods?

Bayesian methods are ideal in situations where you have a very small dataset, especially when each instance in that dataset has a large number of features.

## Regularization #### What is $L_1$ regularization?

$L_1$ lasso penalty: $\sum_{j=1}^p |\beta_j|$

A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, $L_1$ regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model.

#### What is $L_2$ regularization?

$L_2$ ridge penalty: $\sum_{j=1}^p\beta_j^2$