Machine Learning Fundamentals
Fundamentals of all types of machine learning, deep learning or otherwise
 UPDATED
This post is Part 2 of the 4part Machine Learning Research Interview Handbook series (you can see the rest of the series here).
In this part of the series we go into the fundamentals of machine learning systems. After all, there’s more than just neural networks, and you should ideally be able to understand the general rules and properties that hold for all frameworks.
Table of Contents
 Describe bias and variance with examples.
 What is Empirical Risk Minimization?
 What is Hoeffding’s inequality?
 Write the formulae for training error and generalization error. Point out the differences.
 State the uniform convergence theorem.
 What is sample complexity bound of uniform convergence theorem?
 What is error bound of uniform convergence theorem?
 What is the biasvariance tradeoff theorem?
 From the biasvariance tradeoff, can you derive the bound on training set size?
 What is the VC dimension?
 What does the training set size depend on for a finite and infinite hypothesis set? Compare and contrast.
 What is the VC dimension for an ndimensional linear classifier?
 How is the VC dimension of a SVM bounded although it is projected to an infinite dimension?
 Considering that Empirical Risk Minimization is a NPhard problem, how does logistic regression and SVM loss work?
 Why are model selection methods needed?
 How do you do a tradeoff between bias and variance?
 What are the different attributes that can be selected by model selection methods?
 Why is crossvalidation required?
 Describe different crossvalidation techniques.
 What is holdout cross validation? What are its advantages and disadvantages?
 What is kfold cross validation? What are its advantages and disadvantages?
 What is leaveoneout cross validation? What are its advantages and disadvantages?
 Why is feature selection required?
 Describe some feature selection methods.
 What is forward feature selection method? What are its advantages and disadvantages?
 What is backward feature selection method? What are its advantages and disadvantages?
 What is filter feature selection method and describe two of them?
 What is mutual information and KL divergence?
 What are Cross Entropy and KL divergence? Describe KL divergence intuitively.
 Describe the curse of dimensionality with examples.
 What is local constancy or smoothness prior or regularization?
Universal approximation of neural networks
 State the universal approximation theorem? What is the technique used to prove that?
 What is a Borel measurable function?
 What is the mathematical motivation of Deep Learning as opposed to standard Machine Learning techniques?
 In standard Machine Learning vs. Deep Learning, how is the order of number of samples related to the order of regions that can be recognized in the function space?
 What are the reasons for choosing a deep model as opposed to shallow model?
 How does Deep Learning tackle the curse of dimensionality?
 How can the SVM optimization function be derived from the logistic regression optimization function?
 What is a large margin classifier?
 Why SVM is an example of a large margin classifier?
 SVM being a large margin classifier, is it influenced by outliers?
 What is the role of C in SVM?
 In SVM, what is the angle between the decision boundary and theta?
 What is the mathematical intuition of a large margin classifier?
 What is a kernel in SVM? Why do we use kernels in SVM?
 What is a similarity function in SVM? Why it is named so?
 How are the landmarks initially chosen in an SVM? How many and where?
 Can we apply the kernel trick to logistic regression? Why is it not used in practice then?
 What is the difference between logistic regression and SVM without a kernel?
 How does the SVM parameter C affect the bias/variance trade off?
 How does the SVM kernel parameter sigma² affect the bias/variance trade off?
 Can any similarity function be used for SVM?
 Logistic regression vs. SVMs: When to use which one?
 What are the differences between “Bayesian” and “Freqentist” approach for Machine Learning?
 Compare and contrast maximum likelihood and maximum a posteriori estimation.
 How does Bayesian methods do automatic feature selection?
 What do you mean by Bayesian regularization?
 When will you use Bayesian methods instead of Frequentist methods?
 What is $L_1$ regularization?
 What is $L_2$ regularization?
 Compare $L_1$ and $L_2$ regularization.
 Why does $L_1$ regularization result in sparse models?
 What is dropout?
 How will you implement dropout during forward and backward pass?
Evaluation of Machine Learning systems
 What are What are accuracy, sensitivity, specificity, ROC, AUC, Confusion matrix, F1Score?
 What are precision and recall?
 Describe ttest in the context of Machine Learning.
 Describe the kmeans algorithm.
 What is distortion function? Is it convex or nonconvex?
 Tell me about the convergence of the distortion function.
 What is the EM algorithm
 What is the Gaussian Mixture Model?
 Describe the EM algorithm intuitively.
 What are the two steps of the EM algorithm
 Compare Gaussian Mixture Model and Gaussian Discriminant Analysis.
 Why do we need dimensionality reduction techniques?
 What do we need PCA and what does it do?
 What is the difference between logistic regression and PCA?
 What are the two preprocessing steps that should be applied before doing PCA?
Basics of Natural Language Processing
 What is WORD2VEC?
 What is tSNE?
 Why do we use PCA instead of tSNE?
 What is sampled softmax?
 Why is it difficult to train a RNN with SGD?
 How do you tackle the problem of exploding gradients?
 What is the problem of vanishing gradients?
 How do you tackle the problem of vanishing gradients?
 Explain the memory cell of a LSTM.
 What type of regularization do one use in LSTM?
 What is Beam Search?
 How to automatically caption an image?
Machine Learning
Learning Theory
Describe bias and variance with examples.
Variance: refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set. More flexible statistical methods have higher variance.
 Explanation: different training data sets will result in a different $\hat{f}$. But ideally the estimate for $f$ should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in $\hat{f}$
Bias: refers to the error that is introduced by approximating a reallife problem, which may be extremely complicated, by a much simpler model.
 Explanation: As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases.
Decomposition: The expected test MSE, for a given value $x_0$ can always be decomposed into the sum of three fundamental quantities: the variance of $\hat{f}(x_0)$, the squared bias of $\hat{f}(x_0)$, and the variance of the error variance terms $\epsilon$.
$E(y_0\hat{f}(x_0))^2=\text{Var}(\hat{f}(x_0))+[\text{Bias}(\hat{f}(x_0))]^2+\text{Var}(\epsilon)$
The overall expected test MSE can be computed by averaging $E(y_0\hat{f}(x_0))^2$ over all possible values of $x_0$ in the test set.
What is Empirical Risk Minimization?
ERM: the function that minimizes loss on the training set.
In many machine learning task, we have data $Z$ from some distribution $p$ and the task is to minimize the risk: $R_(f)=E_{Z \sim p}[\text{loss}(f(Z),Z)]$ Loss function: In classification $Z=(X,Y)$ and we use 0/1 loss $\text{loss}(f(Z), Z) = I_{f(X) \neq Y}$, in regression $Z=(X,Y)$ and we use squared error $\text{loss}(f(Z), Z) = (f(X)  Y )^2$ and in density estimation $Z=X$ and we use negative log likelihood loss $\text{loss}(f(Z), Z) =  \log f(X)$. We are interested in finding the optimal predictor $f^*=\arg\min_f R(f)$ In practice, we compute the empirical risk: $\hat{R}(f)=\frac{1}{n} \sum_{i=1}^n \text{loss}(f(X_i),Y_I)$ We choose the $\hat{f}$ that minimizes the empirical risk over some class $F$, such as parametric models, histogram classifiers, decision trees or linear/polynomial functions, etc. $\hat{f}^{ERM}=\arg \min_{f \in F} \hat{R}(f)$
What is Hoeffding’s inequality?
Let $Z_1, ..., Z_n$ be $n$ number of random independent, identically distributed random variables, such that $0 \leq Z_i \leq 1$. Then, $P\left ( \left  \frac{1}{n} \sum_{i=1}^n Z_i  E[Z] \right  > \epsilon \right ) \leq \delta = 2 \exp(2n \epsilon^2)$
Basically, we have a bunch of variables $Z_i$. We know that when we average a bunch of them up, we should usually get something close to the expected value $E[Z]$. Hoeffding quantifies “usually” and “close” for us
What the Hoeffding Inequality gives us is a probabilistic guarantee that $v$ doesn’t stray too far from $E[Z]$. $\epsilon$ is some small value which we use to measure the average deviation of $Z_1, ..., Z_n$ from $E[Z]$. We claim that the probability of $Z_i$ being more than $\epsilon$ away from $E[Z]$ is less than or equal to some bound which shrinks exponentially as $\epsilon$ and/or our sample size $n$ increases. In other words, the larger your sample size and the wider your margin of error, the less likely you are to step over that margin of error with your best guess.
Write the formulae for training error and generalization error. Point out the differences.
Generalization error is defined as the expected value of the error on a new input. Here the expectation is taken across different possible inputs, drawn from the distribution of inputs we expect the system to encounter in practice. [DL Book]
Training error: when training a machine learning model, we have access to a training set, we can compute some error measure on the training set called the training error, and we reduce this training error.
In our linear regression example, we trained the model by minimizing the training error:
$\frac{1}{m^{(train)}}\mathbf{X}^{(train)}\mathbf{w}\mathbf{y}^{(train)}^2_2$
but we actually care about the test error:
$\frac{1}{m^{(test)}}\mathbf{X}^{(test)}\mathbf{w}\mathbf{y}^{(test)}^2_2$
Error v.s. Loss v.s. Risk:
Error is the difference between the actual / true value ($\hat{\theta}$) and the predicted / estimated value ($\hat{\theta}$). $Error=\theta\hat{\theta}$ Loss (𝐿) is a measurement of how well our model perform against your training data.

Regression Losses
 Mean Square Error/Quadratic Loss/$L_2$ Loss $\text{MSE}=\frac{\sum_{i=1}^n(y_i\hat{y}_i)^2}{n}$
 Mean Absolute Error/$L_1$ Loss $\text{MAE}=\frac{\sum_{i=1}^ny_i\hat{y}_i}{n}$
 Mean Bias Error $\text{MBE}=\frac{\sum_{i=1}^n(y_i\hat{y}_i)}{n}$

Classification Losses
 Hinge Loss/Multi class SVM Loss $\text{SVM Loss}=\sum_{j \neq y_i} \max(0,s_js_{y_i}+1) \\ L(\mathbf{X},\mathbf{y},\beta)=\sum_{i=1}^n \max[0,1y_i(\beta_0+\beta_1x_{i1}+,,,+\beta_px_{ip})]$
 Cross Entropy Loss/Negative Log Likelihood $\text{CrossEntropyLoss}=\left[y_i\log(\hat{y_i})+(1y_i)\log (1y_i)\right]$
Risk is the average measure of loss, or expected loss, across your whole data distribution.
$R(\theta,\hat{\theta})=E(L(\theta,\hat{\theta}))$
Empirical Risk: when we train our model, we do not have the full distribution of the data. This may be because some of our data is used for validation and testing, or that new data points are produced in realtime. The best we can do is to pick our training data in a random way and assume that our training data is representative of the real data.
Therefore, because we don’t have all the data, the best we can do is to minimize the empirical risk, from data that we do have (our training data), and use regularization techniques to generalize (i.e. avoid overfitting). This is why minimizing loss and minimizing empirical risk are roughly the same thing.
State the uniform convergence theorem.
A sequence $\{f_n\}_{n=1}^{\infty}$ of realvalued functions on a set $X, f_n : X \to \mathbb{R}$, is said to be uniformly convergent on $X$ to a limit function $f : X \to \mathbb{R}$ if for each $\epsilon > 0$ there exists an $\mathbb{N} \in N$ such that $f(x)  f_n(x) < \epsilon$ holds for all $n \geq N$ and $x \in X$.
Uniform convergence simplifies certain calculations, for instance by interchanging the integral and the limit sign in integration.
Difficulties which arise when the convergence is pointwise but not uniform can be seen in the example of the non Riemann integrable indicator function of rational numbers in $[0,1]$ and provide partial explanations of some other anomalies such as the Gibbs phenomenon. Many theorems of functional analysis use uniform convergence in their formulation, such as the Weierstrass approximation theorem and some results of Fourier analysis. Uniform convergence can be used to construct a nowheredifferentiable continuous function.
What is the sample complexity bound of uniform convergence theorem?
For the sample complexity of a uniformly convergent function: Let $F_H = \{f_h  h \in H \}$, where $f_h$ is the loss function for hypothesis $h$.
 $F_H$ has the uniform convergence property. As such, an ERM (Empirical Risk Minimization) algorithm “learns” $H$.
 The sample complexity of learning $H$ is bounded by $m_{F_H}(\eta, \epsilon)$
What is the error bound of uniform convergence theorem?
We measure the quality of a hypothesis by its generalization error, the expected loss of a hypothesis on a new test example. In online learning, we can bound the expected generalization error of online gradient descent using onlinetobatch conversion. This section allows us to derive high probability bounds as well as more general results for hypothesis classes without relying on convexity.
Uniform convergence only gives us upper bounds, so we can’t directly compare the generalization error of two algorithms. Also, it is worst case over all distributions p∗, so it lacks sensitivity to the exact problem structure.
What is the biasvariance tradeoff theorem?
If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. $\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$
Underfitting happens when a model unable to capture the underlying pattern of the data $\Rightarrow$ high bias, low variance.
Overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy dataset $\Rightarrow$ low bias and high variance.
From the biasvariance tradeoff, can you derive the bound on training set size?
Increasing training data size leads to decreased variance. Decreasing the variance leads to an increase the bias. So, increase the training data size leads to decreasing variance and increasing bias. From this, it’s possible to derive a training set size at $\min(\text{Bias}^2 + \text{Variance} + \text{Irreducible Error})$
What is the VC dimension?
In learning theory, the VC dimension is a measure of capacity of a class of hypotheses $\mathcal{H}$ (e.g., a set of classifiers). This notion of capacity indicates how complicated $\mathcal{H}$ is. Although complicated $\mathcal{H}$ may be able to fit well to the dataset at hand, yielding a low training error, there is a possibility that it overfits and gives high generalization error. The VC dimension provides a tool to analyze the generalization error of a class of hypotheses based on how complicated it is, independent of the input distribution, the target function, and the learning algorithm (i.e., a systematic approach to choose the best hypothesis $g \in \mathcal{H}$).
Capacity in VC theory is captured by the concept called shattering. Here we focus only on binary classification problems.
A hypothesis class H is said to shatter $n$ points if there exists a dataset $X=\{x_1,...,x_n\}$ such that for any label assignment $y=(y_1,...,y_n)$ where $y_i \in \{1,+1\}$, there exists a hypothesis $h \in \mathcal{H}$ which can produce $y=h(X)$.
In a simpler terms, we say $H$ shatters $n$ points if there exists a configuration of $X=\{x_1,...,x_n\}$ such that $\mathcal{H}$ can produce all possible $2^n$ assignments of $y$. Things worth noting are
 If $\mathcal{H}$ can produce any assignment of $y$ on just even one configuration of $n$ points, then we say $\mathcal{H}$ can shatter $n$ points. So, when constructing an example, it makes sense to imagine a configuration of points such that $\mathcal{H}$ can shatter easily.
 If $\mathcal{H}$ can shatter $n$ points, then obviously it can shatter less than $n$ points.
 Likewise, if $\mathcal{H}$ cannot shatter n points, then it cannot shatter more than $n$ points.
The VC dimension of $\mathcal{H}$, denoted by $d_{VC}(\mathcal{H})$, is the largest number of points $\mathcal{H}$ can shatter.
As an example, the VC dimension of a linear classifier in twodimensional space is 3. That is, three is the highest number of points a line can produce all possible $\{1,+1\}$ assignments. With four points, there are two cases out of 16 possible assignments a line cannot produce. In general, $d_{VC}(\text{linear classifiers})=d+1$ where $d$ is the input dimension.
The VC dimension can be used to bound probabilistically the difference between the training and test errors. This result is known as VC inequality.
What does the training set size depend on for a finite and infinite hypothesis set? Compare and contrast.
Most of the tools in machine learning are designed to make better use data using $P_{AC}$ (Probably Approximately Correct), first introduced by Professor Valiant. The $P_{AC}$ analyses assume that the true answer/concept is in the given hypothesis space $\mathcal{H}$. A machine learning algorithm $L$ with hypothesis space $\mathcal{H}$ is one that, given a training data set $D$, will always return a hypothesis $\mathcal{H}$ consistent with $D$ if one exists, otherwise it will indicate that no such hypothesis exists. In a finite machine learning hypothesis $\mathcal{H}$ does not have polynomial sample complexity. If $\mathcal{H}$ has polynomial sample complexity it is called infinite hypothesis.
What is the VC dimension for an ndimensional linear classifier?
Let d be the dimension of the input data and $\mathcal{H} = \{f ~~ f(\boldsymbol{x}) = \text{sign}(\boldsymbol{w}^\top \boldsymbol{x} + b), b \in \mathbb{R} \}$ be the set of all linear classifiers. $d_{VC}(\mathcal{H}) \geq d+1.$ Proof We need to prove that $\mathcal{H}$ can shatter at least $d+1$ points. That is, it suffices to show that there exists a set of $d+1$ points such that $\mathcal{H}$ can produce any prespecified $\{1,+1\}$ assignment $y=(y_1,...,y_d+1)$. Let $\boldsymbol{X} = \left[ \begin{array}{c} \boldsymbol{x}_1^\top \\ \vdots \\ \boldsymbol{x}_{d+1}^\top \end{array} \right] = \left[ \begin{array}{ccccc} 1 & 0 & 0 & \cdots & 0 \\ 1 & 1 & 0 & \cdots & 0 \\ 1 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 0 & \cdots & 1 \end{array} \right] \in \mathbb{R}^{d+1 \times d+1}$
The first coordinate of each $x_i$ is 1 to allow the bias term $b$ to be produced. Note that $X$ is invertible. With this definition, for any $y$ we can always have $sign(Xw)=y$ by choosing $\boldsymbol{w} = \boldsymbol{X}^{1} \boldsymbol{y}$ since $Xw=y$ implies sign $\boldsymbol{X}\boldsymbol{w} = \boldsymbol{y}$. $d_{VC}(\mathcal{H}) \leq d + 1.$ Proof We need to show that no $d+2$ points can be shattered by $\mathcal{H}$. That is to show that there exists a certain assignment $\boldsymbol{y} \in \mathbb{R}^{d+2}$ not achievable by $\mathcal{H}$. We will construct one such assignment.
Assume we have $d+2$ points $x_1,...,x_{d+2}$ in $(d+1)$dimensional space. Then, the set is linearly dependent. So, there exists $\boldsymbol{x}_j = \sum_{i\neq j} a_i \boldsymbol{x}_i$ such that not all $a_i$ are $0$. Set $y_j=1$ and $y_i = \text{sign}(a_i) = \text{sign}(\boldsymbol{w}^\top \boldsymbol{x}_i)$. If $a_i=0$, $y_i$ can be arbitrary. $\text{sign}(\boldsymbol{w}^\top \boldsymbol{x}_j) = \text{sign}(\sum_{i \neq j} a_i \boldsymbol{w}^\top \boldsymbol{x}_i) > 0 \neq y_j$ Since there exists an assignment of d+2 points not achievable by H, we can say that $d_{VC}(\mathcal{H}) \leq d + 1$.
$d_{VC}(\mathcal{H}) = d + 1$
Proof Combining the two lemmas.
How is the VC dimension of a SVM bounded although it is projected to an infinite dimension?
Choose some $\eta$ between 0 and 1. Vapnik (1995) showed that with probability $1 \eta$: $R(\alpha) \leq R_{emp}(\alpha) + \sqrt{\frac{h(\log(2l/h)+1)\log(\eta/4)}{l}}$
 $h$ is VC dimension and is a measure of the capacity or complexity of the machine.
 Note the bound is independent of $P(x,y)$
 If we know $h$, can readily compute the RHS. This provides a principled way to choose a learning machine.
Finding VC dimensions of machines with different kernels is nontrivial. Some kernels (e.g. RBF) have infinite VC dimension but still work well in practice.
It is possible to derive a bound based on the margin $\left ( \sum^N_{i=1} \alpha_i \right )^{1/2}$ and the “radius” $R=\max(K(x_i, x_i))$ but the bound tends to be unrealistic.
Considering that Empirical Risk Minimization is an NPhard problem, how does logistic regression and SVM loss work?
Both logistic regression and SVMs can be seen under an empirical risk minimization light, where one is interested in minimizing the following function with regards to the coefficients $w$: $L(x,y,w)=\sum_i^n f(x_i,y_i,w)+R(w)$ When $f \equiv \text{hinge loss}$ and $R(w) =w^Tw$ we have the SVM, and when $f \equiv \text{logistic loss}$ and $R(w) =0$ we have logistic regression.
So you can treat the loss and regularization as parameters of a model, which is empirical risk minimization.
Most, if not all, linear (kernelized/regularized) models can be put under this umbrella as well, it’s a really broad definition.
Of course, the SVM and logistic regression can be seen as different models, because they have different functional forms, having completely different specialized techniques for optimization.
Model and feature selection
Why are model selection methods needed?
 Model Interpretability: Irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted.
 We want to estimate the generalization performance, the predictive performance of our model on future (unseen) data.
 We want to increase the Prediction Accuracy by tweaking the learning algorithm and selecting the best performing model from a given hypothesis space.
 We want to identify the machine learning algorithm that is bestsuited for the problem at hand; thus, we want to compare different algorithms, selecting the bestperforming one as well as the best performing model from the algorithm’s hypothesis space.
 Fast (to train and test)
 Scalable (it can be applied to a large dataset)
The idea of model selection method is intuitive. It answers the following question:
How to select the right input variables for an optimal model?
An Optimal model is a model that fits the data with best values for the evaluation metrics.
(Feature selection simplifies a machine learning problem by choosing which subset of the available features should be used.)
How do you do a tradeoff between bias and variance?
Shrinkage methods: By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias.
Bagging and other resampling techniques can be used to reduce the variance in model predictions.
What are the different attributes that can be selected by model selection methods?
Model selection can be used (at least indirectly) to select for attributes like model robustness, additional engineered features, data transformation steps, feature filtering steps, and data split
Some of the attributes that can be seleted include, but are not limited to:
 Accuracy & Loss metrics
 Akaike Information Criterion (AIC)
 Bayesian Information Criterion (BIC)
 Minimum Description Length (MDL)
 Structural Risk Minimization (SRM)
Because of how tightly linked and entangled the
Why is crossvalidation required?
Crossvalidation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility.
Cross Validation is a very useful technique for assessing the effectiveness of your model, particularly in cases where you need to mitigate overfitting. You need some kind of assurance that your model has got most of the patterns from the data correct, and its not picking up too much on the noise, or in other words its low on bias and variance.
Evaluation of residuals only gives us an idea about how well our model does on data used to train it. So, the problem with this evaluation technique is that it does not give an indication of how well the learner will generalize to an independent/ unseen data set. Getting this idea about our model is known as Cross Validation.
It is also of use in determining the hyper parameters of your model, in the sense that which parameters will result in lowest test error.
Describe different crossvalidation techniques.
 Holdout cross validation
 Kfold cross validation
 Leaveoneout cross validation
What is holdout cross validation? What are its advantages and disadvantages?
Holdout cross validation:
 Randomly dividing the available set of observations into two parts, a training set and a validation set or holdout set.
 The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set.
 The resulting validation set error rate—typically assessed using MSE in the case of a quantitative response—provides an estimate of the test error rate.
Advantage: this method doesn’t take any overhead to compute and is better than traditional validation
Disadvantage:
 High variance: because it is not certain which data points will end up in the validation set and the result might be entirely different for different sets.
 Removing a part of it for validation poses a problem of underfitting
What is kfold cross validation? What are its advantages and disadvantages?
Kfold cross validation:
 Randomly kfold CV dividing the set of observations into k groups, or folds, of approximately equal size.
 The first fold is treated as a validation set, and the method is fit on the remaining k  1 folds.
 The mean squared error, MSE1, is then computed on the observations in the heldout fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set.
 This process results in k estimates of the test error, MSE1,MSE2, … ,MSEk.
 The kfold CV estimate is computed by averaging these values, $CV_{(k)}=\frac{1}{k}\sum_{i=1}^kMSE_i$ Advantage: significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set.
Disadvantage: the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation.
What is leaveoneout cross validation? What are its advantages and disadvantages?
Leaveoneout cross validation: leaves $1$ data points out of training data as validation set. The statistical learning method is fit on the $n  1$ training observations, and a prediction $y_1$ is made for the excluded observation, using its value $x_1$. and then the error is averaged for all trials, to give overall effectiveness.
Since $(x_1,y_1)$ was not used in the fitting process, $MSE_1 =(y_1  \hat{y}_1)^2$ provides an approximately unbiased estimate for the test error. But even though $MSE_1$ is unbiased for the test error, it is a poor estimate because it is highly variable, since it is based upon a single observation $(x_1,y_1)$.
Advantages and disadvantages:
 kFold more biased than LOOCV; kFold less variance than LOOCV
 When we perform LOOCV, we are in effect averaging the outputs of n fitted models, each of which is trained on an almost identical set of observations; therefore, these outputs are highly (positively) correlated with each other.
 very expensive to compute
Why is feature selection required?
 Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
 Improves Accuracy: Less misleading data means modeling accuracy improves.
 Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.
Describe some feature selection methods.

Filter Methods: apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset.
 distance metrics, correlation, mutual information, and consistency metrics

Wrapper Methods: consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy.
 Best Subset, forward and backward, recursive feature elimination

Embedded Methods: learn which features best contribute to the accuracy of the model while the model is being created.
 regularization methods: LASSO, Elastic Net and Ridge Regression
What is forward feature selection method? What are its advantages and disadvantages?
Forward stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the ﬁt.
algorithm:
 Let $\mathcal{M}_0$ denote the null model, which contains no predictors.
 for $k=0,...,p1$:
 Consider all $pk$ models that audment the predictors in $\mathcal{M}_k$ with one additional predictor
 Choose the best among these $pk$ models, and call it $\mathcal{M}_{k+1}$. Here best is defined as having smallest RSS or highest $R^2$.
 Select a single best model from among $\mathcal{M}_0,...,\mathcal{M}_p$ using crossvalidated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$
Advantages and Disadvantages:
 Computational: for large p we cannot compute the best subset sequence, but we can always compute the forward stepwise sequence
 Statistical: a price is paid in variance for selecting the best subset of each size; forward stepwise is a more constrained search, and will have lower variance, but perhaps more bias
What is backward feature selection method? What are its advantages and disadvantages?
Backwardstepwise selection starts with the full model, and sequentially deletes the predictor that has the least impact on the ﬁt. The candidate for dropping is the variable with the smallest Zscore
algorithm:
 Let $\mathcal{M}_p$ denote the full model, which contains all $p$ predictors.
 for $k=p, p1,...,1$:
 Consider all $k$ models that contain all but one of the predictors in $\mathcal{M}_k$, for a total of $k1$ predictors.
 Choose the best among these $k$ models, and call it $\mathcal{M}_{k1}$. Here best is defined as having smallest RSS or highest $R^2$.
 Select a single best model from among $\mathcal{M}_0,...,\mathcal{M}_p$ using crossvalidated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$
Advantages and Disadvantages:
 Backwardstepwise selection can only be used when $N>p$, while forward stepwise can always be used.
 Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the $p$ predictors.
What is filter feature selection method and describe two of them?
Filter Methods: apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset.
 Chisquared: Chisquare test is used for categorical features in a dataset. We calculate Chisquare between each feature and the target and select the desired number of features with best Chisquare scores. to evaluate how likely it is that any observed difference between the sets arose by chance or if the association between two categorical variables of the sample would reflect their real association in the population.
 Correlation: The Correlation Feature Selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: “Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other

Entropy: Entropy measures the amount of information in a random variable; it’s the average length of the message needed to transmit an outcome of that variable using the optimal code.
 Information content: $I(E)=\log [\Pr(E)]=\log (P)$
 Define the entropy as the expected value of information: $H(X)=E[I(X)]=E[\log (P(X))]=\sum_{i=1}^nP(x_i)\log (P(x_i))$
 Oneattributerule(OneR): The idea of the OneR (oneattributerule) algorithm is to find the one attribute to use that makes fewest prediction errors.
What is mutual information and KL divergence?
What are Cross Entropy and KL divergence? Describe KL divergence intuitively.
Cross entropy is, at its core, a way of measuring the “distance” between two probability distributions P and Q. As you observed, entropy on its own is just a measure of a single probability distribution. As such, if we are trying to find a way to model a true probability distribution, P, using, say, a neural network to produce an approximate probability distribution Q, then there is the need for some sort of distance or difference measure which can be minimized.
Cross entropy function: $H(p,q)=H(p)+D_{KL}(pq)$
 The first term, the entropy of the true probability distribution p, during optimization is fixed – it reduces to an additive constant during optimization.
 Only the parameters of the second, approximation distribution, q that can be varied during optimization – and hence the core of the cross entropy measure of distance is the KL divergence function.
KL divergence is an expression of “surprise” – under the assumption that P and Q are close, it is surprising if it turns out that they are NOT CLOSE, hence in those cases the KL divergence will be high. If they are CLOSE together, then the KL divergence will be low.
KL divergence is the information gained when we move from a prior distribution Q to a posterior distribution P.
Derivation of KL divergence:
The expression for KL divergence can also be derived by using a likelihood ratio approach.
The likelihood ratio $LR=\frac{p(x)}{q(x)}$
 Interpretation: if a value x is sampled from some unknown distribution, the likelihood ratio expresses how much more likely the sample has come from distribution $p$ than from distribution $q$. If it is more likely from $p$, the $LR > 1$, otherwise if it is more likely from $q$, the $LR < 1$.
Let’s say we have lots of independent samples and we want to estimate the likelihood function taking into account all this evidence – it then becomes: $LR=\prod_{i=1}^n\frac{p(x_i)}{q(x_i)}$ If we convert the ratio to $\log$, it’s possible to turn the product in the above definition to a summation: $LR=\sum_{i=1}^n\log \frac{p(x_i)}{q(x_i)}$ So now we have the likelihood ratio as a summation. Let’s say we want to answer the question of how much, on average, each sample gives evidence of $p(x)$ over $q(x)$. To do this, we can take the expected value of the likelihood ratio and arrive at: $D_{KL}(PQ)=\sum_{i=1}^n p(x_i)\log \frac{p(x_i)}{q(x_i)}$ The expression above is the definition of KL divergence. It is basically the expected value of the likelihood ratio – where the likelihood ratio expresses how much more likely the sampled data is from distribution P instead of distribution Q. Another way of expressing the above definition is as follows (using log rules): $D_{KL}(PQ)=\sum_{i=1}^n p(x_i)\log p(x_i)\sum_{i=1}^n p(x_i)\log q(x_i)$
 The first term in the above equation is the entropy of the distribution P. As you can recall it is the expected value of the information content of P.
 The second term is the information content of Q, but instead weighted by the distribution P.
 This yields the interpretation of the KL divergence to be something like the following: if P is the “true” distribution, then the KL divergence is the amount of information “lost” when expressing it via Q.
Cross entropy:
$H(p,q) =H(p)+D_{KL}(pq) \\ = \sum_{i=1}^nP(x_i)\log (P(x_i))+\sum_{i=1}^n p(x_i)\log p(x_i)\sum_{i=1}^n p(x_i)\log q(x_i) \\= \sum_{i=1}^n p(x_i)\log q(x_i)$
If we have two separate probability distributions $P(x)$ and $Q(x)$ over the same random variable x, we can measure how different these two distributions are using the KullbackLeibler (KL) divergence: $D_{KL}(PQ)=E_{x \sim P}[\log \frac{P(x)}{Q(x)}]=E_{x \sim P}[\log P(x) \log Q(x)]$
The KL divergence has many useful properties
 Nonnegative. The KL divergence is $0$ if and only if $P$ and $Q$ are the same distribution in the case of discrete variables, or equal “almost everywhere” in the case of continuous variables.
 Asymmetric: $D_{KL}(PQ) \neq D_{KL}(QP)$, this asymmetry means that there are important consequences to the choice of whether to use $D_{KL}(PQ)$ or $D_{KL}(QP)$.
Crossentropy: $H(P, Q) = H(P) +D_{KL}(PQ)$ $H(P,Q)=E_{x \sim P} \log Q(x)$ Minimizing the crossentropy with respect to $Q$ is equivalent to minimizing the KL divergence, because $Q$ does not participate in the omitted term.
Curse of dimensionality
Describe the curse of dimensionality with examples.
Curse of dimensionality: as the dimensionality of the features space increases, the number configurations can grow exponentially, and thus the number of configurations covered by an observation decreases.
As the number of feature or dimensions grows, the amount of data we need to generalise accurately grows exponentially.
fun example: It’s easy to hunt a dog and maybe catch it if it were running around on the plain (two dimensions). It’s much harder to hunt birds, which now have an extra dimension they can move in. If we pretend that ghosts are higherdimensional beings
What is local constancy or smoothness prior or regularization?
(See DL Book 5.11.2)
Smoothness prior or local constancy prior: This prior states that the function we learn should not change very much within a small region.
Many simpler algorithms rely exclusively on this prior to generalize well, and as a result they fail to scale to the statistical challenges involved in solving AIlevel tasks.
 KNN, decision trees, local kernel
All of these different methods are designed to encourage the learning process to learn a function $f^*$ that satisfies the condition $f^*(x)\approx f^*(x+\epsilon)$ In other words, if we know a good answer for an input $x$ (for example, if $x$ is a labeled training example) then that answer is probably good in the neighborhood of $x$.
Assuming only smoothness of the underlying function will not allow a learnerto represent a complex function that has many more regions to be distinguished than the number of training examples
Universal approximation of neural networks
State the universal approximation theorem? What is the technique used to prove that?
Universal approximation theorem (Hornik et al., 1989; Cybenko, 1989) states that a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one finitedimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units.
The universal approximation theorem means that regardless of what function we are trying to learn, we know that a large MLP will be able to represent this function.
However, we are not guaranteed that the training algorithm will be able to learn that function. Even if the MLP is able to represent the function, learning can fail for two different reasons.
 The optimization algorithm used for training may not be able to find the value of the parameters that corresponds to the desired function.
 The training algorithm might choose the wrong function due to overfitting
The universal approximation theorem says that there exists a network large enough to achieve any degree of accuracy we desire, but the theorem does not say how large this network will be.
What is a Borel measurable function?
Any continuous function on a closed and bounded subset of $R^n$ is Borel measurable and therefore may be approximated by a neural network.
Deep Learning motivation
What is the mathematical motivation of Deep Learning as opposed to standard Machine Learning techniques?
In deep learning (or neural networks) each node $i$ the network creates a linear function. Hence a neural network, which is combination of multiple nodes arranged in multiple layers, is capable of learning complex nonlinear functions. In simplest terms one can think of it as a two dimensional array of logistic regression functions.
In standard Machine Learning vs. Deep Learning, how is the order of number of samples related to the order of regions that can be recognized in the function space?
In standard Machine learning, the order of the regions that can be recognized in the function space is smaller. This effectively means that it’s harder to learn more complex features in the samples. However, this also means that the numbers of samples needed can be orders of magnitude smaller than for deep learning.
For deep learning, the nested linear functions in a neural network allow for much higherorder feature regions to be recognized. The tradeoff is that this often requires many times more samples than are used in standard machine learning.
What are the reasons for choosing a deep model as opposed to shallow model?
In short, “shallow” neural networks is a term used to describe NN that usually have only one hidden layer as opposed to deep NN which have several hidden layers, often of various types.
There are papers that highlight that deep NN with the right architectures achieve better results than shallow ones that have the same computational power (e.g. number of neurons or connections).
The main explanation is that the deep models are able to extract/build better features than shallow models and to achieve this they are using the intermediate hidden layers
How does Deep Learning tackle the curse of dimensionality?
The curse of dimensionality normally comes about because in data there are relevant and too many irrelevant (noise) features. The neurons in deep learning (DL) architectures, use lots of data in order to model a problem and thereby a DL system reduces the influence of irrelevant features while increasing the influence of relevant features during learning.
Let me explain this a little further by focusing on a single processing unit (neuron).
Given a raw highdimensional feature vector v, $v=[v_1, v_2, ..., v_n]$ For $n = \text{very large number}$ such as an image $n=\text{width} \times \text{height}$
We know that the actual information exists in a much lower dimensional space than $n$. That is why dimensionality reduction works well because it eliminates curse of dimensionality by projecting the data into a much lower relevant representational space. The process of learning in machine learning (ML) algorithms finds that smaller dimensional representation space in the large raw vector $v$.
For simplicity consider a single node
$y= \phi (\sum_{i=1}^n v_iw_i+b)$
The node makes a decision by weighing each feature $v_i$ thus after training the weights for corresponding relevant features will be high. Further, consider $v$ is a concatenation of the relevant feature vector $v_{\text{relevant}}$ and the irrelevant feature vector $v_{\text{irrelevant}}$ such as
$v=[v_{\text{relevant}}, v_{\text{irrelevant}}]$
The weight vector can also be seen as a concatenated vector:
$w=[w_{\text{relevant}}, w_{\text{irrelevant}}]$
So we can further write
$y= \phi (v^Tw+b)$
$y= \phi(v_{\text{relevant}}^T w_{\text{relevant}}+v_{\text{irrelevant}}^T w_{\text{irrelevant}}+b)$
After learning
$w_{\text{irrelevant}} \approx 0$
So that reduces the effective dimensionality of the problem to the dimensionality of $w_{\text{relevant}}$. This is a form of dimensionality reduction. This process occurs at every layer of the deep neural nets (DNN) because each of the neurons in the DNN will only be sensitive to a particular relevant feature.
Support Vector Machines
How can the SVM optimization function be derived from the logistic regression optimization function?
What is a large margin classifier?
Margin: the smallest (perpendicular) distance from each training observation to a given separating hyperplane $\Rightarrow$ the minimal distance from the observations to the hyperplane.
Maximal margin hyperplane: the separating hyperplane that is farthest from the training observations.
 The maximal margin hyperplane is the separating hyperplane for which the margin is largest
 Overfitting when $p$ is large.
Maximal margin classifier: classify a test observation based on which side of the maximal margin hyperplane it lies.
The maximal margin hyperplane is the solution to the optimization problem $\max_{\beta_0,...\beta_p} M \\ s.t. \sum_{j=1}^p \beta_j^2=1, \quad (9.10) \\ y_i(\beta_0+\beta_1x_{i1}+\beta_2x_{i2},...+\beta_px_{ip})>M \quad \forall i=1,..,n. \quad (9.11)$
 The constraint in $(9.11)$ in fact requires that each observation be on the correct side of the hyperplane, with some cushion, provided that margin $M$ is positive.)
 The constraint in $(9.10)$ makes sure the perpendicular distance from the $i$th observation to the hyperplane is given by $y_i(\beta_0+\beta_1x_{i1}+\beta_2x_{i2},...+\beta_px_{ip})$
Why SVM is an example of a large margin classifier?
Support Vector Classifier (Soft Margin Classifier): Rather than seeking the largest possible margin that every observation is not only on the correct side of the hyperplane but also on the correct side of the margin, we instead allow some observationsto be on the incorrect side of the margin, or even the incorrect side of the hyperplane.
Optimization problem: $\max_{\beta_0,...\beta_p,\epsilon_1,..,\epsilon_n} M \\ s.t. \sum_{j=1}^p \beta_j^2=1, \quad (9.13) \\ y_i(\beta_0+\beta_1x_{i1}+\beta_2x_{i2},...+\beta_px_{ip})>M(1\epsilon_i) \quad \forall i=1,..,n. \quad (9.14) \\ \epsilon_i\geq0,\sum_{i=1}^p\epsilon_i \leq C, \quad (9.15)$

Slack variables: $\epsilon_1,..,\epsilon_n$  allow individual observations to be on the wrong side of the margin or the hyperplane
 $\epsilon_i=0$: the $i$th observation is on the correct side of the margin
 $\epsilon_i>0$: the $i$th observation is on the wrong side of the margin $\Rightarrow$ $i$th observation violated the margin.
 $\epsilon_i>1$: the $i$th observation is on the wrong side of the hyperplane
 Classify the test observation based on the sign of $f(x^∗) = \beta_0+\beta_1x_{1}^*+\beta_2x_{2}^*,...+\beta_px_{p}^*$
The support vector machine (SVM) is an extension of the support vector classifier that results from enlarging the feature space using kernels.
SVM being a large margin classifier, is it influenced by outliers?
Yes, if $C$ is large, otherwise it is not.
What is the role of C in SVM?
 Tuning parameter $C$: $C$ bounds the sum of the $\epsilon_i$’s, and so it determines the number and severity of the violations to the margin(and to the hyperplane) that we will tolerate.
 budget for the amount that the margin can be violated by the $n$ observations.
 Generally chosen via crossvalidation.
 C controls the biasvariance tradeoff of the support vector classifier.
In SVM, what is the angle between the decision boundary and theta?
What is the mathematical intuition of a large margin classifier?
What is a kernel in SVM? Why do we use kernels in SVM?
Kernel: Kernel is a function that quantifies the similarity of two observations.

Linear kernel: $K(x_i,x_{i^{'}})=\sum_{j=1}^px_{ij}x_{i^{'}j}$
 Linear kernel essentially quantifies the similarity of a pair of observations using Pearson (standard) correlation.

Polynomial kernel of degree d: $K(x_i,x_{i^{'}})=(1+\sum_{j=1}^px_{ij}x_{i^{'}j})^d$
 fitting a support vector classifier in a higherdimensional space involving polynomials of degree d.

Radial kernel: $K(x_i,x_{i^{'}})=\exp(\gamma \sum_{j=1}^p(x_{ij}x_{i^{'}j})^2)$

Radial kernel has very local behavior: only nearby training observations have an effect on the class label of a test observation
 If a given test observation $x^∗ = (x^∗_1 . . .x^∗_p)^T$ is far from a training observation $x_i$ in terms of Euclidean distance; $\Rightarrow \sum_{j=1}p(x_{ij}x_{i{'}j})^2$ will be large $\Rightarrow K(x_i,x_{i^{'}})=\exp(\gamma \sum_{j=1}^p(x_{ij}x_{i^{'}j})^2)$ will be very tiny. $\Rightarrow x_i$ will play virtually no role in $f(x∗)$.

Support Vector Machine: When the support vector classifieris combined with a nonlinear kernel, the resulting classifier is known as a support vector machine.
$f(x)=\beta_0+\sum_{i \in S}^n \alpha_i K(x,x_i)$
In machine learning, a “kernel” is usually used to refer to the kernel trick, a method of using a linear classifier to solve a nonlinear problem. The kernel function is what is applied on each data instance to map the original nonlinear observations into a higherdimensional space in which they become separable.
Advantage of Kernel over enlarging the feature space using functions of the original features:

Computational: one need only compute $K(x_i,x_{i^{'}})$ for all $\left(\begin{array}{c}n\\ 2\end{array}\right)$ distinct pairs $i, i^{'}$. This can be done without explicitly working in the enlarged feature space.
 Curse of dimensionality: for some kernels, such as the radial kernel, the feature space is implicit and infinitedimensional.
What is a similarity function in SVM? Why it is named so?
How are the landmarks initially chosen in an SVM? How many and where?
Can we apply the kernel trick to logistic regression? Why is it not used in practice then?
While converting the primal SVM formulation into its dual form (which gives us the kernel version of the SVM), we notice that one of the equations we get is, $\beta=\sum_{i=1}^n \alpha_iy_ix_i$ This, in the kernelized version where the kernel $k$ has an implicit representation for points given by $x \rightarrow \phi(x)$, $\beta=\sum_{i=1}^n \alpha_iy_i\phi(x_i)$ Surprisingly, one can also get this form from the representer theorem [2]. This suggests something general about the classifiers we learn.
Now, let us get back to logistic regression, which is modeled as, $p(y=1x)=\frac{1}{1+\exp(\beta^Tx)}$ First of all, let us map the $x$ to the space of implicit representation of the kernel. So, our model in the implicit representation space would look like, $p(y=1x)=\frac{1}{1+\exp(\beta^T\phi(x))}$ Next, let us use the form of $\beta$, we observed from SVM and the representer theorem. This will give us, $p(y=1x)=\frac{1}{1+\exp(\sum_{i=1}^n\alpha_iy_i\phi(x_i)^T\phi(x))} \\ =\frac{1}{1+\exp(\sum_{i=1}^n\alpha_iy_iK(x_i,x))}$ This gives us the kernelized logistic regression model.
Why is it not used in practice then? logistic regression with kernels is merely an SVM without maximum margins
What is the difference between logistic regression and SVM without a kernel?
How does the SVM parameter C affect the bias/variance trade off?
 if $C$ is small: highly fit to the data, fewer support vectors $\Rightarrow$ low bias, high variance;
 if $C$ is large: margin wider, many support vectors $\Rightarrow$ high bias, low variance;
How does the SVM kernel parameter sigma² affect the bias/variance trade off?
Can any similarity function be used for SVM?
Logistic regression vs. SVMs: When to use which one?
SVM Optimization problem: $\max_{\beta_0,...\beta_p,\epsilon_1,..,\epsilon_n} M \\ s.t. \sum_{j=1}^p \beta_j^2=1, \quad (9.13) \\ y_i(\beta_0+\beta_1x_{i1}+\beta_2x_{i2},...+\beta_px_{ip})>M(1\epsilon_i) \quad \forall i=1,..,n. \quad (9.14) \\ \epsilon_i\geq0,\sum_{i=1}^p\epsilon_i \leq C, \quad (9.15)$
Rewrite the criterion (9.12)–(9.15) for fitting the support vector classifier $f(X) = \beta_0 + \beta_1X_1 + . . . + \beta_pX_p$ as $\min_{\beta_0,...,\beta_p}\left\{ \sum_{i=1}^n\max[0,1y_if(x_i)]+\lambda\sum_{j=1}^p\beta_j^2 \right\}$
 $\lambda$ is small: few violations to the margin ; highvariance, lowbias; $\Leftrightarrow$ small $C$;
“Loss + Penalty” form: $\min_{\beta_0,...,\beta_p}\left\{ L(\mathbf{X},\mathbf{y},\beta)+\lambda P(\beta) \right\}$
 $L(\mathbf{X},\mathbf{y},\beta)$: loss function
 $P(\beta)$: penalty function
Ridge regression and the lasso:
$L(\mathbf{X},\mathbf{y},\beta)=\sum_{i=1}^n \left( y_i\beta_0\sum_{j=1}^p x_{ij}\beta_j \right)^2 \\ P(\beta) = \sum_{j=1}^p \beta_j^2 \quad \text{ridge} \, \text{regression} \\ P(\beta) = \sum_{j=1}^p \beta_j \quad \text{lasso}$
SVM: hinge loss
$L(\mathbf{X},\mathbf{y},\beta)=\sum_{i=1}^n \max[0,1y_i(\beta_0+\beta_1x_{i1}+,,,+\beta_px_{ip})]$
Optimization problems of linear SVM and (regularized) LR: $\min_\beta \lambda\beta^2+\sum_{i=1}^n \max[0,1y_i(\beta_0+\beta_1x_{i1}+,,,+\beta_px_{ip})] \\ \min_\beta \lambda\beta^2+\sum_{i=1}^n \log(1+\exp(1y_i(\beta_0+\beta_1x_{i1}+,,,+\beta_px_{ip})))$ That is, they only differ in the loss function — SVM minimizes hinge loss while logistic regression minimizes logistic loss.
 Logistic loss diverges faster than hinge loss. So, in general, it will be more sensitive to outliers.
 Logistic loss does not go to zero even if the point is classified sufficiently confidently. This might lead to minor degradation in accuracy.
Main Difference:
 SVM try to maximize the margin between the closest support vectors while LR the posterior class probability. Thus, SVM find a solution which is as fare as possible for the two categories while LR has not this property.
 LR is more sensitive to outliers than SVM because the cost function of LR diverges faster than those of SVM. So putting an outlier on above picture would give below picture:

Logistic Regression produces probabilistic values while SVM produces 1 or 0. So in a few words LR makes not absolute prediction and it does not assume data is enough to give a final decision. This maybe be good property when what we want is an estimation or we do not have high confidence into data.
 In order to get discrete values 1 or 0 for the LR we can say that when a function value is greater than a threshold we classify as 1 and when a function value is smaller than the threshold we classify as 0.
Bayesian Machine Learning
What are the differences between “Bayesian” and “Freqentist” approach for Machine Learning?
The Bayesian approach differs from the frequentist (aka “standard”) method for inference in its use of a prior distribution to express the uncertainty present before seeing the data, and to allow the uncertainty remaining after seeing the data to be expressed in the form of a posterior distribution.
Given a specific set of data, the frequentist believes that there is a true, underlying distribution from which said data was generated. The inability to get the exact parameters is a function of finite sample size. The Bayesian, on the other hand, think that we start with some assumption about the parameters (even if unknowingly) and use the data to refine our opinion about those parameters.
 The Bayesian probability measures a “degree of belief”. It pretty much matches our everyday intuitive understanding of probability,
 The frequentists can assign probabilities only to events/obervations that come from repeatable experiments. With ”probability of an event” they mean the relative frequency of the event occuring in an infinitively long series of repetitions. For instance, when a frequentists says that the probability for “heads” in a coin toss is $0.5$ ($50\%$) he means that in infinititively many such coin tosses, $50\%$ of the coins will show “head”.
Frequentist: best suited to falsify a hypothesis. Bayesian: best suited to (re)allocate the credibility of a statement
Downsides of Frequentists
 Frequentists approach relies on data more than Bayesian as we totally ignore our knowledge or logical thinking which have been introduced in a form of prior probability.
 Pvalue does not provide the probability of your hypothesis to be collect. It only avoids the most extreme value that seems to be rare. It sometimes make the situation difficult as you may find it challenging to explain the actual meaning of value. Whereas, the posterior probability describes in percentage how likely your hypothesis is correct based on our prior knowledge.
Downsides of Bayesians
 Bayesian Statistic requires more mathematical knowledge since the formula requires us to deduce two probability distributions.
 What if your prior has become meaningless as the logic we have is no longer valid? (Some articles suggest that the prior at early stage can be any number as it can be updated as more information comes in)
Compare and contrast maximum likelihood and maximum a posteriori estimation.
maximum likelihood estimation (MLE):
Likelihood function: $P(X\theta)$
MLE for $\theta$, the parameter we want to infer:
$\theta_{MLE}=\arg \max_\theta P(X\theta) \\ =\arg \max_\theta \prod_i P(x_i\theta)$
As taking a product of some numbers less than $1$ would approaching $0$ as the number of those numbers goes to infinity, it would be not practical to compute, because of computation underflow. Hence, we will instead work in the log space, as logarithm is monotonically increasing, so maximizing a function is equal to maximizing the log of that function.
$\theta_{MLE}=\arg \max_\theta \log P(X\theta) \\ =\arg \max_\theta \log \prod_i P(x_i\theta) \\ =\arg \max_\theta \sum_i \log P(x_i\theta)$
To use this framework, we just need to derive the log likelihood of our model, then maximizing it with regard of $\theta$ using our favorite optimization algorithm like gradient descent.
maximum a posteriori estimation (MAP):
MAP usually comes up in Bayesian setting. Because, as the name suggests, it works on a posterior distribution, not only the likelihood.
Bayes’ rule: $\theta_{MAP}=\arg \max_\theta P(X\theta) P(\theta) \\ =\arg \max_\theta \log P(X\theta) P(\theta) \\ =\arg \max_\theta \log \prod_i P(x_i\theta) P(\theta)\\ =\arg \max_\theta \sum_i \log P(x_i\theta)P(\theta)$
Comparing both MLE and MAP equation, the only thing differs is the inclusion of prior $P(\theta)$ in MAP. What it means is that, the likelihood is now weighted with some weight coming from the prior.
Let’s consider what if we use the simplest prior in our MAP estimation, i.e. uniform prior. This means, we assign equal weights everywhere, on all possible values of the $\theta$. For example ,our prior $P(\theta)$ is $\frac{1}{6}$
$\theta_{MAP} =\arg \max_\theta \sum_i \log P(x_i\theta)P(\theta) \\ = \arg \max_\theta \sum_i \log P(x_i\theta) const \\ = \arg \max_\theta \sum_i \log P(x_i\theta) \\ = \theta_{MLE}$
Comparison:
What we could conclude then, is that MLE is a special case of MAP, where the prior is uniform
How does Bayesian methods do automatic feature selection?
One type of Bayesian method is Bayesian inference and feature selection has to do with $L^1$ regularization because it is used extensively for this purpose. So, for $L^1$ regularization, the penalty $\alpha \Omega(w)=\alpha \sum_i w_i$ used to regularize a cost function is equivalent to the logprior term that is maximized by MAP Bayesian inference when the prior is an isotropic Laplace distribution.
$L^1$ regularization finds the specific subset of the available features to be used.
What do you mean by Bayesian regularization?
We usually use regularization to guard against overfitting (or a network having “high variance”). One of the techniques to reduce variance and improve generalization is to apply weight decay and weight constraints. If we manage to trim the growing weights on a Neural Network to some meaningful degree, then we can control the variance of the network and avoid overfitting.
We want to be able to use Bayesian inference in such a way that the weight distribution is made optimal to learn the correct function that relevantly maps the input to the output. In likelihood terms, we can also state that we want to find the weight vectors that maximizes the log probability density towards a correct answer. Minimizing the squared error is the same as maximizing the log probability density of the correct answer. This is called Maximum Likelihood Estimation (MLE).
If we want to use Bayesian inference to regularize the maximum likelihood, the solution would be to apply Maximum A Posteriori estimation (or MAP). MAP tries to find the mode of the posterior distribution by employing Bayes’s Theorem. So for Neural Networks, this can be written as:
$P(\mathbf{w}  \mathcal{D}) = \frac{P(\mathbf{w})P(\mathcal{D}  \mathbf{w})}{\int P(\mathbf{w})P(\mathcal{D}  \mathbf{w})}$
Where, $P(\mathbf{w}  \mathcal{D})$ is the posterior probability of the weight vector $\mathbf{w}$ given the training data set $\mathcal{D}$, $P(\mathbf{w})$ is the prior probability of the weight vector, $P(\mathcal{D}  \mathbf{w})$ is the probability of the observed data given weight vector $\mathbf{w}$, and the denominator is the integral of all possible weight vectors.
We can convert the above equation to a cost function again applying the negative log likelihood as follows:
$\text{cost} =  \log P(\mathbf{w}  \mathcal{D}) \\  \log P(\mathbf{w}  \mathcal{D}) = \log P(\mathbf{w}  \mathcal{D}) = \log P(\mathbf{w})  \log P( \mathcal{D}  \mathbf{w}) + \log P(\mathcal{D}) \\ \text{where }P(\mathcal{D}) \text{ is the integral}$
Here, $P(\mathcal{D})$ is an integral over all possible weights and hence $\log P(\mathcal{D})$ converts to some constant. From the Maximum Likelihood, we already know the equation for $\log P(\mathcal{D}  \mathbf{w})$
$\log P(\mathbf{w})$ is the log probability of the prior weights. This is based on how we initialize the weights. One of the best ways to do this is to sample from a zeromean Gaussian distribution.
$P(\mathbf{w}) = \frac{1}{\sigma \sqrt{2 \pi}} e^\frac{\mathbf{w}^2}{2\sigma_\mathbf{w}^2}$
$\log P(\mathbf{w}) = K + \frac{\mathbf{w}^2}{2\sigma_\mathbf{w}^2} \log e$
So, the Bayesian Inference for MAP is as follows:
$\text{cost} =  \log P(\mathbf{w} )  \log P(\mathcal{D}  \mathbf{w}) +\log P( \mathcal{D}) \\ = \frac{1}{2\sigma_\mathbf{w}^2} \sum_i \mathbf{w}_i^2 + \frac{1}{2\sigma_\mathcal{D}^2} \sum_c (t_c  y_c)^2 + K \\ = \frac{1}{2\sigma_\mathbf{w}^2} \sum_i \mathbf{w}_i^2 + \frac{1}{2\sigma_\mathcal{D}^2} \sum_c (t_c  y_c)^2 \\ = \frac{1}{\sigma_\mathcal{D} ^2}\begin{bmatrix} \frac{\sigma_\mathcal{D} ^2}{2\sigma_\mathbf{w}^2} \sum_i \mathbf{w}_i^2 + E \end{bmatrix}$
$\text{cost approximation} = \begin{bmatrix} \frac{\mathbf{w}^2}{2\sigma_\mathbf{w}^2} \sum_i \mathbf{w}_i^2 + E \end{bmatrix}$
This cost approximation is similar to the loss function in $L_2$ regularization. Also note that we started we a randomly initialized zeromeangaussian weight vector for MAP and then started working towards fixing it to improve $P(\mathbf{w}  \mathcal{D})$. This has the same sideeffect as $L_2$ regularizers which can get stuck in local minima.
We take the MAP approach because a full bayesian approach over all possible weights is computational intensive and is not tractable. There are tricks with MCMC which can help approximate a unbiased sample from true posteriors over the entire weights.
When will you use Bayesian methods instead of Frequentist methods?
Bayesian methods are ideal in situations where you have a very small dataset, especially when each instance in that dataset has a large number of features.
Regularization
What is $L_1$ regularization?
$L_1$ lasso penalty: $\sum_{j=1}^p \beta_j$
A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, $L_1$ regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model.
What is $L_2$ regularization?
$L_2$ ridge penalty: $\sum_{j=1}^p\beta_j^2$