Deep Learning Concepts every practicioner should know

A deep dive into the important 'deep' learning concepts

This post is Part 3 of the 4-part Machine Learning Research Interview Handbook series (you can see the rest of the series here).

Deep Learning is the subfield within machine learning involving the use of nested models for more complext tasks. This becomes especially relevant in fields such as reinforcement learning.

Table of Contents

Machine Learning in General

Optimization procedures

Sequence Modeling


Representation Learning

Monte Carlo Methods

Adversarial Attacks

Machine Learning in General

Can you state Tom Mitchell’s definition of learning and discuss TT, PP and EE?

(click the triangle to expand and reveal the full answer)

Mitchell (1997) provides the definition: “A computer program is said to learn from experience EE with respect to some class of tasks TT and performance measure PP, if its performance at tasks in TT, as measured by PP, improves with experience EE.”

What can be different types of tasks encountered in Machine Learning?

(click the triangle to expand and reveal the full answer)

This is a non-exhaustive list. If you want more examples, look at the list of categories of SOTA research at

  • Classification
  • Regression
  • Semantic Segmentation
  • Object Detection
  • Super Resolution
  • Image Generation, Text Generation
  • Domain Adaptation
  • DeNoising
  • Style Transfer
  • Machine translation
  • Anomaly detection
  • Density estimation or probability mass function estimation

What are supervised, unsupervised, semi-supervised, self-supervised, multi-instance learning, and reinforcement learning?

(click the triangle to expand and reveal the full answer)

Supervised learning: Training a model from input data and its corresponding labels.

  • Self-supervised learning (supervised learning sub-type): Automatically labelling the training data during labelling. The datasets do not need to be manually labelled by human, though they can. Methods include (among others) labelling by finding and exploiting the relations (or correlations) between different input signals (input coming e.g. from different sensor modalities).
  • Semi-supervised learning (supervised learning sub-type): Training a model on data where some of the training examples have labels but others don’t. One technique for semi-supervised learning is to infer labels for the unlabeled examples, and then to train on the inferred labels to create a new model. Semi-supervised learning can be useful if labels are expensive to obtain but unlabeled examples are plentiful.
  • Multi-instance learning (supervised learning sub-type): Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances. In the simple case of multiple-instance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. From a collection of labeled bags, the learner tries to either 1) induce a concept that will label individual instances correctly or 2) learn how to label bags without inducing the concept.

Unsupervised learning: Training a model to find patterns in a dataset, typically an unlabeled dataset.

Reinforcement learning: A machine learning approach to maximize an ultimate reward through feedback (rewards and punishments) after a sequence of actions. For example, the ultimate reward of most games is victory. Reinforcement learning systems can become expert at playing complex games by evaluating sequences of previous game moves that ultimately led to wins and sequences that ultimately led to losses. Reinforcement learning is learning what to do---how to map situations to actions---so as to maximize a numerical reward signal

Loosely how can supervised learning be converted into unsupervised learning and vice-versa?

(click the triangle to expand and reveal the full answer)

Any unsupervised algorithm can be converted into a supervised algorithm by either adding

  • Supervised Loss to the Unsupervised Objective OR
  • Supervision as constraints to the original objective

Both are isomorphic to a certain degree, that is, there is very little difference in whether you add it as a loss or add it as a constraint. Such classes of algorithms that mix unsupervised objective with a supervised loss/constraint are called semi-supervised algorithms (also a way to convert from supervised to unsupervised).

Such an algorithm is useful when you have DD and LL, where DD is set of instances and LL is a subset of DD which has labels available. Often the size of L in these cases are much smaller than DD.

An example of such an algorithm is ICML 2001 paper by Kiri Wagstaff et. al. called Constrained K-Means Clustering with Background Knowledge, where they modify the unsupervised K-means algorithm by adding supervised constraints. These constraints tell the algorithm, which two points must belong to a cluster and which two don’t. They observe that they get improved clustering from such supervision.

Consider linear regression. What are TT, PP and EE?

(click the triangle to expand and reveal the full answer)

Task TT: predicting yy from xx by outputting y^=wTx\hat{y} = \mathbf{w}^T\mathbf{x}.

Performance PP: computing the mean squared error of the model on the test set.


Experience EE: The training set containing the data XtrainX^{\text{train}} and the labels ytrainy^{\text{train}}.

Derive the normal equation for linear regression.

(click the triangle to expand and reveal the full answer)

Here is one (but not the only) strategy for deriving the normal equation.

Linear function: hθ(x)=θ0+θ1x1+θ2x2=θTxh_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2=\theta^Tx

Least Squares Cost Function: J(θ)=12i=1n(hθ(x(i))y(i))2J(\theta)=\frac{1}{2}\sum_{i=1}^n(h_\theta(x^{(i)})-y^{(i)})^2

X=[(x(1))T(x(2))T...(x(n))T]y=[y(1)y(2)...y(n)]Xθy=[(x(1))Tθy(1)(x(2))Tθy(2)...(x(n))Tθy(n)]=[(hθ(x(1))y(1)hθ(x(2))y(2)...hθ(x(1))y(n)]\mathbf{X}=\begin{bmatrix}- (x^{(1)})^T - \\- (x^{(2)})^T - \\ ...\\- (x^{(n)})^T -\end{bmatrix} \\ \mathbf{y}=\begin{bmatrix} y^{(1)} \\y^{(2)} \\... \\y^{(n)}\end{bmatrix} \\ \mathbf{X}\theta-\mathbf{y}=\begin{bmatrix} (x^{(1)})^T\theta-y^{(1)} \\(x^{(2)})^T\theta-y^{(2)} \\... \\(x^{(n)})^T\theta-y^{(n)}\end{bmatrix} \\ =\begin{bmatrix} (h_\theta(x^{(1)})-y^{(1)} \\h_\theta(x^{(2)})-y^{(2)} \\... \\h_\theta(x^{(1)})-y^{(n)}\end{bmatrix}

For a vector zz, we have that: zTz=izi2z^Tz=\sum_i z^2_i

12(Xθy)T(Xθy)=12i=1n(hθ(x(i))y(i))2=J(θ)\frac{1}{2}(\mathbf{X}\theta-\mathbf{y})^T(\mathbf{X}\theta-\mathbf{y})=\frac{1}{2}\sum_{i=1}^n(h_\theta(x^{(i)})-y^{(i)})^2 =J(\theta)

We know that:

$\frac{\partial f(A)}{\partial A^T}=(\frac{\partial f(A)}{\partial A})^T \\ \frac{\partial \mathbf{y}^T\mathbf{A}\mathbf{x}}{\partial \mathbf{x}}=\mathbf{y}^T\mathbf{A} \\ \frac{\partial \mathbf{y}^T\mathbf{A}\mathbf{x}}{\partial \mathbf{y}}=\frac{\partial \mathbf{x}^T\mathbf{A}^T\mathbf{y}}{\partial \mathbf{y}}=\mathbf{x}^T\mathbf{A}^T \\ \frac{\partial \mathbf{x}^T\mathbf{A}\mathbf{x}}{\partial \mathbf{x}}=\mathbf{x}^T\mathbf{A}^T +\mathbf{x}^T\mathbf{A}=\mathbf{x}^T(\mathbf{A}^T +\mathbf{A})\$

To minimize JJ, let’s find its derivatives with respect to θ\theta:

J(θ)θ=12(Xθy)T(Xθy)θ=12(θTXTXθθTXTyyTXθ+yTy)θ=12(θTXTXθθTXTyyTXθ+yTy)θ=12(θTXTX+θTXTXyTXyTX)=12(XTXθ2yTX)=XTXθXTy=0\frac{\partial J(\theta)}{\partial \theta}= \frac{\partial \frac{1}{2}(\mathbf{X}\theta-\mathbf{y})^T(\mathbf{X}\theta-\mathbf{y})}{\partial \theta}\\ = \frac{1}{2}\frac{\partial (\theta^T\mathbf{X}^T\mathbf{X}\theta-\theta^T\mathbf{X}^T\mathbf{y}-\mathbf{y}^T\mathbf{X}\theta+\mathbf{y}^T\mathbf{y})}{\partial \theta} \\ =\frac{1}{2}\frac{\partial (\theta^T\mathbf{X}^T\mathbf{X}\theta-\theta^T\mathbf{X}^T\mathbf{y}-\mathbf{y}^T\mathbf{X}\theta+\mathbf{y}^T\mathbf{y})}{\partial \theta} \\ =\frac{1}{2} (\theta^T\mathbf{X}^T\mathbf{X}+\theta^T\mathbf{X}^T\mathbf{X}-\mathbf{y}^T\mathbf{X}-\mathbf{y}^T\mathbf{X} )\\ =\frac{1}{2}(\mathbf{X}^T\mathbf{X}\theta-2\mathbf{y}^T\mathbf{X}) \\ =\mathbf{X}^T\mathbf{X}\theta-\mathbf{X}^T\mathbf{y}=0

Normal Equation: θ=(XTX)1XTy\theta=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

What do you mean by affine transformation? Discuss affine vs. linear transformation.

(click the triangle to expand and reveal the full answer)

A function ff is linear if f(ax+by)=af(x)+bf(y)f(ax+by)=af(x)+bf(y) for all relevant values of aa, bb, xx and yy.

A function gg is affine if g(x)=f(x)+cg(x)=f(x)+c for some linear function aa and constant cc. Note that we allow c=0c=0, which implies that every linear function is an affine function.

  1. All linear transformations are affine transformations.
  2. Not all affine transformations are linear transformations.
  3. It can be shown that any affine transformation A:UVA:U \rightarrow V can be written as A(x)=Lx)+v0A(x)=Lx)+v_0, where v0v_0 is some vector from VV and L:UVL:U \rightarrow V is a linear transformation.

Discuss training error, test error, generalization error, overfitting, and underfitting.

(click the triangle to expand and reveal the full answer)

overfitting: What happens when the gap between the training error and test error is too large

underfitting: What happens when the model is not able to obtain a sufficiently low error value on the training set

training error: When training a machine learning model, we have access to a training set, we can compute some error measure on the training set

generalization error/test error: The expected value of the error on a new input. Here the expectation is taken across different possible inputs, drawn from the distribution of inputs we expect the system to encounter in practice

Visual summary

Compare representational capacity vs. effective capacity of a model.

(click the triangle to expand and reveal the full answer)

Representational capacity - the functions which the model can learn; The model specifies which family of functions the learning algorithm can choose from when varying the parameters in order to reduce a training objective.

Effective capacity - in practice, a learning algorithm is not likely to find the best function out of the possible functions it can learn, though it can learn one that performs exceptionally well - those functions that the learning algorithm is capable of finding defines the model’s effective capacity.

These additional limitations, such as the imperfection of the optimization algorithm, mean that the learning algorithm’s effective capacity may be less than the representational capacity of the model family

What is VC dimension?

(click the triangle to expand and reveal the full answer)

The VC dimension measures the capacity of a binary classifier. The VC dimension is defined as being the largest possible value of mm for which there exists a training set of mm different xx points that the classifier can label arbitrarily.

What are nonparametric models? What is nonparametric learning?

(click the triangle to expand and reveal the full answer)

Parametric models: learn a function described by a parameter vector whose size is finite and fixed before any data is observed (linear regression)

Non-parametric models: assume that the data distribution cannot be defined in terms of a finite set of parameters. But they can often be defined by assuming an infinite dimensional θ\theta. Usually we think of θ\theta as a function (nearest neighbor regression)

What is an ideal model? What is Bayes error? What is/are the source(s) of Bayes error occur?

(click the triangle to expand and reveal the full answer)

The ideal model: is an oracle that simply knows the true probability distribution that generates the data.

  • Even such a model will still incur some error on many problems, because there may still be some noise in the distribution. In the case of supervised learning, the mapping from xx to yy may be inherently stochastic, or yy may be a deterministic function that involves other variables besides those included in xx.

Bayes error: the lowest possible prediction error that can be achieved and is the same as irreducible error. ; The error incurred by an oracle making predictions from the true distribution p(x,y)p(x, y).

Source(s) of Bayes error: noise in the distribution if the process is random

What is the “no free lunch” theorem in connection to Machine Learning?

(click the triangle to expand and reveal the full answer)

The no free lunch theorem for machine learning (Wolpert, 1996) states that, averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. In other words, in some sense, no machine learning algorithm is universally any better than any other. The most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class.

Fortunately, these results hold only when we average over all possible data generating distributions. **If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions.

This means that the goal of machine learning research is not to seek a universal learning algorithm or the absolute best learning algorithm. Instead, our goal is to understand what kinds of distributions are relevant to the “real world” that an AI agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data generating distributions we care about.

What is regularization? Intuitively, what does regularization do during the optimization procedure?

(click the triangle to expand and reveal the full answer)

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

We regularize a model that learns a function f(x;θ)f(x;\theta) by adding a penalty called a regularizer to the cost function. Expressing preferences for one function over another implicitly and explicitly is a more general way of controlling a model’s capacity than including or excluding members from the hypothesis space.

What is weight decay? What is it added?

(click the triangle to expand and reveal the full answer)

Weight decay is an additional term that causes the weights to exponentially decay to zero.

To perform linear regression with weight decay, we minimize a sum comprising both the mean squared error on the training and a criterion J(w)J(w) that expresses a preference for the weights to have smaller squared L2L_2 norm. Specifically,

J(w)=MSEtrain+λwTwJ(w) = MSE_{train} + \lambda \mathbf{w}^T\mathbf{w}

Minimizing J(w)J(w) results in a choice of weights that make a tradeoff between fitting the training data and being small. This gives us solutions that have a smaller slope, or put weight on fewer of the features.

What is a hyperparameter? How do you choose which settings are going to be hyperparameters and which are going to be learned?

(click the triangle to expand and reveal the full answer)

Hyperparameter: Most machine learning algorithms have several settings that we can use to control the behavior of the learning algorithm.

  • The values of hyperparameters are not adapted by the learning algorithm itself

Sometimes a setting is chosen to be a hyperparameter that the learning algorithm does not learn because it is difficult to optimize or it is not appropriate to learn that hyperparameter on the training set. This applies to all hyperparameters that control model capacity. If learned on the training set, such hyperparameters would always choose the maximum possible model capacity, resulting in overfitting.

Why is a validation set necessary?

(click the triangle to expand and reveal the full answer)

Let’s assume that you are training a model whose performance depends on a set of hyperparameters. In the case of a neural network, these parameters may be for instance the learning rate or the number of training iterations.

Given a choice of hyperparameter values, you use the training set to train the model. But, how do you set the values for the hyperparameters? That’s what the validation set is for. You can use it to evaluate the performance of your model for different combinations of hyperparameter values (e.g. by means of a grid search process) and keep the best trained model.

But, how does your selected model compares to other different models? Is your neural network performing better than, let’s say, a random forest trained with the same combination of training/test data? You cannot compare based on the validation set, because that validation set was part of the fitting of your model. You used it to select the hyperparameter values!

The test set allows you to compare different models in an unbiased way, by basing your comparisons in data that were not use in any part of your training/hyperparameter selection process.

What are the different types of cross-validation? When do you use which one?

(click the triangle to expand and reveal the full answer)
  • kk-fold cross-validation randomly divides the data into k blocks of roughly equal size. Each of the blocks is left out in turn and the other k1k-1 blocks are used to train the model. The held out block is predicted and these predictions are summarized into some type of performance measure (e.g. accuracy, root mean squared error (RMSE), etc.). The kk estimates of performance are averaged to get the overall resampled estimate. kk is 1010 or sometimes 55. Why? I have no idea. When kk is equal to the sample size, this procedure is known as Leave-One-Out CV.
  • Repeated k-fold CV does the same as above but more than once. For example, five repeats of 10-fold CV would give 50 total resamples that are averaged. Note this is not the same as 50-fold CV.
  • Leave Group Out cross-validation (LGOCV), aka Monte Carlo CV, randomly leaves out some set percentage of the data BB times. It is similar to min-training and hold-out splits but only uses the training set.
  • The bootstrap takes a random sample with replacement from the training set BB times. Since the sampling is with replacement, there is a very strong likelihood that some training set samples will be represented more than once. As a consequence of this, some training set data points will not be contained in the bootstrap sample. The model is trained on the bootstrap sample and those data points not in that sample are predicted as hold-outs.

What are point estimation and function estimation in the context of Machine Learning? What is the relation between them?

(click the triangle to expand and reveal the full answer)

Point estimation refers to obtaining estimates for specific data points (such as estimating yy from xx, or estimating yy from a distribution). In machine learning, this is the main prediction task. Machine learning models do this by approximating the function that maps xx to yy. I.e., a model approximates the function f(x)=yf(x) = y

What is the maximal likelihood of a parameter vector θ\theta? Where does the log come from?

(click the triangle to expand and reveal the full answer)

The MLE for a vector parameter θ=[θ1,θ2,...,θp]T\theta = [\theta_1, \theta_2, ... , \theta_p]^T is defined as the value that maximizes the likelihood function p(x;θ)p(x; \theta), i.e.,

θ^ML=argmaxθp(x;θ)\hat{\theta}_{\text{ML}} = \text{arg}\max_{\theta} p(x; \theta)

In practise the MLE is found from the root of the log-likelihood function lnθ=0\frac{\partial \ln}{\partial \theta}=0

If multiple solutions exist, then the one that maximizes the likelihood function is the MLE.

Prove that for linear regression MSE can be derived from maximal likelihood by proper assumptions.

(click the triangle to expand and reveal the full answer)

Probabilistic assumption:

Assume that the target variables and the inputs are related via the equation:y(i)=θTx(i)+ϵ(i)y(i)= \theta Tx(i)+ \epsilon(i)

where ϵ(i)\epsilon^{(i)} is an error term that captures either unmodeled effects (such as if there are some features very pertinent to predicting housing price, but that we’d left out of the regression), or random noise.

Assume ϵ(i)\epsilon^{(i)} are distributed IID (independently and identically distributed) according to a Gaussian distribution (also called a Normal distribution) mean zero and some variance σ2\sigma^2

The density of ϵ(i)\epsilon^{(i)} is: p(ϵ(i))=12πσexp((ϵ(i))22σ2)p(\epsilon^{(i)})=\frac{1}{\sqrt{2\pi}\sigma}\exp \left(-\frac{(\epsilon^{(i)})^2}{2\sigma^2}\right)

This implies that: p(y(i)x(i);θ)=12πσexp((y(i)θTx(i))22σ2)p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\sigma}\exp \left(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\right)

Likelihood function: L(θ)=L(θX,y)=p(yX;θ)L(\theta)=L(\theta|\mathbf{X},\mathbf{y})=p(\mathbf{y}|\mathbf{X};\theta)

p(yX;θ)p(y|X;\theta): This quantity is typically viewed a function of yy (and perhaps XX), for a fixed value of θ\theta.

By the independence assumption on the ϵ(i)\epsilon^{(i)}’s (and hence also the y(i)y(i)’s given the x(i)x(i)’s), this can also be written:

L(θ)=i=1np(y(i)x(i);θ)=i=1n12πσexp((y(i)θTx(i))22σ2)L(\theta)= \prod_{i=1}^n p(y^{(i)}|x^{(i)};\theta) \\ =\prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp \left(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\right)

The principal of maximum likelihood says that we should choose θ\theta so as to make the data as high probability as possible \rightarrow maximize L(θ)L(\theta).

Instead of maximizing L(θ)L(\theta), we can also maximize any strictly increasing function of L(θ)L(\theta) \rightarrow log likelihood (θ)\ell(\theta):

(θ)=logL(θ)=logi=1np(y(i)x(i);θ)=i=1nlog12πσexp((y(i)θTx(i))22σ2)=nlog12πσ12σ2i=1n(y(i)θTx(i))2\ell(\theta)=\log L(\theta)=\log \prod_{i=1}^n p(y^{(i)}|x^{(i)};\theta) \\ =\sum_{i=1}^n \log \frac{1}{\sqrt{2\pi}\sigma}\exp \left(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\right) \\ = n\log\frac{1}{\sqrt{2\pi}\sigma}-\frac{1}{2\sigma^2} \sum_{i=1}^n (y^{(i)}-\theta^Tx^{(i)})^2

Hence, maximizing (θ)\ell(\theta) gives the same answer as minimizing

12i=1n(hθ(x(i))y(i))2=J(θ)\frac{1}{2}\sum_{i=1}^n(h_\theta(x^{(i)})-y^{(i)})^2 =J(\theta)

To summarize: Under the previous probabilistic assumptions on the data, least-squares regression corresponds to finding the maximum likelihood estimate of θ\theta. Note also that, in our previous discussion, our final choice of θ\theta did not depend on what was σ2\sigma^2, and indeed we’d have arrived at the same result even if σ2\sigma^2 were unknown.

Why is maximal likelihood the preferred estimator in ML?

(click the triangle to expand and reveal the full answer)

The main appeal of the maximum likelihood estimator is that it can be shown to be the best estimator asymptotically, as the number of examples mm \rightarrow \infty, in terms of its rate of convergence as mm increases

Under appropriate conditions, the maximum likelihood estimator has the properties of:

  • consistency: as the number of training examples approaches infinity, the maximum likelihood estimate of a parameter converges to the true value of the parameter. θ^θn\hat{\theta} \rightarrow \theta^{n \rightarrow \infty}.
  • efficiency: the Cramér-Rao lower bound (Rao, 1945; Cramér, 1946) shows that no consistent estimator has a lower mean squared error Var(θn)\text{Var}(\theta^n) than the maximum likelihood estimator

When the number of examples is small enough to yield overfitting behavior, regularization strategies such as weight decay may be used to obtain a biased version of maximum likelihood that has less variance when training data is limited.

Under what conditions do the maximal likelihood estimator guarantee consistency?

(click the triangle to expand and reveal the full answer)

There are two conditions:

Condition #1: The true distribution pdatap_{\text{data}} must lie within the model family pmodel(;θ)p_{\text{model}}(·; \theta). Otherwise, no estimator can recover pdatap_{\text{data}}.

Condition #2: The true distribution pdatap_{\text{data}} must correspond to exactly one value of θ\theta. Otherwise, maximum likelihood can recover the correct pdatap_{\text{data}}, but will not be able to determine which value of θ\theta was used by the data generating processing.

What is cross-entropy loss?

(click the triangle to expand and reveal the full answer)

Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.

def CrossEntropy(yHat, y):
    if y == 1:
      return -log(yHat)
      return -log(1 - yHat)

In binary classification, where the number of classes MM equals 22, cross-entropy can be calculated as:

(ylog(p)+(1y)log(1p))-{(y\log(p) + (1 - y)\log(1 - p))}

If M>2M>2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.


What is the difference between loss function, cost function and objective function?

(click the triangle to expand and reveal the full answer)

These are not very strict terms and they are highly related. However:

Loss function is usually a function defined on a data point, prediction and label, and measures the penalty. For example:

  • square loss l(f(xiθ),yi)=(f(xiθ)yi)2l(f(x_i|\theta),y_i) = \left (f(x_i|\theta)-y_i \right )^2, used in linear regression
  • hinge loss l(f(xiθ),yi)=max(0,1f(xiθ)yi)l(f(x_i|\theta), y_i) = \max(0, 1-f(x_i|\theta)y_i), used in SVM
  • 0/1 loss l(f(xiθ),yi)=1    f(xiθ)yil(f(x_i|\theta), y_i) = 1 \iff f(x_i|\theta) \neq y_i, used in theoretical analysis and definition of accuracy

Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For example:

  • Mean Squared Error MSE(θ)=1Ni=1N(f(xiθ)yi)2MSE(\theta) = \frac{1}{N} \sum_{i=1}^N \left (f(x_i|\theta)-y_i \right )^2
  • SVM cost function SVM(θ)=θ2+Ci=1NξiSVM(\theta) = \|\theta\|^2 + C \sum_{i=1}^N \xi_i (there are additional constraints connecting ξi\xi_i with CC and with training set)

Objective function is the most general term for any function that you optimize during training. For example, a probability of generating training set in maximum likelihood approach is a well defined objective function, but it is not a loss function nor cost function (however you could define an equivalent cost function). For example:

  • MLE is a type of objective function (which you maximize)
  • Divergence between classes can be an objective function but it is barely a cost function, unless you define something artificial, like 1-Divergence, and name it a cost

In short, a loss function is a part of a cost function, which is a type of objective function.

Optimization procedures

What is the difference between an optimization problem and a Machine Learning problem?

(click the triangle to expand and reveal the full answer)

The difference is very slim between machine learning (ML) and optimization theory.

In ML the idea is to learn a function that minimizes an error or one that maximizes reward over punishment.

Yes a lot of learning can be seen as optimization. In fact learning is an optimization problem.

For example, if you want to learn to play Chess, you will probably suck at first and then the idea (objective) is to be less bad. So you need to learn a series of actions that should hopefully minimize loses over wins.

In the process you are basically optimizing your chances of winning and that is what we call learning.

How can a learning problem be converted into an optimization problem?

(click the triangle to expand and reveal the full answer)

All supervised learning methods try to solve:

w^=argmaxinL(xi,yi;w)\hat{w} = \arg \max \sum_{i}^{n} L(x_i, y_i; w)

Where w^=optimal parameters\hat{w} = \text{optimal parameters}, n=number of training pairsn=\text{number of training pairs}, x=inputx=\text{input}, y=desired outputy=\text{desired output}, w=model parametersw=\text{model parameter}s and L=lossL= \text{loss}, cost or objective function.

This is from optimization theory and the same formulation can be used for basic curve fitting in maths just as it can also be used for complex problems like image and speech recognition.

The goal for ML is similarly to optimize the performance of a model given an objective and the training data.

However, the manner in which ML does optimization is somewhat weak especially considering that the majority of ML models are supervised and require differentiable objective functions.

Many complex problems are not differentiable or are hard to cast in a differentiable manner, especially the so called AI-complete problems such as vision and natural language understanding (NLU).

If the wider artificial intelligence (AI) research community thinks that backpropagation (backprop) + stochastic gradient descent (SGD) algorithms and their variants will lead us to artificial general intelligence (AGI) then they are gravely mistaken.

Of course we can solve some interesting problems by training models with backprop + SGD (or other variants) as evident in the recent progress ML has enjoyed over the past few years but we should, sooner or later, realize that those models are not going to turn into an AGI model, we need better learning (optimization) algorithms, preferably those not limited to differentiable objectives, in order to move further.

What is empirical risk minimization? Why the term empirical? Why do we rarely use it in the context of deep learning?

(click the triangle to expand and reveal the full answer)

Empirical risk minimization refers techniques used to define the theoretical performance bounds of a learning algorithm. The core idea is that we cannot know exactly how well an algorithm will work in practice (the true “risk”) because we don’t know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the “empirical” risk).

The ultimate goal of a learning algorithm is to find a hypothesis hh^{*} among a fixed class of functions H\mathcal{H} for which the risk R(h)R(h) is minimal:

h=argminhHR(h)h^{*} =\arg \min_{h \in {\mathcal {H}}}R(h).

In general, the risk R(h)R(h) cannot be computed because the distribution P(x,y)P(x,y) is unknown to the learning algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation, called empirical risk, by averaging the loss function on the training set:


The empirical risk minimization principle states that the learning algorithm should choose a hypothesis h^\hat{h} which minimizes the empirical risk:

h^=argminhHRemp(h)\hat{h} = \arg \min_{h\in {\mathcal {H}}}R_{\text{emp}}(h)

Thus the learning algorithm defined by the ERM principle consists in solving the above optimization problem.

The main reason it’s rarely used in the context of deep learning is computational complexity.

Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem, even for such a relatively simple class of functions as linear classifiers. t can be solved efficiently when the minimal empirical risk is zero (i.e. data is linearly separable), but this is not always verifiable with the types of problems deep learning is applied to.

In practice, machine learning algorithms cope with that either by employing a convex approximation to the 0-1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution P(x,y)P(x,y) (and thus stop being agnostic learning algorithms to which the above result applies).

Name some typical loss functions used for regression. Compare and contrast.

(click the triangle to expand and reveal the full answer)

1. Mean Square Error, Quadratic loss, L2L_2 Loss Mean Square Error (MSE) is the most commonly used regression loss function. MSE is the sum of squared distances between our target variable and predicted values.

MSE=i=1n(yiyip)2n\text{MSE}=\frac{\sum_{i=1}^{n} (y_i - y_i^p)^2}{n}

Below is a plot of an MSE function where the true target value is 100, and the predicted values range between -10,000 to 10,000. The MSE loss (Y-axis) reaches its minimum value at prediction (X-axis) = 100. The range is 0 to \infty.

Plot of MSE Loss (Y-axis) vs. Predictions (X-axis)

2. Mean Absolute Error, L1L_1 Loss Mean Absolute Error (MAE) is another loss function used for regression models. MAE is the sum of absolute differences between our target and predicted variables. So it measures the average magnitude of errors in a set of predictions, without considering their directions. (If we consider directions also, that would be called Mean Bias Error (MBE), which is a sum of residuals/errors). The range is also 0 to \infty.

MAE=i=1nyiyip2n\text{MAE}=\frac{\sum_{i=1}^{n} |y_i - y_i^p |^2}{n}

Plot of MAE Loss (Y-axis) vs. Predictions (X-axis)

2.5. MSE vs. MAE (L2L_2 loss vs L1L_1 loss) In short, using the squared error is easier to solve, but using the absolute error is more robust to outliers. But let’s understand why!

3. Huber Loss, Smooth Mean Absolute Error Huber loss is less sensitive to outliers in data than the squared error loss. It’s also differentiable at 0. It’s basically absolute error, which becomes quadratic when error is small. How small that error has to be to make it quadratic depends on a hyperparameter, 𝛿 (delta), which can be tuned. Huber loss approaches MAE when δ0\delta \sim 0 and MSE when δ\delta \sim \infty (large numbers.)

Lδ(y,f(x))={(yf(x))2/2for yf(x)δδyf(x)δ2/2otherwise.L_\delta(y, f(x)) = \left\{\begin{matrix} (y-f(x))^2/2 & \text{for }|y-f(x)| \leq \delta \\ \delta |y-f(x)| - \delta^2/2 & \text{otherwise.} \end{matrix}\right.

Plot of Hoss Loss (Y-axis) vs. Predictions (X-axis). True value = 0

The choice of delta is critical because it determines what you’re willing to consider as an outlier. Residuals larger than delta are minimized with L1L_1 (which is less sensitive to large outliers), while residuals smaller than delta are minimized “appropriately” with L2L_2.

3.5. Why use Huber Loss?

One big problem with using MAE for training of neural nets is its constantly large gradient, which can lead to missing minima at the end of training using gradient descent. For MSE, gradient decreases as the loss gets close to its minima, making it more precise.

Huber loss can be really helpful in such cases, as it curves around the minima which decreases the gradient. And it’s more robust to outliers than MSE. Therefore, it combines good properties from both MSE and MAE. However, the problem with Huber loss is that we might need to train hyperparameter delta which is an iterative process.

4. Log-Cosh Loss Log-cosh is another function used in regression tasks that’s smoother than L2L_2. Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.

L(y,yp)=i=1nlog(cosh(yipyi))L(y,y^p)=\sum_{i=1}^{n} \log(\cosh(y_i^p - y_i))

Plot of Log-cosh Loss (Y-axis) vs. Predictions (X-axis). True value = 0

Advantage: log(cosh(x))\log(\cosh(x)) is approximately equal to (x ** 2) / 2 for small x and to abs(x) - log(2) for large x. This means that logcosh works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction. It has all the advantages of Huber loss, and it’s twice differentiable everywhere, unlike Huber loss.

5. Quantile Loss

In most of the real world prediction problems, we are often interested to know about the uncertainty in our predictions. Knowing about the range of predictions as opposed to only point estimates can significantly improve decision making processes for many business problems.

Quantile loss functions turn out to be useful when we are interested in predicting an interval instead of only point predictions. Prediction interval from least square regression is based on an assumption that residuals (yy^)(y - \hat{y}) have constant variance across values of independent variables. We can not trust linear regression models which violate this assumption. We can not also just throw away the idea of fitting linear regression model as baseline by saying that such situations would always be better modeled using non-linear functions or tree based models. This is where quantile loss and quantile regression come to rescue as regression based on quantile loss provides sensible prediction intervals even for residuals with non-constant variance or non-normal distribution.

Let’s see a working example to better understand why regression based on quantile loss performs well with heteroscedastic data.

Quantile regression vs. Ordinary Least Square regression

Left: Linear relationship b/w X1 and Y. With constant variance of residuals. Right: Linear relationship b/w X2 and Y but variance of Y increases with X2. (Heteroscedasticity)

Orange line represents OLS estimates for both cases

Quantile Regression. Dotted lines represent regression based 0.05 and 0.95 quantile loss functions

Understanding the quantile loss function

Quantile-based regression aims to estimate the conditional “quantile” of a response variable given certain values of predictor variables. Quantile loss is actually just an extension of MAE (when quantile is 50th percentile, it’s MAE).

The idea is to choose the quantile value based on whether we want to give more value to positive errors or negative errors. Loss function tries to give different penalties to overestimation and underestimation based on the value of chosen quantile (γ\gamma). For example, a quantile loss function of γ=0.25\gamma = 0.25 gives more penalty to overestimation and tries to keep prediction values a little below median

Lγ(y,yp)=i=yi<yip(γ1)yiyip+i=yiyip(γ)yiyipL_\gamma(y,y^p)=\sum_{i=y_i < y_i^p} (\gamma - 1) \cdot |y_i - y_i^p| + \sum_{i=y_i \geq y_i^p} (\gamma) \cdot |y_i - y_i^p|

γ\gamma is the required quantile and has value between 0 and 1.

Plot of Quantile Loss (Y-axis) vs. Predictions (X-axis). True value of Y = 0

We can also use this loss function to calculate prediction intervals in neural nets or tree based models. Below is an example of Sklearn implementation for gradient boosted tree regressors.

Prediction Intervals using Quantile loss (Gradient Boosting Regressor) from Scikit Learn

Above figure shows 90% prediction interval calculated using quantile loss function available in GradientBoostingRegression of sklearn library. The upper bound is constructed γ=0.95\gamma = 0.95 and lower bound using γ=0.05\gamma = 0.05.

Comparison Study:

A nice comparison simulation is provided in “Gradient boosting machines, a tutorial”. To demonstrate the properties of the all the above loss functions, they’ve simulated a dataset sampled from a sinc(x) function with two sources of artificially simulated noise: the gaussian noise component εN(0,σ2)\varepsilon \sim \mathcal{N}(0, \sigma^2) and the impulsive noise component ξBern(p)\xi \sim \text{Bern}(p). The impulsive noise term is added to illustrate the robustness effects. Below are the results of fitting a GBM regressor using different loss functions.

Continuous loss functions: (A) MSE loss function; (B) MAE loss function; (C) Huber loss function; (D) Quantile loss function. Demonstration of fitting a smooth GBM to a noisy sinc(x) data: (E) original sinc(x) function; (F) smooth GBM fitted with MSE and MAE loss; (G) smooth GBM fitted with Huber loss with delta = {4, 2, 1}; (H) smooth GBM fitted with Quantile loss with alpha = {0.5, 0.1, 0.9}.

Some observations from the simulations:

  • The predictions from model with MAE loss is less affected by the impulsive noise whereas the predictions with MSE loss function is slightly biased due to the caused deviations.
  • The predictions are little sensitive to value of hyperparameter chosen in case of model with huber loss.
  • The quantile losses give a good estimation of the corresponding confidence levels.

All the loss functions in single plot.

What is the 0-1 loss function? Why can’t the 0-1 loss function or classification error be used as a loss function for optimizing a deep neural network?

(click the triangle to expand and reveal the full answer)

Loss functions are for introducing some kind of metric that we can measure the “cost” of an incorrect decision with.

So let’s say I have a dataset of 30 objects, I divided them to training/testing\text{training} / \text{testing} sets like 20/1020 / 10. I will be using 0-1 loss function, so lets say my set of class labels is M and the function looks like this:

L(i, j) = \right \begin{Bmatrix} 0 \qquad i = j \ 1 \qquad i \ne j \end{matrix} \qquad i,j \in M

So I builded some model on my training data, lets say I am using Naive Bayes classifier, and this model classified 7 objects correctly (assigned them the correct class labels) and 3 objects were classified incorrectly.

So my loss function would return “0” 7 times and “1” 3 times - i.e. our model classified 30%30 \% of the objects incorrectly.

The 010-1 loss function is non-convex and discontinuous, so (sub)gradient methods cannot be applied. For binary classification with a linear separator, this loss function can be formulated as finding the β\beta that minimizes the average value of the indicator function 1(yiβxi0)\mathbf{1}(y_i \beta x_i \leq \mathbf{0}) over all ii samples. This is exponential in the inputs, as since there are two possible values for each pair, there are 2n2^n possible configurations to check for n total sample points. This is known to be NP-hard. Knowing the current value of your loss function doesn’t provide any clue as to how you should possibly modify your current solution to improve, as you could derive if gradient methods for convex or continuous functions were available.

Sequence Modeling

Write the equation describing a dynamical system. Can you unfold it? Now, can you use this to describe a RNN?

(click the triangle to expand and reveal the full answer)

Let us consider a dynamic system whose present state is a function of (depends on) the previous state. It can be expressed compactly, with cycles, as below:

state,st=f(st1;θ)\text{state}, s_{t} = f( s_{t-1} ; \theta )

This is a recursive/recurrent definition. The state at time ’tt‘, sts_t is a function (ff) of previous state st1s_{t-1}, parameterized by θ\theta. This equation needs to be unfolded as follows:

s3=f(s2;θ)=f(f(s1;θ)θ)s_{3} = f( s_{2}; \theta ) = f( f( s_{1}; \theta ) \theta )

Consider a little more complex system, whose state not only depends on the previous state, but also on an external signal ’xx‘.

st=f(st1,xt;θ)s_{t} = f( s_{t-1}, x_{t}; \theta )

The system takes a signal xtx_t as input during time step ’tt’, updates it’s state based on the influence of xtx_t and the previous state, st1s_{t-1}. Now, let’s try to think of such a system as a neural network. The state of the system can be seen as the hidden units of a neural network. The signal ’xx’ at each time step, can be seen as a sequence of inputs given to the neural network, one input per time step. At this point, you should know that we are using the term “time step” interchangeably with steps in a sequence.

The state of such a neural network can be represented with hidden layer hth_t, as follows:

ht=f(ht1,xt;θ)h_{t} = f( h_{t-1}, x_{t}; \theta )

The neural network we just described, has recursion/recurrence in it. Let’s start calling it an RNN. It can be compactly represented as a cyclic circuit diagram, where the input signal is processed with a delay. This cyclic circuit can be unfolded into a graph as follows:

Notice the repeated appliation of function ‘f’.

Let gtg_t be the function that represents the unfolded graph after ’tt’ time steps.

Now we can express the hidden state at time ’tt’, two ways:

  • as a recurrence relation, as we have seen before: ht=f(ht1,xt;θ)h_t=f(h_t-1,x_t;\theta)
  • as an unfolded graph: ht=gt(xt,xt1,...x1)h_t=g_t(x_t,x_{t-1},...x_1)

Suriyadeepan Ram’s Blog gives excellent examples of extending this to forward propagation through RNNs, Markov properties, converting vectors to sequences, Seq2Seq, bidirectional RNNs, and deep RNNs.

What determines the size of an unfolded graph?

(click the triangle to expand and reveal the full answer)

‘unfolding’ is dependent on the length of the input sequence. To understand this, suppose you want to lay down the exact computations that are happening in an RNN, in that case, you have to ‘unfold’ your network and the size of your ‘unfolded’ graph would depend on the size of the input sequence. For more information refer to this page. It says that “By unrolling we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word.”

What are the advantages of an unfolded graph?

(click the triangle to expand and reveal the full answer)

The unfolding process introduces two major advantages:

  1. Regardless of sequence length, learned model has same input size (because it is specified in terms of transition from one state to another state rather than specified in terms of a variable length history of states)
  2. Possible to use same function ff with same parameters at every step

What does the output of the hidden layer of a RNN at any arbitrary time t represent?

(click the triangle to expand and reveal the full answer)

The output of the hidden layer at time tt (hth_t) represents the outputs of a function on the input xx at time tt and the ouput of the hidden layer from time t1t-1 (ht1h_{t-1})

Are the output of hidden layers of RNNs lossless? If not, why?

(click the triangle to expand and reveal the full answer)

Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.

Supervised Sequence Labelling with Recurrent Neural Networks, 2008

RNNs are used for various tasks. From a RNNs point of view, what tasks are more demanding than others?

(click the triangle to expand and reveal the full answer)

Compared to architectures like 1D Convolutional layers, looking back and looking ahead are tasks that are more demanding for RNNs (this may require constructing a Bi-directional RNN).

RNNs are limited in how far back they can remember information from. While LSTMs are an improvement, remembering information from more than 10,000 time steps back is extremely difficult. This is made more difficult if the time series data is extremely varied (and could subject the RNN to gradient instability earlier in training).

Another issue of RNNs is that they are not hardware friendly. It takes a lot of resources we do not have to train these network fast. Also it takes much resources to run these model in the cloud, and given that the demand for speech-to-text is growing rapidly, the cloud is not scalable.

RNNs and LSTM are difficult to train because they require memory-bandwidth-bound computation, which is the worst nightmare for hardware designer and ultimately limits the applicability of neural networks solutions. In short, LSTM require 4 linear layer (MLP layer) per cell to run at and for each sequence time-step. Linear layers require large amounts of memory bandwidth to be computed, in fact they cannot use many compute unit often because the system has not enough memory bandwidth to feed the computational units. And it is easy to add more computational units, but hard to add more memory bandwidth (note enough lines on a chip, long wires from processors to memory, etc). As a result, RNN/LSTM and variants are not a good match for hardware acceleration.

Discuss some examples of important design patterns of classical RNNs.

(click the triangle to expand and reveal the full answer)
RNN Design PatternIllustrationExample
One-to-one (Tx=Ty=1T_x=T_y=1)Traditional neural network
One-to-many Tx=1,Ty>1T_x=1,T_y>1Music generation
Many-to-one Tx>1,Ty=1T_x>1,T_y=1Sentiment classification
Many-to-many Tx=TyT_x=T_yName entity recognition
Many-to-many TxTyTx \neq TyMachine translation

Write the equations for a classical RNN where hidden layer has recurrence. How would you define the loss in this case? What problems you might face while training it?

(click the triangle to expand and reveal the full answer)

Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:

For each timestep tt, the activation a<t>a^{< t >} and the output y<t>y^{< t >} are expressed as follows:

a<t>=g1(Waaa<t1>+Waxx<t>+ba)andy<t>=g2(Wyaa<t>+by)\boxed{a^{< t >}=g_1(W_{aa}a^{< t-1 >}+W_{ax}x^{< t >}+b_a)}\quad \\ \text{and} \\ \quad \boxed{y^{< t >}=g_2(W_{ya}a^{< t >}+b_y)}

where Wax,Waa,Wya,ba,byW_{ax}, W_{aa}, W_{ya}, b_a, b_y are coefficients that are shared temporally and g1,g2g_1, g_2 activation functions.

the loss function L\mathcal{L} of all time steps is defined based on the loss at every time step as follows: L(y^,y)=t=1TyL(y^<t>,y<t>)\boxed{\mathcal{L}(\widehat{y},y)=\sum_{t=1}^{T_y}\mathcal{L}(\widehat{y}^{< t >},y^{< t >})}

One challenge of this loss is that it is difficult to capture long term dependencies because of multiplicative gradients that can be exponentially decreasing/increasing with respect to the number of layers

What is backpropagation through time?

(click the triangle to expand and reveal the full answer)

Backpropagation is done at each point in time in the update process for an RNN. At timestep TT, the derivative of the loss LL with respect to weight matrix WW is expressed as follows:

L(T)W=t=1TL(T)W(t)\boxed{\frac{\partial \mathcal{L}^{(T)}}{\partial W}=\sum_{t=1}^T\left.\frac{\partial\mathcal{L}^{(T)}}{\partial W}\right|_{(t)}}

Consider a RNN that has only output to hidden layer recurrence. What are its advantages or disadvantages compared to a RNN having only hidden to hidden recurrence?

(click the triangle to expand and reveal the full answer)

An RNN that has only output-to-hidden layer recurrence is what’s known as a one-to-many RNN. Such RNNs may take in an input at only the beginning time-step, and for the rest o the network use only its own outputs as sources for subsequent hidden layers. Such networks are extremely useful for generating sequences from a very small or short sequence. However, compared to hidden-to-hidden recurrence, they may forget data from time-steps further in the past (i.e., more gradient instability). As time goes on, more of the data the network learns from may consist of the network’s own outputs (hence more limited in applications).

What is Teacher forcing? Compare and contrast with BPTT.

(click the triangle to expand and reveal the full answer)

Teacher forcing is a strategy for training recurrent neural networks that uses model output from a prior time step as an input. Teacher forcing works by using the actual or expected output from the training dataset at the current time step y(t)y(t) as input in the next time step X(t+1)X(t+1), rather than the output generated by the network.

When we have both hidden-to-hidden and output-to-hidden feedbacks in recurrent neural networks, we can use both back propagation through time (BPTT) and teacher forcing learning methods.The main advantage of teacher forcing is to parallelize training of different time steps, where BPTT cannot exploit such parallelization.

What is the disadvantage of using a strict teacher forcing technique? How to solve this?

(click the triangle to expand and reveal the full answer)

Teacher forcing is a strategy for training recurrent neural networks that uses model output from a prior time step as an input. Teacher forcing works by using the actual or expected output from the training dataset at the current time step y(t)y(t) as input in the next time step X(t+1)X(t+1), rather than the output generated by the network.

When we have both hidden-to-hidden and output-to-hidden feedbacks in recurrent neural networks, we can use both back propagation through time (BPTT) and teacher forcing learning methods.The main advantage of teacher forcing is to parallelize training of different time steps, where BPTT cannot exploit such parallelization.

Explain the vanishing/exploding gradient phenomenon for recurrent neural networks.

(click the triangle to expand and reveal the full answer)

To train an RNN, people usually use backpropagation through time (BPTT), which means that you choose a number of time steps NN, and unroll your network so that it becomes a feedforward network made of NN duplicates of the original network, while each of them represents the original network in another time step.

unrolling an RNN

So BPTT is just unrolling your RNN, and then using backpropagation to calculate the gradient (as one would do to train a normal feedforward network).

Because our feedforward network was created by unrolling, it is NN times as deep as the original RNN. Thus the unrolled network is often very deep.

In deep feedforward neural networks, backpropagation has “the unstable gradient problem”, as Michael Nielsen explains in the chapter Why are deep neural networks hard to train? (in his book Neural Networks and Deep Learning):

[…] the gradient in early layers is the product of terms from all the later layers. When there are many layers, that’s an intrinsically unstable situation. The only way all layers can learn at close to the same speed is if all those products of terms come close to balancing out.

I.e. the earlier the layer, the longer the product becomes, and the more unstable the gradient becomes. (For a more rigorous explanation, see this answer.)

Another issue is that the product that gives the gradient contains many instances of the same term. I.e., The product that gives the gradient includes the weights of every later layer.

So in a normal feedforward neural network, this product for the ddth-to-last layer might look like:

w1α1w2α2  wdαdw_1\cdot\alpha_{1}\cdot w_2\cdot\alpha_{2}\cdot\ \cdots\ \cdot w_d\cdot\alpha_{d}

Nielsen explains that (with regard to absolute value) this product tends to be either very big or very small (for a large dd).

But in an unrolled RNN, this product would look like:

wα1wα2  wαdw\cdot\alpha_{1}\cdot w\cdot\alpha_{2}\cdot\ \cdots\ \cdot w\cdot\alpha_{d}

as the unrolled network is composed of duplicates of the same network.

Whether we are dealing with numbers or matrices, the appearance of the same term dd times means that the product is much more unstable (as the chances are much smaller that “all those products of terms come close to balancing out”).

And so the product (with regard to absolute value) tends to be either exponentially small or exponentially big (for a large dd).

In other words, the fact that the unrolled RNN is composed of duplicates of the same network makes the unrolled network’s “unstable gradient problem” more severe than in a normal deep feedforward network.

Why don’t we see the vanishing/exploding gradient phenomenon in feedforward networks?

(click the triangle to expand and reveal the full answer)

In non-recurrent feedforward networks, it is possible to produce unstable gradient problems with a sufficiently large network (and one without regularization). That being said, this is rarely the case, and the number of layers gradients have to be passed through in a very deep MLP is typically far less than the number of time-steps for which gradients in an RNN need to be computed.

The main reasons for the vanishing/exploding gradient problem are the following traits of BPTT in RNNs:

  • An unrolled RNN tends to be a very VERY deep network.
  • In an unrolled RNN the gradient in an early layer is a product that (also) contains many instances of the same term.

What is the key difference in architecture of LSTMs/GRUs compared to traditional RNNs?

(click the triangle to expand and reveal the full answer)

All RNNs have feedback loops in the recurrent layer. This lets them maintain information in ‘memory’ over time. But, it can be difficult to train standard RNNs to solve problems that require learning long-term temporal dependencies. This is because the gradient of the loss function decays exponentially with time (called the vanishing gradient problem).

LSTM networks are a type of RNN that uses special units in addition to standard units. LSTM units include a ‘memory cell’ that can maintain information in memory for long periods of time. A set of gates is used to control when information enters the memory, when it’s output, and when it’s forgotten. This architecture lets them learn longer-term dependencies. GRUs are similar to LSTMs, but use a simplified structure. They also use a set of gates to control the flow of information, but they don’t use separate memory cells, and they use fewer gates.

What is the difference between LSTM and GRU?

(click the triangle to expand and reveal the full answer)

Structural Differences

While both GRUs and LSTMs contain gates, the main difference between these two structures lies in the number of gates and their specific roles. The role of the Update gate in the GRU is very similar to the Input and Forget gates in the LSTM. However, the control of new memory content added to the network differs between these two.

In the LSTM, while the Forget gate determines which part of the previous cell state to retain, the Input gate determines the amount of new memory to be added. These two gates are independent of each other, meaning that the amount of new information added through the Input gate is completely independent of the information retained through the Forget gate.

As for the GRU, the Update gate is responsible for determining which information from the previous memory to retain and is also responsible for controlling the new memory to be added. This means that the retention of previous memory and addition of new information to the memory in the GRU is NOT independent.

Another key difference between the structures is the lack of the cell state in the GRU, as mentioned earlier. While the LSTM stores its longer-term dependencies in the cell state and short-term memory in the hidden state, the GRU stores both in a single hidden state. However, in terms of effectiveness in retaining long-term information, both architectures have been proven to achieve this goal effectively.

Speed Differences

GRUs are faster to train as compared to LSTMs due to the fewer number of weights and parameters to update during training. This can be attributed to the fewer number of gates in the GRU cell (two gates) as compared to the LSTM’s three gates.

In the code walkthrough further down in this article, we’ll be directly comparing the speed of training an LSTM against a GRU on the exact same task.

Performance Evaluations

The accuracy of a model, whether it is measured by the margin of error or proportion of correct classifications, is usually the main factor when deciding which type of model to use for a task. Both GRUs and LSTMs are variants of RNNS and can be plugged in interchangeably to achieve similar results.

Explain Gradient Clipping.

(click the triangle to expand and reveal the full answer)

Gradient clipping is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.

Adam and RMSProp adjust the size of gradients based on previously seen gradients. Do they inherently perform gradient clipping? If no, why?

(click the triangle to expand and reveal the full answer)

Adam and RMSProp make multiplicative adjustments to gradients, but this is not the same as Gradient clipping. Gradient clipping involves setting predetermined cutoffs beyond which the gradients cannot go.

Discuss RNNs in the context of Bayesian Machine Learning.

(click the triangle to expand and reveal the full answer)

Traditional RNNs and Similarity to Bayes Rule In essence, Bayesian means probabilistic. The specific term exists because there are two approaches to probability (bayesian and frequentist). Bayesians think of it as a measure of belief, so that probability is subjective and refers to the future. Specificaly, Bayesians use Bayes’ Rule to update a probability based on a predictor prior probability, likelihood, and class prior probability using the sum rule (p(X)=Yp(X,Y)p(X)= \sum_Y p(X, Y)) and product rule (p(X,Y)=p(YX)p(X)p(X, Y) = p(Y | X)p(X)) of probability. Changing probabilities with new updates through time is similar to what RNNs do.

For example, if we want to predict a given category cc of data XX (contains data points x1,....,xnx_1, ...., x_n)that will be chosen from a sample next (for example, cc could be one of a variety of words from a dictionary), after already having done nn samples (or having passed through nn words already), we could use Bayes rule as follows:

P(cx))Posterior Probability=P(xc)likelihoodP(c)Class Prior ProbabilityP(x)predictor prior probability\underset{\text{Posterior Probability}}{P(c | x))} = \frac{\overset{\text{likelihood}}{P(x|c)} \overset{\text{Class Prior Probability}}{P(c)}}{\underset{\text{predictor prior probability}}{P(x)}}


p(cX)=p(x1c)×p(x2c)×...×p(xnc)×p(c)p(c | X) = p(x_1 | c) \times p(x_2 | c) \times ... \times p(x_n | c) \times p(c)

In short, the way classical RNNs update (and by extension LSTMs and GRUs) is very similar to how Bayes’ rule would be applied to time-series data.

Bayesian RNNs

It is also possible to update RNNs through variational Bayes. This is discussed further in a paper out of DeepMind: “Bayesian Recurrent Neural Networks”

Can we do Batch Normalization in RNNs? If not, what is the alternative?

(click the triangle to expand and reveal the full answer)

No, you cannot use Batch Normalization on a recurrent neural network, as the statistics are computed per batch, this does not consider the recurrent part of the network. Weights are shared in an RNN, and the activation response for each “recurrent loop” might have completely different statistical properties

Other techniques similar to Batch Normalization that take these limitations into account have been developed, for example Layer Normalization. There are also reparametrizations of the LSTM layer that allow Batch Normalization to be used, for example as described in Recurrent Batch Normalization by Coijmaans et al. 2016.

Some DL frameworks make it seem like you can use batch normalization in RNNs, but these typically involve reparametrized LSTMs and not vanilla recurrent networks (which need modifications or a different form of batch normalization)


What is an Autoencoder? What does it “auto-encode”?

(click the triangle to expand and reveal the full answer)

In machine learning, Autoencoding is reducing the dimensions of the dataset while learning how to ignore noise. An Autoencoder is an unsupervised artificial neural network used for learning. Specifically, it learns how to compress data then reconstructing it back to a representation close to the original form.

What were Autoencoders traditionally used for? Why there has been a resurgence of Autoencoders for generative modeling?

(click the triangle to expand and reveal the full answer)

Autoencoders were traditionally used for pre-training of Artificial Neural Networks. According to the history provided in Schmidhuber, “Deep learning in neural networks: an overview,” Neural Networks (2015), auto-encoders were proposed as a method for unsupervised pre-training in Ballard, “Modular learning in neural networks,” Proceedings AAAI (1987). It’s not clear if that’s the first time auto-encoders were used, however; it’s just the first time that they were used for the purpose of pre-training ANNs.

Generative models typicaly fall into either unconditional models (sampling xx from a probability distribution p(x)p(x)) or conditional models (sampling xx from a probability distribution p(xy)p(x | y)). In both cases, being able to generate complex outputs from a learnable function (or learnable latent variables) is extremely useful. Autoencoders are useful for generating these lower-dimensional representations from which diverse-yet-realistic outputs can be obtained.

What is recirculation?

(click the triangle to expand and reveal the full answer)

Unlike general feedforward networks, autoencoders may also be trained using recirculation (Hinton and McClelland, 1988), a learning algorithm based on comparing the activations of the network on the original input to the activations on the reconstructed input. Recirculation is regarded as more biologically plausible than back-propagation but is rarely used for machine learning applications. Specific recirculation algorithms vary between some autoencoder types

What loss functions are used for Autoencoders?

(click the triangle to expand and reveal the full answer)

Many autoencoders work well with loss functions like the mean squared error loss function. Adding regularization to the loss is useful for giving the autoencoder other properties besides the ability to directly copy its input to its output. Many other regression loss functions also work well with autoencoders.

Cross-entropy and other classification loss functions are generaly poor choices for most autoencoders.

What is a linear autoencoder? Can it be optimal (lowest training reconstruction error)? If yes, under what conditions?

(click the triangle to expand and reveal the full answer)

In its simplest form, the autoencoder is a three layers net, i.e. a neural net with one hidden layer. The input and output are the same, and we learn how to reconstruct the input.

Such an architecture can be optimal (i.e., lowest training error for a given hidden layer size) if it has the following properties:

  • Tied Weights
  • Weight Orthognoality
  • Feature non-correlation
  • Unit norm of the weights being 1

What is the difference between Autoencoders and PCA?

(click the triangle to expand and reveal the full answer)

PCA is restricted to a linear map, while auto encoders can have nonlinear enoder/decoders.

A single layer auto encoder with linear transfer function is nearly equivalent to PCA, where nearly means that the WW found by an autoencoder and PCA won’t be the same—but the subspace spanned by the respective WW’s will.

What is the impact of the size of the hidden layer in Autoencoders?

(click the triangle to expand and reveal the full answer)

We can an autoencoder very powerful by increasing the size of the hidden layers. Increasing these hyperparameters will let the autoencoder learn more complex codings. But, we should be careful to not make it too powerful, otherwise the autoencoder will simply learn to copy its inputs to the output, without learning any meaningful representation. It will just mimic the identity function. The autoencoder will reconstruct the training data perfectly, but it will be overfitting without being able to generalize to new instances, which is not what we want. This is why we use a “sandwich” architecture with hidden layer sizes smaller than the inputs or outputs (also known as a “bottleneck” architecture).

What is an undercomplete Autoencoder? Why is it typically used for?

(click the triangle to expand and reveal the full answer)

An undercomplete autoencoder has no explicit regularization term - we simply train our model according to the reconstruction loss. Thus, our only way to ensure that the model isn’t memorizing the input data is the ensure that we’ve sufficiently restricted the number of nodes in the hidden layer(s). As such, it is typically used in dimensionality reduction cases.

What problems might a nonlinear undercomplete Autoencoder face?

(click the triangle to expand and reveal the full answer)

If too much of the input to the undercomplete autoencoder is random noise or irrelevant data, the autoencoder might not be able to learn any meaningful representations of the data.

What are overcomplete Autoencoders? What problems might they face? Does the scenario change for linear overcomplete autoencoders?

(click the triangle to expand and reveal the full answer)

In overcomplete autoencoders, where the dimension of the latent representation is greater than the input. In these cases, even a linear encoder and linear decoder can learn to copy the input to the output without learning anything useful about the data distribution.

Discuss the importance of regularization in the context of Autoencoders.

(click the triangle to expand and reveal the full answer)

We can regularize the autoencoder by using a sparsity constraint such that only a fraction of the nodes would have nonzero values, called active nodes. In particular, we add a penalty term to the loss function such that only a fraction of the nodes become active. This forces the autoencoder to represent each input as a combination of small number of nodes, and demands it to discover interesting structure in the data. This method works even if the code size is large, since only a small subset of the nodes will be active at any time.

Why does generative autoencoders not require regularization?

(click the triangle to expand and reveal the full answer)

The generative section of a generative autoencoder (i.e., the decoder) mimics the process by which the input data was created. Rather than reconstructing input data as closely as possible, the training process forces the generative autoencoder to produce outputs that fit within a distribution of the traning data features. As such, regularization through sparsity constraints is redundant.

What are sparse autoencoders?

(click the triangle to expand and reveal the full answer)

A sparse autoencoder is simply an autoencoder whose training criterion involves a sparsity penalty. In most cases, we would construct our loss function by penalizing activations of hidden layers so that only a few nodes are encouraged to activate when a single sample is fed into the network. We can construct our sparsity penalty with techniques like L1L_1 regularization or KL-divergence.

The intuition behind this method is that fewer nodes activating while still keeping the autoencoder’s performance during training would guarantee that the autoencoder is actually learning latent representations instead of redundant information in our input data.

What is a denoising autoencoder? What are its advantages? How does it solve the overcomplete problem?

(click the triangle to expand and reveal the full answer)

Denoising autoencoders purposefully corrupt the data they are trained on by randomly setting some of the input values to zero. This makes them more robust against noise and very useful for feature selection and extraction.

With the overcomplete problem, an autoencoder risks learning the “identity function” or “null function”. This means the output and input are identical. Randomly setting some of the inputs to 0 (usually around 50%, or sometimes as low as 30% depending on the amount of data) makes it much harder to learn an exact, uncompressed representation of all the input data.

What is score matching? Discuss it’s connections to DAEs.

(click the triangle to expand and reveal the full answer)

Score matching (SM) is an alternative to the maximum likelihood principle suitable for unnormalized probability density models whose partition function is intractable. Score matching searches for parameters that are more robust to small-noise perturbations of the training data.

Pascal Vincent from the University of Montreal proved that raining an denoising autoencoder defined is equivalent to performing score matching (explicit or implicit) with a described energy function on Parzen density estimate qσq_\sigma. Such a training would typically use stochastic gradient descent, whereby samples from qσq_\sigma are obtained by corrupting samples from the input dataset. This can all be carried out with a variety of optimization objective formulations. In other words, this paper demonstrates the equivalence between SM & certain DAEs.

Are there any connections between Autoencoders and RBMs?

(click the triangle to expand and reveal the full answer)

Restricted Boltzman Machines (RBMs) share a similar idea as autoencoders, but use a stochastic approach. Instead of deterministic (e.g. logistic or ReLU) it uses stochastic units with particular (usually binary of Gaussian) distribution. The learning procedure consists of several steps of Gibbs sampling (propagate: sample hiddens given visibles; reconstruct: sample visibles given hiddens; repeat) and adjusting the weights to minimize reconstruction error.

The intuition behind RBMs is that there are some visible random variables (e.g. film reviews from different users) and some hidden variables (like film genres or other internal features), and the task of training is to find out how these two sets of variables are actually connected to each other

Both models have their use cases, pros and cons, but probably the most important properties are:

  1. Autoencoders are simplest ones. They are intuitively understandable, easy to implement and to reason about (e.g. it’s much easier to find good meta-parameters for them than for RBMs).
  2. RBMs are generative. That is, unlike autoencoders that only discriminate some data vectors in favour of others, RBMs can also generate new data with given joined distribution. They are also considered more feature-rich and flexible.

What is manifold learning? How are denoising and contractive autoencoders equipped to do manifold learning?

(click the triangle to expand and reveal the full answer)

Even though input data for AI tasks such as image, text, audio reside in a high dimensional space (e.g. 28×2828 \times 28 black and white image has 784784 degrees of freedom yielding 27842^{784} possible images), most uniformly sampled output (for instance sampling from those 27842^{784} possible images) would not be naturally occurring images.

The basic idea of the manifold hypothesis is that there exists a lower dimensional manifold in which these naturally occurring images actually lie. So the model learning task becomes learning to output representations that map the naturally occurring images in the high dimensional input space to the low dimensional manifold. The idea is that the small variations of the naturally occurring images (e.g. rotations) etc are mapping to corresponding changes in the learned representation (see figure above) in the low dimensional manifold. PCA is an example of a manifold mapping algorithm where the manifold is linear. Autoencoders are inspired by the manifold hypothesis and learn lower dimensional representations of high dimensional data. Even though autoencoders are known to perform dimensionality reduction, the manifold view gives a deeper understanding of this mapping. Both denoising and contractive autoencoders, by resisting learning non-representative noise, are equiped to learn lower-dimensional latent representations that approximate such a manifold.

What is a contractive autoencoder? Discuss its advantages. How does it solve the overcomplete problem?

(click the triangle to expand and reveal the full answer)

Contractive autoencoders are a type of regularized autoencoder. Contractive autoencoders penalize large derivatives in the encoding hh of input data xx, with a pentalty term λ\lambda (x\nabla_x is a scaling factor).

Ω(h,x)=λixhi2\Omega(h, x) = \lambda \sum_i \| \nabla_xh_i \|^2

Optimal values have a very small (specifically, 0) derivative. In gradient descent, we take large steps when our loss function’s derivative is large, and small steps when our loss function’s derivative is small. Contractive autoencoders are similar; they grow your loss function when your derivative is large, encouraging your model to take larger steps during gradient descent.

L(x,g(f(x)))+Ω(h,x)L(x, g(f(x))) + \Omega(h, x)

Penalizing large derivatives in the encoding (like the kinds that would appear in learning noise that deviates from the latent representation) reduces the likelihood of learning an “identity function” of the input data.

Why is a contractive autoencoder named so?

(click the triangle to expand and reveal the full answer)

The name “contractive” comes from the fact that the CAE is encouraged to map a neighborhood of input points to a smaller neighborhood of output points.

What are the practical issues with CAEs? How to tackle them?

(click the triangle to expand and reveal the full answer)

Since the goal of CAEs is robustness of representation (i.e., feature extraction), having training data that’s too uniform risks undoing the regularizing safeguards of the architecture. This can be averted with larger amounts of training data.

When training a contractive autoencoder, contractive loss usually needs to be defined separately. An example of a CAE implementation (in Keras) is as follows:

from keras.layers import Input, Dense
from keras.models import Model
import keras.backend as K

lam = 1e-4
inputs = Input(shape=(N,))
encoded = Dense(N_hidden, activation='sigmoid', name='encoded')(inputs)
outputs = Dense(N, activation='linear')(encoded)

model = Model(input=inputs, output=outputs)

def contractive_loss(y_pred, y_true):
    mse = K.mean(K.square(y_true - y_pred), axis=1)

    W = K.variable(value=model.get_layer('encoded').get_weights()[0])  # N x N_hidden
    W = K.transpose(W)  # N_hidden x N
    h = model.get_layer('encoded').output
    dh = h * (1 - h)  # N_batch x N_hidden

    # N_batch x N_hidden * N_hidden x 1 = N_batch x 1
    contractive = lam * K.sum(dh**2 * K.sum(W**2, axis=1), axis=1)
    return mse + contractive
model.compile(optimizer='adam', loss=contractive_loss), X, batch_size=N_batch, nb_epoch=5)

What is a stacked autoencoder? What is a deep autoencoder? Compare and contrast.

(click the triangle to expand and reveal the full answer)

“Stacking” is to literally feed the output of one autoencoder to the input of the next autoencoder. As the name suggests, a stacked autoencoder is a bunch of smaller autoencoders stacked on top of each other. Stacked Denoising Autoencoders are a tool for unsupervised/semisupervised learning.

Any deep network is created by stacking layers. It’s true that if there were no non-linearities in the layers you could collapse the entire network to a single layer, but there are non-linearities and you can’t. “Stacking” isn’t generally used to describe connecting simple layers, but that’s what it is, and stacking autoencoders — or other blocks of layers — is just a way of making more complex networks.

Compare the reconstruction quality of a deep autoencoder vs. PCA.

(click the triangle to expand and reveal the full answer)

There are a lot of comparisons to be made between autoencoders and PCA is essentially a linear transformation but Auto-encoders are capable of modelling complex non linear functions. PCA features are totally linearly uncorrelated with each other since features are projections onto the orthogonal basis, but autoencoded features might have correlations since they are just trained for accurate reconstruction.

A single layered autoencoder with a linear activation function is very similar to PCA. However, a deep autoencoder can easily beat the reconstruction quality of PCA (though it’s important to make sure regularization is used so the reconstruction isn’t just a complex identity function).

What is predictive sparse decomposition?

(click the triangle to expand and reveal the full answer)

From Yann LeCunn’s lab at NYU

One problem with traditional sparse coding is that inference is somewhat slow. Give an input vector, finding the corresponding code vector requires an L2/L1L_2/L_1 optimization. Having to do this for every patch in an image would preclude the use of sparse coding for high-speed image recognition.

Predictive Sparse Decomposition (PSD) alleviates this problem by training a simple, feed-forward encoder module to predict an approximation to the optimal sparse code. The factor graph is shown below.

This could be seen as a kind of auto-encoder. The energy function has an additional term, which measure the discrepancy between the predicted code and actual code. The encoder can take several function forms. In the simplest instance, it is simply a linear transform, followed by a sigmoid non-linearity, and a trainable diagonal gain matrix.

The animation below shows the filters being learned by the PSD learning procedure, when trained on 12x12 pixel patches from natural images.

Discuss some applications of Autoencoders.

(click the triangle to expand and reveal the full answer)

One popular application of autoencoding is Data denoising, which is doable on data ranging from image inputs to audio data.

Another application is Dimensionality reduction. Working with high dimensional data presents lots of challenges, one of which is visualization. Autoencoders are usually used as a preprocessing stage to visualization methods such as t-SNE.

Other examples of applications include data compression, feature extraction, image generation, and sequence-to-sequence prediction

Representation Learning

What is representation learning? Why is it useful?

(click the triangle to expand and reveal the full answer)

Representation learning is learning representations of input data typically by transforming it or extracting features from it(by some means), that makes it easier to perform a task like classification or prediction. There are various ways of learning different representations. For instance,

  • in the case of probabilistic models, the goal is to learn a representation that captures the probability distribution of the underlying explanatory features for the observed input. Such a learnt representation can then be used for prediction.
  • in deep learning, the representations are formed by composition of multiple non-linear transformations of the input data with the goal of yielding abstract and useful representations for tasks like classification, prediction etc.

Focussing specifically on deep learning, representation learning is the consequence of the function a model learns where the learning is captured in the parameters of the model, as the function transforms input to output, during training. Representation learning here is referring to the nature/characteristics of the transformed input - not the model parameters/ function that is causal to it. The casual role is played both by the architecture of the model, and the learned parameters (e.g. does a parameter play a role in representing part or all of the input etc.) in mapping input to output.

What is the relation between Representation Learning and Deep Learning?

(click the triangle to expand and reveal the full answer)

Deep learning is but one of the many ways to learn representations and “depth” in deep learning just happens to be one of the many factors to learning a good representation, even though it is an important one.

In deep learning, “depth” refers to a hierarchical organization of explanatory features. As humans we describe the world using a hierarchy of concepts, with more abstract concepts layered on top of less abstract ones. Similarly, deep learning models learn functions that transform input to output using a composition of non-linear functions stacked in layers, where the output of layers form hierarchy of distributed representations with increasing levels of abstraction as input flows through them.

In addition to outputting progressive levels of abstract feature representations, deep learning architectures also enable feature reuse. Just as features are reused to represent different input regions in distributed representation, depth allows for feature reuse across layers by the multiple circuit paths in the computational graph from input to output through the nodes in the layers of the network.

What is one-shot and zero-shot learning? Give examples.

(click the triangle to expand and reveal the full answer)

They are both classification problems but with different requirements on how many training examples you get. Normally, you get many training examples (e.g. thousands or more) per category before you are tested on things that you haven’t seen before.

In one-shot learning (first defined in “One-Shot learning of object categories”), you get only 1 or a few training examples in some categories. For example, in some classification tasks, there may be little-to-no data or the number of each category may be severely imbalanced (such as in medical diagnoses, where healthy examples may be far more common than a rare disease). For machine learning areas like semi-supervised classification, ,some of the data may be only partly labelled. The model will need to label new incoming data, and correctly use this newly labelled data to learn features of the category more. This whole process is dependent on how well the model can learn the category features in the beginning with few examples to draw on.

In zero-shot learning (first defined in “Zero-Shot Learning with Semantic Output Codes” ), you are not presented with every class label in training. So in some categories, you get 0 training examples. One real-world example of this is classification of fMRI data. Because human thoughts can take so many diverse forms, it’s more than likely interpreting fMRI patterns would require making sense of previously unseen fMRI patterns.

What trade-offs does representation learning have to consider?

(click the triangle to expand and reveal the full answer)

From Deep Learning (by Ian Goodfellow, Yoshua Bengio, and Aaron Courville):

Most representation learning problems face a trade-off between preserving as much information about the input as possible and attaining nice properties (such as independence).

Yoshua Bengio’s 2014 paper goes into more detail:

One of the challenges of representation learning that distinguishes it from other machine learning tasks such as classification is the difficulty in establishing a clear objective, or target for training. In the case of classification, the objective is (at least conceptually) obvious, we want to minimize the number of misclassifications on the training dataset. In the case of representation learning, our objective is far-removed from the ultimate objective, which is typically learning a classifier or some other predictor. Our problem is reminiscent of the credit assignment problem encountered in reinforcement learning. We have proposed that a good representation is one that disentangles the underlying factors of variation, but how do we translate that into appropriate training criteria? Is it even necessary to do anything but maximize likelihood under a good model or can we introduce priors such as those enumerated above (possibly data-dependent ones) that help the representation better do this disentangling? This question remains clearly open…

From a practitioners perspective, one can have at least a qualitative sense if not a quantitative one of how good a model’s representations are by the different tasks it can be used for. For instance,

  • the distributed representation of words (word embeddings) output by a simple model like word2vec with just two matrices as it basic architecture, has shown the power of non-local representation learning. Word embeddings have become the de facto representation of words for downstream NLP tasks
  • The recent BERT model mentioned earlier is another example of how rich distributed representations output by the model can be used for a variety of NLP tasks with very little task specific data and hardly any task specific architecture, to obtain state-of-art results on NLP tasks.
  • While BERT’s use case is one of reusing representations learnt from reconstruction of XX from a masked version of it, in P(YX,task)P(\frac{Y}{X}, \text{task}), GPT-2 is another recent model that learns representations from a language modeling task and these representations are reused for tasks without any labeled data for tasks that are typically supervised. This is done by cleverly crafting the supervised task as a language modeling task (predicting the next word given current sentence). While the performance of the model is not state-of-art yet on these tasks, it underscores the power and versatility of learnt representations, particularly distributed representations, which is largely responsible for the current successes in NLP.

What is greedy layer-wise unsupervised pretraining (GLUP)? Why greedy? Why layer-wise? Why unsupervised? Why pretraining?

(click the triangle to expand and reveal the full answer)

See Jason Brownlee’s blog for more in-depth details

Traditionally, training deep neural networks with many layers was challenging. As the number of hidden layers is increased, the amount of error information propagated back to earlier layers is dramatically reduced. This means that weights in hidden layers close to the output layer are updated normally, whereas weights in hidden layers close to the input layer are updated minimally or not at all. Generally, this problem prevented the training of very deep neural networks and was referred to as the vanishing gradient problem.

An important milestone in the resurgence of neural networking that initially allowed the development of deeper neural network models was the technique of greedy layer-wise pretraining, often simply referred to as ”pretraining.”

The technique is referred to as ”greedy” because the piecewise or layer-wise approach to solving the harder problem of training a deep network. As an optimization process, dividing the training process into a succession of layer-wise training processes is seen as a greedy shortcut that likely leads to an aggregate of locally optimal solutions, a shortcut to a good enough global solution.

Pretraining involves successively adding a new hidden layer to a model and refitting, allowing the newly added model to learn the inputs from the existing hidden layer, often while keeping the weights for the existing hidden layers fixed. This gives the technique the name ”layer-wise” as the model is trained one layer at a time.

Broadly, supervised pretraining involves successively adding hidden layers to a model trained on a supervised learning task. Unsupervised pretraining involves using the greedy layer-wise process to build up an unsupervised autoencoder model, to which a supervised output layer is later added.

What were/are the purposes of the above technique?

(click the triangle to expand and reveal the full answer)

We can expect the unsupervised pretraining to be more effective when the initial representation is poor. The use of word embeddings is a great example, where learned word embeddings naturally encode similarity between words. Other factors also matter. For example, unsupervised pretraining is likely to be most useful when the function to be learned is extremely complicated.

In short, the technique above reduces the time and search space needed to find the optimal weights of a deep learning model

Why does unsupervised pretraining work?

(click the triangle to expand and reveal the full answer)

Unsupervised pretraining combines two ideas:

  1. The choice of initial parameters of a deep neural network can have a significant regularizing effect;
  2. learning about the input distribution can help with learning about the mapping from inputs to outputs.

But both ideas are not fully understood.

From the representation view, the idea is that some features that are useful for unsupervised task are also useful for supervised tasks. But this is not understood at a mathematical or theoretical level.

When does unsupervised training work? Under which circumstances?

(click the triangle to expand and reveal the full answer)

Unsupervised pretraining may be appropriate when you have a significantly larger number of unlabeled examples that can be used to initialize a model prior to using a much smaller number of examples to fine tune the model weights for a supervised task.

Pretraining can be used to iteratively deepen a supervised model or an unsupervised model that can be repurposed as a supervised model. Pretraining may be useful for problems with small amounts labeled data and large amounts of unlabeled data.

Why might unsupervised pretraining act as a regularizer?

(click the triangle to expand and reveal the full answer)

From the regularizing view, it’s possible that pretraining initializes the model in a location that would otherwise be in accessible.

Thinking of this process as a regularizer, i.e. to avoid overfitting, we can expect unsupervised pretraining to be most helpful when there is few labeled data and large number of unlabeled data.

What is the disadvantage of unsupervised pretraining compared to other forms of unsupervised learning?

(click the triangle to expand and reveal the full answer)

Unsupervised pretraining has a very obvious disadvantage of operating with two separate phases. It has also way more hyperparameters, whose effect may be measured after training but are often difficult to predict ahead.

Another disadvantage of having two separate phases is that each phase has its own hyper parameters. The performance of the second phase usually cannot be predicted during the first phase, so there is a long delay between proposing the hyperparameters and updating them.

How do you control the regularizing effect of unsupervised pretraining?

(click the triangle to expand and reveal the full answer)

It’s always difficult to tell what aspects of the pretrained parameters are retained during the supervised training stage, making this kind of intractable. There are two ways to overcome this:

  • train supervised and unsupervised learning simultaneously.
  • freeze the parameters for the feature extractors and using supervised learning only to add a classifier on top of them.

How to select the hyperparameters of each stage of GLUP?

(click the triangle to expand and reveal the full answer)

we can train unsupervised and supervised learning simultaneously. Through this we can reduce the tuning to a single hyperparameter, usually a coefficient attached to the weight of the unsupervised cost.

We can also use validation set error in the supervised phase to select the hyperparameters of the pretraining phase. In practice, some hyperparameters, like the number of iterations, are more conveniently set during the pretraining phase using early stopping.

Monte Carlo Methods

What are deterministic algorithms?

(click the triangle to expand and reveal the full answer)

In computer science, a deterministic algorithm is an algorithm which, given a particular input, will always produce the same output, with the underlying machine always passing through the same sequence of states.

This is why when we use algorithms like neural networks that are not necessarily deterministic, we make them deterministic by setting a seed value.

What are Las vegas algorithms?

(click the triangle to expand and reveal the full answer)

An example of a Las Vegas Algorithm would be a Randomized Quicksort, in which the pivot is chosen randomly, and divides the elements into three partitions: elements less than pivot, elements equal to pivot, and elements greater than pivot. The randomized QuickSort requires a lot of resources (and can have veried runtime depending on which pivot element is randomly chosen first) but it will always generate the sorted array as an output.

# Input A is an array of n elements
def randomized_quicksort(A):
    Partition A into elements < x, x, and >x  # as shown in the figure above.
    Execute Quicksort on A[1 to i-1] and A[i+1 to n].
    Combine the responses in order to obtain a sorted array.
    if n = 1:
        return A  # A is sorted.
        i = random.randrange(1, n)  # Will take a random number in the range 1~n
        X = A[i]  # The pivot element

Las Vegas algorithms arise frequently in search problems. For example, one looking for some information online might search related websites for the desired information. The time complexity thus ranges from getting “lucky” and finding the content immediately, to being “unlucky” and spending large amounts of time. Once the right website is found, then there is no possibility of error.

What are deterministic approximate algorithms?

(click the triangle to expand and reveal the full answer)

Approximation algorithms are efficient algorithms that find approximate solutions to optimization problems (in particular NP-hard problems) with provable guarantees on the distance of the returned solution to the optimal one. Many of these algorithms are randomized, but can be seeded with a predetermined seed number for a pseudorandom number generator. Hence, despite the “random sampling” , these are deterministic algorithms that will always produce the same output from the same input.

What are Monte Carlo algorithms?

(click the triangle to expand and reveal the full answer)

Monte Carlo Algorithms are randomized algorithms whose output may be incorrect with a certain (typically small) probability. If there is a procedure for verifying whether the answer given by a Monte Carlo algorithm is correct, and the probability of a correct answer is bounded above zero, then with probability one running the algorithm repeatedly while testing the answers will eventually give a correct answer.

A common example of a Monte Carlo Algorithm would be estimating the value of π\pi from points randomly selected (x,y)(x, y) points in a circle contained within a square with sides of length 1. We know that area of the square is 1 unit sq while that of circle is π(12)2=π4\pi \ast (\frac{1}{2})^{2} = \frac{\pi}{4}. Now for a very large number of generated points,

area of the circlearea of the square=no. of points generated inside the circletotal no. of points generated or no. of points generated inside the square\frac{\textrm{area of the circle}}{\textrm{area of the square}} = \frac{\textrm{no. of points generated inside the circle}}{\textrm{total no. of points generated or no. of points generated inside the square}}

that is,

π=4no. of points generated inside the circleno. of points generated inside the square\pi = 4 \ast \frac{\textrm{no. of points generated inside the circle}}{\textrm{no. of points generated inside the square}}

The beauty of this algorithm is that we don’t need any graphics or simulation to display the generated points. We simply generate random (x, y) pairs and then check if x2+y21x^{2} + y^{2} \leqslant 1. If yes, we increment the number of points that appears inside the circle. In randomized and simulation algorithms like Monte Carlo, the more the number of iterations, the more accurate the result is. This is why care was take to specify this example as ”estimating the value of Pi” and not “calculating the value of Pi”. The code looks like the following:

import random 

circle_points, square_points = 0, 0

# Total Random numbers generated= possible x 
# values* possible y values 
for i in range(INTERVAL**2): 
  # Randomly generated x and y values from a 
  # uniform distribution 
  # Rannge of x and y values is -1 to 1 
  rand_x= random.uniform(-1, 1) 
  rand_y= random.uniform(-1, 1) 
  # Distance between (x, y) from the origin 
  origin_dist= rand_x**2 + rand_y**2
  # Checking if (x, y) lies inside the circle 
  if origin_dist<= 1: 
    circle_points+= 1
  square_points+= 1
  # Estimating value of pi, 
  # pi= 4*(no. of points generated inside the 
  # circle)/ (no. of points generated inside the square) 
  pi = 4* circle_points/ square_points 
## print(rand_x, rand_y, circle_points, square_points, "-", pi) 
## print("\n") 

print("Final Estimation of Pi=", pi)

Here’s another visual example with calculating Euler’s constant. Being an irrational number the calculation can never reach e exactly, but it does get closer and closer to the true value.

Las Vegas algorithms can be contrasted with Monte Carlo algorithms, in which the resources used are bounded but the answer may be incorrect with a certain (typically small) probability. By an application of Markov’s inequality, a Las Vegas algorithm can be converted into a Monte Carlo algorithm by running it for set time and generating a random answer when it fails to terminate. We can also use the same method to can set the bound on the probability that the Las Vegas algorithm would go over the fixed limit.

Running TimeCorrectness
Las Vegas Algorithmprobabilisticcertain
Monte Carlo Algorithmcertainprobabilistic

If a deterministic way to test for correctness is available, then it is possible to turn a Monte Carlo algorithm into a Las Vegas algorithm. However, it is hard to convert Monte Carlo algorithm to Las Vegas algorithm without a way to test the algorithm. On the other hand, changing Las Vegas algorithm to Monte Carlo algorithm is easy. This can be done by running a Las Vegas algorithm for a specific period of time given by confidence parameter. If the algorithm finds the solution within the time, then it is success and if not then output can simply be “sorry”.

Adversarial Attacks

What are adversarial examples?

(click the triangle to expand and reveal the full answer)

Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines.

Discuss state-of-the-art adversarial attack techniques.

(click the triangle to expand and reveal the full answer)

The space of adversarial attacks is much larger than the space of provably robust defences. Some of the more successful attack types include the following (taken from this review paper):

  • L-BFGS Attack
  • Fast Gradient Sign Method (FGSM)
  • One-step Target Class Method (OTCM)
  • Basic Iterative Method (BIM) and Iterative Least-Likely Class Method (ILLC)
  • Jacobian-based Saliency Map Attack (JSMA)
  • DeepFool
  • CPPN EA Fool
  • C&W’s Attack
  • Zeroth Order Optimization (ZOO)
  • Universal Perturbation
  • One Pixel Attack
  • Feature Adversary
  • Hot/Cold method
  • Natural GAN
  • Model-based Ensembling Attack
  • Ground-Truth Attack
  • GPU Memory leaks

Discuss state-of-the-art defense techniques for adversarial models.

(click the triangle to expand and reveal the full answer)

Traditional techniques for making machine learning models more robust, such as weight decay and dropout, generally do not provide a practical defense against adversarial examples. So far, only two methods have provided a significant defense.

Adversarial training: This is a brute force solution where we simply generate a lot of adversarial examples and explicitly train the model not to be fooled by each of them. An open-source implementation of adversarial training is available in the cleverhans library and its use illustrated in the following tutorial.

Defensive distillation: This is a strategy where we train the model to output probabilities of different classes, rather than hard decisions about which class to output. The probabilities are supplied by an earlier model, trained on the same task using hard class labels. This creates a model whose surface is smoothed in the directions an adversary will typically try to exploit, making it difficult for them to discover adversarial input tweaks that lead to incorrect categorization. (Distillation was originally introduced in Distilling the Knowledge in a Neural Network as a technique for model compression, where a small model is trained to imitate a large one, in order to obtain computational savings.)

Yet even these specialized algorithms can easily be broken by giving more computational firepower to the attacker.

Why is it hard to defend against adversarial examples?

(click the triangle to expand and reveal the full answer)

Adversarial examples are hard to defend against because it is difficult to construct a theoretical model of the adversarial example crafting process. Adversarial examples are solutions to an optimization problem that is non-linear and non-convex for many ML models, including neural networks. Because we don’t have good theoretical tools for describing the solutions to these complicated optimization problems, it is very hard to make any kind of theoretical argument that a defense will rule out a set of adversarial examples.

Adversarial examples are also hard to defend against because they require machine learning models to produce good outputs for every possible input. Most of the time, machine learning models work very well but only work on a very small amount of all the many possible inputs they might encounter.

Most strategies that have been tested so far fail because they is not adaptive: they may block one kind of attack, but they leave another vulnerability open to an attacker who knows about the defense being used. Designing a defense that can protect against a powerful, adaptive attacker is an important research area.

References and Resources

In the ML community, it’s important to give credit where credit is due, and not just slap our name on someone else’s hard work cough. Here is a list of the references and resources I used for various equations, concepts, explanations, examples, and inspiration for visualizations:

Cited as:

  title   = "Deep Learning Concepts every practicioner should know",
  author  = "McAteer, Matthew",
  journal = "",
  year    = "2019",
  url     = ""

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.