SlideShare a Scribd company logo
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
Easy to learn deep learning guide - elementry
A Selective Overview of Deep Learning
Jianqing Fan∗
Cong Ma‡
Yiqiao Zhong∗
April 16, 2019
Abstract
Deep learning has arguably achieved tremendous success in recent years. In simple words, deep
learning uses the composition of many nonlinear functions to model the complex dependency between
input features and labels. While neural networks have a long history, recent advances have greatly
improved their performance in computer vision, natural language processing, etc. From the statistical
and scientific perspective, it is natural to ask: What is deep learning? What are the new characteristics of
deep learning, compared with classical methods? What are the theoretical foundations of deep learning?
To answer these questions, we introduce common neural network models (e.g., convolutional neural
nets, recurrent neural nets, generative adversarial nets) and training techniques (e.g., stochastic gradient
descent, dropout, batch normalization) from a statistical point of view. Along the way, we highlight new
characteristics of deep learning (including depth and over-parametrization) and explain their practical
and theoretical benefits. We also sample recent results on theories of deep learning, many of which are
only suggestive. While a complete understanding of deep learning remains elusive, we hope that our
perspectives and discussions serve as a stimulus for new statistical research.
Keywords: neural networks, over-parametrization, stochastic gradient descent, approximation theory, gen-
eralization error.
Contents
1 Introduction 2
1.1 Intriguing new characteristics of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Towards theory of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Roadmap of the paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Feed-forward neural networks 5
2.1 Model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Back-propagation in computational graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Popular models 8
3.1 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Deep unsupervised learning 14
4.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Representation power: approximation theory 17
5.1 Universal approximation theory for shallow NNs . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Approximation theory for multi-layer NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Author names are sorted alphabetically.
∗Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; Email:
{jqfan, congm, yiqiaoz}@princeton.edu.
1
arXiv:1904.05526v2
[stat.ML]
15
Apr
2019
6 Training deep neural nets 20
6.1 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Easing numerical instability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3 Regularization techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Generalization power 25
7.1 Algorithm-independent controls: uniform convergence . . . . . . . . . . . . . . . . . . . . . . 25
7.2 Algorithm-dependent controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8 Discussion 29
1 Introduction
Modern machine learning and statistics deal with the problem of learning from data: given a training dataset
{(yi, xi)}1≤i≤n where xi ∈ Rd
is the input and yi ∈ R is the output1
, one seeks a function f : Rd
7→ R from a
certain function class F that has good prediction performance on test data. This problem is of fundamental
significance and finds applications in numerous scenarios. For instance, in image recognition, the input x
(reps. the output y) corresponds to the raw image (reps. its category) and the goal is to find a mapping f(·)
that can classify future images accurately. Decades of research efforts in statistical machine learning have been
devoted to developing methods to find f(·) efficiently with provable guarantees. Prominent examples include
linear classifiers (e.g., linear / logistic regression, linear discriminant analysis), kernel methods (e.g., support
vector machines), tree-based methods (e.g., decision trees, random forests), nonparametric regression (e.g.,
nearest neighbors, local kernel smoothing), etc. Roughly speaking, each aforementioned method corresponds
to a different function class F from which the final classifier f(·) is chosen.
Deep learning [70], in its simplest form, proposes the following compositional function class:

f(x; θ) = WLσL(WL−1 · · · σ2(W2σ1(W1x))) θ = {W1, . . . , WL} . (1)
Here, for each 1 ≤ l ≤ L, σ`(·) is some nonlinear function, and θ = {W1, . . . , WL} consists of matrices
with appropriate sizes. Though simple, deep learning has made significant progress towards addressing
the problem of learning from data over the past decade. Specifically, it has performed close to or better
than humans in various important tasks in artificial intelligence, including image recognition [50], game
playing [114], and machine translation [132]. Owing to its great promise, the impact of deep learning is
also growing rapidly in areas beyond artificial intelligence; examples include statistics [15, 111, 76, 104, 41],
applied mathematics [130, 22], clinical research [28], etc.
Table 1: Winning models for ILSVRC image classification challenge.
Model Year # Layers # Params Top-5 error
Shallow  2012 — —  25%
AlexNet 2012 8 61M 16.4%
VGG19 2014 19 144M 7.3%
GoogleNet 2014 22 7M 6.7%
ResNet-152 2015 152 60M 3.6%
To get a better idea of the success of deep learning, let us take the ImageNet Challenge [107] (also
known as ILSVRC) as an example. In the classification task, one is given a training dataset consisting of 1.2
million color images with 1000 categories, and the goal is to classify images based on the input pixels. The
performance of a classifier is then evaluated on a test dataset of 100 thousand images, and in the end the
top-5 error2
is reported. Table 1 highlights a few popular models and their corresponding performance. As
1When the label y is given, this problem is often known as supervised learning. We mainly focus on this paradigm throughout
this paper and remark sparingly on its counterpart, unsupervised learning, where y is not given.
2The algorithm makes an error if the true label is not contained in the 5 predictions made by the algorithm.
2
Figure 1: Visualization of trained filters in the first layer of AlexNet. The model is pre-trained on ImageNet
and is downloadable via PyTorch package torchvision.models. Each filter contains 11×11×3 parameters
and is shown as an RGB color map of size 11 × 11.
can be seen, deep learning models (the second to the last rows) have a clear edge over shallow models (the
first row) that fit linear models / tree-based models on handcrafted features. This significant improvement
raises a foundational question:
Why is deep learning better than classical methods on tasks like image recognition?
1.1 Intriguing new characteristics of deep learning
It is widely acknowledged that two indispensable factors contribute to the success of deep learning, namely
(1) huge datasets that often contain millions of samples and (2) immense computing power resulting from
clusters of graphics processing units (GPUs). Admittedly, these resources are only recently available: the
latter allows to train larger neural networks which reduces biases and the former enables variance reduction.
However, these two alone are not sufficient to explain the mystery of deep learning due to some of its “dreadful”
characteristics: (1) over-parametrization: the number of parameters in state-of-the-art deep learning models
is often much larger than the sample size (see Table 1), which gives them the potential to overfit the training
data, and (2) nonconvexity: even with the help of GPUs, training deep learning models is still NP-hard [8]
in the worst case due to the highly nonconvex loss function to minimize. In reality, these characteristics are
far from nightmares. This sharp difference motivates us to take a closer look at the salient features of deep
learning, which we single out a few below.
1.1.1 Depth
Deep learning expresses complicated nonlinearity through composing many nonlinear functions; see (1).
The rationale for this multilayer structure is that, in many real-world datasets such as images, there are
different levels of features and lower-level features are building blocks of higher-level ones. See [134] for a
visualization of trained features of convolutional neural nets; here in Figure 1, we sample and visualize weights
from a pre-trained AlexNet model. This intuition is also supported by empirical results from physiology and
neuroscience [56, 2]. The use of function composition marks a sharp difference from traditional statistical
methods such as projection pursuit models [38] and multi-index models [73, 27]. It is often observed that
depth helps efficiently extract features that are representative of a dataset. In comparison, increasing width
(e.g., number of basis functions) in a shallow model leads to less improvement. This suggests that deep
learning models excel at representing a very different function space that is suitable for complex datasets.
1.1.2 Algorithmic regularization
The statistical performance of neural networks (e.g., test accuracy) depends heavily on the particular opti-
mization algorithms used for training [131]. This is very different from many classical statistical problems,
where the related optimization problems are less complicated. For instance, when the associated optimization
3
(a) MNIST images (b) training and test accuracies
Figure 2: (a) shows the images in the public dataset MNIST; and (b) depicts the training and test accuracies
along the training dynamics. Note that the training accuracy is approaching 100% and the test accuracy is
still high (no overfitting).
problem has a relatively simple structure (e.g., convex objective functions, linear constraints), the solution
to the optimization problem can often be unambiguously computed and analyzed. However, in deep neural
networks, due to over-parametrization, there are usually many local minima with different statistical perfor-
mance [72]. Nevertheless, common practice runs stochastic gradient descent with random initialization and
finds model parameters with very good prediction accuracy.
1.1.3 Implicit prior learning
It is well observed that deep neural networks trained with only the raw inputs (e.g., pixels of images) can
provide a useful representation of the data. This means that after training, the units of deep neural networks
can represent features such as edges, corners, wheels, eyes, etc.; see [134]. Importantly, the training process
is automatic in the sense that no human knowledge is involved (other than hyper-parameter tuning). This
is very different from traditional methods, where algorithms are designed after structural assumptions are
posited. It is likely that training an over-parametrized model efficiently learns and incorporates the prior
distribution p(x) of the input, even though deep learning models are themselves discriminative models. With
automatic representation of the prior distribution, deep learning typically performs well on similar datasets
(but not very different ones) via transfer learning.
1.2 Towards theory of deep learning
Despite the empirical success, theoretical support for deep learning is still in its infancy. Setting the stage,
for any classifier f, denote by E(f) the expected risk on fresh sample (a.k.a. test error, prediction error
or generalization error), and by En(f) the empirical risk / training error averaged over a training dataset.
Arguably, the key theoretical question in deep learning is
why is E( ˆ
fn) small, where ˆ
fn is the classifier returned by the training algorithm?
We follow the conventional approximation-estimation decomposition (sometimes, also bias-variance trade-
off) to decompose the term E( ˆ
fn) into two parts. Let F be the function space expressible by a family of
neural nets. Define f∗
= argminf E(f) to be the best possible classifier and f∗
F = argminf∈F E(f) to be the
best classifier in F. Then, we can decompose the excess error E , E( ˆ
fn) − E(f∗
) into two parts:
E = E(f∗
F ) − E(f∗
)
| {z }
approximation error
+ E( ˆ
fn) − E(f∗
F )
| {z }
estimation error
. (2)
Both errors can be small for deep learning (cf. Figure 2), which we explain below.
4
• The approximation error is determined by the function class F. Intuitively, the larger the class, the smaller
the approximation error. Deep learning models use many layers of nonlinear functions (Figure 3)that can
drive this error small. Indeed, in Section 5, we provide recent theoretical progress of its representation
power. For example, deep models allow efficient representation of interactions among variable while shallow
models cannot.
• The estimation error reflects the generalization power, which is influenced by both the complexity of the
function class F and the properties of the training algorithms. Interestingly, for over-parametrized deep
neural nets, stochastic gradient descent typically results in a near-zero training error (i.e., En( ˆ
fn) ≈ 0;
see e.g. left panel of Figure 2). Moreover, its generalization error E( ˆ
fn) remains small or moderate. This
“counterintuitive” behavior suggests that for over-parametrized models, gradient-based algorithms enjoy
benign statistical properties; we shall see in Section 7 that gradient descent enjoys implicit regularization
in the over-parametrized regime even without explicit regularization (e.g., `2 regularization).
The above two points lead to the following heuristic explanation of the success of deep learning models.
The large depth of deep neural nets and heavy over-parametrization lead to small or zero training errors, even
when running simple algorithms with moderate number of iterations. In addition, these simple algorithms
with moderate number of steps do not explore the entire function space and thus have limited complexities,
which results in small generalization error with a large sample size. Thus, by combining the two aspects, it
explains heuristically that the test error is also small.
1.3 Roadmap of the paper
We first introduce basic deep learning models in Sections 2–4, and then examine their representation power
via the lens of approximation theory in Section 5. Section 6 is devoted to training algorithms and their
ability of driving the training error small. Then we sample recent theoretical progress towards demystifying
the generalization power of deep learning in Section 7. Along the way, we provide our own perspectives, and
at the end we identify a few interesting questions for future research in Section 8. The goal of this paper
is to present suggestive methods and results, rather than giving conclusive arguments (which is currently
unlikely) or a comprehensive survey. We hope that our discussion serves as a stimulus for new statistics
research.
2 Feed-forward neural networks
Before introducing the vanilla feed-forward neural nets, let us set up necessary notations for the rest of this
section. We focus primarily on classification problems, as regression problems can be addressed similarly.
Given the training dataset {(yi, xi)}1≤i≤n where yi ∈ [K] , {1, 2, . . . , K} and xi ∈ Rd
are independent
across i ∈ [n], supervised learning aims at finding a (possibly random) function ˆ
f(x) that predicts the
outcome y for a new input x, assuming (y, x) follows the same distribution as (yi, xi). In the terminology
of machine learning, the input xi is often called the feature, the output yi called the label, and the pair
(yi, xi) is an example. The function ˆ
f is called the classifier, and estimation of ˆ
f is training or learning. The
performance of ˆ
f is evaluated through the prediction error P(y 6= ˆ
f(x)), which can be often estimated from
a separate test dataset.
As with classical statistical estimation, for each k ∈ [K], a classifier approximates the conditional prob-
ability P(y = k|x) using a function fk(x; θk) parametrized by θk. Then the category with the highest
probability is predicted. Thus, learning is essentially estimating the parameters θk. In statistics, one of the
most popular methods is (multinomial) logistic regression, which stipulates a specific form for the functions
fk(x; θk): let zk = x
βk + αk and fk(x; θk) = Z−1
exp(zk) where Z =
PK
k=1 exp(zk) is a normalization
factor to make {fk(x; θk)}1≤k≤K a valid probability distribution. It is clear that logistic regression induces
linear decision boundaries in Rd
, and hence it is restrictive in modeling nonlinear dependency between y and
x. The deep neural networks we introduce below provide a flexible framework for modeling nonlinearity in
a fairly general way.
5
hidden layer input layer output layer
hidden layer input layer output layer
hidden layer input layer output layer
hidden layer input layer output layer
n layer input layer output layer
x y W y
en layer input layer output layer
x y W y
en layer input layer output layer
x y W y
n layer input layer output layer
x y W y
n layer input layer output layer
x y W y
hidden layer input layer output layer
x y W y
hidden layer input layer output layer
x y W y
hidden layer input layer output layer
x y W y
hidden layer input layer output layer
x y W y
hidden layer input layer output layer
x y W y
Figure 3: A feed-forward neural network with an input layer, two hidden layers and an output layer. The
input layer represents raw features {xi}1≤i≤n. Both hidden layers compute an affine transform (a.k.s.
indices) of the input and then apply an element-wise activation function σ(·). Finally, the output returns a
linear transform followed by the softmax activation (resp. simply a linear transform) of the hidden layers for
the classification (resp. regression) problem.
2.1 Model setup
From the high level, deep neural networks (DNNs) use composition of a series of simple nonlinear functions
to model nonlinearity
h(L)
= g(L)
◦ g(L−1)
◦ . . . ◦ g(1)
(x),
where ◦ denotes composition of two functions and L is the number of hidden layers, and is usually called
depth of a NN model. Letting h(0)
, x, one can recursively define h(l)
= g(l)
h(l−1)

for all ` = 1, 2, . . . , L.
The feed-forward neural networks, also called the multilayer perceptrons (MLPs), are neural nets with a
specific choice of g(l)
: for ` = 1, . . . , L, define
h(`)
= g(l)
h(l−1)

, σ W(`)
h(`−1)
+ b(`)

, (3)
where W(l)
and b(l)
are the weight matrix and the bias / intercept, respectively, associated with the l-th
layer, and σ(·) is usually a simple given (known) nonlinear function called the activation function. In words,
in each layer `, the input vector h(`−1)
goes through an affine transformation first and then passes through a
fixed nonlinear function σ(·). See Figure 3 for an illustration of a simple MLP with two hidden layers. The
activation function σ(·) is usually applied element-wise, and a popular choice is the ReLU (Rectified Linear
Unit) function:
[σ(z)]j = max{zj, 0}. (4)
Other choices of activation functions include leaky ReLU, tanh function [79] and the classical sigmoid function
(1 + e−z
)−1
, which is less used now.
Given an output h(L)
from the final hidden layer and a label y, we can define a loss function to minimize.
A common loss function for classification problems is the multinomial logistic loss. Using the terminology of
deep learning, we say that h(L)
goes through an affine transformation and then the soft-max function:
fk(x; θ) ,
exp(zk)
P
k exp(zk)
, ∀ k ∈ [K], where z = W(L+1)
h(L)
+ b(L+1)
∈ RK
.
Then the loss is defined to be the cross-entropy between the label y (in the form of an indicator vector) and
the score vector (f1(x; θ), . . . , fK(x; θ))
, which is exactly the negative log-likelihood of the multinomial
logistic regression model:
L(f(x; θ), y) = −
K
X
k=1
1{y = k} log pk, (5)
6
where θ , {W(`)
, b(`)
: 1 ≤ ` ≤ L + 1}. As a final remark, the number of parameters scales with both the
depth L and the width (i.e., the dimensionality of W(`)
), and hence it can be quite large for deep neural
nets.
2.2 Back-propagation in computational graphs
Training neural networks follows the empirical risk minimization paradigm that minimizes the loss (e.g.,
(5)) over all the training data. This minimization is usually done via stochastic gradient descent (SGD). In a
way similar to gradient descent, SGD starts from a certain initial value θ0
and then iteratively updates the
parameters θt
by moving it in the direction of the negative gradient. The difference is that, in each update,
a small subsample B ⊂ [n] called a mini-batch—which is typically of size 32–512—is randomly drawn and
the gradient calculation is only on B instead of the full batch [n]. This saves considerably the computational
cost in calculation of gradient. By the law of large numbers, this stochastic gradient should be close to the
full sample one, albeit with some random fluctuations. A pass of the whole training set is called an epoch.
Usually, after several or tens of epochs, the error on a validation set levels off and training is complete. See
Section 6 for more details and variants on training algorithms.
The key to the above training procedure, namely SGD, is the calculation of the gradient ∇`B(θ), where
`B(θ) , |B|−1
X
i∈B
L(f(xi; θ), yi). (6)
Gradient computation, however, is in general nontrivial for complex models, and it is susceptible to numerical
instability for a model with large depth. Here, we introduce an efficient approach, namely back-propagation,
for computing gradients in neural networks.
Back-propagation [106] is a direct application of the chain rule in networks. As the name suggests,
the calculation is performed in a backward fashion: one first computes ∂`B/∂h(L)
, then ∂`B/∂h(L−1)
, . . .,
and finally ∂`B/∂h(1)
. For example, in the case of the ReLU activation function3
, we have the following
recursive / backward relation
∂`B
∂h(`−1)
=
∂h(`)
∂h(`−1)
·
∂`B
∂h(`)
= (W(`)
)
diag

1{W(`)
h(`−1)
+ b(`)
≥ 0}
 ∂`B
∂h(`)
(7)
where diag(·) denotes a diagonal matrix with elements given by the argument. Note that the calculation of
∂`B/∂h(`−1)
depends on ∂`B/∂h(`)
, which is the partial derivatives from the next layer. In this way, the
derivatives are “back-propagated” from the last layer to the first layer. These derivatives {∂`B/∂h(`)
} are
then used to update the parameters. For instance, the gradient update for W(`)
is given by
W(`)
← W(`)
− η
∂`B
∂W(`)
, where
∂`B
∂W
(`)
jm
=
∂`B
∂h
(`)
j
· σ0
· h(`−1)
m , (8)
where σ0
= 1 if the j-th element of W(`)
h(`−1)
+ b(`)
is nonnegative, and σ0
= 0 otherwise. The step size
η  0, also called the learning rate, controls how much parameters are changed in a single update.
A more general way to think about neural network models and training is to consider computational
graphs. Computational graphs are directed acyclic graphs that represent functional relations between vari-
ables. They are very convenient and flexible to represent function composition, and moreover, they also
allow an efficient way of computing gradients. Consider an MLP with a single hidden layer and an `2
regularization:
`λ
B(θ) = `B(θ) + rλ(θ) = `B(θ) + λ
 X
j,j0
W
(1)
j,j0
2
+
X
j,j0
W
(2)
j,j0
2

, (9)
where `B(θ) is the same as (6), and λ ≥ 0 is a tuning parameter. A similar example is considered in [45]. The
corresponding computational graph is shown in Figure 4. Each node represents a function (inside a circle),
which is associated with an output of that function (outside a circle). For example, we view the term `B(θ)
as a result of 4 compositions: first the input data x multiplies the weight matrix W(1)
resulting in u(1)
,
3The issue of non-differentiability at the origin is often ignored in implementation.
7
matmul relu matmul
+
 # SoS
$ %(')
)(') *
12
12
,
-(')
-(.)
cross
entropy
/,
0
Figure 4: The computational graph illustrates the loss (9). For simplicity, we omit the bias terms. Symbols
inside nodes represent functions, and symbols outside nodes represent function outputs (vectors/scalars).
matmul is matrix multiplication, relu is the ReLU activation, cross entropy is the cross entropy loss, and
SoS is the sum of squares.
then it goes through the ReLU activation function relu resulting in h(1)
, then it multiplies another weight
matrix W(2)
leading to p, and finally it produces the cross-entropy with label y as in (5). The regularization
term is incorporated in the graph similarly.
A forward pass is complete when all nodes are evaluated starting from the input x. A backward pass
then calculates the gradients of `λ
B with respect to all other nodes in the reverse direction. Due to the chain
rule, the gradient calculation for a variable (say, ∂`B/∂u(1)
) is simple: it only depends on the gradient value
of the variables (∂`B/∂h) the current node points to, and the function derivative evaluated at the current
variable value (σ0
(u(1)
)). Thus, in each iteration, a computation graph only needs to (1) calculate and
store the function evaluations at each node in the forward pass, and then (2) calculate all derivatives in the
backward pass.
Back-propagation in computational graphs forms the foundations of popular deep learning programming
softwares, including TensorFlow [1] and PyTorch [92], which allows more efficient building and training of
complex neural net models.
3 Popular models
Moving beyond vanilla feed-forward neural networks, we introduce two other popular deep learning models,
namely, the convolutional neural networks (CNNs) and the recurrent neural networks (RNNs). One impor-
tant characteristic shared by the two models is weight sharing, that is some model parameters are identical
across locations in CNNs or across time in RNNs. This is related to the notion of translational invariance in
CNNs and stationarity in RNNs. At the end of this section, we introduce a modular thinking for constructing
more flexible neural nets.
3.1 Convolutional neural networks
The convolutional neural network (CNN) [71, 40] is a special type of feed-forward neural networks that is
tailored for image processing. More generally, it is suitable for analyzing data with salient spatial structures.
In this subsection, we focus on image classification using CNNs, where the raw input (image pixels) and
features of each hidden layer are represented by a 3D tensor X ∈ Rd1×d2×d3
. Here, the first two dimensions
d1, d2 of X indicate spatial coordinates of an image while the third d3 indicates the number of channels. For
instance, d3 is 3 for the raw inputs due to the red, green and blue channels, and d3 can be much larger (say,
256) for hidden layers. Each channel is also called a feature map, because each feature map is specialized to
detect the same feature at different locations of the input, which we will soon explain. We now introduce
two building blocks of CNNs, namely the convolutional layer and the pooling layer.
1. Convolutional layer (CONV). A convolutional layer has the same functionality as described in (3), where
8
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
1
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
1
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
1
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
1
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
1
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
1
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
X̃ 2 R24⇥24⇥3
24
1
1
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
X̃ 2 R24⇥24⇥3
24
1
1
X 2 R28⇥28⇥3
Fk 2 R5⇥5⇥3
28
5
3
X̃ 2 R24⇥24⇥3
24
1
1
3
X̃ 2 R24⇥24⇥3
24
1
input feature map
filter
output feature map
1
X̃ 2 R24⇥24⇥3
24
1
input feature map
filter
output feature map
1
X̃ 2 R
24
1
input feature map
filter
output feature map
1
stride = 1
stride = 2
X̃ 2 R24⇥24⇥1
1
Figure 5: X ∈ R28×28×3
represents the input feature consisting of 28 × 28 spatial coordinates in a total
number of 3 channels / feature maps. Fk ∈ R5×5×3
denotes the k-th filter with size 5 × 5. The third
dimension 3 of the filter automatically matches the number 3 of channels in the previous input. Every 3D
patch of X gets convolved with the filter Fk and this as a whole results in a single output feature map X̃:,:,k
with size 24 × 24 × 1. Stacking the outputs of all the filters {Fk}1≤k≤K will lead to the output feature with
size 24 × 24 × K.
the input feature X ∈ Rd1×d2×d3
goes through an affine transformation first and then an element-wise
nonlinear activation. The difference lies in the specific form of the affine transformation. A convolutional
layer uses a number of filters to extract local features from the previous input. More precisely, each filter
is represented by a 3D tensor Fk ∈ Rw×w×d3
(1 ≤ k ≤ ˜
d3), where w is the size of the filter (typically 3 or
5) and ˜
d3 denotes the total number of filters. Note that the third dimension d3 of Fk is equal to that of
the input feature X. For this reason, one usually says that the filter has size w × w, while suppressing the
third dimension d3. Each filter Fk then convolves with the input feature X to obtain one single feature
map Ok
∈ R(d1−w+1)×(d1−w+1)
, where4
Ok
ij = [X]ij , Fk =
w
X
i0=1
w
X
j0=1
d3
X
l=1
[X]i+i0−1,j+j0−1,l[Fk]i0,j0,l. (10)
Here [X]ij ∈ Rw×w×d3
is a small “patch” of X starting at location (i, j). See Figure 5 for an illustration of
the convolution operation. If we view the 3D tensors [X]ij and Fk as vectors, then each filter essentially
computes their inner product with a part of X indexed by i, j (which can be also viewed as convolution,
as its name suggests). One then pack the resulted feature maps {Ok
} into a 3D tensor O with size
(d1 − w + 1) × (d1 − w + 1) × ˜
d3, where
[O]ijk = [Ok
]ij. (11)
The outputs of convolutional layers are then followed by nonlinear activation functions. In the ReLU
case, we have
X̃ijk = σ(Oijk), ∀ i ∈ [d1 − w + 1], j ∈ [d2 − w + 1], k ∈ [ ˜
d3]. (12)
The convolution operation (10) and the ReLU activation (12) work together to extract features X̃ from
the input X. Different from feed-forward neural nets, the filters Fk are shared across all locations (i, j).
A patch [X]ij of an input responds strongly (that is, producing a large value) to a filter Fk if they are
positively correlated. Therefore intuitively, each filter Fk serves to extract features similar to Fk.
As a side note, after the convolution (10), the spatial size d1×d2 of the input X shrinks to (d1 − w + 1) × (d2 − w + 1)
of X̃. However one may want the spatial size unchanged. This can be achieved via padding, where one
4To simplify notation, we omit the bias/intercept term associated with each filter.
9
6 7 15 13
3 14 1 9
16 8 2 10
11 5 4 12
14 15 15
16 14 10
16 8 12
2 ⇥ 2 max pooling
1
stride = 1
stride = 2
1
14 15
16 12
2 ⇥ 2 max pooling
1
stride = 1
stride = 2
1
Figure 6: A 2 × 2 max pooling layer extracts the maximum of 2 by 2 neighboring pixels / features across the
spatial dimension.
3
X̃ 2 R24⇥24⇥3
24
1
map
ature map
G
D
stribution PZ
amples {xi}1in
xi
z
g (z)
d (·)
32 ⇥ 32 ⇥ 1
1
24
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
d (·)
32 ⇥ 32 ⇥ 1
28 ⇥ 28 ⇥ 6
1
X̃ 2 R24⇥24⇥3
24
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
d (·)
32 ⇥ 32 ⇥ 1
28 ⇥ 28 ⇥ 6
14 ⇥ 14 ⇥ 6
5 ⇥ 5 ⇥ 6
120
84
10
10 ⇥ 10 ⇥ 6
1
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
d (·)
32 ⇥ 32 ⇥ 1
28 ⇥ 28 ⇥ 6
14 ⇥ 14 ⇥ 6
5 ⇥ 5 ⇥ 16
120
84
10
10 ⇥ 10 ⇥ 16
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
d (·)
32 ⇥ 32 ⇥ 1
28 ⇥ 28 ⇥ 6
14 ⇥ 14 ⇥ 6
5 ⇥ 5 ⇥ 16
120
84
10
10 ⇥ 10 ⇥ 16
1
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
d (·)
32 ⇥ 32 ⇥ 1
28 ⇥ 28 ⇥ 6
14 ⇥ 14 ⇥ 6
5 ⇥ 5 ⇥ 16
120
84
10
10 ⇥ 10 ⇥ 16
1
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
d (·)
32 ⇥ 32 ⇥ 1
28 ⇥ 28 ⇥ 6
14 ⇥ 14 ⇥ 6
5 ⇥ 5 ⇥ 16
120
84
10
10 ⇥ 10 ⇥ 16
1
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
d (·)
32 ⇥ 32 ⇥ 1
28 ⇥ 28 ⇥ 6
14 ⇥ 14 ⇥ 6
5 ⇥ 5 ⇥ 16
120
84
10
10 ⇥ 10 ⇥ 16
1
10 ⇥ 10 ⇥ 16
FC
POOL 2 ⇥ 2
CONV 5 ⇥ 5
2
10 ⇥ 10 ⇥ 16
FC
POOL 2 ⇥ 2
CONV 5 ⇥ 5
2
10 ⇥ 10 ⇥ 16
FC
POOL 2 ⇥ 2
CONV 5 ⇥ 5
2
10 ⇥ 10 ⇥ 16
FC
POOL 2 ⇥ 2
CONV 5 ⇥ 5
2
10 ⇥ 10 ⇥ 16
FC
POOL 2 ⇥ 2
CONV 5 ⇥ 5
2
10 ⇥ 10 ⇥ 16
FC
POOL 2 ⇥ 2
CONV 5 ⇥ 5
2
10 ⇥ 10 ⇥ 16
FC
POOL 2 ⇥ 2
CONV 5 ⇥ 5
2
Figure 7: LeNet is composed of an input layer, two convolutional layers, two pooling layers and three fully-
connected layers. Both convolutions are valid and use filters with size 5 × 5. In addition, the two pooling
layers use 2 × 2 average pooling.
appends zeros to the margins of the input X to enlarge the spatial size to (d1 + w − 1) × (d2 + w − 1). In
addition, a stride in the convolutional layer determines the gap i0
− i and j0
− j between two patches Xij
and Xi0j0 : in (10) the stride is 1, and a larger stride would lead to feature maps with smaller sizes.
2. Pooling layer (POOL). A pooling layer aggregates the information of nearby features into a single one.
This downsampling operation reduces the size of the features for subsequent layers and saves computa-
tion. One common form of the pooling layer is composed of the 2 × 2 max-pooling filter. It computes
max{Xi,j,k, Xi+1,j,k, Xi,j+1,k, Xi+1,j+1,k}, that is, the maximum of the 2 × 2 neighborhood in the spatial
coordinates; see Figure 6 for an illustration. Note that the pooling operation is done separately for each
feature map k. As a consequence, a 2 × 2 max-pooling filter acting on X ∈ Rd1×d2×d3
will result in an
output of size d1/2×d2/2×d3. In addition, the pooling layer does not involve any parameters to optimize.
Pooling layers serve to reduce redundancy since a small neighborhood around a location (i, j) in a feature
map is likely to contain the same information.
In addition, we also use fully-connected layers as building blocks, which we have already seen in Section 2.
Each fully-connected layer treats input tensor X as a vector Vec(X), and computes X̃ = σ(WVec(X)).
A fully-connected layer does not use weight sharing and is often used in the last few layers of a CNN. As
an example, Figure 7 depicts the well-known LeNet 5 [71], which is composed of two sets of CONV-POOL
layers and three fully-connected layers.
3.2 Recurrent neural networks
Recurrent neural nets (RNNs) are another family of powerful models, which are designed to process time
series data and other sequence data. RNNs have successful applications in speech recognition [108], machine
translation [132], genome sequencing [21], etc. The structure of an RNN naturally forms a computational
graph, and can be easily combined with other structures such as CNNs to build large computational graph
10
!
 # $ %
) )# )$ )%
(
(
(
(!
(' (' (' ('
! !# !$ !%
 # $ %
)%
(
(
(
(! (! (! (!
('
! !# !$ !%
 # $ %
) )# )$ )%
(
(
(
(! (! (! (!
(' (' (' ('
(a) One-to-many (b) Many-to-one (c) Many-to-many
Figure 8: Vanilla RNNs with different inputs/outputs settings. (a) has one input but multiple outputs; (b)
has multiple inputs but one output; (c) has multiple inputs and outputs. Note that the parameters are
shared across time steps.
models for complex tasks. Here we introduce vanilla RNNs and improved variants such as long short-term
memory (LSTM).
3.2.1 Vanilla RNNs
Suppose we have general time series inputs x1, x2, . . . , xT . A vanilla RNN models the “hidden state” at time
t by a vector ht, which is subject to the recursive formula
ht = fθ(ht−1, xt). (13)
Here, fθ is generally a nonlinear function parametrized by θ. Concretely, a vanilla RNN with one hidden
layer has the following form5
ht = tanh (Whhht−1 + Wxhxt + bh) , where tanh(a) = e2a
−1
e2a+1 ,
zt = σ (Whyht + bz) ,
where Whh, Wxh, Why are trainable weight matrices, bh, bz are trainable bias vectors, and zt is the output
at time t. Like many classical time series models, those parameters are shared across time. Note that in
different applications, we may have different input/output settings (cf. Figure 8). Examples include
• One-to-many: a single input with multiple outputs; see Figure 8(a). A typical application is image
captioning, where the input is an image and outputs are a series of words.
• Many-to-one: multiple inputs with a single output; see Figure 8(b). One application is text sentiment
classification, where the input is a series of words in a sentence and the output is a label (e.g., positive
vs. negative).
• Many-to-many: multiple inputs and outputs; see Figure 8(c). This is adopted in machine translation,
where inputs are words of a source language (say Chinese) and outputs are words of a target language
(say English).
As the case with feed-forward neural nets, we minimize a loss function using back-propagation, where
the loss is typically
`T (θ) =
X
t∈T
L(yt, zt) = −
X
t∈T
K
X
k=1
1{yt = k} log

exp([zt]k)
P
k exp([zt]k)

,
where K is the number of categories for classification (e.g., size of the vocabulary in machine translation),
and T ⊂ [T] is the length of the output sequence. During the training, the gradients ∂`T /∂ht are computed
in the reverse time order (from T to t). For this reason, the training process is often called back-propagation
through time.
5Similar to the activation function σ(·), the function tanh(·) means element-wise operations.
11
!
#
!
#$%
!$%
#
time
depth
Figure 9: A vanilla RNN with two hidden layers. Higher-level hidden states h`
t are determined by the old
states h`
t−1 and lower-level hidden states h`−1
t . Multilayer RNNs generalize both feed-forward neural nets
and one-hidden-layer RNNs.
One notable drawback of vanilla RNNs is that, they have difficulty in capturing long-range dependencies
in sequence data when the length of the sequence is large. This is sometimes due to the phenomenon of
exploding / vanishing gradients. Take Figure 8(c) as an example. Computing ∂`T /∂h1 involves the product
Q3
t=1(∂ht+1/∂ht) by the chain rule. However, if the sequence is long, the product will be the multiplication
of many Jacobian matrices, which usually results in exponentially large or small singular values. To alleviate
this issue, in practice, the forward pass and backward pass are implemented in a shorter sliding window
{t1, t1 + 1, . . . , t2}, instead of the full sequence {1, 2, . . . , T}. Though effective in some cases, this technique
alone does not fully address the issue of long-term dependency.
3.2.2 GRUs and LSTM
There are two improved variants that alleviate the above issue: gated recurrent units (GRUs) [26] and long
short-term memory (LSTM) [54].
• A GRU refines the recursive formula (13) by introducing gates, which are vectors of the same length as
ht. The gates, which take values in [0, 1] elementwise, multiply with ht−1 elementwise and determine how
much they keep the old hidden states.
• An LSTM similarly uses gates in the recursive formula. In addition to ht, an LSTM maintains a cell
state, which takes values in R elementwise and are analogous to counters.
Here we only discuss LSTM in detail. Denote by the element-wise multiplication. We have a recursive
formula in replace of (13):




it
ft
ot
gt



 =




σ
σ
σ
tanh



 W


ht−1
xt
1

 ,
ct = ft ct−1 + it gt,
ht = ot tanh(ct),
where W is a big weight matrix with appropriate dimensions. The cell state vector ct carries information of
the sequence (e.g., singular/plural form in a sentence). The forget gate ft determines how much the values
of ct−1 are kept for time t, the input gate it controls the amount of update to the cell state, and the output
gate ot gives how much ct reveals to ht. Ideally, the elements of these gates have nearly binary values.
For example, an element of ft being close to 1 may suggest the presence of a feature in the sequence data.
Similar to the skip connections in residual nets, the cell state ct has an additive recursive formula, which
helps back-propagation and thus captures long-range dependencies.
12
3.2.3 Multilayer RNNs
Multilayer RNNs are generalization of the one-hidden-layer RNN discussed above. Figure 9 shows a vanilla
RNN with two hidden layers. In place of (13), the recursive formula for an RNN with L hidden layers now
reads
h`
t = tanh

W`


h`−1
t
h`
t−1
1



 , for all ` ∈ [L], h0
t , xt.
Note that a multilayer RNN has two dimensions: the sequence length T and depth L. Two special cases are
the feed-forward neural nets (where T = 1) introduced in Section 2, and RNNs with one hidden layer (where
L = 1). Multilayer RNNs usually do not have very large depth (e.g., 2–5), since T is already very large.
Finally, we remark that CNNs, RNNs, and other neural nets can be easily combined to tackle tasks
that involve different sources of input data. For example, in image captioning, the images are first processed
through a CNN, and then the high-level features are fed into an RNN as inputs. Theses neural nets combined
together form a large computational graph, so they can be trained using back-propagation. This generic
training method provides much flexibility in various applications.
3.3 Modules
Deep neural nets are essentially composition of many nonlinear functions. A component function may be
designed to have specific properties in a given task, and it can be itself resulted from composing a few
simpler functions. In LSTM, we have seen that the building block consists of several intermediate variables,
including cell states and forget gates that can capture long-term dependency and alleviate numerical issues.
This leads to the idea of designing modules for building more complex neural net models. Desirable
modules usually have low computational costs, alleviate numerical issues in training, and lead to good
statistical accuracy. Since modules and the resulting neural net models form computational graphs, training
follows the same principle briefly described in Section 2.
Here, we use the examples of Inception and skip connections to illustrate the ideas behind modules.
Figure 10(a) is an example of “Inception” modules used in GoogleNet [123]. As before, all the convolutional
layers are followed by the ReLU activation function. The concatenation of information from filters with
different sizes give the model great flexibility to capture spatial information. Note that 1 × 1 filters is an
1 × 1 × d3 tensor (where d3 is the number of feature maps), so its convolutional operation does not interact
with other spatial coordinates, only serving to aggregate information from different feature maps at the same
coordinate. This reduces the number of parameters and speeds up the computation. Similar ideas appear
in other work [78, 57].
1×'2
3456
1×'2
3456
1×'2
3456
1×'2
3456
3×72
3456
5×82
3456
3×72
POOL
concat
3×94
5678
3×94
5678
+
$
$ ;($)
(a) “Inception” module (b) Skip connections
Figure 10: (a) The “Inception” module from GoogleNet. Concat means combining all features maps into a
tensor. (b) Skip connections are added every two layers in ResNets.
Another module, usually called skip connections, is widely used to alleviate numerical issues in very deep
neural nets, with additional benefits in optimization efficiency and statistical accuracy. Training very deep
13
neural nets are generally more difficult, but the introduction of skip connections in residual networks [50, 51]
has greatly eased the task.
The high level idea of skip connections is to add an identity map to an existing nonlinear function. Let
F(x) be an arbitrary nonlinear function represented by a (fragment of) neural net, then the idea of skip
connections is simply replacing F(x) with x+F(x). Figure 10(b) shows a well-known structure from residual
networks [50]—for every two layers, an identity map is added:
x 7−→ σ(x + F(x)) = σ(x + W0
σ(Wx + b) + b0
), (14)
where x can be hidden nodes from any layer and W, W0
, b, b0
are corresponding parameters. By repeating
(namely composing) this structure throughout all layers, [50, 51] are able to train neural nets with hundreds
of layers easily, which overcomes well-observed training difficulties in deep neural nets. Moreover, deep
residual networks also improve statistical accuracy, as the classification error on ImageNet challenge was
reduced by 46% from 2014 to 2015. As a side note, skip connections can be used flexibly. They are not
restricted to the form in (14), and can be used between any pair of layers `, `0
[55].
4 Deep unsupervised learning
In supervised learning, given labelled training set {(yi, xi)}, we focus on discriminative models, which essen-
tially represents P(y | x) by a deep neural net f(x; θ) with parameters θ. Unsupervised learning, in contrast,
aims at extracting information from unlabeled data {xi}, where the labels {yi} are absent. In regard to this
information, it can be a low-dimensional embedding of the data {xi} or a generative model with latent vari-
ables to approximate the distribution PX(x). To achieve these goals, we introduce two popular unsupervised
deep leaning models, namely, autoencoders and generative adversarial networks (GANs). The first one can
be viewed as a dimension reduction technique, and the second as a density estimation method. DNNs are
the key elements for both of these two models.
4.1 Autoencoders
Recall that in dimension reduction, the goal is to reduce the dimensionality of the data and at the same time
preserve its salient features. In particular, in principal component analysis (PCA), the goal is to embed the
data {xi}1≤i≤n into a low-dimensional space via a linear function f such that maximum variance can be
explained. Equivalently, we want to find linear functions f : Rd
→ Rk
and g : Rk
→ Rd
(k ≤ d) such that
the difference between xi and g(f(xi)) is minimized. Formally, we let
f (x) = Wf x , h and g (h) = Wgh, where Wf ∈ Rk×d
and Wg ∈ Rd×k
.
Here, for simplicity, we assume that the intercept/bias terms for f and g are zero. Then, PCA amounts to
minimizing the quadratic loss function
minimizeWf ,Wg
1
n
n
X
i=1
kxi − Wf Wgxik
2
2 . (15)
It is the same as minimizing kX − WXk2
F subject to rank(W) ≤ k, where X ∈ Rp×n
is the design matrix.
The solution is given by the singular value decomposition of X [44, Thm. 2.4.8], which is exactly what PCA
does. It turns out that PCA is a special case of autoencoders, which is often known as the undercomplete
linear autoencoder.
More broadly, autoencoders are neural network models for (nonlinear) dimension reduction, which gen-
eralize PCA. An autoencoder has two key components, namely, the encoder function f(·), which maps the
input x ∈ Rd
to a hidden code/representation h , f(x) ∈ Rk
, and the decoder function g(·), which maps
the hidden representation h to a point g(h) ∈ Rd
. Both functions can be multilayer neural networks as
(3). See Figure 11 for an illustration of autoencoders. Let L(x1, x2) be a loss function that measures the
difference between x1 and x2 in Rd
. Similar to PCA, an autoencoder is used to find the encoder f and
14
hidden layer input layer output layer
hidden layer input layer output layer
1
hidden layer input layer output layer
1
x
h = f (x)
g (h)
1
x
h = f (x)
g (h)
1
x
h = f (x)
g (h)
1
x
h = f (x)
g (h)
1
x
h = f (x)
g (h)
encoder
decoder
x
h = f (x)
g (h)
encoder
decoder
L (x, g (h))
1
Figure 11: First an input x goes through the decoder f(·), and we obtain its hidden representation h = f(x).
Then, we use the decoder g(·) to get g(h) as a reconstruction of x. Finally, the loss is determined from the
difference between the original input x and its reconstruction g(f(x)).
decoder g such that L(x, g(f(x))) is as small as possible. Mathematically, this amounts to solving the
following minimization problem
minimizef,g
1
n
n
X
i=1
L (xi, g (hi)) with hi = f (xi) , for all i ∈ [n]. (16)
One needs to make structural assumptions on the functions f and g in order to find useful representations
of the data, which leads to different types of autoencoders. Indeed, if no assumption is made, choosing f and
g to be identity functions clearly minimizes the above optimization problem. To avoid this trivial solution,
one natural way is to require that the encoder f maps the data onto a space with a smaller dimension,
i.e., k  d. This is the undercomplete autoencoder that includes PCA as a special case. There are other
structured autoencoders which add desired properties to the model such as sparsity or robustness, mainly
through regularization terms. Below we present two other common types of autoencoders.
• Sparse autoencoders. One may believe that the dimension k of the hidden code hi is larger than the
input dimension d, and that hi admits a sparse representation. As with LASSO [126] or SCAD [36], one
may add a regularization term to the reconstruction loss L in (16) to encourage sparsity [98]. A sparse
autoencoder solves
minf,g
1
n
n
X
i=1
L (xi, g (hi))
| {z }
loss
+ λ khik1
| {z }
regularizer
with hi = f (xi) , for all i ∈ [n].
This is similar to dictionary learning, where one aims at finding a sparse representation of input data on
an overcomplete basis. Due to the imposed sparsity, the model can potentially learn useful features of the
data.
• Denoising autoencoders. One may hope that the model is robust to noise in the data: even if the
input data xi are corrupted by small noise ξi or miss some components (the noise level or the missing
probability is typically small), an ideal autoencoder should faithfully recover the original data. A denoising
autoencoder [128] achieves this robustness by explicitly building a noisy data x̃i = xi +ξi as the new input,
15
Fk 2 R5⇥5⇥3
28
5
3
X̃ 2 R24⇥24⇥3
24
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
X 2 R
Fk 2 R5⇥5⇥3
28
5
3
X̃ 2 R24⇥24⇥3
24
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
3
X̃ 2 R24⇥24⇥3
24
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
24
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
28
5
3
X̃ 2 R24⇥24⇥3
24
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
X̃ 2 R
24
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
1
input feature map
filter
output feature map
G
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
1
D
source distribution PZ
training samples {xi}1in
sample
x
g (
1: real
0: fake
D
source distribution PZ
training samples {xi}1in
sample
xi
z
g (z)
1: real
0: fake
d (·)
1
Figure 12: GANs consist of two components, a generator G which generates fake samples and a discriminator
D which differentiate the true ones from the fake ones.
and then solves an optimization problem similar to (16) where L (xi, g (hi)) is replaced by L (xi, g (f(x̃i))).
A denoising autoencoder encourages the encoder/decoder to be stable in the neighborhood of an input,
which is generally a good statistical property. An alternative way could be constraining f and g in the
optimization problem, but that would be very difficult to optimize. Instead, sampling by adding small
perturbations in the input provides a simple implementation. We shall see similar ideas in Section 6.3.3.
4.2 Generative adversarial networks
Given unlabeled data {xi}1≤i≤n, density estimation aims to estimate the underlying probability density
function PX from which the data is generated. Both parametric and nonparametric estimators [115] have
been proposed and studied under various assumptions on the underlying distribution. Different from these
classical density estimators, where the density function is explicitly defined in relatively low dimension,
generative adversarial networks (GANs) [46] can be categorized as an implicit density estimator in much
higher dimension. The reasons are twofold: (1) GANs put more emphasis on sampling from the distribution
PX than estimation; (2) GANs define the density estimation implicitly through a source distribution PZ and
a generator function g(·), which is usually a deep neural network. We introduce GANs from the perspective
of sampling from PX and later we will generalize the vanilla GANs using its relation to density estimators.
4.2.1 Sampling view of GANs
Suppose the data {xi}1≤i≤n at hand are all real images, and we want to generate new natural images.
With this goal in mind, GAN models a zero-sum game between two players, namely, the generator G and
the discriminator D. The generator G tries to generate fake images akin to the true images {xi}1≤i≤n
while the discriminator D aims at differentiating the fake ones from the true ones. Intuitively, one hopes to
learn a generator G to generate images where the best discriminator D cannot distinguish. Therefore the
payoff is higher for the generator G if the probability of the discriminator D getting wrong is higher, and
correspondingly the payoff for the discriminator correlates positively with its ability to tell wrong from truth.
Mathematically, the generator G consists of two components, an source distribution PZ (usually a stan-
dard multivariate Gaussian distribution with hundreds of dimensions) and a function g(·) which maps a
sample z from PZ to a point g(z) living in the same space as x. For generating images, g(z) would be a
3D tensor. Here g(z) is the fake sample generated from G. Similarly the discriminator D is composed of
one function which takes an image x (real or fake) and return a number d(x) ∈ [0, 1], the probability of x
being a real sample from PX or not. Oftentimes, both the generating function g(·) and the discriminating
function d(·) are realized by deep neural networks, e.g., CNNs introduced in Section 3.1. See Figure 12 for
an illustration for GANs. Denote θG and θD the parameters in g(·) and d(·), respectively. Then GAN tries
to solve the following min-max problem:
16
min
θG
max
θD
Ex∼PX
[log (d (x))] + Ez∼PZ
[log (1 − d (g (z)))] . (17)
Recall that d(x) models the belief / probability that the discriminator thinks that x is a true sample. Fix
the parameters θG and hence the generator G and consider the inner maximization problem. We can see
that the goal of the discriminator is to maximize its ability of differentiation. Similarly, if we fix θD (and
hence the discriminator), the generator tries to generate more realistic images g(z) to fool the discriminator.
4.2.2 Density estimation view of GANs
Let us now take a density-estimation view of GANs. Fixing the source distribution PZ, any generator G
induces a distribution PG over the space of images. Removing the restrictions on d(·), one can then rewrite
(17) as
min
PG
max
d(·)
Ex∼PX
[log (d (x))] + Ex∼PG
[log (1 − d (x))] . (18)
Observe that the inner maximization problem is solved by the likelihood ratio, i.e.
d∗
(x) =
PX (x)
PX (x) + PG (x)
.
As a result, (18) can be simplified as
min
PG
JS (PX k PG) , (19)
where JS(·k·) denotes the Jensen–Shannon divergence between two distributions
JS (PXkPG) =
1
2
KL PX k PX +PG
2

+
1
2
KL PG k PX +PG
2

.
In words, the vanilla GAN (17) seeks a density PG that is closest to PX in terms of the Jensen–Shannon di-
vergence. This view allows to generalize GANs to other variants, by changing the distance metric. Examples
include f-GAN [90], Wasserstein GAN (W-GAN) [6], MMD GAN [75], etc. We single out the Wasserstein
GAN (W-GAN) [6] to introduce due to its popularity. As the name suggests, it minimizes the Wasserstein
distance between PX and PG:
min
θG
WS (PXkPG) = min
θG
sup
f:f 1-Lipschitz
Ex∼PX
[f (x)] − Ex∼PG
[f (x)] , (20)
where f(·) is taken over all Lipschitz functions with coefficient 1. Comparing W-GAN (20) with the original
formulation of GAN (17), one finds that the Lipschitz function f in (20) corresponds to the discriminator D
in (17) in the sense that they share similar objectives to differentiate the true distribution PX from the fake
one PG. In the end, we would like to mention that GANs are more difficult to train than supervised deep
learning models such as CNNs [110]. Apart from the training difficulty, how to evaluate GANs objectively
and effectively is an ongoing research.
5 Representation power: approximation theory
Having seen the building blocks of deep learning models in the previous sections, it is natural to ask: what is
the benefits of composing multiple layers of nonlinear functions. In this section, we address this question from
a approximation theoretical point of view. Mathematically, letting H be the space of functions representable
by neural nets (NNs), how well can a function f (with certain properties) be approximated by functions in
H. We first revisit universal approximation theories, which are mostly developed for shallow neural nets
(neural nets with a single hidden layer), and then provide recent results that demonstrate the benefits of
depth in neural nets. Other notable works include Kolmogorov-Arnold superposition theorem [7, 120], and
circuit complexity for neural nets [91].
17
5.1 Universal approximation theory for shallow NNs
The universal approximation theories study the approximation of f in a space F by a function represented
by a one-hidden-layer neural net
g(x) =
N
X
j=1
cjσ∗(w
j x − bj), (21)
where σ∗ : R → R is certain activation function and N is the number of hidden units in the neural net. For
different space F and activation function σ∗, there are upper bounds and lower bounds on the approximation
error kf − gk. See [93] for a comprehensive overview. Here we present representative results.
First, as N → ∞, any continuous function f can be approximated by some g under mild conditions.
Loosely speaking, this is because each component σ∗(w
j x − bj) behaves like a basis function and functions
in a suitable space F admits a basis expansion. Given the above heuristics, the next natural question is:
what is the rate of approximation for a finite N?
Let us restrict the domain of x to a unit ball Bd
in Rd
. For p ∈ [1, ∞) and integer m ≥ 1, consider the
Lp
space and the Sobolev space with standard norms
kfkp =
h Z
Bn
|g(x)|p
dx
i1/p
, kfkm,p =
h X
0≤|k|≤m
kDk
fkp
p
i1/p
,
where Dk
f denotes partial derivatives indexed by k ∈ Zd
+. Let F , Fm
p be the space of functions f in the
Sobolev space with kfkm,p ≤ 1. Note that functions in F have bounded derivatives up to m-th order, and
that smoothness of functions is controlled by m (larger m means smoother). Denote by HN the space of
functions with the form (21). The following general upper bound is due to [85].
Theorem 1 (Theorem 2.1 in [85]). Assume σ∗ : R → R is such that σ∗ has arbitrary order derivatives in an
open interval I, and that σ∗ is not a polynomial on I. Then, for any p ∈ [1, ∞), d ≥ 2, and integer m ≥ 1,
sup
f∈Fm
p
inf
g∈HN
kf − gkp ≤ Cd,m,p N−m/d
,
where Cd,m,p is independent of N, the number of hidden units.
In the above theorem, the condition on σ∗(·) is mainly technical. This upper bound is useful when the
dimension d is not large. It clearly implies that the one-hidden-layer neural net is able to approximate any
smooth function with enough hidden units. However, it is unclear how to find a good approximator g; nor
do we have control over the magnitude of the parameters (huge weights are impractical). While increasing
the number of hidden units N leads to better approximation, the exponent −m/d suggests the presence of
the curse of dimensionality. The following (nearly) matching lower bound is stated in [80].
Theorem 2 (Theorem 5 in [80]). Let p ≥ 1, m ≥ 1 and N ≥ 2. If the activation function is the standard
sigmoid function σ(t) = (1 + e−t
)−1
, then
sup
f∈Fm
p
inf
g∈HN
kf − gkp ≥ C0
d,m,p (N log N)−m/d
, (22)
where C0
d,m,p is independent of N.
Results for other activation functions are also obtained by [80]. Moreover, the term log N can be removed
if we assume an additional continuity condition [85].
For the natural space Fm
p of smooth functions, the exponential dependence on d in the upper and lower
bounds may look unappealing. However, [12] showed that for a different function space, there is a good
dimension-free approximation by the neural nets. Suppose that a function f : Rd
7→ R has a Fourier
representation
f(x) =
Z
Rd
eihω,xi ˜
f(ω) dω, (23)
18
where ˜
f(ω) ∈ C. Assume that f(0) = 0 and that the following quantity is finite
Cf =
Z
Rd
kωk2| ˜
f(ω)| dω. (24)
[12] uncovers the following dimension-free approximation guarantee.
Theorem 3 (Proposition 1 in [12]). Fix a C  0 and an arbitrary probability measure µ on the unit ball Bd
in Rd
. For every function f with Cf ≤ C and every N ≥ 1, there exists some g ∈ HN such that
Z
Bd
(f(x) − g(x))2
µ(dx)
1/2
≤
2C
√
N
.
Moreover, the coefficients of g may be restricted to satisfy
PN
j=1 |cj| ≤ 2C.
The upper bound is now independent of the dimension d. However, Cf may implicitly depend on d, as
the formula in (24) involves an integration over Rd
(so for some functions Cf may depend exponentially
on d). Nevertheless, this theorem does characterize an interesting function space with an improved upper
bound. Details of the function space are discussed by [12]. This theorem can be generalized; see [81] for an
example.
To help understand why a dimensionality-free approximation holds, let us appeal to a heuristic argument
given by Monte Carlo simulations. It is well-known that Monte Carlo approximation errors are independent
of dimensionality in evaluation of high-dimensional integrals. Let us generate {ωj}1≤j≤N randomly from a
given density p(·) in Rd
. Consider the approximation to (23) by
gN (x) =
1
N
N
X
j=1
cjeihωj ,xi
, cj =
˜
f(ωj)
p(ωj)
. (25)
Then, gN (x) is a one-hidden-layer neural network with N units and the sinusoid activation function. Note
that EgN (x) = f(x), where the expectation is taken with respect to randomness {ωj}. Now, by indepen-
dence, we have
E(gN (x) − f(x))2
=
1
N
Var(cjeihωj ,xi
) ≤
1
N
Ec2
j ,
if Ec2
j  ∞. Therefore, the rate is independent of the dimensionality d, though the constant can be.
5.2 Approximation theory for multi-layer NNs
The approximation theory for multilayer neural nets is less understood compared with neural nets with one
hidden layer. Driven by the success of deep learning, there are many recent papers focusing on expressivity
of deep neural nets. As studied by [125, 35, 84, 94, 15, 111, 77, 103], deep neural nets excel at representing
composition of functions. This is perhaps not surprising, since deep neural nets are themselves defined by
composing layers of functions. Nevertheless, it points to a new territory rarely studied in statistics before.
Below we present a result based on [77, 103].
Suppose that the inputs x have a bounded domain [−1, 1]d
for simplicity. As before, let σ∗ : R → R be a
generic function, and σ∗ = (σ∗, · · · , σ∗)
be element-wise application of σ∗. Consider a neural net which is
similar to (3) but with scaler output: g(x) = W`σ∗(· · · σ∗(W2σ∗(W1x)) · · · ). A unit or neuron refers to an
element of vectors σ∗(Wk · · · σ∗(W2σ∗(W1x)) · · · ) for any k = 1, . . . , ` − 1. For a multivariate polynomial
p, define mk(p) to be the smallest integer such that, for any   0, there exists a neural net g(x) satisfying
supx |p(x) − g(x)|  , with k hidden layers (i.e., ` = k + 1) and no more than mk(p) neurons in total.
Essentially, mk(p) is the minimum number of neurons required to approximate p arbitrarily well.
Theorem 4 (Theorem 4.1 in [103]). Let p(x) be a monomial xr1
1 xr2
2 · · · xrd
d with q =
Pd
j=1 rj. Suppose that
σ∗ has derivatives of order 2q at the origin, and that they are nonzero. Then,
(i) m1(p) =
Qd
j=1(rj + 1);
(ii) mink mk(p) ≤
Pd
j=1 (7dlog2(rj)e + 4).
19
This theorem reveals a sharp distinction between shallow networks (one hidden layer) and deep networks.
To represent a monomial function, a shallow network requires exponentially many neurons in terms of the
dimension d, whereas linearly many neurons suffice for a deep network (with bounded rj). The exponential
dependence on d, as shown in Theorem 4(i), is resonant with the curse of dimensionality widely seen in
many fields; see [30]. One may ask: how does depth help? Depth circumvents this issue, at least for certain
functions, by allowing us to represent function composition efficiently. Indeed, Theorem 4(ii) offers a nice
result with clear intuitions: it is known that the product of two scalar inputs can be represented using 4
neurons [77], so by composing multiple products, we can express monomials with O(d) neurons.
Recent advances in nonparametric regressions also support the idea that deep neural nets excel at repre-
senting composition of functions [15, 111]. In particular, [15] considered the nonparametric regression setting
where we want to estimate a function ˆ
fn(x) from i.i.d. data Dn = {(yi, xi)}1≤i≤n. If the true regression
function f(x) has certain hierarchical structure with intrinsic dimensionality6
d∗
, then the error
EDn
Ex
ˆ
fn(x) − f(x)
2
has an optimal minimax convergence rate O(n− 2q
2q+d∗
), rather than the usual rate O(n− 2q
2q+d ) that depends on
the ambient dimension d. Here q is the smoothness parameter. This provides another justification for deep
neural nets: if data are truly hierarchical, then the quality of approximators by deep neural nets depends on
the intrinsic dimensionality, which avoids the curse of dimensionality.
We point out that the approximation theory for deep learning is far from complete. For example, in
Theorem 4, the condition on σ∗ excludes the widely used ReLU activation function, there are no constraints
on the magnitude of the weights (so they can be unreasonably large).
6 Training deep neural nets
The existence of a good function approximator in the NN function class does not explain why in practice
we can easily find them. In this section, we introduce standard methods, namely stochastic gradient descent
(SGD) and its variants, to train deep neural networks (or to find such a good approximator). As with many
statistical machine learning tasks, training DNNs follows the empirical risk minimization (ERM) paradigm
which solves the following optimization problem
minimizeθ∈Rp `n (θ) ,
1
n
n
X
i=1
L (f (xi; θ) , yi) . (26)
Here L(f(xi; θ), yi) measures the discrepancy between the prediction f(xi; θ) of the neural network and the
true label yi. Correspondingly, denote by `(θ) , E(x,y)∼D[L(f(x; θ), y)] the out-of-sample error, where D
is the joint distribution over (y, x). Solving ERM (26) for deep neural nets faces various challenges that
roughly fall into the following three categories.
• Scalability and nonconvexity. Both the sample size n and the number of parameters p can be huge for
modern deep learning applications, as we have seen in Table 1. Many optimization algorithms are not
practical due to the computational costs and memory constraints. What is worse, the empirical loss
function `n(θ) in deep learning is often nonconvex. It is a priori not clear whether an optimization
algorithm can drive the empirical loss (26) small.
• Numerical stability. With a large number of layers in DNNs, the magnitudes of the hidden nodes can be
drastically different, which may result in the “exploding gradients” or “vanishing gradients” issue during
the training process. This is because the recursive relations across layers often lead to exponentially
increasing / decreasing values in both forward passes and backward passes.
• Generalization performance. Our ultimate goal is to find a parameter θ̂ such that the out-of-sample error
`(θ̂) is small. However, in the over-parametrized regime where p is much larger than n, the underlying
6Roughly speaking, the true regression function can be represented by a tree where each node has at most d∗ children.
See [15] for the precise definition.
20
neural network has the potential to fit the training data perfectly while performing poorly on the test
data. To avoid this overfitting issue, proper regularization, whether explicit or implicit, is needed in the
training process for the neural nets to generalize.
In the following three subsections, we discuss practical solutions / proposals to address these challenges.
6.1 Stochastic gradient descent
Stochastic gradient descent (SGD) [101] is by far the most popular optimization algorithm to solve ERM (26)
for large-scale problems. It has the following simple update rule:
θt+1
= θt
− ηtG(θt
) with G θt

= ∇L f xit ; θt

, yit

(27)
for t = 0, 1, 2, . . ., where ηt  0 is the step size (or learning rate), θ0
∈ Rp
is an initial point and it is
chosen randomly from {1, 2, · · · , n}. It is easy to verify that G(θt
) is an unbiased estimate of ∇`n(θt
). The
advantage of SGD is clear: compared with gradient descent, which goes over the entire dataset in every
update, SGD uses a single example in each update and hence is considerably more efficient in terms of both
computation and memory (especially in the first few iterations).
Apart from practical benefits of SGD, how well does SGD perform theoretically in terms of minimizing
`n(θ)? We begin with the convex case, i.e., the case where the loss function is convex w.r.t. θ. It is well
understood in literature that with proper choices of the step sizes {ηt}, SGD is guaranteed to achieve both
consistency and asymptotic normality.
• Consistency. If `(θ) is a strongly convex function7
, then under some mild conditions8
, learning rates that
satisfy
∞
X
t=0
ηt = +∞ and
∞
X
t=0
η2
t  +∞ (28)
guarantee almost sure convergence to the unique minimizer θ∗
, argminθ`(θ), i.e., θt a.s.
−
−
→ θ∗
as t →
∞ [101, 64, 16, 69]. The requirements in (28) can be viewed from the perspective of bias-variance tradeoff:
the first condition ensures that the iterates can reach the minimizer (controlled bias), and the second
ensures that stochasticity does not prevent convergence (controlled variance).
• Asymptotic normality. It is proved by [97] that for robust linear regression with fixed dimension p, under
the choice ηt = t−1
,
√
t (θt
− θ∗
) is asymptotically normal under some regularity conditions (but θt
is not
asymptotically efficient in general). Moreover, by averaging the iterates of SGD, [96] proved that even
with a larger step size ηt ∝ t−α
, α ∈ (1/2, 1), the averaged iterate θ̄
t
= t−1
Pt
s=1 θs
is asymptotic efficient
for robust linear regression. These strong results show that SGD with averaging performs as well as the
MLE asymptotically, in addition to its computational efficiency.
These classical results, however, fail to explain the effectiveness of SGD when dealing with nonconvex
loss functions in deep learning. Admittedly, finding global minima of nonconvex functions is computationally
infeasible in the worst case. Nevertheless, recent work [4, 32] bypasses the worst case scenario by focusing
on losses incurred by over-parametrized deep learning models. In particular, they show that (stochastic)
gradient descent converges linearly towards the global minimizer of `n(θ) as long as the neural network is
sufficiently over-parametrized. This phenomenon is formalized below.
Theorem 5 (Theorem 2 in [4]). Let {(yi, xi)}1≤i≤n be a training set satisfying mini,j:i6=j kxi −xjk2 ≥ δ  0.
Consider fitting the data using a feed-forward neural network (1) with ReLU activations. Denote by L
(resp. W) the depth (resp. width) of the network. Suppose that the neural network is sufficiently over-
parametrized, i.e.,
W  poly

n, L,
1
δ

, (29)
7For results on consistency and asymptotic normality, we consider the case where in each step of SGD, the stochastic
gradient is computed using a fresh sample (y, x) from D. This allows to view SGD as an optimization algorithm to minimize
the population loss `(θ).
8One example of such condition can be constraining the second moment of the gradients: E

k∇L xi, yi; θt

k2
2

≤ C1 +
C2kθt
− θ∗
k2
2 for some C1, C2  0. See [16] for details.
21
where poly means a polynomial function. Then with high probability, running SGD (27) with certain random
initialization and properly chosen step sizes yields `n(θt
) ≤ ε in t  log 1
ε iterations.
Two notable features are worth mentioning: (1) first, the network under consideration is sufficiently over-
parametrized (cf. (29)) in which the number of parameters is much larger than the number of samples, and
(2) one needs to initialize the weight matrices to be in near-isometry such that the magnitudes of the hidden
nodes do not blow up or vanish. In a nutshell, over-parametrization and random initialization together
ensure that the loss function (26) has a benign landscape9
around the initial point, which in turn implies
fast convergence of SGD iterates.
There are certainly other challenges for vanilla SGD to train deep neural nets: (1) training algorithms
are often implemented in GPUs, and therefore it is important to tailor the algorithm to the infrastructure,
(2) the vanilla SGD might converge very slowly for deep neural networks, albeit good theoretical guarantees
for well-behaved problems, and (3) the learning rates {ηt} can be difficult to tune in practice. To address
the aforementioned challenges, three important variants of SGD, namely mini-batch SGD, momentum-based
SGD, and SGD with adaptive learning rates are introduced.
6.1.1 Mini-batch SGD
Modern computational infrastructures (e.g., GPUs) can evaluate the gradient on a number (say 64) of
examples as efficiently as evaluating that on a single example. To utilize this advantage, mini-batch SGD
with batch size K ≥ 1 forms the stochastic gradient through K random samples:
θt+1
= θt
− ηtG(θt
) with G(θt
) =
1
K
K
X
k=1
∇L f xik
t
; θt

, yik
t

, (30)
where for each 1 ≤ k ≤ K, ik
t is sampled uniformly from {1, 2, · · · , n}. Mini-batch SGD, which is an
“interpolation” between gradient descent and stochastic gradient descent, achieves the best of both worlds:
(1) using 1  K  n samples to estimate the gradient, one effectively reduces the variance and hence
accelerates the convergence, and (2) by taking the batch size K appropriately (say 64 or 128), the stochastic
gradient G(θt
) can be efficiently computed using the matrix computation toolboxes on GPUs.
6.1.2 Momentum-based SGD
While mini-batch SGD forms the foundation of training neural networks, it can sometimes be slow to converge
due to its oscillation behavior [122]. Optimization community has long investigated how to accelerate the
convergence of gradient descent, which results in a beautiful technique called momentum methods [95, 88].
Similar to gradient descent with moment, momentum-based SGD, instead of moving the iterate θt
in the
direction of the current stochastic gradient G(θt
), smooth the past (stochastic) gradients {G(θt
)} to stabilize
the update directions. Mathematically, let vt
∈ Rp
be the direction of update in the tth iteration, i.e.,
θt+1
= θt
− ηtvt
.
Here v0
= G(θ0
) and for t = 1, 2, · · ·
vt
= ρvt−1
+ G(θt
) (31)
with 0  ρ  1. A typical choice of ρ is 0.9. Notice that ρ = 0 recovers the mini-batch SGD (30), where
no past information of gradients is used. A simple unrolling of vt
reveals that vt
is actually an exponential
averaging of the past gradients, i.e., vt
=
Pt
j=0 ρt−j
G(θj
). Compared with vanilla mini-batch SGD, the
inclusion of the momentum “smoothes” the oscillation direction and accumulates the persistent descent
direction. We want to emphasize that theoretical justifications of momentum in the stochastic setting is not
fully understood [63, 60].
9In [4], the loss function `n(θ) satisfies the PL condition.
22
6.1.3 SGD with adaptive learning rates
In optimization, preconditioning is often used to accelerate first-order optimization algorithms. In principle,
one can apply this to SGD, which yields the following update rule:
θt+1
= θt
− ηtP −1
t G(θt
) (32)
with Pt ∈ Rp×p
being a preconditioner at the t-th step. Newton’s method can be viewed as one type
of preconditioning where Pt = ∇2
`(θt
). The advantages of preconditioning are two-fold: first, a good
preconditioner reduces the condition number by changing the local geometry to be more homogeneous, which
is amenable to fast convergence; second, a good preconditioner frees practitioners from laboring tuning of the
step sizes, as is the case with Newton’s method. AdaGrad, an adaptive gradient method proposed by [33],
builds a preconditioner Pt based on information of the past gradients:
Pt =
n
diag
 t
X
j=0
G θt

G θt

o1/2
. (33)
Since we only require the diagonal part, this preconditioner (and its inverse) can be efficiently computed in
practice. In addition, investigating (32) and (33), one can see that AdaGrad adapts to the importance of each
coordinate of the parameters by setting smaller learning rates for frequent features, whereas larger learning
rates for those infrequent ones. In practice, one adds a small quantity δ  0 (say 10−8
) to the diagonal
entries to avoid singularity (numerical underflow). A notable drawback of AdaGrad is that the effective
learning rate vanishes quickly along the learning process. This is because the historical sum of the gradients
can only increase with time. RMSProp [52] is a popular remedy for this problem which incorporates the
idea of exponential averaging:
Pt =
n
diag

ρPt−1 + (1 − ρ)G θt

G θt

o1/2
. (34)
Again, the decaying parameter ρ is usually set to be 0.9. Later, Adam [65, 100] combines the momentum
method and adaptive learning rate and becomes the default training algorithms in many deep learning
applications.
6.2 Easing numerical instability
For very deep neural networks or RNNs with long dependencies, training difficulties often arise when the val-
ues of nodes have different magnitudes or when the gradients “vanish” or “explode” during back-propagation.
Here we discuss three partial solutions to alleviate this problem.
6.2.1 ReLU activation function
One useful characteristic of the ReLU function is that its derivative is either 0 or 1, and the derivative remains
1 even for a large input. This is in sharp contrast with the standard sigmoid function (1 + e−t
)−1
which
results in a very small derivative when inputs have large magnitude. The consequence of small derivatives
across many layers is that gradients tend to be “killed”, which means that gradients become approximately
zero in deep nets.
The popularity of the ReLU activation function and its variants (e.g., leaky ReLU) is largely attributable
to the above reason. It has been well observed that the ReLU activation function has superior training
performance over the sigmoid function [68, 79].
6.2.2 Skip connections
We have introduced skip connections in Section 3.3. Why are skip connections helpful for reducing numerical
instability? This structure does not introduce a larger function space, since the identity map can be also
represented with ReLU activations: x = σ(x) − σ(−x).
23
One explanation is that skip connections bring ease to the training / optimization process. Suppose
that we have a general nonlinear function F(x`; θ`). With a skip connection, we represent the map as
x`+1 = x` + F(x`; θ`) instead. Now the gradient ∂x`+1/∂x` becomes
∂x`+1
∂x`
= I +
∂F(x`; θ`)
∂x`
instead of
∂F(x`; θ`)
∂x`
, (35)
where I is an identity matrix. By the chain rule, gradient update requires computing products of many
components, e.g., ∂xL
∂x1
=
QL−1
`=1
∂x`+1
∂x`
, so it is desirable to keep the spectra (singular values) of each component
∂x`+1
∂x`
close to 1. In neural nets, with skip connections, this is easily achieved if the parameters have small
values; otherwise, this may not be achievable even with careful initialization and tuning. Notably, training
neural nets with hundreds of layers is possible with the help of skip connections.
6.2.3 Batch normalization
Recall that in regression analysis, one often standardizes the design matrix so that the features have zero
mean and unit variance. Batch normalization extends this standardization procedure from the input layer
to all the hidden layers. Mathematically, fix a mini-batch of input data {(xi, yi)}i∈B, where B ⊂ [n]. Let
h
(`)
i be the feature of the i-th example in the `-th layer (` = 0 corresponds to the input xi). The batch
normalization layer computes the normalized version of h
(`)
i via the following steps:
µ ,
1
|B|
X
i∈B
h
(`)
i , σ2
,
1
|B|
X
i∈B
h
(`)
i − µ
2
and h
(l)
i,norm ,
h
(`)
i − µ
σ
.
Here all the operations are element-wise. In words, batch normalization computes the z-score for each feature
over the mini-batch B and use that as inputs to subsequent layers. To make it more versatile, a typical batch
normalization layer has two additional learnable parameters γ(`)
and β(`)
such that
h
(l)
i,new = γ(l)
h
(l)
i,norm + β(l)
.
Again denotes the element-wise multiplication. As can be seen, γ(`)
and β(`)
set the new feature h
(l)
inew
to have mean β(`)
and standard deviation γ(`)
. The introduction of batch normalization makes the training
of neural networks much easier and smoother. More importantly, it allows the neural nets to perform well
over a large family of hyper-parameters including the number of layers, the number of hidden units, etc. At
test time, the batch normalization layer needs more care. For brevity we omit the details and refer to [58].
6.3 Regularization techniques
So far we have focused on training techniques to drive the empirical loss (26) small efficiently. Here we
proceed to discuss common practice to improve the generalization power of trained neural nets.
6.3.1 Weight decay
One natural regularization idea is to add an `2 penalty to the loss function. This regularization technique
is known as the weight decay in deep learning. We have seen one example in (9). For general deep neural
nets, the loss to optimize is `λ
n(θ) = `n(θ) + rλ(θ) where
rλ(θ) = λ
L
X
`=1
X
j,j0

W
(`)
j,j0
2
.
Note that the bias (intercept) terms are not penalized. If `n(θ) is a least square loss, then regularization
with weight decay gives precisely ridge regression. The penalty rλ(θ) is a smooth function and thus it can
be also implemented efficiently with back-propagation.
24
6.3.2 Dropout
Dropout, introduced by [53], prevents overfitting by randomly dropping out subsets of features during train-
ing. Take the l-th layer of the feed-forward neural network as an example. Instead of propagating all the
features in h(`)
for later computations, dropout randomly omits some of its entries by
h
(`)
drop = h(`)
mask`
,
where denotes element-wise multiplication as before, and mask`
is a vector of Bernoulli variables with
success probability p. It is sometimes useful to rescale the features h
(`)
inv drop = h
(`)
drop/p, which is called
inverted dropout. During training, mask`
are i.i.d. vectors across mini-batches and layers. However, when
testing on fresh samples, dropout is disabled and the original features h(`)
are used to compute the output
label y. It has been nicely shown by [129] that for generalized linear models, dropout serves as adaptive
regularization. In the simplest case of linear regression, it is equivalent to `2 regularization. Another possible
way to understand the regularization effect of dropout is through the lens of bagging [45]. Since different
mini-batches has different masks, dropout can be viewed as training a large ensemble of classifiers at the same
time, with a further constraint that the parameters are shared. Theoretical justification remains elusive.
6.3.3 Data augmentation
Data augmentation is a technique of enlarging the dataset when we have knowledge about invariance structure
of data. It implicitly increases the sample size and usually regularizes the model effectively. For example,
in image classification, we have strong prior knowledge about what invariance properties a good classifier
should possess. The label of an image should not be affected by translation, rotation, flipping, and even
crops of the image. Hence one can augment the dataset by randomly translating, rotating and cropping the
images in the original dataset.
Formally, during training we want to minimize the loss `n(θ) =
P
i L(f(xi; θ), yi) w.r.t. parameters θ,
and we know a priori that certain transformation T ∈ T where T : Rd
→ Rd
(e.g., affine transformation)
should not change the category / label of a training sample. In principle, if computation costs were not a
consideration, we could convert this knowledge to a constraint fθ(Txi) = fθ(xi), ∀ T ∈ T in the minimization
formulation. Instead of solving a constrained optimization problem, data augmentation enlarges the training
dataset by sampling T ∈ T and generating new data {(Txi, yi)}. In this sense, data augmentation induces
invariance properties through sampling, which results in a much bigger dataset than the original one.
7 Generalization power
Section 6 has focused on the in-sample / training error obtained via SGD, but this alone does not guarantee
good performance with respect to the out-of-sample / test error. The gap between the in-sample error and
the out-of-sample error, namely the generalization gap, has been the focus of statistical learning theory since
its birth; see [112] for an excellent introduction to this topic.
While understanding the generalization power of deep neural nets is difficult [135, 99], we sample re-
cent endeavors in this section. From a high level point of view, these approaches can be divided into
two categories, namely algorithm-independent controls and algorithm-dependent controls. More specifically,
algorithm-independent controls focus solely on bounding the complexity of the function class represented
by certain deep neural networks. In contrast, algorithm-dependent controls take into account the algorithm
(e.g., SGD) used to train the neural network.
7.1 Algorithm-independent controls: uniform convergence
The key to algorithm-independent controls is the notion of complexity of the function class parametrized
by certain neural networks. Informally, as long as the complexity is not too large, the generalization gap of
any function in the function class is well-controlled. However, the standard complexity measure (e.g., VC
dimension [127]) is at least proportional to the number of weights in a neural network [5, 112], which fails to
explain the practical success of deep learning. The caveat here is that the function class under consideration
25
is all the functions realized by certain neural networks, with no restrictions on the size of the weights at all.
On the other hand, for the class of linear functions with bounded norm, i.e., {x 7→ w
x | kwk2 ≤ M}, it is
well understood that the complexity of this function class (measured in terms of the empirical Rademacher
complexity) with respect to a random sample {xi}1≤i≤n is upper bounded by maxi kxik2M/
√
n, which is
independent of the number of parameters in w. This motivates researchers to investigate the complexity
of norm-controlled deep neural networks10
[89, 14, 43, 74]. Setting the stage, we introduce a few necessary
notations and facts. The key object under study is the function class parametrized by the following fully-
connected neural network with depth L:
FL ,

x 7→ WLσ (WL−1σ (· · · W2σ (W1x))) (W1, · · · , WL) ∈ W . (36)
Here (W1, W2, · · · , WL) ∈ W represents a certain constraint on the parameters. For instance, one can
restrict the Frobenius norm of each parameter Wl through the constraint kWlkF ≤ MF(l), where MF(l) is
some positive quantity. With regard to the complexity measure, it is standard to use Rademacher complexity
to control the capacity of the function class of interest.
Definition 1 (Empirical Rademacher complexity). The empirical Rademacher complexity of a function
class F w.r.t. a dataset S , {xi}1≤i≤n is defined as
RS (F) = Eε
h
sup
f∈F
1
n
n
X
i=1
εif (xi)
i
, (37)
where ε , (ε1, ε2, · · · , εn) is composed of i.i.d. Rademacher random variables, i.e., P(εi = 1) = P(εi = −1) =
1/2.
In words, Rademacher complexity measures the ability of the function class to fit the random noise rep-
resented by ε. Intuitively, a function class with a larger Rademacher complexity is more prone to overfitting.
We now formalize the connection between the empirical Rademacher complexity and the out-of-sample error;
see Chapter 24 in [112].
Theorem 6. Assume that for all f ∈ F and all (y, x) we have |L(f(x), y)| ≤ 1. In addition, assume that
for any fixed y, the univariate function L(·, y) is Lipschitz with constant 1. Then with probability at least
1 − δ over the sample S , {(yi, xi)}1≤i≤n
i.i.d.
∼ D, one has for all f ∈ F
E(y,x)∼D [L (f(x), y)]
| {z }
out-of-sample error
≤
1
n
n
X
i=1
L (f(xi), yi)
| {z }
in-sample error
+2RS (F) + 4
r
log (4/δ)
n
.
In English, the generalization gap of any function f that lies in F is well-controlled as long as the
Rademacher complexity of F is not too large. With this connection in place, we single out the following
complexity bound.
Theorem 7 (Theorem 1 in [43]). Consider the function class FL in (36), where each parameter Wl has
Frobenius norm at most MF(l). Further suppose that the element-wise activation function σ(·) is 1-Lipschitz
and positive-homogeneous (i.e., σ(c · x) = cσ(x) for all c ≥ 0). Then the empirical Rademacher complex-
ity (37) w.r.t. S , {xi}1≤i≤n satisfies
RS (FL) ≤ max
i
kxik2 ·
4
√
L
QL
l=1 MF(l)
√
n
. (38)
The upper bound of the empirical Rademacher complexity (38) is in a similar vein to that of linear
functions with bounded norm, i.e., maxi kxik2M/
√
n, where
√
L
QL
l=1 MF(l) plays the role of M in the
latter case. Moreover, ignoring the term
√
L, the upper bound (38) does not depend on the size of the
network in an explicit way if MF (l) sharply concentrates around 1. This reveals that the capacity of the
10Such attempts have been made in the seminal work [13].
26
neural network is well-controlled, regardless of the number of parameters, as long as the Frobenius norm
of the parameters is bounded. Extensions to other norm constraints, e.g., spectral norm constraints, path
norm constraints have been considered by [89, 14, 74, 67, 34]. This line of work improves upon traditional
capacity analysis of neural networks in the over-parametrized setting, because the upper bounds derived
are often size-independent. Having said this, two important remarks are in order: (1) the upper bounds
(e.g.,
QL
l=1 MF(l)) involve implicit dependence on the size of the weight matrix and the depth of the neural
network, which is hard to characterize; (2) the upper bound on the Rademacher complexity offers a uniform
bound over all functions in the function class, which is a pure statistical result. However, it stays silent
about how and why standard training algorithms like SGD can obtain a function whose parameters have
small norms.
7.2 Algorithm-dependent controls
In this subsection, we bring computational thinking into statistics and investigate the role of algorithms in the
generalization power of deep learning. The consideration of algorithms is quite natural and well motivated:
(1) local/global minima reached by different algorithms can exhibit totally different generalization behaviors
due to extreme nonconvexity, which marks a huge difference from traditional models, (2) the effective capacity
of neural nets is possibly not large, since a particular algorithm does not explore the entire parameter space.
These demonstrate the fact that on top of the complexity of the function class, the inherent property of
the algorithm we use plays an important role in the generalization ability of deep learning. In what follows,
we survey three different ways to obtain upper bounds on the generalization errors by exploiting properties
of the algorithms.
7.2.1 Mean field view of neural nets
As we have emphasized, modern deep learning models are highly over-parametrized. A line of work [83, 117,
105, 25, 82, 61] approximates the ensemble of weights by an asymptotic limit as the number of hidden units
tends to infinity, so that the dynamics of SGD can be studied via certain partial different equations.
More specifically, let ˆ
f(x; θ) = N−1
PN
i=1 σ(θ
i x) be a function given by a one-hidden-layer neural net
with N hidden units, where σ(·) is the ReLU activation function and parameters θ , [θ1, . . . , θN ]
∈ RN×d
are suitably randomly initialized. Consider the regression setting where we want to minimize the population
risk RN (θ) = E[(y − ˆ
f(x; θ))2
] over parameters θ. A key observation is that this population risk depends
on the parameters θ only through its empirical distribution, i.e., ρ̂(N)
= N−1
PN
i=1 δθi where δθi is a point
mass at θi. This motivates us to view express RN (θ) equivalently as R(ρ̂(N)
), where R(·) is a functional
that maps distributions to real numbers. Running SGD for RN (·)—in a suitable scaling limit—results in
a gradient flow on the space of distributions endowed with the Wasserstein metric that minimizes R(·). It
turns out that the empirical distribution ρ̂
(N)
k of the parameters after k steps of SGD is well approximated
by the gradient follow, as long as the the neural net is over-parametrized (i.e., N  d) and the number of
steps is not too large. In particular, [83] have shown that under certain regularity conditions,
sup
k∈[0,T/ε]∩N
R(ρ̂(N)
) − R (ρkε) . eT
r
1
N
∨ ε ·
r
d + log
N
ε
,
where ε  0 is an proxy for the step size of SGD and ρkε is the distribution of the gradient flow at time kε.
In words, the out-of-sample error under θk
generated by SGD is well-approximated by that of ρkε. Viewing
the optimization problem from the distributional aspect greatly simplifies the problem conceptually, as
the complicated optimization problem is now passed into its limit version—for this reason, this analytical
approach is called the mean field perspective. In particular, [83] further demonstrated that in some simple
settings, the out-of-sample error R(ρkε) of the distributional limit can be fully characterized. Nevertheless,
how well does R(ρkε) perform and how fast it converges remain largely open for general problems.
7.2.2 Stability
A second way to understand the generalization ability of deep learning is through the stability of SGD. An
algorithm is considered stable if a slight change of the input does not alter the output much. It has long been
27
observed that a stable algorithm has a small generalization gap; examples include k nearest neighbors [102,
29], bagging [18, 19], etc. The precise connection between stability and generalization gap is stated by [17,
113]. In what follows, we formalize the idea of stability and its connection with the generalization gap. Let
A denote an algorithm (possibly randomized) which takes a sample S , {(yi, xi)}1≤i≤n of size n and returns
an estimated parameter θ̂ , A(S). Following [49], we have the following definition for stability.
Definition 2. An algorithm (possibly randomized) A is ε-uniformly stable with respect to the loss function
L(·, ·) if for all datasets S, S0
of size n which differ in at most one example, one has
sup
x,y
EA [L (f(x; A (S)), y) − L (f(x; A (S0
)), y)] ≤ ε.
Here the expectation is taken w.r.t. the randomness in the algorithm A and ε might depend on n. The loss
function L(·, ·) takes an example (say (x, y)) and the estimated parameter (say A(S)) as inputs and outputs
a real value.
Surprisingly, an ε-uniformly stable algorithm incurs small generalization gap in expectation, which is
stated in the following lemma.
Lemma 1 (Theorem 2.2 in [49]). Let A be ε-uniformly stable. Then the expected generalization gap is no
larger than ε, i.e.,
EA,S

1
n
n
X
i=1
L(f(xi; A (S)), yi) − E(x,y)∼D [L (f(x; A (S)), y)]
#
≤ ε. (39)
With Lemma 1 in hand, it suffices to prove stability bound on specific algorithms. It turns out that SGD
introduced in Section 6 is uniformly stable when solving smooth nonconvex functions.
Theorem 8 (Theorem 3.12 in [49]). Assume that for any fixed (y, x), the loss function L(f(x; θ), y), viewed
as a function of θ, is L-Lipschitz and β-smooth. Consider running SGD on the empirical loss function with
decaying step size αt ≤ c/t, where c is some small absolute constant. Then SGD is uniformly stable with
ε .
T1− 1
βc+1
n
,
where we have ignored the dependency on β, c and L.
Theorem 8 reveals that SGD operating on nonconvex loss functions is indeed uniformly stable as long
as the number of steps T is not large compared with n. This together with Lemma 1 demonstrates the
generalization ability of SGD in expectation. Nevertheless, two important limitations are worth mentioning.
First, Lemma 1 provides an upper bound on the out-of-sample error in expectation, but ideally, instead of
an on-average guarantee under EA,S, we would like to have a high probability guarantee as in the convex
case [37]. Second, controlling the generalization gap alone is not enough to achieve a small out-of-sample
error, since it is unclear whether SGD can achieve a small training error within T steps.
7.2.3 Implicit regularization
In the presence of over-parametrization (number of parameters larger than the sample size), conventional
wisdom informs us that we should apply some regularization techniques (e.g., `1 / `2 regularization) so that
the model will not overfit the data. However, in practice, neural networks without explicit regularization
generalize well. This phenomenon motivates researchers to look at the regularization effects introduced by
training algorithms (e.g., SGD) in this over-parametrized regime. While there might exits multiple, if not
infinite global minima of the empirical loss (26), it is possible that practical algorithms tend to converge to
solutions with better generalization powers.
Take the underdetermined linear system Xθ = y as a starting point. Here X ∈ Rn×p
and θ ∈ Rp
with p
much larger than n. Running gradient descent on the loss 1
2 kXθ − yk2
2 from the origin (i.e., θ0
= 0) results
in the solution with the minimum Euclidean norm, that is GD converges to
min
θ∈Rp
kθk2 subject to Xθ = y.
28
In words, without any `2 regularization in the loss function, gradient descent automatically finds the solution
with the least `2 norm. This phenomenon, often called as implicit regularization, not only has been empirically
observed in training neural networks, but also has been theoretically understood in some simplified cases,
e.g., logistic regression with separable data. In logistic regression, given a training set {(yi, xi)}1≤i≤n with
xi ∈ Rp
and yi ∈ {1, −1}, one aims to fit a logistic regression model by solving the following program:
min
θ∈Rp
1
n
n
X
i=1
` yix
i θt

. (40)
Here, `(u) , log(1 + e−u
) denotes the logistic loss. Further assume that the data is separable, i.e., there
exists θ∗
∈ Rp
such that yiθ∗
xi  0 for all i. Under this condition, the loss function (40) can be arbitrarily
close to zero for certain θ with kθk2 → ∞. What happens when we minimize (40) using gradient descent?
[119] uncovers a striking phenomenon.
Theorem 9 (Theorem 3 in [119]). Consider the logistic regression (40) with separable data. If we run GD
θt+1
= θt
− η
1
n
n
X
i=1
yixi`0
yix
i θt

from any initialization θ0
with appropriate step size η  0, then normalized θt
converges to a solution with
the maximum `2 margin. That is,
lim
t→∞
θt
kθtk2
= θ̂, (41)
where θ̂ is the solution to the hard margin support vector machine:
θ̂ , arg min
θ∈Rp
kθk2, subject to yix
i θ ≥ 1 for all 1 ≤ i ≤ n. (42)
The above theorem reveals that gradient descent, when solving logistic regression with separable data,
implicitly regularizes the iterates towards the `2 max margin vector (cf. (41)), without any explicit regular-
ization as in (42). Similar results have been obtained by [62]. In addition, [47] studied algorithms other than
gradient descent and showed that coordinate descent produces a solution with the maximum `1 margin.
Moving beyond logistic regression, which can be viewed as a one-layer neural net, the theoretical under-
standing of implicit regularization in deeper neural networks is still limited; see [48] for an illustration in
deep linear convolutional neural networks.
8 Discussion
Due to space limitations, we have omitted several important deep learning models; notable examples include
deep reinforcement learning [86], deep probabilistic graphical models [109], variational autoencoders [66],
transfer learning [133], etc. Apart from the modeling aspect, interesting theories on generative adversarial
networks [10, 11], recurrent neural networks [3], connections with kernel methods [59, 9] are also emerging.
We have also omitted the inverse-problem view of deep learning where the data are assumed to be generated
from a certain neural net and the goal is to recover the weights in the NN with as few examples as possible.
Various algorithms (e.g., GD with spectral initialization) have been shown to recover the weights successfully
in some simplified settings [136, 118, 42, 87, 23, 39].
In the end, we identify a few important directions for future research.
• New characterization of data distributions. The success of deep learning relies on its power of efficiently
representing complex functions relevant to real data. Comparatively, classical methods often have optimal
guarantee if a problem has a certain known structure, such as smoothness, sparsity, and low-rankness [121,
31, 20, 24], but they are insufficient for complex data such as images. How to characterize the high-
dimensional real data that can free us from known barriers, such as the curse of dimensionality is an
interesting open question?
29
• Understanding various computational algorithms for deep learning. As we have emphasized throughout this
survey, computational algorithms (e.g., variants of SGD) play a vital role in the success of deep learning.
They allow fast training of deep neural nets and probably contribute towards the good generalization
behavior of deep learning in practice. Understanding these computational algorithms and devising better
ones are crucial components in understanding deep learning.
• Robustness. It has been well documented that DNNs are sensitive to small adversarial perturbations that
are indistinguishable to humans [124]. This raises serious safety issues once if deploy deep learning models
in applications such as self-driving cars, healthcare, etc. It is therefore crucial to refine current training
practice to enhance robustness in a principled way [116].
• Low SNRs. Arguably, for image data and audio data where the signal-to-noise ratio (SNR) is high, deep
learning has achieved great success. In many other statistical problems, the SNR may be very low. For
example, in financial applications, the firm characteristic and covariates may only explain a small part of
the financial returns; in healthcare systems, the uncertainty of an illness may not be predicted well from
a patient’s medical history. How to adapt deep learning models to excel at such tasks is an interesting
direction to pursue?
Acknowledgements
J. Fan is supported in part by the NSF grants DMS-1712591 and DMS-1662139, the NIH grant R01-
GM072611 and the ONR grant N00014-19-1-2120. We thank Ruying Bao, Yuxin Chen, Chenxi Liu, Weijie
Su, Qingcan Wang and Pengkun Yang for helpful comments and discussions.
References
[1] Martín Abadi and et. al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.
[2] Reza Abbasi-Asl, Yuansi Chen, Adam Bloniarz, Michael Oliver, Ben DB Willmore, Jack L Gallant,
and Bin Yu. The deeptune framework for modeling and characterizing neurons in visual cortex area
v4. bioRxiv, page 465534, 2018.
[3] Zeyuan Allen-Zhu and Yuanzhi Li. Can SGD Learn Recurrent Neural Networks with Provable Gener-
alization? ArXiv e-prints, abs/1902.01028, 2019.
[4] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-
parameterization. arXiv preprint arXiv:1811.03962, 2018.
[5] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge
university press, 2009.
[6] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks.
70:214–223, 06–11 Aug 2017.
[7] Vladimir I Arnold. On functions of three variables. Collected Works: Representations of Functions,
Celestial Mechanics and KAM Theory, 1957–1965, pages 5–8, 2009.
[8] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University
Press, 2009.
[9] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of
optimization and generalization for overparameterized two-layer neural networks. arXiv preprint
arXiv:1901.08584, 2019.
30
[10] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in
generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 224–232. JMLR. org, 2017.
[11] Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in GANs.
arXiv preprint arXiv:1806.10586, 2018.
[12] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE
Transactions on Information theory, 39(3):930–945, 1993.
[13] Peter L Bartlett. The sample complexity of pattern classification with neural networks: the size of the
weights is more important than the size of the network. IEEE transactions on Information Theory,
44(2):525–536, 1998.
[14] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for
neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and
R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6240–6249. Curran
Associates, Inc., 2017.
[15] Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionality in
nonparametric regression. Technical report, Technical report, 2017.
[16] Léon Bottou. Online learning and stochastic approximations. On-line learning in neural networks,
17(9):142, 1998.
[17] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning
research, 2(Mar):499–526, 2002.
[18] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
[19] Leo Breiman et al. Heuristics of instability and stabilization in model selection. The annals of statistics,
24(6):2350–2383, 1996.
[20] Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix comple-
tion. arXiv preprint arXiv:0903.1476, 2009.
[21] Chensi Cao, Feng Liu, Hai Tan, Deshou Song, Wenjie Shu, Weizhong Li, Yiming Zhou, Xiaochen Bo,
and Zhi Xie. Deep learning and its applications in biomedicine. Genomics, proteomics  bioinformatics,
16(1):17–32, 2018.
[22] Tianqi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential
equations. arXiv preprint arXiv:1806.07366, 2018.
[23] Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization:
Fast global convergence for nonconvex phase retrieval. Mathematical Programming, pages 1–33, 2019.
[24] Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, and Yuling Yan. Noisy matrix completion: Un-
derstanding statistical guarantees for convex relaxation via nonconvex optimization. arXiv preprint
arXiv:1902.07698, 2019.
[25] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized
models using optimal transport. In Advances in neural information processing systems, pages 3040–
3050, 2018.
[26] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078, 2014.
[27] R Dennis Cook et al. Fisher lecture: Dimension reduction in regression. Statistical Science, 22(1):1–26,
2007.
31
[28] Jeffrey De Fauw, Joseph R Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev,
Sam Blackwell, Harry Askham, Xavier Glorot, Brendan O’Donoghue, Daniel Visentin, et al. Clinically
applicable deep learning for diagnosis and referral in retinal disease. Nature medicine, 24(9):1342, 2018.
[29] Luc Devroye and Terry Wagner. Distribution-free performance bounds for potential function rules.
IEEE Transactions on Information Theory, 25(5):601–604, 1979.
[30] David L Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. AMS
math challenges lecture, 1(2000):32, 2000.
[31] David L Donoho and Jain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika,
81(3):425–455, 1994.
[32] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global
minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
[33] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
[34] Weinan E, Chao Ma, and Qingcan Wang. A priori estimates of the population risk for residual networks.
arXiv preprint arXiv:1903.02154, 2019.
[35] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference
on Learning Theory, pages 907–940, 2016.
[36] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
[37] Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly stable algo-
rithms with nearly optimal rate. arXiv preprint arXiv:1902.10710, 2019.
[38] Jerome H Friedman and Werner Stuetzle. Projection pursuit regression. Journal of the American
statistical Association, 76(376):817–823, 1981.
[39] Haoyu Fu, Yuejie Chi, and Yingbin Liang. Local geometry of one-hidden-layer neural networks for
logistic regression. arXiv preprint arXiv:1802.06463, 2018.
[40] Kunihiko Fukushima and Sei Miyake. Neocognitron: A self-organizing neural network model for a
mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pages 267–
285. Springer, 1982.
[41] Chao Gao, Jiyi Liu, Yuan Yao, and Weizhi Zhu. Robust estimation and generative adversarial nets.
arXiv preprint arXiv:1810.02030, 2018.
[42] Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with overlapping
patches. arXiv preprint arXiv:1802.02547, 2018.
[43] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural
networks. arXiv preprint arXiv:1712.06541, 2017.
[44] Gene H Golub and Charles F Van Loan. Matrix computations. JHU Press, 4 edition, 2013.
[45] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
[46] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information pro-
cessing systems, pages 2672–2680, 2014.
[47] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms
of optimization geometry. arXiv preprint arXiv:1802.08246, 2018.
32
[48] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on
linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9482–
9491, 2018.
[49] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochas-
tic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
[50] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recogni-
tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778,
2016.
[51] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual net-
works. In European conference on computer vision, pages 630–645. Springer, 2016.
[52] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture
6a overview of mini-batch gradient descent. 2012.
[53] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdi-
nov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint
arXiv:1207.0580, 2012.
[54] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–
1780, 1997.
[55] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected con-
volutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4700–4708, 2017.
[56] David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture
in the cat’s visual cortex. The Journal of physiology, 160(1):106–154, 1962.
[57] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt
Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size.
arXiv preprint arXiv:1602.07360, 2016.
[58] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[59] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and gener-
alization in neural networks. In Advances in neural information processing systems, pages 8580–8589,
2018.
[60] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating
stochastic gradient descent. arXiv preprint arXiv:1704.08227, 2017.
[61] Adel Javanmard, Marco Mondelli, and Andrea Montanari. Analysis of a two-layer neural network via
displacement convexity. arXiv preprint arXiv:1901.01375, 2019.
[62] Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint
arXiv:1803.07300, 2018.
[63] Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insufficiency of ex-
isting momentum schemes for stochastic optimization. In 2018 Information Theory and Applications
Workshop (ITA), pages 1–9. IEEE, 2018.
[64] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function.
The Annals of Mathematical Statistics, 23(3):462–466, 1952.
[65] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
33
[66] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
[67] Jason M Klusowski and Andrew R Barron. Risk bounds for high-dimensional ridge function combina-
tions including neural networks. arXiv preprint arXiv:1607.01434, 2016.
[68] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. In Advances in neural information processing systems, pages 1097–1105,
2012.
[69] Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications,
volume 35. Springer Science  Business Media, 2003.
[70] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
[71] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[72] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape
of neural nets. In Advances in Neural Information Processing Systems, pages 6391–6401, 2018.
[73] Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical
Association, 86(414):316–327, 1991.
[74] Xingguo Li, Junwei Lu, Zhaoran Wang, Jarvis Haupt, and Tuo Zhao. On tighter generalization bound
for deep neural networks: Cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159, 2018.
[75] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International
Conference on Machine Learning, pages 1718–1727, 2015.
[76] Tengyuan Liang. How well can generative adversarial networks (GAN) learn densities: A nonparametric
view. arXiv preprint arXiv:1712.08244, 2017.
[77] Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well?
Journal of Statistical Physics, 168(6):1223–1247, 2017.
[78] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
[79] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network
acoustic models. In Proc. icml, volume 30, page 3, 2013.
[80] VE Maiorov and Ron Meir. On the near optimality of the stochastic approximation of smooth functions
by neural networks. Advances in Computational Mathematics, 13(1):79–103, 2000.
[81] Yuly Makovoz. Random approximants and neural networks. Journal of Approximation Theory,
85(1):98–109, 1996.
[82] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural
networks: dimension-free bounds and kernel limit. arXiv preprint arXiv:1902.06015, 2019.
[83] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer
neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
[84] Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. Learning functions: when is deep better than
shallow. arXiv preprint arXiv:1603.00988, 2016.
[85] Hrushikesh N Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.
Neural computation, 8(1):164–177, 1996.
[86] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control
through deep reinforcement learning. Nature, 518(7540):529, 2015.
34
[87] Marco Mondelli and Andrea Montanari. On the connection between learning two-layers neural networks
and tensor decomposition. arXiv preprint arXiv:1802.07301, 2018.
[88] Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o
(1/kˆ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
[89] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural
networks. In Conference on Learning Theory, pages 1376–1401, 2015.
[90] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers
using variational divergence minimization. In Advances in Neural Information Processing Systems,
pages 271–279, 2016.
[91] Ian Parberry. Circuit complexity and neural networks. MIT press, 1994.
[92] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming
Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[93] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143–195,
1999.
[94] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and
when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International
Journal of Automation and Computing, 14(5):503–519, 2017.
[95] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Compu-
tational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
[96] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM
Journal on Control and Optimization, 30(4):838–855, 1992.
[97] Boris Teodorovich Polyak and Yakov Zalmanovich Tsypkin. Adaptive estimation algorithms: conver-
gence, optimality, stability. Avtomatika i Telemekhanika, (3):71–84, 1979.
[98] Christopher Poultney, Sumit Chopra, Yann LeCun, et al. Efficient learning of sparse representations
with an energy-based model. In Advances in neural information processing systems, pages 1137–1144,
2007.
[99] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do cifar-10 classifiers
generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018.
[100] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. 2018.
[101] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical
Statistics, 22(3):400–407, 1951.
[102] William H Rogers and Terry J Wagner. A finite sample distribution-free performance bound for local
discrimination rules. The Annals of Statistics, pages 506–514, 1978.
[103] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions.
arXiv preprint arXiv:1705.05502, 2017.
[104] Yaniv Romano, Matteo Sesia, and Emmanuel J Candès. Deep knockoffs. arXiv preprint
arXiv:1811.06687, 2018.
[105] Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymp-
totic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint
arXiv:1805.00915, 2018.
35
[106] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by
error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science,
1985.
[107] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Ima-
geNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV),
115(3):211–252, 2015.
[108] Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term memory recurrent neural network
architectures for large scale acoustic modeling. In Fifteenth annual conference of the international
speech communication association, 2014.
[109] Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Artificial intelligence and
statistics, pages 448–455, 2009.
[110] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Im-
proved techniques for training GANs. In Advances in Neural Information Processing Systems, pages
2234–2242, 2016.
[111] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation
function. arXiv preprint arXiv:1708.06633, 2017.
[112] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms.
Cambridge university press, 2014.
[113] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and
uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010.
[114] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without
human knowledge. Nature, 550(7676):354, 2017.
[115] Bernard W Silverman. Density estimation for statistics and data analysis. Chapman  Hall, CRC,
1998.
[116] Chandan Singh, W James Murdoch, and Bin Yu. Hierarchical interpretations for neural network
predictions. arXiv preprint arXiv:1806.05337, 2018.
[117] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks. arXiv preprint
arXiv:1805.01053, 2018.
[118] Mahdi Soltanolkotabi. Learning relus via gradient descent. In Advances in Neural Information Pro-
cessing Systems, pages 2007–2017, 2017.
[119] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit
bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–
2878, 2018.
[120] David A Sprecher. On the structure of continuous functions of several variables. Transactions of the
American Mathematical Society, 115:340–355, 1965.
[121] Charles J Stone. Optimal global rates of convergence for nonparametric regression. The annals of
statistics, pages 1040–1053, 1982.
[122] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization
and momentum in deep learning. In International conference on machine learning, pages 1139–1147,
2013.
36
[123] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[124] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,
and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[125] Matus Telgarsky. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.
[126] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society: Series B (Methodological), 58(1):267–288, 1996.
[127] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to
their probabilities. Theory of Probability  Its Applications, 16(2):264–280, 1971.
[128] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and com-
posing robust features with denoising autoencoders. In Proceedings of the 25th international conference
on Machine learning, pages 1096–1103. ACM, 2008.
[129] Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances
in neural information processing systems, pages 351–359, 2013.
[130] E Weinan, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high-
dimensional parabolic partial differential equations and backward stochastic differential equations.
Communications in Mathematics and Statistics, 5(4):349–380, 2017.
[131] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal
value of adaptive gradient methods in machine learning. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems 30, pages 4148–4158. Curran Associates, Inc., 2017.
[132] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation
system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,
2016.
[133] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep
neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
[134] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural
networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
[135] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep
learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
[136] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees
for one-hidden-layer neural networks. In Proceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 4140–4149. JMLR. org, 2017.
37
Top Deep Learning Interview Questions You Must Know
1.3K Views
Kurt
Last updated on May 22,2019
Deep Learning is one of the Hottest topics of 2018-19 and for a good reason. There have been so many advancements in the Industry wherein the time has come when machines or Computer
Programs are actually replacing Humans. Arti cial Intelligence is going to create 2.3 million Jobs by 2020 and to crack those job interview I have come up with a set of Deep Learning Interview
Questions. I have divided this article into two sections:
Basic Deep Learning Interview Questions
Advance Deep Learning Interview Questions
Basics Deep Learning Interview Questions
Q1. Differentiate between AI, Machine Learning and Deep Learning.
Artificial Intelligence is a technique which enables machines to mimic human behavior.
Machine Learning is a subset of AI technique which uses statistical methods to enable machines to improve with experience.
Deep learning is a subset of ML which make the computation of multi-layer neural network feasible. It uses Neural networks to simulate human-like decision making.
Q2. Do you think Deep Learning is Better than Machine Learning? If so, why?
Though traditional ML algorithms solve a lot of our cases, they are not useful while working with high dimensional data, that is where we have a large number of inputs and outputs. For example, in
the case of handwriting recognition, we have a large amount of input where we will have a different type of inputs associated with different type of handwriting.
The second major challenge is to tell the computer what are the features it should look for that will play an important role in predicting the outcome as well as to achieve better accuracy while
doing so.
Q3. What is Perceptron? And How does it Work?
If we focus on the structure of a biological neuron, it has dendrites which are used to receive inputs. These inputs are summed in the cell body and using the Axon it is passed on to the next
biological neuron as shown below.
Dendrite: Receives signals from other neurons
Cell Body: Sums all the inputs
Axon: It is used to transmit signals to the other cells
Similarly, a perceptron receives multiple inputs, applies various transformations and functions and provides an output. A Perceptron is a linear model used for binary classi cation. It models a
neuron which has a set of inputs, each of which is given a specific weight. The neuron computes some function on these weighted inputs and gives the output.

Subscribe

Q4. What is the role of weights and bias?
For a perceptron, there can be one more input calledbias. While the weights determine the slope of the classifier line, bias allows us to shift the line towards left or right. Normally bias is treated
as another weighted input with the input value x
Q5. What are the activation functions?
Activation function translates the inputs into outputs. Activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias with it. The
purpose of the activation function is to introduce non-linearity into the output of a neuron.
There can be many Activation functions like:
Linear or Identity
Unit or Binary Step
Sigmoid or Logistic
Tanh
ReLU
Softmax
Q6. Explain Learning of a Perceptron.
1. Initializing the weights and threshold.
2. Provide the input and calculate the output.
3. Update the weights.
4. Repeat Steps 2 and 3
Wj (t+1) – Updated Weight
Wj (t) – Old Weight
d – Desired Output
y – Actual Output
x – Input
Q7. What is the significance of a Cost/Loss function?
A cost function is a measure of the accuracy of the neural network with respect to a given training sample and expected output. It provides the performance of a neural network as a whole. In
deep learning, the goal is to minimize the cost function. For that, we use the concept of gradient descent.
Q8. What is gradient descent?
Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.
Stochastic Gradient Descent: Uses only a single training example to calculate the gradient and update parameters.
Batch Gradient Descent: Calculate the gradients for the whole dataset and perform just one update at each iteration.
Mini-batch Gradient Descent: Mini-batch gradient is a variation of stochastic gradient descent where instead of single training example, mini-batch of samples is used. It’s one of the most
popular optimization algorithms.
Q9. What are the benefits of mini-batch gradient descent?
This is more efficient compared to stochastic gradient descent.
The generalization by finding the flat minima.
Mini-batches allows help to approximate the gradient of the entire training set which helps us to avoid local minima.
0.
Q10.What are the steps for using a gradient descent algorithm?
Initialize random weight and bias.
Pass an input through the network and get values from the output layer.
Calculate the error between the actual value and the predicted value.
Go to each neuron which contributes to the error and then change its respective values to reduce the error.
Reiterate until you find the best weights of the network.
Q11. Create a Gradient Descent in python.
Q12. What are the shortcomings of a single layer perceptron?
Well, there are two major problems:
Single-Layer Perceptrons cannot classify non-linearly separable data points.
Complex problems, that involve a lot of parameters cannot be solved by Single-Layer Perceptrons
Q13. What is a Multi-Layer-Perceptron
A multilayer perceptron (MLP) is a deep, arti cial neural network. It is composed of more than one perceptron. They are composed of an input layer to receive the signal, an output layer that makes
a decision or prediction about the input, and in between those two, an arbitrary number of hidden layers that are the true computational engine of the MLP.
Q14. What are the different parts of a multi-layer perceptron?
Input Nodes: The Input nodes provide information from the outside world to the network and are together referred to as the “Input Layer”. No computation is performed in any of the Input
nodes – they just pass on the information to the hidden nodes.
Hidden Nodes: The Hidden nodes perform computations and transfer information from the input nodes to the output nodes. A collection of hidden nodes forms a “Hidden Layer”. While a network
will only have a single input layer and a single output layer, it can have zero or multiple Hidden Layers.
Output Nodes: The Output nodes are collectively referred to as the “Output Layer” and are responsible for computations and transferring information from the network to the outside world.
Q15. What Is Data Normalization And Why Do We Need It?
Data normalization is very important preprocessing step, used to rescale values to t in a speci c range to assure better convergence during backpropagation. In general, it boils down to
subtracting the mean of each data point and dividing by its standard deviation.
These were some basic Deep Learning Interview Questions. Now, let’s move on to some advanced ones.
Advance Interview Questions
Q16. Which is Better Deep Networks or Shallow ones? and Why?
Both the Networks, be it shallow or Deep are capable of approximating any function. But what matters is how precise that network is in terms of getting the results. A shallow network works with
only a few features, as it can’t extract more. But a deep network goes deep by computing efficiently and working on more features/parameters.
Q17. Why is Weight Initialization important in Neural Networks?
Weight initialization is one of the very important steps. A bad weight initialization can prevent a network from learning but good weight initialization helps in giving a quicker convergence and a
better overall error.
1
2
3
4
5
6
7
8
9
10
11
12
13
params = [weights_hidden, weights_output, bias_hidden, bias_output]
def sgd(cost, params, lr=0.05):
grads = T.grad(cost=cost, wrt=params)
updates = []
for p, g in zip(params, grads):
updates.append([p, p - g * lr])
return updates
updates = sgd(cost, params)
Biases can be generally initialized to zero. The rule for setting the weights is to be close to zero without being too small.
Q18. What’s the difference between a feed-forward and a backpropagation neural network?
A Feed-Forward Neural Network is a type of Neural Network architecture where the connections are “fed forward”, i.e. do not form cycles. The term “Feed-Forward” is also used when you input
something at the input layer and it travels from input to hidden and from hidden to the output layer.
Backpropagation is a training algorithm consisting of 2 steps:
Feed-Forward the values.
Calculate the error and propagate it back to the earlier layers.
So to be precise, forward-propagation is part of the backpropagation algorithm but comes before back-propagating.
Q19. What are the Hperparameteres? Name a few used in any Neural Network.
Hyperparameters are the variables which determine the network structure(Eg: Number of Hidden Units) and the variables which determine how the network is trained(Eg: Learning Rate).
Hyperparameters are set before training.
Number of Hidden Layers
Network Weight Initialization
Activation Function
Learning Rate
Momentum
Number of Epochs
Batch Size
Q20. Explain the different Hyperparameters related to Network and Training.
Network Hyperparameters
The number of Hidden Layers: Many hidden units within a layer with regularization techniques can increase accuracy. Smaller number of units may cause underfitting.
Network Weight Initialization: Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. Mostly uniform distribution is used.
Activation function: Activation functions are used to introduce nonlinearity to models, which allows deep learning models to learn nonlinear prediction boundaries.
Training Hyperparameters
Learning Rate: The learning rate de nes how quickly a network updates its parameters. Low learning rate slows down the learning process but converges smoothly. Larger learning rate speeds up
the learning but may not converge.
Momentum: Momentum helps to know the direction of the next step with the knowledge of the previous steps. It helps to prevent oscillations. A typical choice of momentum is between 0.5 to 0.9.
The number of epochs: Number of epochs is the number of times the whole training data is shown to the network while training. Increase the number of epochs until the validation accuracy
starts decreasing even when training accuracy is increasing(overfitting).
Batch size: Mini batch size is the number of sub-samples given to the network after which parameter update happens. A good default for batch size might be 32. Also try 32, 64, 128, 256, and so
on.
Q21. What is Dropout?
Dropout is a regularization technique to avoid over tting thus increasing the generalizing power. Generally, we should use a small dropout value of 20%-50% of neurons with 20% providing a good
starting point. A probability too low has minimal effect and a value too high results in under-learning by the network.
Use a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
Q22. In training a neural network, you notice that the loss does not decrease in the few starting epochs. What could be the reason?
The reasons for this could be:
The learning is rate is low
Regularization parameter is high
Stuck at local minima
Q23. Name a few deep learning frameworks
TensorFlow
Caffe
The Microsoft Cognitive Toolkit/CNTK
Torch/PyTorch
MXNet
Chainer
Keras
Q24. What are Tensors?
Tensors are nothing but a de facto for representing the data in deep learning. They are just multidimensional arrays, that allows you to represent data having higher dimensions. In general, Deep
Learning you deal with high dimensional data sets where dimensions refer to different features present in the data set.
Q25. List a few advantages of TensorFlow?
It has platform flexibility
It is easily trainable on CPU as well as GPU for distributed computing.
TensorFlow has auto differentiation capabilities
It has advanced support for threads, asynchronous computation, and queue es.
It is a customizable and open source.
Q26. What is Computational Graph?
A computational graph is a series of TensorFlow operations arranged as nodes in the graph. Each node takes zero or more tensors as input and produces a tensor as output.
Basically, one can think of a Computational Graph as an alternative way of conceptualizing mathematical calculations that takes place in a TensorFlow program. The operations assigned to different
nodes of a Computational Graph can be performed in parallel, thus, providing better performance in terms of computations.
Q27. What is a CNN?
Convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. Unlike neural networks, where the input is a vector, here
the input is a multi-channeled image. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.
Q28. Explain the different Layers of CNN.
There are four layered concepts we should understand in Convolutional Neural Networks:
Convolution: The convolution layer comprises of a set of independent lters. All these lters are initialized randomly and become our parameters which will be learned by the network
subsequently.
ReLu: This layer is used with the convolutional layer.
Pooling: Its function is to progressively reduce the spatial size of the representation to reduce the number of parameters and computation in the network. Pooling layer operates on each feature
map independently.
Full Connectedness: Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed
with a matrix multiplication followed by a bias offset.
Q29. What is an RNN?
Recurrent Networks are a type of arti cial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, the spoken word, numerical times series data.
Recurrent Neural Networks use backpropagation algorithm for training Because of their internal memory, RNN’s are able to remember important things about the input they received, which
enables them to be very precise in predicting what’s coming next.
Q30. What are some issues faced while training an RNN?
Recurrent Neural Networks use backpropagation algorithm for training, but it is applied for every timestamp. It is commonly known asBack-propagation Through Time (BTT).
There are some issues with Back-propagation such as:
Vanishing Gradient
Exploding Gradient
Q31. What is Vanishing Gradient? And how is this harmful?
When we do Back-propagation, the gradients tend to get smaller and smaller as we keep on moving backward in the Network. This means that the neurons in the Earlier layers learn very slowly as
compared to the neurons in the later layers in the Hierarchy.
Earlier layers in the Network are important because they are responsible to learn and detecting the simple patterns and are actually the building blocks of our Network.
Obviously, if they give improper and inaccurate results, then how can we expect the next layers and the complete Network to perform nicely and produce accurate results. The Training process
takes too long and the Prediction Accuracy of the Model will decrease.
Q32. What is Exploding Gradient Descent?
Exploding gradients are a problem when large error gradients accumulate and result in very large updates to neural network model weights during training.
Gradient Descent process works best when these updates are small and controlled. When the magnitudes of the gradients accumulate, an unstable network is likely to occur, which can cause poor
prediction of results or even a model that reports nothing useful what so ever.
Q33. Explain the importance of LSTM.
Long short-term memory(LSTM) is an arti cial recurrent neural network architecture used in the eld of deep learning. Unlike standard feedforward neural networks, LSTM has feedback
connections that make it a “general purpose computer”. It can not only process single data points, but also entire sequences of data.
They are a special kind of Recurrent Neural Networks which are capable of learning long-term dependencies.
Q34. What are capsules in Capsule Neural Network?
Capsules are a vector specifying the features of the object and its likelihood. These features can be any of the instantiation parameters like position, size, orientation, deformation, velocity, hue,
texture and much more.
A capsule can also specify its attributes like angle and size so that it can represent the same generic information. Now, just like a neural network has layers of neurons, a capsule network can have
layers of capsules.
Now, let’s continue this Deep Learning Interview Questions and move to the section of autoencoders and RBMs.
Q35. Explain Autoencoders and it’s uses.
An autoencoder neural network is an Unsupervised Machine learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. Autoencoders are used to reduce
the size of our inputs into a smaller representation. If anyone needs the original data, they can reconstruct it from the compressed data.
Q36. In terms of Dimensionality Reduction, How does Autoencoder differ from PCAs?
An autoencoder can learn non-linear transformations with a non-linear activation function and multiple layers.
It doesn’t have to learn dense layers. It can use convolutional layers to learn which is better for video, image and series data.
It is more efficient to learn several layers with an autoencoder rather than learn one huge transformation with PCA.
An autoencoder provides a representation of each layer as the output.
It can make use of pre-trained layers from another model to apply transfer learning to enhance the encoder/decoder.
Q37. Give some real-life examples where autoencoders can be applied.
Image Coloring: Autoencoders are used for converting any black and white picture into a colored image. Depending on what is in the picture, it is possible to tell what the color should be.
Feature variation: It extracts only the required features of an image and generates the output by removing any noise or unnecessary interruption.
Dimensionality Reduction: The reconstructed image is the same as our input but with reduced dimensions. It helps in providing a similar image with a reduced pixel value.
Denoising Image: The input seen by the autoencoder is not the raw input but a stochastically corrupted version. A denoising autoencoder is thus trained to reconstruct the original input from the
noisy version.
Q38. what are the different layers of Autoencoders?
An Autoencoder consist of three layers:
Encoder
Code
Decoder
Q39. Explain the architecture of an Autoencoder.
Encoder: This part of the network compresses the input into a latent space representation. The encoder layer encodes the input image as a compressed representation in a reduced dimension.
The compressed image is the distorted version of the original image.
Code: This part of the network represents the compressed input which is fed to the decoder.
Decoder: This layer decodes the encoded image back to the original dimension. The decoded image is a lossy reconstruction of the original image and it is reconstructed from the latent space
representation.
Q40. What is a Bottleneck in autoencoder and why is it used?
The layer between the encoder and decoder, ie. the code is also known as Bottleneck. This is a well-designed approach to decide which aspects of observed data are relevant information and what
aspects can be discarded.
It does this by balancing two criteria:
Compactness of representation, measured as the compressibility.
It retains some behaviourally relevant variables from the input.
Q41. Is there any variation of Autoencoders?
Convolution Autoencoders
Sparse Autoencoders
Deep Autoencoders
Contractive Autoencoders
Q42. What are Deep Autoencoders?
The extension of the simple Autoencoder is the Deep Autoencoder. The rst layer of the Deep Autoencoder is used for rst-order features in the raw input. The second layer is used for second-
order features corresponding to patterns in the appearance of first-order features. Deeper layers of the Deep Autoencoder tend to learn even higher-order features.
A deep autoencoder is composed of two, symmetrical deep-belief networks:
First four or five shallow layers representing the encoding half of the net.
The second set of four or five layers that make up the decoding half.
Q43. What is a Restricted Boltzmann Machine?
Restricted Boltzmann Machine is an undirected graphical model that plays a major role in Deep Learning Framework in recent times.
It is an algorithm which is useful for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling.
Q44. How Does RBM differ from Autoencoders?
Autoencoder is a simple 3-layer neural network where output units are directly connected back to input units. Typically, the number of hidden units is much less than the number of visible ones.
The task of training is to minimize an error or reconstruction, i.e. find the most efficient compact representation for input data.
RBM shares a similar idea, but it uses stochastic units with particular distribution instead of deterministic distribution. The task of training is to nd out how these two sets of variables are actually
Math of Deep Learning Neural Networks – Simplified
(Part 2)
· Roopam Upadhyay 4 Comments
The Math of Deep Learning Neural Networks – by Roopam
Welcome back to this series of articles on deep learning and neural networks. In the last part, you
learned how training a deep learning network is similar to a plumbing job. This time you will learn the
math of deep learning. We will continue to use the plumbing analogy to simplify the seemingly
complicated math. I believe you will find this highly intuitive. Moreover, understanding this will provide
you with a good idea about the inner workings of deep learning networks and artificial intelligence (AI)
to build an AI of your own. We will use the math of deep learning to make an image recognition AI in
the next part. But before that let’s create the links between…
The Math of Deep Learning and Plumbing
Last time we noticed that neural networks are like the networks of water pipes. The goal of neural
networks is to identify the right settings for the knobs (6 in this schematic) to get the right output given
the input.
Shown below is a familiar schematic of neural networks almost identical to the water pipelines above.
The only exception is the additional bias terms (b ,b , and b ) added to the nodes.
In this post, we will solve this network to understand the math of deep learning. Note that a deep
learning model has multiple hidden layers, unlike this simple neural network. However, this simple
neural network can easily be generalized to the deep learning models. The math of deep learning
does not change a lot with additional complexity and hidden layers. Here, our objective is to identify
the values of the parameters {W (W ,…, W ) and b (b ,b , and b )}. We will soon use the
1 2 3
1 6 1 2 3
backpropagation algorithm along with gradient descent optimization to solve this network and
identify the optimal values of these weights.
Backpropagation and Gradient Descent
In the previous post, we discussed that the
backpropagation algorithm works similar to me shouting
back at my plumber while he was working in the duct.
Remember, I was telling the plumber about the difference
in actual water pressure from the expected. The plumber
of neural networks, unlike my building’s plumber, learns
from this information to optimize the positions of the knobs.
The method that the neural networks plumber uses to
iteratively correct the weights or settings of the knobs is called gradient descent.
We have discussed the gradient descent algorithm in an earlier post to solve a logistic regression
model. I recommend that you read that article to get a good grasp of the things we will discuss in this
post. Essentially, the idea is to iteratively correct the value of the weights (W ) to produce the least
difference between the actual and the expected values of the output. This difference is measured
mathematically by the loss function i.e . The weights (W and b ) are then iteratively improved using
the gradient of the loss function wrt weights using this expression:
Here, α is called the learning rate – it’s a hyperparameter and stays constant. Hence, the overall
problem boils down to the identification of partial derivatives of the loss function with respect to the
weights i.e. . For our problem, we just need to solve the partial derivatives for W and W . The
partial derivatives for other weights can then be easily derived using the same method used for W
and W .
Before we solve these partial derivatives, let’s do some more plumbing jobs and look at a tap to
develop intuitions about the results we will get from the gradient descent optimization.
Intuitive Math of Deep Learning for W  A Tap
We will use this simple tap to identify an optimal setting for its knob. In this process, we will develop
intuitions about gradient descent and the math of deep learning. Here, the input is the water coming
from the pipe on the left of the image. Moreover, the output is the water coming out of the tap. You
use the knob, on the top of the tap, to regulate the quantity of the output water given the
input. Remember, you want to turn the knob in such a way that you get the desired output (i.e the
quantity of water) to wash your hands. Keep in mind, the position of the knob is similar to the weight of
a neural networks’ parameters. Moreover, the input/output water is similar to the input/output
variables.
i
i i
5 1
5
1
5
Essentially, in math terms, you are trying to identify how
the position of the knob influences the output water. The
mathematical equation for the same is:
If you understand the influence of the knob on the output
flow of water you can easily turn it to get the desired
output. Now, let’s develop an intuition about how much to
twist the knob. When you use a tap you twist the knob
until you get the right flow or the output. When the
difference between the desired output and the actual
output is large then you need a lot of twisting. On the other
hand, when the difference is less then you turn the knob
gently.
Moreover, the other factor on which your decision
depends on is the input from the left pipe. If there is no
water flowing from the left pipe then no matter how much
you twist the knob it won’t help. Essentially, your action
depends on these two factors.
Your decision to turn the knob depends on
Factor 1: Difference between the actual output and the desir
ed Output and
Factor 2: Input from the grey pipe on the left
Soon you will get the same result by doing a seemingly complicated math for the gradient descent to
solve the neural network.
For our network, the output difference is and input is . Hence,
Disclaimer
Please note, to make the concepts easy for you to understand, I had taken a few liberties while
defining the factors in the previous section. I will make these factors much more theoretically
grounded at the end of this article when I will discuss the chain rule to solve derivatives. For now, I will
continue to take more liberties in the next section when I discuss the other weight modification for
other parameters of neural networks.
Add More Knobs to Solve W – Intuitive Math of Deep Learning
Neural networks, as discussed earlier, have several parameters (Ws and bs). To develop an intuition
about the math to estimate the other parameters further away from the output (i.e. W ), let’s add
another knob to the tap.
Here, we have added a red regulator knob to the tap we saw in the earlier section. Now, the output
from the tap is governed by both these knobs. Referring to the neural network’s image shown earlier,
the red knob is similar to the parameters (W , W W , W b and b ) added to the hidden layers. The
knob on top of the brass tap is like the parameters to the output layer (i.e. W , W , and b ).
Now, you are also using the red knob, in addition to the knob on the tap, to get the desired output
from the tap. Your effort of the red knob will depend on these factors.
Your decision to turn the red knob depends on
Factor 1: Difference between the actual and the desired fina
l output and
Factor 2: Position / setting of the knob on the brass tap an
d
Factor 3: Change in input to the brass tap caused by the red
knob and
Factor 4: Input from the pipe on the left into the red knob
Here, as already discussed earlier, factor 1 is . W is the setting/weight for the knob of the
brass tap. Factor 3 is . Finally, the last factor is the input or X . This completes our
equation as:
1
1
1 2, 3 4, 1, 2
5 6 3
5
1
Now, before we do the math to get these results, we just need to discuss the components of our
neural network in mathematical terms. We already know how it relates to the water pipelines
discussed earlier.
Let’s start with the nodes or the orange circles in the network diagram.
Nodes of Neural Networks
Here, these two networks are equivalent except the additional b or bias for the neural networks.
The node for the neural network has two components i.e. sum and non-linear. The sum component
(Z ) is just a linear combination of the input and the weights.
The next term, i.e. non-linear, is the non-linear sigmoid activation function ( ). As discussed earlier, it
is like a regulator of a fan that keeps the value of between 0 and 1 or on/off.
1
1
The mathematical expression for this sigmoid activation function ( ) is:
The nodes in both the hidden and output layer behave the same as described above. Now, the last
thing is to define the loss function ( ) which is to measure the difference between the expected and
actual output. We will define the loss function for most common business problems.
Classification Problem – Math of Deep Learning
In practice, most business problems are about classification. They have binary or categorical
outputs/answers such as:
Is the last credit card transaction fraudulent or not?
Will the borrower return the money or not?
Was the last email in your mailbox a spam or ham?
Is that a picture of a dog or cat? (this is not a business problem but a famous problem for deep
learning)
Is there an object in front of an autonomous car to generate a signal to push the break?
Will the person surfing the web respond to the ad of a luxury condo?
Hence, we will design the loss function of our neural network for similar binary outputs. This binary
loss function, aka binary cross entropy, can easily be extended for multiclass problems with minor
modifications.
Loss Function and Cross Entropy
The loss function for binary output problems is:
This expression is also referred to as binary cross entropy. We can easily extend this binary cross-
entropy to multi-class entropy if the output has many classes such as images of dog, cat, bus, car etc.
We will learn about multiclass cross entropy and softmax function in the next part of this series. Now
that we have identified all the components of the neural network, we are ready to solve it using the
chain rule of differential equations.
Chain Rule for W – Math of Deep Learning
We discussed the outcome for change observed in the loss function( ) wrt to change in W earlier
using a single knob analogy. We know the answer to is equal to . Now, let’s derive
the same thing using the chain rule of derivatives. Essentially, this is similar to the change in water
pressure observed at the output by turning the knob on the top of the tap. The chain rule states this:
The above equation for chain rule is fairly simple since equation on the right-hand side will become
the one on the left-hand side by simple division. More importantly, these equations suggest that the
change in the output essentially the change observed at different components of the pipeline because
of turning the knob.
Moreover, we already discussed the loss function which is the binary cross entropy i.e.
The first component of the chain rule is which is
This was fairly easy to compute if you only know that derivative of a natural log function is
This second component of the step function is . This derivative of the sigmoid function ( ) is
slightly more complicated. You could find here a detailed solution to the derivative of the sigmoid
function. This implies,
Finally, the third component of chain rule is again very easy to compute i.e.
Since we know,
5
5
Now, we just multiply these three components of the chain rule and we get the output i.e.
Chain Rule for W – Math of Deep Learning
The chain rule for the red knob or the additional layer is just an extension of the chain rule of the knob
on the top of the tap. This one has a few more components because the water has to travel through
more components i.e.
The first two components are exactly the same as the knob of the tap i.e. W . This makes sense since
the water is flowing through the same pipeline towards the end. Hence, we will calculate the third
component
The fourth component is the derivative of the sigmoid function i.e. the derivative of the sigmoid
function
The fifth and the final component is again easy to calculate.
That’s it. We now multiply these five components to get the results we have already seen for the
additional red knob.
Sign-off Node
This part of the series became a little math heavy. All this, however, will help us a lot when we will
build an artificial intelligence to recognize images. See you then.
Share
1
5

CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
Super VIP Cheatsheet: Deep Learning
Afshine Amidi and Shervine Amidi
November 25, 2018
Contents
1 Convolutional Neural Networks 2
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Types of layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Filter hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Tuning hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Commonly used activation functions . . . . . . . . . . . . . . . . . . . 3
1.6 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.1 Face verification and recognition . . . . . . . . . . . . . . . . . 5
1.6.2 Neural style transfer . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.3 Architectures using computational tricks . . . . . . . . . . . . 6
2 Recurrent Neural Networks 7
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Handling long term dependencies . . . . . . . . . . . . . . . . . . . . 8
2.3 Learning word representation . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Motivation and notations . . . . . . . . . . . . . . . . . . . 9
2.3.2 Word embeddings . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Comparing words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Deep Learning Tips and Tricks 11
3.1 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Training a neural network . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Finding optimal weights . . . . . . . . . . . . . . . . . . . . . 12
3.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Weights initialization . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Optimizing convergence . . . . . . . . . . . . . . . . . . . . . 12
3.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Good practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1 Convolutional Neural Networks
1.1 Overview
r Architecture of a traditional CNN – Convolutional neural networks, also known as CNNs,
are a specific type of neural networks that are generally composed of the following layers:
The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters
that are described in the next sections.
1.2 Types of layer
r Convolutional layer (CONV) – The convolution layer (CONV) uses filters that perform
convolution operations as it is scanning the input I with respect to its dimensions. Its hyperpa-
rameters include the filter size F and stride S. The resulting output O is called feature map or
activation map.
Remark: the convolution step can be generalized to the 1D and 3D cases as well.
r Pooling (POOL) – The pooling layer (POOL) is a downsampling operation, typically applied
after a convolution layer, which does some spatial invariance. In particular, max and average
pooling are special kinds of pooling where the maximum and average value is taken, respectively.
Stanford University 1 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
Max pooling Average pooling
Purpose
Each pooling operation selects the
maximum value of the current view
Each pooling operation averages
the values of the current view
Illustration
Comments
- Preserves detected features
- Most commonly used
- Downsamples feature map
- Used in LeNet
r Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input where
each input is connected to all neurons. If present, FC layers are usually found towards the end
of CNN architectures and can be used to optimize objectives such as class scores.
1.3 Filter hyperparameters
The convolution layer contains filters for which it is important to know the meaning behind its
hyperparameters.
r Dimensions of a filter – A filter of size F × F applied to an input containing C channels is
a F × F × C volume that performs convolutions on an input of size I × I × C and produces an
output feature map (also called activation map) of size O × O × 1.
Remark: the application of K filters of size F × F results in an output feature map of size
O × O × K.
r Stride – For a convolutional or a pooling operation, the stride S denotes the number of pixels
by which the window moves after each operation.
r Zero-padding – Zero-padding denotes the process of adding P zeroes to each side of the
boundaries of the input. This value can either be manually specified or automatically set through
one of the three modes detailed below:
Valid Same Full
Value P = 0
Pstart =
j
Sd I
S
e−I+F −S
2
k
Pend =
l
Sd I
S
e−I+F −S
2
m
Pstart ∈ [[0,F − 1]]
Pend = F − 1
Illustration
Purpose
- No padding
- Drops last
convolution if
dimensions do not
match
- Padding such that feature
map size has size
l
I
S
m
- Output size is
mathematically convenient
- Also called ’half’ padding
- Maximum padding
such that end
convolutions are
applied on the limits
of the input
- Filter ’sees’ the input
end-to-end
1.4 Tuning hyperparameters
r Parameter compatibility in convolution layer – By noting I the length of the input
volume size, F the length of the filter, P the amount of zero padding, S the stride, then the
output size O of the feature map along that dimension is given by:
O =
I − F + Pstart + Pend
S
+ 1
Remark: often times, Pstart = Pend , P, in which case we can replace Pstart + Pend by 2P in
the formula above.
Stanford University 2 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
r Understanding the complexity of the model – In order to assess the complexity of a
model, it is often useful to determine the number of parameters that its architecture will have.
In a given layer of a convolutional neural network, it is done as follows:
CONV POOL FC
Illustration
Input size I × I × C I × I × C Nin
Output size O × O × K O × O × C Nout
Number of
parameters
(F × F × C + 1) · K 0 (Nin + 1) × Nout
Remarks
- One bias parameter
per filter
- In most cases, S  F
- A common choice
for K is 2C
- Pooling operation
done channel-wise
- In most cases, S = F
- Input is flattened
- One bias parameter
per neuron
- The number of FC
neurons is free of
structural constraints
r Receptive field – The receptive field at layer k is the area denoted Rk × Rk of the input
that each pixel of the k-th activation map can ’see’. By calling Fj the filter size of layer j and
Si the stride value of layer i and with the convention S0 = 1, the receptive field at layer k can
be computed with the formula:
Rk = 1 +
k
X
j=1
(Fj − 1)
j−1
Y
i=0
Si
In the example below, we have F1 = F2 = 3 and S1 = S2 = 1, which gives R2 = 1+2 · 1+2 · 1 =
5.
1.5 Commonly used activation functions
r Rectified Linear Unit – The rectified linear unit layer (ReLU) is an activation function g
that is used on all elements of the volume. It aims at introducing non-linearities to the network.
Its variants are summarized in the table below:
ReLU Leaky ReLU ELU
g(z) = max(0,z)
g(z) = max(z,z)
with   1
g(z) = max(α(ez − 1),z)
with α  1
Non-linearity complexities
biologically interpretable
Addresses dying ReLU
issue for negative values
Differentiable everywhere
r Softmax – The softmax step can be seen as a generalized logistic function that takes as input
a vector of scores x ∈ Rn and outputs a vector of output probability p ∈ Rn through a softmax
function at the end of the architecture. It is defined as follows:
p =
p1
.
.
.
pn

where pi =
exi
n
X
j=1
exj
1.6 Object detection
r Types of models – There are 3 main types of object recognition algorithms, for which the
nature of what is predicted is different. They are described in the table below:
Image classification
Classification
w. localization
Detection
- Classifies a picture
- Predicts probability
of object
- Detects object in a picture
- Predicts probability of
object and where it is
located
- Detects up to several objects
in a picture
- Predicts probabilities of objects
and where they are located
Traditional CNN Simplified YOLO, R-CNN YOLO, R-CNN
r Detection – In the context of object detection, different methods are used depending on
whether we just want to locate the object or detect a more complex shape in the image. The
two main ones are summed up in the table below:
Stanford University 3 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
Bounding box detection Landmark detection
Detects the part of the image where
the object is located
- Detects a shape or characteristics of
an object (e.g. eyes)
- More granular
Box of center (bx,by), height bh
and width bw
Reference points (l1x,l1y), ...,(lnx,lny)
r Intersection over Union – Intersection over Union, also known as IoU, is a function that
quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding
box Ba. It is defined as:
IoU(Bp,Ba) =
Bp ∩ Ba
Bp ∪ Ba
Remark: we always have IoU ∈ [0,1]. By convention, a predicted bounding box Bp is considered
as being reasonably good if IoU(Bp,Ba) ⩾ 0.5.
r Anchor boxes – Anchor boxing is a technique used to predict overlapping bounding boxes.
In practice, the network is allowed to predict more than one box simultaneously, where each box
prediction is constrained to have a given set of geometrical properties. For instance, the first
prediction can potentially be a rectangular box of a given form, while the second will be another
rectangular box of a different geometrical form.
r Non-max suppression – The non-max suppression technique aims at removing duplicate
overlapping bounding boxes of a same object by selecting the most representative ones. After
having removed all boxes having a probability prediction lower than 0.6, the following steps are
repeated while there are boxes remaining:
• Step 1: Pick the box with the largest prediction probability.
• Step 2: Discard any box having an IoU ⩾ 0.5 with the previous box.
r YOLO – You Only Look Once (YOLO) is an object detection algorithm that performs the
following steps:
• Step 1: Divide the input image into a G × G grid.
• Step 2: For each grid cell, run a CNN that predicts y of the following form:
y =

pc,bx,by,bh,bw,c1,c2,...,cp
| {z }
repeated k times
,...
T
∈ RG×G×k×(5+p)
where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the
detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were
detected, and k is the number of anchor boxes.
• Step 3: Run the non-max suppression algorithm to remove any potential duplicate over-
lapping bounding boxes.
Remark: when pc = 0, then the network does not detect any object. In that case, the corre-
sponding predictions bx, ..., cp have to be ignored.
r R-CNN – Region with Convolutional Neural Networks (R-CNN) is an object detection algo-
rithm that first segments the image to find potential relevant bounding boxes and then run the
detection algorithm to find most probable objects in those bounding boxes.
Remark: although the original algorithm is computationally expensive and slow, newer archi-
tectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.
Stanford University 4 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
1.6.1 Face verification and recognition
r Types of models – Two main types of model are summed up in table below:
Face verification Face recognition
- Is this the correct person?
- One-to-one lookup
- Is this one of the K persons in the database?
- One-to-many lookup
r One Shot Learning – One Shot Learning is a face verification algorithm that uses a limited
training set to learn a similarity function that quantifies how different two given images are. The
similarity function applied to two images is often noted d(image 1, image 2).
r Siamese Network – Siamese Networks aim at learning how to encode images to then quantify
how different two images are. For a given input image x(i), the encoded output is often noted
as f(x(i)).
r Triplet loss – The triplet loss ` is a loss function computed on the embedding representation
of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive
example belong to a same class, while the negative example to another one. By calling α ∈ R+
the margin parameter, this loss is defined as follows:
`(A,P,N) = max (d(A,P) − d(A,N) + α,0)
1.6.2 Neural style transfer
r Motivation – The goal of neural style transfer is to generate an image G based on a given
content C and a given style S.
r Activation – In a given layer l, the activation is noted a[l] and is of dimensions nH ×nw ×nc
r Content cost function – The content cost function Jcontent(C,G) is used to determine how
the generated image G differs from the original content image C. It is defined as follows:
Jcontent(C,G) =
1
2
||a[l](C)
− a[l](G)
||2
r Style matrix – The style matrix G[l] of a given layer l is a Gram matrix where each of its
elements G
[l]
kk0 quantifies how correlated the channels k and k0 are. It is defined with respect to
activations a[l] as follows:
G
[l]
kk0 =
n
[l]
H
X
i=1
n
[l]
w
X
j=1
a
[l]
ijk
a
[l]
ijk0
Remark: the style matrix for the style image and the generated image are noted G[l](S) and
G[l](G) respectively.
r Style cost function – The style cost function Jstyle(S,G) is used to determine how the
generated image G differs from the style S. It is defined as follows:
J
[l]
style(S,G) =
1
(2nH nwnc)2
||G[l](S)
− G[l](G)
||2
F =
1
(2nH nwnc)2
nc
X
k,k0=1

G
[l](S)
kk0 − G
[l](G)
kk0
2
r Overall cost function – The overall cost function is defined as being a combination of the
content and style cost functions, weighted by parameters α,β, as follows:
J(G) = αJcontent(C,G) + βJstyle(S,G)
Remark: a higher value of α will make the model care more about the content while a higher
value of β will make it care more about the style.
1.6.3 Architectures using computational tricks
r Generative Adversarial Network – Generative adversarial networks, also known as GANs,
are composed of a generative and a discriminative model, where the generative model aims at
generating the most truthful output that will be fed into the discriminative which aims at
differentiating the generated and true image.
Stanford University 5 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
Remark: use cases using variants of GANs include text to image, music generation and syn-
thesis.
r ResNet – The Residual Network architecture (also called ResNet) uses residual blocks with a
high number of layers meant to decrease the training error. The residual block has the following
characterizing equation:
a[l+2]
= g(a[l]
+ z[l+2]
)
r Inception Network – This architecture uses inception modules and aims at giving a try
at different convolutions in order to increase its performance. In particular, it uses the 1 × 1
convolution trick to lower the burden of computation.
? ? ?
2 Recurrent Neural Networks
2.1 Overview
r Architecture of a traditional RNN – Recurrent neural networks, also known as RNNs,
are a class of neural networks that allow previous outputs to be used as inputs while having
hidden states. They are typically as follows:
For each timestep t, the activation at and the output yt are expressed as follows:
at
= g1(Waaat−1
+ Waxxt
+ ba) and yt
= g2(Wyaat
+ by)
where Wax, Waa, Wya, ba, by are coefficients that are shared temporally and g1, g2 activation
functions
The pros and cons of a typical RNN architecture are summed up in the table below:
Advantages Drawbacks
- Possibility of processing input of any length
- Model size not increasing with size of input
- Computation takes into account
historical information
- Weights are shared across time
- Computation being slow
- Difficulty of accessing information
from a long time ago
- Cannot consider any future input
for the current state
r Applications of RNNs – RNN models are mostly used in the fields of natural language
processing and speech recognition. The different applications are summed up in the table below:
Stanford University 6 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
Type of RNN Illustration Example
One-to-one
Tx = Ty = 1
Traditional neural network
One-to-many
Tx = 1, Ty  1
Music generation
Many-to-one
Tx  1, Ty = 1
Sentiment classification
Many-to-many
Tx = Ty
Name entity recognition
Many-to-many
Tx 6= Ty
Machine translation
r Loss function – In the case of a recurrent neural network, the loss function L of all time
steps is defined based on the loss at every time step as follows:
L(b
y,y) =
Ty
X
t=1
L(b
yt
,yt
)
r Backpropagation through time – Backpropagation is done at each point in time. At
timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:
∂L(T )
∂W
=
T
X
t=1
∂L(T )
∂W
(t)
2.2 Handling long term dependencies
r Commonly used activation functions – The most common activation functions used in
RNN modules are described below:
Sigmoid Tanh RELU
g(z) =
1
1 + e−z
g(z) =
ez − e−z
ez + e−z
g(z) = max(0,z)
r Vanishing/exploding gradient – The vanishing and exploding gradient phenomena are
often encountered in the context of RNNs. The reason why they happen is that it is difficult
to capture long term dependencies because of multiplicative gradient that can be exponentially
decreasing/increasing with respect to the number of layers.
r Gradient clipping – It is a technique used to cope with the exploding gradient problem
sometimes encountered when performing backpropagation. By capping the maximum value for
the gradient, this phenomenon is controlled in practice.
r Types of gates – In order to remedy the vanishing gradient problem, specific gates are used
in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and
are equal to:
Γ = σ(Wxt
+ Uat−1
+ b)
where W, U, b are coefficients specific to the gate and σ is the sigmoid function. The main ones
are summed up in the table below:
Stanford University 7 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
Type of gate Role Used in
Update gate Γu How much past should matter now? GRU, LSTM
Relevance gate Γr Drop previous information? GRU, LSTM
Forget gate Γf Erase a cell or not? LSTM
Output gate Γo How much to reveal of a cell? LSTM
r GRU/LSTM – Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM)
deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being
a generalization of GRU. Below is a table summing up the characterizing equations of each
architecture:
Gated Recurrent Unit
(GRU)
Long Short-Term Memory
(LSTM)
c̃t tanh(Wc[Γr ? at−1,xt] + bc) tanh(Wc[Γr ? at−1,xt] + bc)
ct Γu ? c̃t + (1 − Γu) ? ct−1 Γu ? c̃t + Γf ? ct−1
at ct Γo ? ct
Dependencies
Remark: the sign ? denotes the element-wise multiplication between two vectors.
r Variants of RNNs – The table below sums up the other commonly used RNN architectures:
Bidirectional
(BRNN)
Deep
(DRNN)
2.3 Learning word representation
In this section, we note V the vocabulary and |V | its size.
2.3.1 Motivation and notations
r Representation techniques – The two main ways of representing words are summed up in
the table below:
1-hot representation Word embedding
- Noted ow
- Naive approach, no similarity information
- Noted ew
- Takes into account words similarity
r Embedding matrix – For a given word w, the embedding matrix E is a matrix that maps
its 1-hot representation ow to its embedding ew as follows:
ew = Eow
Remark: learning the embedding matrix can be done using target/context likelihood models.
2.3.2 Word embeddings
r Word2vec – Word2vec is a framework aimed at learning word embeddings by estimating the
likelihood that a given word is surrounded by other words. Popular models include skip-gram,
negative sampling and CBOW.
r Skip-gram – The skip-gram word2vec model is a supervised learning task that learns word
embeddings by assessing the likelihood of any given target word t happening with a context
word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:
Stanford University 8 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
P(t|c) =
exp(θT
t ec)
|V |
X
j=1
exp(θT
j ec)
Remark: summing over the whole vocabulary in the denominator of the softmax part makes
this model computationally expensive. CBOW is another word2vec model using the surrounding
words to predict a given word.
r Negative sampling – It is a set of binary classifiers using logistic regressions that aim at
assessing how a given context and a given target words are likely to appear simultaneously, with
the models being trained on sets of k negative examples and 1 positive example. Given a context
word c and a target word t, the prediction is expressed by:
P(y = 1|c,t) = σ(θT
t ec)
Remark: this method is less computationally expensive than the skip-gram model.
r GloVe – The GloVe model, short for global vectors for word representation, is a word em-
bedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of
times that a target i occurred with a context j. Its cost function J is as follows:
J(θ) =
1
2
|V |
X
i,j=1
f(Xij)(θT
i ej + bi + b0
j − log(Xij))2
here f is a weighting function such that Xi,j = 0 =⇒ f(Xi,j) = 0.
Given the symmetry that e and θ play in this model, the final word embedding e
(final)
w is given
by:
e
(final)
w =
ew + θw
2
Remark: the individual components of the learned word embeddings are not necessarily inter-
pretable.
2.4 Comparing words
r Cosine similarity – The cosine similarity between words w1 and w2 is expressed as follows:
similarity =
w1 · w2
||w1|| ||w2||
= cos(θ)
Remark: θ is the angle between words w1 and w2.
r t-SNE – t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at re-
ducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly
used to visualize word vectors in the 2D space.
2.5 Language model
r Overview – A language model aims at estimating the probability of a sentence P(y).
r n-gram model – This model is a naive approach aiming at quantifying the probability that
an expression appears in a corpus by counting its number of appearance in the training data.
r Perplexity – Language models are commonly assessed using the perplexity metric, also
known as PP, which can be interpreted as the inverse probability of the dataset normalized by
the number of words T. The perplexity is such that the lower, the better and is defined as
follows:
PP =
T
Y
t=1
1
P|V |
j=1
y
(t)
j · b
y
(t)
j
! 1
T
Remark: PP is commonly used in t-SNE.
2.6 Machine translation
r Overview – A machine translation model is similar to a language model except it has an
encoder network placed before. For this reason, it is sometimes referred as a conditional language
model. The goal is to find a sentence y such that:
y = arg max
y1,...,yTy
P(y1
,...,yTy
|x)
r Beam search – It is a heuristic search algorithm used in machine translation and speech
recognition to find the likeliest sentence y given an input x.
• Step 1: Find top B likely words y1
• Step 2: Compute conditional probabilities yk|x,y1,...,yk−1
• Step 3: Keep top B combinations x,y1,...,yk
Stanford University 9 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.
r Beam width – The beam width B is a parameter for beam search. Large values of B yield
to better result but with slower performance and increased memory. Small values of B lead to
worse results but is less computationally intensive. A standard value for B is around 10.
r Length normalization – In order to improve numerical stability, beam search is usually ap-
plied on the following normalized objective, often called the normalized log-likelihood objective,
defined as:
Objective =
1
Tα
y
Ty
X
t=1
log
h
p(yt
|x,y1
, ..., yt−1
)
i
Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.
r Error analysis – When obtaining a predicted translation b
y that is bad, one can wonder why
we did not get a good translation y∗ by performing the following error analysis:
Case P(y∗|x)  P(b
y|x) P(y∗|x) ⩽ P(b
y|x)
Root cause Beam search faulty RNN faulty
Remedies Increase beam width
- Try different architecture
- Regularize
- Get more data
r Bleu score – The bilingual evaluation understudy (bleu) score quantifies how good a machine
translation is by computing a similarity score based on n-gram precision. It is defined as follows:
bleu score = exp
1
n
n
X
k=1
pk
!
where pn is the bleu score on n-gram only defined as follows:
pn =
X
n-gram∈b
y
countclip(n-gram)
X
n-gram∈b
y
count(n-gram)
Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially
inflated bleu score.
2.7 Attention
r Attention model – This model allows an RNN to pay attention to specific parts of the input
that is considered as being important, which improves the performance of the resulting model
in practice. By noting αt,t0
 the amount of attention that the output yt should pay to the
activation at0
 and ct the context at time t, we have:
ct
=
X
t0
αt,t0

at0

with
X
t0
αt,t0

= 1
Remark: the attention scores are commonly used in image captioning and machine translation.
r Attention weight – The amount of attention that the output yt should pay to the
activation at0
 is given by αt,t0
 computed as follows:
αt,t0

=
exp(et,t0
)
Tx
X
t00=1
exp(et,t00

)
Remark: computation complexity is quadratic with respect to Tx.
? ? ?
Stanford University 10 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
3 Deep Learning Tips and Tricks
3.1 Data processing
r Data augmentation – Deep learning models usually need a lot of data to be properly trained.
It is often useful to get more data from the existing ones using data augmentation techniques.
The main ones are summed up in the table below. More precisely, given the following input
image, here are the techniques that we can apply:
Original Flip Rotation Random crop
- Image without
any modification
- Flipped with respect
to an axis for which
the meaning of the
image is preserved
- Rotation with
a slight angle
- Simulates incorrect
horizon calibration
- Random focus
on one part of
the image
- Several random
crops can be
done in a row
Color shift Noise addition Information loss Contrast change
- Nuances of RGB
is slightly changed
- Captures noise
that can occur
with light exposure
- Addition of noise
- More tolerance to
quality variation of
inputs
- Parts of image
ignored
- Mimics potential
loss of parts of image
- Luminosity changes
- Controls difference
in exposition due
to time of day
r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {xi}.
By noting µB, σ2
B the mean and variance of that we want to correct to the batch, it is done as
follows:
xi ←− γ
xi − µB
p
σ2
B + 
+ β
It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
3.2 Training a neural network
3.2.1 Definitions
r Epoch – In the context of training a model, epoch is a term used to refer to one iteration
where the model sees the whole training set to update its weights.
r Mini-batch gradient descent – During the training phase, updating weights is usually not
based on the whole training set at once due to computation complexities or one data point due
to noise issues. Instead, the update step is done on mini-batches, where the number of data
points in a batch is a hyperparameter that we can tune.
r Loss function – In order to quantify how a given model performs, the loss function L is
usually used to evaluate to what extent the actual outputs y are correctly predicted by the
model outputs z.
r Cross-entropy loss – In the context of binary classification in neural networks, the cross-
entropy loss L(z,y) is commonly used and is defined as follows:
L(z,y) = −
h
y log(z) + (1 − y) log(1 − z)
i
3.2.2 Finding optimal weights
r Backpropagation – Backpropagation is a method to update the weights in the neural network
by taking into account the actual output and the desired output. The derivative with respect
to each weight w is computed using the chain rule.
Using this method, each weight is updated with the rule:
w ←− w − α
∂L(z,y)
∂w
r Updating weights – In a neural network, weights are updated as follows:
• Step 1: Take a batch of training data and perform forward propagation to compute the
loss.
• Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight.
• Step 3: Use the gradients to update the weights of the network.
Stanford University 11 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
3.3 Parameter tuning
3.3.1 Weights initialization
r Xavier initialization – Instead of initializing the weights in a purely random manner, Xavier
initialization enables to have initial weights that take into account characteristics that are unique
to the architecture.
r Transfer learning – Training a deep learning model requires a lot of data and more impor-
tantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets
that took days/weeks to train, and leverage it towards our use case. Depending on how much
data we have at hand, here are the different ways to leverage this:
Training size Illustration Explanation
Small
Freezes all layers,
trains weights on softmax
Medium
Freezes most layers,
trains weights on last
layers and softmax
Large
Trains weights on layers
and softmax by initializing
weights on pre-trained ones
3.3.2 Optimizing convergence
r Learning rate – The learning rate, often noted α or sometimes η, indicates at which pace the
weights get updated. It can be fixed or adaptively changed. The current most popular method
is called Adam, which is a method that adapts the learning rate.
r Adaptive learning rates – Letting the learning rate vary when training a model can reduce
the training time and improve the numerical optimal solution. While Adam optimizer is the
most commonly used technique, others can also be useful. They are summed up in the table
below:
Method Explanation Update of w Update of b
Momentum
- Dampens oscillations
- Improvement to SGD
- 2 parameters to tune
w − αvdw b − αvdb
RMSprop
- Root Mean Square propagation
- Speeds up learning algorithm
by controlling oscillations
w − α
dw
√
sdw
b ←− b − α
db
√
sdb
Adam
- Adaptive Moment estimation
- Most popular method
- 4 parameters to tune
w − α
vdw
√
sdw + 
b ←− b − α
vdb
√
sdb + 
Remark: other methods include Adadelta, Adagrad and SGD.
3.4 Regularization
r Dropout – Dropout is a technique used in neural networks to prevent overfitting the training
data by dropping out neurons with probability p  0. It forces the model to avoid relying too
much on particular sets of features.
Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1−p.
r Weight regularization – In order to make sure that the weights are not too large and that
the model is not overfitting the training set, regularization techniques are usually performed on
the model weights. The main ones are summed up in the table below:
LASSO Ridge Elastic Net
- Shrinks coefficients to 0
- Good for variable selection
Makes coefficients smaller
Tradeoff between variable
selection and small coefficients
... + λ||θ||1
λ ∈ R
... + λ||θ||2
2
λ ∈ R
... + λ
h
(1 − α)||θ||1 + α||θ||2
2
i
λ ∈ R,α ∈ [0,1]
Stanford University 12 Winter 2019
CS 230 – Deep Learning Shervine Amidi  Afshine Amidi
r Early stopping – This regularization technique stops the training process as soon as the
validation loss reaches a plateau or starts to increase.
3.5 Good practices
r Overfitting small batch – When debugging a model, it is often useful to make quick tests
to see if there is any major issue with the architecture of the model itself. In particular, in order
to make sure that the model can be properly trained, a mini-batch is passed inside the network
to see if it can overfit on it. If it cannot, it means that the model is either too complex or not
complex enough to even overfit on a small batch, let alone a normal-sized training set.
r Gradient checking – Gradient checking is a method used during the implementation of
the backward pass of a neural network. It compares the value of the analytical gradient to the
numerical gradient at given points and plays the role of a sanity-check for correctness.
Numerical gradient Analytical gradient
Formula
df
dx
(x) ≈
f(x + h) − f(x − h)
2h
df
dx
(x) = f0
(x)
Comments
- Expensive; loss has to be
computed two times per dimension
- Used to verify correctness
of analytical implementation
-Trade-off in choosing h
not too small (numerical instability)
nor too large (poor gradient approx.)
- ’Exact’ result
- Direct computation
- Used in the final implementation
? ? ?
Stanford University 13 Winter 2019

More Related Content

PDF
Neural network book. Interesting and precise
PPTX
Chapter10.pptx
PDF
Deep learning_ adaptive computation and machine learning ( PDFDrive ).pdf
PDF
Review_of_Deep_Learning_Algorithms_and_Architectures.pdf
PDF
MLIP - Chapter 3 - Introduction to deep learning
PDF
Deep learning: Cutting through the Myths and Hype
PDF
Neural Networks on Steroids
PPTX
Tsinghua invited talk_zhou_xing_v2r0
Neural network book. Interesting and precise
Chapter10.pptx
Deep learning_ adaptive computation and machine learning ( PDFDrive ).pdf
Review_of_Deep_Learning_Algorithms_and_Architectures.pdf
MLIP - Chapter 3 - Introduction to deep learning
Deep learning: Cutting through the Myths and Hype
Neural Networks on Steroids
Tsinghua invited talk_zhou_xing_v2r0

Similar to Easy to learn deep learning guide - elementry (20)

PPTX
Deep learning from a novice perspective
PDF
A Survey of Deep Learning Algorithms for Malware Detection
PDF
Neural Networks and Deep Learning Syllabus
PDF
Deep Learning Basics (lecture notes).pdf
PPTX
DEEP LEARNING (UNIT 2 ) by surbhi saroha
PDF
Feedforward Networks and Deep Learning Module-02.pdf
PPTX
Deep Learning Sample Class (Jon Lederman)
PPTX
Introduction of deep learning in cse.pptx
PDF
PhD Defense
PPTX
Deep learning lecture - part 1 (basics, CNN)
PDF
Neural networks and deep learning
PPTX
deep learning.pptx
PPTX
Deep learning introduction
PPTX
Introduction to deep learning
PDF
An introduction to deep learning
PDF
imageclassification-160206090009.pdf
PPTX
Deep Learning: Towards General Artificial Intelligence
PPTX
Introduction to deep learning
PPTX
Image classification with Deep Neural Networks
PPTX
A simple presentation for deep learning.
Deep learning from a novice perspective
A Survey of Deep Learning Algorithms for Malware Detection
Neural Networks and Deep Learning Syllabus
Deep Learning Basics (lecture notes).pdf
DEEP LEARNING (UNIT 2 ) by surbhi saroha
Feedforward Networks and Deep Learning Module-02.pdf
Deep Learning Sample Class (Jon Lederman)
Introduction of deep learning in cse.pptx
PhD Defense
Deep learning lecture - part 1 (basics, CNN)
Neural networks and deep learning
deep learning.pptx
Deep learning introduction
Introduction to deep learning
An introduction to deep learning
imageclassification-160206090009.pdf
Deep Learning: Towards General Artificial Intelligence
Introduction to deep learning
Image classification with Deep Neural Networks
A simple presentation for deep learning.
Ad

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Business Analytics and business intelligence.pdf
PPTX
1_Introduction to advance data techniques.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Introduction to the R Programming Language
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Introduction to Data Science and Data Analysis
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction-to-Cloud-ComputingFinal.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Database Infoormation System (DBIS).pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Business Analytics and business intelligence.pdf
1_Introduction to advance data techniques.pptx
Clinical guidelines as a resource for EBP(1).pdf
Quality review (1)_presentation of this 21
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to the R Programming Language
Business Ppt On Nestle.pptx huunnnhhgfvu
Galatica Smart Energy Infrastructure Startup Pitch Deck
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to Data Science and Data Analysis
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Ad

Easy to learn deep learning guide - elementry

  • 29. A Selective Overview of Deep Learning Jianqing Fan∗ Cong Ma‡ Yiqiao Zhong∗ April 16, 2019 Abstract Deep learning has arguably achieved tremendous success in recent years. In simple words, deep learning uses the composition of many nonlinear functions to model the complex dependency between input features and labels. While neural networks have a long history, recent advances have greatly improved their performance in computer vision, natural language processing, etc. From the statistical and scientific perspective, it is natural to ask: What is deep learning? What are the new characteristics of deep learning, compared with classical methods? What are the theoretical foundations of deep learning? To answer these questions, we introduce common neural network models (e.g., convolutional neural nets, recurrent neural nets, generative adversarial nets) and training techniques (e.g., stochastic gradient descent, dropout, batch normalization) from a statistical point of view. Along the way, we highlight new characteristics of deep learning (including depth and over-parametrization) and explain their practical and theoretical benefits. We also sample recent results on theories of deep learning, many of which are only suggestive. While a complete understanding of deep learning remains elusive, we hope that our perspectives and discussions serve as a stimulus for new statistical research. Keywords: neural networks, over-parametrization, stochastic gradient descent, approximation theory, gen- eralization error. Contents 1 Introduction 2 1.1 Intriguing new characteristics of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Towards theory of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Roadmap of the paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Feed-forward neural networks 5 2.1 Model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Back-propagation in computational graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Popular models 8 3.1 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Deep unsupervised learning 14 4.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5 Representation power: approximation theory 17 5.1 Universal approximation theory for shallow NNs . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2 Approximation theory for multi-layer NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Author names are sorted alphabetically. ∗Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; Email: {jqfan, congm, yiqiaoz}@princeton.edu. 1 arXiv:1904.05526v2 [stat.ML] 15 Apr 2019
  • 30. 6 Training deep neural nets 20 6.1 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.2 Easing numerical instability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6.3 Regularization techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7 Generalization power 25 7.1 Algorithm-independent controls: uniform convergence . . . . . . . . . . . . . . . . . . . . . . 25 7.2 Algorithm-dependent controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 8 Discussion 29 1 Introduction Modern machine learning and statistics deal with the problem of learning from data: given a training dataset {(yi, xi)}1≤i≤n where xi ∈ Rd is the input and yi ∈ R is the output1 , one seeks a function f : Rd 7→ R from a certain function class F that has good prediction performance on test data. This problem is of fundamental significance and finds applications in numerous scenarios. For instance, in image recognition, the input x (reps. the output y) corresponds to the raw image (reps. its category) and the goal is to find a mapping f(·) that can classify future images accurately. Decades of research efforts in statistical machine learning have been devoted to developing methods to find f(·) efficiently with provable guarantees. Prominent examples include linear classifiers (e.g., linear / logistic regression, linear discriminant analysis), kernel methods (e.g., support vector machines), tree-based methods (e.g., decision trees, random forests), nonparametric regression (e.g., nearest neighbors, local kernel smoothing), etc. Roughly speaking, each aforementioned method corresponds to a different function class F from which the final classifier f(·) is chosen. Deep learning [70], in its simplest form, proposes the following compositional function class: f(x; θ) = WLσL(WL−1 · · · σ2(W2σ1(W1x))) θ = {W1, . . . , WL} . (1) Here, for each 1 ≤ l ≤ L, σ`(·) is some nonlinear function, and θ = {W1, . . . , WL} consists of matrices with appropriate sizes. Though simple, deep learning has made significant progress towards addressing the problem of learning from data over the past decade. Specifically, it has performed close to or better than humans in various important tasks in artificial intelligence, including image recognition [50], game playing [114], and machine translation [132]. Owing to its great promise, the impact of deep learning is also growing rapidly in areas beyond artificial intelligence; examples include statistics [15, 111, 76, 104, 41], applied mathematics [130, 22], clinical research [28], etc. Table 1: Winning models for ILSVRC image classification challenge. Model Year # Layers # Params Top-5 error Shallow 2012 — — 25% AlexNet 2012 8 61M 16.4% VGG19 2014 19 144M 7.3% GoogleNet 2014 22 7M 6.7% ResNet-152 2015 152 60M 3.6% To get a better idea of the success of deep learning, let us take the ImageNet Challenge [107] (also known as ILSVRC) as an example. In the classification task, one is given a training dataset consisting of 1.2 million color images with 1000 categories, and the goal is to classify images based on the input pixels. The performance of a classifier is then evaluated on a test dataset of 100 thousand images, and in the end the top-5 error2 is reported. Table 1 highlights a few popular models and their corresponding performance. As 1When the label y is given, this problem is often known as supervised learning. We mainly focus on this paradigm throughout this paper and remark sparingly on its counterpart, unsupervised learning, where y is not given. 2The algorithm makes an error if the true label is not contained in the 5 predictions made by the algorithm. 2
  • 31. Figure 1: Visualization of trained filters in the first layer of AlexNet. The model is pre-trained on ImageNet and is downloadable via PyTorch package torchvision.models. Each filter contains 11×11×3 parameters and is shown as an RGB color map of size 11 × 11. can be seen, deep learning models (the second to the last rows) have a clear edge over shallow models (the first row) that fit linear models / tree-based models on handcrafted features. This significant improvement raises a foundational question: Why is deep learning better than classical methods on tasks like image recognition? 1.1 Intriguing new characteristics of deep learning It is widely acknowledged that two indispensable factors contribute to the success of deep learning, namely (1) huge datasets that often contain millions of samples and (2) immense computing power resulting from clusters of graphics processing units (GPUs). Admittedly, these resources are only recently available: the latter allows to train larger neural networks which reduces biases and the former enables variance reduction. However, these two alone are not sufficient to explain the mystery of deep learning due to some of its “dreadful” characteristics: (1) over-parametrization: the number of parameters in state-of-the-art deep learning models is often much larger than the sample size (see Table 1), which gives them the potential to overfit the training data, and (2) nonconvexity: even with the help of GPUs, training deep learning models is still NP-hard [8] in the worst case due to the highly nonconvex loss function to minimize. In reality, these characteristics are far from nightmares. This sharp difference motivates us to take a closer look at the salient features of deep learning, which we single out a few below. 1.1.1 Depth Deep learning expresses complicated nonlinearity through composing many nonlinear functions; see (1). The rationale for this multilayer structure is that, in many real-world datasets such as images, there are different levels of features and lower-level features are building blocks of higher-level ones. See [134] for a visualization of trained features of convolutional neural nets; here in Figure 1, we sample and visualize weights from a pre-trained AlexNet model. This intuition is also supported by empirical results from physiology and neuroscience [56, 2]. The use of function composition marks a sharp difference from traditional statistical methods such as projection pursuit models [38] and multi-index models [73, 27]. It is often observed that depth helps efficiently extract features that are representative of a dataset. In comparison, increasing width (e.g., number of basis functions) in a shallow model leads to less improvement. This suggests that deep learning models excel at representing a very different function space that is suitable for complex datasets. 1.1.2 Algorithmic regularization The statistical performance of neural networks (e.g., test accuracy) depends heavily on the particular opti- mization algorithms used for training [131]. This is very different from many classical statistical problems, where the related optimization problems are less complicated. For instance, when the associated optimization 3
  • 32. (a) MNIST images (b) training and test accuracies Figure 2: (a) shows the images in the public dataset MNIST; and (b) depicts the training and test accuracies along the training dynamics. Note that the training accuracy is approaching 100% and the test accuracy is still high (no overfitting). problem has a relatively simple structure (e.g., convex objective functions, linear constraints), the solution to the optimization problem can often be unambiguously computed and analyzed. However, in deep neural networks, due to over-parametrization, there are usually many local minima with different statistical perfor- mance [72]. Nevertheless, common practice runs stochastic gradient descent with random initialization and finds model parameters with very good prediction accuracy. 1.1.3 Implicit prior learning It is well observed that deep neural networks trained with only the raw inputs (e.g., pixels of images) can provide a useful representation of the data. This means that after training, the units of deep neural networks can represent features such as edges, corners, wheels, eyes, etc.; see [134]. Importantly, the training process is automatic in the sense that no human knowledge is involved (other than hyper-parameter tuning). This is very different from traditional methods, where algorithms are designed after structural assumptions are posited. It is likely that training an over-parametrized model efficiently learns and incorporates the prior distribution p(x) of the input, even though deep learning models are themselves discriminative models. With automatic representation of the prior distribution, deep learning typically performs well on similar datasets (but not very different ones) via transfer learning. 1.2 Towards theory of deep learning Despite the empirical success, theoretical support for deep learning is still in its infancy. Setting the stage, for any classifier f, denote by E(f) the expected risk on fresh sample (a.k.a. test error, prediction error or generalization error), and by En(f) the empirical risk / training error averaged over a training dataset. Arguably, the key theoretical question in deep learning is why is E( ˆ fn) small, where ˆ fn is the classifier returned by the training algorithm? We follow the conventional approximation-estimation decomposition (sometimes, also bias-variance trade- off) to decompose the term E( ˆ fn) into two parts. Let F be the function space expressible by a family of neural nets. Define f∗ = argminf E(f) to be the best possible classifier and f∗ F = argminf∈F E(f) to be the best classifier in F. Then, we can decompose the excess error E , E( ˆ fn) − E(f∗ ) into two parts: E = E(f∗ F ) − E(f∗ ) | {z } approximation error + E( ˆ fn) − E(f∗ F ) | {z } estimation error . (2) Both errors can be small for deep learning (cf. Figure 2), which we explain below. 4
  • 33. • The approximation error is determined by the function class F. Intuitively, the larger the class, the smaller the approximation error. Deep learning models use many layers of nonlinear functions (Figure 3)that can drive this error small. Indeed, in Section 5, we provide recent theoretical progress of its representation power. For example, deep models allow efficient representation of interactions among variable while shallow models cannot. • The estimation error reflects the generalization power, which is influenced by both the complexity of the function class F and the properties of the training algorithms. Interestingly, for over-parametrized deep neural nets, stochastic gradient descent typically results in a near-zero training error (i.e., En( ˆ fn) ≈ 0; see e.g. left panel of Figure 2). Moreover, its generalization error E( ˆ fn) remains small or moderate. This “counterintuitive” behavior suggests that for over-parametrized models, gradient-based algorithms enjoy benign statistical properties; we shall see in Section 7 that gradient descent enjoys implicit regularization in the over-parametrized regime even without explicit regularization (e.g., `2 regularization). The above two points lead to the following heuristic explanation of the success of deep learning models. The large depth of deep neural nets and heavy over-parametrization lead to small or zero training errors, even when running simple algorithms with moderate number of iterations. In addition, these simple algorithms with moderate number of steps do not explore the entire function space and thus have limited complexities, which results in small generalization error with a large sample size. Thus, by combining the two aspects, it explains heuristically that the test error is also small. 1.3 Roadmap of the paper We first introduce basic deep learning models in Sections 2–4, and then examine their representation power via the lens of approximation theory in Section 5. Section 6 is devoted to training algorithms and their ability of driving the training error small. Then we sample recent theoretical progress towards demystifying the generalization power of deep learning in Section 7. Along the way, we provide our own perspectives, and at the end we identify a few interesting questions for future research in Section 8. The goal of this paper is to present suggestive methods and results, rather than giving conclusive arguments (which is currently unlikely) or a comprehensive survey. We hope that our discussion serves as a stimulus for new statistics research. 2 Feed-forward neural networks Before introducing the vanilla feed-forward neural nets, let us set up necessary notations for the rest of this section. We focus primarily on classification problems, as regression problems can be addressed similarly. Given the training dataset {(yi, xi)}1≤i≤n where yi ∈ [K] , {1, 2, . . . , K} and xi ∈ Rd are independent across i ∈ [n], supervised learning aims at finding a (possibly random) function ˆ f(x) that predicts the outcome y for a new input x, assuming (y, x) follows the same distribution as (yi, xi). In the terminology of machine learning, the input xi is often called the feature, the output yi called the label, and the pair (yi, xi) is an example. The function ˆ f is called the classifier, and estimation of ˆ f is training or learning. The performance of ˆ f is evaluated through the prediction error P(y 6= ˆ f(x)), which can be often estimated from a separate test dataset. As with classical statistical estimation, for each k ∈ [K], a classifier approximates the conditional prob- ability P(y = k|x) using a function fk(x; θk) parametrized by θk. Then the category with the highest probability is predicted. Thus, learning is essentially estimating the parameters θk. In statistics, one of the most popular methods is (multinomial) logistic regression, which stipulates a specific form for the functions fk(x; θk): let zk = x βk + αk and fk(x; θk) = Z−1 exp(zk) where Z = PK k=1 exp(zk) is a normalization factor to make {fk(x; θk)}1≤k≤K a valid probability distribution. It is clear that logistic regression induces linear decision boundaries in Rd , and hence it is restrictive in modeling nonlinear dependency between y and x. The deep neural networks we introduce below provide a flexible framework for modeling nonlinearity in a fairly general way. 5
  • 34. hidden layer input layer output layer hidden layer input layer output layer hidden layer input layer output layer hidden layer input layer output layer n layer input layer output layer x y W y en layer input layer output layer x y W y en layer input layer output layer x y W y n layer input layer output layer x y W y n layer input layer output layer x y W y hidden layer input layer output layer x y W y hidden layer input layer output layer x y W y hidden layer input layer output layer x y W y hidden layer input layer output layer x y W y hidden layer input layer output layer x y W y Figure 3: A feed-forward neural network with an input layer, two hidden layers and an output layer. The input layer represents raw features {xi}1≤i≤n. Both hidden layers compute an affine transform (a.k.s. indices) of the input and then apply an element-wise activation function σ(·). Finally, the output returns a linear transform followed by the softmax activation (resp. simply a linear transform) of the hidden layers for the classification (resp. regression) problem. 2.1 Model setup From the high level, deep neural networks (DNNs) use composition of a series of simple nonlinear functions to model nonlinearity h(L) = g(L) ◦ g(L−1) ◦ . . . ◦ g(1) (x), where ◦ denotes composition of two functions and L is the number of hidden layers, and is usually called depth of a NN model. Letting h(0) , x, one can recursively define h(l) = g(l) h(l−1) for all ` = 1, 2, . . . , L. The feed-forward neural networks, also called the multilayer perceptrons (MLPs), are neural nets with a specific choice of g(l) : for ` = 1, . . . , L, define h(`) = g(l) h(l−1) , σ W(`) h(`−1) + b(`) , (3) where W(l) and b(l) are the weight matrix and the bias / intercept, respectively, associated with the l-th layer, and σ(·) is usually a simple given (known) nonlinear function called the activation function. In words, in each layer `, the input vector h(`−1) goes through an affine transformation first and then passes through a fixed nonlinear function σ(·). See Figure 3 for an illustration of a simple MLP with two hidden layers. The activation function σ(·) is usually applied element-wise, and a popular choice is the ReLU (Rectified Linear Unit) function: [σ(z)]j = max{zj, 0}. (4) Other choices of activation functions include leaky ReLU, tanh function [79] and the classical sigmoid function (1 + e−z )−1 , which is less used now. Given an output h(L) from the final hidden layer and a label y, we can define a loss function to minimize. A common loss function for classification problems is the multinomial logistic loss. Using the terminology of deep learning, we say that h(L) goes through an affine transformation and then the soft-max function: fk(x; θ) , exp(zk) P k exp(zk) , ∀ k ∈ [K], where z = W(L+1) h(L) + b(L+1) ∈ RK . Then the loss is defined to be the cross-entropy between the label y (in the form of an indicator vector) and the score vector (f1(x; θ), . . . , fK(x; θ)) , which is exactly the negative log-likelihood of the multinomial logistic regression model: L(f(x; θ), y) = − K X k=1 1{y = k} log pk, (5) 6
  • 35. where θ , {W(`) , b(`) : 1 ≤ ` ≤ L + 1}. As a final remark, the number of parameters scales with both the depth L and the width (i.e., the dimensionality of W(`) ), and hence it can be quite large for deep neural nets. 2.2 Back-propagation in computational graphs Training neural networks follows the empirical risk minimization paradigm that minimizes the loss (e.g., (5)) over all the training data. This minimization is usually done via stochastic gradient descent (SGD). In a way similar to gradient descent, SGD starts from a certain initial value θ0 and then iteratively updates the parameters θt by moving it in the direction of the negative gradient. The difference is that, in each update, a small subsample B ⊂ [n] called a mini-batch—which is typically of size 32–512—is randomly drawn and the gradient calculation is only on B instead of the full batch [n]. This saves considerably the computational cost in calculation of gradient. By the law of large numbers, this stochastic gradient should be close to the full sample one, albeit with some random fluctuations. A pass of the whole training set is called an epoch. Usually, after several or tens of epochs, the error on a validation set levels off and training is complete. See Section 6 for more details and variants on training algorithms. The key to the above training procedure, namely SGD, is the calculation of the gradient ∇`B(θ), where `B(θ) , |B|−1 X i∈B L(f(xi; θ), yi). (6) Gradient computation, however, is in general nontrivial for complex models, and it is susceptible to numerical instability for a model with large depth. Here, we introduce an efficient approach, namely back-propagation, for computing gradients in neural networks. Back-propagation [106] is a direct application of the chain rule in networks. As the name suggests, the calculation is performed in a backward fashion: one first computes ∂`B/∂h(L) , then ∂`B/∂h(L−1) , . . ., and finally ∂`B/∂h(1) . For example, in the case of the ReLU activation function3 , we have the following recursive / backward relation ∂`B ∂h(`−1) = ∂h(`) ∂h(`−1) · ∂`B ∂h(`) = (W(`) ) diag 1{W(`) h(`−1) + b(`) ≥ 0} ∂`B ∂h(`) (7) where diag(·) denotes a diagonal matrix with elements given by the argument. Note that the calculation of ∂`B/∂h(`−1) depends on ∂`B/∂h(`) , which is the partial derivatives from the next layer. In this way, the derivatives are “back-propagated” from the last layer to the first layer. These derivatives {∂`B/∂h(`) } are then used to update the parameters. For instance, the gradient update for W(`) is given by W(`) ← W(`) − η ∂`B ∂W(`) , where ∂`B ∂W (`) jm = ∂`B ∂h (`) j · σ0 · h(`−1) m , (8) where σ0 = 1 if the j-th element of W(`) h(`−1) + b(`) is nonnegative, and σ0 = 0 otherwise. The step size η 0, also called the learning rate, controls how much parameters are changed in a single update. A more general way to think about neural network models and training is to consider computational graphs. Computational graphs are directed acyclic graphs that represent functional relations between vari- ables. They are very convenient and flexible to represent function composition, and moreover, they also allow an efficient way of computing gradients. Consider an MLP with a single hidden layer and an `2 regularization: `λ B(θ) = `B(θ) + rλ(θ) = `B(θ) + λ X j,j0 W (1) j,j0 2 + X j,j0 W (2) j,j0 2 , (9) where `B(θ) is the same as (6), and λ ≥ 0 is a tuning parameter. A similar example is considered in [45]. The corresponding computational graph is shown in Figure 4. Each node represents a function (inside a circle), which is associated with an output of that function (outside a circle). For example, we view the term `B(θ) as a result of 4 compositions: first the input data x multiplies the weight matrix W(1) resulting in u(1) , 3The issue of non-differentiability at the origin is often ignored in implementation. 7
  • 36. matmul relu matmul + # SoS $ %(') )(') * 12 12 , -(') -(.) cross entropy /, 0 Figure 4: The computational graph illustrates the loss (9). For simplicity, we omit the bias terms. Symbols inside nodes represent functions, and symbols outside nodes represent function outputs (vectors/scalars). matmul is matrix multiplication, relu is the ReLU activation, cross entropy is the cross entropy loss, and SoS is the sum of squares. then it goes through the ReLU activation function relu resulting in h(1) , then it multiplies another weight matrix W(2) leading to p, and finally it produces the cross-entropy with label y as in (5). The regularization term is incorporated in the graph similarly. A forward pass is complete when all nodes are evaluated starting from the input x. A backward pass then calculates the gradients of `λ B with respect to all other nodes in the reverse direction. Due to the chain rule, the gradient calculation for a variable (say, ∂`B/∂u(1) ) is simple: it only depends on the gradient value of the variables (∂`B/∂h) the current node points to, and the function derivative evaluated at the current variable value (σ0 (u(1) )). Thus, in each iteration, a computation graph only needs to (1) calculate and store the function evaluations at each node in the forward pass, and then (2) calculate all derivatives in the backward pass. Back-propagation in computational graphs forms the foundations of popular deep learning programming softwares, including TensorFlow [1] and PyTorch [92], which allows more efficient building and training of complex neural net models. 3 Popular models Moving beyond vanilla feed-forward neural networks, we introduce two other popular deep learning models, namely, the convolutional neural networks (CNNs) and the recurrent neural networks (RNNs). One impor- tant characteristic shared by the two models is weight sharing, that is some model parameters are identical across locations in CNNs or across time in RNNs. This is related to the notion of translational invariance in CNNs and stationarity in RNNs. At the end of this section, we introduce a modular thinking for constructing more flexible neural nets. 3.1 Convolutional neural networks The convolutional neural network (CNN) [71, 40] is a special type of feed-forward neural networks that is tailored for image processing. More generally, it is suitable for analyzing data with salient spatial structures. In this subsection, we focus on image classification using CNNs, where the raw input (image pixels) and features of each hidden layer are represented by a 3D tensor X ∈ Rd1×d2×d3 . Here, the first two dimensions d1, d2 of X indicate spatial coordinates of an image while the third d3 indicates the number of channels. For instance, d3 is 3 for the raw inputs due to the red, green and blue channels, and d3 can be much larger (say, 256) for hidden layers. Each channel is also called a feature map, because each feature map is specialized to detect the same feature at different locations of the input, which we will soon explain. We now introduce two building blocks of CNNs, namely the convolutional layer and the pooling layer. 1. Convolutional layer (CONV). A convolutional layer has the same functionality as described in (3), where 8
  • 37. X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 1 X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 1 X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 1 X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 1 X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 1 X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 1 X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 X̃ 2 R24⇥24⇥3 24 1 1 X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 X̃ 2 R24⇥24⇥3 24 1 1 X 2 R28⇥28⇥3 Fk 2 R5⇥5⇥3 28 5 3 X̃ 2 R24⇥24⇥3 24 1 1 3 X̃ 2 R24⇥24⇥3 24 1 input feature map filter output feature map 1 X̃ 2 R24⇥24⇥3 24 1 input feature map filter output feature map 1 X̃ 2 R 24 1 input feature map filter output feature map 1 stride = 1 stride = 2 X̃ 2 R24⇥24⇥1 1 Figure 5: X ∈ R28×28×3 represents the input feature consisting of 28 × 28 spatial coordinates in a total number of 3 channels / feature maps. Fk ∈ R5×5×3 denotes the k-th filter with size 5 × 5. The third dimension 3 of the filter automatically matches the number 3 of channels in the previous input. Every 3D patch of X gets convolved with the filter Fk and this as a whole results in a single output feature map X̃:,:,k with size 24 × 24 × 1. Stacking the outputs of all the filters {Fk}1≤k≤K will lead to the output feature with size 24 × 24 × K. the input feature X ∈ Rd1×d2×d3 goes through an affine transformation first and then an element-wise nonlinear activation. The difference lies in the specific form of the affine transformation. A convolutional layer uses a number of filters to extract local features from the previous input. More precisely, each filter is represented by a 3D tensor Fk ∈ Rw×w×d3 (1 ≤ k ≤ ˜ d3), where w is the size of the filter (typically 3 or 5) and ˜ d3 denotes the total number of filters. Note that the third dimension d3 of Fk is equal to that of the input feature X. For this reason, one usually says that the filter has size w × w, while suppressing the third dimension d3. Each filter Fk then convolves with the input feature X to obtain one single feature map Ok ∈ R(d1−w+1)×(d1−w+1) , where4 Ok ij = [X]ij , Fk = w X i0=1 w X j0=1 d3 X l=1 [X]i+i0−1,j+j0−1,l[Fk]i0,j0,l. (10) Here [X]ij ∈ Rw×w×d3 is a small “patch” of X starting at location (i, j). See Figure 5 for an illustration of the convolution operation. If we view the 3D tensors [X]ij and Fk as vectors, then each filter essentially computes their inner product with a part of X indexed by i, j (which can be also viewed as convolution, as its name suggests). One then pack the resulted feature maps {Ok } into a 3D tensor O with size (d1 − w + 1) × (d1 − w + 1) × ˜ d3, where [O]ijk = [Ok ]ij. (11) The outputs of convolutional layers are then followed by nonlinear activation functions. In the ReLU case, we have X̃ijk = σ(Oijk), ∀ i ∈ [d1 − w + 1], j ∈ [d2 − w + 1], k ∈ [ ˜ d3]. (12) The convolution operation (10) and the ReLU activation (12) work together to extract features X̃ from the input X. Different from feed-forward neural nets, the filters Fk are shared across all locations (i, j). A patch [X]ij of an input responds strongly (that is, producing a large value) to a filter Fk if they are positively correlated. Therefore intuitively, each filter Fk serves to extract features similar to Fk. As a side note, after the convolution (10), the spatial size d1×d2 of the input X shrinks to (d1 − w + 1) × (d2 − w + 1) of X̃. However one may want the spatial size unchanged. This can be achieved via padding, where one 4To simplify notation, we omit the bias/intercept term associated with each filter. 9
  • 38. 6 7 15 13 3 14 1 9 16 8 2 10 11 5 4 12 14 15 15 16 14 10 16 8 12 2 ⇥ 2 max pooling 1 stride = 1 stride = 2 1 14 15 16 12 2 ⇥ 2 max pooling 1 stride = 1 stride = 2 1 Figure 6: A 2 × 2 max pooling layer extracts the maximum of 2 by 2 neighboring pixels / features across the spatial dimension. 3 X̃ 2 R24⇥24⇥3 24 1 map ature map G D stribution PZ amples {xi}1in xi z g (z) d (·) 32 ⇥ 32 ⇥ 1 1 24 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake d (·) 32 ⇥ 32 ⇥ 1 28 ⇥ 28 ⇥ 6 1 X̃ 2 R24⇥24⇥3 24 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake d (·) 32 ⇥ 32 ⇥ 1 28 ⇥ 28 ⇥ 6 14 ⇥ 14 ⇥ 6 5 ⇥ 5 ⇥ 6 120 84 10 10 ⇥ 10 ⇥ 6 1 source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake d (·) 32 ⇥ 32 ⇥ 1 28 ⇥ 28 ⇥ 6 14 ⇥ 14 ⇥ 6 5 ⇥ 5 ⇥ 16 120 84 10 10 ⇥ 10 ⇥ 16 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake d (·) 32 ⇥ 32 ⇥ 1 28 ⇥ 28 ⇥ 6 14 ⇥ 14 ⇥ 6 5 ⇥ 5 ⇥ 16 120 84 10 10 ⇥ 10 ⇥ 16 1 filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake d (·) 32 ⇥ 32 ⇥ 1 28 ⇥ 28 ⇥ 6 14 ⇥ 14 ⇥ 6 5 ⇥ 5 ⇥ 16 120 84 10 10 ⇥ 10 ⇥ 16 1 output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake d (·) 32 ⇥ 32 ⇥ 1 28 ⇥ 28 ⇥ 6 14 ⇥ 14 ⇥ 6 5 ⇥ 5 ⇥ 16 120 84 10 10 ⇥ 10 ⇥ 16 1 G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake d (·) 32 ⇥ 32 ⇥ 1 28 ⇥ 28 ⇥ 6 14 ⇥ 14 ⇥ 6 5 ⇥ 5 ⇥ 16 120 84 10 10 ⇥ 10 ⇥ 16 1 10 ⇥ 10 ⇥ 16 FC POOL 2 ⇥ 2 CONV 5 ⇥ 5 2 10 ⇥ 10 ⇥ 16 FC POOL 2 ⇥ 2 CONV 5 ⇥ 5 2 10 ⇥ 10 ⇥ 16 FC POOL 2 ⇥ 2 CONV 5 ⇥ 5 2 10 ⇥ 10 ⇥ 16 FC POOL 2 ⇥ 2 CONV 5 ⇥ 5 2 10 ⇥ 10 ⇥ 16 FC POOL 2 ⇥ 2 CONV 5 ⇥ 5 2 10 ⇥ 10 ⇥ 16 FC POOL 2 ⇥ 2 CONV 5 ⇥ 5 2 10 ⇥ 10 ⇥ 16 FC POOL 2 ⇥ 2 CONV 5 ⇥ 5 2 Figure 7: LeNet is composed of an input layer, two convolutional layers, two pooling layers and three fully- connected layers. Both convolutions are valid and use filters with size 5 × 5. In addition, the two pooling layers use 2 × 2 average pooling. appends zeros to the margins of the input X to enlarge the spatial size to (d1 + w − 1) × (d2 + w − 1). In addition, a stride in the convolutional layer determines the gap i0 − i and j0 − j between two patches Xij and Xi0j0 : in (10) the stride is 1, and a larger stride would lead to feature maps with smaller sizes. 2. Pooling layer (POOL). A pooling layer aggregates the information of nearby features into a single one. This downsampling operation reduces the size of the features for subsequent layers and saves computa- tion. One common form of the pooling layer is composed of the 2 × 2 max-pooling filter. It computes max{Xi,j,k, Xi+1,j,k, Xi,j+1,k, Xi+1,j+1,k}, that is, the maximum of the 2 × 2 neighborhood in the spatial coordinates; see Figure 6 for an illustration. Note that the pooling operation is done separately for each feature map k. As a consequence, a 2 × 2 max-pooling filter acting on X ∈ Rd1×d2×d3 will result in an output of size d1/2×d2/2×d3. In addition, the pooling layer does not involve any parameters to optimize. Pooling layers serve to reduce redundancy since a small neighborhood around a location (i, j) in a feature map is likely to contain the same information. In addition, we also use fully-connected layers as building blocks, which we have already seen in Section 2. Each fully-connected layer treats input tensor X as a vector Vec(X), and computes X̃ = σ(WVec(X)). A fully-connected layer does not use weight sharing and is often used in the last few layers of a CNN. As an example, Figure 7 depicts the well-known LeNet 5 [71], which is composed of two sets of CONV-POOL layers and three fully-connected layers. 3.2 Recurrent neural networks Recurrent neural nets (RNNs) are another family of powerful models, which are designed to process time series data and other sequence data. RNNs have successful applications in speech recognition [108], machine translation [132], genome sequencing [21], etc. The structure of an RNN naturally forms a computational graph, and can be easily combined with other structures such as CNNs to build large computational graph 10
  • 39. ! # $ % ) )# )$ )% ( ( ( (! (' (' (' (' ! !# !$ !% # $ % )% ( ( ( (! (! (! (! (' ! !# !$ !% # $ % ) )# )$ )% ( ( ( (! (! (! (! (' (' (' (' (a) One-to-many (b) Many-to-one (c) Many-to-many Figure 8: Vanilla RNNs with different inputs/outputs settings. (a) has one input but multiple outputs; (b) has multiple inputs but one output; (c) has multiple inputs and outputs. Note that the parameters are shared across time steps. models for complex tasks. Here we introduce vanilla RNNs and improved variants such as long short-term memory (LSTM). 3.2.1 Vanilla RNNs Suppose we have general time series inputs x1, x2, . . . , xT . A vanilla RNN models the “hidden state” at time t by a vector ht, which is subject to the recursive formula ht = fθ(ht−1, xt). (13) Here, fθ is generally a nonlinear function parametrized by θ. Concretely, a vanilla RNN with one hidden layer has the following form5 ht = tanh (Whhht−1 + Wxhxt + bh) , where tanh(a) = e2a −1 e2a+1 , zt = σ (Whyht + bz) , where Whh, Wxh, Why are trainable weight matrices, bh, bz are trainable bias vectors, and zt is the output at time t. Like many classical time series models, those parameters are shared across time. Note that in different applications, we may have different input/output settings (cf. Figure 8). Examples include • One-to-many: a single input with multiple outputs; see Figure 8(a). A typical application is image captioning, where the input is an image and outputs are a series of words. • Many-to-one: multiple inputs with a single output; see Figure 8(b). One application is text sentiment classification, where the input is a series of words in a sentence and the output is a label (e.g., positive vs. negative). • Many-to-many: multiple inputs and outputs; see Figure 8(c). This is adopted in machine translation, where inputs are words of a source language (say Chinese) and outputs are words of a target language (say English). As the case with feed-forward neural nets, we minimize a loss function using back-propagation, where the loss is typically `T (θ) = X t∈T L(yt, zt) = − X t∈T K X k=1 1{yt = k} log exp([zt]k) P k exp([zt]k) , where K is the number of categories for classification (e.g., size of the vocabulary in machine translation), and T ⊂ [T] is the length of the output sequence. During the training, the gradients ∂`T /∂ht are computed in the reverse time order (from T to t). For this reason, the training process is often called back-propagation through time. 5Similar to the activation function σ(·), the function tanh(·) means element-wise operations. 11
  • 40. ! # ! #$% !$% # time depth Figure 9: A vanilla RNN with two hidden layers. Higher-level hidden states h` t are determined by the old states h` t−1 and lower-level hidden states h`−1 t . Multilayer RNNs generalize both feed-forward neural nets and one-hidden-layer RNNs. One notable drawback of vanilla RNNs is that, they have difficulty in capturing long-range dependencies in sequence data when the length of the sequence is large. This is sometimes due to the phenomenon of exploding / vanishing gradients. Take Figure 8(c) as an example. Computing ∂`T /∂h1 involves the product Q3 t=1(∂ht+1/∂ht) by the chain rule. However, if the sequence is long, the product will be the multiplication of many Jacobian matrices, which usually results in exponentially large or small singular values. To alleviate this issue, in practice, the forward pass and backward pass are implemented in a shorter sliding window {t1, t1 + 1, . . . , t2}, instead of the full sequence {1, 2, . . . , T}. Though effective in some cases, this technique alone does not fully address the issue of long-term dependency. 3.2.2 GRUs and LSTM There are two improved variants that alleviate the above issue: gated recurrent units (GRUs) [26] and long short-term memory (LSTM) [54]. • A GRU refines the recursive formula (13) by introducing gates, which are vectors of the same length as ht. The gates, which take values in [0, 1] elementwise, multiply with ht−1 elementwise and determine how much they keep the old hidden states. • An LSTM similarly uses gates in the recursive formula. In addition to ht, an LSTM maintains a cell state, which takes values in R elementwise and are analogous to counters. Here we only discuss LSTM in detail. Denote by the element-wise multiplication. We have a recursive formula in replace of (13):     it ft ot gt     =     σ σ σ tanh     W   ht−1 xt 1   , ct = ft ct−1 + it gt, ht = ot tanh(ct), where W is a big weight matrix with appropriate dimensions. The cell state vector ct carries information of the sequence (e.g., singular/plural form in a sentence). The forget gate ft determines how much the values of ct−1 are kept for time t, the input gate it controls the amount of update to the cell state, and the output gate ot gives how much ct reveals to ht. Ideally, the elements of these gates have nearly binary values. For example, an element of ft being close to 1 may suggest the presence of a feature in the sequence data. Similar to the skip connections in residual nets, the cell state ct has an additive recursive formula, which helps back-propagation and thus captures long-range dependencies. 12
  • 41. 3.2.3 Multilayer RNNs Multilayer RNNs are generalization of the one-hidden-layer RNN discussed above. Figure 9 shows a vanilla RNN with two hidden layers. In place of (13), the recursive formula for an RNN with L hidden layers now reads h` t = tanh  W`   h`−1 t h` t−1 1     , for all ` ∈ [L], h0 t , xt. Note that a multilayer RNN has two dimensions: the sequence length T and depth L. Two special cases are the feed-forward neural nets (where T = 1) introduced in Section 2, and RNNs with one hidden layer (where L = 1). Multilayer RNNs usually do not have very large depth (e.g., 2–5), since T is already very large. Finally, we remark that CNNs, RNNs, and other neural nets can be easily combined to tackle tasks that involve different sources of input data. For example, in image captioning, the images are first processed through a CNN, and then the high-level features are fed into an RNN as inputs. Theses neural nets combined together form a large computational graph, so they can be trained using back-propagation. This generic training method provides much flexibility in various applications. 3.3 Modules Deep neural nets are essentially composition of many nonlinear functions. A component function may be designed to have specific properties in a given task, and it can be itself resulted from composing a few simpler functions. In LSTM, we have seen that the building block consists of several intermediate variables, including cell states and forget gates that can capture long-term dependency and alleviate numerical issues. This leads to the idea of designing modules for building more complex neural net models. Desirable modules usually have low computational costs, alleviate numerical issues in training, and lead to good statistical accuracy. Since modules and the resulting neural net models form computational graphs, training follows the same principle briefly described in Section 2. Here, we use the examples of Inception and skip connections to illustrate the ideas behind modules. Figure 10(a) is an example of “Inception” modules used in GoogleNet [123]. As before, all the convolutional layers are followed by the ReLU activation function. The concatenation of information from filters with different sizes give the model great flexibility to capture spatial information. Note that 1 × 1 filters is an 1 × 1 × d3 tensor (where d3 is the number of feature maps), so its convolutional operation does not interact with other spatial coordinates, only serving to aggregate information from different feature maps at the same coordinate. This reduces the number of parameters and speeds up the computation. Similar ideas appear in other work [78, 57]. 1×'2 3456 1×'2 3456 1×'2 3456 1×'2 3456 3×72 3456 5×82 3456 3×72 POOL concat 3×94 5678 3×94 5678 + $ $ ;($) (a) “Inception” module (b) Skip connections Figure 10: (a) The “Inception” module from GoogleNet. Concat means combining all features maps into a tensor. (b) Skip connections are added every two layers in ResNets. Another module, usually called skip connections, is widely used to alleviate numerical issues in very deep neural nets, with additional benefits in optimization efficiency and statistical accuracy. Training very deep 13
  • 42. neural nets are generally more difficult, but the introduction of skip connections in residual networks [50, 51] has greatly eased the task. The high level idea of skip connections is to add an identity map to an existing nonlinear function. Let F(x) be an arbitrary nonlinear function represented by a (fragment of) neural net, then the idea of skip connections is simply replacing F(x) with x+F(x). Figure 10(b) shows a well-known structure from residual networks [50]—for every two layers, an identity map is added: x 7−→ σ(x + F(x)) = σ(x + W0 σ(Wx + b) + b0 ), (14) where x can be hidden nodes from any layer and W, W0 , b, b0 are corresponding parameters. By repeating (namely composing) this structure throughout all layers, [50, 51] are able to train neural nets with hundreds of layers easily, which overcomes well-observed training difficulties in deep neural nets. Moreover, deep residual networks also improve statistical accuracy, as the classification error on ImageNet challenge was reduced by 46% from 2014 to 2015. As a side note, skip connections can be used flexibly. They are not restricted to the form in (14), and can be used between any pair of layers `, `0 [55]. 4 Deep unsupervised learning In supervised learning, given labelled training set {(yi, xi)}, we focus on discriminative models, which essen- tially represents P(y | x) by a deep neural net f(x; θ) with parameters θ. Unsupervised learning, in contrast, aims at extracting information from unlabeled data {xi}, where the labels {yi} are absent. In regard to this information, it can be a low-dimensional embedding of the data {xi} or a generative model with latent vari- ables to approximate the distribution PX(x). To achieve these goals, we introduce two popular unsupervised deep leaning models, namely, autoencoders and generative adversarial networks (GANs). The first one can be viewed as a dimension reduction technique, and the second as a density estimation method. DNNs are the key elements for both of these two models. 4.1 Autoencoders Recall that in dimension reduction, the goal is to reduce the dimensionality of the data and at the same time preserve its salient features. In particular, in principal component analysis (PCA), the goal is to embed the data {xi}1≤i≤n into a low-dimensional space via a linear function f such that maximum variance can be explained. Equivalently, we want to find linear functions f : Rd → Rk and g : Rk → Rd (k ≤ d) such that the difference between xi and g(f(xi)) is minimized. Formally, we let f (x) = Wf x , h and g (h) = Wgh, where Wf ∈ Rk×d and Wg ∈ Rd×k . Here, for simplicity, we assume that the intercept/bias terms for f and g are zero. Then, PCA amounts to minimizing the quadratic loss function minimizeWf ,Wg 1 n n X i=1 kxi − Wf Wgxik 2 2 . (15) It is the same as minimizing kX − WXk2 F subject to rank(W) ≤ k, where X ∈ Rp×n is the design matrix. The solution is given by the singular value decomposition of X [44, Thm. 2.4.8], which is exactly what PCA does. It turns out that PCA is a special case of autoencoders, which is often known as the undercomplete linear autoencoder. More broadly, autoencoders are neural network models for (nonlinear) dimension reduction, which gen- eralize PCA. An autoencoder has two key components, namely, the encoder function f(·), which maps the input x ∈ Rd to a hidden code/representation h , f(x) ∈ Rk , and the decoder function g(·), which maps the hidden representation h to a point g(h) ∈ Rd . Both functions can be multilayer neural networks as (3). See Figure 11 for an illustration of autoencoders. Let L(x1, x2) be a loss function that measures the difference between x1 and x2 in Rd . Similar to PCA, an autoencoder is used to find the encoder f and 14
  • 43. hidden layer input layer output layer hidden layer input layer output layer 1 hidden layer input layer output layer 1 x h = f (x) g (h) 1 x h = f (x) g (h) 1 x h = f (x) g (h) 1 x h = f (x) g (h) 1 x h = f (x) g (h) encoder decoder x h = f (x) g (h) encoder decoder L (x, g (h)) 1 Figure 11: First an input x goes through the decoder f(·), and we obtain its hidden representation h = f(x). Then, we use the decoder g(·) to get g(h) as a reconstruction of x. Finally, the loss is determined from the difference between the original input x and its reconstruction g(f(x)). decoder g such that L(x, g(f(x))) is as small as possible. Mathematically, this amounts to solving the following minimization problem minimizef,g 1 n n X i=1 L (xi, g (hi)) with hi = f (xi) , for all i ∈ [n]. (16) One needs to make structural assumptions on the functions f and g in order to find useful representations of the data, which leads to different types of autoencoders. Indeed, if no assumption is made, choosing f and g to be identity functions clearly minimizes the above optimization problem. To avoid this trivial solution, one natural way is to require that the encoder f maps the data onto a space with a smaller dimension, i.e., k d. This is the undercomplete autoencoder that includes PCA as a special case. There are other structured autoencoders which add desired properties to the model such as sparsity or robustness, mainly through regularization terms. Below we present two other common types of autoencoders. • Sparse autoencoders. One may believe that the dimension k of the hidden code hi is larger than the input dimension d, and that hi admits a sparse representation. As with LASSO [126] or SCAD [36], one may add a regularization term to the reconstruction loss L in (16) to encourage sparsity [98]. A sparse autoencoder solves minf,g 1 n n X i=1 L (xi, g (hi)) | {z } loss + λ khik1 | {z } regularizer with hi = f (xi) , for all i ∈ [n]. This is similar to dictionary learning, where one aims at finding a sparse representation of input data on an overcomplete basis. Due to the imposed sparsity, the model can potentially learn useful features of the data. • Denoising autoencoders. One may hope that the model is robust to noise in the data: even if the input data xi are corrupted by small noise ξi or miss some components (the noise level or the missing probability is typically small), an ideal autoencoder should faithfully recover the original data. A denoising autoencoder [128] achieves this robustness by explicitly building a noisy data x̃i = xi +ξi as the new input, 15
  • 44. Fk 2 R5⇥5⇥3 28 5 3 X̃ 2 R24⇥24⇥3 24 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 X 2 R Fk 2 R5⇥5⇥3 28 5 3 X̃ 2 R24⇥24⇥3 24 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 3 X̃ 2 R24⇥24⇥3 24 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 24 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 28 5 3 X̃ 2 R24⇥24⇥3 24 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 X̃ 2 R 24 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 1 input feature map filter output feature map G D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake 1 D source distribution PZ training samples {xi}1in sample x g ( 1: real 0: fake D source distribution PZ training samples {xi}1in sample xi z g (z) 1: real 0: fake d (·) 1 Figure 12: GANs consist of two components, a generator G which generates fake samples and a discriminator D which differentiate the true ones from the fake ones. and then solves an optimization problem similar to (16) where L (xi, g (hi)) is replaced by L (xi, g (f(x̃i))). A denoising autoencoder encourages the encoder/decoder to be stable in the neighborhood of an input, which is generally a good statistical property. An alternative way could be constraining f and g in the optimization problem, but that would be very difficult to optimize. Instead, sampling by adding small perturbations in the input provides a simple implementation. We shall see similar ideas in Section 6.3.3. 4.2 Generative adversarial networks Given unlabeled data {xi}1≤i≤n, density estimation aims to estimate the underlying probability density function PX from which the data is generated. Both parametric and nonparametric estimators [115] have been proposed and studied under various assumptions on the underlying distribution. Different from these classical density estimators, where the density function is explicitly defined in relatively low dimension, generative adversarial networks (GANs) [46] can be categorized as an implicit density estimator in much higher dimension. The reasons are twofold: (1) GANs put more emphasis on sampling from the distribution PX than estimation; (2) GANs define the density estimation implicitly through a source distribution PZ and a generator function g(·), which is usually a deep neural network. We introduce GANs from the perspective of sampling from PX and later we will generalize the vanilla GANs using its relation to density estimators. 4.2.1 Sampling view of GANs Suppose the data {xi}1≤i≤n at hand are all real images, and we want to generate new natural images. With this goal in mind, GAN models a zero-sum game between two players, namely, the generator G and the discriminator D. The generator G tries to generate fake images akin to the true images {xi}1≤i≤n while the discriminator D aims at differentiating the fake ones from the true ones. Intuitively, one hopes to learn a generator G to generate images where the best discriminator D cannot distinguish. Therefore the payoff is higher for the generator G if the probability of the discriminator D getting wrong is higher, and correspondingly the payoff for the discriminator correlates positively with its ability to tell wrong from truth. Mathematically, the generator G consists of two components, an source distribution PZ (usually a stan- dard multivariate Gaussian distribution with hundreds of dimensions) and a function g(·) which maps a sample z from PZ to a point g(z) living in the same space as x. For generating images, g(z) would be a 3D tensor. Here g(z) is the fake sample generated from G. Similarly the discriminator D is composed of one function which takes an image x (real or fake) and return a number d(x) ∈ [0, 1], the probability of x being a real sample from PX or not. Oftentimes, both the generating function g(·) and the discriminating function d(·) are realized by deep neural networks, e.g., CNNs introduced in Section 3.1. See Figure 12 for an illustration for GANs. Denote θG and θD the parameters in g(·) and d(·), respectively. Then GAN tries to solve the following min-max problem: 16
  • 45. min θG max θD Ex∼PX [log (d (x))] + Ez∼PZ [log (1 − d (g (z)))] . (17) Recall that d(x) models the belief / probability that the discriminator thinks that x is a true sample. Fix the parameters θG and hence the generator G and consider the inner maximization problem. We can see that the goal of the discriminator is to maximize its ability of differentiation. Similarly, if we fix θD (and hence the discriminator), the generator tries to generate more realistic images g(z) to fool the discriminator. 4.2.2 Density estimation view of GANs Let us now take a density-estimation view of GANs. Fixing the source distribution PZ, any generator G induces a distribution PG over the space of images. Removing the restrictions on d(·), one can then rewrite (17) as min PG max d(·) Ex∼PX [log (d (x))] + Ex∼PG [log (1 − d (x))] . (18) Observe that the inner maximization problem is solved by the likelihood ratio, i.e. d∗ (x) = PX (x) PX (x) + PG (x) . As a result, (18) can be simplified as min PG JS (PX k PG) , (19) where JS(·k·) denotes the Jensen–Shannon divergence between two distributions JS (PXkPG) = 1 2 KL PX k PX +PG 2 + 1 2 KL PG k PX +PG 2 . In words, the vanilla GAN (17) seeks a density PG that is closest to PX in terms of the Jensen–Shannon di- vergence. This view allows to generalize GANs to other variants, by changing the distance metric. Examples include f-GAN [90], Wasserstein GAN (W-GAN) [6], MMD GAN [75], etc. We single out the Wasserstein GAN (W-GAN) [6] to introduce due to its popularity. As the name suggests, it minimizes the Wasserstein distance between PX and PG: min θG WS (PXkPG) = min θG sup f:f 1-Lipschitz Ex∼PX [f (x)] − Ex∼PG [f (x)] , (20) where f(·) is taken over all Lipschitz functions with coefficient 1. Comparing W-GAN (20) with the original formulation of GAN (17), one finds that the Lipschitz function f in (20) corresponds to the discriminator D in (17) in the sense that they share similar objectives to differentiate the true distribution PX from the fake one PG. In the end, we would like to mention that GANs are more difficult to train than supervised deep learning models such as CNNs [110]. Apart from the training difficulty, how to evaluate GANs objectively and effectively is an ongoing research. 5 Representation power: approximation theory Having seen the building blocks of deep learning models in the previous sections, it is natural to ask: what is the benefits of composing multiple layers of nonlinear functions. In this section, we address this question from a approximation theoretical point of view. Mathematically, letting H be the space of functions representable by neural nets (NNs), how well can a function f (with certain properties) be approximated by functions in H. We first revisit universal approximation theories, which are mostly developed for shallow neural nets (neural nets with a single hidden layer), and then provide recent results that demonstrate the benefits of depth in neural nets. Other notable works include Kolmogorov-Arnold superposition theorem [7, 120], and circuit complexity for neural nets [91]. 17
  • 46. 5.1 Universal approximation theory for shallow NNs The universal approximation theories study the approximation of f in a space F by a function represented by a one-hidden-layer neural net g(x) = N X j=1 cjσ∗(w j x − bj), (21) where σ∗ : R → R is certain activation function and N is the number of hidden units in the neural net. For different space F and activation function σ∗, there are upper bounds and lower bounds on the approximation error kf − gk. See [93] for a comprehensive overview. Here we present representative results. First, as N → ∞, any continuous function f can be approximated by some g under mild conditions. Loosely speaking, this is because each component σ∗(w j x − bj) behaves like a basis function and functions in a suitable space F admits a basis expansion. Given the above heuristics, the next natural question is: what is the rate of approximation for a finite N? Let us restrict the domain of x to a unit ball Bd in Rd . For p ∈ [1, ∞) and integer m ≥ 1, consider the Lp space and the Sobolev space with standard norms kfkp = h Z Bn |g(x)|p dx i1/p , kfkm,p = h X 0≤|k|≤m kDk fkp p i1/p , where Dk f denotes partial derivatives indexed by k ∈ Zd +. Let F , Fm p be the space of functions f in the Sobolev space with kfkm,p ≤ 1. Note that functions in F have bounded derivatives up to m-th order, and that smoothness of functions is controlled by m (larger m means smoother). Denote by HN the space of functions with the form (21). The following general upper bound is due to [85]. Theorem 1 (Theorem 2.1 in [85]). Assume σ∗ : R → R is such that σ∗ has arbitrary order derivatives in an open interval I, and that σ∗ is not a polynomial on I. Then, for any p ∈ [1, ∞), d ≥ 2, and integer m ≥ 1, sup f∈Fm p inf g∈HN kf − gkp ≤ Cd,m,p N−m/d , where Cd,m,p is independent of N, the number of hidden units. In the above theorem, the condition on σ∗(·) is mainly technical. This upper bound is useful when the dimension d is not large. It clearly implies that the one-hidden-layer neural net is able to approximate any smooth function with enough hidden units. However, it is unclear how to find a good approximator g; nor do we have control over the magnitude of the parameters (huge weights are impractical). While increasing the number of hidden units N leads to better approximation, the exponent −m/d suggests the presence of the curse of dimensionality. The following (nearly) matching lower bound is stated in [80]. Theorem 2 (Theorem 5 in [80]). Let p ≥ 1, m ≥ 1 and N ≥ 2. If the activation function is the standard sigmoid function σ(t) = (1 + e−t )−1 , then sup f∈Fm p inf g∈HN kf − gkp ≥ C0 d,m,p (N log N)−m/d , (22) where C0 d,m,p is independent of N. Results for other activation functions are also obtained by [80]. Moreover, the term log N can be removed if we assume an additional continuity condition [85]. For the natural space Fm p of smooth functions, the exponential dependence on d in the upper and lower bounds may look unappealing. However, [12] showed that for a different function space, there is a good dimension-free approximation by the neural nets. Suppose that a function f : Rd 7→ R has a Fourier representation f(x) = Z Rd eihω,xi ˜ f(ω) dω, (23) 18
  • 47. where ˜ f(ω) ∈ C. Assume that f(0) = 0 and that the following quantity is finite Cf = Z Rd kωk2| ˜ f(ω)| dω. (24) [12] uncovers the following dimension-free approximation guarantee. Theorem 3 (Proposition 1 in [12]). Fix a C 0 and an arbitrary probability measure µ on the unit ball Bd in Rd . For every function f with Cf ≤ C and every N ≥ 1, there exists some g ∈ HN such that Z Bd (f(x) − g(x))2 µ(dx) 1/2 ≤ 2C √ N . Moreover, the coefficients of g may be restricted to satisfy PN j=1 |cj| ≤ 2C. The upper bound is now independent of the dimension d. However, Cf may implicitly depend on d, as the formula in (24) involves an integration over Rd (so for some functions Cf may depend exponentially on d). Nevertheless, this theorem does characterize an interesting function space with an improved upper bound. Details of the function space are discussed by [12]. This theorem can be generalized; see [81] for an example. To help understand why a dimensionality-free approximation holds, let us appeal to a heuristic argument given by Monte Carlo simulations. It is well-known that Monte Carlo approximation errors are independent of dimensionality in evaluation of high-dimensional integrals. Let us generate {ωj}1≤j≤N randomly from a given density p(·) in Rd . Consider the approximation to (23) by gN (x) = 1 N N X j=1 cjeihωj ,xi , cj = ˜ f(ωj) p(ωj) . (25) Then, gN (x) is a one-hidden-layer neural network with N units and the sinusoid activation function. Note that EgN (x) = f(x), where the expectation is taken with respect to randomness {ωj}. Now, by indepen- dence, we have E(gN (x) − f(x))2 = 1 N Var(cjeihωj ,xi ) ≤ 1 N Ec2 j , if Ec2 j ∞. Therefore, the rate is independent of the dimensionality d, though the constant can be. 5.2 Approximation theory for multi-layer NNs The approximation theory for multilayer neural nets is less understood compared with neural nets with one hidden layer. Driven by the success of deep learning, there are many recent papers focusing on expressivity of deep neural nets. As studied by [125, 35, 84, 94, 15, 111, 77, 103], deep neural nets excel at representing composition of functions. This is perhaps not surprising, since deep neural nets are themselves defined by composing layers of functions. Nevertheless, it points to a new territory rarely studied in statistics before. Below we present a result based on [77, 103]. Suppose that the inputs x have a bounded domain [−1, 1]d for simplicity. As before, let σ∗ : R → R be a generic function, and σ∗ = (σ∗, · · · , σ∗) be element-wise application of σ∗. Consider a neural net which is similar to (3) but with scaler output: g(x) = W`σ∗(· · · σ∗(W2σ∗(W1x)) · · · ). A unit or neuron refers to an element of vectors σ∗(Wk · · · σ∗(W2σ∗(W1x)) · · · ) for any k = 1, . . . , ` − 1. For a multivariate polynomial p, define mk(p) to be the smallest integer such that, for any 0, there exists a neural net g(x) satisfying supx |p(x) − g(x)| , with k hidden layers (i.e., ` = k + 1) and no more than mk(p) neurons in total. Essentially, mk(p) is the minimum number of neurons required to approximate p arbitrarily well. Theorem 4 (Theorem 4.1 in [103]). Let p(x) be a monomial xr1 1 xr2 2 · · · xrd d with q = Pd j=1 rj. Suppose that σ∗ has derivatives of order 2q at the origin, and that they are nonzero. Then, (i) m1(p) = Qd j=1(rj + 1); (ii) mink mk(p) ≤ Pd j=1 (7dlog2(rj)e + 4). 19
  • 48. This theorem reveals a sharp distinction between shallow networks (one hidden layer) and deep networks. To represent a monomial function, a shallow network requires exponentially many neurons in terms of the dimension d, whereas linearly many neurons suffice for a deep network (with bounded rj). The exponential dependence on d, as shown in Theorem 4(i), is resonant with the curse of dimensionality widely seen in many fields; see [30]. One may ask: how does depth help? Depth circumvents this issue, at least for certain functions, by allowing us to represent function composition efficiently. Indeed, Theorem 4(ii) offers a nice result with clear intuitions: it is known that the product of two scalar inputs can be represented using 4 neurons [77], so by composing multiple products, we can express monomials with O(d) neurons. Recent advances in nonparametric regressions also support the idea that deep neural nets excel at repre- senting composition of functions [15, 111]. In particular, [15] considered the nonparametric regression setting where we want to estimate a function ˆ fn(x) from i.i.d. data Dn = {(yi, xi)}1≤i≤n. If the true regression function f(x) has certain hierarchical structure with intrinsic dimensionality6 d∗ , then the error EDn Ex ˆ fn(x) − f(x) 2 has an optimal minimax convergence rate O(n− 2q 2q+d∗ ), rather than the usual rate O(n− 2q 2q+d ) that depends on the ambient dimension d. Here q is the smoothness parameter. This provides another justification for deep neural nets: if data are truly hierarchical, then the quality of approximators by deep neural nets depends on the intrinsic dimensionality, which avoids the curse of dimensionality. We point out that the approximation theory for deep learning is far from complete. For example, in Theorem 4, the condition on σ∗ excludes the widely used ReLU activation function, there are no constraints on the magnitude of the weights (so they can be unreasonably large). 6 Training deep neural nets The existence of a good function approximator in the NN function class does not explain why in practice we can easily find them. In this section, we introduce standard methods, namely stochastic gradient descent (SGD) and its variants, to train deep neural networks (or to find such a good approximator). As with many statistical machine learning tasks, training DNNs follows the empirical risk minimization (ERM) paradigm which solves the following optimization problem minimizeθ∈Rp `n (θ) , 1 n n X i=1 L (f (xi; θ) , yi) . (26) Here L(f(xi; θ), yi) measures the discrepancy between the prediction f(xi; θ) of the neural network and the true label yi. Correspondingly, denote by `(θ) , E(x,y)∼D[L(f(x; θ), y)] the out-of-sample error, where D is the joint distribution over (y, x). Solving ERM (26) for deep neural nets faces various challenges that roughly fall into the following three categories. • Scalability and nonconvexity. Both the sample size n and the number of parameters p can be huge for modern deep learning applications, as we have seen in Table 1. Many optimization algorithms are not practical due to the computational costs and memory constraints. What is worse, the empirical loss function `n(θ) in deep learning is often nonconvex. It is a priori not clear whether an optimization algorithm can drive the empirical loss (26) small. • Numerical stability. With a large number of layers in DNNs, the magnitudes of the hidden nodes can be drastically different, which may result in the “exploding gradients” or “vanishing gradients” issue during the training process. This is because the recursive relations across layers often lead to exponentially increasing / decreasing values in both forward passes and backward passes. • Generalization performance. Our ultimate goal is to find a parameter θ̂ such that the out-of-sample error `(θ̂) is small. However, in the over-parametrized regime where p is much larger than n, the underlying 6Roughly speaking, the true regression function can be represented by a tree where each node has at most d∗ children. See [15] for the precise definition. 20
  • 49. neural network has the potential to fit the training data perfectly while performing poorly on the test data. To avoid this overfitting issue, proper regularization, whether explicit or implicit, is needed in the training process for the neural nets to generalize. In the following three subsections, we discuss practical solutions / proposals to address these challenges. 6.1 Stochastic gradient descent Stochastic gradient descent (SGD) [101] is by far the most popular optimization algorithm to solve ERM (26) for large-scale problems. It has the following simple update rule: θt+1 = θt − ηtG(θt ) with G θt = ∇L f xit ; θt , yit (27) for t = 0, 1, 2, . . ., where ηt 0 is the step size (or learning rate), θ0 ∈ Rp is an initial point and it is chosen randomly from {1, 2, · · · , n}. It is easy to verify that G(θt ) is an unbiased estimate of ∇`n(θt ). The advantage of SGD is clear: compared with gradient descent, which goes over the entire dataset in every update, SGD uses a single example in each update and hence is considerably more efficient in terms of both computation and memory (especially in the first few iterations). Apart from practical benefits of SGD, how well does SGD perform theoretically in terms of minimizing `n(θ)? We begin with the convex case, i.e., the case where the loss function is convex w.r.t. θ. It is well understood in literature that with proper choices of the step sizes {ηt}, SGD is guaranteed to achieve both consistency and asymptotic normality. • Consistency. If `(θ) is a strongly convex function7 , then under some mild conditions8 , learning rates that satisfy ∞ X t=0 ηt = +∞ and ∞ X t=0 η2 t +∞ (28) guarantee almost sure convergence to the unique minimizer θ∗ , argminθ`(θ), i.e., θt a.s. − − → θ∗ as t → ∞ [101, 64, 16, 69]. The requirements in (28) can be viewed from the perspective of bias-variance tradeoff: the first condition ensures that the iterates can reach the minimizer (controlled bias), and the second ensures that stochasticity does not prevent convergence (controlled variance). • Asymptotic normality. It is proved by [97] that for robust linear regression with fixed dimension p, under the choice ηt = t−1 , √ t (θt − θ∗ ) is asymptotically normal under some regularity conditions (but θt is not asymptotically efficient in general). Moreover, by averaging the iterates of SGD, [96] proved that even with a larger step size ηt ∝ t−α , α ∈ (1/2, 1), the averaged iterate θ̄ t = t−1 Pt s=1 θs is asymptotic efficient for robust linear regression. These strong results show that SGD with averaging performs as well as the MLE asymptotically, in addition to its computational efficiency. These classical results, however, fail to explain the effectiveness of SGD when dealing with nonconvex loss functions in deep learning. Admittedly, finding global minima of nonconvex functions is computationally infeasible in the worst case. Nevertheless, recent work [4, 32] bypasses the worst case scenario by focusing on losses incurred by over-parametrized deep learning models. In particular, they show that (stochastic) gradient descent converges linearly towards the global minimizer of `n(θ) as long as the neural network is sufficiently over-parametrized. This phenomenon is formalized below. Theorem 5 (Theorem 2 in [4]). Let {(yi, xi)}1≤i≤n be a training set satisfying mini,j:i6=j kxi −xjk2 ≥ δ 0. Consider fitting the data using a feed-forward neural network (1) with ReLU activations. Denote by L (resp. W) the depth (resp. width) of the network. Suppose that the neural network is sufficiently over- parametrized, i.e., W poly n, L, 1 δ , (29) 7For results on consistency and asymptotic normality, we consider the case where in each step of SGD, the stochastic gradient is computed using a fresh sample (y, x) from D. This allows to view SGD as an optimization algorithm to minimize the population loss `(θ). 8One example of such condition can be constraining the second moment of the gradients: E k∇L xi, yi; θt k2 2 ≤ C1 + C2kθt − θ∗ k2 2 for some C1, C2 0. See [16] for details. 21
  • 50. where poly means a polynomial function. Then with high probability, running SGD (27) with certain random initialization and properly chosen step sizes yields `n(θt ) ≤ ε in t log 1 ε iterations. Two notable features are worth mentioning: (1) first, the network under consideration is sufficiently over- parametrized (cf. (29)) in which the number of parameters is much larger than the number of samples, and (2) one needs to initialize the weight matrices to be in near-isometry such that the magnitudes of the hidden nodes do not blow up or vanish. In a nutshell, over-parametrization and random initialization together ensure that the loss function (26) has a benign landscape9 around the initial point, which in turn implies fast convergence of SGD iterates. There are certainly other challenges for vanilla SGD to train deep neural nets: (1) training algorithms are often implemented in GPUs, and therefore it is important to tailor the algorithm to the infrastructure, (2) the vanilla SGD might converge very slowly for deep neural networks, albeit good theoretical guarantees for well-behaved problems, and (3) the learning rates {ηt} can be difficult to tune in practice. To address the aforementioned challenges, three important variants of SGD, namely mini-batch SGD, momentum-based SGD, and SGD with adaptive learning rates are introduced. 6.1.1 Mini-batch SGD Modern computational infrastructures (e.g., GPUs) can evaluate the gradient on a number (say 64) of examples as efficiently as evaluating that on a single example. To utilize this advantage, mini-batch SGD with batch size K ≥ 1 forms the stochastic gradient through K random samples: θt+1 = θt − ηtG(θt ) with G(θt ) = 1 K K X k=1 ∇L f xik t ; θt , yik t , (30) where for each 1 ≤ k ≤ K, ik t is sampled uniformly from {1, 2, · · · , n}. Mini-batch SGD, which is an “interpolation” between gradient descent and stochastic gradient descent, achieves the best of both worlds: (1) using 1 K n samples to estimate the gradient, one effectively reduces the variance and hence accelerates the convergence, and (2) by taking the batch size K appropriately (say 64 or 128), the stochastic gradient G(θt ) can be efficiently computed using the matrix computation toolboxes on GPUs. 6.1.2 Momentum-based SGD While mini-batch SGD forms the foundation of training neural networks, it can sometimes be slow to converge due to its oscillation behavior [122]. Optimization community has long investigated how to accelerate the convergence of gradient descent, which results in a beautiful technique called momentum methods [95, 88]. Similar to gradient descent with moment, momentum-based SGD, instead of moving the iterate θt in the direction of the current stochastic gradient G(θt ), smooth the past (stochastic) gradients {G(θt )} to stabilize the update directions. Mathematically, let vt ∈ Rp be the direction of update in the tth iteration, i.e., θt+1 = θt − ηtvt . Here v0 = G(θ0 ) and for t = 1, 2, · · · vt = ρvt−1 + G(θt ) (31) with 0 ρ 1. A typical choice of ρ is 0.9. Notice that ρ = 0 recovers the mini-batch SGD (30), where no past information of gradients is used. A simple unrolling of vt reveals that vt is actually an exponential averaging of the past gradients, i.e., vt = Pt j=0 ρt−j G(θj ). Compared with vanilla mini-batch SGD, the inclusion of the momentum “smoothes” the oscillation direction and accumulates the persistent descent direction. We want to emphasize that theoretical justifications of momentum in the stochastic setting is not fully understood [63, 60]. 9In [4], the loss function `n(θ) satisfies the PL condition. 22
  • 51. 6.1.3 SGD with adaptive learning rates In optimization, preconditioning is often used to accelerate first-order optimization algorithms. In principle, one can apply this to SGD, which yields the following update rule: θt+1 = θt − ηtP −1 t G(θt ) (32) with Pt ∈ Rp×p being a preconditioner at the t-th step. Newton’s method can be viewed as one type of preconditioning where Pt = ∇2 `(θt ). The advantages of preconditioning are two-fold: first, a good preconditioner reduces the condition number by changing the local geometry to be more homogeneous, which is amenable to fast convergence; second, a good preconditioner frees practitioners from laboring tuning of the step sizes, as is the case with Newton’s method. AdaGrad, an adaptive gradient method proposed by [33], builds a preconditioner Pt based on information of the past gradients: Pt = n diag t X j=0 G θt G θt o1/2 . (33) Since we only require the diagonal part, this preconditioner (and its inverse) can be efficiently computed in practice. In addition, investigating (32) and (33), one can see that AdaGrad adapts to the importance of each coordinate of the parameters by setting smaller learning rates for frequent features, whereas larger learning rates for those infrequent ones. In practice, one adds a small quantity δ 0 (say 10−8 ) to the diagonal entries to avoid singularity (numerical underflow). A notable drawback of AdaGrad is that the effective learning rate vanishes quickly along the learning process. This is because the historical sum of the gradients can only increase with time. RMSProp [52] is a popular remedy for this problem which incorporates the idea of exponential averaging: Pt = n diag ρPt−1 + (1 − ρ)G θt G θt o1/2 . (34) Again, the decaying parameter ρ is usually set to be 0.9. Later, Adam [65, 100] combines the momentum method and adaptive learning rate and becomes the default training algorithms in many deep learning applications. 6.2 Easing numerical instability For very deep neural networks or RNNs with long dependencies, training difficulties often arise when the val- ues of nodes have different magnitudes or when the gradients “vanish” or “explode” during back-propagation. Here we discuss three partial solutions to alleviate this problem. 6.2.1 ReLU activation function One useful characteristic of the ReLU function is that its derivative is either 0 or 1, and the derivative remains 1 even for a large input. This is in sharp contrast with the standard sigmoid function (1 + e−t )−1 which results in a very small derivative when inputs have large magnitude. The consequence of small derivatives across many layers is that gradients tend to be “killed”, which means that gradients become approximately zero in deep nets. The popularity of the ReLU activation function and its variants (e.g., leaky ReLU) is largely attributable to the above reason. It has been well observed that the ReLU activation function has superior training performance over the sigmoid function [68, 79]. 6.2.2 Skip connections We have introduced skip connections in Section 3.3. Why are skip connections helpful for reducing numerical instability? This structure does not introduce a larger function space, since the identity map can be also represented with ReLU activations: x = σ(x) − σ(−x). 23
  • 52. One explanation is that skip connections bring ease to the training / optimization process. Suppose that we have a general nonlinear function F(x`; θ`). With a skip connection, we represent the map as x`+1 = x` + F(x`; θ`) instead. Now the gradient ∂x`+1/∂x` becomes ∂x`+1 ∂x` = I + ∂F(x`; θ`) ∂x` instead of ∂F(x`; θ`) ∂x` , (35) where I is an identity matrix. By the chain rule, gradient update requires computing products of many components, e.g., ∂xL ∂x1 = QL−1 `=1 ∂x`+1 ∂x` , so it is desirable to keep the spectra (singular values) of each component ∂x`+1 ∂x` close to 1. In neural nets, with skip connections, this is easily achieved if the parameters have small values; otherwise, this may not be achievable even with careful initialization and tuning. Notably, training neural nets with hundreds of layers is possible with the help of skip connections. 6.2.3 Batch normalization Recall that in regression analysis, one often standardizes the design matrix so that the features have zero mean and unit variance. Batch normalization extends this standardization procedure from the input layer to all the hidden layers. Mathematically, fix a mini-batch of input data {(xi, yi)}i∈B, where B ⊂ [n]. Let h (`) i be the feature of the i-th example in the `-th layer (` = 0 corresponds to the input xi). The batch normalization layer computes the normalized version of h (`) i via the following steps: µ , 1 |B| X i∈B h (`) i , σ2 , 1 |B| X i∈B h (`) i − µ 2 and h (l) i,norm , h (`) i − µ σ . Here all the operations are element-wise. In words, batch normalization computes the z-score for each feature over the mini-batch B and use that as inputs to subsequent layers. To make it more versatile, a typical batch normalization layer has two additional learnable parameters γ(`) and β(`) such that h (l) i,new = γ(l) h (l) i,norm + β(l) . Again denotes the element-wise multiplication. As can be seen, γ(`) and β(`) set the new feature h (l) inew to have mean β(`) and standard deviation γ(`) . The introduction of batch normalization makes the training of neural networks much easier and smoother. More importantly, it allows the neural nets to perform well over a large family of hyper-parameters including the number of layers, the number of hidden units, etc. At test time, the batch normalization layer needs more care. For brevity we omit the details and refer to [58]. 6.3 Regularization techniques So far we have focused on training techniques to drive the empirical loss (26) small efficiently. Here we proceed to discuss common practice to improve the generalization power of trained neural nets. 6.3.1 Weight decay One natural regularization idea is to add an `2 penalty to the loss function. This regularization technique is known as the weight decay in deep learning. We have seen one example in (9). For general deep neural nets, the loss to optimize is `λ n(θ) = `n(θ) + rλ(θ) where rλ(θ) = λ L X `=1 X j,j0 W (`) j,j0 2 . Note that the bias (intercept) terms are not penalized. If `n(θ) is a least square loss, then regularization with weight decay gives precisely ridge regression. The penalty rλ(θ) is a smooth function and thus it can be also implemented efficiently with back-propagation. 24
  • 53. 6.3.2 Dropout Dropout, introduced by [53], prevents overfitting by randomly dropping out subsets of features during train- ing. Take the l-th layer of the feed-forward neural network as an example. Instead of propagating all the features in h(`) for later computations, dropout randomly omits some of its entries by h (`) drop = h(`) mask` , where denotes element-wise multiplication as before, and mask` is a vector of Bernoulli variables with success probability p. It is sometimes useful to rescale the features h (`) inv drop = h (`) drop/p, which is called inverted dropout. During training, mask` are i.i.d. vectors across mini-batches and layers. However, when testing on fresh samples, dropout is disabled and the original features h(`) are used to compute the output label y. It has been nicely shown by [129] that for generalized linear models, dropout serves as adaptive regularization. In the simplest case of linear regression, it is equivalent to `2 regularization. Another possible way to understand the regularization effect of dropout is through the lens of bagging [45]. Since different mini-batches has different masks, dropout can be viewed as training a large ensemble of classifiers at the same time, with a further constraint that the parameters are shared. Theoretical justification remains elusive. 6.3.3 Data augmentation Data augmentation is a technique of enlarging the dataset when we have knowledge about invariance structure of data. It implicitly increases the sample size and usually regularizes the model effectively. For example, in image classification, we have strong prior knowledge about what invariance properties a good classifier should possess. The label of an image should not be affected by translation, rotation, flipping, and even crops of the image. Hence one can augment the dataset by randomly translating, rotating and cropping the images in the original dataset. Formally, during training we want to minimize the loss `n(θ) = P i L(f(xi; θ), yi) w.r.t. parameters θ, and we know a priori that certain transformation T ∈ T where T : Rd → Rd (e.g., affine transformation) should not change the category / label of a training sample. In principle, if computation costs were not a consideration, we could convert this knowledge to a constraint fθ(Txi) = fθ(xi), ∀ T ∈ T in the minimization formulation. Instead of solving a constrained optimization problem, data augmentation enlarges the training dataset by sampling T ∈ T and generating new data {(Txi, yi)}. In this sense, data augmentation induces invariance properties through sampling, which results in a much bigger dataset than the original one. 7 Generalization power Section 6 has focused on the in-sample / training error obtained via SGD, but this alone does not guarantee good performance with respect to the out-of-sample / test error. The gap between the in-sample error and the out-of-sample error, namely the generalization gap, has been the focus of statistical learning theory since its birth; see [112] for an excellent introduction to this topic. While understanding the generalization power of deep neural nets is difficult [135, 99], we sample re- cent endeavors in this section. From a high level point of view, these approaches can be divided into two categories, namely algorithm-independent controls and algorithm-dependent controls. More specifically, algorithm-independent controls focus solely on bounding the complexity of the function class represented by certain deep neural networks. In contrast, algorithm-dependent controls take into account the algorithm (e.g., SGD) used to train the neural network. 7.1 Algorithm-independent controls: uniform convergence The key to algorithm-independent controls is the notion of complexity of the function class parametrized by certain neural networks. Informally, as long as the complexity is not too large, the generalization gap of any function in the function class is well-controlled. However, the standard complexity measure (e.g., VC dimension [127]) is at least proportional to the number of weights in a neural network [5, 112], which fails to explain the practical success of deep learning. The caveat here is that the function class under consideration 25
  • 54. is all the functions realized by certain neural networks, with no restrictions on the size of the weights at all. On the other hand, for the class of linear functions with bounded norm, i.e., {x 7→ w x | kwk2 ≤ M}, it is well understood that the complexity of this function class (measured in terms of the empirical Rademacher complexity) with respect to a random sample {xi}1≤i≤n is upper bounded by maxi kxik2M/ √ n, which is independent of the number of parameters in w. This motivates researchers to investigate the complexity of norm-controlled deep neural networks10 [89, 14, 43, 74]. Setting the stage, we introduce a few necessary notations and facts. The key object under study is the function class parametrized by the following fully- connected neural network with depth L: FL , x 7→ WLσ (WL−1σ (· · · W2σ (W1x))) (W1, · · · , WL) ∈ W . (36) Here (W1, W2, · · · , WL) ∈ W represents a certain constraint on the parameters. For instance, one can restrict the Frobenius norm of each parameter Wl through the constraint kWlkF ≤ MF(l), where MF(l) is some positive quantity. With regard to the complexity measure, it is standard to use Rademacher complexity to control the capacity of the function class of interest. Definition 1 (Empirical Rademacher complexity). The empirical Rademacher complexity of a function class F w.r.t. a dataset S , {xi}1≤i≤n is defined as RS (F) = Eε h sup f∈F 1 n n X i=1 εif (xi) i , (37) where ε , (ε1, ε2, · · · , εn) is composed of i.i.d. Rademacher random variables, i.e., P(εi = 1) = P(εi = −1) = 1/2. In words, Rademacher complexity measures the ability of the function class to fit the random noise rep- resented by ε. Intuitively, a function class with a larger Rademacher complexity is more prone to overfitting. We now formalize the connection between the empirical Rademacher complexity and the out-of-sample error; see Chapter 24 in [112]. Theorem 6. Assume that for all f ∈ F and all (y, x) we have |L(f(x), y)| ≤ 1. In addition, assume that for any fixed y, the univariate function L(·, y) is Lipschitz with constant 1. Then with probability at least 1 − δ over the sample S , {(yi, xi)}1≤i≤n i.i.d. ∼ D, one has for all f ∈ F E(y,x)∼D [L (f(x), y)] | {z } out-of-sample error ≤ 1 n n X i=1 L (f(xi), yi) | {z } in-sample error +2RS (F) + 4 r log (4/δ) n . In English, the generalization gap of any function f that lies in F is well-controlled as long as the Rademacher complexity of F is not too large. With this connection in place, we single out the following complexity bound. Theorem 7 (Theorem 1 in [43]). Consider the function class FL in (36), where each parameter Wl has Frobenius norm at most MF(l). Further suppose that the element-wise activation function σ(·) is 1-Lipschitz and positive-homogeneous (i.e., σ(c · x) = cσ(x) for all c ≥ 0). Then the empirical Rademacher complex- ity (37) w.r.t. S , {xi}1≤i≤n satisfies RS (FL) ≤ max i kxik2 · 4 √ L QL l=1 MF(l) √ n . (38) The upper bound of the empirical Rademacher complexity (38) is in a similar vein to that of linear functions with bounded norm, i.e., maxi kxik2M/ √ n, where √ L QL l=1 MF(l) plays the role of M in the latter case. Moreover, ignoring the term √ L, the upper bound (38) does not depend on the size of the network in an explicit way if MF (l) sharply concentrates around 1. This reveals that the capacity of the 10Such attempts have been made in the seminal work [13]. 26
  • 55. neural network is well-controlled, regardless of the number of parameters, as long as the Frobenius norm of the parameters is bounded. Extensions to other norm constraints, e.g., spectral norm constraints, path norm constraints have been considered by [89, 14, 74, 67, 34]. This line of work improves upon traditional capacity analysis of neural networks in the over-parametrized setting, because the upper bounds derived are often size-independent. Having said this, two important remarks are in order: (1) the upper bounds (e.g., QL l=1 MF(l)) involve implicit dependence on the size of the weight matrix and the depth of the neural network, which is hard to characterize; (2) the upper bound on the Rademacher complexity offers a uniform bound over all functions in the function class, which is a pure statistical result. However, it stays silent about how and why standard training algorithms like SGD can obtain a function whose parameters have small norms. 7.2 Algorithm-dependent controls In this subsection, we bring computational thinking into statistics and investigate the role of algorithms in the generalization power of deep learning. The consideration of algorithms is quite natural and well motivated: (1) local/global minima reached by different algorithms can exhibit totally different generalization behaviors due to extreme nonconvexity, which marks a huge difference from traditional models, (2) the effective capacity of neural nets is possibly not large, since a particular algorithm does not explore the entire parameter space. These demonstrate the fact that on top of the complexity of the function class, the inherent property of the algorithm we use plays an important role in the generalization ability of deep learning. In what follows, we survey three different ways to obtain upper bounds on the generalization errors by exploiting properties of the algorithms. 7.2.1 Mean field view of neural nets As we have emphasized, modern deep learning models are highly over-parametrized. A line of work [83, 117, 105, 25, 82, 61] approximates the ensemble of weights by an asymptotic limit as the number of hidden units tends to infinity, so that the dynamics of SGD can be studied via certain partial different equations. More specifically, let ˆ f(x; θ) = N−1 PN i=1 σ(θ i x) be a function given by a one-hidden-layer neural net with N hidden units, where σ(·) is the ReLU activation function and parameters θ , [θ1, . . . , θN ] ∈ RN×d are suitably randomly initialized. Consider the regression setting where we want to minimize the population risk RN (θ) = E[(y − ˆ f(x; θ))2 ] over parameters θ. A key observation is that this population risk depends on the parameters θ only through its empirical distribution, i.e., ρ̂(N) = N−1 PN i=1 δθi where δθi is a point mass at θi. This motivates us to view express RN (θ) equivalently as R(ρ̂(N) ), where R(·) is a functional that maps distributions to real numbers. Running SGD for RN (·)—in a suitable scaling limit—results in a gradient flow on the space of distributions endowed with the Wasserstein metric that minimizes R(·). It turns out that the empirical distribution ρ̂ (N) k of the parameters after k steps of SGD is well approximated by the gradient follow, as long as the the neural net is over-parametrized (i.e., N d) and the number of steps is not too large. In particular, [83] have shown that under certain regularity conditions, sup k∈[0,T/ε]∩N R(ρ̂(N) ) − R (ρkε) . eT r 1 N ∨ ε · r d + log N ε , where ε 0 is an proxy for the step size of SGD and ρkε is the distribution of the gradient flow at time kε. In words, the out-of-sample error under θk generated by SGD is well-approximated by that of ρkε. Viewing the optimization problem from the distributional aspect greatly simplifies the problem conceptually, as the complicated optimization problem is now passed into its limit version—for this reason, this analytical approach is called the mean field perspective. In particular, [83] further demonstrated that in some simple settings, the out-of-sample error R(ρkε) of the distributional limit can be fully characterized. Nevertheless, how well does R(ρkε) perform and how fast it converges remain largely open for general problems. 7.2.2 Stability A second way to understand the generalization ability of deep learning is through the stability of SGD. An algorithm is considered stable if a slight change of the input does not alter the output much. It has long been 27
  • 56. observed that a stable algorithm has a small generalization gap; examples include k nearest neighbors [102, 29], bagging [18, 19], etc. The precise connection between stability and generalization gap is stated by [17, 113]. In what follows, we formalize the idea of stability and its connection with the generalization gap. Let A denote an algorithm (possibly randomized) which takes a sample S , {(yi, xi)}1≤i≤n of size n and returns an estimated parameter θ̂ , A(S). Following [49], we have the following definition for stability. Definition 2. An algorithm (possibly randomized) A is ε-uniformly stable with respect to the loss function L(·, ·) if for all datasets S, S0 of size n which differ in at most one example, one has sup x,y EA [L (f(x; A (S)), y) − L (f(x; A (S0 )), y)] ≤ ε. Here the expectation is taken w.r.t. the randomness in the algorithm A and ε might depend on n. The loss function L(·, ·) takes an example (say (x, y)) and the estimated parameter (say A(S)) as inputs and outputs a real value. Surprisingly, an ε-uniformly stable algorithm incurs small generalization gap in expectation, which is stated in the following lemma. Lemma 1 (Theorem 2.2 in [49]). Let A be ε-uniformly stable. Then the expected generalization gap is no larger than ε, i.e., EA,S 1 n n X i=1 L(f(xi; A (S)), yi) − E(x,y)∼D [L (f(x; A (S)), y)] # ≤ ε. (39) With Lemma 1 in hand, it suffices to prove stability bound on specific algorithms. It turns out that SGD introduced in Section 6 is uniformly stable when solving smooth nonconvex functions. Theorem 8 (Theorem 3.12 in [49]). Assume that for any fixed (y, x), the loss function L(f(x; θ), y), viewed as a function of θ, is L-Lipschitz and β-smooth. Consider running SGD on the empirical loss function with decaying step size αt ≤ c/t, where c is some small absolute constant. Then SGD is uniformly stable with ε . T1− 1 βc+1 n , where we have ignored the dependency on β, c and L. Theorem 8 reveals that SGD operating on nonconvex loss functions is indeed uniformly stable as long as the number of steps T is not large compared with n. This together with Lemma 1 demonstrates the generalization ability of SGD in expectation. Nevertheless, two important limitations are worth mentioning. First, Lemma 1 provides an upper bound on the out-of-sample error in expectation, but ideally, instead of an on-average guarantee under EA,S, we would like to have a high probability guarantee as in the convex case [37]. Second, controlling the generalization gap alone is not enough to achieve a small out-of-sample error, since it is unclear whether SGD can achieve a small training error within T steps. 7.2.3 Implicit regularization In the presence of over-parametrization (number of parameters larger than the sample size), conventional wisdom informs us that we should apply some regularization techniques (e.g., `1 / `2 regularization) so that the model will not overfit the data. However, in practice, neural networks without explicit regularization generalize well. This phenomenon motivates researchers to look at the regularization effects introduced by training algorithms (e.g., SGD) in this over-parametrized regime. While there might exits multiple, if not infinite global minima of the empirical loss (26), it is possible that practical algorithms tend to converge to solutions with better generalization powers. Take the underdetermined linear system Xθ = y as a starting point. Here X ∈ Rn×p and θ ∈ Rp with p much larger than n. Running gradient descent on the loss 1 2 kXθ − yk2 2 from the origin (i.e., θ0 = 0) results in the solution with the minimum Euclidean norm, that is GD converges to min θ∈Rp kθk2 subject to Xθ = y. 28
  • 57. In words, without any `2 regularization in the loss function, gradient descent automatically finds the solution with the least `2 norm. This phenomenon, often called as implicit regularization, not only has been empirically observed in training neural networks, but also has been theoretically understood in some simplified cases, e.g., logistic regression with separable data. In logistic regression, given a training set {(yi, xi)}1≤i≤n with xi ∈ Rp and yi ∈ {1, −1}, one aims to fit a logistic regression model by solving the following program: min θ∈Rp 1 n n X i=1 ` yix i θt . (40) Here, `(u) , log(1 + e−u ) denotes the logistic loss. Further assume that the data is separable, i.e., there exists θ∗ ∈ Rp such that yiθ∗ xi 0 for all i. Under this condition, the loss function (40) can be arbitrarily close to zero for certain θ with kθk2 → ∞. What happens when we minimize (40) using gradient descent? [119] uncovers a striking phenomenon. Theorem 9 (Theorem 3 in [119]). Consider the logistic regression (40) with separable data. If we run GD θt+1 = θt − η 1 n n X i=1 yixi`0 yix i θt from any initialization θ0 with appropriate step size η 0, then normalized θt converges to a solution with the maximum `2 margin. That is, lim t→∞ θt kθtk2 = θ̂, (41) where θ̂ is the solution to the hard margin support vector machine: θ̂ , arg min θ∈Rp kθk2, subject to yix i θ ≥ 1 for all 1 ≤ i ≤ n. (42) The above theorem reveals that gradient descent, when solving logistic regression with separable data, implicitly regularizes the iterates towards the `2 max margin vector (cf. (41)), without any explicit regular- ization as in (42). Similar results have been obtained by [62]. In addition, [47] studied algorithms other than gradient descent and showed that coordinate descent produces a solution with the maximum `1 margin. Moving beyond logistic regression, which can be viewed as a one-layer neural net, the theoretical under- standing of implicit regularization in deeper neural networks is still limited; see [48] for an illustration in deep linear convolutional neural networks. 8 Discussion Due to space limitations, we have omitted several important deep learning models; notable examples include deep reinforcement learning [86], deep probabilistic graphical models [109], variational autoencoders [66], transfer learning [133], etc. Apart from the modeling aspect, interesting theories on generative adversarial networks [10, 11], recurrent neural networks [3], connections with kernel methods [59, 9] are also emerging. We have also omitted the inverse-problem view of deep learning where the data are assumed to be generated from a certain neural net and the goal is to recover the weights in the NN with as few examples as possible. Various algorithms (e.g., GD with spectral initialization) have been shown to recover the weights successfully in some simplified settings [136, 118, 42, 87, 23, 39]. In the end, we identify a few important directions for future research. • New characterization of data distributions. The success of deep learning relies on its power of efficiently representing complex functions relevant to real data. Comparatively, classical methods often have optimal guarantee if a problem has a certain known structure, such as smoothness, sparsity, and low-rankness [121, 31, 20, 24], but they are insufficient for complex data such as images. How to characterize the high- dimensional real data that can free us from known barriers, such as the curse of dimensionality is an interesting open question? 29
  • 58. • Understanding various computational algorithms for deep learning. As we have emphasized throughout this survey, computational algorithms (e.g., variants of SGD) play a vital role in the success of deep learning. They allow fast training of deep neural nets and probably contribute towards the good generalization behavior of deep learning in practice. Understanding these computational algorithms and devising better ones are crucial components in understanding deep learning. • Robustness. It has been well documented that DNNs are sensitive to small adversarial perturbations that are indistinguishable to humans [124]. This raises serious safety issues once if deploy deep learning models in applications such as self-driving cars, healthcare, etc. It is therefore crucial to refine current training practice to enhance robustness in a principled way [116]. • Low SNRs. Arguably, for image data and audio data where the signal-to-noise ratio (SNR) is high, deep learning has achieved great success. In many other statistical problems, the SNR may be very low. For example, in financial applications, the firm characteristic and covariates may only explain a small part of the financial returns; in healthcare systems, the uncertainty of an illness may not be predicted well from a patient’s medical history. How to adapt deep learning models to excel at such tasks is an interesting direction to pursue? Acknowledgements J. Fan is supported in part by the NSF grants DMS-1712591 and DMS-1662139, the NIH grant R01- GM072611 and the ONR grant N00014-19-1-2120. We thank Ruying Bao, Yuxin Chen, Chenxi Liu, Weijie Su, Qingcan Wang and Pengkun Yang for helpful comments and discussions. References [1] Martín Abadi and et. al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [2] Reza Abbasi-Asl, Yuansi Chen, Adam Bloniarz, Michael Oliver, Ben DB Willmore, Jack L Gallant, and Bin Yu. The deeptune framework for modeling and characterizing neurons in visual cortex area v4. bioRxiv, page 465534, 2018. [3] Zeyuan Allen-Zhu and Yuanzhi Li. Can SGD Learn Recurrent Neural Networks with Provable Gener- alization? ArXiv e-prints, abs/1902.01028, 2019. [4] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. arXiv preprint arXiv:1811.03962, 2018. [5] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009. [6] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. 70:214–223, 06–11 Aug 2017. [7] Vladimir I Arnold. On functions of three variables. Collected Works: Representations of Functions, Celestial Mechanics and KAM Theory, 1957–1965, pages 5–8, 2009. [8] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University Press, 2009. [9] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019. 30
  • 59. [10] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 224–232. JMLR. org, 2017. [11] Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in GANs. arXiv preprint arXiv:1806.10586, 2018. [12] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993. [13] Peter L Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2):525–536, 1998. [14] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6240–6249. Curran Associates, Inc., 2017. [15] Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Technical report, Technical report, 2017. [16] Léon Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142, 1998. [17] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002. [18] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. [19] Leo Breiman et al. Heuristics of instability and stabilization in model selection. The annals of statistics, 24(6):2350–2383, 1996. [20] Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix comple- tion. arXiv preprint arXiv:0903.1476, 2009. [21] Chensi Cao, Feng Liu, Hai Tan, Deshou Song, Wenjie Shu, Weizhong Li, Yiming Zhou, Xiaochen Bo, and Zhi Xie. Deep learning and its applications in biomedicine. Genomics, proteomics bioinformatics, 16(1):17–32, 2018. [22] Tianqi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366, 2018. [23] Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming, pages 1–33, 2019. [24] Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, and Yuling Yan. Noisy matrix completion: Un- derstanding statistical guarantees for convex relaxation via nonconvex optimization. arXiv preprint arXiv:1902.07698, 2019. [25] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems, pages 3040– 3050, 2018. [26] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [27] R Dennis Cook et al. Fisher lecture: Dimension reduction in regression. Statistical Science, 22(1):1–26, 2007. 31
  • 60. [28] Jeffrey De Fauw, Joseph R Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Brendan O’Donoghue, Daniel Visentin, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature medicine, 24(9):1342, 2018. [29] Luc Devroye and Terry Wagner. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601–604, 1979. [30] David L Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 1(2000):32, 2000. [31] David L Donoho and Jain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika, 81(3):425–455, 1994. [32] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018. [33] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011. [34] Weinan E, Chao Ma, and Qingcan Wang. A priori estimates of the population risk for residual networks. arXiv preprint arXiv:1903.02154, 2019. [35] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pages 907–940, 2016. [36] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001. [37] Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly stable algo- rithms with nearly optimal rate. arXiv preprint arXiv:1902.10710, 2019. [38] Jerome H Friedman and Werner Stuetzle. Projection pursuit regression. Journal of the American statistical Association, 76(376):817–823, 1981. [39] Haoyu Fu, Yuejie Chi, and Yingbin Liang. Local geometry of one-hidden-layer neural networks for logistic regression. arXiv preprint arXiv:1802.06463, 2018. [40] Kunihiko Fukushima and Sei Miyake. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pages 267– 285. Springer, 1982. [41] Chao Gao, Jiyi Liu, Yuan Yao, and Weizhi Zhu. Robust estimation and generative adversarial nets. arXiv preprint arXiv:1810.02030, 2018. [42] Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with overlapping patches. arXiv preprint arXiv:1802.02547, 2018. [43] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. arXiv preprint arXiv:1712.06541, 2017. [44] Gene H Golub and Charles F Van Loan. Matrix computations. JHU Press, 4 edition, 2013. [45] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. [46] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information pro- cessing systems, pages 2672–2680, 2014. [47] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018. 32
  • 61. [48] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9482– 9491, 2018. [49] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochas- tic gradient descent. arXiv preprint arXiv:1509.01240, 2015. [50] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recogni- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [51] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual net- works. In European conference on computer vision, pages 630–645. Springer, 2016. [52] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. 2012. [53] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdi- nov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. [54] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997. [55] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected con- volutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. [56] David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106–154, 1962. [57] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. [58] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [59] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and gener- alization in neural networks. In Advances in neural information processing systems, pages 8580–8589, 2018. [60] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227, 2017. [61] Adel Javanmard, Marco Mondelli, and Andrea Montanari. Analysis of a two-layer neural network via displacement convexity. arXiv preprint arXiv:1901.01375, 2019. [62] Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018. [63] Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insufficiency of ex- isting momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018. [64] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952. [65] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 33
  • 62. [66] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [67] Jason M Klusowski and Andrew R Barron. Risk bounds for high-dimensional ridge function combina- tions including neural networks. arXiv preprint arXiv:1607.01434, 2016. [68] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo- lutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [69] Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science Business Media, 2003. [70] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015. [71] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [72] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pages 6391–6401, 2018. [73] Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991. [74] Xingguo Li, Junwei Lu, Zhaoran Wang, Jarvis Haupt, and Tuo Zhao. On tighter generalization bound for deep neural networks: Cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159, 2018. [75] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718–1727, 2015. [76] Tengyuan Liang. How well can generative adversarial networks (GAN) learn densities: A nonparametric view. arXiv preprint arXiv:1712.08244, 2017. [77] Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017. [78] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013. [79] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013. [80] VE Maiorov and Ron Meir. On the near optimality of the stochastic approximation of smooth functions by neural networks. Advances in Computational Mathematics, 13(1):79–103, 2000. [81] Yuly Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85(1):98–109, 1996. [82] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. arXiv preprint arXiv:1902.06015, 2019. [83] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018. [84] Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. Learning functions: when is deep better than shallow. arXiv preprint arXiv:1603.00988, 2016. [85] Hrushikesh N Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1):164–177, 1996. [86] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. 34
  • 63. [87] Marco Mondelli and Andrea Montanari. On the connection between learning two-layers neural networks and tensor decomposition. arXiv preprint arXiv:1802.07301, 2018. [88] Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o (1/kˆ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983. [89] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015. [90] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016. [91] Ian Parberry. Circuit complexity and neural networks. MIT press, 1994. [92] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. [93] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143–195, 1999. [94] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503–519, 2017. [95] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Compu- tational Mathematics and Mathematical Physics, 4(5):1–17, 1964. [96] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. [97] Boris Teodorovich Polyak and Yakov Zalmanovich Tsypkin. Adaptive estimation algorithms: conver- gence, optimality, stability. Avtomatika i Telemekhanika, (3):71–84, 1979. [98] Christopher Poultney, Sumit Chopra, Yann LeCun, et al. Efficient learning of sparse representations with an energy-based model. In Advances in neural information processing systems, pages 1137–1144, 2007. [99] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018. [100] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. 2018. [101] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951. [102] William H Rogers and Terry J Wagner. A finite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, pages 506–514, 1978. [103] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. arXiv preprint arXiv:1705.05502, 2017. [104] Yaniv Romano, Matteo Sesia, and Emmanuel J Candès. Deep knockoffs. arXiv preprint arXiv:1811.06687, 2018. [105] Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018. 35
  • 64. [106] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. [107] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Ima- geNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. [108] Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, 2014. [109] Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Artificial intelligence and statistics, pages 448–455, 2009. [110] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Im- proved techniques for training GANs. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016. [111] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. arXiv preprint arXiv:1708.06633, 2017. [112] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [113] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010. [114] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. [115] Bernard W Silverman. Density estimation for statistics and data analysis. Chapman Hall, CRC, 1998. [116] Chandan Singh, W James Murdoch, and Bin Yu. Hierarchical interpretations for neural network predictions. arXiv preprint arXiv:1806.05337, 2018. [117] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks. arXiv preprint arXiv:1805.01053, 2018. [118] Mahdi Soltanolkotabi. Learning relus via gradient descent. In Advances in Neural Information Pro- cessing Systems, pages 2007–2017, 2017. [119] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822– 2878, 2018. [120] David A Sprecher. On the structure of continuous functions of several variables. Transactions of the American Mathematical Society, 115:340–355, 1965. [121] Charles J Stone. Optimal global rates of convergence for nonparametric regression. The annals of statistics, pages 1040–1053, 1982. [122] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013. 36
  • 65. [123] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. [124] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. [125] Matus Telgarsky. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016. [126] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. [127] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability Its Applications, 16(2):264–280, 1971. [128] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and com- posing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008. [129] Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in neural information processing systems, pages 351–359, 2013. [130] E Weinan, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high- dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017. [131] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4148–4158. Curran Associates, Inc., 2017. [132] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. [133] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014. [134] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015. [135] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016. [136] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4140–4149. JMLR. org, 2017. 37
  • 66. Top Deep Learning Interview Questions You Must Know 1.3K Views Kurt Last updated on May 22,2019 Deep Learning is one of the Hottest topics of 2018-19 and for a good reason. There have been so many advancements in the Industry wherein the time has come when machines or Computer Programs are actually replacing Humans. Arti cial Intelligence is going to create 2.3 million Jobs by 2020 and to crack those job interview I have come up with a set of Deep Learning Interview Questions. I have divided this article into two sections: Basic Deep Learning Interview Questions Advance Deep Learning Interview Questions Basics Deep Learning Interview Questions Q1. Differentiate between AI, Machine Learning and Deep Learning. Artificial Intelligence is a technique which enables machines to mimic human behavior. Machine Learning is a subset of AI technique which uses statistical methods to enable machines to improve with experience. Deep learning is a subset of ML which make the computation of multi-layer neural network feasible. It uses Neural networks to simulate human-like decision making. Q2. Do you think Deep Learning is Better than Machine Learning? If so, why? Though traditional ML algorithms solve a lot of our cases, they are not useful while working with high dimensional data, that is where we have a large number of inputs and outputs. For example, in the case of handwriting recognition, we have a large amount of input where we will have a different type of inputs associated with different type of handwriting. The second major challenge is to tell the computer what are the features it should look for that will play an important role in predicting the outcome as well as to achieve better accuracy while doing so. Q3. What is Perceptron? And How does it Work? If we focus on the structure of a biological neuron, it has dendrites which are used to receive inputs. These inputs are summed in the cell body and using the Axon it is passed on to the next biological neuron as shown below. Dendrite: Receives signals from other neurons Cell Body: Sums all the inputs Axon: It is used to transmit signals to the other cells Similarly, a perceptron receives multiple inputs, applies various transformations and functions and provides an output. A Perceptron is a linear model used for binary classi cation. It models a neuron which has a set of inputs, each of which is given a specific weight. The neuron computes some function on these weighted inputs and gives the output.  Subscribe 
  • 67. Q4. What is the role of weights and bias? For a perceptron, there can be one more input calledbias. While the weights determine the slope of the classifier line, bias allows us to shift the line towards left or right. Normally bias is treated as another weighted input with the input value x Q5. What are the activation functions? Activation function translates the inputs into outputs. Activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron. There can be many Activation functions like: Linear or Identity Unit or Binary Step Sigmoid or Logistic Tanh ReLU Softmax Q6. Explain Learning of a Perceptron. 1. Initializing the weights and threshold. 2. Provide the input and calculate the output. 3. Update the weights. 4. Repeat Steps 2 and 3 Wj (t+1) – Updated Weight Wj (t) – Old Weight d – Desired Output y – Actual Output x – Input Q7. What is the significance of a Cost/Loss function? A cost function is a measure of the accuracy of the neural network with respect to a given training sample and expected output. It provides the performance of a neural network as a whole. In deep learning, the goal is to minimize the cost function. For that, we use the concept of gradient descent. Q8. What is gradient descent? Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. Stochastic Gradient Descent: Uses only a single training example to calculate the gradient and update parameters. Batch Gradient Descent: Calculate the gradients for the whole dataset and perform just one update at each iteration. Mini-batch Gradient Descent: Mini-batch gradient is a variation of stochastic gradient descent where instead of single training example, mini-batch of samples is used. It’s one of the most popular optimization algorithms. Q9. What are the benefits of mini-batch gradient descent? This is more efficient compared to stochastic gradient descent. The generalization by finding the flat minima. Mini-batches allows help to approximate the gradient of the entire training set which helps us to avoid local minima. 0.
  • 68. Q10.What are the steps for using a gradient descent algorithm? Initialize random weight and bias. Pass an input through the network and get values from the output layer. Calculate the error between the actual value and the predicted value. Go to each neuron which contributes to the error and then change its respective values to reduce the error. Reiterate until you find the best weights of the network. Q11. Create a Gradient Descent in python. Q12. What are the shortcomings of a single layer perceptron? Well, there are two major problems: Single-Layer Perceptrons cannot classify non-linearly separable data points. Complex problems, that involve a lot of parameters cannot be solved by Single-Layer Perceptrons Q13. What is a Multi-Layer-Perceptron A multilayer perceptron (MLP) is a deep, arti cial neural network. It is composed of more than one perceptron. They are composed of an input layer to receive the signal, an output layer that makes a decision or prediction about the input, and in between those two, an arbitrary number of hidden layers that are the true computational engine of the MLP. Q14. What are the different parts of a multi-layer perceptron? Input Nodes: The Input nodes provide information from the outside world to the network and are together referred to as the “Input Layer”. No computation is performed in any of the Input nodes – they just pass on the information to the hidden nodes. Hidden Nodes: The Hidden nodes perform computations and transfer information from the input nodes to the output nodes. A collection of hidden nodes forms a “Hidden Layer”. While a network will only have a single input layer and a single output layer, it can have zero or multiple Hidden Layers. Output Nodes: The Output nodes are collectively referred to as the “Output Layer” and are responsible for computations and transferring information from the network to the outside world. Q15. What Is Data Normalization And Why Do We Need It? Data normalization is very important preprocessing step, used to rescale values to t in a speci c range to assure better convergence during backpropagation. In general, it boils down to subtracting the mean of each data point and dividing by its standard deviation. These were some basic Deep Learning Interview Questions. Now, let’s move on to some advanced ones. Advance Interview Questions Q16. Which is Better Deep Networks or Shallow ones? and Why? Both the Networks, be it shallow or Deep are capable of approximating any function. But what matters is how precise that network is in terms of getting the results. A shallow network works with only a few features, as it can’t extract more. But a deep network goes deep by computing efficiently and working on more features/parameters. Q17. Why is Weight Initialization important in Neural Networks? Weight initialization is one of the very important steps. A bad weight initialization can prevent a network from learning but good weight initialization helps in giving a quicker convergence and a better overall error. 1 2 3 4 5 6 7 8 9 10 11 12 13 params = [weights_hidden, weights_output, bias_hidden, bias_output] def sgd(cost, params, lr=0.05): grads = T.grad(cost=cost, wrt=params) updates = [] for p, g in zip(params, grads): updates.append([p, p - g * lr]) return updates updates = sgd(cost, params)
  • 69. Biases can be generally initialized to zero. The rule for setting the weights is to be close to zero without being too small. Q18. What’s the difference between a feed-forward and a backpropagation neural network? A Feed-Forward Neural Network is a type of Neural Network architecture where the connections are “fed forward”, i.e. do not form cycles. The term “Feed-Forward” is also used when you input something at the input layer and it travels from input to hidden and from hidden to the output layer. Backpropagation is a training algorithm consisting of 2 steps: Feed-Forward the values. Calculate the error and propagate it back to the earlier layers. So to be precise, forward-propagation is part of the backpropagation algorithm but comes before back-propagating. Q19. What are the Hperparameteres? Name a few used in any Neural Network. Hyperparameters are the variables which determine the network structure(Eg: Number of Hidden Units) and the variables which determine how the network is trained(Eg: Learning Rate). Hyperparameters are set before training. Number of Hidden Layers Network Weight Initialization Activation Function Learning Rate Momentum Number of Epochs Batch Size Q20. Explain the different Hyperparameters related to Network and Training. Network Hyperparameters The number of Hidden Layers: Many hidden units within a layer with regularization techniques can increase accuracy. Smaller number of units may cause underfitting. Network Weight Initialization: Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. Mostly uniform distribution is used. Activation function: Activation functions are used to introduce nonlinearity to models, which allows deep learning models to learn nonlinear prediction boundaries. Training Hyperparameters Learning Rate: The learning rate de nes how quickly a network updates its parameters. Low learning rate slows down the learning process but converges smoothly. Larger learning rate speeds up the learning but may not converge. Momentum: Momentum helps to know the direction of the next step with the knowledge of the previous steps. It helps to prevent oscillations. A typical choice of momentum is between 0.5 to 0.9. The number of epochs: Number of epochs is the number of times the whole training data is shown to the network while training. Increase the number of epochs until the validation accuracy starts decreasing even when training accuracy is increasing(overfitting). Batch size: Mini batch size is the number of sub-samples given to the network after which parameter update happens. A good default for batch size might be 32. Also try 32, 64, 128, 256, and so on. Q21. What is Dropout? Dropout is a regularization technique to avoid over tting thus increasing the generalizing power. Generally, we should use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A probability too low has minimal effect and a value too high results in under-learning by the network. Use a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations. Q22. In training a neural network, you notice that the loss does not decrease in the few starting epochs. What could be the reason? The reasons for this could be: The learning is rate is low Regularization parameter is high Stuck at local minima
  • 70. Q23. Name a few deep learning frameworks TensorFlow Caffe The Microsoft Cognitive Toolkit/CNTK Torch/PyTorch MXNet Chainer Keras Q24. What are Tensors? Tensors are nothing but a de facto for representing the data in deep learning. They are just multidimensional arrays, that allows you to represent data having higher dimensions. In general, Deep Learning you deal with high dimensional data sets where dimensions refer to different features present in the data set. Q25. List a few advantages of TensorFlow? It has platform flexibility It is easily trainable on CPU as well as GPU for distributed computing. TensorFlow has auto differentiation capabilities It has advanced support for threads, asynchronous computation, and queue es. It is a customizable and open source. Q26. What is Computational Graph? A computational graph is a series of TensorFlow operations arranged as nodes in the graph. Each node takes zero or more tensors as input and produces a tensor as output. Basically, one can think of a Computational Graph as an alternative way of conceptualizing mathematical calculations that takes place in a TensorFlow program. The operations assigned to different nodes of a Computational Graph can be performed in parallel, thus, providing better performance in terms of computations. Q27. What is a CNN? Convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. Unlike neural networks, where the input is a vector, here the input is a multi-channeled image. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. Q28. Explain the different Layers of CNN. There are four layered concepts we should understand in Convolutional Neural Networks: Convolution: The convolution layer comprises of a set of independent lters. All these lters are initialized randomly and become our parameters which will be learned by the network subsequently. ReLu: This layer is used with the convolutional layer.
  • 71. Pooling: Its function is to progressively reduce the spatial size of the representation to reduce the number of parameters and computation in the network. Pooling layer operates on each feature map independently. Full Connectedness: Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. Q29. What is an RNN? Recurrent Networks are a type of arti cial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, the spoken word, numerical times series data. Recurrent Neural Networks use backpropagation algorithm for training Because of their internal memory, RNN’s are able to remember important things about the input they received, which enables them to be very precise in predicting what’s coming next. Q30. What are some issues faced while training an RNN? Recurrent Neural Networks use backpropagation algorithm for training, but it is applied for every timestamp. It is commonly known asBack-propagation Through Time (BTT). There are some issues with Back-propagation such as: Vanishing Gradient Exploding Gradient Q31. What is Vanishing Gradient? And how is this harmful? When we do Back-propagation, the gradients tend to get smaller and smaller as we keep on moving backward in the Network. This means that the neurons in the Earlier layers learn very slowly as compared to the neurons in the later layers in the Hierarchy. Earlier layers in the Network are important because they are responsible to learn and detecting the simple patterns and are actually the building blocks of our Network. Obviously, if they give improper and inaccurate results, then how can we expect the next layers and the complete Network to perform nicely and produce accurate results. The Training process takes too long and the Prediction Accuracy of the Model will decrease. Q32. What is Exploding Gradient Descent? Exploding gradients are a problem when large error gradients accumulate and result in very large updates to neural network model weights during training. Gradient Descent process works best when these updates are small and controlled. When the magnitudes of the gradients accumulate, an unstable network is likely to occur, which can cause poor prediction of results or even a model that reports nothing useful what so ever. Q33. Explain the importance of LSTM. Long short-term memory(LSTM) is an arti cial recurrent neural network architecture used in the eld of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections that make it a “general purpose computer”. It can not only process single data points, but also entire sequences of data. They are a special kind of Recurrent Neural Networks which are capable of learning long-term dependencies. Q34. What are capsules in Capsule Neural Network? Capsules are a vector specifying the features of the object and its likelihood. These features can be any of the instantiation parameters like position, size, orientation, deformation, velocity, hue, texture and much more.
  • 72. A capsule can also specify its attributes like angle and size so that it can represent the same generic information. Now, just like a neural network has layers of neurons, a capsule network can have layers of capsules. Now, let’s continue this Deep Learning Interview Questions and move to the section of autoencoders and RBMs. Q35. Explain Autoencoders and it’s uses. An autoencoder neural network is an Unsupervised Machine learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. Autoencoders are used to reduce the size of our inputs into a smaller representation. If anyone needs the original data, they can reconstruct it from the compressed data. Q36. In terms of Dimensionality Reduction, How does Autoencoder differ from PCAs? An autoencoder can learn non-linear transformations with a non-linear activation function and multiple layers. It doesn’t have to learn dense layers. It can use convolutional layers to learn which is better for video, image and series data. It is more efficient to learn several layers with an autoencoder rather than learn one huge transformation with PCA. An autoencoder provides a representation of each layer as the output. It can make use of pre-trained layers from another model to apply transfer learning to enhance the encoder/decoder. Q37. Give some real-life examples where autoencoders can be applied. Image Coloring: Autoencoders are used for converting any black and white picture into a colored image. Depending on what is in the picture, it is possible to tell what the color should be. Feature variation: It extracts only the required features of an image and generates the output by removing any noise or unnecessary interruption. Dimensionality Reduction: The reconstructed image is the same as our input but with reduced dimensions. It helps in providing a similar image with a reduced pixel value. Denoising Image: The input seen by the autoencoder is not the raw input but a stochastically corrupted version. A denoising autoencoder is thus trained to reconstruct the original input from the noisy version. Q38. what are the different layers of Autoencoders? An Autoencoder consist of three layers: Encoder Code Decoder Q39. Explain the architecture of an Autoencoder. Encoder: This part of the network compresses the input into a latent space representation. The encoder layer encodes the input image as a compressed representation in a reduced dimension. The compressed image is the distorted version of the original image.
  • 73. Code: This part of the network represents the compressed input which is fed to the decoder. Decoder: This layer decodes the encoded image back to the original dimension. The decoded image is a lossy reconstruction of the original image and it is reconstructed from the latent space representation. Q40. What is a Bottleneck in autoencoder and why is it used? The layer between the encoder and decoder, ie. the code is also known as Bottleneck. This is a well-designed approach to decide which aspects of observed data are relevant information and what aspects can be discarded. It does this by balancing two criteria: Compactness of representation, measured as the compressibility. It retains some behaviourally relevant variables from the input. Q41. Is there any variation of Autoencoders? Convolution Autoencoders Sparse Autoencoders Deep Autoencoders Contractive Autoencoders Q42. What are Deep Autoencoders? The extension of the simple Autoencoder is the Deep Autoencoder. The rst layer of the Deep Autoencoder is used for rst-order features in the raw input. The second layer is used for second- order features corresponding to patterns in the appearance of first-order features. Deeper layers of the Deep Autoencoder tend to learn even higher-order features. A deep autoencoder is composed of two, symmetrical deep-belief networks: First four or five shallow layers representing the encoding half of the net. The second set of four or five layers that make up the decoding half. Q43. What is a Restricted Boltzmann Machine? Restricted Boltzmann Machine is an undirected graphical model that plays a major role in Deep Learning Framework in recent times. It is an algorithm which is useful for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. Q44. How Does RBM differ from Autoencoders? Autoencoder is a simple 3-layer neural network where output units are directly connected back to input units. Typically, the number of hidden units is much less than the number of visible ones. The task of training is to minimize an error or reconstruction, i.e. find the most efficient compact representation for input data. RBM shares a similar idea, but it uses stochastic units with particular distribution instead of deterministic distribution. The task of training is to nd out how these two sets of variables are actually
  • 74. Math of Deep Learning Neural Networks – Simplified (Part 2) · Roopam Upadhyay 4 Comments The Math of Deep Learning Neural Networks – by Roopam Welcome back to this series of articles on deep learning and neural networks. In the last part, you learned how training a deep learning network is similar to a plumbing job. This time you will learn the math of deep learning. We will continue to use the plumbing analogy to simplify the seemingly complicated math. I believe you will find this highly intuitive. Moreover, understanding this will provide you with a good idea about the inner workings of deep learning networks and artificial intelligence (AI)
  • 75. to build an AI of your own. We will use the math of deep learning to make an image recognition AI in the next part. But before that let’s create the links between… The Math of Deep Learning and Plumbing Last time we noticed that neural networks are like the networks of water pipes. The goal of neural networks is to identify the right settings for the knobs (6 in this schematic) to get the right output given the input. Shown below is a familiar schematic of neural networks almost identical to the water pipelines above. The only exception is the additional bias terms (b ,b , and b ) added to the nodes. In this post, we will solve this network to understand the math of deep learning. Note that a deep learning model has multiple hidden layers, unlike this simple neural network. However, this simple neural network can easily be generalized to the deep learning models. The math of deep learning does not change a lot with additional complexity and hidden layers. Here, our objective is to identify the values of the parameters {W (W ,…, W ) and b (b ,b , and b )}. We will soon use the 1 2 3 1 6 1 2 3
  • 76. backpropagation algorithm along with gradient descent optimization to solve this network and identify the optimal values of these weights. Backpropagation and Gradient Descent In the previous post, we discussed that the backpropagation algorithm works similar to me shouting back at my plumber while he was working in the duct. Remember, I was telling the plumber about the difference in actual water pressure from the expected. The plumber of neural networks, unlike my building’s plumber, learns from this information to optimize the positions of the knobs. The method that the neural networks plumber uses to iteratively correct the weights or settings of the knobs is called gradient descent. We have discussed the gradient descent algorithm in an earlier post to solve a logistic regression model. I recommend that you read that article to get a good grasp of the things we will discuss in this post. Essentially, the idea is to iteratively correct the value of the weights (W ) to produce the least difference between the actual and the expected values of the output. This difference is measured mathematically by the loss function i.e . The weights (W and b ) are then iteratively improved using the gradient of the loss function wrt weights using this expression: Here, α is called the learning rate – it’s a hyperparameter and stays constant. Hence, the overall problem boils down to the identification of partial derivatives of the loss function with respect to the weights i.e. . For our problem, we just need to solve the partial derivatives for W and W . The partial derivatives for other weights can then be easily derived using the same method used for W and W . Before we solve these partial derivatives, let’s do some more plumbing jobs and look at a tap to develop intuitions about the results we will get from the gradient descent optimization. Intuitive Math of Deep Learning for W A Tap We will use this simple tap to identify an optimal setting for its knob. In this process, we will develop intuitions about gradient descent and the math of deep learning. Here, the input is the water coming from the pipe on the left of the image. Moreover, the output is the water coming out of the tap. You use the knob, on the top of the tap, to regulate the quantity of the output water given the input. Remember, you want to turn the knob in such a way that you get the desired output (i.e the quantity of water) to wash your hands. Keep in mind, the position of the knob is similar to the weight of a neural networks’ parameters. Moreover, the input/output water is similar to the input/output variables. i i i 5 1 5 1 5
  • 77. Essentially, in math terms, you are trying to identify how the position of the knob influences the output water. The mathematical equation for the same is: If you understand the influence of the knob on the output flow of water you can easily turn it to get the desired output. Now, let’s develop an intuition about how much to twist the knob. When you use a tap you twist the knob until you get the right flow or the output. When the difference between the desired output and the actual output is large then you need a lot of twisting. On the other hand, when the difference is less then you turn the knob gently. Moreover, the other factor on which your decision depends on is the input from the left pipe. If there is no water flowing from the left pipe then no matter how much you twist the knob it won’t help. Essentially, your action depends on these two factors. Your decision to turn the knob depends on Factor 1: Difference between the actual output and the desir ed Output and Factor 2: Input from the grey pipe on the left Soon you will get the same result by doing a seemingly complicated math for the gradient descent to solve the neural network. For our network, the output difference is and input is . Hence, Disclaimer Please note, to make the concepts easy for you to understand, I had taken a few liberties while defining the factors in the previous section. I will make these factors much more theoretically grounded at the end of this article when I will discuss the chain rule to solve derivatives. For now, I will continue to take more liberties in the next section when I discuss the other weight modification for other parameters of neural networks.
  • 78. Add More Knobs to Solve W – Intuitive Math of Deep Learning Neural networks, as discussed earlier, have several parameters (Ws and bs). To develop an intuition about the math to estimate the other parameters further away from the output (i.e. W ), let’s add another knob to the tap. Here, we have added a red regulator knob to the tap we saw in the earlier section. Now, the output from the tap is governed by both these knobs. Referring to the neural network’s image shown earlier, the red knob is similar to the parameters (W , W W , W b and b ) added to the hidden layers. The knob on top of the brass tap is like the parameters to the output layer (i.e. W , W , and b ). Now, you are also using the red knob, in addition to the knob on the tap, to get the desired output from the tap. Your effort of the red knob will depend on these factors. Your decision to turn the red knob depends on Factor 1: Difference between the actual and the desired fina l output and Factor 2: Position / setting of the knob on the brass tap an d Factor 3: Change in input to the brass tap caused by the red knob and Factor 4: Input from the pipe on the left into the red knob Here, as already discussed earlier, factor 1 is . W is the setting/weight for the knob of the brass tap. Factor 3 is . Finally, the last factor is the input or X . This completes our equation as: 1 1 1 2, 3 4, 1, 2 5 6 3 5 1
  • 79. Now, before we do the math to get these results, we just need to discuss the components of our neural network in mathematical terms. We already know how it relates to the water pipelines discussed earlier. Let’s start with the nodes or the orange circles in the network diagram. Nodes of Neural Networks Here, these two networks are equivalent except the additional b or bias for the neural networks. The node for the neural network has two components i.e. sum and non-linear. The sum component (Z ) is just a linear combination of the input and the weights. The next term, i.e. non-linear, is the non-linear sigmoid activation function ( ). As discussed earlier, it is like a regulator of a fan that keeps the value of between 0 and 1 or on/off. 1 1
  • 80. The mathematical expression for this sigmoid activation function ( ) is: The nodes in both the hidden and output layer behave the same as described above. Now, the last thing is to define the loss function ( ) which is to measure the difference between the expected and actual output. We will define the loss function for most common business problems. Classification Problem – Math of Deep Learning In practice, most business problems are about classification. They have binary or categorical outputs/answers such as: Is the last credit card transaction fraudulent or not? Will the borrower return the money or not? Was the last email in your mailbox a spam or ham? Is that a picture of a dog or cat? (this is not a business problem but a famous problem for deep learning) Is there an object in front of an autonomous car to generate a signal to push the break? Will the person surfing the web respond to the ad of a luxury condo? Hence, we will design the loss function of our neural network for similar binary outputs. This binary loss function, aka binary cross entropy, can easily be extended for multiclass problems with minor modifications. Loss Function and Cross Entropy The loss function for binary output problems is:
  • 81. This expression is also referred to as binary cross entropy. We can easily extend this binary cross- entropy to multi-class entropy if the output has many classes such as images of dog, cat, bus, car etc. We will learn about multiclass cross entropy and softmax function in the next part of this series. Now that we have identified all the components of the neural network, we are ready to solve it using the chain rule of differential equations. Chain Rule for W – Math of Deep Learning We discussed the outcome for change observed in the loss function( ) wrt to change in W earlier using a single knob analogy. We know the answer to is equal to . Now, let’s derive the same thing using the chain rule of derivatives. Essentially, this is similar to the change in water pressure observed at the output by turning the knob on the top of the tap. The chain rule states this: The above equation for chain rule is fairly simple since equation on the right-hand side will become the one on the left-hand side by simple division. More importantly, these equations suggest that the change in the output essentially the change observed at different components of the pipeline because of turning the knob. Moreover, we already discussed the loss function which is the binary cross entropy i.e. The first component of the chain rule is which is This was fairly easy to compute if you only know that derivative of a natural log function is This second component of the step function is . This derivative of the sigmoid function ( ) is slightly more complicated. You could find here a detailed solution to the derivative of the sigmoid function. This implies, Finally, the third component of chain rule is again very easy to compute i.e. Since we know, 5 5
  • 82. Now, we just multiply these three components of the chain rule and we get the output i.e. Chain Rule for W – Math of Deep Learning The chain rule for the red knob or the additional layer is just an extension of the chain rule of the knob on the top of the tap. This one has a few more components because the water has to travel through more components i.e. The first two components are exactly the same as the knob of the tap i.e. W . This makes sense since the water is flowing through the same pipeline towards the end. Hence, we will calculate the third component The fourth component is the derivative of the sigmoid function i.e. the derivative of the sigmoid function The fifth and the final component is again easy to calculate. That’s it. We now multiply these five components to get the results we have already seen for the additional red knob. Sign-off Node This part of the series became a little math heavy. All this, however, will help us a lot when we will build an artificial intelligence to recognize images. See you then. Share 1 5 
  • 83. CS 230 – Deep Learning Shervine Amidi Afshine Amidi Super VIP Cheatsheet: Deep Learning Afshine Amidi and Shervine Amidi November 25, 2018 Contents 1 Convolutional Neural Networks 2 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Types of layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Filter hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Tuning hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Commonly used activation functions . . . . . . . . . . . . . . . . . . . 3 1.6 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6.1 Face verification and recognition . . . . . . . . . . . . . . . . . 5 1.6.2 Neural style transfer . . . . . . . . . . . . . . . . . . . . . . . 5 1.6.3 Architectures using computational tricks . . . . . . . . . . . . 6 2 Recurrent Neural Networks 7 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Handling long term dependencies . . . . . . . . . . . . . . . . . . . . 8 2.3 Learning word representation . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Motivation and notations . . . . . . . . . . . . . . . . . . . 9 2.3.2 Word embeddings . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Comparing words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.7 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Deep Learning Tips and Tricks 11 3.1 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Training a neural network . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Finding optimal weights . . . . . . . . . . . . . . . . . . . . . 12 3.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 Weights initialization . . . . . . . . . . . . . . . . . . . . . . 12 3.3.2 Optimizing convergence . . . . . . . . . . . . . . . . . . . . . 12 3.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5 Good practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1 Convolutional Neural Networks 1.1 Overview r Architecture of a traditional CNN – Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers: The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections. 1.2 Types of layer r Convolutional layer (CONV) – The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperpa- rameters include the filter size F and stride S. The resulting output O is called feature map or activation map. Remark: the convolution step can be generalized to the 1D and 3D cases as well. r Pooling (POOL) – The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively. Stanford University 1 Winter 2019
  • 84. CS 230 – Deep Learning Shervine Amidi Afshine Amidi Max pooling Average pooling Purpose Each pooling operation selects the maximum value of the current view Each pooling operation averages the values of the current view Illustration Comments - Preserves detected features - Most commonly used - Downsamples feature map - Used in LeNet r Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores. 1.3 Filter hyperparameters The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters. r Dimensions of a filter – A filter of size F × F applied to an input containing C channels is a F × F × C volume that performs convolutions on an input of size I × I × C and produces an output feature map (also called activation map) of size O × O × 1. Remark: the application of K filters of size F × F results in an output feature map of size O × O × K. r Stride – For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation. r Zero-padding – Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below: Valid Same Full Value P = 0 Pstart = j Sd I S e−I+F −S 2 k Pend = l Sd I S e−I+F −S 2 m Pstart ∈ [[0,F − 1]] Pend = F − 1 Illustration Purpose - No padding - Drops last convolution if dimensions do not match - Padding such that feature map size has size l I S m - Output size is mathematically convenient - Also called ’half’ padding - Maximum padding such that end convolutions are applied on the limits of the input - Filter ’sees’ the input end-to-end 1.4 Tuning hyperparameters r Parameter compatibility in convolution layer – By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by: O = I − F + Pstart + Pend S + 1 Remark: often times, Pstart = Pend , P, in which case we can replace Pstart + Pend by 2P in the formula above. Stanford University 2 Winter 2019
  • 85. CS 230 – Deep Learning Shervine Amidi Afshine Amidi r Understanding the complexity of the model – In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows: CONV POOL FC Illustration Input size I × I × C I × I × C Nin Output size O × O × K O × O × C Nout Number of parameters (F × F × C + 1) · K 0 (Nin + 1) × Nout Remarks - One bias parameter per filter - In most cases, S F - A common choice for K is 2C - Pooling operation done channel-wise - In most cases, S = F - Input is flattened - One bias parameter per neuron - The number of FC neurons is free of structural constraints r Receptive field – The receptive field at layer k is the area denoted Rk × Rk of the input that each pixel of the k-th activation map can ’see’. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0 = 1, the receptive field at layer k can be computed with the formula: Rk = 1 + k X j=1 (Fj − 1) j−1 Y i=0 Si In the example below, we have F1 = F2 = 3 and S1 = S2 = 1, which gives R2 = 1+2 · 1+2 · 1 = 5. 1.5 Commonly used activation functions r Rectified Linear Unit – The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below: ReLU Leaky ReLU ELU g(z) = max(0,z) g(z) = max(z,z) with 1 g(z) = max(α(ez − 1),z) with α 1 Non-linearity complexities biologically interpretable Addresses dying ReLU issue for negative values Differentiable everywhere r Softmax – The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x ∈ Rn and outputs a vector of output probability p ∈ Rn through a softmax function at the end of the architecture. It is defined as follows: p = p1 . . . pn where pi = exi n X j=1 exj 1.6 Object detection r Types of models – There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below: Image classification Classification w. localization Detection - Classifies a picture - Predicts probability of object - Detects object in a picture - Predicts probability of object and where it is located - Detects up to several objects in a picture - Predicts probabilities of objects and where they are located Traditional CNN Simplified YOLO, R-CNN YOLO, R-CNN r Detection – In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below: Stanford University 3 Winter 2019
  • 86. CS 230 – Deep Learning Shervine Amidi Afshine Amidi Bounding box detection Landmark detection Detects the part of the image where the object is located - Detects a shape or characteristics of an object (e.g. eyes) - More granular Box of center (bx,by), height bh and width bw Reference points (l1x,l1y), ...,(lnx,lny) r Intersection over Union – Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as: IoU(Bp,Ba) = Bp ∩ Ba Bp ∪ Ba Remark: we always have IoU ∈ [0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba) ⩾ 0.5. r Anchor boxes – Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form. r Non-max suppression – The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining: • Step 1: Pick the box with the largest prediction probability. • Step 2: Discard any box having an IoU ⩾ 0.5 with the previous box. r YOLO – You Only Look Once (YOLO) is an object detection algorithm that performs the following steps: • Step 1: Divide the input image into a G × G grid. • Step 2: For each grid cell, run a CNN that predicts y of the following form: y = pc,bx,by,bh,bw,c1,c2,...,cp | {z } repeated k times ,... T ∈ RG×G×k×(5+p) where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes. • Step 3: Run the non-max suppression algorithm to remove any potential duplicate over- lapping bounding boxes. Remark: when pc = 0, then the network does not detect any object. In that case, the corre- sponding predictions bx, ..., cp have to be ignored. r R-CNN – Region with Convolutional Neural Networks (R-CNN) is an object detection algo- rithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes. Remark: although the original algorithm is computationally expensive and slow, newer archi- tectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN. Stanford University 4 Winter 2019
  • 87. CS 230 – Deep Learning Shervine Amidi Afshine Amidi 1.6.1 Face verification and recognition r Types of models – Two main types of model are summed up in table below: Face verification Face recognition - Is this the correct person? - One-to-one lookup - Is this one of the K persons in the database? - One-to-many lookup r One Shot Learning – One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1, image 2). r Siamese Network – Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)). r Triplet loss – The triplet loss ` is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α ∈ R+ the margin parameter, this loss is defined as follows: `(A,P,N) = max (d(A,P) − d(A,N) + α,0) 1.6.2 Neural style transfer r Motivation – The goal of neural style transfer is to generate an image G based on a given content C and a given style S. r Activation – In a given layer l, the activation is noted a[l] and is of dimensions nH ×nw ×nc r Content cost function – The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows: Jcontent(C,G) = 1 2 ||a[l](C) − a[l](G) ||2 r Style matrix – The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G [l] kk0 quantifies how correlated the channels k and k0 are. It is defined with respect to activations a[l] as follows: G [l] kk0 = n [l] H X i=1 n [l] w X j=1 a [l] ijk a [l] ijk0 Remark: the style matrix for the style image and the generated image are noted G[l](S) and G[l](G) respectively. r Style cost function – The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows: J [l] style(S,G) = 1 (2nH nwnc)2 ||G[l](S) − G[l](G) ||2 F = 1 (2nH nwnc)2 nc X k,k0=1 G [l](S) kk0 − G [l](G) kk0 2 r Overall cost function – The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows: J(G) = αJcontent(C,G) + βJstyle(S,G) Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style. 1.6.3 Architectures using computational tricks r Generative Adversarial Network – Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image. Stanford University 5 Winter 2019
  • 88. CS 230 – Deep Learning Shervine Amidi Afshine Amidi Remark: use cases using variants of GANs include text to image, music generation and syn- thesis. r ResNet – The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation: a[l+2] = g(a[l] + z[l+2] ) r Inception Network – This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance. In particular, it uses the 1 × 1 convolution trick to lower the burden of computation. ? ? ? 2 Recurrent Neural Networks 2.1 Overview r Architecture of a traditional RNN – Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows: For each timestep t, the activation at and the output yt are expressed as follows: at = g1(Waaat−1 + Waxxt + ba) and yt = g2(Wyaat + by) where Wax, Waa, Wya, ba, by are coefficients that are shared temporally and g1, g2 activation functions The pros and cons of a typical RNN architecture are summed up in the table below: Advantages Drawbacks - Possibility of processing input of any length - Model size not increasing with size of input - Computation takes into account historical information - Weights are shared across time - Computation being slow - Difficulty of accessing information from a long time ago - Cannot consider any future input for the current state r Applications of RNNs – RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below: Stanford University 6 Winter 2019
  • 89. CS 230 – Deep Learning Shervine Amidi Afshine Amidi Type of RNN Illustration Example One-to-one Tx = Ty = 1 Traditional neural network One-to-many Tx = 1, Ty 1 Music generation Many-to-one Tx 1, Ty = 1 Sentiment classification Many-to-many Tx = Ty Name entity recognition Many-to-many Tx 6= Ty Machine translation r Loss function – In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows: L(b y,y) = Ty X t=1 L(b yt ,yt ) r Backpropagation through time – Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows: ∂L(T ) ∂W = T X t=1 ∂L(T ) ∂W (t) 2.2 Handling long term dependencies r Commonly used activation functions – The most common activation functions used in RNN modules are described below: Sigmoid Tanh RELU g(z) = 1 1 + e−z g(z) = ez − e−z ez + e−z g(z) = max(0,z) r Vanishing/exploding gradient – The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers. r Gradient clipping – It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice. r Types of gates – In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to: Γ = σ(Wxt + Uat−1 + b) where W, U, b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below: Stanford University 7 Winter 2019
  • 90. CS 230 – Deep Learning Shervine Amidi Afshine Amidi Type of gate Role Used in Update gate Γu How much past should matter now? GRU, LSTM Relevance gate Γr Drop previous information? GRU, LSTM Forget gate Γf Erase a cell or not? LSTM Output gate Γo How much to reveal of a cell? LSTM r GRU/LSTM – Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture: Gated Recurrent Unit (GRU) Long Short-Term Memory (LSTM) c̃t tanh(Wc[Γr ? at−1,xt] + bc) tanh(Wc[Γr ? at−1,xt] + bc) ct Γu ? c̃t + (1 − Γu) ? ct−1 Γu ? c̃t + Γf ? ct−1 at ct Γo ? ct Dependencies Remark: the sign ? denotes the element-wise multiplication between two vectors. r Variants of RNNs – The table below sums up the other commonly used RNN architectures: Bidirectional (BRNN) Deep (DRNN) 2.3 Learning word representation In this section, we note V the vocabulary and |V | its size. 2.3.1 Motivation and notations r Representation techniques – The two main ways of representing words are summed up in the table below: 1-hot representation Word embedding - Noted ow - Naive approach, no similarity information - Noted ew - Takes into account words similarity r Embedding matrix – For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows: ew = Eow Remark: learning the embedding matrix can be done using target/context likelihood models. 2.3.2 Word embeddings r Word2vec – Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW. r Skip-gram – The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by: Stanford University 8 Winter 2019
  • 91. CS 230 – Deep Learning Shervine Amidi Afshine Amidi P(t|c) = exp(θT t ec) |V | X j=1 exp(θT j ec) Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word. r Negative sampling – It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by: P(y = 1|c,t) = σ(θT t ec) Remark: this method is less computationally expensive than the skip-gram model. r GloVe – The GloVe model, short for global vectors for word representation, is a word em- bedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows: J(θ) = 1 2 |V | X i,j=1 f(Xij)(θT i ej + bi + b0 j − log(Xij))2 here f is a weighting function such that Xi,j = 0 =⇒ f(Xi,j) = 0. Given the symmetry that e and θ play in this model, the final word embedding e (final) w is given by: e (final) w = ew + θw 2 Remark: the individual components of the learned word embeddings are not necessarily inter- pretable. 2.4 Comparing words r Cosine similarity – The cosine similarity between words w1 and w2 is expressed as follows: similarity = w1 · w2 ||w1|| ||w2|| = cos(θ) Remark: θ is the angle between words w1 and w2. r t-SNE – t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at re- ducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space. 2.5 Language model r Overview – A language model aims at estimating the probability of a sentence P(y). r n-gram model – This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data. r Perplexity – Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows: PP = T Y t=1 1 P|V | j=1 y (t) j · b y (t) j ! 1 T Remark: PP is commonly used in t-SNE. 2.6 Machine translation r Overview – A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that: y = arg max y1,...,yTy P(y1 ,...,yTy |x) r Beam search – It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x. • Step 1: Find top B likely words y1 • Step 2: Compute conditional probabilities yk|x,y1,...,yk−1 • Step 3: Keep top B combinations x,y1,...,yk Stanford University 9 Winter 2019
  • 92. CS 230 – Deep Learning Shervine Amidi Afshine Amidi Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search. r Beam width – The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10. r Length normalization – In order to improve numerical stability, beam search is usually ap- plied on the following normalized objective, often called the normalized log-likelihood objective, defined as: Objective = 1 Tα y Ty X t=1 log h p(yt |x,y1 , ..., yt−1 ) i Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1. r Error analysis – When obtaining a predicted translation b y that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis: Case P(y∗|x) P(b y|x) P(y∗|x) ⩽ P(b y|x) Root cause Beam search faulty RNN faulty Remedies Increase beam width - Try different architecture - Regularize - Get more data r Bleu score – The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows: bleu score = exp 1 n n X k=1 pk ! where pn is the bleu score on n-gram only defined as follows: pn = X n-gram∈b y countclip(n-gram) X n-gram∈b y count(n-gram) Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score. 2.7 Attention r Attention model – This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting αt,t0 the amount of attention that the output yt should pay to the activation at0 and ct the context at time t, we have: ct = X t0 αt,t0 at0 with X t0 αt,t0 = 1 Remark: the attention scores are commonly used in image captioning and machine translation. r Attention weight – The amount of attention that the output yt should pay to the activation at0 is given by αt,t0 computed as follows: αt,t0 = exp(et,t0 ) Tx X t00=1 exp(et,t00 ) Remark: computation complexity is quadratic with respect to Tx. ? ? ? Stanford University 10 Winter 2019
  • 93. CS 230 – Deep Learning Shervine Amidi Afshine Amidi 3 Deep Learning Tips and Tricks 3.1 Data processing r Data augmentation – Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply: Original Flip Rotation Random crop - Image without any modification - Flipped with respect to an axis for which the meaning of the image is preserved - Rotation with a slight angle - Simulates incorrect horizon calibration - Random focus on one part of the image - Several random crops can be done in a row Color shift Noise addition Information loss Contrast change - Nuances of RGB is slightly changed - Captures noise that can occur with light exposure - Addition of noise - More tolerance to quality variation of inputs - Parts of image ignored - Mimics potential loss of parts of image - Luminosity changes - Controls difference in exposition due to time of day r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {xi}. By noting µB, σ2 B the mean and variance of that we want to correct to the batch, it is done as follows: xi ←− γ xi − µB p σ2 B + + β It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization. 3.2 Training a neural network 3.2.1 Definitions r Epoch – In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights. r Mini-batch gradient descent – During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune. r Loss function – In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z. r Cross-entropy loss – In the context of binary classification in neural networks, the cross- entropy loss L(z,y) is commonly used and is defined as follows: L(z,y) = − h y log(z) + (1 − y) log(1 − z) i 3.2.2 Finding optimal weights r Backpropagation – Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule. Using this method, each weight is updated with the rule: w ←− w − α ∂L(z,y) ∂w r Updating weights – In a neural network, weights are updated as follows: • Step 1: Take a batch of training data and perform forward propagation to compute the loss. • Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight. • Step 3: Use the gradients to update the weights of the network. Stanford University 11 Winter 2019
  • 94. CS 230 – Deep Learning Shervine Amidi Afshine Amidi 3.3 Parameter tuning 3.3.1 Weights initialization r Xavier initialization – Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture. r Transfer learning – Training a deep learning model requires a lot of data and more impor- tantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this: Training size Illustration Explanation Small Freezes all layers, trains weights on softmax Medium Freezes most layers, trains weights on last layers and softmax Large Trains weights on layers and softmax by initializing weights on pre-trained ones 3.3.2 Optimizing convergence r Learning rate – The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate. r Adaptive learning rates – Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below: Method Explanation Update of w Update of b Momentum - Dampens oscillations - Improvement to SGD - 2 parameters to tune w − αvdw b − αvdb RMSprop - Root Mean Square propagation - Speeds up learning algorithm by controlling oscillations w − α dw √ sdw b ←− b − α db √ sdb Adam - Adaptive Moment estimation - Most popular method - 4 parameters to tune w − α vdw √ sdw + b ←− b − α vdb √ sdb + Remark: other methods include Adadelta, Adagrad and SGD. 3.4 Regularization r Dropout – Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p 0. It forces the model to avoid relying too much on particular sets of features. Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1−p. r Weight regularization – In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below: LASSO Ridge Elastic Net - Shrinks coefficients to 0 - Good for variable selection Makes coefficients smaller Tradeoff between variable selection and small coefficients ... + λ||θ||1 λ ∈ R ... + λ||θ||2 2 λ ∈ R ... + λ h (1 − α)||θ||1 + α||θ||2 2 i λ ∈ R,α ∈ [0,1] Stanford University 12 Winter 2019
  • 95. CS 230 – Deep Learning Shervine Amidi Afshine Amidi r Early stopping – This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase. 3.5 Good practices r Overfitting small batch – When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set. r Gradient checking – Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness. Numerical gradient Analytical gradient Formula df dx (x) ≈ f(x + h) − f(x − h) 2h df dx (x) = f0 (x) Comments - Expensive; loss has to be computed two times per dimension - Used to verify correctness of analytical implementation -Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approx.) - ’Exact’ result - Direct computation - Used in the final implementation ? ? ? Stanford University 13 Winter 2019