Deep Learning: Introduction & Chapter 5 Machine Learning Basics

Deep Learning 101, Dec. 9th, 2016
By Jason Tsai (蔡志順) & Frank Wu (吳孟倫)

*Note: Most materials from this presentation are taken
from the book “Deep Learning” authored by Ian
Goodfellow, Yoshua Bengio, & Aaron Courville. The other
quoted sources are mentioned in the respective slides.
Activation Function

Action Potential (動作電位)
Figure from https://en.wikipedia.org/wiki/Action_potential

Softmax
Sigmoid
tanh
ReLU
Common Used
Activation Functions

Rounding Error
Underflow: It occurs when numbers
near zero are rounded to zero.
Overflow: It occurs when numbers
with large magnitude are
approximated as ∞ or -∞ .

Softmax Function
To avoid rounding error:

Cross-Entropy Loss
Cross-entropy:

Logistic Sigmoid
(Sigmoid Function)
Figure from “A Tutorial on Deep Learning, Part 1: Nonlinear Classifiers
and The Backpropagation Algorithm, by Quoc V. Le”

Hyperbolic Tangent
(tanh)
Figure from http://mathworld.wolfram.com/HyperbolicTangent.html

Stochastic Gradient Descent
(SGD)

Objective / Cost / Loss Function
To minimize or maximize, J:
Objective function / Criterion
Specific to minimize, L:
Cost Function
/ Loss Function
/ Error Function
x* = arg min f(x)

Epoch / Batch / Iteration
*One Epoch: One forward pass and one backward pass of all
the training examples
*Batch Size: The number of training examples in one
forward/backward pass. The higher the batch size, the more
memory space you'll need.
Minibatch: Take a small number of examples at a time,
ranging from 1 to a few hundred, during one iteration.
*Number of Iterations: Number of passes, each pass using
[batch size] number of examples. To be clear, one pass = one
forward pass + one backward pass.
*Quote from http://stackoverflow.com/questions/4752626/epoch-vs-iteration-
when-training-neural-networks/31842945#31842945

Machine Learning Algorithms
Algorithms that are able to learn from data
From Machine Learning (1997), page 2, by
Tom M. Mitchell:
“A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P , if its
performance at tasks in T , as measured by P,
improves with experience E.”

Tasks in Machine Learning
(Categories of Applications)
Classification
Classifications with
missing inputs
Regression
Transcription
Machine Translation
Structure Output
Anomaly detection
Synthesis and
sampling
Imputation of missing
values
Denoising
Density Estimation or
Probability Mass
Function Estimation
…
Refer to chapter 12 for details of several practical applications

The Difficulties of Performance
Measure
In some cases, it is difficult to
decide what should be measured, e.g.
transcription.
In some other cases, we know what
quantity we would ideally like to
measure, but measuring it is
impractical, e.g. density estimation.

Supervised learning algorithms experience a dataset
containing features, but each example is also
associated with a label or target.
Unsupervised learning algorithms experience a
dataset containing many features, then learn useful
properties of the structure of this dataset.
Reinforcement Learning algorithms interact with an
environment and learn what to do by trial and error to
maximize a numerical reward but without told prior
which actions to take.
Experience:
Supervised vs. Unsupervised vs.
Reinforcement

One example is a collection of features.
1 example (data point)  features
1 image  the values of the pixels in that image
A dataset is collection of many examples.
Experience:
Dataset / Examples / Features

Underfitting occurs when the model is
not able to obtain a sufficiently low error
value on the training set. (學不像)
Overfitting occurs when the gap between
the training error and test error is too
large. (學的太像了不可能是真的)

Test Error a.k.a Generalization Error

Choose Relevant Learning
Algorithms
The No Free Lunch theorem for Machine Learning:
Averaged over all possible data generating
distributions, every classification algorithm has the
same error rate when classifying previously
unobserved points.
As the training size increases to infinity, the training
error of any fixed-capacity model must rise to at least
the Bayes error.
As the training set size increases, the optimal
capacity increases. Choose the optimal capacity after
reaching sufficient complexity to solve the task.
Regularization: Refer to chapter 7

Define KFoldXV(D, A, L, k):
Require: D, the given dataset, with elements z(i)
Require: A, the learning algorithm, seen as a function that
takes a dataset as input and outputs a learned function
Require: L, the loss function, seen as a function from a
learned function f and an example z(i) ∈ D to a scalar ∈ R
Require: k, the number of folds
Split D into k mutually exclusive subsets Di, whose union is D.
for i from 1 to k do
fi = A(DDi)
for z(j) in Di do
ej = L(fi, z(j))
end for
end for
Return e
K-fold Cross-Validation
Algorithm

*Estimation of the relation between input and target variables.
*We want to estimate a parameter
*The true value of the parameter : 𝜽 (fixed and unknown)
*The estimated value of parameter : 𝜽 (a function of the data, random variable)
*Let { 𝑥 1 , … , 𝑥 𝑚 } be a set of 𝑚 independent and identically distributed
(i.i.d.) data points
*A point estimator or statistic is any function of the data:
𝜃 𝑚 = 𝑔(𝑥(1)
, … , 𝑥(𝑚)
)
Point Estimation

*Predict a variable 𝑦 given an input vector 𝑥.
*Assume that 𝑦 = 𝑓 𝑥 + 𝜖
* 𝜖 stands for the part of 𝑦 that is not predictable from 𝑥.
*Approximating 𝑓 with a model or estimate 𝑓
*Estimator 𝑓 is a point estimator in function space.
*Estimating a parameter 𝑤 or estimating a function 𝑓 mapping from 𝑥 to 𝑦
*Linear regression
*Polynomial regression
Function Estimation

*Error = Bias + Variance
*Example:
*Use a gun to shot at 10 targets, 7 targets hit.
*Error = 10 - 7 = 3
*Only targeted at 9 targets, assuming the gun is the model…
*Bias = 10 - 9 = 1 (The difference between the model and real target)
*Variance = 9 – 7 = 2 (The model is not stable enough for targeting)
Error

*Variance is denoted as: 𝑉𝑎𝑟( 𝜃)
*Standard error is denoted as: 𝑆𝐸( 𝜃) (Square root of the variance)
*Standard error of the mean
𝑆𝐸 𝜇 𝑚 = 𝑉𝑎𝑟
1
𝑚
𝑖=1
𝑚
𝑥(𝑖) =
𝜎
𝑚
*The bias of an estimator is defined as:
𝑏𝑖𝑎𝑠 𝜃 𝑚 = 𝐸 𝜃 𝑚 − 𝜃
Variance and Standard Error

*MSE = 𝐸 𝜃 𝑚 − 𝜃
2
= 𝐵𝑖𝑎𝑠 𝜃 𝑚
2
+ 𝑉𝑎𝑟( 𝜃 𝑚)
Mean Squared Error
𝐵𝑖𝑎𝑠 𝑓(𝑥) = 𝐸 𝑓 𝑥 − 𝑓 𝑥
𝑉𝑎𝑟 𝑓 𝑥 = 𝐸 𝑓 𝑥 2 − 𝐸 𝑓 𝑥
2

*A set of examples 𝑋 = 𝑥 1 , … , 𝑥 𝑚 generating 𝑝 𝑑𝑎𝑡𝑎 𝑥
*Drawn independently from True but unknown data.
* 𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥; 𝜃) maps any configuration of 𝑥 to a real number estimating the
true probability 𝑝 𝑑𝑎𝑡𝑎 𝑥 .
*The maximum likelihood estimator for 𝜃 is defined as
𝜃 𝑀𝐿 = 𝐚𝐫𝐠𝐦𝐚𝐱
𝜽
𝒑 𝒎𝒐𝒅𝒆𝒍(𝑿; 𝜽)
= argmax
𝜃
𝑖=1
𝑚
𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥 𝑖
; 𝜃)
*= argmax
𝜃
𝑖=1
𝑚
log 𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥 𝑖
; 𝜃)
Maximum Likelihood Estimation

*Example:
*5 bags with infinite cookies with two flavors, the proportion of flavor are:
*Cherry 100%
*Cherry 75% + Lemon 25%
*Lemon 100%
*Which bag is most likely that we take two cookies with both lemon flavor?
*0% * 0% = 0%
*25% * 25% = 6.25%
*50% * 50% = 25%
*75% * 75% = 56.25%
*100% * 100% = 100%  this bag
Maximum Likelihood Estimation

* 𝜃 𝑀𝐴𝑃 = argmax
𝜃
𝑝(𝜃|𝑥) = argmax
𝜃
𝑝 𝑥 𝜃 𝑝(𝜃)
*Example:
*5 bags with infinite cookies with two flavors, the proportion of flavor are:
*10% choose  Cherry 100%
*20% choose  Cherry 75% + Lemon 25%
*10% choose  Lemon 100%
*Which bag is most likely that we take two cookies with both lemon flavor?
*0% * 0% * 10% = 0%
*25% * 25% * 20% = 1.25%
*50% * 50% * 40% = 10%
*75% * 75% * 20% = 13.05%  this bag
*100% * 100% * 10% = 10%
Maximum-a-Posteriori (MAP)

* 𝑝 𝜃 𝑥 1 , … , 𝑥 𝑚 =
𝑝 𝑥 1
, … , 𝑥 𝑚
𝜃 ∙ 𝑝 𝜃
𝑝 𝑥 1 , … , 𝑥 𝑚
*Example: Female + age 31~40 + worker + medium income  buys
car?
Age Gender Occupation Income Buy car
< 21 Male Student Low No
21 ~ 30 Female Student Low Yes
21 ~ 30 Male Worker Low Yes
> 41 Male Worker High Yes
31 ~ 40 Male No Job Low No
31 ~ 40 Male Worker Medium No
31 ~ 40 Female No Job Low Yes
31 ~ 40 Female Worker Medium Yes
31 ~ 40 Female Worker High No
< 21 Female Student Low Yes
𝑝 Yes =
6
10
𝑝 Female|Yes =
4
6
𝑝 Worker|Yes =
3
6
𝑝 Medium|Yes =
1
6
𝑝 31~40|Yes =
2
6
𝑝 Female =
5
10
𝑝 31~40 =
5
10
𝑝 Worker =
5
10
𝑝 Medium =
2
10
𝑝 Yes|𝑥 1 , … , 𝑥 𝑚 =
(
4
6
×
2
6
×
3
6
×
1
6
)×
6
10
5
10
×
5
10
×
5
10
×
2
10

* argmax
𝜃
𝑝 𝜃 𝑥 1 , … , 𝑥 𝑚
*Example: Female + age 31~40 + worker + medium income  buys
car?Age Gender Occupation Income Buy car
< 21 Male Student Low No
21 ~ 30 Female Student Low Yes
21 ~ 30 Male Worker Low Yes
> 41 Male Worker High Yes
31 ~ 40 Male No Job Low No
31 ~ 40 Male Worker Medium No
31 ~ 40 Female No Job Low Yes
31 ~ 40 Female Worker Medium Yes
31 ~ 40 Female Worker High No
< 21 Female Student Low Yes
𝑝 Yes =
6
10
𝑝 Female|Yes =
4
6
𝑝 Worker|Yes =
3
6
𝑝 Medium|Yes =
1
6
𝑝 31~40|Yes =
2
6
𝑝 Yes 𝑥 1
, … , 𝑥 𝑚
=
4
6
×
2
6
×
3
6
×
1
6
×
6
10
= 0.0111
𝑝 Female|No =
1
4
𝑝 Worker|No =
2
4
𝑝 Medium|No =
1
4
𝑝 31~40|No =
3
4
𝑝 No =
4
10
𝑝 𝐍𝐨|𝑥 1 , … , 𝑥 𝑚 =
1
4
×
3
4
×
2
4
×
1
4
×
4
10
= 𝟎. 𝟎𝟏𝟖𝟕𝟓

Use maximum likelihood estimation to find the best
parameter vector 𝜃 for a parametric family of distributions
𝑝(𝑦|𝑥; 𝜃)
Linear regression corresponds to the family
𝑝 𝑦 𝑥; 𝜃 = 𝑁(𝑦; 𝜃 𝑇
𝑥, 𝐼)
Logistic regression
𝑝 𝑦 = 1 𝑥; 𝜃 = 𝜎(𝜃 𝑇 𝑥)

*Find the optimal hyperplane for classification
*Driven by a linear function 𝑤 𝑇 𝑥 + 𝑏
* 𝑤 𝑇 𝑥 + 𝑏 = 𝑏 + 𝑖=1
𝑚
𝛼𝑖 ∙ 𝑥 𝑇 𝑥 𝑖
* 𝑥 𝑖
is a training sample.
* 𝛼 is a vector of coefficients.
*Predicted as positive class: 𝑤 𝑇
𝑥 + 𝑏 is positive
*Predicted as negative class: 𝑤 𝑇 𝑥 + 𝑏 is negative

*Replace 𝑥 by the output of a given feature function 𝜙(𝑥)
*Replacing dot products with kernel evaluations
𝑓 𝑥 =
𝑖
𝛼𝑖 𝒌(𝒙, 𝒙 𝒊 ) + 𝑏
𝒌 𝒙, 𝒙 𝒊
= 𝜙 𝑥 𝑇
𝜙 𝑥 𝑖
*Commonly used kernel
*Gaussian kernel
𝑘 𝑢, 𝑣 = 𝑁(𝑢 − 𝑣; 0, 𝜎2
𝐼)
* 𝑁(𝑥; 𝜇, Σ) is the standard normal density.
* Its value decrease along lines in 𝑣 space radiating outward from 𝑢.
* Also known as radial basis function (RBF)

𝑥1 𝑥2 class
2 1 Yes
3 2 Yes
4 1 Yes
1 2 No
2 3 No
1 4 No
𝑥1
𝑥2
≤ 2
≤ 1
Yes No
> 1
> 2
Yes
Trained model using
decision tree:
Ground truth
Axis-aligned result

Unsupervised Learning
Algorithms

Simpler Representation
Lower dimensional representations
Sparse representations
(Refer to chapter 7.10)
Independent representations

K-means Clustering
The k-means algorithm works by initializing
k different centroids {µ(1), . . . , µ(k)} to
different values, then alternating between
two different steps until convergence.
A. In one step, each training example is
assigned to cluster i, where i is the index of
the nearest centroid µ(i).
B. In the other step, each centroid µ(i) is
updated to the mean of all training
examples x(j) assigned to cluster i.

Principle Components Analysis
(PCA)
Apply lossy compression to a collection of
m points {x(1), . . . , x(m)} in n.
For each point x(i) ∈ n we will find a
corresponding code vector c(i) ∈ l.
Encoding function: f(x) = c
Reconstructing (decoding) function:
g(f(x)) = g(c) = Dc, where D ∈ Rn×l
All columns of D are orthogonal to each other.

PCA cont’d
Find where Var[z] is diagonal
given

PCA cont’d
When we project the data x to z, via the linear transformation
W, the resulting representation has a diagonal covariance
matrix (as given by Σ2) which immediately implies that the
individual elements of z are mutually uncorrelated.

Challenges motivate Deep
Learning

The curse of Dimensionality (維度
的詛咒)
Local Constancy and Smoothness
Regularization
Manifold Learning

Curse of Dimensionality
The number of possible distinct configurations
of a set of variables increases exponentially
as the number of variables increases.
The number of possible configurations is
much larger than the available number of
training examples as the number of
dimensions increases.

Smoothness Prior /
Local Constancy Prior
The smoothness prior or local constancy prior
states that the function we learn should not
change very much within a small region.
Many simpler algorithms rely exclusively on
this prior to generalize well, and as a result
they fail to scale to the statistical challenges
involved in solving AI level tasks.
The core idea in deep learning is that we
assume that the data was generated by the
composition of factors or features, potentially
at multiple levels in a hierarchy.

Manifold Learning
Manifold (流形): A set of points associated with
a neighborhood around each point
From any given point, the manifold locally
appears to be a Euclidean space (歐基里德空間).
Manifold hypothesis: Assuming most of Rn
consists of invalid inputs and interesting inputs
occur only along a collection of manifolds
containing a small subset of points.
 The probability distribution over images, text strings, and
sounds that occur in real life is highly concentrated.
 Examples are connected in neighborhoods and may be
reachable by transformations.

Data sampled in 2D space but actually
concentrated like a 1-D twisted string.

Examples of a face training dataset (QMUL dataset)
that subject moved in two angles of rotation.
The goal is to get learning algorithms to discover and
disentangle such manifold coordinates.

Visualizations of learned data manifold for generative
models with two-dimensional latent space. Learned by
a variational autoencoder (VAE).
Figure from “Auto-encoding Variational Bayes (2014), by D. Kingma & M. Welling”

Deep Learning: Introduction & Chapter 5 Machine Learning Basics

Deep Learning: Introduction & Chapter 5 Machine Learning Basics

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Deep Learning: Introduction & Chapter 5 Machine Learning Basics (20)

More from Jason Tsai (11)

Recently uploaded (20)

Deep Learning: Introduction & Chapter 5 Machine Learning Basics