SlideShare a Scribd company logo
Deep Learning 101, Dec. 9th, 2016
By Jason Tsai (蔡志順) & Frank Wu (吳孟倫)
*Note: Most materials from this presentation are taken
from the book “Deep Learning” authored by Ian
Goodfellow, Yoshua Bengio, & Aaron Courville. The other
quoted sources are mentioned in the respective slides.
Activation Function
Action Potential (動作電位)
Figure from https://en.wikipedia.org/wiki/Action_potential
Softmax
Sigmoid
tanh
ReLU
Common Used
Activation Functions
Rounding Error
Underflow: It occurs when numbers
near zero are rounded to zero.
Overflow: It occurs when numbers
with large magnitude are
approximated as ∞ or -∞ .
Softmax Function
To avoid rounding error:
Cross-Entropy Loss
Cross-entropy:
Logistic Sigmoid
(Sigmoid Function)
Figure from “A Tutorial on Deep Learning, Part 1: Nonlinear Classifiers
and The Backpropagation Algorithm, by Quoc V. Le”
Hyperbolic Tangent
(tanh)
Figure from http://mathworld.wolfram.com/HyperbolicTangent.html
Rectified Linear Unit
(ReLU)
Stochastic Gradient Descent
(SGD)
Objective / Cost / Loss Function
To minimize or maximize, J:
Objective function / Criterion
Specific to minimize, L:
Cost Function
/ Loss Function
/ Error Function
x* = arg min f(x)
Gradient Descent (GD)
Epoch / Batch / Iteration
*One Epoch: One forward pass and one backward pass of all
the training examples
*Batch Size: The number of training examples in one
forward/backward pass. The higher the batch size, the more
memory space you'll need.
Minibatch: Take a small number of examples at a time,
ranging from 1 to a few hundred, during one iteration.
*Number of Iterations: Number of passes, each pass using
[batch size] number of examples. To be clear, one pass = one
forward pass + one backward pass.
*Quote from http://stackoverflow.com/questions/4752626/epoch-vs-iteration-
when-training-neural-networks/31842945#31842945
SGD cont’d
minibatch
A Proof for an Equation
What is Machine Learning
Machine Learning Algorithms
Algorithms that are able to learn from data
From Machine Learning (1997), page 2, by
Tom M. Mitchell:
“A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P , if its
performance at tasks in T , as measured by P,
improves with experience E.”
Tasks in Machine Learning
(Categories of Applications)
Classification
Classifications with
missing inputs
Regression
Transcription
Machine Translation
Structure Output
Anomaly detection
Synthesis and
sampling
Imputation of missing
values
Denoising
Density Estimation or
Probability Mass
Function Estimation
…
Refer to chapter 12 for details of several practical applications
The Difficulties of Performance
Measure
In some cases, it is difficult to
decide what should be measured, e.g.
transcription.
In some other cases, we know what
quantity we would ideally like to
measure, but measuring it is
impractical, e.g. density estimation.
Supervised learning algorithms experience a dataset
containing features, but each example is also
associated with a label or target.
Unsupervised learning algorithms experience a
dataset containing many features, then learn useful
properties of the structure of this dataset.
Reinforcement Learning algorithms interact with an
environment and learn what to do by trial and error to
maximize a numerical reward but without told prior
which actions to take.
Experience:
Supervised vs. Unsupervised vs.
Reinforcement
One example is a collection of features.
1 example (data point)  features
1 image  the values of the pixels in that image
A dataset is collection of many examples.
Experience:
Dataset / Examples / Features
Examples: Linear Regression
Model Generalization
Underfitting occurs when the model is
not able to obtain a sufficiently low error
value on the training set. (學不像)
Overfitting occurs when the gap between
the training error and test error is too
large. (學的太像了不可能是真的)
Test Error a.k.a Generalization Error
Choose Relevant Learning
Algorithms
The No Free Lunch theorem for Machine Learning:
Averaged over all possible data generating
distributions, every classification algorithm has the
same error rate when classifying previously
unobserved points.
As the training size increases to infinity, the training
error of any fixed-capacity model must rise to at least
the Bayes error.
As the training set size increases, the optimal
capacity increases. Choose the optimal capacity after
reaching sufficient complexity to solve the task.
Regularization: Refer to chapter 7
Define KFoldXV(D, A, L, k):
Require: D, the given dataset, with elements z(i)
Require: A, the learning algorithm, seen as a function that
takes a dataset as input and outputs a learned function
Require: L, the loss function, seen as a function from a
learned function f and an example z(i) ∈ D to a scalar ∈ R
Require: k, the number of folds
Split D into k mutually exclusive subsets Di, whose union is D.
for i from 1 to k do
fi = A(DDi)
for z(j) in Di do
ej = L(fi, z(j))
end for
end for
Return e
K-fold Cross-Validation
Algorithm
*Estimation of the relation between input and target variables.
*We want to estimate a parameter
*The true value of the parameter : 𝜽 (fixed and unknown)
*The estimated value of parameter : 𝜽 (a function of the data, random variable)
*Let { 𝑥 1 , … , 𝑥 𝑚 } be a set of 𝑚 independent and identically distributed
(i.i.d.) data points
*A point estimator or statistic is any function of the data:
𝜃 𝑚 = 𝑔(𝑥(1)
, … , 𝑥(𝑚)
)
Point Estimation
*Predict a variable 𝑦 given an input vector 𝑥.
*Assume that 𝑦 = 𝑓 𝑥 + 𝜖
* 𝜖 stands for the part of 𝑦 that is not predictable from 𝑥.
*Approximating 𝑓 with a model or estimate 𝑓
*Estimator 𝑓 is a point estimator in function space.
*Estimating a parameter 𝑤 or estimating a function 𝑓 mapping from 𝑥 to 𝑦
*Linear regression
*Polynomial regression
Function Estimation
*Error = Bias + Variance
*Example:
*Use a gun to shot at 10 targets, 7 targets hit.
*Error = 10 - 7 = 3
*Only targeted at 9 targets, assuming the gun is the model…
*Bias = 10 - 9 = 1 (The difference between the model and real target)
*Variance = 9 – 7 = 2 (The model is not stable enough for targeting)
Error
*Variance is denoted as: 𝑉𝑎𝑟( 𝜃)
*Standard error is denoted as: 𝑆𝐸( 𝜃) (Square root of the variance)
*Standard error of the mean
𝑆𝐸 𝜇 𝑚 = 𝑉𝑎𝑟
1
𝑚
𝑖=1
𝑚
𝑥(𝑖) =
𝜎
𝑚
*The bias of an estimator is defined as:
𝑏𝑖𝑎𝑠 𝜃 𝑚 = 𝐸 𝜃 𝑚 − 𝜃
Variance and Standard Error
*MSE = 𝐸 𝜃 𝑚 − 𝜃
2
= 𝐵𝑖𝑎𝑠 𝜃 𝑚
2
+ 𝑉𝑎𝑟( 𝜃 𝑚)
Mean Squared Error
𝐵𝑖𝑎𝑠 𝑓(𝑥) = 𝐸 𝑓 𝑥 − 𝑓 𝑥
𝑉𝑎𝑟 𝑓 𝑥 = 𝐸 𝑓 𝑥 2 − 𝐸 𝑓 𝑥
2
*A set of examples 𝑋 = 𝑥 1 , … , 𝑥 𝑚 generating 𝑝 𝑑𝑎𝑡𝑎 𝑥
*Drawn independently from True but unknown data.
* 𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥; 𝜃) maps any configuration of 𝑥 to a real number estimating the
true probability 𝑝 𝑑𝑎𝑡𝑎 𝑥 .
*The maximum likelihood estimator for 𝜃 is defined as
𝜃 𝑀𝐿 = 𝐚𝐫𝐠𝐦𝐚𝐱
𝜽
𝒑 𝒎𝒐𝒅𝒆𝒍(𝑿; 𝜽)
= argmax
𝜃
𝑖=1
𝑚
𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥 𝑖
; 𝜃)
*= argmax
𝜃
𝑖=1
𝑚
log 𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥 𝑖
; 𝜃)
Maximum Likelihood Estimation
*Example:
*5 bags with infinite cookies with two flavors, the proportion of flavor are:
*Cherry 100%
*Cherry 75% + Lemon 25%
*Cherry 50% + Lemon 50%
*Cherry 25% + Lemon 75%
*Lemon 100%
*Which bag is most likely that we take two cookies with both lemon flavor?
*0% * 0% = 0%
*25% * 25% = 6.25%
*50% * 50% = 25%
*75% * 75% = 56.25%
*100% * 100% = 100%  this bag
Maximum Likelihood Estimation
* 𝜃 𝑀𝐴𝑃 = argmax
𝜃
𝑝(𝜃|𝑥) = argmax
𝜃
𝑝 𝑥 𝜃 𝑝(𝜃)
*Example:
*5 bags with infinite cookies with two flavors, the proportion of flavor are:
*10% choose  Cherry 100%
*20% choose  Cherry 75% + Lemon 25%
*40% choose  Cherry 50% + Lemon 50%
*20% choose  Cherry 25% + Lemon 75%
*10% choose  Lemon 100%
*Which bag is most likely that we take two cookies with both lemon flavor?
*0% * 0% * 10% = 0%
*25% * 25% * 20% = 1.25%
*50% * 50% * 40% = 10%
*75% * 75% * 20% = 13.05%  this bag
*100% * 100% * 10% = 10%
Maximum-a-Posteriori (MAP)
* 𝑝 𝜃 𝑥 1 , … , 𝑥 𝑚 =
𝑝 𝑥 1
, … , 𝑥 𝑚
𝜃 ∙ 𝑝 𝜃
𝑝 𝑥 1 , … , 𝑥 𝑚
*Example: Female + age 31~40 + worker + medium income  buys
car?
Age Gender Occupation Income Buy car
< 21 Male Student Low No
21 ~ 30 Female Student Low Yes
21 ~ 30 Male Worker Low Yes
> 41 Male Worker High Yes
31 ~ 40 Male No Job Low No
31 ~ 40 Male Worker Medium No
31 ~ 40 Female No Job Low Yes
31 ~ 40 Female Worker Medium Yes
31 ~ 40 Female Worker High No
< 21 Female Student Low Yes
𝑝 Yes =
6
10
𝑝 Female|Yes =
4
6
𝑝 Worker|Yes =
3
6
𝑝 Medium|Yes =
1
6
𝑝 31~40|Yes =
2
6
𝑝 Female =
5
10
𝑝 31~40 =
5
10
𝑝 Worker =
5
10
𝑝 Medium =
2
10
𝑝 Yes|𝑥 1 , … , 𝑥 𝑚 =
(
4
6
×
2
6
×
3
6
×
1
6
)×
6
10
5
10
×
5
10
×
5
10
×
2
10
* argmax
𝜃
𝑝 𝜃 𝑥 1 , … , 𝑥 𝑚
*Example: Female + age 31~40 + worker + medium income  buys
car?Age Gender Occupation Income Buy car
< 21 Male Student Low No
21 ~ 30 Female Student Low Yes
21 ~ 30 Male Worker Low Yes
> 41 Male Worker High Yes
31 ~ 40 Male No Job Low No
31 ~ 40 Male Worker Medium No
31 ~ 40 Female No Job Low Yes
31 ~ 40 Female Worker Medium Yes
31 ~ 40 Female Worker High No
< 21 Female Student Low Yes
𝑝 Yes =
6
10
𝑝 Female|Yes =
4
6
𝑝 Worker|Yes =
3
6
𝑝 Medium|Yes =
1
6
𝑝 31~40|Yes =
2
6
𝑝 Yes 𝑥 1
, … , 𝑥 𝑚
=
4
6
×
2
6
×
3
6
×
1
6
×
6
10
= 0.0111
𝑝 Female|No =
1
4
𝑝 Worker|No =
2
4
𝑝 Medium|No =
1
4
𝑝 31~40|No =
3
4
𝑝 No =
4
10
𝑝 𝐍𝐨|𝑥 1 , … , 𝑥 𝑚 =
1
4
×
3
4
×
2
4
×
1
4
×
4
10
= 𝟎. 𝟎𝟏𝟖𝟕𝟓
Use maximum likelihood estimation to find the best
parameter vector 𝜃 for a parametric family of distributions
𝑝(𝑦|𝑥; 𝜃)
Linear regression corresponds to the family
𝑝 𝑦 𝑥; 𝜃 = 𝑁(𝑦; 𝜃 𝑇
𝑥, 𝐼)
Logistic regression
𝑝 𝑦 = 1 𝑥; 𝜃 = 𝜎(𝜃 𝑇 𝑥)
*Find the optimal hyperplane for classification
*Driven by a linear function 𝑤 𝑇 𝑥 + 𝑏
* 𝑤 𝑇 𝑥 + 𝑏 = 𝑏 + 𝑖=1
𝑚
𝛼𝑖 ∙ 𝑥 𝑇 𝑥 𝑖
* 𝑥 𝑖
is a training sample.
* 𝛼 is a vector of coefficients.
*Predicted as positive class: 𝑤 𝑇
𝑥 + 𝑏 is positive
*Predicted as negative class: 𝑤 𝑇 𝑥 + 𝑏 is negative
*Replace 𝑥 by the output of a given feature function 𝜙(𝑥)
*Replacing dot products with kernel evaluations
𝑓 𝑥 =
𝑖
𝛼𝑖 𝒌(𝒙, 𝒙 𝒊 ) + 𝑏
𝒌 𝒙, 𝒙 𝒊
= 𝜙 𝑥 𝑇
𝜙 𝑥 𝑖
*Commonly used kernel
*Gaussian kernel
𝑘 𝑢, 𝑣 = 𝑁(𝑢 − 𝑣; 0, 𝜎2
𝐼)
* 𝑁(𝑥; 𝜇, Σ) is the standard normal density.
* Its value decrease along lines in 𝑣 space radiating outward from 𝑢.
* Also known as radial basis function (RBF)
𝑥1 𝑥2 class
2 1 Yes
3 2 Yes
4 1 Yes
1 2 No
2 3 No
1 4 No
𝑥1
𝑥2
≤ 2
≤ 1
Yes No
> 1
> 2
Yes
Trained model using
decision tree:
Ground truth
Axis-aligned result
Unsupervised Learning
Algorithms
Simpler Representation
Lower dimensional representations
Sparse representations
(Refer to chapter 7.10)
Independent representations
K-means Clustering
The k-means algorithm works by initializing
k different centroids {µ(1), . . . , µ(k)} to
different values, then alternating between
two different steps until convergence.
A. In one step, each training example is
assigned to cluster i, where i is the index of
the nearest centroid µ(i).
B. In the other step, each centroid µ(i) is
updated to the mean of all training
examples x(j) assigned to cluster i.
Principle Components Analysis
(PCA)
Apply lossy compression to a collection of
m points {x(1), . . . , x(m)} in n.
For each point x(i) ∈ n we will find a
corresponding code vector c(i) ∈ l.
Encoding function: f(x) = c
Reconstructing (decoding) function:
g(f(x)) = g(c) = Dc, where D ∈ Rn×l
All columns of D are orthogonal to each other.
PCA cont’d
Find where Var[z] is diagonal
given
PCA cont’d
When we project the data x to z, via the linear transformation
W, the resulting representation has a diagonal covariance
matrix (as given by Σ2) which immediately implies that the
individual elements of z are mutually uncorrelated.
Challenges motivate Deep
Learning
The curse of Dimensionality (維度
的詛咒)
Local Constancy and Smoothness
Regularization
Manifold Learning
Curse of Dimensionality
The number of possible distinct configurations
of a set of variables increases exponentially
as the number of variables increases.
The number of possible configurations is
much larger than the available number of
training examples as the number of
dimensions increases.
Smoothness Prior /
Local Constancy Prior
The smoothness prior or local constancy prior
states that the function we learn should not
change very much within a small region.
Many simpler algorithms rely exclusively on
this prior to generalize well, and as a result
they fail to scale to the statistical challenges
involved in solving AI level tasks.
The core idea in deep learning is that we
assume that the data was generated by the
composition of factors or features, potentially
at multiple levels in a hierarchy.
Manifold Learning
Manifold (流形): A set of points associated with
a neighborhood around each point
From any given point, the manifold locally
appears to be a Euclidean space (歐基里德空間).
Manifold hypothesis: Assuming most of Rn
consists of invalid inputs and interesting inputs
occur only along a collection of manifolds
containing a small subset of points.
 The probability distribution over images, text strings, and
sounds that occur in real life is highly concentrated.
 Examples are connected in neighborhoods and may be
reachable by transformations.
Data sampled in 2D space but actually
concentrated like a 1-D twisted string.
Examples of a face training dataset (QMUL dataset)
that subject moved in two angles of rotation.
The goal is to get learning algorithms to discover and
disentangle such manifold coordinates.
Visualizations of learned data manifold for generative
models with two-dimensional latent space. Learned by
a variational autoencoder (VAE).
Figure from “Auto-encoding Variational Bayes (2014), by D. Kingma & M. Welling”
Deep Learning: Introduction & Chapter 5 Machine Learning Basics

More Related Content

PPTX
Em algorithm
PPTX
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
PPTX
Deep learning lecture - part 1 (basics, CNN)
PPTX
Image classification using CNN
PDF
Deep Feed Forward Neural Networks and Regularization
PDF
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
PDF
Convolutional Neural Networks (CNN)
PPTX
Classification and Regression
Em algorithm
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Deep learning lecture - part 1 (basics, CNN)
Image classification using CNN
Deep Feed Forward Neural Networks and Regularization
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
Convolutional Neural Networks (CNN)
Classification and Regression

What's hot (20)

PPTX
Recurrent Neural Networks (RNNs)
PPTX
ML_ Unit_1_PART_A
PDF
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
PPTX
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
PPTX
Machine learning overview
PPTX
Introduction to ML (Machine Learning)
PPTX
Gradient Descent. How NN learns
PPTX
Recurrent Neural Network
PPTX
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
PDF
Introduction to Recurrent Neural Network
PPTX
Hyperparameter Tuning
PPTX
2.mathematics for machine learning
PPTX
Machine Learning
PDF
Support Vector Machines for Classification
PDF
Introduction to Autoencoders
PDF
Introduction to Statistical Machine Learning
PPTX
Scikit Learn intro
PDF
Optimal binary search tree
PPTX
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
PDF
Confusion Matrix
Recurrent Neural Networks (RNNs)
ML_ Unit_1_PART_A
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
Machine learning overview
Introduction to ML (Machine Learning)
Gradient Descent. How NN learns
Recurrent Neural Network
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Introduction to Recurrent Neural Network
Hyperparameter Tuning
2.mathematics for machine learning
Machine Learning
Support Vector Machines for Classification
Introduction to Autoencoders
Introduction to Statistical Machine Learning
Scikit Learn intro
Optimal binary search tree
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Confusion Matrix
Ad

Viewers also liked (6)

PPTX
漫談人工智慧:啟發自大腦科學的深度學習網路
PPT
漫談人工智慧:啟發自大腦科學的深度學習網路
PPTX
Tensorflow internal
PDF
用十分鐘開始理解深度學習技術 (從 dnn.js 專案出發)
PDF
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
PDF
3 Things Every Sales Team Needs to Be Thinking About in 2017
漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路
Tensorflow internal
用十分鐘開始理解深度學習技術 (從 dnn.js 專案出發)
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
3 Things Every Sales Team Needs to Be Thinking About in 2017
Ad

Similar to Deep Learning: Introduction & Chapter 5 Machine Learning Basics (20)

PPTX
Machine learning introduction lecture notes
PDF
Computational Intelligence Assisted Engineering Design Optimization (using MA...
PDF
Curvefitting
PDF
Regression
PPTX
Unit-1 Basic Concept of Algorithm.pptx
PPTX
Koh_Liang_ICML2017
PDF
5.3 Graphs of Polynomial Functions
PDF
Machine Learning, Financial Engineering and Quantitative Investing
PPTX
Linear Programming- Leacture-16-lp1.pptx
DOCX
Essentials of machine learning algorithms
PPTX
151028_abajpai1
PPTX
MLU_DTE_Lecture_2.pptx
PDF
Basic Calculus-Q4-Module-2-Grade-11-STEM.pdf
PDF
MS CS - Selecting Machine Learning Algorithm
PDF
Pattern Discovery - part I
PDF
Ijarcet vol-2-issue-4-1579-1582
PPTX
Gen Math Lesson 1.pptx
PDF
Module 6: Ensemble Algorithms
PDF
Understanding Blackbox Prediction via Influence Functions
PPTX
Machine Learning Seminar
Machine learning introduction lecture notes
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Curvefitting
Regression
Unit-1 Basic Concept of Algorithm.pptx
Koh_Liang_ICML2017
5.3 Graphs of Polynomial Functions
Machine Learning, Financial Engineering and Quantitative Investing
Linear Programming- Leacture-16-lp1.pptx
Essentials of machine learning algorithms
151028_abajpai1
MLU_DTE_Lecture_2.pptx
Basic Calculus-Q4-Module-2-Grade-11-STEM.pdf
MS CS - Selecting Machine Learning Algorithm
Pattern Discovery - part I
Ijarcet vol-2-issue-4-1579-1582
Gen Math Lesson 1.pptx
Module 6: Ensemble Algorithms
Understanding Blackbox Prediction via Influence Functions
Machine Learning Seminar

More from Jason Tsai (11)

PPTX
基於深度學習的人臉辨識技術簡介
PPTX
Neural Network Design: Chapter 17 Radial Basis Networks
PPTX
Neural Network Design: Chapter 18 Grossberg Network
PPTX
Convolutional Neural Networks (CNN) — 卷積神經網路的前世今生
PPTX
Convolutional Neural Networks (CNN) — 卷積神經網路的前世今生
PPTX
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
PPTX
漫談人工智慧:啟發自大腦科學的深度學習網路
PPTX
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
PPTX
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
PPTX
Reinforcement Learning: Chapter 15 Neuroscience
PPTX
Deep Learning: Chapter 11 Practical Methodology
基於深度學習的人臉辨識技術簡介
Neural Network Design: Chapter 17 Radial Basis Networks
Neural Network Design: Chapter 18 Grossberg Network
Convolutional Neural Networks (CNN) — 卷積神經網路的前世今生
Convolutional Neural Networks (CNN) — 卷積神經網路的前世今生
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
漫談人工智慧:啟發自大腦科學的深度學習網路
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Reinforcement Learning: Chapter 15 Neuroscience
Deep Learning: Chapter 11 Practical Methodology

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Big Data Technologies - Introduction.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
Cloud computing and distributed systems.
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
“AI and Expert System Decision Support & Business Intelligence Systems”
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
The AUB Centre for AI in Media Proposal.docx
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25-Week II
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf

Deep Learning: Introduction & Chapter 5 Machine Learning Basics

  • 1. Deep Learning 101, Dec. 9th, 2016 By Jason Tsai (蔡志順) & Frank Wu (吳孟倫)
  • 2. *Note: Most materials from this presentation are taken from the book “Deep Learning” authored by Ian Goodfellow, Yoshua Bengio, & Aaron Courville. The other quoted sources are mentioned in the respective slides. Activation Function
  • 3. Action Potential (動作電位) Figure from https://en.wikipedia.org/wiki/Action_potential
  • 5. Rounding Error Underflow: It occurs when numbers near zero are rounded to zero. Overflow: It occurs when numbers with large magnitude are approximated as ∞ or -∞ .
  • 6. Softmax Function To avoid rounding error:
  • 8. Logistic Sigmoid (Sigmoid Function) Figure from “A Tutorial on Deep Learning, Part 1: Nonlinear Classifiers and The Backpropagation Algorithm, by Quoc V. Le”
  • 9. Hyperbolic Tangent (tanh) Figure from http://mathworld.wolfram.com/HyperbolicTangent.html
  • 12. Objective / Cost / Loss Function To minimize or maximize, J: Objective function / Criterion Specific to minimize, L: Cost Function / Loss Function / Error Function x* = arg min f(x)
  • 14. Epoch / Batch / Iteration *One Epoch: One forward pass and one backward pass of all the training examples *Batch Size: The number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need. Minibatch: Take a small number of examples at a time, ranging from 1 to a few hundred, during one iteration. *Number of Iterations: Number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass. *Quote from http://stackoverflow.com/questions/4752626/epoch-vs-iteration- when-training-neural-networks/31842945#31842945
  • 16. A Proof for an Equation
  • 17. What is Machine Learning
  • 18. Machine Learning Algorithms Algorithms that are able to learn from data From Machine Learning (1997), page 2, by Tom M. Mitchell: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P, improves with experience E.”
  • 19. Tasks in Machine Learning (Categories of Applications) Classification Classifications with missing inputs Regression Transcription Machine Translation Structure Output Anomaly detection Synthesis and sampling Imputation of missing values Denoising Density Estimation or Probability Mass Function Estimation … Refer to chapter 12 for details of several practical applications
  • 20. The Difficulties of Performance Measure In some cases, it is difficult to decide what should be measured, e.g. transcription. In some other cases, we know what quantity we would ideally like to measure, but measuring it is impractical, e.g. density estimation.
  • 21. Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target. Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. Reinforcement Learning algorithms interact with an environment and learn what to do by trial and error to maximize a numerical reward but without told prior which actions to take. Experience: Supervised vs. Unsupervised vs. Reinforcement
  • 22. One example is a collection of features. 1 example (data point)  features 1 image  the values of the pixels in that image A dataset is collection of many examples. Experience: Dataset / Examples / Features
  • 25. Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set. (學不像) Overfitting occurs when the gap between the training error and test error is too large. (學的太像了不可能是真的)
  • 26. Test Error a.k.a Generalization Error
  • 27. Choose Relevant Learning Algorithms The No Free Lunch theorem for Machine Learning: Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. As the training size increases to infinity, the training error of any fixed-capacity model must rise to at least the Bayes error. As the training set size increases, the optimal capacity increases. Choose the optimal capacity after reaching sufficient complexity to solve the task. Regularization: Refer to chapter 7
  • 28. Define KFoldXV(D, A, L, k): Require: D, the given dataset, with elements z(i) Require: A, the learning algorithm, seen as a function that takes a dataset as input and outputs a learned function Require: L, the loss function, seen as a function from a learned function f and an example z(i) ∈ D to a scalar ∈ R Require: k, the number of folds Split D into k mutually exclusive subsets Di, whose union is D. for i from 1 to k do fi = A(DDi) for z(j) in Di do ej = L(fi, z(j)) end for end for Return e K-fold Cross-Validation Algorithm
  • 29. *Estimation of the relation between input and target variables. *We want to estimate a parameter *The true value of the parameter : 𝜽 (fixed and unknown) *The estimated value of parameter : 𝜽 (a function of the data, random variable) *Let { 𝑥 1 , … , 𝑥 𝑚 } be a set of 𝑚 independent and identically distributed (i.i.d.) data points *A point estimator or statistic is any function of the data: 𝜃 𝑚 = 𝑔(𝑥(1) , … , 𝑥(𝑚) ) Point Estimation
  • 30. *Predict a variable 𝑦 given an input vector 𝑥. *Assume that 𝑦 = 𝑓 𝑥 + 𝜖 * 𝜖 stands for the part of 𝑦 that is not predictable from 𝑥. *Approximating 𝑓 with a model or estimate 𝑓 *Estimator 𝑓 is a point estimator in function space. *Estimating a parameter 𝑤 or estimating a function 𝑓 mapping from 𝑥 to 𝑦 *Linear regression *Polynomial regression Function Estimation
  • 31. *Error = Bias + Variance *Example: *Use a gun to shot at 10 targets, 7 targets hit. *Error = 10 - 7 = 3 *Only targeted at 9 targets, assuming the gun is the model… *Bias = 10 - 9 = 1 (The difference between the model and real target) *Variance = 9 – 7 = 2 (The model is not stable enough for targeting) Error
  • 32. *Variance is denoted as: 𝑉𝑎𝑟( 𝜃) *Standard error is denoted as: 𝑆𝐸( 𝜃) (Square root of the variance) *Standard error of the mean 𝑆𝐸 𝜇 𝑚 = 𝑉𝑎𝑟 1 𝑚 𝑖=1 𝑚 𝑥(𝑖) = 𝜎 𝑚 *The bias of an estimator is defined as: 𝑏𝑖𝑎𝑠 𝜃 𝑚 = 𝐸 𝜃 𝑚 − 𝜃 Variance and Standard Error
  • 33. *MSE = 𝐸 𝜃 𝑚 − 𝜃 2 = 𝐵𝑖𝑎𝑠 𝜃 𝑚 2 + 𝑉𝑎𝑟( 𝜃 𝑚) Mean Squared Error 𝐵𝑖𝑎𝑠 𝑓(𝑥) = 𝐸 𝑓 𝑥 − 𝑓 𝑥 𝑉𝑎𝑟 𝑓 𝑥 = 𝐸 𝑓 𝑥 2 − 𝐸 𝑓 𝑥 2
  • 34. *A set of examples 𝑋 = 𝑥 1 , … , 𝑥 𝑚 generating 𝑝 𝑑𝑎𝑡𝑎 𝑥 *Drawn independently from True but unknown data. * 𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥; 𝜃) maps any configuration of 𝑥 to a real number estimating the true probability 𝑝 𝑑𝑎𝑡𝑎 𝑥 . *The maximum likelihood estimator for 𝜃 is defined as 𝜃 𝑀𝐿 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝜽 𝒑 𝒎𝒐𝒅𝒆𝒍(𝑿; 𝜽) = argmax 𝜃 𝑖=1 𝑚 𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥 𝑖 ; 𝜃) *= argmax 𝜃 𝑖=1 𝑚 log 𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥 𝑖 ; 𝜃) Maximum Likelihood Estimation
  • 35. *Example: *5 bags with infinite cookies with two flavors, the proportion of flavor are: *Cherry 100% *Cherry 75% + Lemon 25% *Cherry 50% + Lemon 50% *Cherry 25% + Lemon 75% *Lemon 100% *Which bag is most likely that we take two cookies with both lemon flavor? *0% * 0% = 0% *25% * 25% = 6.25% *50% * 50% = 25% *75% * 75% = 56.25% *100% * 100% = 100%  this bag Maximum Likelihood Estimation
  • 36. * 𝜃 𝑀𝐴𝑃 = argmax 𝜃 𝑝(𝜃|𝑥) = argmax 𝜃 𝑝 𝑥 𝜃 𝑝(𝜃) *Example: *5 bags with infinite cookies with two flavors, the proportion of flavor are: *10% choose  Cherry 100% *20% choose  Cherry 75% + Lemon 25% *40% choose  Cherry 50% + Lemon 50% *20% choose  Cherry 25% + Lemon 75% *10% choose  Lemon 100% *Which bag is most likely that we take two cookies with both lemon flavor? *0% * 0% * 10% = 0% *25% * 25% * 20% = 1.25% *50% * 50% * 40% = 10% *75% * 75% * 20% = 13.05%  this bag *100% * 100% * 10% = 10% Maximum-a-Posteriori (MAP)
  • 37. * 𝑝 𝜃 𝑥 1 , … , 𝑥 𝑚 = 𝑝 𝑥 1 , … , 𝑥 𝑚 𝜃 ∙ 𝑝 𝜃 𝑝 𝑥 1 , … , 𝑥 𝑚 *Example: Female + age 31~40 + worker + medium income  buys car? Age Gender Occupation Income Buy car < 21 Male Student Low No 21 ~ 30 Female Student Low Yes 21 ~ 30 Male Worker Low Yes > 41 Male Worker High Yes 31 ~ 40 Male No Job Low No 31 ~ 40 Male Worker Medium No 31 ~ 40 Female No Job Low Yes 31 ~ 40 Female Worker Medium Yes 31 ~ 40 Female Worker High No < 21 Female Student Low Yes 𝑝 Yes = 6 10 𝑝 Female|Yes = 4 6 𝑝 Worker|Yes = 3 6 𝑝 Medium|Yes = 1 6 𝑝 31~40|Yes = 2 6 𝑝 Female = 5 10 𝑝 31~40 = 5 10 𝑝 Worker = 5 10 𝑝 Medium = 2 10 𝑝 Yes|𝑥 1 , … , 𝑥 𝑚 = ( 4 6 × 2 6 × 3 6 × 1 6 )× 6 10 5 10 × 5 10 × 5 10 × 2 10
  • 38. * argmax 𝜃 𝑝 𝜃 𝑥 1 , … , 𝑥 𝑚 *Example: Female + age 31~40 + worker + medium income  buys car?Age Gender Occupation Income Buy car < 21 Male Student Low No 21 ~ 30 Female Student Low Yes 21 ~ 30 Male Worker Low Yes > 41 Male Worker High Yes 31 ~ 40 Male No Job Low No 31 ~ 40 Male Worker Medium No 31 ~ 40 Female No Job Low Yes 31 ~ 40 Female Worker Medium Yes 31 ~ 40 Female Worker High No < 21 Female Student Low Yes 𝑝 Yes = 6 10 𝑝 Female|Yes = 4 6 𝑝 Worker|Yes = 3 6 𝑝 Medium|Yes = 1 6 𝑝 31~40|Yes = 2 6 𝑝 Yes 𝑥 1 , … , 𝑥 𝑚 = 4 6 × 2 6 × 3 6 × 1 6 × 6 10 = 0.0111 𝑝 Female|No = 1 4 𝑝 Worker|No = 2 4 𝑝 Medium|No = 1 4 𝑝 31~40|No = 3 4 𝑝 No = 4 10 𝑝 𝐍𝐨|𝑥 1 , … , 𝑥 𝑚 = 1 4 × 3 4 × 2 4 × 1 4 × 4 10 = 𝟎. 𝟎𝟏𝟖𝟕𝟓
  • 39. Use maximum likelihood estimation to find the best parameter vector 𝜃 for a parametric family of distributions 𝑝(𝑦|𝑥; 𝜃) Linear regression corresponds to the family 𝑝 𝑦 𝑥; 𝜃 = 𝑁(𝑦; 𝜃 𝑇 𝑥, 𝐼) Logistic regression 𝑝 𝑦 = 1 𝑥; 𝜃 = 𝜎(𝜃 𝑇 𝑥)
  • 40. *Find the optimal hyperplane for classification *Driven by a linear function 𝑤 𝑇 𝑥 + 𝑏 * 𝑤 𝑇 𝑥 + 𝑏 = 𝑏 + 𝑖=1 𝑚 𝛼𝑖 ∙ 𝑥 𝑇 𝑥 𝑖 * 𝑥 𝑖 is a training sample. * 𝛼 is a vector of coefficients. *Predicted as positive class: 𝑤 𝑇 𝑥 + 𝑏 is positive *Predicted as negative class: 𝑤 𝑇 𝑥 + 𝑏 is negative
  • 41. *Replace 𝑥 by the output of a given feature function 𝜙(𝑥) *Replacing dot products with kernel evaluations 𝑓 𝑥 = 𝑖 𝛼𝑖 𝒌(𝒙, 𝒙 𝒊 ) + 𝑏 𝒌 𝒙, 𝒙 𝒊 = 𝜙 𝑥 𝑇 𝜙 𝑥 𝑖 *Commonly used kernel *Gaussian kernel 𝑘 𝑢, 𝑣 = 𝑁(𝑢 − 𝑣; 0, 𝜎2 𝐼) * 𝑁(𝑥; 𝜇, Σ) is the standard normal density. * Its value decrease along lines in 𝑣 space radiating outward from 𝑢. * Also known as radial basis function (RBF)
  • 42. 𝑥1 𝑥2 class 2 1 Yes 3 2 Yes 4 1 Yes 1 2 No 2 3 No 1 4 No 𝑥1 𝑥2 ≤ 2 ≤ 1 Yes No > 1 > 2 Yes Trained model using decision tree: Ground truth Axis-aligned result
  • 44. Simpler Representation Lower dimensional representations Sparse representations (Refer to chapter 7.10) Independent representations
  • 45. K-means Clustering The k-means algorithm works by initializing k different centroids {µ(1), . . . , µ(k)} to different values, then alternating between two different steps until convergence. A. In one step, each training example is assigned to cluster i, where i is the index of the nearest centroid µ(i). B. In the other step, each centroid µ(i) is updated to the mean of all training examples x(j) assigned to cluster i.
  • 46. Principle Components Analysis (PCA) Apply lossy compression to a collection of m points {x(1), . . . , x(m)} in n. For each point x(i) ∈ n we will find a corresponding code vector c(i) ∈ l. Encoding function: f(x) = c Reconstructing (decoding) function: g(f(x)) = g(c) = Dc, where D ∈ Rn×l All columns of D are orthogonal to each other.
  • 47. PCA cont’d Find where Var[z] is diagonal given
  • 48. PCA cont’d When we project the data x to z, via the linear transformation W, the resulting representation has a diagonal covariance matrix (as given by Σ2) which immediately implies that the individual elements of z are mutually uncorrelated.
  • 50. The curse of Dimensionality (維度 的詛咒) Local Constancy and Smoothness Regularization Manifold Learning
  • 51. Curse of Dimensionality The number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases. The number of possible configurations is much larger than the available number of training examples as the number of dimensions increases.
  • 52. Smoothness Prior / Local Constancy Prior The smoothness prior or local constancy prior states that the function we learn should not change very much within a small region. Many simpler algorithms rely exclusively on this prior to generalize well, and as a result they fail to scale to the statistical challenges involved in solving AI level tasks. The core idea in deep learning is that we assume that the data was generated by the composition of factors or features, potentially at multiple levels in a hierarchy.
  • 53. Manifold Learning Manifold (流形): A set of points associated with a neighborhood around each point From any given point, the manifold locally appears to be a Euclidean space (歐基里德空間). Manifold hypothesis: Assuming most of Rn consists of invalid inputs and interesting inputs occur only along a collection of manifolds containing a small subset of points.  The probability distribution over images, text strings, and sounds that occur in real life is highly concentrated.  Examples are connected in neighborhoods and may be reachable by transformations.
  • 54. Data sampled in 2D space but actually concentrated like a 1-D twisted string.
  • 55. Examples of a face training dataset (QMUL dataset) that subject moved in two angles of rotation. The goal is to get learning algorithms to discover and disentangle such manifold coordinates.
  • 56. Visualizations of learned data manifold for generative models with two-dimensional latent space. Learned by a variational autoencoder (VAE). Figure from “Auto-encoding Variational Bayes (2014), by D. Kingma & M. Welling”