Lecture artificial neural networks and pattern recognition

T H E UN I V E R S I T Y of TE X A S
HE A L T H S C I E N C E CE N T E R A T HO U S T O N
S C H O O L of HE A L T H I N F O R M A T I O N S C I E N C E S
Artificial Neural Networks and
Pattern Recognition
For students of HI 5323
“Image Processing”
Willy Wriggers, Ph.D.
School of Health Information Sciences
http://biomachina.org/courses/processing/13.html

What are Neural Networks?
• Models of the brain and nervous system
• Highly parallel
ƒ Process information much more like the brain than a serial computer
• Learning
• Very simple principles
• Very complex behaviours
• Applications
ƒ As powerful problem solvers
ƒ As biological models
© torsten.reil@zoo.ox.ac.uk, users.ox.ac.uk/~quee0818/teaching/Neural_Networks.ppt

Neuro-
Physiological
Background
• 10 billion neurons in
human cortex
• 60 trillion synapses
• In first two years from birth
~1 million synapses / sec.
formed
pyramidal cell

Modeling the Neuron
bias
inputs
h(w0 ,wi , xi ) y = f(h)
y
x1 w1
xi
wi
xn wn
1
w0 f : activation function
output
h : combine wi & xi
© Leonard Studer, humanresources.web.cern.ch/humanresources/external/training/ tech/special/DISP2003/DISP-2003_L21A_30Apr03.ppt

Common Activation Functions
• Sigmoidal Function:
nΣ
y = f h= w0 ⋅1+ wi ⋅ xi
• Radial Function, e.g.. Gaussian:
• Linear Function
i=1
; ρ
⎛
⎜
⎝
⎞
⎠
⎟ = 1
1+ e
−h
ρ
nΣ
y = f h= xi −wi ( )2
i=1
;σ = w0
⎛
⎜
⎝
⎞
⎠
⎟ = 1
e
2πσ
− h2
2σ 2
nΣ
y = w0 ⋅1+ wi ⋅ xi
i=1

Artificial Neural Networks
• ANNs incorporate the two fundamental components of
biological neural nets:
1. Neurones (nodes)
2. Synapses (weights)
Input Output

“Pidgeon” ANNs
• Pigeons as art experts (Watanabe et al. 1995)
• Experiment:
- Pigeon in Skinner box
- Present paintings of two different artists (e.g. Chagall / Van Gogh)
- Reward for pecking when presented a particular artist (e.g. Van Gogh)

Training Set:
(etc…)

Predictive Power:
• Pigeons were able to discriminate between Van Gogh and Chagall with
95% accuracy (when presented with pictures they had been trained on)
• Discrimination still 85% successful for previously unseen paintings of
the artists.
• Pigeons do not simply memorise the pictures
• They can extract and recognise patterns (the ‘style’)
• They generalise from the already seen to make predictions
• This is what neural networks (biological and artificial) are good at
(unlike conventional computer)

Real ANN Applications
• Recognition of hand-written letters
• Predicting on-line the quality of welding spots
• Identifying relevant documents in corpus
• Visualizing high-dimensional space
• Tracking on-line the position of robot arms
• …etc

ANN Design
1. Get a large amount of data: inputs and outputs
2. Analyze data on the PC
z Relevant inputs ?
z Linear correlations (ANN necessary) ?
z Transform and scale variables
z Other useful preprocessing ?
z Divide in 3 data sets:
Training set
Test set
Validation set

ANN Design
3. Set the ANN architecture: What type of ANN ?
z Number of inputs, outputs ?
z Number of hidden layers
z Number of neurons
z Learning schema « details »
4. Tune/optimize internal parameters by presenting training data set to ANN
5. Validate on test / validation dataset

Main Types of ANN
Supervised Learning:
ƒ Feed-forward ANN
- Multi-Layer Perceptron (with sigmoid hidden neurons)
ƒ Recurrent Networks
- Neurons are connected to self and others
- Time delay of signal transfer
- Multidirectional information flow
Unsupervised Learning:
ƒ Self-organizing ANN
- Kohonen Maps
- Vector Quantization
- Neural Gas

Feed-Forward ANN
• Information flow is unidirectional
• Data is presented to Input layer
• Passed on to Hidden Layer
• Passed on to Output layer
• Information is distributed
• Information processing is parallel
Internal representation (interpretation) of data

Supervised Learning
Training set:
{(μxin, μtout);
1 ≤ μ ≤ P}
μ xout
desired output
(supervisor) μ t out
μ xin
error=μ xout −μ tout
Typically:
backprop.
of errors
-

Important Properties of FFN
• Assume
ƒ g(x): bounded and sufficiently regular fct.
ƒ FFN with 1 hidden layer of finite N neurons
(Transfer function is identical for every neurons)
• => FFN is an Universal Approximator of g(x)
Theorem by Cybenko et al. in 1989
In the sense of uniform approximation
For arbitrary precision ε

• Assume
Important Properties of FFN
ƒ FFN as before
(1 hidden layer of finite N neurons, non linear transfer function)
ƒ Approximation precision ε
• => #{wi} ~ # inputs
Theorem by Barron in 1993
ANN is more parsimonious in #{wi} than a linear approximator
[linear approximator: #{wi} ~ exp(# inputs) ]

Roughness of Output
• Outputs depends of the whole set of
weighted links {wij}
• Example: output unit versus input 1 and
input 2 for a 2*10*1 ANN with random
weights

Feeding Data Through the FNN
(1 × 0.25) + (0.5 × (-1.5)) = 0.25 + (-0.75) = - 0.5
0.3775
1
1
0.5 =
+ e
Squashing:

Feeding Data Through the FNN
• Data is presented to the network in the form of activations in the input layer
• Examples
ƒ Pixel intensity (for pictures)
ƒ Molecule concentrations (for artificial nose)
ƒ Share prices (for stock market prediction)
• Data usually requires preprocessing
ƒ Analogous to senses in biology
• How to represent more abstract data, e.g. a name?
ƒ Choose a pattern, e.g.
- 0-0-1 for “Chris”
- 0-1-0 for “Becky”

Training the Network
How do we adjust the weights?
• Backpropagation
ƒ Requires training set (input / output pairs)
ƒ Starts with small random weights
ƒ Error is used to adjust weights (supervised learning)
Æ Gradient descent on error landscape

Backpropagation

• Advantages
ƒ It works!
ƒ Relatively fast
• Downsides
Backpropagation
ƒ Requires a training set
ƒ Can be slow to converge
ƒ Probably not biologically realistic
• Alternatives to Backpropagation
ƒ Hebbian learning
- Not successful in feed-forward nets
ƒ Reinforcement learning
- Only limited success in FFN
ƒ Artificial evolution
- More general, but can be even slower than backprop

Applications of FFN
ƒ Pattern recognition
- Character recognition
- Face Recognition
ƒ Sonar mine/rock recognition (Gorman & Sejnowksi, 1988)
ƒ Navigation of a car (Pomerleau, 1989)
ƒ Stock-market prediction
ƒ Pronunciation (NETtalk)
(Sejnowksi & Rosenberg, 1987)

Protein Secondary Structure Prediction
(Holley-Karplus, Ph.D., etc):
Supervised learning:
ƒ Adjust weight vectors so
output of network matches
desired result
α-helical
coil
amino acid sequence

Recurrent Networks
• Feed forward networks:
ƒ Information only flows one way
ƒ One input pattern produces one output
ƒ No sense of time (or memory of previous state)
• Recurrency
ƒ Nodes connect back to other nodes or themselves
ƒ Information flow is multidirectional
ƒ Sense of time and memory of previous state(s)
• Biological nervous systems show high levels of recurrency (but feed-forward
structures exists too)

Elman Nets
• Elman nets are feed forward networks with partial
recurrency
• Unlike feed forward nets, Elman nets have a memory or
sense of time

Elman Nets
Classic experiment on language acquisition and processing (Elman, 1990)
• Task
ƒ Elman net to predict successive words in sentences.
• Data
ƒ Suite of sentences, e.g.
- “The boy catches the ball.”
- “The girl eats an apple.”
ƒ Words are input one at a time
• Representation
ƒ Binary representation for each word, e.g.
- 0-1-0-0-0 for “girl”
• Training method
ƒ Backpropagation

Elman Nets
Internal
representation
of words

Hopfield Networks
• Sub-type of recurrent neural nets
ƒ Fully recurrent
ƒ Weights are symmetric
ƒ Nodes can only be on or off
ƒ Random updating
• Learning: Hebb rule (cells that fire together
wire together)
• Can recall a memory, if presented with a
corrupt or incomplete version
Æ auto-associative or
content-addressable memory

Hopfield Networks
Task: store images with resolution of 20x20 pixels
Æ Hopfield net with 400 nodes
Memorise:
1. Present image
2. Apply Hebb rule (cells that fire together, wire together)
- Increase weight between two nodes if both have same
activity, otherwise decrease
3. Go to 1
Recall:
1. Present incomplete pattern
2. Pick random node, update
3. Go to 2 until settled

Hopfield Networks

Hopfield Networks
• Memories are attractors in state space

Catastrophic Forgetting
• Problem: memorising new patterns corrupts the memory of older ones
Æ Old memories cannot be recalled, or spurious memories arise
• Solution: allow Hopfield net to sleep

Solutions
ƒ Unlearning (Hopfield, 1986)
- Recall old memories by random stimulation, but use an inverse
Hebb rule
Æ‘Makes room’ for new memories (basins of attraction shrink)
ƒ Pseudorehearsal (Robins, 1995)
- While learning new memories, recall old memories by random
stimulation
- Use standard Hebb rule on new and old memories
Æ Restructure memory
• Needs short-term + long term memory
- Mammals: hippocampus plays back new memories to neo-cortex,
which is randomly stimulated at the same time

Unsupervised (Self-Organized) Learning
feed-forward (supervised)
feed-forward + lateral feedback
(recurrent network, still supervised)
self-organizing network (unsupervised)
input layer output layer
input layer output layer
continuous
input
space
discrete
output
space

Self Organizing Map (SOM)
neural lattice
input signal space
Kohonen, 1984

Illustration of Kohonen Learning
Inputs: coordinates (x,y) of points
drawn from a square
Display neuron j at position xj,yj where
its sj is maximum
random initial positions
100 inputs 200 inputs
1000 inputs

Why use Kohonen Maps?
• Image Analysis
- Image Classification
• Data Visualization
- By projection from high D -> 2D
Preserving neighborhood relationships
• Partitioning Input Space
Vector Quantization (Coding)

Example:
Modeling of the somatosensory map of the hand (Ritter, Martinetz &
Schulten, 1992).

Representing Topology
with the Kohonen SOM
• free neurons from lattice…
• stimulus–dependent connectivities

The “Neural Gas” Algorithm
(Martinetz & Schulten, 1992)
connectivity matrix:
Cij { 0, 1}
age matrix:
Tij {0,…,T}
stimulus

More Examples: Torus and Myosin S1

Growing Neural Gas
GNG = Neural gas &
dynamical creation/removal of links
© http://www.neuroinformatik.ruhr-uni-bochum.de

Why use GNG ?
• Adaptability to Data Topology
ƒ Both dynamically and spatially
• Data Analysis
• Data Visualization

Radial Basis Function Networks
Outputs as
linear
combination of
hidden layer of
RBF neurons
Inputs
(fan in)
Usually apply a
unsupervised learning
procedure
•Set number of neurons
and then adjust :
1.Gaussian centers
2.Gaussian widths
3.weights

Why use RBF ?
• Density estimation
• Discrimination
• Regression
• Good to know:
ƒ Can be described as Bayesian Networks
ƒ Close to some Fuzzy Systems

Demo
Internet Java demo http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
• HebbRule
• LBG / k-means
• Neural Gas
• GNG
• Kohonen SOM

Vector Quantization
Lloyd (1957)
Linde, Buzo, & Gray (1980)
Martinetz & Schulten (1993)
Digital Signal Processing,
Speech and Image Compression.
Neural Gas.
}
Encode data (in ℜ D ) using a finite set { w j } (j=1,…,k) of codebook vectors.
Delaunay triangulation divides ℜ D into k Voronoi polyhedra (“receptive fields”):
V i = { v ∈ℜ D v − w i ≤ v − w j ∀j
}

k-Means a.k.a. Linde, Buzo & Gray (LBG)
Encoding Distortion Error:
2
E = Σ vi −wj i d
(data points) ( ) i
i
Lower E ( { w ( t ) } ) iteratively: Gradient descent j
∀r :
w t w t w t E v w d
∂ Σ
r ( ) r ( ) r ( 1 ) rj ( i ) ( i r ) i
.
2 w
r i
ε
ε δ
∂
Δ ≡ − − = − ⋅ = ⋅ −
v (t) i : i d
Inline (Monte Carlo) approach for a sequence selected at random
according to propability density function
( ) ~ ( ( ) ) . r rj(i) i r Δw t = ε ⋅δ ⋅ v t −w
Advantage: fast, reasonable clustering.
Limitations: depends on initial random positions,
difficult to avoid getting trapped in the many local minima of E

Neural Gas Revisited
Avoid local minima traps of k-means by smoothing of energy function:
r
−
r r w t e v t w
∀ Δ = ⋅ ⋅ −
( ( ) { }) r i j s v t , w
s
: ( ) ~ ε λ
( i ( ) r
) , Where is the closeness rank:
v w v w v w
− ≤ − ≤ ≤ − −
s s s k
i j i j … i j k
0 1 ( 1)
= = = −
0 1 1
r r r

λ →0 :
λ ≠ 0 : j(i ) w
({ } ) k 2
λ E w λ
= Σ e Σ vi −
i wj d , ( ) .
r 1
sr
j i
i
−
=
Note: k-means.
not only “winner” , also second, third, ... closest are updated.
Can show that this corresponds to stochastic gradient descent on
λ →0 : E~→ E .
λ → ∞ : E~ } ⇒ λ (t)
Note: k-means.
parabolic (single minimum).

Q: How do we know that we have found the global minimum of E?
A: We don’t (in general).
{ } j w
But we can compute the statistical variability of the by repeating the
calculation with different seeds for random number generator.
Codebook vector variability arises due to:
• statistical uncertainty,
• spread of local minima.
A small variability indicates good convergence behavior.
Optimum choice of # of vectors k: variability is minimal.

Pattern Recognition
Definition: “The assignment of a physical object or event to one
of several prespecified categeries” -- Duda Hart
• Apattern is an object, process or event that can be given a name.
• Apattern class (or category) is a set of patterns sharing common attributes and
usually originating from the same source.
• During recognition (or classification) given objects are assigned to prescribed
classes.
• A classifier is a machine which performs classification.
© Voitech Franc, cmp.felk.cvut.cz/~xfrancv/talks/franc-printro03.ppt

PR Applications
• Optical Character
Recognition (OCR)
• Biometrics
• Diagnostic systems
• Handwritten: sorting letters by postal code,
input device for PDA‘s.
• Printed texts: reading machines for blind
people, digitalization of text documents.
• Face recognition, verification, retrieval.
• Finger prints recognition.
• Speech recognition.
• Medical diagnosis: X-Ray, EKG analysis.
• Machine diagnostics, waster detection.

Approaches
• Statistical PR: based on underlying statistical model of patterns and pattern
classes.
• Structural (or syntactic) PR: pattern classes represented by means of formal
structures as grammars, automata, strings, etc.
• Neural networks: classifier is represented as a network of cells modeling
neurons of the human brain (connectionist approach).

Basic Concepts
1 ⎤
Feature vector
⎥ ⎥ ⎥ ⎥ ⎦
⎡
⎢ ⎢ ⎢ ⎢
x
x
y = x
⎣
2
#
n x
- A vector of observations (measurements).
- is a point in feature space .
Hidden state
- Cannot be directly measured.
- Patterns with equal hidden state belong to the same class.
x∈X
x X
y∈Y
Task
- To design a classifer (decision rule)
q : X →Y
which decides about a hidden state based on an onbservation.
Pattern

Example
⎤
x = ⎥⎦
⎡
⎢⎣
x
1
x
2
height
weight
Task: jockey-hoopster recognition.
The set of hidden state is
The feature space is
Y = {H, J}
X = ℜ2
Training examples {( , ), , ( , )} 1 1 l l x y … x y
Linear classifier: y = H
1 x
2 x
y = J
w x
H if b
⎩ ⎨ ⎧
( ⋅ ) + ≥
0
w x
⋅ +
=
( ) 0
q( )
J if b
x
(w⋅x)+b=0

Components of a PR System
Sensors and
preprocessing
Feature
extraction Classifier Class
assignment
Teacher Learning algorithm
Pattern
• Sensors and preprocessing.
• A feature extraction aims to create discriminative features good for classification.
• A classifier.
• A teacher provides information about hidden state -- supervised learning.
• A learning algorithm sets PR from training examples.

Feature Extraction
Task: to extract features which are good for classification.
Good features: • Objects from the same class have similar feature values.
• Objects from different classes have different values.
“Good” features “Bad” features

Feature Extraction Methods
⎤
⎥ ⎥ ⎥ ⎥
⎦
⎡
⎢ ⎢ ⎢ ⎢
⎣
m
1
m
#
2
k m
⎤
⎤
⎡
m
1
n φ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
⎥ ⎥ ⎥ ⎥
⎦
⎡
1 1 φ
⎢ ⎢ ⎢ ⎢
⎣
x
x
2
#
n x
2 φ
⎦
⎢ ⎢ ⎢ ⎢ ⎢ ⎢
⎣
m
2
m
#
3
k m
⎤
⎥ ⎥ ⎥ ⎥
⎦
⎡
⎢ ⎢ ⎢ ⎢
⎣
x
1
x
2
#
n x
Feature extraction Feature selection
φ(θ)
Problem can be expressed as optimization of parameters of featrure extractor .
Supervised methods: objective function is a criterion of separability
(discriminability) of labeled examples, e.g., linear discriminant analysis (LDA).
Unsupervised methods: lower dimesional representation which preserves important
characteristics of input data is sought for, e.g., principal component analysis (PCA).

Classifier
A classifier partitions feature space X into class-labeled regions such that
1 2 |Y| X = X ∪X ∪…∪X {0} 1 2 | | ∩ ∩ ∩ = Y and X X … X
1 X 3 X
2 X
1 X
1 X
2 X
3 X
The classification consists of determining to which region a feature vector x belongs to.
Borders between decision boundaries are called decision regions.

Representation of a Classifier
A classifier is typically represented as a set of discriminant functions
f ( ) : X →ℜ,i =1,…,|Y | i x
The classifier assigns a feature vector x to the i-the class if f (x) f (x) i j ∀j ≠ i
f ( ) 1 x
f ( ) 2 x
x max y
# Class identifier
f ( ) | | x Y
Feature vector
Discriminant function

Bayesian Decision Making
• The Bayesian decision making is a fundamental statistical approach which
allows to design the optimal classifier if complete statistical model is known.
Definition: Obsevations
Hidden states
Decisions
A loss function
A decision rule
D A joint probability
q : X →D
p(x,y)
X
Y
W :Y ×D →R
Task: to design decision rule q which minimizes Bayesian risk
Σ Σ
∈ ∈
R(q) =
p(x, y)W(q(x), y)
y Y x X

Example of a Bayesian Task
Task: minimization of classification error.
A set of decisions D is the same as set of hidden states Y.
0 q( x
)
0/1 - loss function used
⎩ ⎨ ⎧
if =
y
≠
=
if y
y
1 q( )
W(q( ), )
x
x
The Bayesian risk R(q) corresponds to probability of
misclassification.
The solution of Bayesian task is
y p y x x y y
q argminR(q) * argmax ( | ) argmax p( | ) p( )
p( )
= ⇒ = =
q
*
x
y y

Limitations of the Bayesian Approach
• The statistical model p(x,y) is mostly not known therefore
learning must be employed to estimate p(x,y) from training
examples {(x1,y1),…,(xA,yA)} -- plug-in Bayes.
• Non-Bayesian methods offers further task formulations:
• A partial statistical model is avaliable only:
• p(y) is not known or does not exist.
• p(x|y,θ) is influenced by a non-random intervetion θ.
• The loss function is not defined.
• Examples: Neyman-Pearson‘s task, Minimax task, etc.

Discriminative Approaches
Given a class of classification rules q(x;θ) parametrized by θ∈Ξ
the task is to find the “best” parameter θ* based on a set of
training examples {(x1,y1),…,(xA,yA)} -- supervised learning.
The task of learning: recognition which classification rule is
to be used.
The way how to perform the learning is determined by a
selected inductive principle.

Empirical Risk Minimization Principle
The true expected risk R(q) is approximated by empirical risk
emp W(q( ; ), ) R (q( ; )) 1
Σ=
x θ =
x i θ y
i A
A i
1
with respect to a given labeled training set {(x1,y1),…,(xA,yA)}.
The learning based on the empirical minimization principle is
defined as
θ* x θ
argmin R (q( ; )) emp
θ
=
Examples of algorithms: Perceptron, Back-propagation, etc.

Overfitting and Underfitting
Problem: how rich class of classifications q(x;θ) to use.
underfitting good fit overfitting
Problem of generalization: a small emprical risk Remp does not
imply small true expected risk R.

Structural Risk Minimization Principle
Statistical learning theory -- Vapnik Chervonenkis.
An upper bound on the expected risk of a classification rule q∈Q
R(q) R (q) R (1 , , log 1 )
σ
h emp str A
≤ +
where A is number of training examples, h is VC-dimension of class
of functions Q and 1-σ is confidence of the upper bound.
SRM principle: from a given nested function classes Q1,Q2,…,Qm,
such that
m h1 ≤ h2 ≤…≤ h
select a rule q* which minimizes the upper bound on the expected risk.

Unsupervised Learning
Input: training examples {x1,…,xA} without information about the
hidden state.
Clustering: goal is to find clusters of data sharing similar properties.
A broad class of unsupervised learning algorithms:
{ , , } x1 … xA { , , } y1 … yA
Classifier
θ
Learning
algorithm
Classifier
q : X ×Θ →Y
L : (X ×Y)A →Θ
Learning algorithm
(supervised)

Example
k-Means Clustering:
Classifier
= x = x −
y w
q( ) arg min || || i
i k
1, ,
=
…
Goal is to minimize
2
A
Σ − x x
q( )
1
|| || i i
i
w
=
Learning algorithm
1 ,
| | i
= Σ x
I I
i j
i j
w
∈
{ j : q( ) i} i j I = x =
1 w
2 w
3 w
{ , , } x1 … xA
1 { , , } k θ = w … w
{ , , } y1 … yA

Neural Network References
• Neural Networks, a Comprehensive Foundation, S. Haykin, ed. Prentice Hall
(1999)
• Neural Networks for Pattern Recognition, C. M. Bishop, ed Claredon Press,
Oxford (1997)
• Self Organizing Maps, T. Kohonen, Springer (2001)

Some ANN Toolboxes
• Free software
ƒ SNNS: Stuttgarter Neural Network Systems Java NNS
ƒ GNG at Uni Bochum
• Matlab toolboxes
ƒ Fuzzy Logic
ƒ Artificial Neural Networks
ƒ Signal Processing

Pattern Recognition /
Vector Quantization References
Textbooks
Duda, Heart: Pattern Classification and Scene Analysis. J. Wiley Sons, New York,
1982. (2nd edition 2000).
Fukunaga: Introduction to Statistical Pattern Recognition. Academic Press, 1990.
Bishop: Neural Networks for Pattern Recognition. Claredon Press, Oxford, 1997.
Schlesinger, Hlaváč: Ten lectures on statistical and structural pattern recognition.
Kluwer Academic Publisher, 2002.

Lecture artificial neural networks and pattern recognition

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Lecture artificial neural networks and pattern recognition (20)

More from Hưng Đặng (12)

Recently uploaded (20)

Lecture artificial neural networks and pattern recognition