EM algorithm and its application in Probabilistic Latent
                   Semantic Analysis (pLSA)

                                                 Duc-Hieu Tran
                                             tdh.net [at] gmail.com

                                            Nanyang Technological University


                                                   July 27, 2010




Duc-Hieu Trantdh.net [at] gmail.com (NTU)              EM in pLSA              July 27, 2010   1 / 27
The parameter estimation problem


  Outline



   The parameter estimation problem


   EM algorithm


   Probabilistic Latent Sematic Analysis


   Reference




Duc-Hieu Trantdh.net [at] gmail.com (NTU)               EM in pLSA   July 27, 2010   2 / 27
The parameter estimation problem


  Introduction



   Known the prior probabilities P(ωi ), class-conditional densities p(x|ωi )
   =⇒ optimal classifier
           P(ωj |x) ∝ p(x|ωj )p(ωj )
           decide ωi if p(ωi |x) > P(ωj |x), ∀j = i
   In practice, p(x|ωi ) is unknown – just estimated from training samples
   (e.g., assume p(x|ωi ) ∼ N (µi , Σi )).




Duc-Hieu Trantdh.net [at] gmail.com (NTU)               EM in pLSA   July 27, 2010   3 / 27
The parameter estimation problem


  Frequentist vs. Bayesian schools


   Frequentist
           parameters – quantities whose values are fixed but unknown.
           the best estimate of their values – the one that maximizes the
           probability of obtaining the observed samples.
   Bayesian
           paramters – random variables having some known prior distribution.
           observation of the samples converts this to a posterior density;
           revising our opinion about the true values of the parameters.




Duc-Hieu Trantdh.net [at] gmail.com (NTU)               EM in pLSA   July 27, 2010   4 / 27
The parameter estimation problem


  Examples

           training samples: S = {(x (1) , y (1) ), . . . (x (m) , y (m) )}
           frequentist: maximum likelihood

                                                max            p(y (i) |x (i) ; θ)
                                                   θ
                                                          i

           bayesian: P(θ) – prior, e.g., P(θ) ∼ N (0, I)
                                                        m
                                    P(θ|S) ∝                   P(y (i) |x (i) , θ) .P(θ)
                                                       i=1

                                              θMAP = arg max P(θ|S)
                                                                  θ




Duc-Hieu Trantdh.net [at] gmail.com (NTU)               EM in pLSA                         July 27, 2010   5 / 27
EM algorithm


  Outline



   The parameter estimation problem


   EM algorithm


   Probabilistic Latent Sematic Analysis


   Reference




Duc-Hieu Trantdh.net [at] gmail.com (NTU)           EM in pLSA   July 27, 2010   6 / 27
EM algorithm


  An estimation problem

           training set of m independent samples: {x (1) , x (2) , . . . , x (m) }
           goal: fit the paramters of a model p(x, z) to the data
           the likelihood:
                                        m                         m
                                                    (i)
                            (θ) =           log p(x ; θ) =              log       p(x (i) , z; θ)
                                      i=1                         i=1         z

           explicitly maximize (θ) might be difficult.
           z - laten random variable
           if z (i) were observed, then maximum likelihood estimation would be
           easy.
           strategy: repeatedly construct a lower-bound on                           (E-step) and
           optimize that lower-bound (M-step).


Duc-Hieu Trantdh.net [at] gmail.com (NTU)            EM in pLSA                             July 27, 2010   7 / 27
EM algorithm


  EM algorithm (1)
           digression: Jensen’s inequality.
           f – convex function; E [f (X )] ≥ f (E [X ])
           for each i, Qi – distribution of z: z Qi (z) = 1, Qi (z) ≥ 0

                          (θ) =             log p(x (i) ; θ)
                                     i

                               =            log             p(x (i) , z (i) ; θ)
                                     i              z (i)
                                                                          p(x (i) , z (i) ; θ)
                               =            log             Qi (z (i) )                                          (1)
                                     i
                                                                             Qi (z (i) )
                                                    z (i)
                              applying Jensen’s inequality, concave function log
                                                                       p(x (i) , z (i) ; θ)
                               ≥                    Qi (z (i) )log                                               (2)
                                     i
                                                                          Qi (z (i) )
                                            z (i)

      More detail . . .
Duc-Hieu Trantdh.net [at] gmail.com (NTU)                      EM in pLSA                        July 27, 2010   8 / 27
EM algorithm


  EM algorithm (2)
           for any set of distribution Qi , formula (2) gives a lower-bound on (θ)
           how to choose Qi ?
           strategy: make the inequality hold with equality at our particular
           value of θ.
           require:
                                                 p(x (i) , z (i) ; θ)
                                                                      =c
                                                    Qi (z (i) )
           c – constant not depend on z (i)
           choose: Qi (z (i) ) ∝ p(x (i) , z (i) ; θ)
           we know           z   Qi (z (i) ) = 1, so

                                     p(x (i) , z (i) ; θ)   p(x (i) , z (i) ; θ)
                  Qi (z (i) ) =                           =                      = p(z (i) |x (i) ; θ)
                                      z p(x (i) , z; θ)       p(x (i) ; θ)

Duc-Hieu Trantdh.net [at] gmail.com (NTU)            EM in pLSA                           July 27, 2010   9 / 27
EM algorithm


  EM algorithm (3)


           Qi – posterior distribution of z (i) given x (i) and the parameter θ
   EM algorithm: repeat until convergence
           E-step: for each i
                                                Qi (z (i) ) := p(z (i) |x (i) ; θ)
           M-step:

                                                                                   p(x (i) , z (i) ; θ)
                             θ := arg max                        Qi (z (i) ) log
                                            θ        i
                                                                                      Qi (z (i) )
                                                         z (i)

   The algorithm will converge, since (θ(t) ) ≤ (θ(t+1) )



Duc-Hieu Trantdh.net [at] gmail.com (NTU)                EM in pLSA                                July 27, 2010   10 / 27
EM algorithm


  EM algorithm (4)
   Digression: coordinate ascent algorithm.
       maxW (α1 , . . . αm )
             α
           loop until converge:
           for i ∈ 1, . . . , m:

                                    αi = arg max W (α1 , . . . , αi , . . . , αm )
                                                                 ˆ
                                              αi
                                              ˆ

   EM-algorithm as coordinate ascent algorithm

                                                                             p(x (i) , z (i) ; θ)
                               J(Q, θ) =                   Qi (z (i) ) log
                                              i
                                                                                Qi (z (i) )
                                                   z (i)

            (θ) ≥ J(Q, θ)
           EM algorithm can be viewed as coordinate ascent on J
           E-step: maximize w.r.t Q
           M-step: maximize w.r.t θ
Duc-Hieu Trantdh.net [at] gmail.com (NTU)            EM in pLSA                                July 27, 2010   11 / 27
Probabilistic Latent Sematic Analysis


  Outline



   The parameter estimation problem


   EM algorithm


   Probabilistic Latent Sematic Analysis


   Reference




Duc-Hieu Trantdh.net [at] gmail.com (NTU)                 EM in pLSA   July 27, 2010   12 / 27
Probabilistic Latent Sematic Analysis


  Probabilistic Latent Semantic Analysis (1)

           set of documents D = {d1 , . . . , dN }
           set of words W = {w1 , . . . , wM }
           set of unobserved classes Z = {z1 , . . . , zK }
           conditional independence assumption:

                                         P(di , wj |zk ) = P(di |zk )P(wj |zk )                                (3)

           so,
                                                              K
                                        P(wj |di ) =               P(zk |di )P(wj |zk )                        (4)
                                                             k=1
                                                                   K
                                   P(di , wj ) = P(di )                 P(wj |zk )P(zk |di )
                                                                  k=1
      More detail . . .



Duc-Hieu Trantdh.net [at] gmail.com (NTU)                 EM in pLSA                           July 27, 2010   13 / 27
Probabilistic Latent Sematic Analysis


  Probabilistic Latent Semantic Analysis (2)

           n(di , wj ) – # word wj in doc. di
           Likelihood
                                     N                    N       M
                            L=            P(di ) =                     [P(di , wj )]n(di ,wj )
                                    i=1                 i=1 j=1

                                     N      M                 K                                  n(di ,wj )

                               =                  P(di )              P(wj |zk )P(zk |di )
                                    i=1 j=1                 k=1

           log-likelihood           = log(L)
                  N     M                                                               K
             =                 n(di , wj ) log P(di ) + n(di , wj ) log                     P(wj |zk )P(zk |di )
                 i=1 j=1                                                              k=1



Duc-Hieu Trantdh.net [at] gmail.com (NTU)                 EM in pLSA                                  July 27, 2010   14 / 27
Probabilistic Latent Sematic Analysis


  Probabilistic Latent Semantic Analysis (3)
           maximize w.r.t P(wj |zk ), P(zk |di )
           ≈ maximize
                              N      M                          K
                                         n(di , wj ) log             P(wj |zk )P(zk |di )
                            i=1 j=1                           k=1
                                    N    M                           K
                                                                                      P(wj |zk )P(zk |di )
                            =                 n(di , wj ) log              Qk (zk )
                                                                                           Qk (zk )
                                   i=1 j=1                           k=1
                                    N    M                     K
                                                                                      P(wj |zk )P(zk |di )
                            ≥                 n(di , wj )            Qk (zk ) log
                                                                                           Qk (zk )
                                   i=1 j=1                   k=1

           choose
                                                     P(wj |zk )P(zk |di )
                                  Qk (zk ) =         K
                                                                                       = P(zk |di , wj )
                                                     l=1 P(wj |zl )P(zl |di )
              More detail . . .

Duc-Hieu Trantdh.net [at] gmail.com (NTU)                    EM in pLSA                             July 27, 2010   15 / 27
Probabilistic Latent Sematic Analysis


  Probabilistic Latent Semantic Analysis (4)


           ≈ maximize (w.r.t P(wj |zk ), P(zk |di ))

                         N    M                      K
                                                                                P(wj |zk )P(zk |di )
                                    n(di , wj )           P(zk |di , wj ) log
                                                                                  P(zk |di , wj )
                       i=1 j=1                     k=1

           ≈ maximize
                        N     M                     K
                                   n(di , wj )           P(zk |di , wj ) log[P(wj |zk )P(zk |di )]
                       i=1 j=1                     k=1




Duc-Hieu Trantdh.net [at] gmail.com (NTU)                 EM in pLSA                         July 27, 2010   16 / 27
Probabilistic Latent Sematic Analysis


  Probabilistic Latent Semantic Analysis (5)
   EM-algorithm
      E-step: update
                                                                  P(wj |zk )P(zk |di )
                                    P(zk |di , wj ) =             K
                                                                  l=1 P(wj |zl )P(zl |di )
           M-step: maximize w.r.t P(wj |zk ), P(zk |di )
                        N     M                     K
                                   n(di , wj )           P(zk |di , wj ) log[P(wj |zk )P(zk |di )]
                       i=1 j=1                     k=1

           subject to
                                            M
                                                  P(wj |zk ) = 1, k ∈ {1 . . . K }
                                            j=1
                                             K
                                                  P(zk |di ) = 1, i ∈ {1 . . . N}
                                            k=1
Duc-Hieu Trantdh.net [at] gmail.com (NTU)                 EM in pLSA                         July 27, 2010   17 / 27
Probabilistic Latent Sematic Analysis


  Probabilistic Latent Semantic Analysis (6)


   Solution of maximization problem in M-step:
                                                      N
                                                      i=1 n(di , wj )P(zk |di , wj )
                          P(wj |zk ) =           M      N
                                                 m=1    n=1 n(dn , wm )P(zk |dn , wm )
                                                 M
                                                 j=1 n(di , wj )P(zk |di , wj )
                          P(zk |di ) =
                                                               n(di )
                                M
   where, n(di ) =              j=1 n(di , wj )
      More detail . . .




Duc-Hieu Trantdh.net [at] gmail.com (NTU)                  EM in pLSA               July 27, 2010   18 / 27
Probabilistic Latent Sematic Analysis


  Probabilistic Latent Semantic Analysis (7)

   All together
           E-step:
                                                                  P(wj |zk )P(zk |di )
                                    P(zk |di , wj ) =             K
                                                                  l=1 P(wj |zl )P(zl |di )
           M-step:
                                                         N
                                                         i=1 n(di , wj )P(zk |di , wj )
                           P(wj |zk ) =             M      N
                                                    m=1    n=1 n(dn , wm )P(zk |dn , wm )
                                                    M
                                                    j=1 n(di , wj )P(zk |di , wj )
                            P(zk |di ) =
                                                                  n(di )




Duc-Hieu Trantdh.net [at] gmail.com (NTU)                 EM in pLSA                         July 27, 2010   19 / 27
Reference


  Outline



   The parameter estimation problem


   EM algorithm


   Probabilistic Latent Sematic Analysis


   Reference




Duc-Hieu Trantdh.net [at] gmail.com (NTU)        EM in pLSA   July 27, 2010   20 / 27
Reference




           R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification,
           Wiley-Interscience, 2001.
           T. Hofmann, ”Unsupervised learning by probabilistic latent semantic
           analysis,” Machine Learning, vol. 42, 2001, p. 177–196.
           Course: ”Machine Learning CS229”, Andrew Ng, Stanford University




Duc-Hieu Trantdh.net [at] gmail.com (NTU)        EM in pLSA      July 27, 2010   21 / 27
Appendix

   Generative model for word/document co-occurence
       select a document di with probability (w.p) P(di )
       pick a latent class zk w.p P(zk |di )
       generate a word wj w.p P(wj |zk )
                                K                                K
            P(di , wj ) =           P(di , wj |zk )P(zk ) =            P(wj |zk )P(di |zk )P(zk )
                              k=1                                k=1
                                                                  K
                                                            =          P(wj |zk )P(zk |di )P(di )
                                                                 k=1
                                                                          K
                                                            = P(di )           P(wj |zk )P(zk |di )
                                                                         k=1
                                               P(di , wj ) = P(wj |di )P(di )
                                                                 K
                                            =⇒ P(wj |di ) =            P(zk |di )P(wj |zk )
                                                                 k=1

Duc-Hieu Trantdh.net [at] gmail.com (NTU)           EM in pLSA                          July 27, 2010   22 / 27
Appendix




                                                        K
                                       P(wj |di ) =          P(zk |di )P(wj |zk )
                                                       k=1
                       K
           since       k=1 P(zk |di )       = 1, P(wj , di ) is convex combination of P(wj |zk )
           ≈ each document is modelled as a mixture of topics




                                                                                                    Return




Duc-Hieu Trantdh.net [at] gmail.com (NTU)            EM in pLSA                     July 27, 2010     23 / 27
Appendix




                                               P(di , wj |zk )P(zk )
                               P(zk |di , wj ) =                                            (5)
                                                   P(di , wj )
                                               P(wj |zk )P(di |zk )P(zk )
                                             =                                              (6)
                                                       P(di , wj )
                                               P(wj |zk )P(zk |di )
                                             =                                              (7)
                                                   P(wj |di )
                                                 P(wj |zk )P(zk |di )
                                             = K                                            (8)
                                                 l=1 P(wj |zl )P(zl |di )

   From (5) to (6) by conditional independence assumption (3). From (7) to
   (8) by (4).                                                        Return




Duc-Hieu Trantdh.net [at] gmail.com (NTU)          EM in pLSA               July 27, 2010   24 / 27
Appendix




   Lagrange multipliers τk , ρi
                        N     M                    K
               H=                  n(di , wj )          P(zk |di , wj ) log[P(wj |zk )P(zk |di )]
                       i=1 j=1                    k=1
                                                              
                        K                   M                        N              K
                   +         τk 1 −              P(wj |di ) +            ρi 1 −         P(zk |di )
                       k=1                  j=1                      i=1            k=1

                                                  N
                         ∂H                       i=1 P(zk |di , wj )n(di , wj )
                                   =                                                − τk = 0
                       ∂P(wj |zk )                        P(wj |zk )
                                                  M
                         ∂H                       j=1 n(di , wj )P(zk |di , wj )
                                   =                                                − ρi = 0
                       ∂P(zk |di )                        P(zk |di )




Duc-Hieu Trantdh.net [at] gmail.com (NTU)               EM in pLSA                           July 27, 2010   25 / 27
Appendix




                M
   from         j=1 P(wj |zk )      =1

                                             M   N
                                   τk =               P(zk |di , wj )n(di , wj )
                                            j=1 i=1

                K
   from         k=1 P(zk |di , wj )         =1

                                                     ρi = n(di )

   =⇒ P(wj |zk ), P(zk |di )                                                                       Return




Duc-Hieu Trantdh.net [at] gmail.com (NTU)             EM in pLSA                   July 27, 2010     26 / 27
Appendix


  Applying the Jensen’s inequality




           f (x) = log (x), concave function

                                    p(x (i) , z (i) ; θ)                      p(x (i) , z (i) ; θ)
                f    Ez (i) ∼Qi                              ≥ Ez (i) ∼Qi f
                                       Qi (z (i) )                               Qi (z (i) )

                                                                                                      Return




Duc-Hieu Trantdh.net [at] gmail.com (NTU)             EM in pLSA                      July 27, 2010     27 / 27

More Related Content

PDF
Bayesian Deep Learning
POTX
LDA Beginner's Tutorial
PPTX
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
PPT
Latent Semantic Indexing For Information Retrieval
PDF
Latent Dirichlet Allocation
PPTX
Dimensionality reduction: SVD and its applications
PPT
Data preprocessing
PPTX
Building Ethics into the Research with Vulnerable groups
Bayesian Deep Learning
LDA Beginner's Tutorial
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Latent Semantic Indexing For Information Retrieval
Latent Dirichlet Allocation
Dimensionality reduction: SVD and its applications
Data preprocessing
Building Ethics into the Research with Vulnerable groups

What's hot (17)

PPTX
Guideline for interpreting correlation coefficient
ODP
Topic Modeling
PDF
はじパタ 10章 クラスタリング 前半
PDF
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
PDF
A Short Course in Data Stream Mining
PDF
PCA (Principal Component Analysis)
PPS
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
PDF
Data Visualisation for Data Science
PDF
Logistic Ordinal Regression
PDF
Label propagation - Semisupervised Learning with Applications to NLP
PDF
Understanding Bagging and Boosting
PDF
Lecture 8: Decision Trees & k-Nearest Neighbors
PDF
General Tips for participating Kaggle Competitions
PDF
Evaluating machine learning models and their diagnostic value
PPT
2.8 accuracy and ensemble methods
PDF
「3.1.2最小二乗法の幾何学」PRML勉強会4 @筑波大学 #prml学ぼう
Guideline for interpreting correlation coefficient
Topic Modeling
はじパタ 10章 クラスタリング 前半
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
A Short Course in Data Stream Mining
PCA (Principal Component Analysis)
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Visualisation for Data Science
Logistic Ordinal Regression
Label propagation - Semisupervised Learning with Applications to NLP
Understanding Bagging and Boosting
Lecture 8: Decision Trees & k-Nearest Neighbors
General Tips for participating Kaggle Competitions
Evaluating machine learning models and their diagnostic value
2.8 accuracy and ensemble methods
「3.1.2最小二乗法の幾何学」PRML勉強会4 @筑波大学 #prml学ぼう
Ad

Similar to EM algorithm and its application in probabilistic latent semantic analysis (20)

PDF
Machine Learning With MapReduce, K-Means, MLE
PDF
Montpellier Math Colloquium
PDF
C Sanchez Reduction Saetas
PDF
Machine learning (9)
PDF
Ml mle_bayes
PDF
ma112011id535
PDF
Lagrange
PDF
Jensen's inequality, EM 알고리즘
PDF
Approximating Bayes Factors
PDF
Jackknife algorithm for the estimation of logistic regression parameters
PDF
Jyokyo-kai-20120605
PDF
Basics of probability in statistical simulation and stochastic programming
PPTX
Linear prediction
PDF
Logistic Regression(SGD)
PDF
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
PDF
Engr 371 final exam april 1999
PDF
4th joint Warwick Oxford Statistics Seminar
PDF
Survival and hazard estimation of weibull distribution based on
PDF
Discussion of Faming Liang's talk
PDF
A generic method for modeling accelerated life testing data
Machine Learning With MapReduce, K-Means, MLE
Montpellier Math Colloquium
C Sanchez Reduction Saetas
Machine learning (9)
Ml mle_bayes
ma112011id535
Lagrange
Jensen's inequality, EM 알고리즘
Approximating Bayes Factors
Jackknife algorithm for the estimation of logistic regression parameters
Jyokyo-kai-20120605
Basics of probability in statistical simulation and stochastic programming
Linear prediction
Logistic Regression(SGD)
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
Engr 371 final exam april 1999
4th joint Warwick Oxford Statistics Seminar
Survival and hazard estimation of weibull distribution based on
Discussion of Faming Liang's talk
A generic method for modeling accelerated life testing data
Ad

More from zukun (20)

PDF
My lyn tutorial 2009
PDF
ETHZ CV2012: Tutorial openCV
PDF
ETHZ CV2012: Information
PDF
Siwei lyu: natural image statistics
PDF
Lecture9 camera calibration
PDF
Brunelli 2008: template matching techniques in computer vision
PDF
Modern features-part-4-evaluation
PDF
Modern features-part-3-software
PDF
Modern features-part-2-descriptors
PDF
Modern features-part-1-detectors
PDF
Modern features-part-0-intro
PDF
Lecture 02 internet video search
PDF
Lecture 01 internet video search
PDF
Lecture 03 internet video search
PDF
Icml2012 tutorial representation_learning
PPT
Advances in discrete energy minimisation for computer vision
PDF
Gephi tutorial: quick start
PDF
Object recognition with pictorial structures
PDF
Iccv2011 learning spatiotemporal graphs of human activities
PDF
Icml2012 learning hierarchies of invariant features
My lyn tutorial 2009
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Information
Siwei lyu: natural image statistics
Lecture9 camera calibration
Brunelli 2008: template matching techniques in computer vision
Modern features-part-4-evaluation
Modern features-part-3-software
Modern features-part-2-descriptors
Modern features-part-1-detectors
Modern features-part-0-intro
Lecture 02 internet video search
Lecture 01 internet video search
Lecture 03 internet video search
Icml2012 tutorial representation_learning
Advances in discrete energy minimisation for computer vision
Gephi tutorial: quick start
Object recognition with pictorial structures
Iccv2011 learning spatiotemporal graphs of human activities
Icml2012 learning hierarchies of invariant features

Recently uploaded (20)

PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PPTX
Build Your First AI Agent with UiPath.pptx
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PPT
What is a Computer? Input Devices /output devices
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
STKI Israel Market Study 2025 version august
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
Configure Apache Mutual Authentication
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Getting started with AI Agents and Multi-Agent Systems
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
Build Your First AI Agent with UiPath.pptx
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Convolutional neural network based encoder-decoder for efficient real-time ob...
Basics of Cloud Computing - Cloud Ecosystem
What is a Computer? Input Devices /output devices
Taming the Chaos: How to Turn Unstructured Data into Decisions
STKI Israel Market Study 2025 version august
Final SEM Unit 1 for mit wpu at pune .pptx
The influence of sentiment analysis in enhancing early warning system model f...
NewMind AI Weekly Chronicles – August ’25 Week III
Developing a website for English-speaking practice to English as a foreign la...
CloudStack 4.21: First Look Webinar slides
Comparative analysis of machine learning models for fake news detection in so...
A review of recent deep learning applications in wood surface defect identifi...
Configure Apache Mutual Authentication
sbt 2.0: go big (Scala Days 2025 edition)
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION

EM algorithm and its application in probabilistic latent semantic analysis

  • 1. EM algorithm and its application in Probabilistic Latent Semantic Analysis (pLSA) Duc-Hieu Tran tdh.net [at] gmail.com Nanyang Technological University July 27, 2010 Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 1 / 27
  • 2. The parameter estimation problem Outline The parameter estimation problem EM algorithm Probabilistic Latent Sematic Analysis Reference Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 2 / 27
  • 3. The parameter estimation problem Introduction Known the prior probabilities P(ωi ), class-conditional densities p(x|ωi ) =⇒ optimal classifier P(ωj |x) ∝ p(x|ωj )p(ωj ) decide ωi if p(ωi |x) > P(ωj |x), ∀j = i In practice, p(x|ωi ) is unknown – just estimated from training samples (e.g., assume p(x|ωi ) ∼ N (µi , Σi )). Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 3 / 27
  • 4. The parameter estimation problem Frequentist vs. Bayesian schools Frequentist parameters – quantities whose values are fixed but unknown. the best estimate of their values – the one that maximizes the probability of obtaining the observed samples. Bayesian paramters – random variables having some known prior distribution. observation of the samples converts this to a posterior density; revising our opinion about the true values of the parameters. Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 4 / 27
  • 5. The parameter estimation problem Examples training samples: S = {(x (1) , y (1) ), . . . (x (m) , y (m) )} frequentist: maximum likelihood max p(y (i) |x (i) ; θ) θ i bayesian: P(θ) – prior, e.g., P(θ) ∼ N (0, I) m P(θ|S) ∝ P(y (i) |x (i) , θ) .P(θ) i=1 θMAP = arg max P(θ|S) θ Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 5 / 27
  • 6. EM algorithm Outline The parameter estimation problem EM algorithm Probabilistic Latent Sematic Analysis Reference Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 6 / 27
  • 7. EM algorithm An estimation problem training set of m independent samples: {x (1) , x (2) , . . . , x (m) } goal: fit the paramters of a model p(x, z) to the data the likelihood: m m (i) (θ) = log p(x ; θ) = log p(x (i) , z; θ) i=1 i=1 z explicitly maximize (θ) might be difficult. z - laten random variable if z (i) were observed, then maximum likelihood estimation would be easy. strategy: repeatedly construct a lower-bound on (E-step) and optimize that lower-bound (M-step). Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 7 / 27
  • 8. EM algorithm EM algorithm (1) digression: Jensen’s inequality. f – convex function; E [f (X )] ≥ f (E [X ]) for each i, Qi – distribution of z: z Qi (z) = 1, Qi (z) ≥ 0 (θ) = log p(x (i) ; θ) i = log p(x (i) , z (i) ; θ) i z (i) p(x (i) , z (i) ; θ) = log Qi (z (i) ) (1) i Qi (z (i) ) z (i) applying Jensen’s inequality, concave function log p(x (i) , z (i) ; θ) ≥ Qi (z (i) )log (2) i Qi (z (i) ) z (i) More detail . . . Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 8 / 27
  • 9. EM algorithm EM algorithm (2) for any set of distribution Qi , formula (2) gives a lower-bound on (θ) how to choose Qi ? strategy: make the inequality hold with equality at our particular value of θ. require: p(x (i) , z (i) ; θ) =c Qi (z (i) ) c – constant not depend on z (i) choose: Qi (z (i) ) ∝ p(x (i) , z (i) ; θ) we know z Qi (z (i) ) = 1, so p(x (i) , z (i) ; θ) p(x (i) , z (i) ; θ) Qi (z (i) ) = = = p(z (i) |x (i) ; θ) z p(x (i) , z; θ) p(x (i) ; θ) Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 9 / 27
  • 10. EM algorithm EM algorithm (3) Qi – posterior distribution of z (i) given x (i) and the parameter θ EM algorithm: repeat until convergence E-step: for each i Qi (z (i) ) := p(z (i) |x (i) ; θ) M-step: p(x (i) , z (i) ; θ) θ := arg max Qi (z (i) ) log θ i Qi (z (i) ) z (i) The algorithm will converge, since (θ(t) ) ≤ (θ(t+1) ) Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 10 / 27
  • 11. EM algorithm EM algorithm (4) Digression: coordinate ascent algorithm. maxW (α1 , . . . αm ) α loop until converge: for i ∈ 1, . . . , m: αi = arg max W (α1 , . . . , αi , . . . , αm ) ˆ αi ˆ EM-algorithm as coordinate ascent algorithm p(x (i) , z (i) ; θ) J(Q, θ) = Qi (z (i) ) log i Qi (z (i) ) z (i) (θ) ≥ J(Q, θ) EM algorithm can be viewed as coordinate ascent on J E-step: maximize w.r.t Q M-step: maximize w.r.t θ Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 11 / 27
  • 12. Probabilistic Latent Sematic Analysis Outline The parameter estimation problem EM algorithm Probabilistic Latent Sematic Analysis Reference Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 12 / 27
  • 13. Probabilistic Latent Sematic Analysis Probabilistic Latent Semantic Analysis (1) set of documents D = {d1 , . . . , dN } set of words W = {w1 , . . . , wM } set of unobserved classes Z = {z1 , . . . , zK } conditional independence assumption: P(di , wj |zk ) = P(di |zk )P(wj |zk ) (3) so, K P(wj |di ) = P(zk |di )P(wj |zk ) (4) k=1 K P(di , wj ) = P(di ) P(wj |zk )P(zk |di ) k=1 More detail . . . Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 13 / 27
  • 14. Probabilistic Latent Sematic Analysis Probabilistic Latent Semantic Analysis (2) n(di , wj ) – # word wj in doc. di Likelihood N N M L= P(di ) = [P(di , wj )]n(di ,wj ) i=1 i=1 j=1 N M K n(di ,wj ) = P(di ) P(wj |zk )P(zk |di ) i=1 j=1 k=1 log-likelihood = log(L) N M K = n(di , wj ) log P(di ) + n(di , wj ) log P(wj |zk )P(zk |di ) i=1 j=1 k=1 Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 14 / 27
  • 15. Probabilistic Latent Sematic Analysis Probabilistic Latent Semantic Analysis (3) maximize w.r.t P(wj |zk ), P(zk |di ) ≈ maximize N M K n(di , wj ) log P(wj |zk )P(zk |di ) i=1 j=1 k=1 N M K P(wj |zk )P(zk |di ) = n(di , wj ) log Qk (zk ) Qk (zk ) i=1 j=1 k=1 N M K P(wj |zk )P(zk |di ) ≥ n(di , wj ) Qk (zk ) log Qk (zk ) i=1 j=1 k=1 choose P(wj |zk )P(zk |di ) Qk (zk ) = K = P(zk |di , wj ) l=1 P(wj |zl )P(zl |di ) More detail . . . Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 15 / 27
  • 16. Probabilistic Latent Sematic Analysis Probabilistic Latent Semantic Analysis (4) ≈ maximize (w.r.t P(wj |zk ), P(zk |di )) N M K P(wj |zk )P(zk |di ) n(di , wj ) P(zk |di , wj ) log P(zk |di , wj ) i=1 j=1 k=1 ≈ maximize N M K n(di , wj ) P(zk |di , wj ) log[P(wj |zk )P(zk |di )] i=1 j=1 k=1 Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 16 / 27
  • 17. Probabilistic Latent Sematic Analysis Probabilistic Latent Semantic Analysis (5) EM-algorithm E-step: update P(wj |zk )P(zk |di ) P(zk |di , wj ) = K l=1 P(wj |zl )P(zl |di ) M-step: maximize w.r.t P(wj |zk ), P(zk |di ) N M K n(di , wj ) P(zk |di , wj ) log[P(wj |zk )P(zk |di )] i=1 j=1 k=1 subject to M P(wj |zk ) = 1, k ∈ {1 . . . K } j=1 K P(zk |di ) = 1, i ∈ {1 . . . N} k=1 Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 17 / 27
  • 18. Probabilistic Latent Sematic Analysis Probabilistic Latent Semantic Analysis (6) Solution of maximization problem in M-step: N i=1 n(di , wj )P(zk |di , wj ) P(wj |zk ) = M N m=1 n=1 n(dn , wm )P(zk |dn , wm ) M j=1 n(di , wj )P(zk |di , wj ) P(zk |di ) = n(di ) M where, n(di ) = j=1 n(di , wj ) More detail . . . Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 18 / 27
  • 19. Probabilistic Latent Sematic Analysis Probabilistic Latent Semantic Analysis (7) All together E-step: P(wj |zk )P(zk |di ) P(zk |di , wj ) = K l=1 P(wj |zl )P(zl |di ) M-step: N i=1 n(di , wj )P(zk |di , wj ) P(wj |zk ) = M N m=1 n=1 n(dn , wm )P(zk |dn , wm ) M j=1 n(di , wj )P(zk |di , wj ) P(zk |di ) = n(di ) Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 19 / 27
  • 20. Reference Outline The parameter estimation problem EM algorithm Probabilistic Latent Sematic Analysis Reference Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 20 / 27
  • 21. Reference R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, Wiley-Interscience, 2001. T. Hofmann, ”Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, vol. 42, 2001, p. 177–196. Course: ”Machine Learning CS229”, Andrew Ng, Stanford University Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 21 / 27
  • 22. Appendix Generative model for word/document co-occurence select a document di with probability (w.p) P(di ) pick a latent class zk w.p P(zk |di ) generate a word wj w.p P(wj |zk ) K K P(di , wj ) = P(di , wj |zk )P(zk ) = P(wj |zk )P(di |zk )P(zk ) k=1 k=1 K = P(wj |zk )P(zk |di )P(di ) k=1 K = P(di ) P(wj |zk )P(zk |di ) k=1 P(di , wj ) = P(wj |di )P(di ) K =⇒ P(wj |di ) = P(zk |di )P(wj |zk ) k=1 Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 22 / 27
  • 23. Appendix K P(wj |di ) = P(zk |di )P(wj |zk ) k=1 K since k=1 P(zk |di ) = 1, P(wj , di ) is convex combination of P(wj |zk ) ≈ each document is modelled as a mixture of topics Return Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 23 / 27
  • 24. Appendix P(di , wj |zk )P(zk ) P(zk |di , wj ) = (5) P(di , wj ) P(wj |zk )P(di |zk )P(zk ) = (6) P(di , wj ) P(wj |zk )P(zk |di ) = (7) P(wj |di ) P(wj |zk )P(zk |di ) = K (8) l=1 P(wj |zl )P(zl |di ) From (5) to (6) by conditional independence assumption (3). From (7) to (8) by (4). Return Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 24 / 27
  • 25. Appendix Lagrange multipliers τk , ρi N M K H= n(di , wj ) P(zk |di , wj ) log[P(wj |zk )P(zk |di )] i=1 j=1 k=1   K M N K + τk 1 − P(wj |di ) + ρi 1 − P(zk |di ) k=1 j=1 i=1 k=1 N ∂H i=1 P(zk |di , wj )n(di , wj ) = − τk = 0 ∂P(wj |zk ) P(wj |zk ) M ∂H j=1 n(di , wj )P(zk |di , wj ) = − ρi = 0 ∂P(zk |di ) P(zk |di ) Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 25 / 27
  • 26. Appendix M from j=1 P(wj |zk ) =1 M N τk = P(zk |di , wj )n(di , wj ) j=1 i=1 K from k=1 P(zk |di , wj ) =1 ρi = n(di ) =⇒ P(wj |zk ), P(zk |di ) Return Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 26 / 27
  • 27. Appendix Applying the Jensen’s inequality f (x) = log (x), concave function p(x (i) , z (i) ; θ) p(x (i) , z (i) ; θ) f Ez (i) ∼Qi ≥ Ez (i) ∼Qi f Qi (z (i) ) Qi (z (i) ) Return Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 27 / 27